· Danilo S. Carvalho, Ph.D.

Formal Semantic Controls over Language Models

Tutorial @ LREC-COLING 2024

Date: May 20, 2024

Address: Lingotto Conference Centre - Torino, Italy

Text embeddings provide a concise representation of the semantics of sentences and larger spans of text, rather than individual words, capturing a wide range of linguistic features. They have found increasing application to a variety of NLP tasks, including machine translation and natural language inference. While most recent breakthroughs in task performance are being achieved by large scale distributional models, there is a growing disconnection between their knowledge representation and traditional semantics, which hinders efforts to capture such knowledge in human interpretable form or explain model inference behaviour.

In this tutorial, we examine from basics to the cutting edge research on the analysis and control of text representations, aiming to shorten the gap between deep latent semantics and formal symbolics. This includes the considerations on knowledge formalisation, the linguistic information that can be extracted and measured from distributional models, and intervention techniques that enable explainable reasoning and controllable text generation, covering methods from pooling to LLM-based.

Materials now online!

Tutorial slides

LangVAE example code notebook

LangSpace example code notebook

Instructors / Organisation

Danilo S. Carvalho
National Biomarker Centre - CRUK-MI, University of Manchester, UK
Bio: Danilo Carvalho is a Principal Clinical Informaticia (Research Associate) at the National Biomarker Centre - Cancer Research UK - Manchester Institute, University of Manchester, working on Safe and Explainable Artificial Intelligence (AI) architectures. He has experience in both industry and academia, having presented works at multiple international conferences over the past 10 years, such as EACL and ESANN. His main area of expertise is representation learning for NLP and his research interests include explainable AI and legal and patent text processing.

Yingji Zhang
Department of Computer Science, University of Manchester, UK
Bio: Yingji Zhang is a 3rd year PhD student at the University of Manchester. His research interests include natural language inference, controllable natural language generation, and disentangled representation learning.

Andre Freitas
Department of Computer Science, University of Manchester, UK
National Biomarker Centre - CRUK-MI, University of Manchester, UK
Idiap Institute, Switzerland
Bio: is a Senior Lecturer at the Department of Computer Science at the University of Manchester. He leads the Neuro-symbolic AI group at Idiap and at the Department of Computer Science at the University of Manchester. His main research interests are on enabling the development of AI methods to support abstract, explainable and flexible inference. In particular, he investigates how the combination of neural and symbolic data representation paradigms can deliver better inference. Some of his research topics include: explanation generation, natural language inference, explainable question answering, knowledge graphs and open information extraction.

Outline

The evolutionary arch from word embeddings to LLMs vs. formal linguistics
Contrastive learning and conceptual modeling
Interpretability and formal linguistics
Disentanglement and separability
Control mechanisms for text generation and inference over latent spaces
The role of compositionality in improving representations
Employing Autoencoders for efficiency and control
Controlling the semantic properties of large language models
Probing sentence latent spaces: geometrical and linguistic properties