Projects

LangVAE & LangSpace

LangVAE is a Python library for training and running language models using Variational Autoencoders (VAEs). It provides an easy-to-use interface to train VAEs on text data, allowing users to customize the model architecture, loss function, and training parameters.
LangSpace is a Python library for evaluating and probing language models using Variational Autoencoders (VAEs). It provides an easy-to-use interface to perform a variety of analises on pretrained LangVAE models.

Project repos: https://github.com/neuro-symbolic-ai/LangVAE, https://github.com/neuro-symbolic-ai/LangSpace

Related publications:

Danilo S. Carvalho, Yingji Zhang, Harriet Unsworth, André Freitas. LangVAE and LangSpace: Building and Probing for Language Model VAEs

SAF-Datasets

The Simple Annotation Framework (SAF) is a lightweight Python library for annotating text data. It provides a simple and flexible way to create, manipulate, and export annotations in various formats. It’s minimalistic data model is flexible enough to be used by most types of linguistic annotation, and can store other types of data associated to the language items (e.g., statistics, data sources, schemas, etc.)

Project repo: https://github.com/neuro-symbolic-ai/saf_datasets

SAF

The saf-datasets library provides easy access to Natural Language Processing (NLP) datasets, and tools to facilitate annotation at document, sentence and token levels. It is built upon the Simple Annotation Framework (SAF) library, which provides its data model and API.

Project repo: https://github.com/dscarvalho/saf

Term Definition Vectors (TDV)

TDV is a high dimensional, sparse vector representation of lexical items (terms) ranging from morphemes to phrases, based on the definitions of their meanings as presented in Wiktionary. It contrasts with distributional representation methods, such as word2vec and GloVe, which define term meanings from their usage patterns (context windows). Compared to distributional methods, TDV performs better at semantic similarity computation, where the former perform better at semantic relatedness. It provides interesting features for Natural Language Processing, such as sense polarity between terms, multi-language representation, and the ability to disambiguate senses using Part of Speech (POS) information.

Project repo: https://github.com/dscarvalho/tdv

Related publications:

Danilo S. Carvalho and Minh Le Nguyen. Building Lexical Vector Representations from Concept Definitions. bib
Danilo S. Carvalho and Minh Le Nguyen. WikTDV: Data extraction and vector representation resource for Wiktionary senses

EasyESA

EasyESA is an implementation of Explicit Semantic Analysis (ESA) based on the Wikiprep-ESA code from Çağatay Çallı https://github.com/faraday/wikiprep-esa. It runs as a JSON webservice which can be queried for the semantic relatedness measure, concept vectors or the context windows.

Explicit Semantic Analysis (ESA) is a technique for text representation that uses Wikipedia commonsense knowledge base using the co-occurrence of words in the text. The articles’ words are associated with its concept using TF-IDF scoring, and a word can be represented as a vector of its associations to each concept and thus “semantic relatedness” between any two words can be measured by means of cosine similarity. A document containing a string of words is represented as the centroid of the vectors representing its words.

Project repo: https://github.com/dscarvalho/easyesa

Publication: Danilo Carvalho, Çağatay Çallı, André Freitas, Edward Curry, EasyESA: A Low-effort Infrastructure for Explicit Semantic Analysis, In Proceedings of the 13th International Semantic Web Conference (ISWC), Rival del Garda, 2014. (Demonstration Paper in Proceedings).

Graphia is a framework which extracts structured data graphs from factual unstructured texts. Instead of extracting simple relations, or committing to a specific conceptual model, Graphia aims at the extraction of graphs which can represent the complexity of contexts present in texts.

The graph representation adopted by the framework (SDGs – Structured Discourse Graphs) can be naturally serialized as an entity-centric RDF graph, which facilitates the integration and the use of the graph with other resources and applications. Additionally, the graph representation supports a pay-as-you-go / semantic best-effort extraction, where a comprehensive extraction is prioritized over accuracy and where the quality of the extracted graph evolves over time.

Features included in the framework:

Structured Discourse Graph extraction and visualization.
Named entity resolution to DBpedia entities.
Co-reference resolution.
Serialization as RDF.

Project page: graphia.dcc.ufrj.br Online demo: graphia.dcc.ufrj.br/OnlineDemo

Related publications::

A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia
In Proceedings of the 1st Workshop on the Web of Linked Entities (WoLE 2012) at the 11th International Semantic Web Conference (ISWC), 2012 (Workshop Full Paper)
Graphia: Extracting Contextual Relation Graphs from Text
In Proceedings of the 10th Extended Semantic Web Conference (ESWC), Montpellier, France, 2013. (Demonstration Paper in Proceedings).
Representing Texts as Contextualized Entity-Centric Linked Data Graphs
In Proceedings of the 12th International Workshop on Web Semantics and Web Intelligence (WebS 2013), 24th International Conference on Database and Expert Systems Applications (DEXA), Prague, 2013. (Workshop Full Paper)

Danilo S. Carvalho, Ph.D. Applied Research - Artificial Intelligence / NLP

Projects

LangVAE & LangSpace

SAF-Datasets

SAF

Term Definition Vectors (TDV)

EasyESA