Danilo S. Carvalho, Ph.D. Applied Research - Artificial Intelligence / NLP

Projects

LangVAE & LangSpace

LangVAE is a Python library for training and running language models using Variational Autoencoders (VAEs). It provides an easy-to-use interface to train VAEs on text data, allowing users to customize the model architecture, loss function, and training parameters.
LangSpace is a Python library for evaluating and probing language models using Variational Autoencoders (VAEs). It provides an easy-to-use interface to perform a variety of analises on pretrained LangVAE models.

Project repos: https://github.com/neuro-symbolic-ai/LangVAE, https://github.com/neuro-symbolic-ai/LangSpace

Related publications:


SAF-Datasets

The Simple Annotation Framework (SAF) is a lightweight Python library for annotating text data. It provides a simple and flexible way to create, manipulate, and export annotations in various formats. It’s minimalistic data model is flexible enough to be used by most types of linguistic annotation, and can store other types of data associated to the language items (e.g., statistics, data sources, schemas, etc.)

Project repo: https://github.com/neuro-symbolic-ai/saf_datasets


SAF

The saf-datasets library provides easy access to Natural Language Processing (NLP) datasets, and tools to facilitate annotation at document, sentence and token levels. It is built upon the Simple Annotation Framework (SAF) library, which provides its data model and API.

Project repo: https://github.com/dscarvalho/saf


Term Definition Vectors (TDV)

TDV is a high dimensional, sparse vector representation of lexical items (terms) ranging from morphemes to phrases, based on the definitions of their meanings as presented in Wiktionary. It contrasts with distributional representation methods, such as word2vec and GloVe, which define term meanings from their usage patterns (context windows). Compared to distributional methods, TDV performs better at semantic similarity computation, where the former perform better at semantic relatedness. It provides interesting features for Natural Language Processing, such as sense polarity between terms, multi-language representation, and the ability to disambiguate senses using Part of Speech (POS) information.

Project repo: https://github.com/dscarvalho/tdv

Related publications:


EasyESA

EasyESA is an implementation of Explicit Semantic Analysis (ESA) based on the Wikiprep-ESA code from Çağatay Çallı https://github.com/faraday/wikiprep-esa. It runs as a JSON webservice which can be queried for the semantic relatedness measure, concept vectors or the context windows.

Explicit Semantic Analysis (ESA) is a technique for text representation that uses Wikipedia commonsense knowledge base using the co-occurrence of words in the text. The articles’ words are associated with its concept using TF-IDF scoring, and a word can be represented as a vector of its associations to each concept and thus “semantic relatedness” between any two words can be measured by means of cosine similarity. A document containing a string of words is represented as the centroid of the vectors representing its words.

Project repo: https://github.com/dscarvalho/easyesa

Publication: Danilo Carvalho, Çağatay Çallı, André Freitas, Edward Curry, EasyESA: A Low-effort Infrastructure for Explicit Semantic Analysis, In Proceedings of the 13th International Semantic Web Conference (ISWC), Rival del Garda, 2014. (Demonstration Paper in Proceedings).


Graphia

Graphia is a framework which extracts structured data graphs from factual unstructured texts. Instead of extracting simple relations, or committing to a specific conceptual model, Graphia aims at the extraction of graphs which can represent the complexity of contexts present in texts.

The graph representation adopted by the framework (SDGs – Structured Discourse Graphs) can be naturally serialized as an entity-centric RDF graph, which facilitates the integration and the use of the graph with other resources and applications. Additionally, the graph representation supports a pay-as-you-go / semantic best-effort extraction, where a comprehensive extraction is prioritized over accuracy and where the quality of the extracted graph evolves over time.

Features included in the framework:

  1. Structured Discourse Graph extraction and visualization.
  2. Named entity resolution to DBpedia entities.
  3. Co-reference resolution.
  4. Serialization as RDF.

Project page: graphia.dcc.ufrj.br Online demo: graphia.dcc.ufrj.br/OnlineDemo

Related publications::