Danilo S. Carvalho, Ph.D. Applied Research - Artificial Intelligence / NLP

Projects

Term Definition Vectors (TDV)

TDV is a high dimensional, sparse vector representation of lexical items (terms) ranging from morphemes to phrases, based on the definitions of their meanings as presented in Wiktionary. It contrasts with distributional representation methods, such as word2vec and GloVe, which define term meanings from their usage patterns (context windows). Compared to distributional methods, TDV performs better at semantic similarity computation, where the former perform better at semantic relatedness. It provides interesting features for Natural Language Processing, such as sense polarity between terms, multi-language representation, and the ability to disambiguate senses using Part of Speech (POS) information.

Project repo: https://github.com/dscarvalho/tdv

Related publications:


EasyESA

EasyESA is an implementation of Explicit Semantic Analysis (ESA) based on the Wikiprep-ESA code from Çağatay Çallı https://github.com/faraday/wikiprep-esa. It runs as a JSON webservice which can be queried for the semantic relatedness measure, concept vectors or the context windows.

Explicit Semantic Analysis (ESA) is a technique for text representation that uses Wikipedia commonsense knowledge base using the co-occurrence of words in the text. The articles’ words are associated with its concept using TF-IDF scoring, and a word can be represented as a vector of its associations to each concept and thus “semantic relatedness” between any two words can be measured by means of cosine similarity. A document containing a string of words is represented as the centroid of the vectors representing its words.

Project repo: https://github.com/dscarvalho/easyesa

Publication: Danilo Carvalho, Çağatay Çallı, André Freitas, Edward Curry, EasyESA: A Low-effort Infrastructure for Explicit Semantic Analysis, In Proceedings of the 13th International Semantic Web Conference (ISWC), Rival del Garda, 2014. (Demonstration Paper in Proceedings).


Graphia

Graphia is a framework which extracts structured data graphs from factual unstructured texts. Instead of extracting simple relations, or committing to a specific conceptual model, Graphia aims at the extraction of graphs which can represent the complexity of contexts present in texts.

The graph representation adopted by the framework (SDGs – Structured Discourse Graphs) can be naturally serialized as an entity-centric RDF graph, which facilitates the integration and the use of the graph with other resources and applications. Additionally, the graph representation supports a pay-as-you-go / semantic best-effort extraction, where a comprehensive extraction is prioritized over accuracy and where the quality of the extracted graph evolves over time.

Features included in the framework:

  1. Structured Discourse Graph extraction and visualization.
  2. Named entity resolution to DBpedia entities.
  3. Co-reference resolution.
  4. Serialization as RDF.

Project page: graphia.dcc.ufrj.br Online demo: graphia.dcc.ufrj.br/OnlineDemo

Related publications::