Papers by Alexsandro Fonseca

Computational and Corpus-Based Phraseology, 2017
A collocation is a type of multiword expression formed by two parts: a base and a collocate. Usua... more A collocation is a type of multiword expression formed by two parts: a base and a collocate. Usually, in a collocation, the base has a denotative or literal meaning, while the collocate has a connotative meaning. Examples of collocations: pay attention, easy as pie, strongly condemn, lend support, etc. The Meaning-Text Theory created the lexical functions to, among other objectives, represent the meaning existing between the base and the collocate or to represent the relation between the base and a support verb. For example, the lexical function Magn represents the meaning intensification, while the lexical function Caus, applied to a base, returns the support verb that represents the causality of the action expressed in the collocation. In a dependency parsing, each word (dependent) is directly associated with its governor in a phrase. In this paper, we show how we combine dependency parsing to extract collocation candidates and a lexical network based on lexical functions to identify the true collocations from the candidates. The candidates are extracted from a French corpus according to 14 dependency relations. The collocations identified are classified according to the semantic group of the lexical functions modeling them. We obtained a general precision (for all dependency types) of 76.3%, with a precision higher than 95% for collocations having certain dependency relations. We also found that about 86% of collocations identified belong to only four semantic categories: qualification, support verb, location and action/event.
A framework to visualize equivalences between computational models of regular languages
Information Processing Letters, 2002
We discuss how to increase and simplify the understanding of the equivalence relations between ma... more We discuss how to increase and simplify the understanding of the equivalence relations between machine models and/or language representations of formal languages by means of the animation tool SAGEMoLiC. Our new educational tool permits the simulation of the execution of models of computation, as many other animation systems do, but its philosophy goes further than these of the usual systems
Given a particular lexicon, what would be the best strategy to learn all of its lexemes? By using... more Given a particular lexicon, what would be the best strategy to learn all of its lexemes? By using elementary graph theory, we propose a simple formal model that answers this question. We also study several learning strategies by comparing their efficiency on eight digital English dictionaries. It turns out that a simple strategy based purely on the degree of the vertices associated with the lexemes could improve significantly the learning process with respect to other psycholinguistical strategies.
This paper presents a comparative analysis based on different classification algorithms and tools... more This paper presents a comparative analysis based on different classification algorithms and tools for the identification of Portuguese multiword expressions. Our focus is on two-word expressions formed by nouns, adjectives and verbs. The candidates are selected on the basis of the frequency of the bigrams; then on the basis of the grammatical class of each bigram's constituent words. This analysis compares the performance of three different multi-layer perceptron training functions in the task of extracting different patterns of multiword expressions, using and comparing nine different classification algorithms, including decision trees, multilayer perceptron and SVM. Moreover, this analysis compares two different tools, Text-NSP and Termostat for the identification of multiword expressions using different association measures.

In social networks services like Twitter, users are overwhelmed with huge amount of social data, ... more In social networks services like Twitter, users are overwhelmed with huge amount of social data, most of which are short, unstructured and highly noisy. Identifying accurate information from this huge amount of data is indeed a hard task. Classification of tweets into organized form will help the user to easily access these required information. Our first contribution relates to filtering parts of speech and preprocessing this kind of highly noisy and short data. Our second contribution concerns the named entity recognition (NER) in tweets. Thus, the adaptation of existing language tools for natural languages, noisy and not accurate language tweets, is necessary. Our third contribution involves segmentation of hashtags and a semantic enrichment using a combination of relations from WordNet, which helps the performance of our classification system, including disambiguation of named entities, abbreviations and acronyms. Graph theory is used to cluster the words extracted from WordNet ...

Lexical functions are a formalism that describes the combinatorial, syntactic and semantic relati... more Lexical functions are a formalism that describes the combinatorial, syntactic and semantic relations among individual lexical units in different languages. Those relations include both paradigmatic relations, i.e. vertical or “in absence”, such as synonymy, antonymy and meronymy, and syntagmatic relations, i.e. horizontal or “in presence”, such as intensification (deeply committed), confirmative (valid argument) and support verbs (give an order, subject to an interrogation). We present in this paper a new lexical ontology, called Lexical Function Ontology (LFO), as a model to represent lexical functions. The aim is for our ontology to be combined with other lexical ontologies, such as the Lexical Model for Ontologies (lemon) and the Lexical Markup Framework (LMF), and to be used for the transformation of lexical networks into the semantic web formats, enriched with the semantic information given by the lexical functions, such as the representation of syntagmatic relations (e.g. coll...
In this paper, we present a Java API to retrieve the lexical information from the French Lexical ... more In this paper, we present a Java API to retrieve the lexical information from the French Lexical Network, a lexical resource based on the Meaning-Text Theory’s lexical functions, which was previously transformed to an RDF/OWL format. We present four API functions: one that returns all the lexical relations between two given vocables; one that returns all the lexical relations and the lexical functions modeling those relations for two given vocables; one that returns all the lexical relations encoded in the lexical network modeled by a specific lexical function; and one that returns the semantic perspectives for a specific lexical function. This API was used in the identification of collocations in a French corpus of 1.8 million sentences and in the semantic classification of these collocations.

Représentation de collocations dans un réseau lexical à l'aide des fonctions lexicales et des formalismes du Web sémantique
Les collocations posent probleme a plusieurs applications liees au domaine du traitement automati... more Les collocations posent probleme a plusieurs applications liees au domaine du traitement automatique des langues, comme la traduction automatique, la recherche d'information et la generation automatique de texte. Notamment, la representation des collocations reste un probleme ouvert et peu exploite. Le formalisme des fonctions lexicales a ete developpe pour representer les nombreux types de relations possibles entre les mots, comme la synonymie (voiture - automobile), l'antonymie (grand - petit) et l'hyponymie (chat - mammifere) – qui sont des relations paradigmatiques –, et des relations comme l'intensification (critique virulente) et la realisation (purger une peine) – qui sont des relations syntagmatiques. La relation entre les constituants d'une collocation est de type syntagmatique et peut etre modelisee par les fonctions lexicales du meme type. Dans une relation modelisee par une fonction lexicale, on peut identifier trois parties : la fonction (comme synon...

A lexical function represents a type of relation that exists between lexical units (words or expr... more A lexical function represents a type of relation that exists between lexical units (words or expressions) in any language. For example, the antonymy is a type of relation that is represented by the lexical function Anti: Anti(big) = small. Those relations include both paradigmatic relations, i.e. vertical relations, such as synonymy, antonymy and meronymy and syntagmatic relations, i.e. horizontal relations, such as objective qualification (legitimate demand), subjective qualification (fruitful analysis), positive evaluation (good review) and support verbs (pay a visit, subject to an interrogation). In this paper, we present the Lexical Functions Ontology Model (lexfom) to represent lexical functions and the relation among lexical units. Lexfom is divided in four modules: lexical function representation (lfrep), lexical function family (lffam), lexical function semantic perspective (lfsem) and lexical function relations (lfrel). Moreover, we show how it combines to Lexical Model for...

Semi-supervised learning and social media text analysis towards multi-labeling categorization
2017 IEEE International Conference on Big Data (Big Data)
In traditional text classification, classes are mutually exclusive, i.e. it is not possible to ha... more In traditional text classification, classes are mutually exclusive, i.e. it is not possible to have one text or text fragment classified into more than one class. On the other hand, in multi-label classification an individual text may belong to several classes simultaneously. This type of classification is required by a large number of current applications such as big data classification, images and video annotation. Supervised learning is the most used type of machine learning in the classification task. It requires large quantities of labeled data and the intervention of a human tagger in the creation of the training sets. When the data sets become very large or heavily noisy, this operation can be tedious, prone to error and time consuming. In this case, semi-supervised learning, which requires only few labels, is a better choice. In this paper, we study and evaluate several methods to address the problem of multi-label classification using semi-supervised learning and data from social networks. First, we propose a linguistic pre-processing involving tokeni-sation, recognition of named entities and hashtag segmentation in order to decrease the noise in this type of massive and unstructured real data and then we perform a word sense disambiguation using WordNet. Second, several experiments related to multi-label classification and semi-supervised learning are carried out on these data sets and compared to each other. These evaluations compare the results of the approaches considered. This paper proposes a method for combining semi-supervised methods with a graph method for the extraction of subjects in social networks using a multi-label classification approach. Experiments show that the performance of the proposed model increases in 4 p.p. the precision of the classification when compared to a baseline.
Efficient natural language pre-processing for analyzing large data sets
2016 IEEE International Conference on Big Data (Big Data), 2016
Animation of Relations between Computational Models and Their Language Representations
Abstract. We discuss how to increase and simplify the understanding of the equivalence relations ... more Abstract. We discuss how to increase and simplify the understanding of the equivalence relations between machine models and language representations of formal languages by means of the animation tool SAGEMoLiC, currently in development. The philosophy of our ...
We discuss how to increase and simplify the understanding of the equivalence relations between ma... more We discuss how to increase and simplify the understanding of the equivalence relations between machine models and/or language representations of formal languages by means of the animation tool SAGEMoLiC. Our new educational tool permits the simulation of the execution of models of computation, as many other animation systems do, but its philosophy goes further than these of the usual systems since it allows for a true visualization of the key notions involved in the formal proofs of these equivalences. In contrast with the proposal of previous systems, our approach to visualize equivalence theorems is not a simple "step by step animation" of specific conversion algorithms between computational models and/or grammatical representations of formal languages, because we make emphasis on the key theoretical notions involved in the formal proofs of these equivalences.
We discuss how to increase and simplify the understanding of the equivalence relations between ma... more We discuss how to increase and simplify the understanding of the equivalence relations between machine models and language representations of formal languages by means of the animation tool SAGEMoLiC, currently in development. The philosophy of our new educational tool goes further than these of the usual ones, since it permits the simulation of the execution of models of computation, as many other animation systems do, as well as the animation of the proofs of the corresponding equivalences. In contrast with previous systems, our philosophy to animate equivalence theorems is not a simple visualization of the direct translation between computational models and their grammatical representations, because we make emphasis in the key theoretical notions involved in the formal proofs of these translations. Keywords: Automata theory, Formal languages, Visualization, Algorithm Animation
Conference Presentations by Alexsandro Fonseca
In this paper, we present a Java API to retrieve the lexical information from the French Lexical ... more In this paper, we present a Java API to retrieve the lexical information from the French Lexical Network, a lexical resource based on the Meaning-Text Theory's lexical functions, which was previously transformed to an RDF/OWL format. We present four API functions: one that returns all the lexical relations between two given vocables; one that returns all the lexical relations and the lexical functions modeling those relations for two given vocables; one that returns all the lexical relations encoded in the lexical network modeled by a specific lexical function; and one that returns the semantic perspectives for a specific lexical function. This API was used in the identification of collocations in a French corpus of 1.8 million sentences and in the semantic classification of these collocations.
—Given a particular lexicon, what would be the best strategy to learn all of its lexemes? By usin... more —Given a particular lexicon, what would be the best strategy to learn all of its lexemes? By using elementary graph theory, we propose a simple formal model that answers this question. We also study several learning strategies by comparing their efficiency on eight digital English dictionaries. It turns out that a simple strategy based purely on the degree of the vertices associated with the lexemes could improve significantly the learning process with respect to other psycholinguistical strategies.
This paper presents a comparative analysis based on different classification algorithms and tools... more This paper presents a comparative analysis based on different classification algorithms and tools for the
identification of Portuguese multiword expressions. Our focus is on two-word expressions formed by
nouns, adjectives and verbs. The candidates are selected on the basis of the frequency of the bigrams;
then on the basis of the grammatical class of each bigram’s constituent words. This analysis compares
the performance of three different multi-layer perceptron training functions in the task of extracting
different patterns of multiword expressions, using and comparing nine different classification
algorithms, including decision trees, multilayer perceptron and SVM. Moreover, this analysis compares
two different tools, Text-NSP and Termostat for the identification of multiword expressions using
different association measures.

This paper presents a comparative study of different methods for the identification of multiword
... more This paper presents a comparative study of different methods for the identification of multiword
expressions, applied to a Brazilian Portuguese corpus. First, we selected the candidates based on the
frequency of bigrams. Second, we used the linguistic information based on the grammatical classes of
the words forming the bigrams, together with the frequency information in order to compare the
performance of different classification algorithms. The focus of this study is related to different
classification techniques such as support-vector machines (SVM), multi-layer perceptron, naïve
Bayesian nets, decision trees and random forest. Third, we evaluated three different multi-layer
perceptron training functions in the task of classifying different patterns of multiword expressions.
Finally, our study compared two different tools, MWEtoolkit and Text-NSP, for the extraction of
multiword expression candidates using different association measures.

Lexical functions are a formalism that describes the combinatorial, syntactic and semantic relati... more Lexical functions are a formalism that describes the combinatorial, syntactic and semantic relations among individual lexical units in different languages. Those relations include both paradigmatic relations, i.e. vertical or " in absence " , such as synonymy, antonymy and meronymy, and syntagmatic relations, i.e. horizontal or " in presence " , such as intensification (deeply committed), confirmative (valid argument) and support verbs (give an order, subject to an interrogation). We present in this paper a new lexical ontology, called Lexical Function Ontology (LFO), as a model to represent lexical functions. The aim is for our ontology to be combined with other lexical ontologies, such as the Lexical Model for Ontologies (lemon) and the Lexical Markup Framework (LMF), and to be used for the transformation of lexical networks into the semantic web formats, enriched with the semantic information given by the lexical functions, such as the representation of syntagmatic relations (e.g. collocations) usually absent from lexical networks.

A lexical function represents a type of relation that exists between lexical units (wo rds or exp... more A lexical function represents a type of relation that exists between lexical units (wo rds or expressions) in any language. For examp le, the antonymy is a type of relat ion that is represented by the lexical function Anti: Anti(big) = small. Those relations include both paradigmatic relations, i.e. vertical relations, such as synonymy, antonymy and meronymy and syntagmatic relat ions, i.e. horizontal relations, such as objective qualification (legitimate demand), subjective qualificat ion (fruitful analysis), positive evaluation (good review) and support verbs (pay a visit, subject to an interrogation). In this paper, we present the Lexical Functions Ontology Model (lexfo m) to represent lexical functions and the relation among lexical units. Lexfo m is divided in four modules: lexical function representation (lfrep), lexical function family (lffam), lexical function semantic perspective (lfsem) and lexical function relations (lfrel). Moreover, we show how it co mbines to Lexical Model for Ontologies (lemon), for the transformation of lexical networks into the semantic web formats. So far, we have implemented 100 simp le and 500 comp lex lexical functions, and encoded about 8,000 syntagmatic and 46,000 paradigmat ic relations, for the French language.
Uploads
Papers by Alexsandro Fonseca
Conference Presentations by Alexsandro Fonseca
identification of Portuguese multiword expressions. Our focus is on two-word expressions formed by
nouns, adjectives and verbs. The candidates are selected on the basis of the frequency of the bigrams;
then on the basis of the grammatical class of each bigram’s constituent words. This analysis compares
the performance of three different multi-layer perceptron training functions in the task of extracting
different patterns of multiword expressions, using and comparing nine different classification
algorithms, including decision trees, multilayer perceptron and SVM. Moreover, this analysis compares
two different tools, Text-NSP and Termostat for the identification of multiword expressions using
different association measures.
expressions, applied to a Brazilian Portuguese corpus. First, we selected the candidates based on the
frequency of bigrams. Second, we used the linguistic information based on the grammatical classes of
the words forming the bigrams, together with the frequency information in order to compare the
performance of different classification algorithms. The focus of this study is related to different
classification techniques such as support-vector machines (SVM), multi-layer perceptron, naïve
Bayesian nets, decision trees and random forest. Third, we evaluated three different multi-layer
perceptron training functions in the task of classifying different patterns of multiword expressions.
Finally, our study compared two different tools, MWEtoolkit and Text-NSP, for the extraction of
multiword expression candidates using different association measures.