Papers by Florentina Hristea

As its authors note, Miller et al. (1990), WordNet (WN) is a lexical knowledge base, first develo... more As its authors note, Miller et al. (1990), WordNet (WN) is a lexical knowledge base, first developed for English and then adopted for several Western European languages, which was created as a machine-readable dictionary based on psycholinguistic principles. The present study is an attempt to discuss the semiautomatic generation of WNs for languages other than English, a topic of great interest since the existence of such WNs will create the appropriate infrastructure for advanced Information Technology systems. Extending the algorithmic approach introduced in Nikolov, Petrova (2001), we propose a semiautomatic method based on heuristics for the generation of WN type synsets and clusters. The focus is on noun and adjective synsets, since nouns and adjectives have completely different organizations in WN, but verb and adverb synset generation is also addressed. The target language for performing tests will be Romanian. Our approach to WN generation relies on so-called “class methods”...

Computational Intelligence, 2020
As mentioned in our Introduction, one of the first thorough analyses of ambiguity in IR is that p... more As mentioned in our Introduction, one of the first thorough analyses of ambiguity in IR is that performed in Reference 24. Of the several comments made by Krovetz and Croft 24 when concluding their study, we especially retain the one stating that "disambiguation has the potential for improving precision for low recall searches via the use of a sense-disambiguated thesaurus," one that will be indirectly validated in more modern times as well. These authors equally comment that "the senses of the words is only one factor affecting relevance. The relationships that these words have to one another is also important." When mentioning these relationships they refer 24 to syntactic relations among words, by invoking usage of a natural language parser. At this early stage, semantic relations between the concepts that these words lexicalize are not yet taken into account, as will be done soon after, in Reference 25. The automatic indexing process that is developed in Reference 25 uses the "is-a" relations from WordNet (WN) and constructs vectors of senses to represent documents and queries. As commented in Reference 2, the results show that the classical stem-based approach was superior overall, while the sense-based approach improved the results for some of the queries only. In addition, the results in another work by the same author 28 showed that WSD can even be harmful, confirming Sanderson's 16 conclusion that even low error rates for WSD can hurt IR performance. Voorhees 28 refers to "the pitfalls of linguistic processing," one of the strongest general conclusions that the author comes to being the fact that "linguistic techniques must be essentially perfect to help." Both References 29 and 30 make use of semantic indexing based on WN synsets. Mihalcea and Moldovan 30 come to the conclusion that semantic indexing, a new trend at the time, "offers an improvement over current IR techniques." Although both References 29 and 30 report positive results, one should notice that the corresponding tests were conducted on small datasets. ‖ As opposed to the approach described in Reference 16 that makes use of pseudowords. This approach 16 deals with artificial queries that, additionally, represent "broad topic codes, rather than precise statements of information need." 41 **In the experiments performed by Schütze and Pedersen, 41 the context vectors are computed by considering all content words in a symmetric 41-word context window centered on the target.

Information Processing & Management, 2015
Word sense ambiguity has been identified as a cause of poor precision in information retrieval (I... more Word sense ambiguity has been identified as a cause of poor precision in information retrieval (IR) systems. Word sense disambiguation and discrimination methods have been defined to help systems choose which documents should be retrieved in relation to an ambiguous query. However, the only approaches that show a genuine benefit for word sense discrimination or disambiguation in IR are generally supervised ones. In this paper we propose a new unsupervised method that uses word sense discrimination in IR. The method we develop is based on spectral clustering and reorders an initially retrieved document list by boosting documents that are semantically similar to the target query. For several TREC ad hoc collections we show that our method is useful in the case of queries which contain ambiguous terms. We are interested in improving the level of precision after 5, 10 and 30 retrieved documents (P@5, P@10, P@30) respectively. We show that precision can be improved by 8% above current state-of-the-art baselines. We also focus on poor performing queries.
Mathematics
Natural language processing (NLP) is one of the most important technologies in use today, especia... more Natural language processing (NLP) is one of the most important technologies in use today, especially due to the large and growing amount of online text, which needs to be understood in order to fully ascertain its enormous value [...]

Mathematics, 2022
Hate Speech is a frequent problem occurring among Internet users. Recent regulations are being di... more Hate Speech is a frequent problem occurring among Internet users. Recent regulations are being discussed by U.K. representatives (“Online Safety Bill”) and by the European Commission, which plans on introducing Hate Speech as an “EU crime”. The recent legislation having passed in order to combat this kind of speech places the burden of identification on the hosting websites and often within a tight time frame (24 h in France and Germany). These constraints make automatic Hate Speech detection a very important topic for major social media platforms. However, recent literature on Hate Speech detection lacks a benchmarking system that can evaluate how different approaches compare against each other regarding the prediction made concerning different types of text (short snippets such as those present on Twitter, as well as lengthier fragments). This paper intended to deal with this issue and to take a step forward towards the standardization of testing for this type of natural language ...

In this paper, we present a novel unsupervised algorithm for word sense disambiguation (WSD) at t... more In this paper, we present a novel unsupervised algorithm for word sense disambiguation (WSD) at the document level. Our algorithm is inspired by a widely-used approach in the field of genetics for whole genome sequencing, known as the Shotgun sequencing technique. The proposed WSD algorithm is based on three main steps. First, a brute-force WSD algorithm is applied to short context windows (up to 10 words) selected from the document in order to generate a short list of likely sense configurations for each window. In the second step, these local sense configurations are assembled into longer composite configurations based on suffix and prefix matching. The resulted configurations are ranked by their length, and the sense of each word is chosen based on a voting scheme that considers only the top k configurations in which the word appears. We compare our algorithm with other state-of-the-art unsupervised WSD algorithms and demonstrate better performance, sometimes by a very large marg...

Open Computer Science
Whether or not word sense disambiguation (WSD) can improve information retrieval (IR) results rep... more Whether or not word sense disambiguation (WSD) can improve information retrieval (IR) results represents a topic that has been intensely debated over the years, with many inconclusive or contradictory conclusions. The most rarely used type of WSD for this task is the unsupervised one, although it has been proven to be beneficial at a large scale. Our study builds on existing research and tries to improve the most recent unsupervised method which is based on spectral clustering. It investigates the possible benefits of “helping” spectral clustering through feature selection when it performs sense discrimination for IR. Results obtained so far, involving large data collections, encourage us to point out the importance of feature selection even in the case of this advanced, state of the art clustering technique that is known for performing its own feature weighting. By suggesting an improvement of what we consider the most promising approach to usage of WSD in IR, and by commenting on ...
ShotgunWSD: An unsupervised algorithm for global word sense disambiguation inspired by DNA sequencing
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers
On the semiautomatic generation of verb synsets in languages other than English
On single carrier outliers in a linear regression model
On a Dependency-based Semantic Space for Unsupervised Noun Sense Disambiguation with an Underlying Naive Bayes Model
Semiautomatic generation of wordnet type synsets and clusters using class methods. an overview
Revue Roumaine De Linguistique, Jan 8, 2007
On single distribution outlying pairs in a linear regression model
An algorithm for the detection of outliers in the case of the normal distribution
Recent Advances Concerning the Usage of the Naive Bayes Model in Unsupervised Word Sense Disambiguation
International Review on Computers and Software, 2009
On multiple response outliers of additive nature in a linear regression model
N-Gram Features for Unsupervised WSD with an Underlying Naïve Bayes Model
SpringerBriefs in Statistics, 2012
Syntactic Dependency-Based Feature Selection
SpringerBriefs in Statistics, 2012
Semantic WordNet-Based Feature Selection
SpringerBriefs in Statistics, 2012
Uploads
Papers by Florentina Hristea