Florentina Hristea

University of Bucharest, Department of Computer Science, Faculty Member

Followers

Following

Co-authors

Public Views

https://cs.unibuc.ro/~fhristea/

less

InterestsView All (6)

Uploads

Papers by Florentina Hristea

Statistical Natural Language Processing

Springer eBooks, 2017

Download

Semiautomatic generation of wordnet type synsets and clusters using class methods. an overview

As its authors note, Miller et al. (1990), WordNet (WN) is a lexical knowledge base, first develo... more As its authors note, Miller et al. (1990), WordNet (WN) is a lexical knowledge base, first developed for English and then adopted for several Western European languages, which was created as a machine-readable dictionary based on psycholinguistic principles. The present study is an attempt to discuss the semiautomatic generation of WNs for languages other than English, a topic of great interest since the existence of such WNs will create the appropriate infrastructure for advanced Information Technology systems. Extending the algorithmic approach introduced in Nikolov, Petrova (2001), we propose a semiautomatic method based on heuristics for the generation of WN type synsets and clusters. The focus is on noun and adjective synsets, since nouns and adjectives have completely different organizations in WN, but verb and adverb synset generation is also addressed. The target language for performing tests will be Romanian. Our approach to WN generation relies on so-called “class methods”...

Download

The long road from performing word sense disambiguation to successfully using it in information retrieval: An overview of the unsupervised approach

Computational Intelligence, 2020

As mentioned in our Introduction, one of the first thorough analyses of ambiguity in IR is that p... more As mentioned in our Introduction, one of the first thorough analyses of ambiguity in IR is that performed in Reference 24. Of the several comments made by Krovetz and Croft 24 when concluding their study, we especially retain the one stating that "disambiguation has the potential for improving precision for low recall searches via the use of a sense-disambiguated thesaurus," one that will be indirectly validated in more modern times as well. These authors equally comment that "the senses of the words is only one factor affecting relevance. The relationships that these words have to one another is also important." When mentioning these relationships they refer 24 to syntactic relations among words, by invoking usage of a natural language parser. At this early stage, semantic relations between the concepts that these words lexicalize are not yet taken into account, as will be done soon after, in Reference 25. The automatic indexing process that is developed in Reference 25 uses the "is-a" relations from WordNet (WN) and constructs vectors of senses to represent documents and queries. As commented in Reference 2, the results show that the classical stem-based approach was superior overall, while the sense-based approach improved the results for some of the queries only. In addition, the results in another work by the same author 28 showed that WSD can even be harmful, confirming Sanderson's 16 conclusion that even low error rates for WSD can hurt IR performance. Voorhees 28 refers to "the pitfalls of linguistic processing," one of the strongest general conclusions that the author comes to being the fact that "linguistic techniques must be essentially perfect to help." Both References 29 and 30 make use of semantic indexing based on WN synsets. Mihalcea and Moldovan 30 come to the conclusion that semantic indexing, a new trend at the time, "offers an improvement over current IR techniques." Although both References 29 and 30 report positive results, one should notice that the corresponding tests were conducted on small datasets. ‖ As opposed to the approach described in Reference 16 that makes use of pseudowords. This approach 16 deals with artificial queries that, additionally, represent "broad topic codes, rather than precise statements of information need." 41 **In the experiments performed by Schütze and Pedersen, 41 the context vectors are computed by considering all content words in a symmetric 41-word context window centered on the target.

Download

Word sense discrimination in information retrieval: A spectral clustering-based approach

Information Processing & Management, 2015

Word sense ambiguity has been identified as a cause of poor precision in information retrieval (I... more Word sense ambiguity has been identified as a cause of poor precision in information retrieval (IR) systems. Word sense disambiguation and discrimination methods have been defined to help systems choose which documents should be retrieved in relation to an ambiguous query. However, the only approaches that show a genuine benefit for word sense discrimination or disambiguation in IR are generally supervised ones. In this paper we propose a new unsupervised method that uses word sense discrimination in IR. The method we develop is based on spectral clustering and reorders an initially retrieved document list by boosting documents that are semantically similar to the target query. For several TREC ad hoc collections we show that our method is useful in the case of queries which contain ambiguous terms. We are interested in improving the level of precision after 5, 10 and 30 retrieved documents (P@5, P@10, P@30) respectively. We show that precision can be improved by 8% above current state-of-the-art baselines. We also focus on poor performing queries.

Download

Preface to the Special Issue “Natural Language Processing (NLP) and Machine Learning (ML)—Theory and Applications”

Mathematics

Natural language processing (NLP) is one of the most important technologies in use today, especia... more

Download

Towards a Benchmarking System for Comparing Automatic Hate Speech Detection with an Intelligent Baseline Proposal

Mathematics, 2022

Hate Speech is a frequent problem occurring among Internet users. Recent regulations are being di... more Hate Speech is a frequent problem occurring among Internet users. Recent regulations are being discussed by U.K. representatives (“Online Safety Bill”) and by the European Commission, which plans on introducing Hate Speech as an “EU crime”. The recent legislation having passed in order to combat this kind of speech places the burden of identification on the hosting websites and often within a tight time frame (24 h in France and Germany). These constraints make automatic Hate Speech detection a very important topic for major social media platforms. However, recent literature on Hate Speech detection lacks a benchmarking system that can evaluate how different approaches compare against each other regarding the prediction made concerning different types of text (short snippets such as those present on Twitter, as well as lengthier fragments). This paper intended to deal with this issue and to take a step forward towards the standardization of testing for this type of natural language ...

Download

ShotgunWSD: An unsupervised algorithm for global word sense disambiguation inspired by DNA sequencing

In this paper, we present a novel unsupervised algorithm for word sense disambiguation (WSD) at t... more In this paper, we present a novel unsupervised algorithm for word sense disambiguation (WSD) at the document level. Our algorithm is inspired by a widely-used approach in the field of genetics for whole genome sequencing, known as the Shotgun sequencing technique. The proposed WSD algorithm is based on three main steps. First, a brute-force WSD algorithm is applied to short context windows (up to 10 words) selected from the document in order to generate a short list of likely sense configurations for each window. In the second step, these local sense configurations are assembled into longer composite configurations based on suffix and prefix matching. The resulted configurations are ranked by their length, and the sense of each word is chosen based on a voting scheme that considers only the top k configurations in which the word appears. We compare our algorithm with other state-of-the-art unsupervised WSD algorithms and demonstrate better performance, sometimes by a very large marg...

Download

Feature selection for spectral clustering: to help or not to help spectral clustering when performing sense discrimination for IR?

Open Computer Science

Whether or not word sense disambiguation (WSD) can improve information retrieval (IR) results rep... more Whether or not word sense disambiguation (WSD) can improve information retrieval (IR) results represents a topic that has been intensely debated over the years, with many inconclusive or contradictory conclusions. The most rarely used type of WSD for this task is the unsupervised one, although it has been proven to be beneficial at a large scale. Our study builds on existing research and tries to improve the most recent unsupervised method which is based on spectral clustering. It investigates the possible benefits of “helping” spectral clustering through feature selection when it performs sense discrimination for IR. Results obtained so far, involving large data collections, encourage us to point out the importance of feature selection even in the case of this advanced, state of the art clustering technique that is known for performing its own feature weighting. By suggesting an improvement of what we consider the most promising approach to usage of WSD in IR, and by commenting on ...

Download

ShotgunWSD: An unsupervised algorithm for global word sense disambiguation inspired by DNA sequencing

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

On the semiautomatic generation of verb synsets in languages other than English

On single carrier outliers in a linear regression model

On a Dependency-based Semantic Space for Unsupervised Noun Sense Disambiguation with an Underlying Naive Bayes Model

Semiautomatic generation of wordnet type synsets and clusters using class methods. an overview

Revue Roumaine De Linguistique, Jan 8, 2007

On single distribution outlying pairs in a linear regression model

An algorithm for the detection of outliers in the case of the normal distribution

Recent Advances Concerning the Usage of the Naive Bayes Model in Unsupervised Word Sense Disambiguation

International Review on Computers and Software, 2009

On multiple response outliers of additive nature in a linear regression model

N-Gram Features for Unsupervised WSD with an Underlying Naïve Bayes Model

SpringerBriefs in Statistics, 2012

Syntactic Dependency-Based Feature Selection

SpringerBriefs in Statistics, 2012

Semantic WordNet-Based Feature Selection

SpringerBriefs in Statistics, 2012

Florentina Hristea

Uploads

Papers by Florentina Hristea

Log In