Academia.eduAcademia.edu

Distributional Similarity Measures

description7 papers
group0 followers
lightbulbAbout this topic
Distributional similarity measures are quantitative techniques used to assess the similarity between linguistic items based on their distributional patterns in large corpora. These measures analyze co-occurrence frequencies and contextual relationships to determine how closely related words or phrases are in meaning, often employed in natural language processing and computational linguistics.
lightbulbAbout this topic
Distributional similarity measures are quantitative techniques used to assess the similarity between linguistic items based on their distributional patterns in large corpora. These measures analyze co-occurrence frequencies and contextual relationships to determine how closely related words or phrases are in meaning, often employed in natural language processing and computational linguistics.

Key research themes

1. How can directional (asymmetric) distributional similarity measures improve lexical expansion and related NLP tasks?

This research theme focuses on developing and analyzing distributional similarity measures that are directional, reflecting asymmetric semantic relations such as hyponymy or lexical entailment. Traditional symmetric measures fail to capture these relations effectively. Directional measures quantify the degree of distributional feature inclusion from a more specific term to a more general term, thereby enhancing lexical expansion, information retrieval, and related tasks where asymmetric semantic relations are critical.

Key finding: This paper identifies the desired properties of directional distributional similarity measures and proposes a novel measure based on averaged precision that quantifies distributional feature inclusion. Empirical evaluation... Read more
Key finding: Although primarily surveying distance metrics for probability distributions, this work provides theoretical foundations for understanding measures like total variation, Jensen-Shannon, and related divergences. The insights... Read more
Key finding: This comprehensive survey discusses a variety of distance and similarity measures for data represented in vector spaces, covering metric and non-metric spaces. Its analysis highlights limitations of symmetric distances in... Read more

2. What are the advantages and computational considerations of embedding complex or variable-length data sequences into metric manifold spaces to facilitate similarity search?

This theme examines methods for representing multivariate, variable-length data sequences—such as text, time series, or trajectories—in a manifold space that preserves meaningful similarity and metric properties. These embeddings address the challenges posed by non-metric and variable-length sequence comparison, enabling effective and computationally feasible similarity search, clustering, and downstream analysis in domains like sensor networks, image retrieval, and linguistics.

Key finding: The paper proposes a semi-supervised manifold learning framework that learns metric embeddings for arbitrary-length multivariate sequences by refining similarity parameters based on instance-level constraints. The approach... Read more
Key finding: This work develops component-wise dissimilarity measures tailored for complex heterogeneous data representations common in pattern recognition, demonstrating that learned weighted Minkowski distances over components yield... Read more
Key finding: This study introduces a novel family of contextual similarity measures between distributions by embedding traditional divergences into a contextual framework, formulates them as convex optimization problems, and applies these... Read more

3. How do different semantic similarity models and pre-processing techniques impact word and document similarity measurement in constrained and morphologically rich language scenarios?

This research theme explores semantic similarity measures applicable to words and senses, focusing on knowledge-based, distributional, and hybrid models. It additionally investigates the effects of language-specific preprocessing, such as root-based and stem-based techniques for morphologically rich languages like Arabic, on similarity computations. The work emphasizes method selection for constrained computing environments (e.g., IoT), the choice of embedding and lexical resource models, and their impacts on semantic similarity accuracy.

Key finding: This paper provides a taxonomy and survey of semantic similarity approaches, distinguishing knowledge-based and distributional methods and discussing their representations and measures. It clarifies the computational and... Read more
Key finding: The authors empirically compare root-based (stemming) and stem-based (light stemming) preprocessing applied to Arabic corpora for semantic similarity computation using Latent Semantic Analysis combined with various... Read more
Key finding: This work analyzes state-of-the-art corpus-based semantic similarity models, including TF-IDF, LSI, Word2Vec, GloVe, fastText, and RoBERTa, under constrained computing scenarios typical in IoT and edge computing. It proposes... Read more
Key finding: Confirming prior results, this study further evaluates Root-based and Stem-based preprocessing effects on Arabic word similarity estimation via LSA and multiple similarity metrics. The findings reiterate that Stem-based... Read more

All papers in Distributional Similarity Measures

At present many experts in the field of information technology have designed and developed algorithms to solve stemming problems, especially in Arabic. But of the many stemming analyses in Arabic, there is no standardization of a good... more
Decisions at the outset of compiling a comparable corpus are of crucial importance for how the corpus is to be built and analysed later on. Several variables and external criteria are usually followed when building a corpus but little is... more
Many methods have been applied to automatic construction or expansion of lexical semantic resources. Most follow the distributional hypothesis applied to lexical context of words, eliminating grammatical context (stopwords). This paper... more
A search engine's ability to retrieve desirable datasets is important for data sharing and reuse. Existing dataset search engines typically rely on matching queries to dataset descriptions. However, a user may not have enough prior... more
This paper describes a multiple-stage retrieval framework for the task of related entity finding on TREC 2010 Entity Track. In the document retrieval stage, search engine is used to improve the retrieval accuracy. In the entity extraction... more
DIR 2011, the 11th Dutch-Belgian Information Retrieval Workshop, was organized by the Information and Language Processing group (ILPS) of the University of Amsterdam in collaboration with the Centrum Wiskunde en Informatica (CWI). Two... more
DIR 2011, the 11th Dutch-Belgian Information Retrieval Workshop, was organized by the Information and Language Processing group (ILPS) of the University of Amsterdam in collaboration with the Centrum Wiskunde en Informatica (CWI). Two... more
Identifying the target types of entity-bearing queries can help improve retrieval performance as well as the overall search experience. In this work, we address the problem of automatically detecting the target types of a query with... more
Decisões tomadas anteriormenteà compilação de um corpo comparável têm um grande impacto na forma em que este será posteriormente construído e analisado. Diversas variáveis e critérios externos são normalmente seguidos na construção de um... more
Decisions at the outset of compiling a comparable corpus are of crucial importance for how the corpus is to be built and analysed later on. Several variables and external criteria are usually followed when building a corpus but little is... more
While there are many studies on information retrieval models using full-text, there are presently no comparison studies of full-text retrieval vs. retrieval only over the titles of documents. On the one hand, the full-text of documents... more
While there are many studies on information retrieval models using full-text, there are presently no comparison studies of full-text retrieval vs. retrieval only over the titles of documents. On the one hand, the full-text of documents... more
Computing the pairwise semantic similarity between all words on the Web is a computationally challenging task. Parallelization and optimizations are necessary. We propose a highly scalable implementation based on distributional... more
Abstract. This paper provides an overview of the experiments we carried out at the TREC 2010 Session Track. We propose an approach for interpreting reformulated queries by using query expansions derived from simulated query logs. We show... more
Considering the wide offer of mobile applications available nowadays, effective search engines are imperative for an user to find applications that provide a specific desired functionality. Retrieval approaches that leverage topic... more
As social media and e-commerce on the Internet continue to grow, opinions have become one of the most important sources of information for users to base their future decisions on. Unfortunately, the large quantities of opinions make it... more
The paper presents an approach to interactively refining user search formulations and its evaluation in the new High Accuracy Retrieval from Documents (HARD) track of TREC-12. The method consists of showing to the user a list of noun... more
The paper presents an approach to interactively refining user search formulations and its evaluation in the new High Accuracy Retrieval from Documents (HARD) track of TREC-12. The method consists of showing to the user a list of noun... more
Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and... more
Entity retrieval is an important part of any modern retrieval system and often satisfies user information needs directly. Word and entity embeddings are a promising opportunity for new improvements in retrieval, especially in the presence... more
Proceedings of the UCNLG+ Eval: Language Generation and Evaluation Workshop, pages 33–38, Edinburgh, Scotland, UK, July 31, 2011. cO2011 Association for Computational Linguistics Exploring linguistically-rich patterns for question... more
Representation of semantic information contained in the words is needed for any Arabic Text Mining applications. More precisely, the purpose is to better take into account the semantic dependencies between words expressed by the... more
Α novel framework is presented for performing re-ranking in the search results of a Web search engine, incorporating user judgments as registered in their selection of relevant documents. The proposed scheme combines smoothly techniques... more
Search engines are becoming an instrument for users to search for needed information. The web search engine is one of the most popular search engines that are successfully implemented in many application areas. A major challenge to a web... more
Search engines are becoming an instrument for users to search for needed information. The web search engine is one of the most popular search engines that are successfully implemented in many application areas. A major challenge to a web... more
Representation of semantic information contained in the words is needed for any Arabic Text Mining applications. More precisely, the purpose is to better take into account the semantic dependencies between words expressed by the... more
Based on the important progresses made in information retrieval (IR) in terms of theoretical models and evaluations, more and more attention has recently been paid to the research in domain specific IR, as evidenced by the organization of... more
This version may not include final proof corrections and does not include published layout or pagination.
Representation of semantic information contained in the words is needed for any Arabic Text Mining applications. More precisely, the purpose is to better take into account the semantic dependencies between words expressed by the... more
This version may not include final proof corrections and does not include published layout or pagination.
In this paper, we report the experiments we conducted for our participation to the TREC 2011 Web Track. The experiments we conducted this year aim at discovering how the combination of specific external resources in a language modeling... more
Abstract. This is the first year for the participation of the City University Centre of Interactive System Research (CISR) in the Expert Search Task. In this paper, we describe an expert search experiment based on window-based techniques,... more
Identifying the target types of entity-bearing queries can help improve retrieval performance as well as the overall search experience. In this work, we address the problem of automatically detecting the target types of a query with... more
Knowledge bases store information about the semantic types of entities, which can be utilized in a range of information access tasks. This information, however, is often incomplete, due to new entities emerging on a daily basis. We... more
Knowledge Graphs capture the semantic relations between realworld entities and can thus, allow end-users to explore di erent aspects of an entity of interest by traversing through the edges in the graph. Most of the state-of-the-art... more
We address the problem of finding descriptive explanations of facts stored in a knowledge graph. This is important in high-risk domains such as healthcare, intelligence, etc. where users need additional information for decision making and... more
Providing effective tools to retrieve event-related pictures within media-sharing applications, such as Flickr, is an important but challenging task. One interesting aspect is to search pictures related to a specific event with a given... more
The paper describes a semi-supervised approach to extracting multiword aspects of user-written reviews that belong to a given category. The method starts with a small set of seed words representing the target category, and calculates... more
We propose an approach to the retrieval of entities that have a specific relationship with the entity given in a query. Our research goal is to investigate whether related entity finding problem can be addressed by combining a measure of... more
Modelling term dependence in IR aims to identify co-occurring terms that are too heavily dependent on each other to be treated as a bag of words, and to adapt the indexing and ranking accordingly. Dependent terms are predominantly... more
Modelling term dependence in IR aims to identify co-occurring terms that are too heavily dependent on each other to be treated as a bag of words, and to adapt the indexing and ranking accordingly. Dependent terms are predominantly... more
Decisions at the outset of compiling a comparable corpus are of crucial importance for how the corpus is to be built and analysed later on. Several variables and external criteria are usually followed when building a corpus but little is... more
Identifying and extracting named entities from web pages has been the subject of many researches. In this paper, we propose and evaluate some new unsupervised language modeling approaches to determine the membership level of a candidate... more
Computing the pairwise semantic similarity between all words on the Web is a computationally challenging task. Parallelization and optimizations are necessary. We propose a highly scalable implementation based on distributional... more
Related entity finding is the task of returning a ranked list of homepages of relevant entities of a specified type that need to engage in a given relationship with a given source entity. We propose a framework for addressing this task... more
We report on experiments for the Related Entity Finding task in which we focus on only using Wikipedia as a target corpus in which to identify (related) entitities. Our approach is based on co-occurrences between the source entity and... more
Abstract The First International Workshop on Entity-Oriented Search (EOS) workshop was held on July 28, 2011 in Beijing, China, in conjunction with the 34th Annual International ACM SIGIR Conference (SIGIR 2011). The objective for the... more
Abstract: Our goal in participating in the TREC 2009 Entity Track was to study whether relation extraction techniques can help in improving accuracy of the entity finding task. Finding related entities is informational in nature and we... more
Abstract This paper describes our participation in the Entity List Completion (ELC) task at Entity track 2011. Our approach combined the work done for the Related Entity Finding 2010 task with some new criteria as the proximity or the... more
Abstract—Searching for named entities has been the subject of many researches in information retrieval. Our goal in participating in TREC 2010 Entity Ranking track is to look for reconizing any named entity in arbitrary categories and use... more
Download research papers for free!