Distributional Similarity Measures

description7 papers

group0 followers

lightbulbAbout this topic

Distributional similarity measures are quantitative techniques used to assess the similarity between linguistic items based on their distributional patterns in large corpora. These measures analyze co-occurrence frequencies and contextual relationships to determine how closely related words or phrases are in meaning, often employed in natural language processing and computational linguistics.

lightbulbAbout this topic

Key research themes

1. How can directional (asymmetric) distributional similarity measures improve lexical expansion and related NLP tasks?

This research theme focuses on developing and analyzing distributional similarity measures that are directional, reflecting asymmetric semantic relations such as hyponymy or lexical entailment. Traditional symmetric measures fail to capture these relations effectively. Directional measures quantify the degree of distributional feature inclusion from a more specific term to a more general term, thereby enhancing lexical expansion, information retrieval, and related tasks where asymmetric semantic relations are critical.

Directional Distributional Similarity for Lexical Expansion

by Maayan Zhitomirsky-Geffet

2017

Key finding: This paper identifies the desired properties of directional distributional similarity measures and proposes a novel measure based on averaged precision that quantifies distributional feature inclusion. Empirical evaluation... Read more

articleView Paper downloadDownload

Survey of Distances between the Most Popular Distributions

by Mark Kelbert

2023, Analytics

Key finding: Although primarily surveying distance metrics for probability distributions, this work provides theoretical foundations for understanding measures like total variation, Jensen-Shannon, and related divergences. The insights... Read more

articleView Paper downloadDownload

A Review of Data and Document Clustering pertaining to various Distance Measures

by Hannah Grace

2025, Salud, Ciencia y Tecnología

Key finding: This comprehensive survey discusses a variety of distance and similarity measures for data represented in vector spaces, covering metric and non-metric spaces. Its analysis highlights limitations of symmetric distances in... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What are the advantages and computational considerations of embedding complex or variable-length data sequences into metric manifold spaces to facilitate similarity search?

This theme examines methods for representing multivariate, variable-length data sequences—such as text, time series, or trajectories—in a manifold space that preserves meaningful similarity and metric properties. These embeddings address the challenges posed by non-metric and variable-length sequence comparison, enabling effective and computationally feasible similarity search, clustering, and downstream analysis in domains like sensor networks, image retrieval, and linguistics.

Manifold Learning for Multivariate Variable-Length Sequences With an Application to Similarity Search

by Shen-shyang Ho

2015, IEEE transactions on neural networks and learning systems

Key finding: The paper proposes a semi-supervised manifold learning framework that learns metric embeddings for arbitrary-length multivariate sequences by refining similarity parameters based on instance-level constraints. The approach... Read more

articleView Paper downloadDownload

On component-wise dissimilarity measures and metric properties in pattern recognition

by Enrico De Santis

2022, PeerJ Computer Science

Key finding: This work develops component-wise dissimilarity measures tailored for complex heterogeneous data representations common in pattern recognition, demonstrating that learned weighted Minkowski distances over components yield... Read more

articleView Paper downloadDownload

A family of contextual measures of similarity between distributions with application to image retrieval

by Florent Perronnin

2013

Key finding: This study introduces a novel family of contextual similarity measures between distributions by embedding traditional divergences into a contextual framework, formulates them as convex optimization problems, and applies these... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How do different semantic similarity models and pre-processing techniques impact word and document similarity measurement in constrained and morphologically rich language scenarios?

This research theme explores semantic similarity measures applicable to words and senses, focusing on knowledge-based, distributional, and hybrid models. It additionally investigates the effects of language-specific preprocessing, such as root-based and stem-based techniques for morphologically rich languages like Arabic, on similarity computations. The work emphasizes method selection for constrained computing environments (e.g., IoT), the choice of embedding and lexical resource models, and their impacts on semantic similarity accuracy.

An overview of word and sense similarity

by Federico Martelli

2021, Natural Language Engineering

Key finding: This paper provides a taxonomy and survey of semantic similarity approaches, distinguishing knowledge-based and distributional methods and discussing their representations and measures. It clarifies the computational and... Read more

articleView Paper downloadDownload

A comparative study of root-based and stem-based approaches for measuring the similarity between arabic words for arabic text mining applications

by Abdelmonaime Lachkar

2022

Key finding: The authors empirically compare root-based (stemming) and stem-based (light stemming) preprocessing applied to Arabic corpora for semantic similarity computation using Latent Semantic Analysis combined with various... Read more

articleView Paper downloadDownload

Comparison of Semantic Similarity Models on Constrained Scenarios

by Diogo Gomes

2023, Information Systems Frontiers

Key finding: This work analyzes state-of-the-art corpus-based semantic similarity models, including TF-IDF, LSI, Word2Vec, GloVe, fastText, and RoBERTa, under constrained computing scenarios typical in IoT and edge computing. It proposes... Read more

articleView Paper downloadDownload

A Comparative Study of Root -Based and Stem -Based Approaches for Measuring the Similarity Between Arabic Words for Arabic Text Mining Applications

by Abdelmonaime Lachkar

2021, Advanced Computing: An International Journal

Key finding: Confirming prior results, this study further evaluates Root-based and Stem-based preprocessing effects on Arabic word similarity estimation via LSA and multiple similarity metrics. The findings reiterate that Stem-based... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Distributional Similarity Measures

Analysis and implementation of computer-based system development of stemming algorithm for finding Arabic root word

by Khaerul Umam

2025, Journal of physics

At present many experts in the field of information technology have designed and developed algorithms to solve stemming problems, especially in Arabic. But of the many stemming analyses in Arabic, there is no standardization of a good... more

descriptionView Paper arrow_downwardDownload

Compilação de Corpos Comparáveis Especializados: Devemos sempre confiar nas Ferramentas de Compilação Semi-automáticas?

by Isabel Durán-Muñoz

2024, Linguamática

Decisions at the outset of compiling a comparable corpus are of crucial importance for how the corpus is to be built and analysed later on. Several variables and external criteria are usually followed when building a corpus but little is... more

descriptionView Paper arrow_downwardDownload

Unsupervised Grammatical Pattern Discovery from Arabic Extra Large Corpora

by Gilles Bernard

2023, Proceedings of the 13th International Joint Conference on Computational Intelligence

Many methods have been applied to automatic construction or expansion of lexical semantic resources. Most follow the distributional hypothesis applied to lexical context of words, eliminating grammatical context (stopwords). This paper... more

descriptionView Paper arrow_downwardDownload

Leveraging Schema Labels to Enhance Dataset Search

by Brian D Davison

2023, Lecture Notes in Computer Science

A search engine's ability to retrieve desirable datasets is important for data sharing and reuse. Existing dataset search engines typically rely on matching queries to dataset descriptions. However, a user may not have enough prior... more

descriptionView Paper arrow_downwardDownload

A Multiple-Stage Framework for Related Entity Finding: FDWIM at TREC 2010 Entity Track

by Haiguang Chen

2023, Text REtrieval Conference

This paper describes a multiple-stage retrieval framework for the task of related entity finding on TREC 2010 Entity Track. In the document retrieval stage, search engine is used to improve the retrieval accuracy. In the entity extraction... more

descriptionView Paper arrow_downwardDownload

Preface Organizing Committee Program Committee Usefulness as the Criterion for Evaluation of Interactive Information Retrieval Systems Semi-supervised Priors for Microblog Language Identification Scope of Negation Detection in Sentiment Analysis a Multi-dimensional Model for Search Intent Result Div

by Stefan de Bruijn

2023

DIR 2011, the 11th Dutch-Belgian Information Retrieval Workshop, was organized by the Information and Language Processing group (ILPS) of the University of Amsterdam in collaboration with the Centrum Wiskunde en Informatica (CWI). Two... more

descriptionView Paper arrow_downwardDownload

by Corrado Boscarino

2023

descriptionView Paper arrow_downwardDownload

Target Type Identification for Entity-Bearing Queries

by Dario Garigliotti

2023, Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Identifying the target types of entity-bearing queries can help improve retrieval performance as well as the overall search experience. In this work, we address the problem of automatically detecting the target types of a query with... more

descriptionView Paper arrow_downwardDownload

Compiling Specialised Comparable Corpora. Should we always trust (Semi-)automatic Compilation Tools?

by Ruslan Mitkov

2023, Linguamática

Decisões tomadas anteriormenteà compilação de um corpo comparável têm um grande impacto na forma em que este será posteriormente construído e analisado. Diversas variáveis e critérios externos são normalmente seguidos na construção de um... more

descriptionView Paper arrow_downwardDownload

Compilação de Corpos Comparáveis Especializados: Devemos sempre confiar nas Ferramentas de Compilação Semi-automáticas?

by Ruslan Mitkov

2023, Linguamática

descriptionView Paper arrow_downwardDownload

Performance Comparison of Ad-Hoc Retrieval Models over Full-Text vs. Titles of Documents

by AHMED SALEH

2022, Maturity and Innovation in Digital Libraries

While there are many studies on information retrieval models using full-text, there are presently no comparison studies of full-text retrieval vs. retrieval only over the titles of documents. On the one hand, the full-text of documents... more

descriptionView Paper arrow_downwardDownload

Performance Comparison of Ad-Hoc Retrieval Models over Full-Text vs. Titles of Documents

by Lukas Galke

2022, Maturity and Innovation in Digital Libraries

descriptionView Paper arrow_downwardDownload

Web-scale distributional similarity and entity set expansion

by Pantel Pantel

2022, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing Volume 2 - EMNLP '09

Computing the pairwise semantic similarity between all words on the Web is a computationally challenging task. Parallelization and optimizations are necessary. We propose a highly scalable implementation based on distributional... more

descriptionView Paper arrow_downwardDownload

Autoadapt at the Session track in TREC 2010

by Maria Fasli

2022, Proceedings of TREC

Abstract. This paper provides an overview of the experiments we carried out at the TREC 2010 Session Track. We propose an approach for interpreting reformulated queries by using query expansions derived from simulated query logs. We show... more

descriptionView Paper arrow_downwardDownload

Using Topic Information to Improve Non-exact Keyword-Based Search for Mobile Applications

by Fernando Batista

2022, Information Processing and Management of Uncertainty in Knowledge-Based Systems

Considering the wide offer of mobile applications available nowadays, effective search engines are imperative for an user to find applications that provide a specific desired functionality. Retrieval approaches that leverage topic... more

descriptionView Paper arrow_downwardDownload

Opinion-Based Entity Ranking using learning to rank

by Abdul Rauf Baig

2022, Applied Soft Computing

As social media and e-commerce on the Internet continue to grow, opinions have become one of the most important sources of information for users to base their future decisions on. Unfortunately, the large quantities of opinions make it... more

Characteristics of cars collection. Table 2 Table 4 Query set: In order to test the effectiveness of our approach we need a good set of queries such that each query contains some popular aspects of hotels and cars. This is necessary for building the assumption that users will use our system for ranking entities by providing their desired search aspects about entities directly in the query through keywords. One possible good query set can be obtained from Ganesan and Zhai [17]. The authors construct this query set by first obtaining a manual set of seed queries where each seed query consists of combination of aspects and these are obtained from a set of real users. Next, aspects that are men- tioned in these queries are randomly combined with each other for generating long multi-aspect queries. Through this random combi- nation, the authors generated 10,000 queries per collection (hotels and cars). The shortest query is one aspect long and the longest query can be a query that touches each aspect of a car or a hotel. Tables 3 and 4 show the aspects and seed queries that Ganesan and Zhai [17] used in their experiments for generating multi-aspect queries. We used their query set for our experiments.

Aspects and seed queries that are used for generating multi-aspect queries for hotels collection. ratings of these aspects given by real users can be used for gen- erating reasonable approximation of human relevance judgments. For a query that has only a single aspect its relevance judgment is generated by averaging the ratings provided by each user. This is called Average Aspect Rating (AAR). For queries that have multiple aspects their relevance judgments are generated by first calculat- ing AAR scores of individual aspects and then all AARs are averaged. This is called Multi-Aspect AAR (MAAR) and is defined in Eq. (14). However, such a judgment is based on average ratings of a group of users, and thus it may not reflect the real preferences of any partic- ular user. As a result, the evaluation results using such judgments are only meaningful for relative comparison of different ranking methods, which is our goal.

nDCGjpo of different feature sets on cars collection. Bold values indicate the retrieval model achieves high effectiveness.

Fittest top three individuals evolved with genetic programming after 100 generations on training dataset. steps in genetic programming; (a) initial population creation, and (b) recombination with the existing population to evolve better solutions. Generally, the initial population (generation) is created randomly and is modeled in the form of trees. Each tree represents a solution, structured by several nodes. Nodes can be either operators (functions) or operands (terminals). From the initial population, recombination occurs to evolve better solutions (next generation’s population). This is performed by crossover and mutation opera- tors. In order to produce an improvement in the next population, it is important to select better solutions from the current population in a larger percentage. This selection is done by a fitness function that measures how close an individual gets to solve the problem. The process of recombination iterates until a predefined number of generations has been reached or no further improvements can be observed. Some important parameters in GP are: (a) the popu- lation size, (b) the number of generations, (c) the depth of tree, (d) the function set, and (e) the terminal set. 6.1. Features normalization Table 7

Bold values indicate the retrieval model achieves high effectiveness. nDCGio of different feature sets on hotels collection. GPrank achieves significantly high effectiveness if we compared it with single ranking features. Table 8

nDCGio of different feature sets on cars collection. GPrank achieves significantly high effectiveness if we compared it with single ranking features. Table 9

Top 10 ranked hotels for the query very clean, great views. This ranking has an nDCG of 0.968 after raking entities using GPrank while the previous nDCG that is reported in Ganesan and Zhai [17] for this query is 0.944. All hotels in this list have AARs above 4.5, which is above the average AAR for aspects ‘cleanliness’ and ‘location’.

Top 10 ranked cars for the query very reliable. This ranking has an nDCG of 0.979 after raking entities using GPrank while the previous nDCG that is reported in Ganesan and Zhai [17] for this query is 0.964. All cars in this list have AARs above 4.5, which is above the average AAR for aspect ‘reliability’.

Fig. 1. Architecture of our Opinion-based Entity Ranking System.

Fig. 2. Improvement gained with nDCG@10 on the basis of average effectiveness scores of all the individuals and fittest individuals of each generation as the gener- ations evolve.

Aspects and seed queries that are used for generating multi-aspect queries for car: collection. Table 3

descriptionView Paper arrow_downwardDownload

Use of Noun Phrases in Interactive Search Refinement

by Murat Karamuftuoglu

2022

The paper presents an approach to interactively refining user search formulations and its evaluation in the new High Accuracy Retrieval from Documents (HARD) track of TREC-12. The method consists of showing to the user a list of noun... more

descriptionView Paper arrow_downwardDownload

Use of Noun Phrases in Interactive Search Refinement

by Murat Karamuftuoglu

2022

descriptionView Paper arrow_downwardDownload

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

by Nandan Thakur

2022, ArXiv

Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and... more

descriptionView Paper arrow_downwardDownload

Exploring Summary-Expanded Entity Embeddings for Entity Retrieval

by Shahrzad Naseri

2022

Entity retrieval is an important part of any modern retrieval system and often satisfies user information needs directly. Word and entity embeddings are a promising opportunity for new improvements in retrieval, especially in the presence... more

descriptionView Paper arrow_downwardDownload

Exploring linguistically-rich patterns for question generation

by Luisa Coheur

2022

Proceedings of the UCNLG+ Eval: Language Generation and Evaluation Workshop, pages 3338, Edinburgh, Scotland, UK, July 31, 2011. cO2011 Association for Computational Linguistics Exploring linguistically-rich patterns for question... more

descriptionView Paper arrow_downwardDownload

A comparative study of root-based and stem-based approaches for measuring the similarity between arabic words for arabic text mining applications

by Abdelmonaime Lachkar

2022

Representation of semantic information contained in the words is needed for any Arabic Text Mining applications. More precisely, the purpose is to better take into account the semantic dependencies between words expressed by the... more

descriptionView Paper arrow_downwardDownload

SerfSIN: Search Engines Results' Refinement using a Sense-driven Inference Network

by Giannis Tzimas

2022, International Conference on Web Information Systems and Technologies

Α novel framework is presented for performing re-ranking in the search results of a Web search engine, incorporating user judgments as registered in their selection of relevant documents. The proposed scheme combines smoothly techniques... more

descriptionView Paper arrow_downwardDownload

Query Expansion Method for Quran Search Using Semantic Search and Lucene Ranking

by nureize arbaiy

2022

Search engines are becoming an instrument for users to search for needed information. The web search engine is one of the most popular search engines that are successfully implemented in many application areas. A major challenge to a web... more

descriptionView Paper arrow_downwardDownload

Query Expansion Method for Quran Search Using Semantic Search and Lucene Ranking

by Nureize Arbaiy

2021

descriptionView Paper arrow_downwardDownload

A Comparative Study of Root -Based and Stem -Based Approaches for Measuring the Similarity Between Arabic Words for Arabic Text Mining Applications

by Abdelmonaime Lachkar

2021, Advanced Computing: An International Journal

descriptionView Paper arrow_downwardDownload

A proposal for chemical information retrieval evaluation

by Jianhan Zhu

2021, Proceeding of the 1st ACM workshop on Patent information retrieval - PaIR '08

Based on the important progresses made in information retrieval (IR) in terms of theoretical models and evaluations, more and more attention has recently been paid to the research in domain specific IR, as evidenced by the organization of... more

descriptionView Paper arrow_downwardDownload

Integrating multiple windows and document features for expert finding

by Jianhan Zhu

2021, Journal of the American Society for Information Science and Technology

This version may not include final proof corrections and does not include published layout or pagination.

descriptionView Paper arrow_downwardDownload

A Comparative Study of Root -Based and Stem -Based Approaches for Measuring the Similarity Between Arabic Words for Arabic Text Mining Applications

by Abdelmonaime Lachkar

2021, Advanced Computing: An International Journal

descriptionView Paper arrow_downwardDownload

Integrating multiple windows and document features for expert finding

by Stefan Rueger

2021, Journal of the American Society for Information Science and Technology

This version may not include final proof corrections and does not include published layout or pagination.

descriptionView Paper arrow_downwardDownload

LIA at TREC 2011 Web Track: Experiments on the Combination of Online Resources

by E. Sanjuan

2021

In this paper, we report the experiments we conducted for our participation to the TREC 2011 Web Track. The experiments we conducted this year aim at discovering how the combination of specific external resources in a language modeling... more

descriptionView Paper arrow_downwardDownload

Window-based Enterprise Expert Search

by Stephen Robertson

2021, Proceeddings of the 15th …

Abstract. This is the first year for the participation of the City University Centre of Interactive System Research (CISR) in the Expert Search Task. In this paper, we describe an expert search experiment based on window-based techniques,... more

Figure 2. Evaluation results on all measures by tuning the b parameter jumping by 0.05; the results are shown in Figure 2. The implication of Figure 2 seems to be that we should turn the b parameter (which controls the extent of document length normalization) right down to zero in this application. This is an interesting conclusion, and diverges from most of our other experiences.

descriptionView Paper arrow_downwardDownload

Target Type Identification for Entity-Bearing Queries

by Dario Garigliotti

2019, Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17)

descriptionView Paper arrow_downwardDownload

NeuType: A Simple and Effective Neural Network Approach for Predicting Missing Entity Type Information in Knowledge Bases

by Dario Garigliotti

2019, arXiv e-prints - arXiv:1907.03007

Knowledge bases store information about the semantic types of entities, which can be utilized in a range of information access tasks. This information, however, is often incomplete, due to new entities emerging on a daily basis. We... more

descriptionView Paper arrow_downwardDownload

Hypertext

by Sumit Bhatia

2018

Knowledge Graphs capture the semantic relations between realworld entities and can thus, allow end-users to explore di erent aspects of an entity of interest by traversing through the edges in the graph. Most of the state-of-the-art... more

descriptionView Paper arrow_downwardDownload

Iswc

by Sumit Bhatia

2018

We address the problem of finding descriptive explanations of facts stored in a knowledge graph. This is important in high-risk domains such as healthcare, intelligence, etc. where users need additional information for decision making and... more

descriptionView Paper arrow_downwardDownload

by Massimiliano Ruocco

2018, International Journal of Multimedia Information Retrieval

Providing effective tools to retrieve event-related pictures within media-sharing applications, such as Flickr, is an important but challenging task. One interesting aspect is to search pictures related to a specific event with a given... more

descriptionView Paper arrow_downwardDownload

A method for automatic extraction of multiword units representing business aspects from user reviews

by Olga Vechtomova

2017, Journal of the Association for Information Science and Technology

The paper describes a semi-supervised approach to extracting multiword aspects of user-written reviews that belong to a given category. The method starts with a small set of seed words representing the target category, and calculates... more

descriptionView Paper arrow_downwardDownload

A domain-independent approach to finding related entities

by Olga Vechtomova

2017, Information Processing & Management

We propose an approach to the retrieval of entities that have a specific relationship with the entity given in a query. Our research goal is to investigate whether related entity finding problem can be addressed by combining a measure of... more

descriptionView Paper arrow_downwardDownload

Non-Compositional Term Dependence for Information Retrieval

by C. Lioma

2016

Modelling term dependence in IR aims to identify co-occurring terms that are too heavily dependent on each other to be treated as a bag of words, and to adapt the indexing and ranking accordingly. Dependent terms are predominantly... more

descriptionView Paper arrow_downwardDownload

Non-Compositional Term Dependence for Information Retrieval

by C. Lioma

2016

descriptionView Paper arrow_downwardDownload

Compiling Specialised Comparable Corpora. Should we always thrust (Semi-)automatic Compilation Tools?

by Hernani Costa

2016

descriptionView Paper arrow_downwardDownload

The Web as a Source of Evidence for Filtering Candidate Answers to Natural Language Questions

by Michel Benoit

2016

Identifying and extracting named entities from web pages has been the subject of many researches. In this paper, we propose and evaluate some new unsupervised language modeling approaches to determine the membership level of a candidate... more

descriptionView Paper arrow_downwardDownload

Web-Scale Distributional Similarity and Entity Set Expansion

by Ana Popescu

2014

descriptionView Paper arrow_downwardDownload

by marc bron

2013

Related entity finding is the task of returning a ranked list of homepages of relevant entities of a specified type that need to engage in a given relationship with a given source entity. We propose a framework for addressing this task... more

descriptionView Paper arrow_downwardDownload

by marc bron

2013

We report on experiments for the Related Entity Finding task in which we focus on only using Wikipedia as a target corpus in which to identify (related) entitities. Our approach is based on co-occurrences between the source entity and... more

descriptionView Paper arrow_downwardDownload

The first international workshop on entity-oriented search (EOS)

by Lise Getoor

2013

Abstract The First International Workshop on Entity-Oriented Search (EOS) workshop was held on July 28, 2011 in Beijing, China, in conjunction with the 34th Annual International ACM SIGIR Conference (SIGIR 2011). The objective for the... more

descriptionView Paper arrow_downwardDownload

by ChengXiang Zhai

2013

Abstract: Our goal in participating in the TREC 2009 Entity Track was to study whether relation extraction techniques can help in improving accuracy of the entity ﬁnding task. Finding related entities is informational in nature and we... more

descriptionView Paper arrow_downwardDownload

Lia-ismart at the TREC 2011 entity track: Entity list completion using contextual unsupervised scores for candidate entities ranking

by Patrice Bellot

2013

Abstract This paper describes our participation in the Entity List Completion (ELC) task at Entity track 2011. Our approach combined the work done for the Related Entity Finding 2010 task with some new criteria as the proximity or the... more

descriptionView Paper arrow_downwardDownload

LIA-iSmart at TREC 2010: An Unsupervised Web-Based Approach for Filtering Answers

by Patrice Bellot

2013

Abstract—Searching for named entities has been the subject of many researches in information retrieval. Our goal in participating in TREC 2010 Entity Ranking track is to look for reconizing any named entity in arbitrary categories and use... more

descriptionView Paper arrow_downwardDownload