Academia.eduAcademia.edu

Text Similarity Functions

description35 papers
group2 followers
lightbulbAbout this topic
Text similarity functions are computational algorithms used to quantify the degree of similarity between two text documents. These functions analyze various linguistic features, such as word choice, syntax, and semantics, to produce a numerical score that reflects how alike the texts are in content and meaning.
lightbulbAbout this topic
Text similarity functions are computational algorithms used to quantify the degree of similarity between two text documents. These functions analyze various linguistic features, such as word choice, syntax, and semantics, to produce a numerical score that reflects how alike the texts are in content and meaning.

Key research themes

1. What are the primary methodological categories for text similarity functions and how do their strengths and weaknesses compare?

This theme investigates the taxonomy of text similarity measurement methods, primarily categorized into string-based, corpus-based, knowledge-based, and hybrid approaches. It focuses on understanding the methodological foundations, comparative advantages, limitations, and domains of applicability for each category. Knowledge of these categories informs researchers' choices when selecting or designing text similarity functions for tasks such as information retrieval, text clustering, semantic analysis, and plagiarism detection.

Key finding: This comprehensive systematic literature review classifies short text similarity (STS) methods into four main categories: string-based, corpus-based, knowledge-based, and hybrid. It highlights each method's strengths and... Read more
Key finding: This survey partitions text similarity methods into string-based, corpus-based, and knowledge-based approaches, describing key representative algorithms for each. It presents character-based string metrics (e.g., Levenshtein... Read more
Key finding: This paper extensively reviews semantic textual similarity (STS) techniques, categorizing them into topological/knowledge-based, statistical/corpus-based, and string-based methods. It emphasizes taxonomy-based approaches... Read more
Key finding: Proposes a parameterized similarity model leveraging a 'soft cardinality' measure that flexibly represents textual granularity from characters to words and sentences. Unlike traditional static measures, this adaptive function... Read more

2. How can advanced linguistic and semantic resources enhance text similarity detection beyond surface-level measures?

This theme explores the leveraging of lexical ontologies, semantic frames, predicate-argument structures, and distributional semantics to improve the semantic sensitivity of text similarity functions. It investigates how such resources enable capturing synonymy, polysemy, and contextual equivalence beyond mere lexical overlap. This research area is vital for applications requiring deeper language understanding, such as paraphrase detection, textual entailment, and semantic search.

Key finding: Introduces a supervised regression framework combining lexical overlap, shallow syntactic similarity (BLEU on base-phrase labels), and semantically informed metrics leveraging named entity recognition and alignment of verb... Read more
Key finding: Develops a novel corpus-based semantic similarity measure PatternSim leveraging an extended set of finite-state transducer encoded lexico-syntactic patterns to extract hypernymic and synonymic relations from large corpora... Read more
Key finding: Proposes semantic similarity models based on distributional word representations constructed from a large corpus via Random Indexing (RI), Latent Semantic Analysis (LSA) over RI, and improved syntactic representations via... Read more
Key finding: Building on textual entailment methodologies, SAGAN combines multiple WordNet-based semantic similarity metrics at the word-to-word level to approximate sentence-level semantic similarity. The system uses multiple... Read more

3. What are effective parameterized and empirical similarity functions for text comparison and how can parameters be optimized?

This theme focuses on parametric similarity measures that generalize traditional metrics such as PMI, Dice, or cardinality-based coefficients through tunable parameters. It considers theoretical properties ensuring formal constraints and proposes empirical methods to estimate optimal parameters in an unsupervised fashion. This enables adapting similarity functions to diverse representation models and textual datasets, addressing issues like asymmetry, salience, and granularity. Such work provides both theoretical advances and practical optimization strategies for improved text similarity assessments.

Key finding: Investigates the Information Contrast Model (ICM), a parameterized generalization of Pointwise Mutual Information (PMI) with constraints ensuring formal similarity properties. The study proposes an unsupervised estimator for... Read more
Key finding: The proposed soft cardinality measure includes parameterized resemblance coefficients and admits recursion over granularity levels from characters to words to sentences, enabling adaptation to specific textual comparison... Read more
Key finding: Introduces SemSim p, a parametric semantic similarity method leveraging information content weighted ontologies and taxonomic reasoning. The method’s two main parameters control concept weighting and normalization for... Read more

All papers in Text Similarity Functions

This paper presents a novel approach for building adaptive similarity functions based on cardinality using machine learning. Unlike current approaches that build feature sets using similarity scores, we have developed these feature sets... more
Soft cardinality has been shown to be a very strong text-overlapping baseline for the task of measuring semantic textual similarity (STS), obtaining 3 rd place in SemEval-2012. At *SEM-2013 shared task, beside the plain textoverlapping... more
In this paper we describe our system used to participate in the Student-Response-Analysis task-7 at SemEval 2013. This system is based on text overlap through the soft cardinality and a new mechanism for weight propagation. Although there... more
In this paper we describe our system submit- ted for evaluation in the CLTE-SemEval-2013 task, which achieved the best results in two of the four data sets, and finished third in av- erage. This system consists of a SVM clas- sifier with... more
This paper presents a novel approach for building adaptive similarity functions based on cardinality using machine learning. Unlike current approaches that build feature sets using similarity scores, we have developed these feature sets... more
In this paper we describe our system submit- ted for evaluation in the CLTE-SemEval-2013 task, which achieved the best results in two of the four data sets, and finished third in av- erage. This system consists of a SVM clas- sifier with... more
This paper presents an approach for tackling the authorship identification task. The approach is based on comparing the similarity between a given unknown document against the known documents using a number of different phrase-level and... more
In this paper we describe our system used to participate in the Student-Response-Analysis task-7 at SemEval 2013. This system is based on text overlap through the soft cardinality and a new mechanism for weight propagation. Although there... more
In this paper we describe our system submit- ted for evaluation in the CLTE-SemEval-2013 task, which achieved the best results in two of the four data sets, and finished third in av- erage. This system consists of a SVM clas- sifier with... more
Abstract. Soft-cardinality spectra (SC spectra) is a new method of approximation for text strings in linear time, which divides text strings into character q-grams of different sizes. The method allows simultaneous use of weighting at... more
This paper presents a novel approach for building adaptive similarity functions based on cardinality using machine learning. Unlike current approaches that build feature sets using similarity scores, we have developed these feature sets... more
We present an approach for the construction of text similarity functions using a parameterized resemblance coefficient in combination with a softened cardinality function called soft cardinality. Our approach provides a consistent and... more
The soft cardinality proved to be a very strong text-overlapping baseline for the task of semantic-textual-similarity (STS) obtaining the third place in SemEval-2012. This year, besides to the plain text-overlapping approach, two... more
In this paper we describe our system used to participate in the Student-Response-Analysis task-7 at SemEval 2013. This system is based on text overlap through the soft cardinality and a new mechanism for weight propagation. Although there... more
In this paper we describe our system submit- ted for evaluation in the CLTE-SemEval-2013 task, which achieved the best results in two of the four data sets, and finished third in av- erage. This system consists of a SVM clas- sifier with... more
This paper presents a novel approach for building adaptive similarity functions based on cardinality using machine learning. Unlike current approaches that build feature sets using similarity scores, we have developed these feature sets... more
This paper presents a novel approach for building adaptive similarity functions based on cardinality using machine learning. Unlike current approaches that build feature sets using similarity scores, we have developed these feature sets... more
This paper presents an approach for tackling the authorship identification task. The approach is based on comparing the similarity between a given unknown document against the known documents using a number of different phrase-level and... more
This paper presents an approach for tackling the authorship identification task. The approach is based on comparing the similarity between a given unknown document against the known documents using a number of different phrase-level and... more
Soft cardinality has been shown to be a very strong text-overlapping baseline for the task of measuring semantic textual similarity (STS), obtaining 3 rd place in SemEval-2012. At *SEM-2013 shared task, beside the plain textoverlapping... more
In this paper we describe our system used to participate in the Student-Response-Analysis task-7 at SemEval 2013. This system is based on text overlap through the soft cardinality and a new mechanism for weight propagation. Although there... more
In this paper we describe our system submit- ted for evaluation in the CLTE-SemEval-2013 task, which achieved the best results in two of the four data sets, and finished third in av- erage. This system consists of a SVM clas- sifier with... more
This paper presents a novel approach for building adaptive similarity functions based on cardinality using machine learning. Unlike current approaches that build feature sets using similarity scores, we have developed these feature sets... more
We present an approach for the construction of text similarity functions using a parameterized resemblance coefficient in combination with a softened cardinality function called soft cardinality. Our approach provides a consistent and... more
In this paper, we present a survey and comparative studies on semantic textual similarity methods, those are based on WordNet taxonomy. We also proposed a new method for measuring semantic similarity between sentences. This proposed... more
The ability to identify similarities between narratives has been argued to be central in human interactions. Previous work that sought to formalize this task has hypothesized that narrative similarity can be equated to the existence of a... more
Abstract. Soft-cardinality spectra (SC spectra) is a new method of approximation for text strings in linear time, which divides text strings into character q-grams of different sizes. The method allows simultaneous use of weighting at... more
In this paper we describe our system submitted for evaluation in the CLTE-SemEval-2013 task, which achieved the best results in two of the four data sets, and finished third in average. This system consists of a SVM classifier with... more
Resumo-O presente trabalho apresenta o desenvolvimento de um sistema computacional para recuperação de imagens baseada em conteúdo, denominado SRIM-Sistema de Recuperação de Imagens Mamográficas. O SRIM tem como objetivo permitir a... more
This paper presents a novel approach for building adaptive similarity functions based on cardinality using machine learning. Unlike current approaches that build feature sets using similarity scores, we have developed these feature sets... more
We present an approach for the construction of text similarity functions using a parameterized resemblance coefficient in combination with a softened cardinality function called soft cardinality. Our approach provides a consistent and... more
We present an approach for the construction of text similarity functions using a parameterized resemblance coefficient in combination with a softened cardinality function called soft cardinality. Our approach provides a consistent and... more
This research aims to explore CBAR concepts to implement test oracles to support testing activities of TTS Systems, helping the human in quality evaluations. In an automated software testing environment, Test Oracles represent the... more
This paper presented a prototype of a system whose goal is to highlight the opportunity to explore computer vision applied in the Content-based Image Retrievial (CBIR), in order to testing oracles for software that generate graphical... more
No contexto de teste de software, um desafio a ser vencidoé o teste de programas com saídas gráficas. A Recuperação de Imagens Baseada em Conteúdo (CBIR) constitui uma abordagem factível para esses testes, mas seus resultados podem variar... more
Resumo: A recuperação de imagens por conteúdo, tem atraído bastante atenção, principalmente, em grandes conjuntos de imagens onde solicitar dos usuários rótulos para cada uma das imagens se torna um processo custoso e mais suscetível a... more
In this paper we describe our system submitted for evaluation in the CLTE-SemEval-2013 task, which achieved the best results in two of the four data sets, and finished third in average. This system consists of a SVM classifier with... more
This paper describes our participation in the SemEval-2014 tasks 1, 3 and 10. We used an uniform approach for addressing all the tasks using the soft cardinality for extracting features from text pairs, and machine learning for predicting... more
The classical set theory provides a method for comparing objects using cardinality and intersection, in combination with well-known resemblance coefficients such as Dice, Jaccard, and cosine. However, set operations are intrinsically... more
The need for appropriate applications of the various similarity measures for clustering has arisen over the years as data massively keep on increasing. The issue of deciding which similarity measure is the best and on what kind of dataset... more
Soft cardinality has been shown to be a very strong text-overlapping baseline for the task of measuring semantic textual similarity (STS), obtaining 3 rd place in SemEval-2012. At *SEM-2013 shared task, beside the plain textoverlapping... more
In this paper we describe our system used to participate in the Student-Response-Analysis task-7 at SemEval 2013. This system is based on text overlap through the soft cardinality and a new mechanism for weight propagation. Although there... more
In this paper we describe our system submit- ted for evaluation in the CLTE-SemEval-2013 task, which achieved the best results in two of the four data sets, and finished third in av- erage. This system consists of a SVM clas- sifier with... more
This paper presents a novel approach for building adaptive similarity functions based on cardinality using machine learning. Unlike current approaches that build feature sets using similarity scores, we have developed these feature sets... more
We present an approach for the construction of text similarity functions using a parameterized resemblance coefficient in combination with a softened cardinality function called soft cardinality. Our approach provides a consistent and... more
Soft cardinality (SC) is a softened version of the classical cardinality of set theory. However, given its prohibitive cost of computing (exponential order), an approximation that is quadratic in the number of terms in the text has been... more
In this paper we describe our system used to participate in the Student-Response-Analysis task-7 at SemEval 2013. This system is based on text overlap through the soft cardinality and a new mechanism for weight propagation. Although there... more
Soft cardinality (SC) is a softened version of the classical cardinality of set theory. However, given its prohibitive cost of computing (exponential order), an approximation that is quadratic in the number of terms in the text has been... more
In this paper we describe our system submit- ted for evaluation in the CLTE-SemEval-2013 task, which achieved the best results in two of the four data sets, and finished third in av- erage. This system consists of a SVM clas- sifier with... more
Soft cardinality has been shown to be a very strong text-overlapping baseline for the task of measuring semantic textual similarity (STS), obtaining 3 rd place in SemEval-2012. At *SEM-2013 shared task, beside the plain textoverlapping... more
Download research papers for free!