Academia.eduAcademia.edu

Short-Text Semantic Similarity

description7 papers
group2 followers
lightbulbAbout this topic
Short-Text Semantic Similarity is a subfield of natural language processing that focuses on measuring the degree of similarity in meaning between short text segments, such as sentences or phrases. It employs various computational techniques, including vector space models and deep learning, to quantify semantic relationships and enhance understanding of textual content.
lightbulbAbout this topic
Short-Text Semantic Similarity is a subfield of natural language processing that focuses on measuring the degree of similarity in meaning between short text segments, such as sentences or phrases. It employs various computational techniques, including vector space models and deep learning, to quantify semantic relationships and enhance understanding of textual content.

Key research themes

1. What semantic knowledge sources and methodological frameworks can most effectively measure short-text semantic similarity?

This research theme focuses on classifying and evaluating various semantic similarity measurement techniques by leveraging different semantic knowledge sources such as string-based, corpus-based, knowledge-based, and hybrid methods. Understanding these frameworks is critical to developing effective algorithms that capture semantic similarity beyond surface lexical matching, which is particularly challenging in short texts due to limited context and high ambiguity.

Key finding: This comprehensive review categorizes short-text similarity methods into string-based, corpus-based, knowledge-based, and hybrid techniques, identifying four semantic knowledge bases and eight corpus resources as external... Read more
Key finding: The paper divides STS methods into topological/knowledge-based, statistical/corpus-based, and string-based categories, with special emphasis on WordNet taxonomy for topological methods. It contributes a novel hybrid approach... Read more
Key finding: Proposes a corpus-based semantic word similarity measure integrated with a modified and normalized Longest Common Subsequence (LCS) algorithm for text similarity. Experimentation on multiple datasets shows superior... Read more

2. How can lexical, syntactic, and semantic features be integrated via machine learning to improve short-text semantic similarity prediction?

This research area investigates the integration of multiple linguistic feature types—lexical overlap, syntactic structures, semantic relations—through supervised machine learning techniques like Support Vector Machines (SVM). The goal is to construct robust feature representations that capture semantic equivalence or similarity between short texts, enabling systems to generalize across languages and domains, including resource-scarce settings.

Key finding: This work presents an SVM-based system utilizing diverse linguistically motivated features including distributional, conceptual, semantic similarity measures, and multiword expressions. It performed well on SemEval-2015 Task... Read more
Key finding: Employs a supervised learning regression model combining lexical, syntactic, and semantic metrics such as word overlap, BLEU scores on base-phrases, named entity preservation, and predicate-argument alignment. While... Read more
Key finding: Proposes a bag-of-words statistical model augmented with a part-of-speech weighting scheme as proxy for deeper syntactic information, enhancing semantic similarity measurement without requiring resource-heavy parsing. It... Read more
Key finding: Offers a systematic analysis and classification of syntactic information usage—including word order, POS tagging, parsing, semantic role labeling—in STS algorithms, evaluated on the Microsoft Research Paraphrase Corpus.... Read more

3. What role do lexico-syntactic pattern-based corpus methods play in capturing semantic similarity in short texts without reliance on fine-grained semantic resources?

This theme examines approaches that extract semantic similarity via lexico-syntactic patterns from large corpora instead of curated semantic lexical resources like WordNet, aiming to overcome coverage limitations and resource constraints. It focuses on pattern-based methods employing finite-state transducers and corpus mining to capture semantic relations robustly and their utility in short-text similarity and relation extraction.

Key finding: Introduces PatternSim, a novel corpus-based semantic similarity measure leveraging 18 hand-crafted lexico-syntactic patterns encoded as finite-state transducers applied to massive corpora (WACYPEDIA, UKWAC). Without relying... Read more
Key finding: Applies various similarity matching methods—ranging from simple word overlap to dependency graph matching and feature-based vector similarity incorporating lexical, syntactic, and semantic features—for multiple-choice... Read more
Key finding: Develops a system originally designed for textual entailment that uses multiple WordNet-based word-to-word similarity measures aggregated at sentence level to assess semantic textual similarity. Achieves competitive Pearson... Read more

All papers in Short-Text Semantic Similarity

The Semantic Textual Similarity (STS) algorithms have a key role in Natural Language Processing (NLP) studies since it can support various NLP tasks such as Text Summarization and Information Retrieval. Although we found several STS... more
This study presents a malware classification system designed to classify malicious processes at run-time on production hosts. The system monitors process-level system call activity and uses information extracted from system call traces as... more
We analyze methods for selecting topics in news articles to explain stock returns. We find, through empirical and theoretical results, that supervised Latent Dirichlet Allocation (sLDA) implemented through Gibbs sampling in a stochastic... more
Extracting knowledge from unstructured text and then classifying it is gaining importance after the data explosion on the web. The traditional text classification approaches are becoming ubiquitous, but the hybrid of semantic knowledge... more
Automatic short answer grading is a significant problem in E-assessment. Several models have been proposed to deal with it. Evaluation and comparison of such solutions need the availability of Datasets with manual examples. In this paper,... more
Technical writing in professional environments, such as user manual authoring, requires the use of uniform language. Nonuniform language detection is a novel task, which aims to guarantee the consistency for technical writing by detecting... more
Semantic Textual Similarity (STS) aims at computing the proximity of meaning transmitted by two sentences. In 2016, the ASSIN shared task targeted STS in Portuguese and released training and test collections. This paper describes the... more
A significant portion of the world's text is tagged by readers on social bookmarking websites. Credit attribution is an inherent problem in these corpora because most pages have multiple tags, but the tags do not always apply with equal... more
Electronic health records (EHRs) contain important clinical information about pa-tients. Some of these data are in the form of free text and require preprocessing to be able to used in automated systems. Effi-cient and effective use of... more
Conventional schemes to document classification need labeled data to build consistent and precise classifiers. On the other hand, labeled data are rarely available, and normally too expensive to obtain. Provided a learning task for which... more
Technical writing in professional environments, such as user manual authoring, requires the use of uniform language. Nonuniform language detection is a novel task, which aims to guarantee the consistency for technical writing by detecting... more
The web page recommendation is generated by using the navigational history from web server log files. Semantic Variable Length Markov Chain Model (SVLMC) is a web page recommendation system used to generate recommendation by combining a... more
Detecting the semantic coherence of a document is a challenging task and has several applications such as in text segmentation and categorization. This paper is an attempt to distinguish between a 'semantically coherent' true document and... more
Extracting knowledge from unstructured text and then classifying it is gaining importance after the data explosion on the web. The traditional text classification approaches are becoming ubiquitous, but the hybrid of semantic knowledge... more
Text mining to a great extent depends on the various text preprocessing techniques. The preprocessing methods and tools which are used to prepare texts for further mining can be divided into those which are and those which are not... more
Word embedding methods allow to represent words as vectors in a space that is structured using word co-occurrences so that words with close meanings are close in this space. These vectors are then provided as input to automatic systems to... more
ParsiPardaz Toolkit (Persian Language Processing Toolkit), which is introduced in this paper, is a comprehensive suite of Persian language processing tools, providing many computational linguistic applications. This system can process and... more
When the relevance feedback, which is one of the most popular information retrieval model, is used in an information retrieval system, a related word is extracted based on the first retrival result. Then these words are added into the... more
We present a method for measuring the semantic similarity of texts using a corpus-based measure of semantic word similarity and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Existing... more
Text mining to a great extent depends on the various text preprocessing techniques. The preprocessing methods and tools which are used to prepare texts for further mining can be divided into those which are and those which are not... more
This paper presents the Serbian datasets developed within the project Advancing Novel Textual Similarity-based Solutions in Software Development-AVANTES, intended for the study of Cross-Level Semantic Similarity (CLSS). CLSS measures the... more
This paper presents an overview of the open access datasets in Serbian that have been manually annotated for the tasks of semantic textual similarity and short-text sentiment classification. In addition, it describes several kinds of... more
Understanding and detecting the intended meaning in social media is challenging because social media messages contain varieties of noise and chaos that are irrelevant to the themes of interests. For example, conventional supervised... more
The bag of words representation of documents is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. Improvements might be achieved by expanding the vocabulary with other relevant word,... more
Classifying tweets is an intrinsically hard task as tweets are short messages which makes traditional bags of words based approach inefficient. In fact, bags of words approaches ignores relationships between important terms that do not... more
Topic modeling is a technique for reducing dimensionality of large corpuses of text. Latent Dirichlet allocation (LDA), the most prevalent form of topic modeling, improved upon earlier methods by introducing Bayesian iterative updates,... more
Web behaviour analysis of a collective user has provided a powerful means for studying the collective user interests on the Internet. However, the existing research merely analyses the behaviour of a single user who accesses multiple... more
This paper describes ASAPPpy – a framework fully-developed in Python for computing Semantic Textual Similarity (STS) between Portuguese texts – and its participation in the ASSIN 2 shared task on this topic. ASAPPpy follows other versions... more
Topic models provide insights into document collections, and their supervised extensions also capture associated document-level metadata such as sentiment. However, inferring such models from data is often slow and cannot scale to big... more
Supervised models of NLP rely on large collections of text which closely resemble the intended testing setting. Unfortunately matching text is often not available in sufficient quantity, and moreover, within any domain of text, data is... more
The semantics are derived from textual data that provide representations for Machine Learning algorithms. These representations are interpretable form of high dimensional sparse matrix that are given as an input to the machine learning... more
Conventional schemes to document classification need labeled data to build consistent and precise classifiers. On the other hand, labeled data are rarely available, and normally too expensive to obtain. Provided a learning task for which... more
Inferring locations from user texts on social media platforms is a non-trivial and challenging problem relating to public safety. We propose a novel non-uniform grid-based approach for location inference from Twitter messages using... more
Topic modelling is the new revolution in text mining. It is a statistical technique for revealing the underlying semantic structure in large collection of documents. After analysing approximately 300 research articles on topic modeling, a... more
The use of semantic models is relevant in automated learning systems, in solving certain tasks, such as: extracting knowledge from texts, information retrieval, abstracting, checking the correctness of vocabulary terms and definitions,... more
The histogram method is a powerful non-parametric approach for estimating the probability density function of a continuous variable. But the construction of a histogram, compared to the parametric approaches, demands a large number of... more
Strukturu membranskih proteina osjetno je teže eksperimentalno odrediti nego strukturu topljivih proteina. Kako bi se razvio pouzdani model za predviđanje strukture proteina, potrebno je provesti njegovu optimizaciju na što većem... more
Topic modeling has emerged as a popular learning technique not only in mining text representations, but also in modeling authors’ interests and influence, as well as predicting linkage among documents or authors. However, few existing... more
by Pe Ter
Conventional schemes to document classification need labeled data to build consistent and precise classifiers. On the other hand, labeled data are rarely available, and normally too expensive to obtain. Provided a learning task for which... more
Twitter act as a most important medium of communication and information sharing. As tweets do not provide sufficient word occurrences i.e. of 140 characters limits, classification methods that use traditional approaches like... more
One of the key problems encountered while using a text classification learning algorithms is that they require huge amount of labelled examples to learn accurately. The objective of this paper is to propose a novel method of topic... more
Sentiment analysis predicts a one-dimensional quantity describing the positive or negative emotion of an author. Mood analysis extends the one-dimensional sentiment response to a multi-dimensional quantity, describing a diverse set of... more
There are huge data from unstructured text obtained daily from various resources like emails, tweets, social media posts, customer comments, reviews, and reports in many different fields, etc. Unstructured text data can be analyzed to... more
Small-sample classification is a challenging problem in computer vision. In this work, we show how to e ciently and e↵ectively utilize semantic information of the annotations to improve the performance of small-sample classification.... more
The capital investment for automated analysis of electronic documents has enhanced rapidly since the growth of text categorization and classification. In recent times, various works have been done in the context of text mining and... more
This paper describes ASAPPpy – a framework fully-developed in Python for computing Semantic Textual Similarity (STS) between Portuguese texts – and its participation in the ASSIN 2 shared task on this topic. ASAPPpy follows other versions... more
This paper reports about results collected during the development of a scalable Information Retrieval system for near real-time analytics on social networks. More precisely, we present the end-user functionalities provided by the system,... more
Twitter Sentiment Analysis is the task of detecting opinions and sentiments in tweets using different algorithms. In our research work, we conducted a study to analyze and compare different Algorithms of Machine Learning (MLAs) for the... more
This paper discusses the classification process for medical data. In this paper, we use the data from ACM KDDCup 2008 to demonstrate our classification process based on latent topic discovery. In this data set, the target set and outliers... more
Oftentimes, the question "what is this poem about?" has no trivial answer, regardless of length, style, author, or context in which the poem is found. We propose a simple system of multi-label classification of poems based on... more
Download research papers for free!