Academia.eduAcademia.edu

Distributed Representations of Words

description7 papers
group0 followers
lightbulbAbout this topic
Distributed representations of words refer to a method in natural language processing where words are represented as high-dimensional vectors in a continuous space. This approach captures semantic relationships and similarities between words, allowing for more effective modeling of language and improving tasks such as machine translation and sentiment analysis.
lightbulbAbout this topic
Distributed representations of words refer to a method in natural language processing where words are represented as high-dimensional vectors in a continuous space. This approach captures semantic relationships and similarities between words, allowing for more effective modeling of language and improving tasks such as machine translation and sentiment analysis.

Key research themes

1. How can global co-occurrence modeling improve the semantic sub-structure of word embeddings compared to local context methods?

This research area investigates the design of word embedding models that leverage global word-word co-occurrence statistics rather than relying solely on local context windows. The goal is to produce word vector spaces with more meaningful linear structures that capture fine-grained syntactic and semantic regularities, thereby improving performance on tasks such as word analogy, similarity, and downstream applications like named entity recognition. Understanding the model properties that enable such global methods to outperform localized embedding techniques can inform more efficient and accurate representation learning.

Key finding: Proposes a weighted least squares model that efficiently trains on nonzero elements of a word-word co-occurrence matrix, combining matrix factorization and local context methods benefits. Demonstrates that training on global... Read more
Key finding: Compares word2vec and GloVe embeddings trained on over one million biomedical articles, observing that hyperparameter settings significantly affect semantic similarity and relatedness performance in the biomedical domain.... Read more
Key finding: Provides a comprehensive survey of the development of vector space representations, emphasizing the progression from count-based matrix factorization (e.g., Latent Semantic Analysis) to predictive models (e.g., word2vec,... Read more
Key finding: Introduces QVEC, an intrinsic evaluation aligning distributional word vector dimensions with linguistically interpretable dimensions derived from annotated lexical resources. Demonstrates that alignment to global semantic... Read more

2. What methods exploit subword and auxiliary linguistic information to improve representations of rare and out-of-vocabulary words?

This area focuses on addressing the limitations of standard word embeddings for rare and out-of-vocabulary (OOV) words. Due to Zipfian distribution properties in natural language, many words are rare and receive poor, unstable embeddings when trained only on local contexts or even large corpora. Researchers develop models that incorporate subword (morphological or character-level) information or leverage auxiliary data sources (e.g., dictionary definitions) to generate embeddings on-the-fly or to better generalize to infrequent lexical items. These methods improve downstream task performance where domain-specific or low-frequency terms occur.

Key finding: Proposes a method that predicts embeddings of rare words dynamically using auxiliary data such as dictionary definitions or word spellings, trained end-to-end with downstream tasks. Demonstrates significant improvements in... Read more
Key finding: Introduces a character-based embedding generation model employing convolutional neural networks and highway networks to extract subword-level features and generalize pre-trained embeddings to OOV words. Evaluated across... Read more
Key finding: Shows empirically that the L2 norm (vector length) of word embeddings, combined with term frequency, can serve as a measure of word significance, as words consistently used in similar contexts tend to have longer vectors.... Read more

3. How can combining multiple heterogeneous word embedding sources improve representation quality through meta-embedding?

This theme addresses the integration of diverse pre-trained word embeddings, which individually capture different semantic and syntactic aspects and vary in vocabulary coverage and dimensionality. By learning meta-embeddings that locally relate neighborhoods and linearly reconstruct embeddings from multiple sources, researchers achieve richer word representations that are sensitive to local semantic variations. Such meta-embedding approaches enable overcoming issues such as out-of-vocabulary words and improve downstream task accuracy by blending complementary information.

Key finding: Presents a locally linear meta-embedding technique that reconstructs each word’s embedding vector as a linear combination of its neighbors across multiple pre-trained embedding sources with possibly different dimensionalities... Read more
Key finding: Proposes combination strategies for heterogeneous word representations (e.g., LSA, LDA, VSM) to complement semantic coverage and enhance word similarity and relatedness measures. Experiments demonstrate that combined... Read more

All papers in Distributed Representations of Words

In this paper, our main contributions are that embeddings from relatively smaller corpora can outperform ones from larger corpora and we make the new Swedish analogy test set publicly available. To achieve a good network performance in... more
Under-resourced domain problem is significant in automatic speech recognition, especially in small languages such as Hungarian or in fields where data is often confidential such as finance and medicine. We introduce a method using word... more
Commonsense can be vital in some applications like Natural Language Understanding (NLU), where it is often required to resolve ambiguity arising from implicit knowledge and underspecification. In spite of the remarkable success of neural... more
In this paper, we propose to learn word embeddings based on the recent fixedsize ordinally forgetting encoding (FOFE) method, which can almost uniquely encode any variable-length sequence into a fixed-size representation. We use FOFE to... more
The availability of different pre-trained semantic models has enabled the quick development of machine learning components for downstream applications. However, even if texts are abundant for low-resource languages, there are very few... more
Distributional word vector representation orword embedding has become an essential ingredient in many natural language processing (NLP) tasks such as machine translation, document classification, information retrieval andquestion... more
The paper presents an evaluation of word embedding models in clustering of texts in the Polish language. Authors verified six different embedding models, starting from widely used word2vec, across fast-Text with character n-grams... more
Distributional semantic models represent the meaning of words as vectors. We introduce a selection method to learn a vector space that each of its dimensions is a natural word. The selection method starts from the most frequent words and... more
Urdu is a widely spoken language in South Asia. Though immoderate literature exists for the Urdu language still the data isn't enough to naturally process the language by NLP techniques. Very efficient language models exist for the... more
Eliciting semantic similarity between concepts remains a challenging task. Recent approaches founded on embedding vectors have gained in popularity as they risen to efficiently capture semantic relationships. The underlying idea is that... more
Word embeddings have been found to provide meaningful representations for words in an efficient way; therefore, they have become common in Natural Language Processing systems. In this paper, we evaluated different word embedding models... more
Distributed word embeddings have shown superior performances in numerous Natural Language Processing (NLP) tasks. However, their performances vary significantly across different tasks, implying that the word embeddings learnt by those... more
We propose a new method that leverages contextual embeddings for the task of diachronic semantic shift detection by generating time specific word representations from BERT embeddings. The results of our experiments in the domain specific... more
This paper proposes an alternative to the Paragraph Vector algorithm, generating fixed-length vectors of human-readable features for natural language corpora. It extends word2vec retaining its other advantages like speed and accuracy,... more
Aoccdrnig to a reasrech at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny itmopnrat tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll... more
Word embeddings have been found to provide meaningful representations for words in an efficient way; therefore, they have become common in Natural Language Processing systems. In this paper, we evaluated different word embedding models... more
In this paper, we propose a novel information criteria-based approach to select the dimensionality of the word2vec Skip-gram (SG). From the perspective of the probability theory, SG is considered as an implicit probability distribution... more
Neural network based word embeddings, such as Word2Vec and Glove, are purely data driven in that they capture the distributional information about words from the training corpus. Past works have attempted to improve these embeddings by... more
An interesting method of evaluating word representations is by how much they reflect the semantic representations in the human brain. However, most, if not all, previous works only focus on small datasets and a single modality. In this... more
An interesting method of evaluating word representations is by how much they reflect the semantic representations in the human brain. However, most, if not all, previous works only focus on small datasets and a single modality. In this... more
Commonsense can be vital in some applications like Natural Language Understanding (NLU), where it is often required to resolve ambiguity arising from implicit knowledge and underspecification. In spite of the remarkable success of neural... more
Recent word embeddings techniques represent words in a continuous vector space, moving away from the atomic and sparse representations of the past. Each such technique can further create multiple varieties of embeddings based on different... more
English. In this work we analyze the performances of two of the most used word embeddings algorithms, skip-gram and continuous bag of words on Italian language. These algorithms have many hyper-parameter that have to be carefully tuned in... more
Due to the increasing use of information technologies by biomedical experts, researchers, public health agencies, and healthcare professionals, a large number of scientifc literatures, clinical notes, and other structured and unstructured... more
One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a... more
A variety of NLP applications use word2vec skip-gram, GloVe, and fastText word embeddings. These models learn two sets of embedding vectors, but most practitioners use only one of them, or alternately an unweighted sum of both. This is... more
This paper introduces a framework for both semantic analysis and annotation, called Multilayered Semantic Frame Analysis (MSFA) of text, inspired by the Berkeley FrameNet approach to semantic analysis of natural language text [8, 13].... more
Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input... more
The large source of information space produced by the plethora of social media platforms in general and microblogging in particular has spawned a slew of new applications and prompted the rise and expansion of sentiment analysis research.... more
Role-denoting nouns are more ready for figurative uses
In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of... more
In this paper we report an unsupervised method aimed to identify whether an attribute is discriminative for two words (which are treated as concepts, in our particular case). To this end, we use geometrically inspired vector operations... more
In this paper we examine the pattern of inward FDI at the disaggregated industry level (NIC 3- digit), and test for the industry-specific characteristics that have been significant in attracting foreign investment in India during 2000-10.... more
We propose a new word embedding model, called SPhrase, that incorporates supervised phrase information. Our method modifies traditional word embeddings by ensuring that all target words in a phrase have exactly the same context. We... more
Language, like other natural sequences, exhibits statistical dependencies at a wide range of scales as discussed by . However, many statistical learning models applied to language impose a sampling scale while extracting statistical... more
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning... more
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning... more
Word embedding is the process of representing words from a corpus of text as real number vectors. These vectors are often derived from frequency statistics from the source corpus. In the GloVe model as proposed by Pennington et al., these... more
Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups... more
For robots to interact with natural language and handle realworld situations, some ability to perform analogical and associational reasoning is desirable. Consider commands like ”Fetch the ball” vs. ”Fetch the wagon”, the robot needs to... more
Recent work on segmentation-free word embedding(sembei) developed a new pipeline of word embedding for unsegmentated language while avoiding segmentation as a preprocessing step. However, too many noisy n-grams existing in the embedding... more
This paper illustrates relevant details of an on-going semantic-role annotation work based on a framework called MULTILAYERED/DIMENSIONAL SEMANTIC FRAME ANALYSIS (MSFA for short) (Kuroda and Isahara, 2005b), which is inspired by, if not... more
For many natural language processing applications, estimating similarity and relatedness between words are key tasks that serve as the basis for classification and generalization. Currently, vector semantic models (VSM) have become a... more
For many natural language processing applications, estimating similarity and relatedness between words are key tasks that serve as the basis for classification and generalization. Currently, vector semantic models (VSM) have become a... more
In this paper, we propose LexVec, a new method for generating distributed word representations that uses low-rank, weighted factorization of the Positive Point-wise Mutual Information matrix via stochastic gradient descent, employing a... more
Word embeddings are increasingly attracting the attention of researchers dealing with semantic similarity and analogy tasks. However, finding the optimal hyper-parameters remains an important challenge due to the resulting impact on the... more
Word embeddings have found their way into a wide range of natural language processing tasks including those in the biomedical domain. While these vector representations successfully capture semantic and syntactic word relations, hidden... more
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning... more
Download research papers for free!