Distributed Representations of Words

description7 papers

group0 followers

lightbulbAbout this topic

Distributed representations of words refer to a method in natural language processing where words are represented as high-dimensional vectors in a continuous space. This approach captures semantic relationships and similarities between words, allowing for more effective modeling of language and improving tasks such as machine translation and sentiment analysis.

lightbulbAbout this topic

Key research themes

1. How can global co-occurrence modeling improve the semantic sub-structure of word embeddings compared to local context methods?

This research area investigates the design of word embedding models that leverage global word-word co-occurrence statistics rather than relying solely on local context windows. The goal is to produce word vector spaces with more meaningful linear structures that capture fine-grained syntactic and semantic regularities, thereby improving performance on tasks such as word analogy, similarity, and downstream applications like named entity recognition. Understanding the model properties that enable such global methods to outperform localized embedding techniques can inform more efficient and accurate representation learning.

GloVe: Global Vectors for Word Representation

by Delia Ioana

2019

Key finding: Proposes a weighted least squares model that efficiently trains on nonzero elements of a word-word co-occurrence matrix, combining matrix factorization and local context methods benefits. Demonstrates that training on global... Read more

articleView Paper downloadDownload

Evaluating distributed word representations for capturing semantics of biomedical concepts

by Sunil Sahu

2021, Proceedings of BioNLP 15

Key finding: Compares word2vec and GloVe embeddings trained on over one million biomedical articles, observing that hyperparameter settings significantly affect semantic similarity and relatedness performance in the biomedical domain.... Read more

articleView Paper downloadDownload

Word Embeddings and Semantic Spaces in Natural Language Processing

by Peter J Worth

2023, Scientific Research Publishing

Key finding: Provides a comprehensive survey of the development of vector space representations, emphasizing the progression from count-based matrix factorization (e.g., Latent Semantic Analysis) to predictive models (e.g., word2vec,... Read more

articleView Paper downloadDownload

Evaluation of Word Vector Representations by Subspace Alignment

by 즈이 왠

2017

Key finding: Introduces QVEC, an intrinsic evaluation aligning distributional word vector dimensions with linguistically interpretable dimensions derived from annotated lexical resources. Demonstrates that alignment to global semantic... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What methods exploit subword and auxiliary linguistic information to improve representations of rare and out-of-vocabulary words?

This area focuses on addressing the limitations of standard word embeddings for rare and out-of-vocabulary (OOV) words. Due to Zipfian distribution properties in natural language, many words are rare and receive poor, unstable embeddings when trained only on local contexts or even large corpora. Researchers develop models that incorporate subword (morphological or character-level) information or leverage auxiliary data sources (e.g., dictionary definitions) to generate embeddings on-the-fly or to better generalize to infrequent lexical items. These methods improve downstream task performance where domain-specific or low-frequency terms occur.

Learning to Compute Word Embeddings On the Fly

by Stanislaw Jastrzebski

2022, ArXiv

Key finding: Proposes a method that predicts embeddings of rare words dynamically using auxiliary data such as dictionary definitions or word spellings, trained end-to-end with downstream tasks. Demonstrates significant improvements in... Read more

articleView Paper downloadDownload

Learning to Generate Word Representations using Subword Information

by YeaChan Kim

2022

Key finding: Introduces a character-based embedding generation model employing convolutional neural networks and highway networks to extract subword-level features and generalize pre-trained embeddings to OOV words. Evaluated across... Read more

articleView Paper downloadDownload

Measuring Word Significance using Distributed Representations of Words

by Adriaan Schakel

2025, arXiv (Cornell University)

Key finding: Shows empirically that the L2 norm (vector length) of word embeddings, combined with term frequency, can serve as a measure of word significance, as words consistently used in similar contexts tend to have longer vectors.... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can combining multiple heterogeneous word embedding sources improve representation quality through meta-embedding?

This theme addresses the integration of diverse pre-trained word embeddings, which individually capture different semantic and syntactic aspects and vary in vocabulary coverage and dimensionality. By learning meta-embeddings that locally relate neighborhoods and linearly reconstruct embeddings from multiple sources, researchers achieve richer word representations that are sensitive to local semantic variations. Such meta-embedding approaches enable overcoming issues such as out-of-vocabulary words and improve downstream task accuracy by blending complementary information.

Think Globally, Embed Locally --- Locally Linear Meta-embedding of Words

by Ken-ichi Kawarabayashi

2023, Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence

Key finding: Presents a locally linear meta-embedding technique that reconstructs each word’s embedding vector as a linear combination of its neighbors across multiple pre-trained embedding sources with possibly different dimensionalities... Read more

articleView Paper downloadDownload

Combining Word Representations for Measuring Word Relatedness and Similarity

by Dipesh Gautam

2021

Key finding: Proposes combination strategies for heterogeneous word representations (e.g., LSA, LDA, VSM) to complement semantic coverage and enhance word similarity and relatedness measures. Experiments demonstrate that combined... Read more

articleView Paper downloadDownload

All papers in Distributed Representations of Words

Exploring Swedish & English fastText Embeddings for NER with the Transformer

by Foteini Liwicki

2025, arXiv (Cornell University)

In this paper, our main contributions are that embeddings from relatively smaller corpora can outperform ones from larger corpora and we make the new Swedish analogy test set publicly available. To achieve a good network performance in... more

descriptionView Paper arrow_downwardDownload

Distinguishing levels of morphological derivations in word-embedding models

by Adèle Hénot-Mortier

2024, Proceedings of the 53rd Annual Meeting of the North East Linguistic Society

descriptionView Paper arrow_downwardDownload

Smooth inverse frequency based text data selection for medical dictation

by Péter Mihajlik

2024

Under-resourced domain problem is significant in automatic speech recognition, especially in small languages such as Hungarian or in fields where data is often confidential such as finance and medicine. We introduce a method using word... more

descriptionView Paper arrow_downwardDownload

Augmenting Named Entity Recognition with Commonsense Knowledge

by Ngọc Thành Lê

2024

Commonsense can be vital in some applications like Natural Language Understanding (NLU), where it is often required to resolve ambiguity arising from implicit knowledge and underspecification. In spite of the remarkable success of neural... more

descriptionView Paper arrow_downwardDownload

Word Embeddings based on Fixed-Size Ordinally Forgetting Encoding

by Joseph A. B Sanu

2024, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In this paper, we propose to learn word embeddings based on the recent fixedsize ordinally forgetting encoding (FOFE) method, which can almost uniquely encode any variable-length sequence into a fixed-size representation. We use FOFE to... more

descriptionView Paper arrow_downwardDownload

Introducing Various Semantic Models for Amharic: Experimentation and Evaluation with Multiple Tasks and Datasets

by seid yimam

2024, Future Internet

The availability of different pre-trained semantic models has enabled the quick development of machine learning components for downstream applications. However, even if texts are abundant for low-resource languages, there are very few... more

descriptionView Paper arrow_downwardDownload

Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations

by Md Rajib hossain

2024

Distributional word vector representation orword embedding has become an essential ingredient in many natural language processing (NLP) tasks such as machine translation, document classification, information retrieval andquestion... more

descriptionView Paper arrow_downwardDownload

Evaluation of vector embedding models in clustering of text documents

by Tomasz Walkowiak

2024, Proceedings - Natural Language Processing in a Deep Learning World

The paper presents an evaluation of word embedding models in clustering of texts in the Polish language. Authors verified six different embedding models, starting from widely used word2vec, across fast-Text with character n-grams... more

descriptionView Paper arrow_downwardDownload

A Word Selection Method for Producing Interpretable Distributional Semantic Word Vectors

by Atefe Pakzad

2024, J. Artif. Intell. Res.

Distributional semantic models represent the meaning of words as vectors. We introduce a selection method to learn a vector space that each of its dimensions is a natural word. The selection method starts from the most frequent words and... more

descriptionView Paper arrow_downwardDownload

Co-occurrences using Fasttext embeddings for word similarity tasks in Urdu

by USAMA KHALID

2024, arXiv (Cornell University)

Urdu is a widely spoken language in South Asia. Though immoderate literature exists for the Urdu language still the data isn't enough to naturally process the language by NLP techniques. Very efficient language models exist for the... more

descriptionView Paper arrow_downwardDownload

The MeSH-Gram Neural Network Model: Extending Word Embedding Vectors with MeSH Concepts for Semantic Similarity

by Lina Soualmia

2024, HAL (Le Centre pour la Communication Scientifique Directe)

Eliciting semantic similarity between concepts remains a challenging task. Recent approaches founded on embedding vectors have gained in popularity as they risen to efficiently capture semantic relationships. The underlying idea is that... more

descriptionView Paper arrow_downwardDownload

Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks

by Erick Fonseca

2023, arXiv (Cornell University)

Word embeddings have been found to provide meaningful representations for words in an efficient way; therefore, they have become common in Natural Language Processing systems. In this paper, we evaluated different word embedding models... more

Then, we tokenized the text relying on whitespaces and punctuation signs, paying special attention to hyphenation. Clitic pronouns like “machucou-se” are kept intact. Since it differs from the approach used in [Rodrigues et al. 2016] and their corpus is a subset of ours, we adapted their tokenization using out criteria. We also removed their Wikipedia section, and in all our subcorpora, we only used sentences with 5 or more tokens in order to reduce noisy content. This reduced the number of tokens of LX-Corpus from 1,723,693,241 to 714,286,638.

Table 4. Extrinsic evaluation on Semantic Similarity task. Table 4 shows the performance of our word embedding models for both PT-BR and PT-EU test sets. To our surprise, the word embedding models which achieved the best results on semantic analogies (see Table 2) were not the best in this semantic task. The best results for European Portuguese was achieved by Word2Vec CBOW model using 1,000 dimensions. CBOW models were the worst on semantic analogies and were not expected to achieve the best results in this task. The best result for Brazilian Portuguese was obtained by Wang2Vec Skip-Gram model using 1,000 dimensions. This model also achieved the best results for POS tagging. Neither FastText nor GloVe models beat the results achieved by [Hartmann 2016].

descriptionView Paper arrow_downwardDownload

Think Globally, Embed Locally --- Locally Linear Meta-embedding of Words

by Ken-ichi Kawarabayashi

2023, Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence

Distributed word embeddings have shown superior performances in numerous Natural Language Processing (NLP) tasks. However, their performances vary significantly across different tasks, implying that the word embeddings learnt by those... more

descriptionView Paper arrow_downwardDownload

Leveraging Contextual Embeddings for Detecting Diachronic Semantic Shift

by Senja Pollak

2023

We propose a new method that leverages contextual embeddings for the task of diachronic semantic shift detection by generating time specific word representations from BERT embeddings. The results of our experiments in the domain specific... more

descriptionView Paper arrow_downwardDownload

Human Readable Feature Generation for Natural Language Corpora

by Tomasz Dryjanski

2023, viXra

This paper proposes an alternative to the Paragraph Vector algorithm, generating fixed-length vectors of human-readable features for natural language corpora. It extends word2vec retaining its other advantages like speed and accuracy,... more

descriptionView Paper arrow_downwardDownload

Garbled-Word Embeddings for Jumbled Text

by Alejandro Moreo

2023

Aoccdrnig to a reasrech at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny itmopnrat tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll... more

Performance evaluation of different sets of embeddings on 17 intrinsic-task benchmarks, grouped ac- cording to task (2nd row) and evaluation measure (3rd row). Interestingly, the results obtained with our method (BE-sorted) are, by and large, very similar to the ones obtained on the original corpus (Garbled(0%)), and almost always superior to those obtained by Garbled(5%), thus confirming our initial hypothesis. When character sorting is not used, performance seems to deteriorate as the fraction of garbled word occurrences increases. The results also clearly indicate that Full-sorted fares worse than BE-sorted, thus bringing empirical support to the intuition according to which, for computer-based distributional semantic models, as well as for humans, the first and the last letters should remain in place in order to achieve comparable performance. We are currently investigating this aspect in greater detail. Interestingly, the results obtained with our method (BE-sorted) are, by and large, very similar Table 1

descriptionView Paper arrow_downwardDownload

Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks

by Sandra Maria Aluísio

2023, arXiv (Cornell University)

descriptionView Paper arrow_downwardDownload

Word2vec Skip-Gram Dimensionality Selection via Sequential Normalized Maximum Likelihood

by Pham Trong Hung (K18 DN)

2023, Entropy

In this paper, we propose a novel information criteria-based approach to select the dimensionality of the word2vec Skip-gram (SG). From the perspective of the probability theory, SG is considered as an implicit probability distribution... more

descriptionView Paper arrow_downwardDownload

Integrating Lexical Knowledge in Word Embeddings using Sprinkling and Retrofitting

by Devi Ganesan

2023, ArXiv

Neural network based word embeddings, such as Word2Vec and Glove, are purely data driven in that they capture the distributional information about words from the training corpus. Past works have attempted to improve these embeddings by... more

descriptionView Paper arrow_downwardDownload

CogniVal: A Framework for Cognitive Word Embedding Evaluation

by Antonio de la Torre

2023, Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

An interesting method of evaluating word representations is by how much they reflect the semantic representations in the human brain. However, most, if not all, previous works only focus on small datasets and a single modality. In this... more

descriptionView Paper arrow_downwardDownload

CogniVal: A Framework for Cognitive Word Embedding Evaluation

by Nicolas Langer

2023, Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

descriptionView Paper arrow_downwardDownload

Augmenting Named Entity Recognition with Commonsense Knowledge

by ghaith dekhili

2023

descriptionView Paper arrow_downwardDownload

Intrinsic analysis for dual word embedding space models

by MOHIT MAYANK (RA2111004010364)

2023, arXiv (Cornell University)

Recent word embeddings techniques represent words in a continuous vector space, moving away from the atomic and sparse representations of the past. Each such technique can further create multiple varieties of embeddings based on different... more

descriptionView Paper arrow_downwardDownload

Analysis of Italian Word Embeddings

by STEFANO PIRA

2023, Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017

English. In this work we analyze the performances of two of the most used word embeddings algorithms, skip-gram and continuous bag of words on Italian language. These algorithms have many hyper-parameter that have to be carefully tuned in... more

descriptionView Paper arrow_downwardDownload

Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts

by Md. Rabiul Auwul

2023, Computational Intelligence and Neuroscience

Due to the increasing use of information technologies by biomedical experts, researchers, public health agencies, and healthcare professionals, a large number of scientifc literatures, clinical notes, and other structured and unstructured text resources are rapidly increasing and being stored in various data sources like PubMed. Tese massive text resources can be leveraged to extract valuable knowledge and insights using machine learning techniques. Recent advancement in neural network-based classifcation models has gained popularity which takes numeric vectors (aka word representation) of training data as the input to train classifcation models. Better the input vectors, more accurate would be the classifcation. Word representations are learned as the distribution of words in an embedding space, wherein each word has its vector and the semantically similar words based on the contexts appear nearby each other. However, such distributional word representations are incapable of encapsulating relational semantics between distant words. In the biomedical domain, relation mining is a well-studied problem which aims to extract relational words, which associates distant entities generally representing the subject and object of a sentence. Our goal is to capture the relational semantics information between distant words from a large corpus to learn enhanced word representation and employ the learned word representation for various natural language processing tasks such as text classifcation. In this article, we have proposed an application of biomedical relation triplets to learn word representation through incorporating relational semantic information within the distributional representation of words. In other words, the proposed approach aims to capture both distributional and relational contexts of the words to learn their numeric vectors from text corpus. We have also proposed an application of the learned word representations for text classifcation. Te proposed approach is evaluated over multiple benchmark datasets, and the efcacy of the learned word representations is tested in terms of word similarity and concept categorization tasks. Our proposed approach provides better performance in comparison to the state-of-the-art GloVe model. Furthermore, we have applied the learned word representations to classify biomedical texts using four neural network-based classifcation models, and the classifcation accuracy further confrms the efectiveness of the learned word representations by our proposed approach.

descriptionView Paper arrow_downwardDownload

Word Embeddings and Semantic Spaces in Natural Language Processing

by Peter J Worth

2023, Scientific Research Publishing

One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a... more

descriptionView Paper arrow_downwardDownload

Task-dependent Optimal Weight Combinations for Static Embeddings

by Nathaniel Carlson

2022, Northern European Journal of Language Technology

A variety of NLP applications use word2vec skip-gram, GloVe, and fastText word embeddings. These models learn two sets of embedding vectors, but most practitioners use only one of them, or alternately an unweighted sum of both. This is... more

descriptionView Paper arrow_downwardDownload

Proposing theMultilayered Semantic Frame Analysis of Text As an Effective Framework to Reveal What You Need to Know Before Defining Entries for a (Generative) Lexicon

by Hitoshi Isahara

2022

This paper introduces a framework for both semantic analysis and annotation, called Multilayered Semantic Frame Analysis (MSFA) of text, inspired by the Berkeley FrameNet approach to semantic analysis of natural language text [8, 13].... more

descriptionView Paper arrow_downwardDownload

A Partially Linear Inference of the Kuznets Hypothesis

by Faris Naufal

2022

Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input... more

descriptionView Paper arrow_downwardDownload

Context-Aware Sentiment Analysis using Tweet Expansion Method

by Bashar Tahayna

2022, Journal of ICT Research and Applications

The large source of information space produced by the plethora of social media platforms in general and microblogging in particular has spawned a slew of new applications and prompted the rise and expansion of sentiment analysis research.... more

descriptionView Paper arrow_downwardDownload

Insight

by Toshiyuki Kanamaru

2022

Role-denoting nouns are more ready for figurative uses

descriptionView Paper arrow_downwardDownload

Multilingual Culture-Independent Word Analogy Datasets

by Milda Dailidėnaitė

2022

In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of... more

descriptionView Paper arrow_downwardDownload

UNAM at SemEval-2018 Task 10: Unsupervised Semantic Discriminative Attribute Identification in Neural Word Embedding Cones

by Carlos-Francisco Méndez-Cruz

2022, Proceedings of The 12th International Workshop on Semantic Evaluation

In this paper we report an unsupervised method aimed to identify whether an attribute is discriminative for two words (which are treated as concepts, in our particular case). To this end, we use geometrically inspired vector operations... more

descriptionView Paper arrow_downwardDownload

Hierarchical Semantic Analysis of Japanese Sentences of the form “Y-ga X-kara (Z-ni) nigeru”: Combining a manual corpus analysis and psycholinguistic experiments

by Kow Kuroda

2022

descriptionView Paper arrow_downwardDownload

What Attracts FDI in Indian Manufacturing Industries

by Rashmi Rastogi

2022

In this paper we examine the pattern of inward FDI at the disaggregated industry level (NIC 3- digit), and test for the industry-specific characteristics that have been significant in attracting foreign investment in India during 2000-10.... more

descriptionView Paper arrow_downwardDownload

Supervised Phrase-Boundary Embeddings

by Manni Singh

2022, Advances in Intelligent Data Analysis XVIII

We propose a new word embedding model, called SPhrase, that incorporates supervised phrase information. Our method modifies traditional word embeddings by ensuring that all target words in a phrase have exactly the same context. We... more

descriptionView Paper arrow_downwardDownload

Scale-Dependent Relationships in Natural Language

by Aakash Sarkar

2022, Computational Brain & Behavior

Language, like other natural sequences, exhibits statistical dependencies at a wide range of scales as discussed by . However, many statistical learning models applied to language impose a sampling scale while extracting statistical... more

descriptionView Paper arrow_downwardDownload

Comparative Analysis of Word Embeddings for Capturing Word Similarities

by Frosina Stojanovska

2022, 6th International Conference on Natural Language Processing (NATP 2020)

Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning... more

descriptionView Paper arrow_downwardDownload

Comparative Analysis of Word Embeddings for Capturing Word Similarities

by Frosina Stojanovska

2022

descriptionView Paper arrow_downwardDownload

Incorporating word order explicitly in GloVe word embedding

by Brandon Cox

2022

Word embedding is the process of representing words from a corpus of text as real number vectors. These vectors are often derived from frequency statistics from the source corpus. In the GloVe model as proposed by Pennington et al., these... more

descriptionView Paper arrow_downwardDownload

Give your Text Representation Models some Love: the Case for Basque

by Xabier Saralegi

2022

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups... more

descriptionView Paper arrow_downwardDownload

Analogical Reasoning with Knowledge-based Embeddings Douglas Summers-Stay

by Douglas Summers-Stay

2022

For robots to interact with natural language and handle realworld situations, some ability to perform analogical and associational reasoning is desirable. Consider commands like ”Fetch the ball” vs. ”Fetch the wagon”, the robot needs to... more

descriptionView Paper arrow_downwardDownload

Improving Chinese Segmentation-free Word Embedding With Unsupervised Association Measure

by Yifan Zhang

2022

Recent work on segmentation-free word embedding(sembei) developed a new pipeline of word embedding for unsegmentated language while avoiding segmentation as a preprocessing step. However, too many noisy n-grams existing in the embedding... more

descriptionView Paper arrow_downwardDownload

Getting Deeper Semantics than Berkeley FrameNet with MSFA

by Kow Kuroda

2022

This paper illustrates relevant details of an on-going semantic-role annotation work based on a framework called MULTILAYERED/DIMENSIONAL SEMANTIC FRAME ANALYSIS (MSFA for short) (Kuroda and Isahara, 2005b), which is inspired by, if not... more

descriptionView Paper arrow_downwardDownload

by Yaniris Reyes

2022, IEEE Access

For many natural language processing applications, estimating similarity and relatedness between words are key tasks that serve as the basis for classification and generalization. Currently, vector semantic models (VSM) have become a... more

descriptionView Paper arrow_downwardDownload

by Marco Antonio Ramírez Salinas

2022, IEEE Access

descriptionView Paper arrow_downwardDownload

Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations

by Marco idiart

2021, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In this paper, we propose LexVec, a new method for generating distributed word representations that uses low-rank, weighted factorization of the Positive Point-wise Mutual Information matrix via stochastic gradient descent, employing a... more

descriptionView Paper arrow_downwardDownload

k-NN Embedding Stability for word2vec Hyper-Parametrisation in Scientific Text

by Jagdev Bhogal

2021, Lecture Notes in Computer Science

Word embeddings are increasingly attracting the attention of researchers dealing with semantic similarity and analogy tasks. However, finding the optimal hyper-parameters remains an important challenge due to the resulting impact on the... more

descriptionView Paper arrow_downwardDownload

Evaluating Sparse Interpretable Word Embeddings for Biomedical Domain

by Zeinab Maleki

2021

Word embeddings have found their way into a wide range of natural language processing tasks including those in the biomedical domain. While these vector representations successfully capture semantic and syntactic word relations, hidden... more

descriptionView Paper arrow_downwardDownload

Comparative Analysis of Word Embeddings for Capturing Word Similarities

by Frosina Stojanovska

2021

descriptionView Paper arrow_downwardDownload