Skip to main content

Andrey Kutuzov

University of Oslo, Department of Informatics, Graduate Student

National Research University Higher School of Economics, Philology, Faculty Member

Followers

537

Following

116

Co-authors

21

Public Views

Current position: PhD candidate at Research Group for Language Technology, University of Oslo, Norway

E-mail: andreku AT ifi.uio.no
Supervisors: Erik Velldal and Lilja Øvrelid

less

InterestsView All (25)

Uploads

Papers by Andrey Kutuzov

RuSemShift: a dataset of historical lexical semantic change in Russian

by Andrey Kutuzov and Julia Rodina

COLING, 2020

We present RuSemShift, a large-scale manually annotated test set for the task of semantic change ... more We present RuSemShift, a large-scale manually annotated test set for the task of semantic change modeling in Russian for two long-term time period pairs: from the pre-Soviet through the Soviet times and from the Soviet through the post-Soviet times. Target words were annotated by multiple crowd-source workers. The annotation process was organized following the DURel framework and was based on sentence contexts extracted from the Russian National Corpus. Additionally, we report the performance of several distributional approaches on RuSemShift, achieving promising results, which at the same time leave room for other researchers to improve.

Taxonomy enrichment for Russian: Synset classification outperforms linear hyponym‑hypernym projections

by Maria Kunilovskaya and Andrey Kutuzov

Computational Linguistics and Intellectual Technologies: papers from the Annual conference ``Dialogue'', 2020

We present the description of our system that was ranked third in the noun sub-track of the Taxon... more We present the description of our system that was ranked third in the noun sub-track of the Taxonomy Enrichment for the Russian Language shared task offered by Dialogue Evaluation 2020. Our best-performing system appears against the backdrop of other methods and their combinations attempted, and its results argue in favour of Occam's razor for this task. A simple supervised classifier was trained on static distributional embed-dings of hyponym words as features and their numeric hypernym synset identifiers from the taxonomy as class labels. It outperformed more complicated approaches based on learning linear projections from hyponym embeddings to hypernym embeddings and returning synset identifiers for the nearest neighbours of the predicted vectors. Training specially tailored word embeddings for ruWordNet multi-word expressions proved to be one of the key factors for both approaches. Key words: taxonomy enrichment, hypernymy relations, distributional semantics, word embeddings, projection learning, supervised machine learning ПОПОЛНЕНИЕ ТАКСОНОМИИ ДЛЯ РУССКОГО ЯЗЫКА: ЛИНЕЙНЫЕ ГИПО-ГИПЕРОНИМИЧЕСКИЕ ПРОЕКЦИИ ИЛИ КЛАССИФИКАТОР СИНСЕТОВ В настоящей статье описывается способ расширения таксономии, который занял третье место в соревновании, объявленном в рамках Dialogue Evaluation 2020 (задача определения гиперонимических син-сетов для существительных). Мы сравниваем наш наиболее эффек-тивный подход с другими методами, которые были применены к реше-нию поставленной задачи. Наши опыт и результаты свидетельствуют Kunilovskaya M., Kutuzov A., Plum A. 460 в пользу выбора более простого подхода, который изначально не пред-ставлялся многообещающим. Таким методом оказался классифика-тор, обученный на векторах гипонимов и идентификационных номерах соответствующих гиперонимических синсетов. Его результат значи-тельно выше чем для метода, основанного на выучивании линейной трансформации вектора гипонима в вектор гиперонима с последую-щим поиском слов (и идентификаторов их синсетов), семантически похожих на предсказанные гиперонимы. Для обоих подходов важную роль играет наличие качественных дистрибутивных векторных репре-зентаций для многословных единиц тезауруса ruWordNet. Ключевые слова: пополнение таксономии, гипо-гиперонимические отношения, векторные репрезентации, линейная трансформация век-торов, машинное обучение с учителем

ShiftRy: Web Service for Diachronic Analysis of Russian News

by Andrey Kutuzov, Vadim Fomin, and Vladislav Mikhailov

Dialogue, Jun 2020

We present the ShiftRy web service. It helps to analyze temporal changes in the usage of words in... more We present the ShiftRy web service. It helps to analyze temporal changes in the usage of words in news texts from Russian mass media. For that, we employ diachronic word embedding models trained on large Russian news corpora from 2010 up to 2019. The users can explore the usage history of any given query word, or browse the lists of words ranked by the degree of their semantic drift in any couple of years. Visualizations of the words' trajectories through time are provided. Importantly, users can obtain corpus examples with the query word before and after the semantic shift (if any). The aim of ShiftRy is to ease the task of studying word history on short-term time spans, and the influence of social and political events on word usage.
The service will be updated with new data yearly.

Double-Blind Peer-Reviewing and Inclusiveness in Russian NLP Conferences

Analysis of Images, Social Networks and Texts, 2019

Double-blind peer reviewing has been proved to be pretty effective and fair way of academic work ... more Double-blind peer reviewing has been proved to be pretty effective and fair way of academic work selection. However, to the best of our knowledge, nobody has yet analysed the effects caused by its introduction at the Russian NLP conferences. We investigate how the double-blind peer reviewing influences gender and location (according to authors’ affiliations) biases and whether it makes two of the conferences under analysis more inclusive. The results show that gender distribution has become more equal for the Dialogue conference, but did not change for the AIST conference. The authors’ location distribution (roughly divided into ‘central’ and ‘not central’) has become more equal for AIST, but, interestingly, less equal for Dialogue.

Vec2graph: a Python library for visualizing word embeddings as graphs

by Надежда Катричева and Andrey Kutuzov

Analysis of Images, Social Networks and Texts (AIST), 2019

Visualization as a means of easy conveyance of ideas plays a key role in communicating linguistic... more Visualization as a means of easy conveyance of ideas plays a key role in communicating linguistic theory through its applications. User-friendly NLP visualization tools allow researchers to get important insights for building, challenging, proving or rejecting their hypotheses. At the same time, visualizations provide general public with some understanding of what computational linguists investigate.

In this paper, we present vec2graph: a ready-to-use Python 3 library visualizing vector representations (for example, word embeddings) as dynamic and interactive graphs. It is aimed at users with beginners' knowledge of software development, and can be used to easily produce visualizations suitable for the Web. We describe key ideas behind vec2graph, its hyperparameters, and its integration into existing word embedding frameworks.

Learning Graph Embeddings from WordNet-based Similarity Measures

by Andrey Kutuzov and Alexander Panchenko

Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019), 2019

We present path2vec, a new approach for learning graph embeddings that relies on structural measu... more We present path2vec, a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph distance measure, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. Evaluation of the proposed model on semantic similarity and word sense disambiguation tasks, using various WordNet-based similarity measures, show that our approach yields competitive results, outperforming strong graph embedding baselines. The model is computationally efficient, being orders of magnitude faster than the direct computation of graph-based distances.

Tracing cultural diachronic semantic shifts in Russian using word embeddings: test sets and baselines

by Vadim Fomin, Daria Bakshandaeva, Andrey Kutuzov, and Julia Rodina

Dialogue, 2019

The paper introduces manually annotated test sets for the task of tracing diachronic (temporal) s... more The paper introduces manually annotated test sets for the task of tracing diachronic (temporal) semantic shifts in Russian. The two test sets are complementary in that the first one covers comparatively strong semantic changes occurring to nouns and adjectives from pre-Soviet to Soviet times, while the second one covers comparatively subtle socially and culturally determined shifts occurring in years from 2000 to 2014. Additionally, the second test set offers more granular classification of shifts degree, but is limited to only adjectives.

The introduction of the test sets allowed us to evaluate several well-established algorithms of semantic shifts detection (posing this as a classification problem), most of which have never been tested on Russian material. All of these algorithms use distributional word embedding models trained on the corresponding in-domain corpora. The resulting scores provide solid comparison baselines for future studies tackling similar tasks. We publish the datasets, code and the trained models in order to facilitate further research in automatically detecting temporal semantic shifts for Russian words, with time periods of different granularities.

RusNLP: Semantic search engine for Russian NLP conference papers

by Andrey Kutuzov and Amir Bakarov

Proceedings of AIST-2018, 2018

We present RusNLP, a web service implementing semantic search engine and recommendation system ov... more We present RusNLP, a web service implementing semantic search engine and recommendation system over proceedings of three major Russian NLP conferences (Dialogue, AIST and AINL). The collected corpus spans across 12 years and contains about 400 academic papers in English. The presented web service allows searching for publications semantically similar to arbitrary user queries or to any given paper. Search results can be filtered by authors and their affiliations, conferences or years. They are also interlinked with the NLPub.ru service, making it easier to quickly capture the general focus of each paper. The search engine source code and the publications metadata are freely available for all interested researchers.
In the course of preparing the web service, we evaluated several well-known techniques for representing and comparing documents: TF-IDF, LDA, and Paragraph Vector. On our comparatively small corpus, TF-IDF yielded the best results and thus was chosen as the primary algorithm working under the hood of RusNLP.

Diachronic word embeddings and semantic shifts: a survey

by Andrey Kutuzov and Terry Szymanski

Proceedings of the 27th International Conference on Computational Linguistics (COLING-2018) , Aug 2018

Recent years have witnessed a surge of publications aimed at tracing temporal changes in lexical ... more Recent years have witnessed a surge of publications aimed at tracing temporal changes in lexical semantics using distributional methods, particularly prediction-based word embedding models.
However, this vein of research lacks the cohesion, common terminology and shared practices of more established areas of natural language processing.
In this paper, we survey the current state of academic research related to diachronic word embeddings and semantic shifts detection. We start with discussing the notion of semantic shifts, and then continue with an overview of the existing methods for tracing such time-related shifts with word embedding models. We propose several axes along which these methods can be compared, and outline the main challenges before this emerging subfield of NLP, as well as prospects and possible applications.

Russian word sense induction by clustering averaged word embeddings

Proceedings of the 24rd International Conference on Computational Linguistics and Intellectual Technologies (Dialogue’2018), 2018

The paper reports our participation in the shared task on word sense induction and disambiguation... more The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE’2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants.

The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-shelf pre-trained distributional models. Then, these vector representations were clustered with mainstream clustering techniques, thus producing the groups corresponding to the ambiguous word’ senses. As a side result, we show that word embedding models trained on small but balanced corpora can be superior to those trained on large but noisy data — not only in intrinsic evaluation, but also in downstream tasks like word sense induction.

Universal Dependencies-based syntactic features in detecting human translation varieties

by Maria Kunilovskaya and Andrey Kutuzov

In this paper, syntactic annotation is used to reveal linguistic properties of translations. We e... more In this paper, syntactic annotation is used to reveal linguistic properties of translations. We employ the Universal Dependencies framework to represent learner and professional translations of English mass-media texts into Russian (along with non-translated Russian texts of the same genre) with the aim to discover and describe syntactic specificity of translations produced at different levels of competence. The search for differences between varieties of translation and the native texts is augmented with the results obtained from a series of machine learning classifications experiments. We show that syntactic structures have considerable predictive power in translationese detection, on par with the known low-level lexical features.

Evaluation tracks on plagiarism detection algorithms for the Russian language

by Mihail Kopotev, Lyubov Ivanova, Andrey Kutuzov, and Rita Kuznetsova

Computational linguistics and intellectual technologies. Proceedings of the international conference Dialogue-2017, 2017

The paper presents a methodology and preliminary results for evaluating plagiarism detection algo... more

Temporal dynamics of semantic relations in word embeddings: an application to predicting armed conflict participants

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017

This paper deals with using word embedding models to trace the temporal dynamics of semantic rela... more This paper deals with using word embedding models to trace the temporal dynamics of semantic relations between pairs of words. The set-up is similar to the well-known analogies task, but expanded with a time dimension. To this end, we apply incremental updating of the models with new training texts, including incremental vocabulary expansion, coupled with learned transformation matrices that let us map between members of the relation.
The proposed approach is evaluated on the task of predicting insurgent armed groups based on geographical locations. The gold standard data for the time span 1994--2010 is extracted from the UCDP Armed Conflicts dataset. The results show that the method is feasible and outperforms the baselines, but also that important work still remains to be done.

Size vs. structure in training corpora for word embedding models: Araneum Russicum Maximum and Russian National Corpus

by Andrey Kutuzov and Maria Kunilovskaya

Proceedings of AIST-2017 conference, 2018

In this paper, we present a distributional word embedding model trained on one of the largest ava... more In this paper, we present a distributional word embedding model trained on one of the largest available Russian corpora: Araneum Russicum Maximum (over 10 billion words crawled from the web). We compare this model to the model trained on the Russian National Corpus (RNC). The two corpora are much different in their size and compilation procedures. We test these differences by evaluating the trained models against the Russian part of the Multilingual SimLex999 semantic similarity dataset. We detect and describe numerous issues in this dataset and publish a new corrected version.
Aside from the already known fact that the RNC is generally a better training corpus than web corpora, we enumerate and explain fine differences in how the models process semantic similarity task, what parts of the evaluation set are difficult for particular models and why. Additionally, the learning curves for both models are described, showing that the RNC is generally more robust as training material for this task.

Tracing armed conflicts with diachronic word embedding models

Proceedings of the Events and Stories in the News Workshop, Aug 2017

Recent studies have shown that word embedding models can be used to trace time-related (diachroni... more Recent studies have shown that word embedding models can be used to trace time-related (diachronic) semantic shifts for particular words. In this paper, we evaluate some of these approaches on the new task of predicting the dynamics of global armed conflicts on a year-to-year basis, using a dataset from the field of conflict research as the gold standard and the Gigaword news corpus as the training data. The results show that much work still remains in extracting `cultural' semantic shifts from diachronic word embedding models. At the same time, we present a new task complete with an evaluation set and introduce the `anchor words' method which outperforms previous approaches on this data.

Testing target text fluency: machine learning approach to detecting syntactic translationese in English-Russian translation

by Andrey Kutuzov and Maria Kunilovskaya

New perspectives on cohesion and coherence: Implications for translation, 2017

This research is aimed at semi-automatic detection of divergences in sentence structures between ... more This research is aimed at semi-automatic detection of divergences in sentence structures between Russian translated texts and non-translations. We focus our attention on atypical syntactic features of translations, because of their greater negative impact on the overall textual quality than lexical translationese. Inadequate syntactic structures bring about various issues with target text fluency, which reduces readability and the reader's chances to get to the text message. From procedural viewpoint faulty syntax implies more post-editing effort.

In the framework of this research we reveal cases of syntactic translationese as dissimilarities between patterns of selected morphosyntactic and syntactic features (such as part of speech and sentence length) in the context of sentence boundaries observed in comparable monolingual corpora of learner translated and non-translated texts in Russian.

To establish these syntactic differences we resort to machine learning approach as opposed to the usual statistical significance analyses. To this end we employ models, which predict unnatural sentence boundaries in translations and highlight factors that are responsible for their `foreignness'.

At the first stage of the experiment we train a decision tree model to describe the contextual features of sentence boundaries in the reference corpus of Russian texts. At the second stage we use the results of the first multifactorial analysis as indicators of learner translators' choices that run counter the regularities of the standard language variety. The predictors and their combinations are evaluated as to their efficiency for this task. As a result we are able to extract translated sentences whose structure is atypical against Russian texts produced without the constraints of the translation process and which, therefore, can be tentatively considered less fluent. These sentences represent cases of translationese.

Arbitrariness of Linguistic Sign Questioned: Correlation between Word Form and Meaning in Russian

Computational Linguistics and Intellectual Technologies: papers from the Annual conference “Dialogue”

In this paper, we present the results of preliminary experiments on finding the link between the ... more In this paper, we present the results of preliminary experiments on finding the link between the surface forms of Russian nouns (as represented by their graphic forms) and their meanings (as represented by vectors in a distributional model trained on the Russian National Corpus). We show that there is a strongly significant correlation between these two sides of a linguistic sign (in our case, word). This correlation coefficient is equal to 0.03 as calculated on a set of 1 729 mono-syllabic nouns, and in some subsets of words starting with particular two-letter sequences the correlation raises as high as 0.57. The overall correlation value is higher than the one reported in similar experiments for English (0.016).

Additionally, we report correlation values for the noun subsets related to different phonaesthemes, supposedly represented by the initial characters of these nouns.

Redefining Context Windows for Word Embedding Models: An Experimental Study

Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa)

Distributional semantic models learn vector representations of words through the contexts they oc... more Distributional semantic models learn vector representations of words through the contexts they occur in. Although the choice of context (which often takes the form of a sliding window) has a direct influence on the resulting embeddings, the exact role of this model component is still not fully understood. This paper presents a systematic analysis of context windows based on a set of four distinct hyperparameters. We train continuous SkipGram models on two English-language corpora for various combinations of these hyper-parameters, and evaluate them on both lexical similarity and analogy tasks. Notable experimental results are the positive impact of cross-sentential contexts and the surprisingly good performance of right-context windows.

Clustering of Russian Adjective-Noun Constructions using Word Embeddings

by Andrey Kutuzov and Elizaveta Kuzmenko

Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

This paper presents a method of automatic construction extraction from a large corpus of Russian.... more This paper presents a method of automatic construction extraction from a large corpus of Russian. The term `construction' here means a multi-word expression in which a variable can be replaced with another word from the same semantic class, for example, a glass of [water/juice/milk]. We deal with constructions that consist of a noun and its adjective modifier. We propose a method of grouping such constructions into semantic classes via 2-step clustering of word vectors in distributional models. We compare it with other clustering techniques and evaluate it against A Russian-English Collocational Dictionary of the Human Body that contains manually annotated groups of constructions with nouns denoting human body parts.

The best performing method is used to cluster all adjective-noun bigrams in the Russian National Corpus. Results of this procedure are publicly available and can be used to build a Russian construction dictionary, accelerate theoretical studies of constructions as well as facilitate teaching Russian as a foreign language.

Two centuries in two thousand words: Neural embedding models in detecting diachronic lexical changes

by Andrey Kutuzov and Elizaveta Kuzmenko

Quantitative Approaches to the Russian Language, Sep 2017

In this paper, we show how Continuous Bag-of-Words (Mikolov et al., 2013) models trained on time-... more In this paper, we show how Continuous Bag-of-Words (Mikolov et al., 2013) models trained on time-separated sub-corpora of the Russian National Corpus can be used to detect automatically words that may have undergone semantic changes. Our central assumption is that online training of such models with new textual data results in a “drift” of word vectors in the semantic space. Given that vectors represent the “meaning” of entities, this drift can be taken to reflect semantic shifts in the words experiencing it. As a result, we were able to closely replicate manually compiled lists of semantically changed Russian words from the existing body of research and substantially extend them in a largely unsupervised way. This idea is one of the reasons for the title of this paper, which in a way serves as a complement to the “20 words” in (Daniel & Dobrushina, 2016).