
Andrey Kutuzov
Current position: PhD candidate at Research Group for Language Technology, University of Oslo, Norway
E-mail: andreku AT ifi.uio.no
Supervisors: Erik Velldal and Lilja Øvrelid
E-mail: andreku AT ifi.uio.no
Supervisors: Erik Velldal and Lilja Øvrelid
less
InterestsView All (25)
Uploads
Papers by Andrey Kutuzov
The service will be updated with new data yearly.
In this paper, we present vec2graph: a ready-to-use Python 3 library visualizing vector representations (for example, word embeddings) as dynamic and interactive graphs. It is aimed at users with beginners' knowledge of software development, and can be used to easily produce visualizations suitable for the Web. We describe key ideas behind vec2graph, its hyperparameters, and its integration into existing word embedding frameworks.
The introduction of the test sets allowed us to evaluate several well-established algorithms of semantic shifts detection (posing this as a classification problem), most of which have never been tested on Russian material. All of these algorithms use distributional word embedding models trained on the corresponding in-domain corpora. The resulting scores provide solid comparison baselines for future studies tackling similar tasks. We publish the datasets, code and the trained models in order to facilitate further research in automatically detecting temporal semantic shifts for Russian words, with time periods of different granularities.
In the course of preparing the web service, we evaluated several well-known techniques for representing and comparing documents: TF-IDF, LDA, and Paragraph Vector. On our comparatively small corpus, TF-IDF yielded the best results and thus was chosen as the primary algorithm working under the hood of RusNLP.
However, this vein of research lacks the cohesion, common terminology and shared practices of more established areas of natural language processing.
In this paper, we survey the current state of academic research related to diachronic word embeddings and semantic shifts detection. We start with discussing the notion of semantic shifts, and then continue with an overview of the existing methods for tracing such time-related shifts with word embedding models. We propose several axes along which these methods can be compared, and outline the main challenges before this emerging subfield of NLP, as well as prospects and possible applications.
The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-shelf pre-trained distributional models. Then, these vector representations were clustered with mainstream clustering techniques, thus producing the groups corresponding to the ambiguous word’ senses. As a side result, we show that word embedding models trained on small but balanced corpora can be superior to those trained on large but noisy data — not only in intrinsic evaluation, but also in downstream tasks like word sense induction.
The proposed approach is evaluated on the task of predicting insurgent armed groups based on geographical locations. The gold standard data for the time span 1994--2010 is extracted from the UCDP Armed Conflicts dataset. The results show that the method is feasible and outperforms the baselines, but also that important work still remains to be done.
Aside from the already known fact that the RNC is generally a better training corpus than web corpora, we enumerate and explain fine differences in how the models process semantic similarity task, what parts of the evaluation set are difficult for particular models and why. Additionally, the learning curves for both models are described, showing that the RNC is generally more robust as training material for this task.
In the framework of this research we reveal cases of syntactic translationese as dissimilarities between patterns of selected morphosyntactic and syntactic features (such as part of speech and sentence length) in the context of sentence boundaries observed in comparable monolingual corpora of learner translated and non-translated texts in Russian.
To establish these syntactic differences we resort to machine learning approach as opposed to the usual statistical significance analyses. To this end we employ models, which predict unnatural sentence boundaries in translations and highlight factors that are responsible for their `foreignness'.
At the first stage of the experiment we train a decision tree model to describe the contextual features of sentence boundaries in the reference corpus of Russian texts. At the second stage we use the results of the first multifactorial analysis as indicators of learner translators' choices that run counter the regularities of the standard language variety. The predictors and their combinations are evaluated as to their efficiency for this task. As a result we are able to extract translated sentences whose structure is atypical against Russian texts produced without the constraints of the translation process and which, therefore, can be tentatively considered less fluent. These sentences represent cases of translationese.
Additionally, we report correlation values for the noun subsets related to different phonaesthemes, supposedly represented by the initial characters of these nouns.
The best performing method is used to cluster all adjective-noun bigrams in the Russian National Corpus. Results of this procedure are publicly available and can be used to build a Russian construction dictionary, accelerate theoretical studies of constructions as well as facilitate teaching Russian as a foreign language.