Complexity Metric for Code-Mixed Social Media Text
Computación y Sistemas
https://doi.org/10.13053/CYS-21-4-2852…
11 pages
1 file
Sign up for access to the world's latest research
Abstract
An evaluation metric is an absolute necessity for measuring the performance of any system and complexity of any data. In this paper, we have discussed how to determine the level of complexity of code-mixed social media texts that are growing rapidly due to multilingual interference. In general, texts written in multiple languages are often hard to comprehend and analyze. At the same time, in order to meet the demands of analysis, it is also necessary to determine the complexity of a particular document or a text segment. Thus, in the present paper, we have discussed the existing metrics for determining the code-mixing complexity of a corpus, their advantages and shortcomings as well as proposed several improvements on the existing metrics. The new index better reflects the variety and complexity of a multilingual document. Also, the index can be applied to a sentence and seamlessly extended to a paragraph or an entire document. We have employed two existing code-mixed corpora to suit the requirements of our study.
Related papers
2021
In this paper, I discuss the concept of linguistic complexity, which has been high on the linguistic agenda during the last few decades (Merlini Barbaresi (ed.) 2003, Sampson et al. (eds.) 2009, Moretti 2018 and many others). I first cite the most important definitions of complexity proposed by different scholars, I then apply and compare particular elements of these definitions to linguistic phenomena found in two specific languages, Italian and Danish. I focus mainly on the number of propositions per sentence and on the degree of their subordination (as conveyed by verb implicitness and nominalisation), two manifestations of complexity that are numerically measurable and cross-linguistically comparable. I give both cross-and intralinguistic examples taken from comparable texts that exhibit differences in these kinds of complexity, and in this way I demonstrate that linguistic complexity is clearly linked to and dependent on the language type in question as well as the given uses and people. In the case of Italian, we might talk about a "language-internal multilingualism". However, I conclude the paper by giving a positive answer to my question: Some languages are indeed more complex than others.
Register Studies, 2020
In this article, we present the results of a corpus-based study where we explore whether it is possible to automatically single out different facets of text complexity in a general-purpose corpus. To this end, we use factor analysis as applied in Biber’s multi-dimensional analysis framework. We evaluate the results of the factor solution by correlating factor scores and readability scores to ascertain whether the selected factor solution matches the independent measurement of readability, which is a notion tightly linked to text complexity. The corpus used in the study is the Swedish national corpus, calledStockholm-Umeå Corpusor SUC. The SUC contains subject-based text varieties (e.g., hobby), press genres (e.g., editorials), and mixed categories (e.g., miscellaneous). We refer to them collectively as ‘registers’. Results show that it is indeed possible to elicit and interpret facets of text complexity using factor analysis despite some caveats. We propose a tentative text complexi...
ArXiv, 2021
This paper discusses the results obtained for different techniques applied for performing the sentiment analysis of social media (Twitter) codemixed text written in Hinglish. The various stages involved in performing the sentiment analysis were data consolidation, data cleaning, data transformation and modelling. Various data cleaning techniques were applied, data was cleaned in five iterations and the results of experiments conducted were noted after each iteration. Data was transformed using count vectorizer, one hot vectorizer, tf-idf vectorizer, doc2vec, word2vec and fasttext embeddings. The models were created using various machine learning algorithms such as SVM, KNN, Decision Trees, Random Forests, Naïve Bayes, Logistic Regression, and ensemble voting classifiers. The data was obtained from a task on Codalab competition website which was listed as Task:9 on the Semeval-2020 competition website. The models created were evaluated using the F1-score (macro). The best F1-score of...
Proceedings of the Second Workshop on Computational Approaches to Code Switching, 2016
Multilingual users of social media sometimes use multiple languages during conversation. Mixing multiple languages in content is known as code-mixing. We annotate a subset of a trilingual code-mixed corpus (Barman et al., 2014) with part-of-speech (POS) tags. We investigate two state-of-the-art POS tagging techniques for code-mixed content and combine the features of the two systems to build a better POS tagger. Furthermore, we investigate the use of a joint model which performs language identification (LID) and partof-speech (POS) tagging simultaneously.
2016
The pervasiveness of social media in the present digital era has empowered the ‘netizens’ to be more creative and interactive, and to generate content using free language forms that often are closer to spoken language and hence show phenomena previously mainly analysed in speech. One such phenomenon is code-mixing, which occurs when multilingual persons switch freely between the languages they have in common. Code-mixing presents many new challenges for language processing and the paper discusses some of them, taking as a starting point the problems of collecting and annotating three corpora of code-mixed Indian social media text: one corpus with English-Bengali Twitter messages and two corpora containing English-Hindi Twitter and Facebook messages, respectively. We present statistics of these corpora, discuss part-of-speech tagging of the corpora using both a coarse-grained and a fine-grained tag set, and compare their complexity to several other code-mixed corpora based on a Code-...
2019
The article focuses on identifying, extracting and evaluating syntactic parameters influencing the complexity of Russian academic texts. Our ultimate goal is to select a set of text features effectively measuring text complexity and build an automatic tool able to rank Russian academic texts according to grade levels. models based on the most promising features by using machine learning methods The innovative algorithm of designing a predictive model of text complexity is based on a training text corpus and a set of previously proposed and new syntactic features (average sentence length, average number of syllables per word, the number of adjectives, average number of participial constructions, average number of coordinating chains, path number, i.e. average number of sub-trees). Our best model achieves an MSE of 1.15. Our experiments indicate that by adding the abovementioned syntactic features, namely the average number of participial constructions, average number of coordinating ...
Proceedings of the Canadian Conference on Artificial Intelligence
Measuring a document's complexity level is an open challenge, particularly when one is working on a diverse corpus of documents rather than comparing several documents on a similar topic or working on a language other than English. In this paper, we define a methodology to measure the complexity of French documents, using a new general and diversified corpus of texts, the "French Canadian complexity level corpus", and a wide range of metrics. We compare different learning algorithms to this task and contrast their performances and their observations on which characteristics of the texts are more significant to their complexity. Our results show that our methodology gives a generalpurpose measurement of text complexity in French.
2012
This article presents an empirical study where translational complexity is related to a notion of computability. Samples of English-Norwegian parallel texts have been analysed in order to estimate to what extent the given translations could have been produced automatically, assuming a rule-based approach to machine translation. The study compares two text types, fiction and law text, in order to see how these differ with respect to the question of automatisation. A central assumption behind the empirical method is that a specific translation of a given source expression can be predicted, or computed, provided that the linguistically encoded information in the original, together with information about source and target languages, and about their interrelations, provides the information needed to produce that specific target expression. The results of the investigation indicate that automatic translation tools may be helpful in the case of the law texts, and the study concurs with the...
In social media communication, multilingual speakers often switch between languages , and, in such an environment, automatic language identification becomes both a necessary and challenging task. In this paper, we describe our work in progress on the problem of automatic language identification for the language of social media. We describe a new dataset that we are in the process of creating , which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi. We also present some preliminary word-level language identification experiments using this dataset. Different techniques are employed, including a simple unsuper-vised dictionary-based approach, supervised word-level classification with and without contextual clues, and sequence labelling using Conditional Random Fields. We find that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration .
Complexity, 2014
We compared entropy for texts written in natural languages (English, Spanish) and artificial languages (computer software) based on a simple expression for the entropy as a function of message length and specific word diversity. Code text written in artificial languages showed higher entropy than text of similar length expressed in natural languages. Spanish texts exhibit more symbolic diversity than English ones. Results showed that algorithms based on complexity measures differentiate artificial from natural languages, and that text analysis based on complexity measures allows the unveiling of important aspects of their nature. We propose specific expressions to examine entropy related aspects of tests and estimate the values of entropy, emergence, self-organization and complexity based on specific diversity and message length.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (3)
- Auer, P.: Bilingual conversation. John Benjamins Publishing (1984)
- Gambäck, B., Das, A.: On measuring the complexity of code-mixing. In: Proceedings of the 11th International Conference on Natural Language Processing, Goa, India. pp. 1-7 (2014)
- Kilgarriff, A.: Comparing corpora. International journal of corpus linguistics 6(1), 97-133 (2001)