Complexity Metric for Code-Mixed Social Media Text

Souvick Ghosh

doi:10.13053/CYS-21-4-2852

Outline

Results on Different Corpora

Conclusion and Future Work

References

Complexity Metric for Code-Mixed Social Media Text

Souvick Ghosh

Computación y Sistemas

https://doi.org/10.13053/CYS-21-4-2852

visibility

…

description

11 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

An evaluation metric is an absolute necessity for measuring the performance of any system and complexity of any data. In this paper, we have discussed how to determine the level of complexity of code-mixed social media texts that are growing rapidly due to multilingual interference. In general, texts written in multiple languages are often hard to comprehend and analyze. At the same time, in order to meet the demands of analysis, it is also necessary to determine the complexity of a particular document or a text segment. Thus, in the present paper, we have discussed the existing metrics for determining the code-mixing complexity of a corpus, their advantages and shortcomings as well as proposed several improvements on the existing metrics. The new index better reflects the variety and complexity of a multilingual document. Also, the index can be applied to a sentence and seamlessly extended to a paragraph or an entire document. We have employed two existing code-mixed corpora to suit the requirements of our study.

Iørn Korzen

2021

In this paper, I discuss the concept of linguistic complexity, which has been high on the linguistic agenda during the last few decades (Merlini Barbaresi (ed.) 2003, Sampson et al. (eds.) 2009, Moretti 2018 and many others). I first cite the most important definitions of complexity proposed by different scholars, I then apply and compare particular elements of these definitions to linguistic phenomena found in two specific languages, Italian and Danish. I focus mainly on the number of propositions per sentence and on the degree of their subordination (as conveyed by verb implicitness and nominalisation), two manifestations of complexity that are numerically measurable and cross-linguistically comparable. I give both cross-and intralinguistic examples taken from comparable texts that exhibit differences in these kinds of complexity, and in this way I demonstrate that linguistic complexity is clearly linked to and dependent on the language type in question as well as the given uses and people. In the case of Italian, we might talk about a "language-internal multilingualism". However, I conclude the paper by giving a positive answer to my question: Some languages are indeed more complex than others.

downloadDownload free PDF View PDFchevron_right

Pinning down text complexity

Arne Jönsson

In this article, we present the results of a corpus-based study where we explore whether it is possible to automatically single out different facets of text complexity in a general-purpose corpus. To this end, we use factor analysis as applied in Biber’s multi-dimensional analysis framework. We evaluate the results of the factor solution by correlating factor scores and readability scores to ascertain whether the selected factor solution matches the independent measurement of readability, which is a notion tightly linked to text complexity. The corpus used in the study is the Swedish national corpus, calledStockholm-Umeå Corpusor SUC. The SUC contains subject-based text varieties (e.g., hobby), press genres (e.g., editorials), and mixed categories (e.g., miscellaneous). We refer to them collectively as ‘registers’. Results show that it is indeed possible to elicit and interpret facets of text complexity using factor analysis despite some caveats. We propose a tentative text complexi...

downloadDownload free PDF View PDFchevron_right

Sentiment Analysis of Code-Mixed Social Media Text (Hinglish)

Gaurav Singh

ArXiv, 2021

This paper discusses the results obtained for different techniques applied for performing the sentiment analysis of social media (Twitter) codemixed text written in Hinglish. The various stages involved in performing the sentiment analysis were data consolidation, data cleaning, data transformation and modelling. Various data cleaning techniques were applied, data was cleaned in five iterations and the results of experiments conducted were noted after each iteration. Data was transformed using count vectorizer, one hot vectorizer, tf-idf vectorizer, doc2vec, word2vec and fasttext embeddings. The models were created using various machine learning algorithms such as SVM, KNN, Decision Trees, Random Forests, Naïve Bayes, Logistic Regression, and ensemble voting classifiers. The data was obtained from a task on Codalab competition website which was listed as Task:9 on the Semeval-2020 competition website. The models created were evaluated using the F1-score (macro). The best F1-score of...

downloadDownload free PDF View PDFchevron_right

Part-of-speech Tagging of Code-Mixed Social Media Text

Satanu Ghosh

Proceedings of the Second Workshop on Computational Approaches to Code Switching, 2016

Multilingual users of social media sometimes use multiple languages during conversation. Mixing multiple languages in content is known as code-mixing. We annotate a subset of a trilingual code-mixed corpus (Barman et al., 2014) with part-of-speech (POS) tags. We investigate two state-of-the-art POS tagging techniques for code-mixed content and combine the features of the two systems to build a better POS tagger. Furthermore, we investigate the use of a joint model which performs language identification (LID) and partof-speech (POS) tagging simultaneously.

downloadDownload free PDF View PDFchevron_right

Collecting and Annotating Indian Social Media Code-Mixed Corpora

Anupam Jamatia

2016

The pervasiveness of social media in the present digital era has empowered the ‘netizens’ to be more creative and interactive, and to generate content using free language forms that often are closer to spoken language and hence show phenomena previously mainly analysed in speech. One such phenomenon is code-mixing, which occurs when multilingual persons switch freely between the languages they have in common. Code-mixing presents many new challenges for language processing and the paper discusses some of them, taking as a starting point the problems of collecting and annotating three corpora of code-mixed Indian social media text: one corpus with English-Bengali Twitter messages and two corpora containing English-Hindi Twitter and Facebook messages, respectively. We present statistics of these corpora, discuss part-of-speech tagging of the corpora using both a coarse-grained and a fine-grained tag set, and compare their complexity to several other code-mixed corpora based on a Code-...

downloadDownload free PDF View PDFchevron_right

Computing Syntactic Parameters for Automated Text Complexity Assessment

Valery Solovyev

2019

The article focuses on identifying, extracting and evaluating syntactic parameters influencing the complexity of Russian academic texts. Our ultimate goal is to select a set of text features effectively measuring text complexity and build an automatic tool able to rank Russian academic texts according to grade levels. models based on the most promising features by using machine learning methods The innovative algorithm of designing a predictive model of text complexity is based on a training text corpus and a set of previously proposed and new syntactic features (average sentence length, average number of syllables per word, the number of adjectives, average number of participial constructions, average number of coordinating chains, path number, i.e. average number of sub-trees). Our best model achieves an MSE of 1.15. Our experiments indicate that by adding the abovementioned syntactic features, namely the average number of participial constructions, average number of coordinating ...

downloadDownload free PDF View PDFchevron_right

Quantifying French Document Complexity

David Beauchemin

Proceedings of the Canadian Conference on Artificial Intelligence

Measuring a document's complexity level is an open challenge, particularly when one is working on a diverse corpus of documents rather than comparing several documents on a similar topic or working on a language other than English. In this paper, we define a methodology to measure the complexity of French documents, using a new general and diversified corpus of texts, the "French Canadian complexity level corpus", and a wide range of metrics. We compare different learning algorithms to this task and contrast their performances and their observations on which characteristics of the texts are more significant to their complexity. Our results show that our methodology gives a generalpurpose measurement of text complexity in French.

downloadDownload free PDF View PDFchevron_right

An analysis of translational complexity in two text types

Martha Thunes

2012

This article presents an empirical study where translational complexity is related to a notion of computability. Samples of English-Norwegian parallel texts have been analysed in order to estimate to what extent the given translations could have been produced automatically, assuming a rule-based approach to machine translation. The study compares two text types, fiction and law text, in order to see how these differ with respect to the question of automatisation. A central assumption behind the empirical method is that a specific translation of a given source expression can be predicted, or computed, provided that the linguistically encoded information in the original, together with information about source and target languages, and about their interrelations, provides the information needed to produce that specific target expression. The results of the investigation indicate that automatic translation tools may be helpful in the case of the law texts, and the study concurs with the...

downloadDownload free PDF View PDFchevron_right

Code Mixing: A Challenge for Language Identification in the Language of Social Media

Habiyonizeye Jean Claude

In social media communication, multilingual speakers often switch between languages , and, in such an environment, automatic language identification becomes both a necessary and challenging task. In this paper, we describe our work in progress on the problem of automatic language identification for the language of social media. We describe a new dataset that we are in the process of creating , which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi. We also present some preliminary word-level language identification experiments using this dataset. Different techniques are employed, including a simple unsuper-vised dictionary-based approach, supervised word-level classification with and without contextual clues, and sequence labelling using Conditional Random Fields. We find that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration .

downloadDownload free PDF View PDFchevron_right

Complexity measurement of natural and artificial languages

Gerardo L Febres

Complexity, 2014

We compared entropy for texts written in natural languages (English, Spanish) and artificial languages (computer software) based on a simple expression for the entropy as a function of message length and specific word diversity. Code text written in artificial languages showed higher entropy than text of similar length expressed in natural languages. Spanish texts exhibit more symbolic diversity than English ones. Results showed that algorithms based on complexity measures differentiate artificial from natural languages, and that text analysis based on complexity measures allows the unveiling of important aspects of their nature. We propose specific expressions to examine entropy related aspects of tests and estimate the values of entropy, emergence, self-organization and complexity based on specific diversity and message length.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (3)

Auer, P.: Bilingual conversation. John Benjamins Publishing (1984)
Gambäck, B., Das, A.: On measuring the complexity of code-mixing. In: Proceedings of the 11th International Conference on Natural Language Processing, Goa, India. pp. 1-7 (2014)
Kilgarriff, A.: Comparing corpora. International journal of corpus linguistics 6(1), 97-133 (2001)

MAYANK SINGH

Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, 2021

Code-mixing is a frequent communication style among multilingual speakers where they mix words and phrases from two different languages in the same utterance of text or speech. Identifying and filtering code-mixed text is a challenging task due to its coexistence with monolingual and noisy text. Over the years, several code-mixing metrics have been extensively used to identify and validate codemixed text quality. This paper demonstrates several inherent limitations of code-mixing metrics with examples from the already existing datasets that are popularly used across various experiments.

downloadDownload free PDF View PDFchevron_right

Comparing the Level of Code-Switching in Corpora

Björn Gambäck

International Conference on Language Resources and Evaluation, 2016

Social media texts are often fairly informal and conversational, and when produced by bilinguals tend to be written in several different languages simultaneously, in the same way as conversational speech. The recent availability of large social media corpora has thus also made large-scale code-switched resources available for research. The paper addresses the issues of evaluation and comparison these new corpora entail, by defining an objective measure of corpus level complexity of code-switched texts. It is also shown how this formal measure can be used in practice, by applying it to several code-switched corpora.

downloadDownload free PDF View PDFchevron_right

Code-Mixing in Social Media Text The Last Language Identification Frontier

Ripazha Mangroveightyone

Automatic understanding of noisy social media text is one of the prime present-day research areas. Most research has so far concentrated on English texts; however, more than half of the users are writing in other languages, making language identification a prerequisite for comprehensive processing of social media text. Though language identification has been considered an almost solved problem in other applications, language detectors fail in the social media context due to phenomena such as code-mixing, code-switching, lexical borrowings, Anglicisms, and phonetic typing. This paper reports an initial study to understand the characteristics of code-mixing in the social media context and presents a system developed to automatically detect language boundaries in code-mixed social media text, here exemplified by Facebook messages in mixed English-Bengali and English-Hindi. RÉSUMÉ. La compréhension automatique du texte bruyant des médias sociaux est l'un des secteurs de recherche contemporaine principaux. Jusqu'ici, la plupart des recherches se sont concentrées sur les textes en anglais ; mais plus de la moitié des utilisateurs écrivent dans d'autres langues, ce qui rend l'identification de la langue préalable au traitement complet du texte des médias sociaux. Bien que l'identification de la langue ait été considérée comme un problème presque résolu dans d'autres applications, les détecteurs de langue échouent dans le contexte des médias sociaux, et cela est dû aux phénomènes tels que le mélange et l'alternance de code linguistique, les emprunts lexicaux, les anglicismes et la dactylographie phonétique. Cet article présente une étude initiale pour comprendre les caractéristiques de mélange des codes dans le contexte des médias sociaux ainsi qu' un système développé pour détecter automatique-ment les barrières linguistiques en texte «code-mélangé» de médias sociaux, ici illustrées par des messages de Facebook en mixte anglais-bengali et anglais-hindi.

downloadDownload free PDF View PDFchevron_right

Automatic processing of code-mixed social media content

Utsab Barman

2019

Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together during conversation. Standard natural language processing (NLP) tools such as part-of-speech (POS) tagger and parsers perform poorly because such tools are generally trained with monolingual content. Thus there is a need for code-mixed NLP. This research focuses on creating a code-mixed corpus in English-Hindi-Bengali and using it to develop a world-level language identifier and a POS tagger for such code-mixed content. The first target of this research is word-level language identification. A data set of romanised and code-mixed content written in English, Hindi and Bengali was created and annotated. Word-level language identification (LID) was performed on this data using dictionaries and machine learn- ing techniques. We find that among a dictionary-based system, a character-n-gram based linear model, a character-n-gram based first order Conditional Random Fields (CRF) and a recurrent n...

downloadDownload free PDF View PDFchevron_right

An Analysis of Code Mixing in Twitter

Teguh Setiawan

Proceedings of the International Conference on Interdisciplinary Language, Literature and Education (ICILLE 2018), 2019

The research is focused on explaining the types of code mixing that appear in twitter status. The source of the data is twitter status which find out from 8-24 August 2018 and the data are sentences boundary that has code mixing. The Data collection techniques used in this research is an Observation Method with writing technique. The instrument of research is human instrument and Musyken's theory as point to classify the type of code mixing. Furthermore, the data analysis technique uses data reduction, data display and drawing conclusion. In the research finding, there was found that three types of code mixing such as insertion, alternation and congruent lexicalization. The most type which appears in twitter status is insertion and followed by congruent lexicalization and the last is alternation. The English word, phrase and clause are mixed into single sentence that has Indonesian language as a main code.

downloadDownload free PDF View PDFchevron_right

Language Identification and Analysis of Code-Switched Social Media Text

suraj maharjan

Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, 2018

In this paper, we detail our work on comparing different word-level language identification systems for codeswitched Hindi-English data and a standard Spanish-English dataset. In this regard, we build a new code-switched dataset for Hindi-English. To understand the code-switching patterns in these language pairs, we investigate different codeswitching metrics. We find that the CRF model outperforms the neural network based models by a margin of 2-5 percentage points for Spanish-English and 3-5 percentage points for Hindi-English.

downloadDownload free PDF View PDFchevron_right

PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language Model Embeddings To Estimate Code-Mix Quality

Naman Ahuja

arXiv (Cornell University), 2022

Code-Mixing is a phenomenon of mixing two or more languages in a speech event and is prevalent in multilingual societies. Given the low-resource nature of Code-Mixing, machine generation of code-mixed text is a prevalent approach for data augmentation. However, evaluating the quality of such machine generated code-mixed text is an open problem. In our submission to HinglishEval, a sharedtask collocated with INLG2022, we attempt to build models factors that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality. Hingli-shEval Shared Task consists of two sub-tasks-a) Quality rating prediction); b) Disagreement prediction. We leverage popular codemixed metrics and embeddings of multilingual large language models (MLLMs) as features, and train task specific MLP regression models. Our approach could not beat the baseline results. However, for Subtask-A our team ranked a close second on F-1 and Cohen's Kappa Score measures and first for Mean Squared Error measure. For Subtask-B our approach ranked third for F1 score, and first for Mean Squared Error measure. Code of our submission can be accessed here.

downloadDownload free PDF View PDFchevron_right

Developing Language-tagged Corpora for Code-switching Tweets

suraj maharjan

Proceedings of The 9th Linguistic Annotation Workshop, 2015

Code-switching, where a speaker switches between languages mid-utterance, is frequently used by multilingual populations worldwide. Despite its prevalence, limited effort has been devoted to develop computational approaches or even basic linguistic resources to support research into the processing of such mixedlanguage data. We present a user-centric approach to collecting code-switched utterances from social media posts, and develop language universal guidelines for the annotation of codeswitched data. We also present results for several baseline language identification models on our corpora and demonstrate that language identification in code-switched text is a difficult task that calls for deeper investigation.

downloadDownload free PDF View PDFchevron_right

Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique

Kalika Bali

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017

Word-level language detection is necessary for analyzing code-switched text, where multiple languages could be mixed within a sentence. Existing models are restricted to code-switching between two specific languages and fail in real-world scenarios as text input rarely has a priori information on the languages used. We present a novel unsupervised word-level language detection technique for codeswitched text for an arbitrarily large number of languages, which does not require any manually annotated training data. Our experiments with tweets in seven languages show a 74% relative error reduction in word-level labeling with respect to competitive baselines. We then use this system to conduct a large-scale quantitative analysis of code-switching patterns on Twitter, both global as well as regionspecific, with 58M tweets.

downloadDownload free PDF View PDFchevron_right

Automatic Normalization of Word Variations in Code-Mixed Social Media Text

Nurendra Choudhary

ArXiv, 2018

Social media platforms such as Twitter and Facebook are becoming popular in multilingual societies. This trend induces portmanteau of South Asian languages with English. The blend of multiple languages as code-mixed data has recently become popular in research communities for various NLP tasks. Code-mixed data consist of anomalies such as grammatical errors and spelling variations. In this paper, we leverage the contextual property of words where the different spelling variation of words share similar context in a large noisy social media text. We capture different variations of words belonging to same context in an unsupervised manner using distributed representations of words. Our experiments reveal that preprocessing of the code-mixed dataset based on our approach improves the performance in state-of-the-art part-of-speech tagging (POS-tagging) and sentiment analysis tasks.

downloadDownload free PDF View PDFchevron_right

Complexity Metric for Code-Mixed Social Media Text

Sign up for access to the world's latest research

Abstract

Related papers

References (3)

Related papers