Papers by Natalia Levshina

Linguistics, Apr 25, 2023
This article argues for a gradient approach to word order, which treats word order preferences, b... more This article argues for a gradient approach to word order, which treats word order preferences, both within and across languages, as a continuous variable. Word order variability should be regarded as a basic assumption, rather than as something exceptional. Although this approach follows naturally from the emergentist usage-based view of language, we argue that it can be beneficial for all frameworks and linguistic domains, including language acquisition, processing, typology, language contact, language evolution and change, and formal approaches. Gradient approaches have been very fruitful in some domains, such as language processing, but their potential is not fully realized yet. This may be due to practical reasons. We discuss the most pressing methodological challenges in corpus-based and experimental research of word order and propose some practical solutions.
Do you think I believe you hope…? The power of recursion
Comparing Bayesian and Frequentist Models of Language Variation
Data and Methods in Corpus Linguistics

Reassessing Scale Effects On Differential Case Marking: Methodological, Conceptual And Theoretical Issues In The Quest For A Universal
It is widely believed that when differential case marking depends on the referential proper-<b... more It is widely believed that when differential case marking depends on the referential proper-<br> ties of the NP in question, it is governed by a well-defined hierarchy or scale of referential<br> categories, and that the resulting systematicity is one of the most robust generalizations in<br> linguistic typology. This view has recently been called into question, with Sinnemäki (2014)<br> and especially Bickel, Witzlack-Makarevich & Zakharko (2015) claiming that there is now<br> firm typological evidence against such universal scale effects. Since these papers are based<br> on the largest world-wide databases compiled so far, their results are likely to be taken as the<br> current state of the field. In the present paper, we re-examine Bickel, Witzlack-Makarevich<br> & Zakharko's (2015) data from a different perspective and re-evaluate their negative conclu-<br> sions: First, we complement their analysis in terms of diachronic...

The present paper discusses connectivity and proximity maps of causative constructions. In the fi... more The present paper discusses connectivity and proximity maps of causative constructions. In the first part, we show that creation of a connectivity map of causatives is not a trivial task due to incomplete descriptions, inconsistent terminology and the problem of determining the semantic nodes, and propose an innovative data-driven solution based on data from a parallel corpus. The second part of the paper focuses on proximity maps based on Multidimensional Scaling and compares the most important semantic distinctions, which are inferred from typological data and from a parallel corpus of film subtitles. The results suggest that corpus-based maps of tokens are more sensitive to cultural differences in the prominence of specific causation scenarios, such as interpersonal letting and forceful causation, than maps based on constructional types, which are described in reference grammars.

Introduction Karsten Schmidtke-Bode iii 1 Can cross-linguistic regularities be explained by const... more Introduction Karsten Schmidtke-Bode iii 1 Can cross-linguistic regularities be explained by constraints on change? Martin Haspelmath 2 Taking diachronic evidence seriously: Result-oriented vs. source-oriented explanations of typological universals Sonia Cristofaro 3 Some language universals are historical accidents Jeremy Collins 4 Grammaticalization accounts of word order correlations Matthew S. Dryer 5 Preposed adverbial clauses: Functional adaptation and diachronic inheritance Holger Diessel 6 Attractor states and diachronic change in Hawkins's "Processing Typology" Karsten Schmidtke-Bode 7 Weak universal forces: The discriminatory function of case in differential object marking systems Ilja A. Seržant 8 Support from creole languages for functional adaptation in grammar: Dependent and independent possessive person-forms Susanne Maria Michaelis Contents 9 Linguistic Frankenstein, or How to test universal constraints without real languages Natalia Levshina 10 Diachronic sources, functional motivations and the nature of the evidence: A synthesis

Just because: In search of an objective approach to subjectivity
Aims and theoretical background It has been well-established since Sweetser (1990), that because ... more Aims and theoretical background It has been well-established since Sweetser (1990), that because can be used to express causal relations in the content, epistemic and speech-act domains. In other words, the connective because is a multifunctional linguistic expression that is underspecified for its contexts of use, while other languages like Dutch, French or German have developed so-called “subjective” and “objective” connectives, which may function as strong indicators of the subjectivity level of the causal relation at hand (Stukker & Sanders 2012). The connective because being uninformative in this respect, we aim at investigating whether it is possible to anchor the different uses of because in context (or rather, cotext), examining a large number of syntactic, morphological and semantic cues with a minimal cost of manual annotation. Therefore, we propose an innovative method making use of information available from an English/Dutch parallel corpus to distinguish between differe...
This paper is a quantitative multifactorial study of near-synonymous constructions let + V, allow... more This paper is a quantitative multifactorial study of near-synonymous constructions let + V, allow + to V and permit + to V based on the British National Corpus. We fit a Bayesian multinomial mixed model with twenty formal, semantic, social, collostructional and other variables as fixed effects and the infinitives that fill in the second verb slot as random effects. The model reveals a remarkable alignment of variables that indicate the formal distance between the predicates, conceptual distance between the events they represent and between the speaker and the main arguments, the social and communicative distance between the interlocutors, as well as the looseness of the relationship between the constructions and second verb slot fillers. These results raise fundamental theoretical questions about the relationships between linguistic form, function and use.

We investigate the correlations between lability for verbal arguments with other typological para... more We investigate the correlations between lability for verbal arguments with other typological parameters using large, syntactically annotated corpora of online news in 28 languages. We focus on A-lability, when the A argument alternates with S (e.g., She is singing vs. She is singing a song), and P-lability, when the P-argument alternates with S (e.g., She opened the door vs. The door opened). To estimate how much lability is observed in a language, we measure associations between Verbs or Verb + Noun combinations and the alternating constructions in which they occur. Our correlational analyses show that high P-lability scores correlate strongly with the following parameters: little or no case marking; weaker associations between lexemes and the grammatical roles A and P; rigid order of Subject and Object; and a high proportion of verb-medial clauses (SVO). Low P-lability correlates with the presence of case marking, stronger associations between nouns and grammatical roles, relative...

Zeitschrift für Sprachwissenschaft
The present paper discusses connectivity and proximity maps of causative constructions and combin... more The present paper discusses connectivity and proximity maps of causative constructions and combines them with different types of typological data. In the first case study, I show how one can create a connectivity map based on a parallel corpus. This allows us to solve many problems, such as incomplete descriptions, inconsistent terminology and the problem of determining the semantic nodes. The second part focuses on proximity maps based on Multidimensional Scaling and compares the most important semantic distinctions, which are inferred from a parallel corpus of film subtitles and from grammar descriptions. The results suggest that corpus-based maps of tokens are more sensitive to cultural and genre-related differences in the prominence of specific causation scenarios than maps based on constructional types, which are described in reference grammars. The grammar-based maps also reveal a less clear structure, which can be due to incomplete semantic descriptions in grammars. Therefore...

Communicative efficiency and the Principle of No Synonymy: predictability effects and the variation of want to and wanna
Language and Cognition
There is ample psycholinguistic evidence that speakers behave efficiently, using shorter and less... more There is ample psycholinguistic evidence that speakers behave efficiently, using shorter and less effortful constructions when the meaning is more predictable, and longer and more effortful ones when it is less predictable. However, the Principle of No Synonymy requires that all formally distinct variants should also be functionally different. The question is how much two related constructions should overlap semantically and pragmatically in order to be used for the purposes of efficient communication. The case study focuses on want to + Infinitive and its reduced variant with wanna, which have different stylistic and sociolinguistic connotations. Bayesian mixed-effects regression modelling based on the spoken part of the British National Corpus reveals a very limited effect of efficiency: predictability increases the chances of the reduced variant only in fast speech. We conclude that efficient use of more and less effortful variants is restricted when two variants are associated w...
These are the slides for our presentation at the 11th International Conference on Construction Gr... more These are the slides for our presentation at the 11th International Conference on Construction Grammar (ICCG11), Antwerp (Belgium), 18-20 August 2021
The scarcity of diachronic data represents a serious problem when linguists try to explain a typo... more The scarcity of diachronic data represents a serious problem when linguists try to explain a typological universal. To overcome this empirical bottleneck, one can simulate the process of language evolution in artificial language learning experiments. After a brief discussion of the main principles and findings of such experiments, this paper presents a case study of causative constructions showing that language users have a bias towards the efficient organisation of communication. They regularise their linguistic input such that more frequent causative situations are expressed by shorter forms, and less frequent situations are expressed by longer forms. This supports the economy-based explanation of the universal form-meaning mapping found in causative constructions of different languages.

There is ample evidence that human<br> communication is organized efficiently: more<br&g... more There is ample evidence that human<br> communication is organized efficiently: more<br> predictable information is usually encoded by<br> shorter linguistic forms and less predictable<br> information is represented by longer forms.<br> The present study, which is based on the<br> Universal Dependencies corpora, investigates<br> if the length of words can be predicted from<br> the average syntactic information content,<br> which is defined as the average information<br> content of a word given its counterpart in a<br> dyadic syntactic relationship. The effect of<br> this variable is tested on the data from nine<br> typologically diverse languages while<br> controlling for a number of other well-known<br> parameters: word frequency and average<br> word predictability based on the preceding<br> and following words. Poisson generalized<br> linear models and conditional random fore...

Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length... more Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) can be more strongly correlated with word length, although this tendency is not observed consistently, depending on several methodological choices. The present study, which examines a more diverse sample of languages than in the previous studies (Arabic, Finnish, Hungarian, Indonesian, Russian, Spanish and Turkish), reveals intriguing cross-linguistic differences, which can be explained by typological properties of the languages. I use large web-based corpora from the Leipzig Corpora Collection to estimate word lengths in UTF-8 characters, as well as word frequency, informativity given previous word and informativity given next word, applying different methods of bigrams processing. The results show consisten...

Frontiers in Psychology, 2021
Cross-linguistic studies focus on inverse correlations (trade-offs) between linguistic variables ... more Cross-linguistic studies focus on inverse correlations (trade-offs) between linguistic variables that reflect different cues to linguistic meanings. For example, if a language has no case marking, it is likely to rely on word order as a cue for identification of grammatical roles. Such inverse correlations are interpreted as manifestations of language users’ tendency to use language efficiently. The present study argues that this interpretation is problematic. Linguistic variables, such as the presence of case, or flexibility of word order, are aggregate properties, which do not represent the use of linguistic cues in context directly. Still, such variables can be useful for circumscribing the potential role of communicative efficiency in language evolution, if we move from cross-linguistic trade-offs to multivariate causal networks. This idea is illustrated by a case study of linguistic variables related to four types of Subject and Object cues: case marking, rigid word order of Su...
Chapter 11. Geographic variation of quite: Distinctive collexeme analysis
This chapter introduces distinctive collexeme analysis, which employs bidirectional association m... more This chapter introduces distinctive collexeme analysis, which employs bidirectional association measures discussed in the previous chapter. This method is based on the co-occurrence frequencies of words that occur in two near-synonymous constructions, or in two or more dialectal or diachronic variants of the same construction. Here we will compare the variants of quite + ADJ constructions in different national varieties of English. We will first present a canonical distinctive collexeme analysis with only two varieties, British and American English, and then will show how this approach can be extended to more lects, presenting a unified approach to multiple distinctive collexeme analysis.

Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories, 2020
Languages differ in the degree of semantic flexibility of their syntactic roles. For example, Eng... more Languages differ in the degree of semantic flexibility of their syntactic roles. For example, English and Indonesian are considered more flexible with regard to the semantics of subjects, whereas German and Japanese are less flexible. In Hawkins' classification, more flexible languages are said to have a loose fit, and less flexible ones are those that have a tight fit. This classification has been based on manual inspection of example sentences. The present paper proposes a new, quantitative approach to deriving the measures of looseness and tightness from corpora. We use corpora of online news from the Leipzig Corpora Collection in thirty typologically and genealogically diverse languages and parse them syntactically with the help of the Universal Dependencies annotation software. Next, we compute Mutual Information scores for each language using the matrices of lexical lemmas and four syntactic dependencies (intransitive subjects, transitive subject, objects and obliques). The new approach allows us not only to reproduce the results of previous investigations, but also to extend the typology to new languages. We also demonstrate that verb-final languages tend to have a tighter relationship between lexemes and syntactic roles, which helps language users to recognize thematic roles early during comprehension. This paper proposes a quantitative bottom-up corpus-based approach to cross-linguistic comparison, determining how tightly or loosely different lexemes can be mapped on basic syntactic roles. The idea goes back to Hawkins ( : 121-127, 1995; see also Müller-Gotama 1994), who coined the terms 'tight-fit' and 'loose-fit' languages. The former have unique surface forms that map onto more constrained meanings, whereas the latter have more vague forms with less constrained meanings. For instance, Present-Day English has fewer semantic restrictions on the subject and object than Old English, German or Russian. Consider several examples below. (1) a. Locative: This tent sleeps four. b. Temporal: 2020 witnessed a spread of the highly infectious coronavirus disease. c. Instrument: 10 Euros will buy you a meal. d. Source: The roof leaks water. While these sentences are perfectly acceptable in English, their German or Russian equivalents would be unacceptable or strange. This means that subjects in English are less semantically restricted than subjects in German and Russian (see also . Tightness and looseness have several components. Semantic flexibility of arguments is only one of them. Other features of tight languages include formal case marking, avoidance of raisings and long WH-movements and lower reliance on context in interpretation. Languages can change their degree of tightness. English is a well-known example of shifting from tight to loose (Hawkins 1986). As the case was lost, the zero-marked NPs in Middle English became more dependent on the verb for theta-role assignment. This is why the rigid SVO order emerged, which

Corpora, 2017
In this paper, I investigate online film subtitles from a quantitative perspective, treating them... more In this paper, I investigate online film subtitles from a quantitative perspective, treating them as a separate register of communication. Subtitles from films in English and other languages translated into English are compared with registers of spoken and written communication represented by large corpora of British and American English. A series of quantitative analyses based of n-gram frequencies demonstrate that subtitles are not fundamentally different from other registers of English and that they represent a close approximation of British and American informal conversations. However, I show that the subtitles are different from the conversations with regard to several functional characteristics, which are typical of the language of scripted dialogues in films and TV series in general. Namely, the language of subtitles is more emotional and dynamic, but less spontaneous, vague and narrative than that of normally occurring conversations. The paper also compares subtitles in orig...

Research in Language, 2017
The present study investigates the cross-linguistic differences in the use of so-called T/V forms... more The present study investigates the cross-linguistic differences in the use of so-called T/V forms (e.g. French tu and vous, German du and Sie, Russian ty and vy) in ten European languages from different language families and genera. These constraints represent an elusive object of investigation because they depend on a large number of subtle contextual features and social distinctions, which should be cross-linguistically matched. Film subtitles in different languages offer a convenient solution because the situations of communication between film characters can serve as comparative concepts. I selected more than two hundred contexts that contain the pronouns you and yourself in the original English versions, which are then coded for fifteen contextual variables that describe the Speaker and the Hearer, their relationships and different situational properties. The creators of subtitles in the other languages have to choose between T and V when translating from English, where the T/V...
Uploads
Papers by Natalia Levshina