And now for something completely different
2020
…
1 page
1 file
Sign up for access to the world's latest research
Abstract
I will discuss a recent court case in the Netherlands, in which forensic phonetic expertise was called upon to help settle a dispute over trade name infringement. In 2014, Dutch brewer Grolsch launched a beer called Kornuit /kɔrˈnoeyt/. Recently, supermarket chain Lidle released a beer under the name Kordaat /kɔrˈda:t/. I was asked by Grolsch to shed light on the phonetic similarity between the brand names. Using the Levenshtein distance metric , the phonetic difference between the names is 29 percent. To show that the similarity between the brand names was very likely to be intentional rather than accidental (as Lidle would have it), I established the statistical distribution of the similarity of Dutch word pairs. I selected the 3000 most frequent monomorphemic content words from and computed the Levenshtein distance for all 4,498,500 non-identical word pairs (using Gabmap software, . Distances ≤ 29% occur in .5 percent of the word pairs, which arguably shows that the name Kordaat was not accidentally chosen by Lidl. In my talk I will explain the Levenshtein metric and motivate the decisions made to obtain the distribution of distances between Dutch word pairs.
Related papers
2022
Many researches have studied the similarity between languages (e.g. Eden 2018; Crowley and Bowern, 2010; Longobardi and Guardiano, 2009, 2017), but there is no research which quantifies the similarity between languages. The final goal of this study is to examine whether similarity can be measured and quantified using the scales of the acoustical prominence of several phonetic and phonological properties, while merging them into one universal scale of prominence. However, since there is no research in which similarity is measured by phonetic and phonological features alone, the goal of my thesis was to examine which features should be placed in this scale in the first place. This study contains two experiments, a preliminary one and a main one. In the preliminary experiment, 132 Hebrew speakers rated their familiarity level with each of the 35 languages that appeared in the main experiment. In the main experiment, 362 Hebrew speakers listened to 20 sets of three recordings, a base language and two additional languages, and were asked which of the two additional languages was more similar to the base language. The similarity was determined by the number of the shared features between the base language and the other language, and the features (a total of 41) were taken mostly from the World Atlas of Language Structures Online (WALS) and from Bradlow et al. (2010). One of the additional languages shared more features with the base language (the similar language) and the other language shared fewer features with it (the dissimilar language). The results showed a significant inclination to choose the more similar language over the dissimilar one. These findings suggest that the similarity can be measured by phonetic and phonological features. However, we know that not all features were created equal; thus, this model can be upgraded by weighting the features, so that more prominent features v will have more weight in similarity quantification. I leave the weighting of the features for future research.
Language Dynamics and Change, 0
Previous work using lexical data from around the world has suggested that distances among language varieties are distributed such that varieties are typically either rather similar, qualifying as dialects of one another, or rather dissimilar, qualifying as different languages, with a scarcity of varieties that are around halfway similar. Wichmann (2019) observed that there is a bimodal distribution of distances with two roughly normal distributions separated by a valley. The previous work was based on a database mostly containing either descriptions of single languages or surveys covering several close varieties, so the bimodal distribution could potentially be an artifact of the underlying sample. Here we test whether a similar distribution is found when using another source of data and an unbiased sample drawn from the cells of a geographical grid (of Central Europe). The data consists of 18 lexemes from 274 doculects. Using Bayesian Beta regression and leave-one-out crossvalidation, we show that the data follows a bimodal distribution which is robust to sampling, and also to at least some aspects of the data (coarse-vs. fine-grained phonetic transcriptions).
The idea of measuring distance between languages seems to have its roots in the work of the French explorer Dumont D'Urville (1832) [13]. He collected comparative word lists for various languages during his voyages aboard the Astrolabe from 1826 to 1829 and, in his work concerning the geographical division of the Pacific, he proposed a method for measuring the degree of relation among languages. The method used by modern glottochronology, developed by Morris Swadesh in the 1950s, measures distances from the percentage of shared cognates, which are words with a common historical origin. Recently, we proposed a new automated method which uses the normalized Levenshtein distances among words with the same meaning and averages on the words contained in a list. Recently another group of scholars, Bakker et al. (2009) [8] and Holman et al. (2008) [9], proposed a refined version of our definition including a second normalization. In this paper we compare the information content of our definition with the refined version in order to decide which of the two can be applied with greater success to resolve relationships among languages.
SSRN Electronic Journal, 2000
This series presents research findings based either directly on data from the German Socio-Economic Panel Study (SOEP) or using SOEP data as part of an internationally comparable data set (e.g. CNEF, ECHP, LIS, LWS, CHER/PACO). SOEP is a truly multidisciplinary household panel study covering a wide range of social and behavioral sciences: economics, sociology, psychology, survey methodology, econometrics and applied statistics, educational science, political science, public health, behavioral genetics, demography, geography, and sport science.
2004
The Levenshtein dialect distance method has proven to be a successful method for measuring phonetic distances between Dutch dialects. The aim of the present investigation is to validate the Levenshtein dialect distance with perceptual data from a language area other than the Dutch, namely Norway. We calculate the correlation between the Levenshtein distances and the distances between 15 Norwegian dialects as judged by Norwegian listeners. We carry out this analysis to see the degree to which the average Levenshtein distances correspond to the psychoacoustic perception of the speakers of the dialects.
2010
In Ref. , Petroni and Serva discuss the use of Levenshtein distances (LD) between words referring to the same concepts as a tool for establishing overall distances among languages which can then subsequently be used to derive phylogenies. The authors modify the raw LD by dividing the LD by the length of the longer of the two words compared, to produce what could be called LDN (normalized LD). Other scholars have used a further modification, where they divide the LDN by the average LDN among words not referring to the same concept. This produces what could be called LDND. The authors of Ref.
The Journal of the Acoustical Society of America, 2007
Abstract This study explores the effects of informational redundancy, as carried by a word's morphological paradigmatic structure, on acoustic duration in read aloud speech. The hypothesis that the more predictable a linguistic unit is, the less salient its realization, was tested on the basis of the acoustic duration of interfixes in Dutch compounds in two datasets: One for the interfix -s-(1155 tokens) and one for the interfix -e(n)-(742 tokens). Both datasets show that the more probable the interfix is, given the compound and its constituents, the longer it is realized. These findings run counter to the predictions of informationtheoretical approaches and can be resolved by the Paradigmatic Signal Enhancement Hypothesis. This hypothesis argues that whenever selection of an element from alternatives is probabilistic, the element's duration is predicted by the amount of paradigmatic support for the element: The most likely alternative in the paradigm of selection is realized longer.
2012
In recent years, dialectometry has gained interest among Catalan dialectologists. As a consequence, a specific dialectometric approach has been developed at the University of Barcelona, which aims at increasing the accuracy of final groupings by means of discriminating the predictable components of the language from its unpredictable ones. Another popular method to obtain dialect distances is the Levenshtein Distance (LD) which has never been applied to a Catalan corpus so far. The goal of this paper is to present the results of applying the LD to a corpus of Catalan linguistic data, and to compare the results from this analysis both with the results from Barcelona and the traditional classifications of Catalan dialectology.
Research Square (Research Square), 2024
The writings of one ancient civilization often overlap in time and space with others. Many of these sources comprise unstructured text in ancient languages, causing scholars studying these civilizations to be siloed, often relying on sources in specific languages. Most recent efforts to extract structured information from historical scripts into place (toponym) and people databases (prospographies) have followed this pattern, focusing on one civilization and selected sources. The path to creating a common database runs through aligning names or toponyms between sources from disparate languages utilizing different scripts. Existing multilingual orthographic (string-based) comparison often relies on transliteration to a common script (Latin/English). Transliteration often creates multiple options and even more confusion. However, when integrating sources that overlap in space and time, the languages often share a common phonetic background. This commonality may prove beneficial. In this work, we present a benchmark for comparing toponyms from two linguistically and culturally related languages, namely Hebrew and Arabic. We provide a benchmark comprised of a set of dataset pairs created from historical sources written in Medieval variants of these languages, later historical Gazetteers and a modern dataset curated from Wikidata. We empirically evaluate several toponym comparison approaches over the benchmark: transliteration to a common script, direct transliteration, and phonetic comparison using a common phonetic representation. We discuss the results and the limitations of the various methods and outline future work.
Procedia Computer Science, 2014
Drug name similarity is one of major reasons of medical accidents. In order to prevent from the accidents, one of the best ways is to avoid approving drugs that has the names similar to that of existing drugs. It is well-known that there are two kinds of drug name similarity, look-alikeness and sound-alikeness. Nabeta et. al. proposed a look-alikeness similarity index,which excludes the sound-alikeness. Though, in Japan, oral prescription is basically prohibited, emergent situation can force a doctor to prescribe orally. In such a situation, medical accidents can occur.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (4)
- Baayen, R. H., Piepenbrock, R. & Gulikers, L. (1995). CELEX2 LDC96L14. Web Download. Philadelphia: Linguistic Data Consortium.
- Heeringa, W. J. (2004). Measuring dialect pronunciation differences using Levenshtein distance. Doctoral dissertation, University of Groningen.
- Leinonen, T., Çöltekin, Ç. & Nerbonne, J. (2016). Using Gabmap. Lingua, 178, 71-83.
- Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707-710.