Computational Phraseology Research Papers

Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics (Dagstuhl Seminar 23191)

2025

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or... more

descriptionView Paper arrow_downwardDownload

Using Word Alignments to Determine the Compositionality of Swedish Compound Nouns

by Sara Stymne

2025, Language and Technology Conference

descriptionView Paper arrow_downwardDownload

Fixed Similes: Measuring aspects of the relation between MWE idiomatic semantics and syntactic flexibility

by Stella Markantonatou

2025

We shed light on aspects of the relation between the semantics and the syntactic flexibility of multiword expressions by investigating fixed adjective similes (FS), a predicative multiword expression class not studied in this respect... more

descriptionView Paper arrow_downwardDownload

Hungarian Corpus of Light Verb Constructions

by Janos Csirik

2025, International Conference on Computational Linguistics

The precise identification of light verb constructions is crucial for the successful functioning of several NLP applications. In order to facilitate the development of an algorithm that is capable of recognizing them, a manually annotated... more

descriptionView Paper arrow_downwardDownload

Identifying bilingual Multi-Word Expressions for Statistical Machine Translation

by Nasredine Semmar

2025, Language Resources and Evaluation

MultiWord Expressions (MWEs) repesent a key issue for numerous applications in Natural Language Processing (NLP) especially for Machine Translation (MT). In this paper, we describe a strategy for detecting translation pairs of MWEs in a... more

descriptionView Paper arrow_downwardDownload

A Hybrid Approach for Automatic Extraction of Bilingual Multiword Expressions from Parallel Corpora

by Nasredine Semmar

2025, Language Resources and Evaluation

Specific-domain bilingual lexicons play an important role for domain adaptation in machine translation. The entries of these types of lexicons are mostly composed of MultiWord Expressions (MWEs). The manual construction of MWEs bilingual... more

descriptionView Paper arrow_downwardDownload

Using cross-language information retrieval and statistical language modelling in example-based machine translation

by Nasredine Semmar

2025

In this paper, we present a hybrid approach to align single words, compound words and idiomatic expressions from bilingual parallel corpora. The objective is to develop, improve and maintain automatically translation lexicons. This... more

In this paper, we present a hybrid approach to align single words, compound words and idiomatic expressions from bilingual parallel corpora. The objective is to develop, improve and maintain automatically translation lexicons. This approach combines linguistic and statistical information in order to improve word alignment results. The linguistic improvements taken into account refer to the use of an existing bilingual lexicon, named entities recognition, grammatical tags matching and detection of syntactic dependency relations between words. Statistical information refer to the number of occurrences of repeated words, their positions in the parallel corpus and their lengths in terms of number of characters. Single-word alignment uses an existing bilingual lexicon, named entities and cognates detection and grammatical tags matching. Compound-word alignment consists in establishing correspondences between the compound words of the source sentence and the compound words of the target sentences. A syntactic analysis is applied on the source and target sentences in order to extract dependency relations between words and to recognize compound words. Idiomatic expressions alignment starts with a monolingual term extraction for each of the source and target languages, which provides a list of sequences of repeated words and a list of potential translations. These sequences are represented with vectors which indicate their numbers of occurrences and the numbers of segments in which they appear. Then, the translation relation between the source and target expressions are evaluated with a distance metric. The single and compound word aligners have been evaluated on a subset of 1103 sentences in English and French of the JOC (Official Journal of the European Community) corpus . The obtained results showed that these aligners generate a translation lexicon with 90 % of precision for single words and 84 % of precision for compound words. We evaluated the idiomatic expressions aligner on a subset of the Canadian Parliament Hansard corpus and we obtained a precision of 81%.

descriptionView Paper arrow_downwardDownload

A Hybrid Approach for Automatic Extraction of Bilingual Multiword Expressions from Parallel Corpora

by Nasredine Semmar

2025

Specific-domain bilingual lexicons play an importan t role for domain adaptation in machine translation . The entries of these types of lexicons are mostly composed of MultiWord Expressio n (MWEs). The manual construction of MWEs bilingua... more

descriptionView Paper arrow_downwardDownload

Building Multiword Expressions Bilingual Lexicons for Domain Adaptation of an Example-Based Machine Translation System

by Nasredine Semmar

2025, RANLP 2017 - Recent Advances in Natural Language Processing Meet Deep Learning

We describe in this paper a hybrid approach to build automatically bilingual lexicons of Multiword Expressions (MWEs) from parallel corpora. We more specifically investigate the impact of using a domain-specific bilingual lexicon of MWEs... more

descriptionView Paper arrow_downwardDownload

Automatic Construction of a MultiWord Expressions Bilingual Lexicon: A Statistical Machine Translation Evaluation Perspective

by Nasredine Semmar

2025

Identifying and translating MultiWord Expressions (MWES) in a text represent a key issue for numerous applications of Natural Language Processing (NLP), especially for Machine Translation (MT). In this paper, we present a method aiming to... more

descriptionView Paper arrow_downwardDownload

Identifying bilingual Multi-Word Expressions for Statistical Machine Translation

by Nasredine Semmar

2025

MultiWord Expressions (MWEs) repesent a key issue for numerous applications in Natural Language Processing (NLP) especially for Machine Translation (MT). In this paper, we describe a strategy for detecting translation pairs of MWEs in a... more

descriptionView Paper arrow_downwardDownload

A New Approach for Idiom Identification Using Meanings and the Web

by Rakesh Verma

2025

There is a great deal of knowledge available on the Web, which represents a great opportunity for automatic, intelligent text processing and understanding, but the major problems are finding the legitimate sources of information and the... more

descriptionView Paper arrow_downwardDownload

New developments on processing European Portuguese verbal idioms

by Jorge Baptista

2024

This paper presents recent developments in processing verbal idioms within a rule-based grammar of European Portuguese. It describes the automatic construction of parsing rules directly from a lexicon-grammar matrix with about 2,500... more

descriptionView Paper arrow_downwardDownload

A collocation extraction tool for Romanian

by Luka Nerima

2024

Background. Lexical knowledge, and in particular knowledge on multi-word expressions, is at the cornerstone of language applications such as syntactic parsing or machine translation. Corpus-driven lexical acquisition is one of the major... more

descriptionView Paper arrow_downwardDownload

On-line Multilingual Linguistic Services

by Luka Nerima

2024

In this demo, we present our free on-line multilingual linguistic services which allow to analyze sentences or to extract collocations from a corpus directly on-line, or by uploading a corpus. They are available for 8 European languages... more

descriptionView Paper arrow_downwardDownload

Bilingual Dictionary Extraction Tools

by Gregor Thurmair

2024

D5.3 English-French and English-Greek parallel corpus for the Environment and Labour Legislation domains (M22) D5.5 English-French and English-Greek bilingual dictionaries for the Environment and Labour Legislation domains D7.4 Panacea... more

descriptionView Paper arrow_downwardDownload

Improving Lexical Alignment Using Hybrid Discriminative and Post-Processing Techniques

by Profa. Helena Caseli

2024

Automatic lexical alignment is a vital step for empirical machine translation, and although good results can be obtained with existent models (e.g. Giza++), more precise alignment is still needed for successfully handling complex... more

descriptionView Paper arrow_downwardDownload

Multiword Expressions Dataset for Indian Languages

by Sudha Bhingardive

2024

Multiword Expressions (MWEs) are used frequently in natural languages, but understanding the diversity in MWEs is one of the open problem in the area of Natural Language Processing. In the context of Indian languages, MWEs play an... more

descriptionView Paper arrow_downwardDownload

Translation Inference across Dictionaries via a Combination of Graph-based Methods and Co-occurrence Statistics

by Philipp Heinrich

2024

This system description explains how to use several bilingual dictionaries and aligned corpora in order to create translation candidates for novel language pairs. It proposes (1) a graph-based approach which does not depend on cyclical... more

descriptionView Paper arrow_downwardDownload

MULTIWORD EXPRESSIONS IN COMPARABLE CORPORA

by Peter Ďurčo

2024, Computational Phraseology.

On the basis of Aranea Gigaword Web corpora, a family of comparable corpora intended for use in contrastive linguistic research, multilingual lexicography, language teaching and translation studies we discuss the pros and cons of... more

descriptionView Paper arrow_downwardDownload

Automated Paraphrasing for Authoring Aids and Machine Translation

by Anabela Barreiro

2024

descriptionView Paper arrow_downwardDownload

LA GEOGRAFIA NEI MODI DI DIRE PER UN'EDUCAZIONE LINGUISTICA INTERCULTURALE Aspetti metodologici e potenzialità didattiche di una comparazione tra italiano e spagnolo

by Diana Peppoloni and

2024, Lingue e linguaggi

Phraseology is a key component of the development of linguisticcommunicative competence of advanced level learners of a foreign language (Wray 2002); furthermore, it represents a major teaching challenge, being a highly stereotyped and... more

descriptionView Paper arrow_downwardDownload

Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

by Francis Bond

2024

MWEs and constructions in language acquisition and in non-standard language (e.g. tweets, forums, spontaneous speech)-Evaluation of annotation and processing techniques for MWEs and constructions-Retrospective comparative analyses from... more

descriptionView Paper arrow_downwardDownload

Challenges on the A utomatic T ranslation of C ollocations

by Luisa Coheur

2024

As a linguistic phenomenon, collocations have been the subject of numerous researches both in the fields of theoretical and descriptive linguistics, and, more recently, in automatic Natural Language Processing. In the area of Machine... more

descriptionView Paper arrow_downwardDownload

Learning Paraphrasing for Multiword Expressions

by seid yimam

2024, Proceedings of the 12th Workshop on Multiword Expressions

In this paper, we investigate the impact of context for the paraphrase ranking task, comparing and quantifying results for multi-word expressions and single words. We focus on systematic integration of existing paraphrase resources to... more

descriptionView Paper arrow_downwardDownload

SAMER: A Semi-Automatically Created Lexical Resource for Arabic Verbal Multiword Expressions Tokens Paradigm and their Morphosyntactic Features

by Abdelati Hawwari

2024

Although MWE are relatively morphologically and syntactically fixed expressions, several types of flexibility can be observed in MWE, verbal MWE in particular. Identifying the degree of morphological and syntactic flexibility of MWE is... more

descriptionView Paper arrow_downwardDownload

The Portrait of Dorian Gray: A Corpus-Based Analysis of Translated Verb + Noun (Object) Collocations in Peninsular and Colombian Spanish

by Gloria Corpas Pastor

2024, Computational and Corpus-Based Phraseology

Corpus-based Translation Studies have promoted research on the features of translated language, by focusing on the process and product of translation, from a descriptive perspective. Some of these features have been proposed by Toury... more

descriptionView Paper arrow_downwardDownload

Computational Phraseology light: automatic translation of multiword expressions without translation resources

by Ruslan Mitkov

2024, Yearbook of Phraseology

This paper describes the first phase of a project whose ultimate goal is the implementation of a practical tool to support the work of language learners and translators by automatically identifying multiword expressions (MWEs) and... more

descriptionView Paper arrow_downwardDownload

Bilingual Contexts from Comparable Corpora to Mine for Translations of Collocations

by Ruslan Mitkov

2024, Lecture Notes in Computer Science

Due to the limited availability of parallel data in many languages, we propose a methodology that benefits from comparable corpora to find translation equivalents for collocations (as a specific type of difficult-to-translate multi-word... more

descriptionView Paper arrow_downwardDownload

Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021)

by Jelena Mitrović

2024, HAL (Le Centre pour la Communication Scientifique Directe)

In recent years, language models (LMs) have become almost synonymous with NLP. Pre-trained to "read" a large text corpus, such models are useful as both a representation layer as well as a source of world knowledge. But how well do they... more

descriptionView Paper arrow_downwardDownload

Long Word Sequences in the Discourse of Adventure Tourism

by Isabel Durán-Muñoz

2024, Proceedings of the International Conference EUROPHRAS 2022 (short papers, posters and MUMTTT workshop contributions)

Tourism discourse as a domain-specific discourse is characterized by a set of linguistic, pragmatic, and function features that make it different from other discourses and the general language. One of its essential elements is the usage... more

Table 1. Structural classification of the 4-word bundles selected As shown in Table 1, there is a big difference between the three most frequent structural categories (verbal, nominal, and prepositional bundles, whose representation is over 20%) and the four least recurrent categories (adverbial and adjectival bundles, conjunc- tions, and full phrases, whose recurrence is below 5%).

Table 2. Functional classification of the 4-word bundles selected

Table 3. Structural frameworks used in terms of the functional classification Table 3 shows that each broad func structural framework. To put it diff rent structure identified in “researc ional category is mostly characterized by a different ferently, nominal bundle (43.8%) is the most recur- h-oriented” sequences (e.g., the edge of the, queens- land adventure activity standards), the head describing location, the topic of the texts, quantities, among others. On the o her hand, verbal bundle (70.4%) is the most typical structure of “participant-oriented” bundles (e.g., if vou have any, give us a call), for the verbs help to engage the readers o f the texts and render the writers’ opinions. Finally, prepositional bundle (37.5%) is the most common structure found in “text-oriented” sequences (e.g., as a result of, in the event of), being useful to organize the text.

descriptionView Paper arrow_downwardDownload

SAMER: A Semi-Automatically Created Lexical Resource for Arabic Verbal Multiword Expressions Tokens Paradigm and their Morphosyntactic Features

by Mona Diab

2024

Although MWE are relatively morphologically and syntactically fixed expressions, several types of flexibility can be observed in MWE, verbal MWE in particular. Identifying the degree of morphological and syntactic flexibility of MWE is... more

descriptionView Paper arrow_downwardDownload

SPaR.txt, a Cheap Shallow Parsing Approach for Regulatory Texts

by Bimal Kumar

2024, Proceedings of the Natural Legal Language Processing Workshop 2021

Automated Compliance Checking (ACC) systems aim to semantically parse building regulations to a set of rules. However, semantic parsing is known to be hard and requires large amounts of training data. The complexity of creating such... more

descriptionView Paper arrow_downwardDownload

Support Verb Constructions across the Ocean Sea

by Jorge Baptista

2023, HAL (Le Centre pour la Communication Scientifique Directe)

This paper analyses the support (or light) verb constructions (SVC) in a publicly available, manually annotated corpus of multiword expressions (MWE) in Brazilian Portuguese. The paper highlights several issues in the linguistic... more

descriptionView Paper arrow_downwardDownload

MWEs: Support/light Verb Constructions vs Fixed Expressions in Modern Greek and French

by Voula Giouli

2023, HAL (Le Centre pour la Communication Scientifique Directe)

The paper reports on a study aimed at defining the limits between fixed expressions and Support Verb Constructions. To this end, a set of formal criteria that are applicable for the efficient classification of verbal MWEs were checked... more

descriptionView Paper arrow_downwardDownload

Corpus-Supported Foreign Language Teaching of Less Commonly Taught Languages

by Nives Mikelic Preradovic

2023, International Journal of Instruction

The study explores the implementation of corpora in teaching of the four courses of Croatian as a foreign language. Croatian, as a less-resourced and less commonly taught language, shares the same issues as other less resourced languages:... more

descriptionView Paper arrow_downwardDownload

Temporal Expressions: Comparisons in a Multilingual Corpus

by Duško Vitas

2023, HAL (Le Centre pour la Communication Scientifique Directe)

In this paper we analyze the problem of recognition of temporal expressions and their translations. Starting from named entities describing time (e.g. of the TIMEX type) and using FST cascades, temporal expressions were extracted from a... more

descriptionView Paper arrow_downwardDownload

Applying Computational Linguistics and Language Models: From Descriptive Linguistics to Text Mining and Psycholinguistics

by Gerold Schneider

2023

descriptionView Paper arrow_downwardDownload

Parser-based analysis of syntax-lexis interactions

by Gerold Schneider

2023, BRILL eBooks

Fixedness in language has been extensively studied in areas like multi-word units, idiomatic expressions, collocations and verb-particle constructions. These have often been treated as relatively fixed non-compositional sequences, which... more

descriptionView Paper arrow_downwardDownload

Applying Computational Linguistics and Language Models: From Descriptive Linguistics to Text Mining and Psycholinguistics

by Gerold Schneider

2023

This synopsis presents the application of computational linguistic tools and approaches which were developed by the author for Descriptive Linguistics, Text Mining, and Psycholinguistics. It also describes how the computational linguistic... more

descriptionView Paper arrow_downwardDownload

Parser-based analysis of syntax-lexis interactions

by Gerold Schneider

2023, Corpora: Pragmatics and Discourse

Fixedness in language has been extensively studied in areas like multi-word units, idiomatic expressions, collocations and verb-particle constructions. These have often been treated as relatively fixed non-compositional sequences, which... more

descriptionView Paper arrow_downwardDownload

Rule-based Automatic Multi-word Term Extraction and Lemmatization

by Cvetana Krstev

2023

In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of... more

descriptionView Paper arrow_downwardDownload

The arText prototype: An automatic system for writing specialized texts

by Mireia Montane

2023, Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

This article describes an automatic system for writing specialized texts in Spanish. The arText prototype is a free online text editor that includes different types of linguistic information. It is designed for a variety of end users and... more

descriptionView Paper arrow_downwardDownload

Identifying idiomatic expressions using automatic word-alignment

by Begoña Villada Moiron

2023

For NLP applications that require some sort of semantic interpretation it would be helpful to know what expressions exhibit an idiomatic meaning and what expressions exhibit a literal meaning. We investigate whether automatic... more

descriptionView Paper arrow_downwardDownload

MWEs: Support/light Verb Constructions vs Fixed Expressions in Modern Greek and French

by Angeliki Fotopoulou

2023, HAL (Le Centre pour la Communication Scientifique Directe)

The paper reports on a study aimed at defining the limits between fixed expressions and Support Verb Constructions. To this end, a set of formal criteria that are applicable for the efficient classification of verbal MWEs were checked... more

descriptionView Paper arrow_downwardDownload

Collocation or Free Combination? â€• Applying Machine Translation Techniques to identify collocations in Japanese

by ELGA LAURA STRAFELLA

2023, Language Resources and Evaluation

This work presents an initial investigation on how to distinguish collocations from free combinations. The assumption is that, while free combinations can be literally translated, the overall meaning of collocations is different from the... more

descriptionView Paper arrow_downwardDownload

Translating idioms

by Eric Wehrli

2023

This paper discusses the treatment of fixed word expressions developed for our ITS-2 French-English translation system. This treatment makes a clear distinction between compounds-i.e. multiword expressions of X°-level in which the chunks... more

descriptionView Paper arrow_downwardDownload

Automatic Extraction Of Malay Compound Nouns Using A Hybrid Of Statistical And Machine Learning Methods

by Fadl Ba-Alwi

2023, International Journal of Electrical and Computer Engineering (IJECE)

Identifying of compound nouns is important for a wide spectrum of applications in the field of natural language processing such as machine translation and information retrieval. Extraction of compound nouns requires deep or shallow... more

Figure 1. Extraction and Filtration of Compound Nouns Multiword Units Corpora have been extensively employed in several NLP tasks as the basis for automatically learning models for language analysis and generation. In this step, we crawl and collect Malay news articles which are written in Malay language from Malaysian National News Agency (BERNAMA) news source [http://www.bernama.com/bernama/v6/index.php]. The size of the corpus is 49661 news article and 13,346,381 token.

Other methods: in addition to the methods described above, other statistical association measures such as dice coefficient, odds ratio and Jaccard (J), Normalized Expectation (NE), Mutual Dependency (MD), and Mutual Expectation (ME) are also used. These methods are widely used in the collocation extraction [6]-[9],[17],[24],[25],[32],[34]. These methods are formulated below: EN IIS NI IDI SONIDO MOISE, EEE IESE! EIA ISIE! ESE IIE IIE ANI SSIES SAIS IIL ISEE IIASA ONE The C-value Approach: The C-value method is an efficient domain-independent multi-word term recognition method [35], which combines linguistic and statistical information [13],[14],[36]. C-value is sensitive to the nested compounding by its enhanced statistical measure of frequency of occurrence. C-value is defined as:

Figure 2. Overall Precision of different measures at different evaluation points

Automatic Extraction of Malay Compound Nouns Using A Hybrid of Statistical and .... (Muneer A.S. Hazaa) eee Cl ii “ame ie eae x In this phase, all crawled web pages are preprocessed by removing all HTML tags, identifying main content, automatic noise removal and breaking the content down to a sequence of individual tokens. A fte1 that, all-uppercase, capitalized and mixed case words were lowercased. Punctuations, special symbols and numbers are removed. Table | shows the n-gram statistic of our corpus.

Table 2. The performance metrics (Precision, Recall and F-score) for all methods at different evaluation point

Table 3. Top 10 Malay N-N candidates extracted by different methods OS VIVE dCINO VES PICCIslOll, LOCdll ANG PP i-tMedsure OL 1.470, OF ./0 70 ANG O1.1% 7/0 TUOSPCCUVELY. Experiments show that classification algorithms | which combine association scores given by several association measures methods lead to a significant performance improvement in comparison with individual basic methods. In fact, Experimental results obtained are quite satisfactory, especially when being compared to results obtained in other works [6],[7]. In [6],[7] a hybrid method of linguistic and statistical approaches has been proposed in terms of identifying compound nouns. Its clear that the hyprid method which combine both statistical and machine learning is outperformed the hybrid method of linguistic approach and statistical methods.

Table 4. Results for rank combination Method Table 5. Performance of Classification Methods Combining All Association Measures

descriptionView Paper arrow_downwardDownload

A Large Automatically-Acquired All-Words List of Multiword Expressions Scored for Compositionality

by Markus Egg

2023

We present and make available a large automatically-acquired all-words list of English multiword expressions scored for compositionality. Intrinsic evaluation against manually-produced gold standards demonstrates that our compositionality... more

descriptionView Paper arrow_downwardDownload

Tokenization as the initial phase in NLP

by Jonathan Webster

2023

In this paper, the authors address the significance and complexity of tokenization, the beginning step of NLP. Notions of word and token are discussed and defined from the viewpoints of lexicography and pragmatic implementation,... more

descriptionView Paper arrow_downwardDownload