Learning to detect english and hungarian light verb constructions

Veronika Vincze; István Nagy T.; János Zsibrita

doi:10.1145/0000000.0000000

Outline

Related Corpora and Databases

Natural Language Processing

Learning to Detect English and Hungarian Light Verb Constructions

István T. Nagy

https://doi.org/10.1145/0000000.0000000

visibility

…

description

20 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Light verb constructions consist of a verbal and a nominal component, where the noun preserves its original meaning while the verb has lost it (to some degree). They are syntactically flexible and their meaning can only be partially computed on the basis of the meaning of their parts, thus they require special treatment in natural language processing. For this purpose, the first step is to identify light verb constructions.

Figures (14)

Table II. True light verbs and vague action verbs in English. Learning to Detect English and Hungarian Light Verb Constructions

Fig. 1. Types of LVCs based on syntactic and semantic criteria.

Fig. 2. Types of LVCs from a morphological point of view.

In addition, we annotated LVCs in 500 randomly selected pieces of short news from the CoNLL-2003 dataset originally developed for named entity recognition [?]. As for the Hungarian corpora, they form part of the Szeged Treebank annotated for LVCs [?]. Among the subcorpora the domains of law, short business news and newspa- per texts were selected for the purpose of this study. i.

Table IV. The length of LVCs and LVC lemmas including or excluding prepositions and articles. Table V. Statistical data on LVCs in the corpora.

Table ?? includes some statistics on the length of LVCs. A typical example of a two- token LVC is take care, one for a three-token long is take a decision and a four-token LVC is come to a conclusion. In order to minimize the typological differences between the two languages, we also calculated the length of LVCs and LVC lemmas for English with prepositions and articles omitted and it was revealed that similar to Hungarian, most of the LVC lemmas contain only two words.

corpora, which aptly underlines the domain specificity of the problem, namely, differ- ent corpora contain different LVCs. Clll CULVULa COULILAIT GLICO liv us. With the corpora at hand, we were able to examine the proportion of LVC and non- LVC uses of some specific LVC candidates. For instance, the phrase tdrgyaldst folytat (negotiation-ACC continues) usually means “to conduct a negotiation”, which is an LVC but in certain contexts, it can mean “to continue a(n ongoing) negotiation”, which is not an LVC. In the corpora, there are 13 LVC uses and 1 non-LVC use. However, the sequence megbeszélést tart (meeting-ACC holds) “to have a meeting” — which can be also considered an LVC (out of context) — occurs only once in the corpus, and in a non- LVC use: megbeszélést tart célszertinek (meeting-ACC holds necessary-DAT) “he thinks that a meeting is required”. Thus, non-LVC usage of LVC-candidates is not so frequent but the corpora contain some examples. Learning to Detect English and Hungarian Light Verb Constructions

Table IX. The utility of individual features in Hungarian in terms of recall, precision and F-score. Table X. The utility of individual features in English in terms of recall, precision and F-score.

was in IOB format, where B-VERBFX labels the first word of an LVC, I-VERBFX labels all other subsequent words which are part of the LVC, and O labels non-entities. In our case, the labeling of LVCs was only accepted if all of its members were labeled correctly and no other neighbouring words were marked (true positive, TP). We consider it a false negative (FN) example when there was an LVC entity in the running text, but the system could not correctly recognize it. In other words, the system could notice that there was an LVC but got its boundaries wrong or there was an entity but the system missed it. In the case of false positives (FP), there was no LVC in the text but the system hypothesized one. To calculate F,-scores we define precision and recall as follows:

Table XI. Experimental results on different target and source English domain pairs in terms of F-score. TARGET: in-domain setting. CROSS: cross-domain setting. RB: rule-based methods. DL: dictionary labeling. Diffoross: differences between the TARGET and CROSS results. Diff: differences between the TARGET and RB results. Diffp,,: differences between the TARGET and DL results. Table XII. Domain adaptation results on English corpora in terms of F-score. DA: domain adaptation setting. ID: training on a limited set of target data. Diffp 4: differences between the CROSS and DA results. Diffp 4 /7p: differences between the DA and ID results.

Table XIII. Experimental results on different target and source Hungarian domain pairs in terms of F-score. TARGET: in-domain setting. CROSS: cross-domain setting. RB: rule-based methods. DL: dictionary labeling. Diffoross: differences between the TARGET and CROSS results. Diffs: differences between the TARGET and RB results. Diffp,: differences between the TAR- GET and DL results.

58.56%. Dictionary labeling achieved 31.6% on the three corpora, which was exceeded by the TARGET results with 35.54%.

Table XIV. Domain adaptation results on Hungarian corpora in terms of F-score. DA: domain adapta- tion setting. ID: training on a limited set of target data. Diffp.4: differences between the CROSS and DA results. Diffp 4 /7p: differences between the DA and ID results. Fig. 3. The effect of the size of the target data on detecting LVCs. DA: domain adaptation setting. ID: training on a limited set of target data. CROSS: cross-domain setting. TARGET: in-domain setting. RB: rule-based methods. DL: dictionary labeling.

Table XVI. Results obtained for LVCs with different lengths on Hungarian corpora. more similar to each other than any of them and the legal domain (see Table ??). The special nature of the legal domain is also evident from the baseline results: compared to the other domains, the rule-based system is able to achieve here a fairly good result (58.56%). This suggests that the morphological and syntactic patterns of LVCs in the Hungarian law corpus typically follow the canonical form of Hungarian LVCs and thus can be identified by rules.

Ivelina Stoyanova

This paper presents work in progress focused on developing a method for automatic identification of light verb constructions (LVCs) as a subclass of Bulgar-ian verbal MWEs. The method is based on machine learning and is trained on a set of LVCs extracted from the Bulgarian WordNet (BulNet) and the Bulgarian National Corpus (BulNC). The machine learning uses lexical, morphosyntac-tic, syntactic and semantic features of LVCs. We trained and tested two separate classifiers using the Java package Weka and two learning decision tree algorithms – J48 and RandomTree. The evaluation of the method includes 10-fold cross-validation on the training data from Bul-Net (F 1 = 0.766 obtained by the J48 decision tree algorithm and F 1 = 0.725 by the RandomTree algorithm), as well as evaluation of the performance on new instances from the BulNC (F 1 = 0.802 by J48 and F 1 = 0.607 by the RandomTree algorithm). Preliminary filtering of the candidates gives a slight improvement in precision.

downloadDownload free PDF View PDFchevron_right

Hungarian corpus of light verb constructions

Veronika Vincze

2010

The precise identification of light verb constructions is crucial for the successful functioning of several NLP applications. In order to facilitate the development of an algorithm that is capable of recognizing them, a manually annotated corpus of light verb constructions has been built for Hungarian. Basic annotation guidelines and statistical data on the corpus are also presented in the paper. It is also shown how applications in the fields of machine translation and information extraction can make use of such a corpus and an algorithm.

downloadDownload free PDF View PDFchevron_right

Annotation and Classification of Light Verbs and Light Verb Variations in Mandarin Chinese

Chu-Ren Huang

Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing, 2014

Light verbs pose an a challenge in linguistics because of its syntactic and semantic versatility and its unique distribution different from regular verbs with higher semantic content and selectional resrictions. Due to its light grammatical content, earlier natural language processing studies typically put light verbs in a stop word list and ignore them. Recently, however, classification and identification of light verbs and light verb construction have become a focus of study in computational linguistics, especially in the context of multi-word expression, information retrieval, disambiguation, and parsing. Past linguistic and computational studies on light verbs had very different foci. Linguistic studies tend to focus on the status of light verbs and its various selectional constraints. While NLP studies have focused on light verbs in the context of either a multi-word expression (MWE) or a construction to be identified, classified, or translated, trying to overcome the apparent poverty of semantic content of light verbs. There has been nearly no work attempting to bridge these two lines of research. This paper takes this challenge by proposing a corpus-bases study which classifies and captures syntactic-semantic difference among all light verbs. In this study, we first incorporate results from past linguistic studies to create annotated light verb corpora with syntactic-semantics features. We next adopt a statistic method for automatic identification of light verbs based on this annotated corpora. Our results show that a language resource based methodology optimally incorporating linguistic information can resolve challenges posed by light verbs in NLP.

downloadDownload free PDF View PDFchevron_right

Linguistic features for Hindi light verb construction identification

Sumeet Agarwal, Ashwini Vaidya

Light verb constructions (LVC) in Hindi are highly productive. If we can distinguish a case such as nirnay lenaa 'decision take; decide' from an ordinary verb-argument combination kaagaz lenaa 'paper take; take (a) paper', it has been shown to aid NLP applications such as parsing (Begum et al., 2011) and machine translation (Pal et al., 2011). In this paper, we propose an LVC identification system using language specific features for Hindi which shows an improvement over previous work (Begum et al., 2011). To build our system, we carry out a linguistic analysis of Hindi LVCs using Hindi Treebank annotations and propose two new features that are aimed at capturing the diversity of Hindi LVCs in the corpus. We find that our model performs robustly across a diverse range of LVCs and our results underscore the importance of semantic features, which is in keeping with the findings for English. Our error analysis also demonstrates that our classifier can be used to further refine LVC annotations in the Hindi Treebank and make them more consistent across the board.

downloadDownload free PDF View PDFchevron_right

Light Verb Constructions in the SzegedParalellFX English–Hungarian Parallel Corpus

Veronika Vincze

Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC 2012), 2012

In this paper, we describe the first English–Hungarian parallel corpus annotated for light verb constructions, which contains 14,261 sentence alignment units. Annotation principles and statistical data on the corpus are also provided, and English and Hungarian data are contrasted. On the basis of corpus data, a database containing pairs of English–Hungarian light verb constructions has been created as well. The corpus and the database can contribute to the automatic detection of light verb constructions and they can enhance ...

downloadDownload free PDF View PDFchevron_right

A Hybrid Approach to Extracting and Classifying Verb+Noun Constructions.

Dan Stefanescu, Christopher Gledhill, Dan Tufis

We present the main findings and preliminary results of an ongoing project aimed at developing a system for collocation extraction based on contextual morpho-syntactic properties. We explored two hybrid extraction methods: the first method applies language-indepedent statistical techniques followed by a linguistic filtering, while the second approach, available only for German, is based on a set of lexico-syntactic patterns to extract collocation candidates. To define extraction and filtering patterns, we studied a specific collocation category, the Verb-Noun constructions, using a model inspired by the systemic functional grammar, proposing three level analysis: lexical, functional and semantic criteria. From tagged and lemmatized corpus, we identify some contextual morpho-syntactic properties helping to filter the output of the statistical methods and to extract some potential interesting VN constructions (complex predicates vs complex predicator). The extracted candidates are validated and classified manually.

downloadDownload free PDF View PDFchevron_right

Learning English light verb constructions: contextual or statistical

Dan Roth

2011

Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World (MWE 2011), pages 31–39, Portland, Oregon, USA, 23 June 2011. cO2011 Association for Computational Linguistics Learning English Light Verb Constructions: Contextual or Statistical Yuancheng Tu Department of Linguistics University of Illinois ytu@ illinois. edu Dan Roth Department of Computer Science University of Illinois danr@ illinois.

downloadDownload free PDF View PDFchevron_right

A Mix Approach to Extracting and Classifying Verb+Noun Constructions

Dan Ioan Tufis

We present the main findings and preliminary results of an ongoing project aimed at developing a system for collocation extraction based on contextual morpho-syntactic properties. We explored two hybrid extraction methods: the first method applies language- indepedent statistical techniques followed by a linguistic filtering, while the second approach, available only for German, is based on a set of lexico-syntactic patterns to extract collocation candidates. To define extraction and filtering patterns, we studied a specific collocation category, the Verb-Noun constructions, using a model inspired by the systemic functional grammar, proposing three level analysis: lexical, functional and semantic criteria. From tagged and lemmatized corpus, we identify some contextual morpho- syntactic properties helping to filter the output of the statistical methods and to extract some potential interesting VN constructions (complex predicates vs complex predicator). The extracted candidates are v...

downloadDownload free PDF View PDFchevron_right

Propbank annotation of multilingual light verb constructions

Aous Mansouri

2010

In this paper, we have addressed the task of PropBank annotation of light verb constructions, which like multi-word expressions pose special problems. To arrive at a solution, we have evaluated 3 different possible methods of annotation. The final method involves three passes:

downloadDownload free PDF View PDFchevron_right

Detecting noun compounds and light verb constructions: a contrastive study

István T. Nagy

ACL HLT 2011, 2011

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

Veronika Vincze

Proceedings of the Student Research Workshop associated with RANLP, 2011

In this paper, we show how our methods developed for identifying light verb constructions can be adapted to different domains and different types of texts. We both experiment with rule-based methods and machine learning approaches. Our results indicate that existing solutions for detecting light verb constructions can be successfully applied to other domains as well and we conclude that even a little amount of annotated target data can notably contribute to performance if a bigger corpus from another domain is also ...

downloadDownload free PDF View PDFchevron_right

Extending corpus-based identification of light verb constructions using a supervised learning framework

Min-Yen Kan

2006

Abstract Light verb constructions (LVCs), such as ��make a call�� in English, can be said to be complex predicates in which the verb plays only a functional role. LVCs pose challenges for natural language understanding, as their semantics differ from usual predicate structures. We extend the existing corpus-based measures for identifying LVCs between verb-object pairs in English, by proposing using new features that use mutual information and assess other syntactic properties.

downloadDownload free PDF View PDFchevron_right

Automatic Identification of Light Verb Constructions: A Review

TONG MING LIM

Journal of the Institute of Engineers, Malaysia, 2022

Light verb constructions (LVC) are complex predicates that are present in many languages. They belong to the Multiword Expression (MWE) category known as verbal MWEs and has the canonical form of verb+noun. Examples of LVCs include give help, make decisions, and take walks. LVC identification is essential for many natural processing (NLP) applications such as machine translation, sentiment analysis, and information extraction. However, the task of LVC identification is challenging due to its characteristics such as variability, discontinuity, and ambiguity. This paper presents a review of recent work, discusses the gaps that still exist, and proposes some future work that may contribute significant progress in LVC identification.

downloadDownload free PDF View PDFchevron_right

Syntax-based identification of light-verb constructions

Marie Candito

2019

This paper analyzes results on light-verb construction identification from the PARSEME shared-task, distinguishing between simple cases that could be directly learned from training data from more complex cases that require an extra level of semantic processing. We propose a simple baseline that beats the state of the art for the simple cases, and couple it with another simple baseline to handle the complex cases. We additionally present two other classifiers based on a richer set of features, with results surpassing the state of the art by 8 percentage points.

downloadDownload free PDF View PDFchevron_right

English Light Verb Construction Identification Using Lexical Knowledge

Claire Bonial

2015

This research describes the development of a supervised classifier of English light verb constructions, for example, take a walk and make a speech. This classifier relies on features from dependency parses, OntoNotes sense tags, WordNet hypernyms and WordNet lexical file information. Evaluation shows that this system achieves an 89% F1 score (four points above the state of the art) on the BNC test set used by Tu & Roth (2011), and an F1 score of 80.68 on the OntoNotes test set, which is significantly more challenging. We attribute the superior F1 score to the use of our rich linguistic features, including the use of WordNet synset and hypernym relations for the detection of previously unattested light verb constructions. We describe the classifier and its features, as well as the characteristics of the OntoNotes light verb construction test set, which relies on linguistically motivated PropBank annotation.

downloadDownload free PDF View PDFchevron_right

Classification of verb particle constructions with the Google Web1T Corpus

Jonathan K Kummerfeld

2008

Manually maintaining comprehensive databases of multi-word expressions, for example Verb-Particle Constructions (VPCs), is infeasible. We describe a new type level classifier for potential VPCs, which uses information in the Google Web1T corpus to perform a simple linguistic constituency test. Specifically, we consider the fronting test, comparing the frequencies of the two possible orderings of the given verb and particle. Using only a small set of queries for each verb-particle pair, the system was able to achieve an F-score of 75.7% in our evaluation while processing thousands of queries a second.

downloadDownload free PDF View PDFchevron_right

Natural Language Inference in Ordinary and Support Verb Constructions

Ignazio Mauro Mirto

Distributed Computing and Artificial Intelligence,17th International Conference; https://link.springer.com/book/10.1007/978-3-030-53036-5, 2021

The family of clause types known as 'support (or 'light') verb construction' (SVC) manifests a distinct syntax-semantics interface if compared with ordinary verb constructions (OVC). If, in e.g. 'She laughed', the verb licenses an argument and assigns it a semantic role, syntacticians of every stripe nowadays agree that it is the noun 'laugh', in 'She gave a laugh', which fulfils the same function. The differences between the two types have been extensively discussed in the linguistics literature (systematic research started in the 1970s), less so in Computational Linguistics. This paper has two objectives. First, it will propose an innovative type of semantic role, which is termed Cognate Semantic Role (CSR) because the verb employed in the notation is etymologically related to the predicate which licenses arguments and assigns them semantic roles. 'She laughed' and 'She gave a laugh' therefore express the same role >the-one-who-laughs<, assigned by 'laughed' and 'a laugh' respectively. Second, it will introduce a tool capable of extracting CSRs automatically from both OVCs and SVCs; thus a device will be used for detecting the construction type. CSRs offer a number of advantages for the formalization of entailments and paraphrases and for Machine Translation.

downloadDownload free PDF View PDFchevron_right

Cited by

Map Symbols in Video Games: the Example of “Valheim”

Tymoteusz Horbiński

KN - Journal of Cartography and Geographic Information, 2021

The main focus of this article is to examine the interpretation of twelve cartographic symbols on the map in Valheim. The authors set the research goal: to investigate how players and non-players interpret the symbols. The Valheim video game, which was released in 2021, is a survival game set in an open world. The authors noticed that game developers did not provide a direct explanation of the map symbols used, which could result in a different interpretation and experience of the game. The authors adopted a survey on the LimeSurvey platform as research methodology. This survey tool was used to gather information on experiences and interpretations of map symbols among a diverse group of respondents. Using online forums allowed one to disseminate the survey to a large audience of players from all over the world. Then, using the categorisation method for individual questions, a large database of respondents’ answers was created. Through the analysis, the authors checked the interpreta...

downloadDownload free PDF View PDFchevron_right

Learning to Detect English and Hungarian Light Verb Constructions

Sign up for access to the world's latest research

Abstract

Related papers

Related papers

Related topics

Cited by