Academia.eduAcademia.edu

Multi Words Term Extraction

description12 papers
group0 followers
lightbulbAbout this topic
Multi Words Term Extraction is the process of identifying and extracting phrases or terms composed of multiple words from a text corpus, aimed at enhancing information retrieval, natural language processing, and knowledge discovery by recognizing significant multi-word expressions that convey specific meanings or concepts.
lightbulbAbout this topic
Multi Words Term Extraction is the process of identifying and extracting phrases or terms composed of multiple words from a text corpus, aimed at enhancing information retrieval, natural language processing, and knowledge discovery by recognizing significant multi-word expressions that convey specific meanings or concepts.

Key research themes

1. How can linguistic, statistical, and hybrid approaches be combined effectively for multi-word term extraction in specialized and unstructured corpora?

This theme investigates methods that fuse linguistic knowledge (syntactic patterns, POS sequences, semantic context) with statistical measures (frequency, co-occurrence, association scores) or machine learning models (CRFs) to accurately identify multi-word terms (MWTs) from domain-specific or unstructured texts. It addresses challenges such as term variability, ambiguity, and limited labeled data by integrating complementary sources of knowledge, aiming for higher precision and adaptability across domains and languages.

Key finding: Proposed a hybrid methodology integrating Conditional Random Fields (CRF) enriched with shallow linguistic features alongside statistical filtering based on token co-occurrence frequencies to extract complex terms from... Read more
Key finding: Developed a CRF-based term extraction model incorporating linguistic features such as morphosyntactic patterns and contextual observations to capture complex term variations in specialized domains like medical text.... Read more
Key finding: Demonstrated that combining dependency-based linguistic patterns with POS tagging significantly improves multi-word expression (including phrasal verbs) extraction accuracy in English movie subtitles. The approach, supported... Read more
Key finding: Applied frequency and keyword analysis methods on a domain-specific isiZulu corpus to semi-automatically extract linguistic terms for dictionary compilation. This approach reveals the utility of corpus linguistics combined... Read more
Key finding: Introduced an unsupervised iterative method that auto-learns syntactic sentence patterns and corresponding POS tag sequences from a few initial seeds to extract multi-word terminologies from scientific texts without... Read more

2. What statistical association measures and ranking techniques optimize multi-word term candidate extraction and filtering, particularly in noisy or nested terms scenarios?

This theme explores the development and evaluation of statistical scoring functions—such as C-value, NC-value, pointwise mutual information (PMI), normalized PMI, log-likelihood, TF-IDF, and Kullback-Leibler divergence—for identifying and ranking multi-word term candidates from corpora. A notable challenge addressed is accurate identification of nested terms and filtering out spurious or truncated phrases to improve term extraction precision, especially when corpora are small or contain semantically odd phrases.

Key finding: Proposed a novel nested term recognition method that combines grammatical correctness with normalized pointwise mutual information (NPMI) to identify weakest collocation points for binary phrase decomposition. Applied to... Read more
Key finding: Developed an extensible open-source platform for comparing various ATR methods (e.g., TF, RIDF, LR) including combinations using foreground and background corpora. Their comparative experiments on GENIA and Eurogene corpora... Read more
Key finding: Performed a large-scale intrinsic and extrinsic evaluation of six unsupervised term scoring methods across four diverse collections, concluding that collection size and prevalence of multi-word terms critically affect method... Read more
Key finding: Presented a rule-based extraction framework using electronic lexical resources and finite-state transducers to model complex syntactic structures of MWTs in Serbian, combined with statistical filtering measures (C-value,... Read more
Key finding: Conducted a comprehensive meta-analysis of decades of keyword and term extraction research, finding that term length is a critical but undervalued factor influencing keyword status and extraction accuracy. The study also... Read more

3. How does the incorporation of semantic and contextual information improve disambiguation and ranking in multi-word term extraction?

This research theme focuses on leveraging semantic resources (e.g., domain ontologies, thesauri like UMLS) and contextual similarity measures to distinguish ambiguous terms and improve the ranking of multi-word term candidates. It investigates how deep semantic and syntactic contextual analysis surpasses simple bag-of-words or shallow syntactic filters, allowing for better identification of true domain-specific terms and addressing term variation and sense ambiguity.

Key finding: Enhanced the NC-value statistic by incorporating semantic similarity and richer syntactic representation using UMLS domain-specific semantic categories to weight context words. The method clusters contexts based on syntactic... Read more
Key finding: Introduced semantic and syntactic context weighting leveraging UMLS and corpus linguistic information to identify which parts of the textual context contribute most to multi-word term disambiguation. Proposed a novel semantic... Read more
Key finding: Compiled an extensive repository of Arabic MWEs with manual morphological and semantic annotations that capture context-sensitive features necessary for MWE identification. Developed a deterministic pattern-matching algorithm... Read more

All papers in Multi Words Term Extraction

One of the subtasks of the Natural Language Processing (NLP) application, part of speech tagger, is essential for other NLP applications. It includes giving each word an appropriate POS tag that defines how it is used in a... more
One of the subtasks of the Natural Language Processing (NLP) application, part of speech tagger, is essential for other NLP applications. It includes giving each word an appropriate POS tag that defines how it is used in a... more
This report describes our work on Bengali Part-of-speech tagging (POS) for the NLPAI Machine Learning contest 2006. We use a Hidden Markov Model (HMM) based stochastic tagger. The tagger makes use of morphological and contextual... more
Research on Malay Part-of-Speech (POS) tagging has greatly increased over the past few years. Based on previous literature, POS-tags are known as the first phase in the automated text analysis; and the development of language technologies... more
In this paper we report on an experimental syntactically and morphologically driven rule-based Arabic tagger. The tagger is developed using Arabic language grammatical rules and regulations. The tagger requires no pre-tagged text and is... more
In this paper we report on an experimental syntactically and morphologically driven rule-based Arabic tagger. The tagger is developed using Arabic language grammatical rules and regulations. The tagger requires no pre-tagged text and is... more
In this paper we report on an experimental syntactically and morphologically driven rule-based Arabic tagger. The tagger is developed using Arabic language grammatical rules and regulations. The tagger requires no pre-tagged text and is... more
Morphological analyzer is the base for various high-level NLP applications such as information retrieval, spell checking, grammar checking, machine translation, speech recognition, POS tagging and automatic sentence construction. This... more
The paper presents an application of Multidimensional (MD) analysis initially developed for the analysis of register variation in English (Biber, 1988) to the investigation of a genre diverse corpus, which was built from modern texts of... more
Structural priming, i.e., the tendency to repeat linguistic material, can be explained by two alternative representational assumptions: either as the repetition of hierarchical representations generated by syntactic rules, or as the... more
Part of speech tagging (POS tagging) has a crucial role in different fields of natural language processing (NLP) including Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This paper... more
Natural Language Toolkit (NLTK) is a generic platform to process the data of various natural (human) languages and it provides various resources for Indian languages also like Hindi, Bangla, Marathi and so on. In the proposed work, the... more
Part of speech tagging (POS tagging) has a crucial role in different fields of natural language processing (NLP) including Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This paper... more
Part-of-speech (POS) tagger plays an important role in Natural Language Applications like Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This study proposes a building of an efficient... more
Part of speech tagging (POS tagging) has a crucial role in different fields of natural language processing (NLP) including Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This paper... more
Part-of-speech (POS) tagger plays an important role in Natural Language Applications like Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This study proposes a building of an efficient... more
Creating good stemming rules for the Arabic language comes from the importance of Arabic language as the sixth most used language in the word. Stemming is very important in information retrieval, data mining and language processing. With... more
This paper presents an ongoing research that aims to construct a sizable and reliable text corpus along with a set of tools to experiment with natural language applications for Arabic. The corpus is used by graduate students at the... more
Part of speech tagging (POS tagging) has a crucial role in different fields of natural language processing (NLP) including Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This paper... more
Part-of-speech (POS) tagger plays an important role in Natural Language Applications like Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This study proposes a building of an efficient... more
Part of speech tagging (POS tagging) has a crucial role in different fields of natural language processing (NLP) including Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This paper... more
Part-of-speech (POS) tagger plays an important role in Natural Language Applications like Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This study proposes a building of an efficient... more
Text tagging is a very important tool for various applications in natural language processing, namely the morphological and syntactic analysis of texts, indexation and information retrieval, "vocalization" of Arabic texts, and... more
Natural Language Toolkit (NLTK) is a generic platform to process the data of various natural (human) languages and it provides various resources for Indian languages also like Hindi, Bangla, Marathi and so on. In the proposed work, the... more
In this paper, we address the problem of Part-Of-Speech(POS) tagging of Arabic texts with vowel marks. After the description of the specificities of Arabic language and the induced difficulties on the task of POS-tagging, we propose an... more
Morphological analysis of Arabic language is computationally intensive, has numerous forms and rules, and intrinsically parallel. The investigation presented in this paper confirms that the effective development of parallel algorithms and... more
By: Rahima Bentrcia, Samir Zidat, Farhi Marir There is an immense need for information systems that rely on Arabic Quranic ontologies to provide a precise and comprehensive knowledge to the world. Since semantic relations are a vital... more
A tagger is a mandatory segment of most text scrutiny systems, as it consigned a s yntax class (e.g., noun, verb, adjective, and adverb) to every word in a sentence. In this paper, we present a simple part of speech tagger for homoeopathy... more
POS tagger is very much essential software that is used in creation of language translators and extraxtion of information .The problems in POS tagging in NLP(natural language processing) is finding how to tag each words in a given... more
In the research area of the computational linguistic, there are the vast varieties of text data available and there is need to sort out it. Part-of-speech tagging is one of the most important part of the natural language processing which... more
— Adverbs are one of the main aspects of grammar in almost all the languages as they play a vital role in formation of a sentence. The identification and extraction of Multi Word Expressions (MWEs) in Hindi is done by various researchers... more
This paper presents an ongoing research that aims to construct a sizable and reliable text corpus along with a set of tools to experiment with natural language applications for Arabic. The corpus is used by graduate students at the... more
This paper presents an ongoing research that aims to construct a sizable and reliable text corpus along with a set of tools to experiment with natural language applications for Arabic. The corpus is used by graduate students at the... more
tagging is an important pre-requisite in the development of every serious natural language processing application. There are many part-of-speech taggers based on various approaches. One of the most wide-spread approaches for tagging is... more
In the world of non-proprietary NLP soft-ware the standard, and perhaps the best, HMM-based POS tagger is TnT (Brants, 2000). We argue here that some of the crit-icism aimed at HMM performance on lan-guages with rich morphology should... more
Part-of-speech (POS) tagger plays an important role in Natural Language Applications like Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This study proposes a building of an efficient... more
Part of speech tagging (POS tagging) has a crucial role in different fields of natural language processing (NLP) including Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This paper... more
Part of speech tagging (POS tagging) has a crucial role in different fields of natural language processing (NLP) including Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This paper... more
There are a number of coUocational constraints in natural languages that ought to play a more important role in natural language parsers. Thus, for example, it is hard for most parsers to take advantage of the fact that wine is typically... more
In many applications of natural language processing (NLP) grammatically tagged corpora are needed. Thus Part of Speech (POS) Tagging is of high importance in the domain of NLP. Many taggers are designed with different approaches to reach... more
In this article, we are presenting a semantic annotation tool for Arabic texts with a strategy adapted to the automatic location of reported information. The method used is that of Contextual Exploration, which consists of using purely... more
Abstract In the world of non-proprietary NLP software the standard, and perhaps the best, HMM-based POS tagger is TnT (Brants, 2000). We argue here that some of the criticism aimed at HMM performance on languages with rich morphology... more
In this article, we are presenting a semantic annotation tool for Arabic texts with a strategy adapted to the automatic location of reported information. The method used is that of Contextual Exploration, which consists of using purely... more
Download research papers for free!