Key research themes
1. How can linguistic, statistical, and hybrid approaches be combined effectively for multi-word term extraction in specialized and unstructured corpora?
This theme investigates methods that fuse linguistic knowledge (syntactic patterns, POS sequences, semantic context) with statistical measures (frequency, co-occurrence, association scores) or machine learning models (CRFs) to accurately identify multi-word terms (MWTs) from domain-specific or unstructured texts. It addresses challenges such as term variability, ambiguity, and limited labeled data by integrating complementary sources of knowledge, aiming for higher precision and adaptability across domains and languages.
2. What statistical association measures and ranking techniques optimize multi-word term candidate extraction and filtering, particularly in noisy or nested terms scenarios?
This theme explores the development and evaluation of statistical scoring functions—such as C-value, NC-value, pointwise mutual information (PMI), normalized PMI, log-likelihood, TF-IDF, and Kullback-Leibler divergence—for identifying and ranking multi-word term candidates from corpora. A notable challenge addressed is accurate identification of nested terms and filtering out spurious or truncated phrases to improve term extraction precision, especially when corpora are small or contain semantically odd phrases.
3. How does the incorporation of semantic and contextual information improve disambiguation and ranking in multi-word term extraction?
This research theme focuses on leveraging semantic resources (e.g., domain ontologies, thesauri like UMLS) and contextual similarity measures to distinguish ambiguous terms and improve the ranking of multi-word term candidates. It investigates how deep semantic and syntactic contextual analysis surpasses simple bag-of-words or shallow syntactic filters, allowing for better identification of true domain-specific terms and addressing term variation and sense ambiguity.