Nested term recognition driven by word connection strength
2015, Terminology
Abstract
Domain corpora are often not very voluminous and even important terms can occur in them not as isolated maximal phrases but only within more complex constructions. Appropriate recognition of nested terms can thus influence the content of the extracted candidate term list and its order. We propose a new method for identifying nested terms based on a combination of two aspects: grammatical correctness and normalised pointwise mutual information (NPMI) counted for all bigrams in a given corpus. NPMI is typically used for recognition of strong word connections, but in our solution we use it to recognise the weakest points to suggest the best place for division of a phrase into two parts. By creating, at most, two nested phrases in each step, we introduce a binary term structure. We test the impact of the proposed method applied, together with the C-value ranking method, to the automatic term recognition task performed on three corpora, two in Polish and one in English.
References (23)
- Bibliography
- Acedański, Szymon. "A morphosyntactic Brill tagger for inflectional languages." Edited by Hrafn Loftsson, Eirikur Rognvaldsson and Sigrun Helgadottir. Advances in Natural Language Processing. Springer, 2010. 3-14.
- Adam Przepiórkowski. Powierzchniowe przetwarzanie języka polskiego. Akademicka Oficyna Wydawnicza EXIT, 2008.
- Barrón-Cedeno, Alberto, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou. "An improved automatic term recognition method for Spanish." Computational Linguistics and Intelligent Text Processing. Springer, 2009. 125-136.
- Bouma, Gerlof. "Normalized (pointwise) mutual information in collocation." Edited by Christian Chiarcos , Richard Eckart de Castilho and Manfred Stede. From Form to Meaning: Processing Texts Automatically, Proceedings of the Biennial GSCL Conference 2009. Tubingen: Narr Verlag, 2009. 31-40.
- Frantzi, Katerina, Sophia Ananiadou, and Hideki Mima. "Automatic recognition of multi-word terms: the C-value/NC-value method." Journal on Digital Libraries, 2000: 115-130.
- J.-D. Kim, T. Otha, T. Tateisi i J.-I. Tsuji. "GENIA corpus --a semantically annotated corpus of bio-textmining." Bioinformatics, 2003: 180-182.
- K. Toutanova, K. Klein, C. Manning i Y. Singer. "Feature-rich part-of-speech tagging with a cyclic dependency network." Proceedings of HLT-NAACL 2003. 2003. 223-259.
- Kageura, Kyo, and Bin Umino. "Method for automatic term recognition. A review." Terminology, 1996: 259-289.
- Korkontzelos, Ioannis, Ioannis P. Klapaftis, and Suresh Manandhar. "Reviewing and evaluating automatic term recognition techniques." Advances in Natural Language Processing. Springer, 2008. 248-259.
- Lossio-Ventura, Juan Antonio, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire. "Yet Another Ranking Function for Automatic Multiword Term Extraction." PolTAL 2014. Springer, 2014. 52-64.
- Manning, Christopher D. , and Hinrich Schutze. Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press, 1999.
- Marcin Woliński. "Morfeusz -a practical solution for the morphological analysis of Polish." Intelligent Information Processing and Web Mining. Proceedings of the International IIS:IIPWM'06 Conference held in Ustron, Poland. Springer, 2006.
- Marciniak, Małgorzata, and Agnieszka Mykowiecka. "Terminology extraction from medical texts in Polish." Journal of Biomedical Semantics, 5 30, 2014.
- -. "Terminology extraction from domain texts in Polish." Edited by R Bembenik, L Skonieczny, Henryk Rybiński, M Kryszkiewicz and M Niezgódka. Intelligent Tools for Building a Scientific Information Platform. Advanced Architectures and Solutions. Springer, 2013. 171-185.
- -. "Towards Morphologically Annotated Corpus of Hospital Discharge Reports in Polish." Proceedings of BioNLP 2011. 2011. 92-100.
- Pantel, Patrick, and Dekang Lin. "A statistical corpus-based term extractor." Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence. London: Springer-Verlag, 2001. 36-46.
- Pazienza, Maria T, Marco Pennacchiotti, and Fabio M Zanzotto. "Terminology Extraction: An Analysis of Linguistic and Statistical Approaches." In Knowledge Mining Series:Studies in Fuzziness and Soft Computing, by S Sirmakessis. 2005.
- Sclano, Francesco, and Paola Velardi. "Termextractor: a web application to learn the shared terminology of." In Enterprise Interoperability II, by Ricardo Jardim-Gonçalves, Jörg P Müller, Kai Mertins and Martin Zelm. Springer, 2007.
- Ventura, Juan A. Lossio, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire. "Towards a mixed approach to extract biomedical terms from documents." International Journal of Knowledge Discovery in Bioinformatics, 2014.
- Vu, Thuy, Ai Ti Aw, and MIn Zhang. "Term extraction through unithood and termhood unification." Proceedings of International Joint Conference on Natural Language Processing. 2008.
- Wermter, Joachim, and Udo Hahn. "Massive biomedical term discovery." Discovery Science. Springer, 2005. 281-293.
- Y. Tateisi i J.-I. Tsujii. "Part-of-speech annotation of biology reseach abstracts." Proceedings of 4th International Conference on Language Resources and Evaluation. Lisbon, Portugal, 2004. 1267-1270.