A Cascaded Classification Approach to Semantic Head Recognition
Abstract
Most NLP systems use tokenization as part of preprocessing. Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. We adopt a classification approach because – unlike other work on MWUs – tokenization requires a completely automatic approach. We achieve an accuracy of 68% for recognizing non-compositional MWUs and show that our MWU recognizer improves retrieval performance when used as part of an information retrieval system.
References (21)
- Mohammed Attia, Antonio Toral, Lamia Tounsi, Pavel Pecina, and Josef van Genabith. 2010. Automatic ex- traction of arabic multiword expressions. In Proceed- ings of the 2010 Workshop on Multiword Expressions, pages 19-27, Beijing, China. Coling 2010 Organizing Committee.
- Timothy Baldwin, Colin Bannard, Takaaki Tanaka, and Dominic Widdows. 2003. An empirical model of multiword expression decomposability. In Proceed- ings of the ACL 2003 Workshop on Multiword Expres- sions, pages 89-96, Sapporo, Japan. Association for Computational Linguistics.
- Helena Caseli, Aline Villavicencio, André Machado, and Maria José Finatto. 2009. Statistically-driven alignment-based multiword expression identification for technical domains. In Proceedings of the 2009 Workshop on Multiword Expressions, pages 1-8, Sin- gapore. Association for Computational Linguistics.
- Yaacov Choueka. 1988. Looking for needles in a haystack. In Proceedings of RIAO88, pages 609-623.
- Paul Cook, Afsaneh Fazly, and Suzanne Stevenson. 2007. Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expres- sions in context. In Proceedings of the 2007 on Mul- tiword Expressions, pages 41-48, Prague, Czech Re- public. Association for Computational Linguistics.
- Mona Diab and Pravin Bhutada. 2009. Verb noun con- struction mwe token classification. In Proceedings of the 2009 Workshop on Multiword Expressions, pages 17-22, Singapore. Association for Computational Lin- guistics.
- Stefan Evert and Brigitte Krenn. 2001. Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting on Associ- ation for Computational Linguistics, pages 188-195. Association for Computational Linguistics.
- Stefan Evert. 2004. The Statistics of Word Cooccur- rences: Word Pairs and Collocations. Ph.D. thesis, In- stitut für maschinelle Sprachverarbeitung (IMS), Uni- versität Stuttgart.
- Graham Katz and Eugenie Giesbrecht. 2006. Auto- matic identification of non-compositional multi-word expressions using latent semantic analysis. In Pro- ceedings of the 2006 Workshop on Multiword Expres- sions, pages 12-19, Sydney, Australia. Association for Computational Linguistics.
- Linlin Li and Caroline Sporleder. 2010. Linguistic cues for distinguishing literal and non-literal usages. In Coling 2010: Posters, pages 683-691, Beijing, China. Coling 2010 Organizing Committee.
- Dekang Lin. 1999. Automatic identification of non- compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 317-324, College Park, Maryland, USA. Association for Computational Linguistics.
- Marianne Lykke, Birger Larsen, Haakon Lund, and Pe- ter Ingwersen. 2010. Developing a test collection for the evaluation of integrated search. In Advances in In- formation Retrieval, 32nd European Conference on IR Research, ECIR 2010, Milton Keynes, UK, March 28- 31, 2010. Proceedings, pages 627-630.
- Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Process- ing. The MIT Press, Cambridge, MA.
- Donald Metzler and W. Bruce Croft. 2004. Combining the language model and inference network approaches to retrieval. Inf. Process. Manage., 40(5):735-750.
- B.V. Moirón and Jörg Tiedemann. 2006. Identify- ing Idiomatic Expressions Using Automatic Word- Alignment. In Multi-Word-Expressions in a Multilin- gual Context, page 33.
- Pavel Pecina. 2010. Lexical association measures and collocation extraction. Language Resources and Eval- uation, 44(1-2):138-158.
- Carlos Ramisch, Aline Villavicencio, and Christian Boitet. 2010. mwetoolkit: a framework for multiword expression identification. In Proceedings of the Sev- enth conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta.
- Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword ex- pressions: A pain in the neck for nlp. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics, pages 1- 15, Mexico City.
- Patrick Schone and Daniel Jurafsky. 2001. Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In Proceedings of the 2001 Conference on Empirical Methods in Natu- ral Language Processing, pages 100-108, Pittsburgh, Pennsylvania, USA. Association for Computational Linguistics.
- Frank Smadja. 1993. Retrieving collocations from text: Xtract. Computational linguistics, 19(1):143-177.
- ChengXiang Zhai and John D. Lafferty. 2002. Two-stage language models for information retrieval. In SIGIR, pages 49-56. ACM.