Academia.eduAcademia.edu

Outline

A Cascaded Classification Approach to Semantic Head Recognition

Abstract

Most NLP systems use tokenization as part of preprocessing. Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. We adopt a classification approach because – unlike other work on MWUs – tokenization requires a completely automatic approach. We achieve an accuracy of 68% for recognizing non-compositional MWUs and show that our MWU recognizer improves retrieval performance when used as part of an information retrieval system.

References (21)

  1. Mohammed Attia, Antonio Toral, Lamia Tounsi, Pavel Pecina, and Josef van Genabith. 2010. Automatic ex- traction of arabic multiword expressions. In Proceed- ings of the 2010 Workshop on Multiword Expressions, pages 19-27, Beijing, China. Coling 2010 Organizing Committee.
  2. Timothy Baldwin, Colin Bannard, Takaaki Tanaka, and Dominic Widdows. 2003. An empirical model of multiword expression decomposability. In Proceed- ings of the ACL 2003 Workshop on Multiword Expres- sions, pages 89-96, Sapporo, Japan. Association for Computational Linguistics.
  3. Helena Caseli, Aline Villavicencio, André Machado, and Maria José Finatto. 2009. Statistically-driven alignment-based multiword expression identification for technical domains. In Proceedings of the 2009 Workshop on Multiword Expressions, pages 1-8, Sin- gapore. Association for Computational Linguistics.
  4. Yaacov Choueka. 1988. Looking for needles in a haystack. In Proceedings of RIAO88, pages 609-623.
  5. Paul Cook, Afsaneh Fazly, and Suzanne Stevenson. 2007. Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expres- sions in context. In Proceedings of the 2007 on Mul- tiword Expressions, pages 41-48, Prague, Czech Re- public. Association for Computational Linguistics.
  6. Mona Diab and Pravin Bhutada. 2009. Verb noun con- struction mwe token classification. In Proceedings of the 2009 Workshop on Multiword Expressions, pages 17-22, Singapore. Association for Computational Lin- guistics.
  7. Stefan Evert and Brigitte Krenn. 2001. Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting on Associ- ation for Computational Linguistics, pages 188-195. Association for Computational Linguistics.
  8. Stefan Evert. 2004. The Statistics of Word Cooccur- rences: Word Pairs and Collocations. Ph.D. thesis, In- stitut für maschinelle Sprachverarbeitung (IMS), Uni- versität Stuttgart.
  9. Graham Katz and Eugenie Giesbrecht. 2006. Auto- matic identification of non-compositional multi-word expressions using latent semantic analysis. In Pro- ceedings of the 2006 Workshop on Multiword Expres- sions, pages 12-19, Sydney, Australia. Association for Computational Linguistics.
  10. Linlin Li and Caroline Sporleder. 2010. Linguistic cues for distinguishing literal and non-literal usages. In Coling 2010: Posters, pages 683-691, Beijing, China. Coling 2010 Organizing Committee.
  11. Dekang Lin. 1999. Automatic identification of non- compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 317-324, College Park, Maryland, USA. Association for Computational Linguistics.
  12. Marianne Lykke, Birger Larsen, Haakon Lund, and Pe- ter Ingwersen. 2010. Developing a test collection for the evaluation of integrated search. In Advances in In- formation Retrieval, 32nd European Conference on IR Research, ECIR 2010, Milton Keynes, UK, March 28- 31, 2010. Proceedings, pages 627-630.
  13. Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Process- ing. The MIT Press, Cambridge, MA.
  14. Donald Metzler and W. Bruce Croft. 2004. Combining the language model and inference network approaches to retrieval. Inf. Process. Manage., 40(5):735-750.
  15. B.V. Moirón and Jörg Tiedemann. 2006. Identify- ing Idiomatic Expressions Using Automatic Word- Alignment. In Multi-Word-Expressions in a Multilin- gual Context, page 33.
  16. Pavel Pecina. 2010. Lexical association measures and collocation extraction. Language Resources and Eval- uation, 44(1-2):138-158.
  17. Carlos Ramisch, Aline Villavicencio, and Christian Boitet. 2010. mwetoolkit: a framework for multiword expression identification. In Proceedings of the Sev- enth conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta.
  18. Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword ex- pressions: A pain in the neck for nlp. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics, pages 1- 15, Mexico City.
  19. Patrick Schone and Daniel Jurafsky. 2001. Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In Proceedings of the 2001 Conference on Empirical Methods in Natu- ral Language Processing, pages 100-108, Pittsburgh, Pennsylvania, USA. Association for Computational Linguistics.
  20. Frank Smadja. 1993. Retrieving collocations from text: Xtract. Computational linguistics, 19(1):143-177.
  21. ChengXiang Zhai and John D. Lafferty. 2002. Two-stage language models for information retrieval. In SIGIR, pages 49-56. ACM.