Academia.eduAcademia.edu

Outline

A Hybrid Approach for Multiword Expression Identification

2010, Lecture Notes in Computer Science

https://doi.org/10.1007/978-3-642-12320-7_9

Abstract

Considerable attention has been given to the problem of Multiword Expression (MWE) identification and treatment, for NLP tasks like parsing and generation, to improve the quality of results. Statistical methods have been often employed for MWE identification, as an inexpensive and language independent way of finding co-occurrence patterns. On the other hand, more linguistically motivated methods for identification, which employ information such as POS filters and lexical alignment between languages, can produce more targeted candidate lists. In this paper we propose a hybrid approach that combines the strenghts of different sources of information using a machine learning algorithm to produce more robust and precise results. Automatic evaluation on gold standards shows that the performance of our hybrid method is superior to the individual results of statistical and alignment-based MWE extraction approaches for Portuguese and for English. This method can be used to aid lexicographic work by providing a more targeted MWE candidate list.

References (23)

  1. Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword Ex- pressions: A Pain in the Neck for NLP. In: Proceedings of the Third Interna- tional Conference on Computational Linguistics and Intelligent Text Processing (CICLing-2002). Volume 2276 of (Lecture Notes in Computer Science)., London, UK, Springer-Verlag (2002) 1-15
  2. Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan, E.: Grammar of Spoken and Written English. Longman, Harlow (1999)
  3. Jackendoff, R.: Twistin' the night away. Language 73 (1997) 534-59
  4. Evert, S., Krenn, B.: Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language 19(4) (2005) 450-466
  5. This is a pre-print of an article published in the Proceedings of PROPOR 2016. The final authenticated version is available online at: https://doi.org/10.1007/978-3-642-12320-7_9
  6. Baldwin, T.: The deep lexical acquisition of English verb-particles. Computer Speech and Language, Special Issue on Multiword Expressions 19(4) (2005) 398- 414
  7. Caseli, H.M., Villavicencio, A., Machado, A., Finatto, M.J.: Statistically-driven alignment-based multiword expression identification for technical domains. In: Proceedings of the 2009 Workshop on Multiword Expressions (ACL-IJCNLP 2009). (2009) 1-8
  8. Villavicencio, A., Caseli, H.M., Machado, A.: Identification of Multiword Expres- sions in Technical Domains: Investigating Statistical and Alignment-based Ap- proaches. In: Proceedings of the 7th Brazilian Symposium in Information and Human Language Technology, São Carlos, SP (2009)
  9. Fazly, A., Cook, P., Stevenson, S.: Unsupervised type and token identification of idiomatic expressions. Computational Linguistics 35(1) (2009) 61-103
  10. Van de Cruys, T., Villada Moirón, B.: Semantics-based Multiword Expression Ex- traction. In: Proceedings of the Workshop on A Broader Prespective on Multiword Expressions, Prague (June 2007) 25-32
  11. Villada Moirón, B., Tiedemann, J.: Identifying idiomatic expressions using auto- matic word-alignment. In: Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006), Trento, Italy (2006) 33-40
  12. Ramisch, C., Villavicencio, A., Moura, L., Idiart, M.: Picking them up and Figuring them out: Verb-Particle Constructions, Noise and Idiomaticity. In: Proceedings of the 12th Conference on Computational Natural Language Learning (CoNLL 2008). (2008) 49-56
  13. Melamed, I.D.: Automatic Discovery of Non-Compositional Compounds in Parallel Data. In: eprint arXiv:cmp-lg/9706027. (June 1997) 6027-+
  14. Coulthard, R.J.: The application of corpus methodology to translation: the jped parallel corpus and the pediatrics comparable corpus. Master's thesis, Universidade Federal de Santa Catarina (2005)
  15. Lopes, L., Vieira, R., Finatto, M.J., Martins, D., Zanette, A., Jr., L.C.R.: Auto- matic extraction of composite terms for construction of ontologies: an experiment in the health care area. RECIIS -Electronic journal of communication information and innovation in health (English edition. Online) 3 (2009) 76-88
  16. Procter, P.: Cambridge International Dictionary of English. Cambridge University Press (1995)
  17. Banerjee, S., Pedersen, T.: The Design, Implementation and Use of the Ngram Statistics Package. In: In Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics. (2003) 370-381
  18. Och, F.J., Ney, H.: Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting of the ACL, Hong Kong, China (October 2000) 440-447
  19. Armentano-Oller, C., Carrasco, R.C., Corbí-Bellot, A.M., Forcada, M.L., Ginestí- Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Ramírez-Sánchez, G., Sánchez- Martínez, F., Scalco, M.A.: Open-source Portuguese-Spanish machine translation. In Vieira, R., Quaresma, P., Nunes, M., Mamede, N., Oliveira, C., Dias, M., eds.: Proceedings of the 7th International Workshop on Computational Processing of Written and Spoken Portuguese, (PROPOR 2006). Volume 3960 of Lecture Notes in Computer Science. Springer-Verlag (May 2006) 50-59
  20. Caseli, H.M., Nunes, M.G.V., Forcada, M.L.: On the automatic learning of bilin- gual resources: Some relevant factors for machine translation. In: Proceedings of the 19th Brazilian Symposium on Artificial Intelligence (SBIA). Volume 5249., Springer Berlin / Heidelberg (2008) 258-267
  21. This is a pre-print of an article published in the Proceedings of PROPOR 2016. The final authenticated version is available online at: https://doi.org/10.1007/978-3-642-12320-7_9
  22. Caseli, H.M., Ramisch, C., Nunes, M.G.V., Villavicencio, A.: Alignment-based ex- traction of multiword expressions. Language Resources and Evaluation (to appear 2009)
  23. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Tech- niques with Java Implementations. Morgan Kaufmann, San Francisco (2005)