Hybrid method for automatic extraction of multiword expressions
2018, International Journal of Engineering & Technology
Abstract
A three phase hybrid method for automatic extraction of English multiword expressions (MWEs) has been proposed. The proposed method is based on linguistic patterns, association and context similarity between constituent words of the MWEs. First, the expressions are extract-ed in the form of N-grams from the raw text and then filtered using well defined linguistic patterns. Next, these expressions are again fil-tered using association score and context similarity score between their constituent words. Two association measures, Dice’s coefficient and PMI have been used for calculating the association score. The context similarity between words has been calculated using Latent Semantic Analysis (LSA) method. The problem of deciding the best value for the cut-off boundary thresholds in statistical methods is quite common. A two phase method of deciding the boundary threshold, using training dataset, has been proposed and employed in the current work. De-tailed performance analysis has b...
References (44)
- Agrawal S, Jaspal A, Aggarwal A, Sanyal R & Sanyal S. (2013). Hybrid Approach: A Solution for Extraction of Domain Independent Multiword Expressions. International Journal of Technology Inno- vations and Research (IJTIR), Vol. 5, pp. 1-16.
- Agrawal S, Sanyal R & Sanyal S. (2014). Statistics and linguistic rules in multiword extraction: A comparative analysis. International Journal of Reasoning-based Intelligent Systems. Vol. 6, No. 1/2, pp. 59-70. https://doi.org/10.1504/IJRIS.2014.063954.
- Baldwin T, Bannard C, Tanaka T & Widdows D. (2003). An empir- ical model of multiword expressions decomposability. In Proceed- ings of the ACL-2003 workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 89-96, Sapporo, Japan. https://doi.org/10.3115/1119282.1119294.
- Baldwin T. (2005). The deep lexical acquisition of English verb-par- ticles. Computer Speech and Language, Special Issue on Multiword Expressions, Vol. 19, pp. 398-414.
- Biber D, Johansson S, Leech G, Conrad S & Finegan E. (1999). Grammar of Spoken and Written English, Longman, Harlow, United Kingdom.
- Boulaknadel S, Daille B & Aboutajdine D. (2008). A multi-word term extraction program for Arabic language. In Proceedings of the 6th International Conference on Language Resources and Evalua- tion (LREC 2008), pp. 1485-1488, Marrakech, Morocco.
- Calzolari N, Fillmore CJ, Grishman R, Ide N, Lenci A, Macleod C & Zampolli A. (2002). Towards best practice for multiword expres- sions in computational lexicons. In Proceedings of the 3rd Interna- tional Conference on Language Resources and Evaluation (LREC), pp. 1934-1940, Las Palmas, Canary Islands.
- Church KW & Hanks P. (1990). Word association norms, mutual in- formation & lexicography. Computational Linguistics, Vol. 16, No. 1, pp. 22-29.
- Dahlmann I & Adolphs S. (2007). Pauses as an indicator of psycho- linguistically valid multi-word expressions (mwes)? In Proceedings of the ACL-2007 Workshop on A Broader Perspective on Multiword Expressions, pp. 49-56, Prague, Czech Republic. https://doi.org/10.3115/1613704.1613711.
- Deerwester SC, Dumais ST, Landauer TK, Furnas GW & Harshman RA. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science (JASIS), Vol. 41, No. 6, pp. 391-407. https://doi.org/10.1002/(SICI)1097- 4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
- Dice LR. (1945). Measures of the Amount of Ecologic Association between Species. Ecology, Vol. 26, No. 3, pp. 297-302. https://doi.org/10.2307/1932409.
- Duan J, Zhang M, Tong L & Guo F. (2009). A hybrid approach to improve bilingual multiword expression extraction. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data (PAKDD 2009), pp. 541-547, Bangkok, Thailand.
- Dubey V, Raghuwanshi P & Vyas S. (2015). Impact of Multiword Expression in English-Hindi Language. International Journal of Emerging Trends & Technology in Computer Science (IJETTCS), Vol. 4, No. 3, pp. 101-105.
- Evert S & Krenn B. (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language, Vol. 19, No. 4, pp. 450-466. https://doi.org/10.1016/j.csl.2005.02.005.
- Fano RM. (1961). Transmission of Information: A Statistical Theory of Communications. MIT Press, Cambridge, Massachusetts, United States.
- Goldman JP, Nerima L & Wehrli E. (2001). Collocation extraction using a syntactic parser. In Proceedings of the ACL Workshop on Collocations, pp. 61-66, Toulouse, France.
- Hofmann T. (1999). Probabilistic Latent Semantic Analysis. In Pro- ceedings of the Fifteenth Conference on Uncertainty in Artificial In- telligence (UAI'99), pp. 289-296, San Francisco, CA.
- Hurskainen A. (2008). Multiword expressions and machine transla- tion. Technical Report 1, Technical Reports in Language Technology.
- Jackendoff R. (1997). Twistin' the night away. Language, Vol. 73, No. 3, pp. 534-559. https://doi.org/10.2307/415883.
- Justeson JS & Katz SM. (1995). Technical terminology: some lin- guistic properties and an algorithm for identification in text. Natural Language Engineering, Vol. 1, No. 1, pp. 9-27. https://doi.org/10.1017/S1351324900000048.
- Karan M, Šnajder J & Bašić BD. (2012). Evaluation of classification algorithms and features for collocation extraction in Croatian. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pp. 657-662, Istanbul, Tur- key.
- Katz G & Giesbrecht E. (2006). Automatic identification of noncom- positional multi-word expressions using latent semantic analysis. In Proceedings of the ACL-2006 workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, pp. 12-19, Syd- ney, Australia.
- Kunchukuttan A & Damani OP. (2008). A system for compound nouns multiword expression extraction for Hindi. In Proceedings of the 6th International conference on Natural Language Processing (ICON 2008), Pune, India.
- Lambert P & Castell N. (2004). Alignment of parallel corpora ex- ploiting asymmetrically aligned phrases. In Proceedings of the LREC 2004 Workshop on the Amazing Utility of Parallel and Comparable Corpora, pp. 26-29.
- Landauer TK & Dumais ST. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, Vol. 104, No. 2, pp. 211-240. https://doi.org/10.1037/0033-295X.104.2.211.
- Liang Y, Tan H, Li H, Wang Z & Gui W. (2017). A language-inde- pendent hybrid approach for multi-word expression extraction. In proceedings of International Joint Conference on Neural Networks (IJCNN), pp. 3273-3279, Anchorage, AK, USA.
- Lin D. (1999). Automatic identification of non-compositional phrases. In Proceedings of the 37th Association of Computational Linguistics (ACL-1999), pp. 317-324, College Park, Maryland, USA. https://doi.org/10.3115/1034678.1034730.
- McInnes BT. (2004). Extending the loglikelihood measure to im- prove collocation identification. Master thesis, University of Minne- sota, USA.
- Moirόn BV & Tiedemann J. (2006). Identifying idiomatic expres- sions using automatic word alignment. In Proceedings of the EACL- 2006 workshop on Multiword Expressions in a multilingual context, pp. 33-40, Trento, Italy.
- Monti J, Barreiro A, Elia A, Marano F & Napoli A. (2011). Taking on new challenges in multiword unit processing for machine transla- tion. In Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation, pp. 11-19, Barcelona, Spain.
- Pearce D. (2001). Synonymy in collocation extraction. In Proceed- ings of the NAACL 2001 Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, pp. 41-46, Pittsburgh, Pennsylvania, USA.
- Piasecki M, Wendelberger M & Maziarz M. (2015). Extraction of the Multiword Lexical Units in the Perspective of the Wordnet Ex- pansion. In Proceedings of Recent Advances in Natural Language Processing, pp. 512-520, Hissar, Bulgaria.
- Ramisch C. (2012). A Generic Framework for Multiword Expres- sions Treatment: from Acquisition to Applications. In Proceedings of the ACL 2012 Student Research Workshop, pp. 61-66, Jeju Island, Korea.
- Sag IA, Baldwin T, Bond F, Copestake A & Flickinger D. (2002). Multi-word expressions: A pain in the neck for nlp. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002), Vol. 2276 of Lec- ture Notes in Computer Science, pp. 1-15, London, UK. https://doi.org/10.1007/3-540-45715-1_1.
- Schone P & Jurafsky D. (2001). Is knowledge-free induction of mul- tiword unit dictionary headwords a solved problem? In Proceedings of the 6th conference on Empirical Methods in Natural Language Processing (EMNLP-2001), pp. 100-108, Hong Kong.
- Seretan V. (2011). A collocation-driven approach to text summari- zation. In Proceedings of the Traitement Automatique des Langues Naturelles (TALN 2011), pp. 9-14, Montpellier, France.
- Singh A & Jamwal SS. (2016). Identification, Extraction and Trans- lation of Multiword Expressions. International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 6, No. 7, pp. 445-449.
- Smadja F. (1993). Retrieving collocations form text: Xtract. Compu- tational Linguistics, Vol. 19, No. 1, pp. 143-177.
- Strik H, Binnenpoorte D & Cucchiarini C. (2005). Multiword expres- sions in spontaneous speech: Do we really speak like that? In Pro- ceedings of the Interspeech-2005 (IS-2005), pp. 1161-1164, Lisbon, Portugal.
- Tsvetkov Y & Wintner S. (2012). Extraction of multi-word expres- sions from small parallel corpora. Natural Language Engineering, Vol. 18, No. 4, pp. 549-573. https://doi.org/10.1017/S1351324912000101.
- Tutubalina E & Braslavski P. (2016). Multiple Features for Multi- word Extraction: A Learning to Rank Approach. In Proceedings of the International Conference on Computational Linguistics and In- tellectual Technologies: "Dialogue 2016", pp. 782-793, Moscow, Russia.
- Vechtomova O. (2005). The role of multi-word units in interactive information retrieval. In Proceedings of the Advances in Information Retrieval, 27th European Conference on IR Research (ECIR-2005), pp. 403-420, Santiago de Compostela, Spain.
- Venkatapathy S, Agrawal P & Joshi AK. (2005). Relative composi- tionality of Noun + Verb multiword expressions in Hindi. In Pro- ceedings of the International Conference on Natural Language (ICON), pp. 37-44, Kanpur, India.
- Vintar S & Fiser D. (2008). Harvesting multi-word expressions from parallel corpora. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), pp. 1091- 1096, Marrakech, Morocco, 2008.