Academia.eduAcademia.edu

Outline

Extracting bilingual terms from the Web

2015, Terminology

https://doi.org/10.1075/TERM.21.2.04GAI

Abstract

In this paper we make two contributions. First, we describe a multi-component system called BiTES (Bilingual Term Extraction System) designed to automatically gather domain-specific bilingual term pairs from Web data. BiTES components consist of data gathering tools, domain classifiers, monolingual text extraction systems and bilingual term aligners. BiTES is readily extendable to new language pairs and has been successfully used to gather bilingual terminology for 24 language pairs, including English and all official EU languages, save Irish. Second, we describe a novel set of methods for evaluating the main components of BiTES and present the results of our evaluation for six language pairs. Results show that the BiTES approach can be used to successfully harvest quality bilingual term pairs from the Web. Our evaluation method delivers significant insights about the strengths and weaknesses of our techniques. It can be straightforwardly reused to evaluate other bilingual term extr...

References (47)

  1. Agarwal, B., and N. Mittal. 2014. "Text Classification Using Machine Learning Methods -A Survey." In Proceedings of the 2nd International Conference on Soft Computing for Problem Solving (SocProS 2012), 701-709. New Delhi: Springer.
  2. Aker, A., Y. Feng, and R.J. Gaizauskas. 2012a. "Automatic Bilingual Phrase Extraction from Comparable Corpora." In Proceedings of the 24th International Conference on Computational Linguistics (Posters) (COLING 2012), 23-32. Bombay: The COLING 2012 Organizing Committee.
  3. Aker, A., E. Kanoulas, and R.J. Gaizauskas, R. J. 2012b. "A Light Way to Collect Comparable Corpora from the Web." In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), 15-20. Istanbul: European Language Resources Association (ELRA).
  4. Aker, A., M.L. Paramita, E. Barker, and R, Gaizauskas. 2014. "Bootstrapping Term Extractors for Multiple Languages." In Proceedings of the 9th International Conference on Language Resources and Evaluation Conference (LREC 2014), 483-489. Reykjavik: European Language Resources Association.
  5. Aker, A., M. Paramita, and R. Gaizauskas. 2013. "Extracting Bilingual Terminologies from Comparable Corpora." In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), 402-411. Sofia: Association for Computational Linguistics.
  6. Al-Onaizan, Y., and K. Knight. 2002. "Machine Transliteration of Names in Arabic Text." In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages, 1-13. Stroudsburg: Association for Computational Linguistics.
  7. Aswani, N., and R. Gaizauskas. 2010. "English-Hindi Transliteration Using Multiple Similarity Metrics." In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), 1786-1793. Valetta: European Language Resources Association (ELRA).
  8. Bouamor, D., N. Semmar, and P. Zweigenbaum. 2012. "Identifying Bilingual Multi-Word Expressions for Statistical Machine Translation." In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), 674-679. Istanbul: European Language Resources Association (ELRA).
  9. Cao, Y., and H. Li. 2002. "Base Noun Phrase Translation Using Web Data and the EM Algorithm." In Proceedings of the 19th International Conference on Computational Linguistics -Volume 1, 1-7. Stroudsburg: Association for Computational Linguistics.
  10. Chung, T. M. 2003. "A Corpus Comparison Approach for Terminology Extraction." Terminology, 9 (2): 221-246.
  11. Daille, B., E. Gaussier, and J. Lange. 1994. "Towards Automatic Extraction of Monolingual and Bilingual Terminology." In Proceedings of the 15th Conference on Computational Linguistics -Volume 1, 515-521. Stroudsburg: Association for Computational Linguistics.
  12. De Benedictis, F., S. Faralli, and R. Navigli. 2013. "Glossboot: Bootstrapping Multilingual Domain Glossaries from the Web." In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), 528-538. Sofia: Association for Computational Linguistics.
  13. De Bessé, B., B. Nkwenti-Azeh, and J.C. Sager. 1997. "Glossary of Terms Used in Terminology." Terminology: International Journal of Theoretical and Applied Issues in Specialized Communication, 4 (1): 117-156.
  14. Drouin, P. 2004. "Detection of Domain Specific Terminology Using Corpora Comparison." In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), 79-82. Lisbon: European Language Resources Association (ELRA).
  15. EuroTermBank. 2015. EuroTermBank. Accessed September 15. http://www.eurotermbank.com. EuroVoc. 2015. EuroVoc, the EU's Multilingual Thesaurus. Thesaurus Eurovoc -Volume 2: Subject-Oriented Version. Ed. 3/English Language. Annex to the index of the Official Journal of the EC. Luxembourg, Office for Official Publications of the European Communities. http://eurovoc.europa.eu/.
  16. Fan, X., N. Shimizu, and H. Nakagawa. 2009. "Automatic Extraction of Bilingual Terms from a Chinese-Japanese Parallel Corpus." In Proceedings of the 3rd International Universal Communication Symposium (IUCS '09), 41-45. New York: Association for Computing Machinery (ACM).
  17. Fung, P., and K. McKeown. 1997. "Finding Terminology Translations from Non-Parallel Corpora." In Proceedings of the 5th Annual Workshop on Very Large Corpora, 192-202. Hong Kong: Association for Computational Linguistics.
  18. Gaizauskas, R., E. Barker, M.L. Paramita, and A. Aker. 2014. "Assigning Terms to Domains by Document Classification." In Proceedings of the 4th International Workshop on Computational Terminology (Computerm), 11-21. Dublin: Association for Computational Linguistics and Dublin City University.
  19. Gornostay, T., and A. Vasiljevs. 2014. "Terminology Resources and Terminology Work Benefit from Cloud Services." In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), 1943-1948. Reykjavik: European Language Resources Association (ELRA).
  20. Grishman, R., and B. Sundheim. 1996. "Message Understanding Conference -6: A Brief History." In Proceedings of the 16th International Conference on Computational Linguistics, 466-471. Copenhagen: Association for Computational Linguistics.
  21. Halcsy, P., A. Kornai, and C. Oravecz. 2007. "HunPos: an Open Source Trigram Tagger." In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, 209-212. Prague: Association for Computational Linguistics.
  22. IATE. 2015. InterActive Terminology for Europe. Accessed September 15. http://iate.europa.eu.
  23. Ismail, A., and S. Manandhar. 2010. "Bilingual Lexicon Extraction from Comparable Corpora Using In-Domain Terms." In Proceedings of the 23rd International Conference on Computational Linguistics: Poster (COLING 2010), 481-489. Beijing: COLING 2010 Organizing Committee.
  24. Justeson, J.S., and S.M. Katz. 1995. "Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text." Natural Language Engineering, 1 (1): 9-27.
  25. Kida, M., M. Tonoike, T. Utsuro, and S. Sato. 2007. "Domain Classification of Technical Terms Using the Web." Systems and Computers in Japan, 38 (14): 11-19.
  26. Kilgariff, A., M. Jakubıcek, V. Kovár, P. Rychlý, and V. Suchomel. 2014. "Finding Terms in Corpora for Many Languages with the Sketch Engine." In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014), 53-56. Gothenburg: Association for Computational Linguistics.
  27. Kim, S. N., T. Baldwin, and M-Y. Kan. 2009. "An Unsupervised Approach to Domain-Specific Term Extraction." In Proceedings of the Australasian Language Technology Association Workshop, 94-98. Sydney: Australasian Language Technology Association.
  28. Knight, K., and J. Graehl. 1998. "Machine Transliteration." Computational Linguistics, 24 (4): 599-612.
  29. Koehn, P., H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. 2007. "Moses: Open Source Toolkit for Statistical Machine Translation." In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (ACL 2007), 177-180. Prague: Association for Computational Linguistics.
  30. Kupiec, J. 1993. "An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora." In Proceedings of the 31st Annual Meeting on Association for Computational Linguistics (ACL 1993), 17-22. Columbus: Association for Computational Linguistics.
  31. Manning, C.D., P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval. Cambridge: Cambridge University Press.
  32. Marciniak, M., and A. Mykowiecka. 2013. "Terminology Extraction from Domain Texts in Polish." In Intelligent Tools for Building a Scientific Information Platform, 171-185. Berlin, Heidelberg: Springer.
  33. Mastropavlos, N., and V. Papavassiliou. 2011. "Automatic Acquisition of Bilingual Language Resources." In Proceedings of the 10th International Conference of Greek Linguistics (ICGL 2011). Komotini, Greece.
  34. Morin, E., B. Daille, K. Takeuchi, and K. Kageura. 2007. "Bilingual Terminology Mining Using Brain, not Brawn Comparable Corpora." In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007), 664-671. Prague: Association for Computational Linguistics.
  35. Okita, T., A. Maldonado-Guerra, Y. Graham, and A. Way. 2010. "Multi-Word Expression- Sensitive Word Alignment." In Proceedings of the 4th International Workshop on Cross Lingual Information Access (CLIA 2010), 26-34. Beijing: COLING 2010 Organizing Committee.
  36. Paramita, M.L., P. Clough, A. Aker, and R.J. Gaizauskas. 2012. "Correlation Between Similarity Measures for Inter-Language Linked Wikipedia Articles." In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), 790-797. Istanbul: European Language Resources Association.
  37. Pazienza, M.T., M. Pennacchiotti, and F.M. Zanzotto. 2005. "Terminology Extraction: an Analysis of Linguistic and Statistical Approaches." In Knowledge Mining, 255-279. Berlin, Heidelberg: Springer.
  38. Pinnis, M. 2013. "Context Independent Term Mapper for European Languages." In Proceedings of Recent Advances in Natural Language Processing (RANLP 2013), 562-570. Hissar: Incoma Ltd. Shoumen, Bulgaria.
  39. Pinnis, M. 2014. "Bootstrapping of a Multilingual Transliteration Dictionary for European Languages. " In Human Language Technologies The Baltic Perspective -Proceedings of the 6th International Conference Baltic (HLT 2014), 132-140. Amsterdam: IOS Press.
  40. Pinnis, M.. and K. Goba. 2011. "Maximum Entropy Model for Disambiguation of Rich Morphological Tags." In Proceedings of the 2nd International Workshop on Systems and Frameworks for Computational Morphology (SFCM 2011), 14-22. Berlin, Heidelberg: Springer.
  41. Pinnis, M., N. Ljubešic, D. Stefanescu, I. Skadina, M. Tadic, and T. Gornostay. 2012. "Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages." In Proceedings of the 10th Conference on Terminology and Knowledge Engineering (TKE 2012), 20-21. Madrid.
  42. Rapp, R. 1995. "Identifying Word Translations in Non-Parallel Texts." In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (ACL 1995), 320- 322. Cambridge, Massachusetts: Association for Computational Linguistics.
  43. Resnik, P., and N.A. Smith. 2003. "The Web as a Parallel Corpus." Computational Linguistics, 29 (3): 349-380.
  44. Sang, E.F.T.K., and F. De Meulder, F. 2003. "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition." In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003 -Volume 4, 142-147. Edmonton: Association for Computational Linguistics.
  45. Spärck Jones, K. 1972. "A Statistical Interpretation of Term Specificity and Its Application in Retrieval." Journal of Documentation, 28: 11-21.
  46. Steinberger, R., B. Pouliquen, and J. Hagman. 2002. "Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EuroVoc." Computational Linguistics and Intelligent Text Processing, 415-424. Berlin, Heidelberg: Springer.
  47. Udupa, R., K. Saravanan, A. Kumaran, and J. Jagarlamudi. 2008. "Mining Named Entity Transliteration Equivalents from Comparable Corpora." In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), 1423-1424. New York: Association for Computing Machinery. Wikipedia. 2014. "Hydraulic Fracturing." Accessed June 23. http://en.wikipedia.org/wiki/Hydraulic_fracturing.