Academia.eduAcademia.edu

Outline

Measuring the Relatedness between Documents in Comparable Corpora

https://doi.org/10.13140/RG.2.1.2155.7205

Abstract

This paper aims at investigating the use of textual distributional similarity measures in the context of comparable corpora. We address the issue of measuring the relatedness between documents by extracting, measuring and ranking their common content. For this purpose, we designed and applied a methodology that exploits available natural language processing technology with statistical methods. Our findings showed that using a list of common entities and a simple, yet robust set of distributional similarity measures was enough to describe and assess the degree of relatedness between the documents. Moreover, our method has demonstrated high performance in the task of filtering out documents with a low level of relatedness. By a way of example, one of the measures got 100%, 100%, 95% and 90% precision when injected 5%, 10%, 15% and 20% of noise, respectively.

References (17)

  1. Laurence Anthony. 2014. AntConc (Version
  2. Douglas Biber. 1988. Variation across speech and writing. Cambridge University Press, Cambridge, UK. Gloria Corpas Pastor and Míriam Seghiri. 2009. Virtual Corpora as Documentation Resources: Translating Travel Insurance Documents (English- Spanish). In A. Beeby, P.R. Inés, and P. Sánchez- Gijón, editors, Corpus Use and Translating: Corpus Use for Learning to Translate and Learning Corpus Use to Translate, Benjamins translation library, chapter 5, pages 75-107. John Benjamins Publishing Company.
  3. Gloria Corpas Pastor. 2001. Compilación de un corpus ad hoc para la enseñanza de la traducción inversa especializada. TRANS, Revista de Traductología, 5(1):155-184.
  4. Hernani Costa, Hugo Gonc ¸alo Oliveira, and Paulo Gomes. 2010. The Impact of Distributional Metrics in the Quality of Relational Triples. In 19 th European Conf. on Artificial Intelligence, Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, ECAI'10, pages 23-29, Lisbon, Portugal, August.
  5. Hernani Costa, Hugo Gonc ¸alo Oliveira, and Paulo Gomes. 2011. Using the Web to Validate Lexico- Semantic Relations. In 15 th Portuguese Conf. on Artificial Intelligence, volume 7026 of EPIA'11, pages 597-609, Lisbon, Portugal, October. Springer.
  6. Hernani Costa, Hanna Béchara, Shiva Taslimipoor, Rohit Gupta, Constantin Orasan, Gloria
  7. Corpas Pastor, and Ruslan Mitkov. 2015. MiniExperts: An SVM approach for Measuring Semantic Textual Similarity. In 9 th Int. Workshop on Semantic Evaluation, SemEval'15, pages 96-101, Denver, Colorado, June. ACL. Hernani Costa. 2010. Automatic Extraction and Validation of Lexical Ontologies from text. Master's thesis, University of Coimbra, Faculty of Sciences and Technology, Department of Informatics Engineering, Coimbra, Portugal, September.
  8. Hernani Costa. 2015. Assessing Comparable Corpora through Distributional Similarity Measures. In EXPERT Scientific and Technological Workshop, pages 23-32, Malaga, Spain, June.
  9. EAGLES. 1996. Preliminary Recommendations on Corpus Typology. Technical report, EAGLES Document EAG-TCWG-CTYP/P., May. http://www.ilc.cnr.it/EAGLES96/ corpustyp/corpustyp.html.
  10. Zelig Harris. 1970. Distributional Structure. In Papers in Structural and Transformational Linguistics, pages 775-794. D. Reidel Publishing Company, Dordrecht, Holland.
  11. Oktay Ibrahimov, Ishwar Sethi, and Nevenka Dimitrova. 2002. The Performance Analysis a Chi-square Similarity Measure for Topic Related Clustering of Noisy Transcripts. In 16 th Int. Conf. on Pattern Recognition, volume 4, pages 285-288. IEEE Computer Society.
  12. Adam Kilgarriff. 2001. Comparing Corpora. Int. Journal of Corpus Linguistics, 6(1):97-133.
  13. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In MT Summit.
  14. Paul Rayson, Geoffrey Leech, and Mary Hodges. 1997. Social Differentiation in the Use of English Vocabulary: Some Analyses of the Conversational Component of the British National Corpus. Int. Journal of Corpus Linguistics, 2(1):133-152.
  15. Gerard Salton and Christopher Buckley. 1988. Term- Weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24(5):513- 523.
  16. Helmut Schmid. 1995. Improvements In Part-of- Speech Tagging With an Application To German. In ACL SIGDAT-Workshop, pages 47-50, Dublin, Ireland.
  17. Amit Singhal. 2001. Modern Information Retrieval: A Brief Overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35-42.