Academia.eduAcademia.edu

Outline

Semantic similarity knowledge and its applications

2007

Abstract

Semantic relatedness refers to the degree to which two concepts or words are related. Humans are able to easily judge if a pair of words are related in some way. For example, most people would agree that apple and orange are more related than are apple and toothbrush. Semantic similarity is a subset of semantic relatedness. In this article we describe several methods for computing the similarity of two words, following two directions: dictionary-based methods that use WordNet, Roget's thesaurus, or other resources; and corpus-based methods that use frequencies of co-occurrence in corpora (cosine method, latent semantic indexing, mutual information, etc). Then, we present results for several applications of word similarity knowledge: solving TOEFL-style synonym questions, detecting words that do not fit into their context in order to detect speech recognition errors, and synonym choice in context, for writing aid tools. We also present a method for computing the similarity of two short texts, based on the similarities of their words. Applications of text similarity knowledge include: designing exercises for second language-learning, acquisition of domain-specific corpora, information retrieval, and text categorization. Before concluding, we briefly describe cross-language extensions of the methods for similarity of words and texts.

References (30)

  1. P.F. Brown, P.V. DeSouza, R.L. Mercer, T.J. Watson, V.J. Della Pietra, and J.C. Lai. Class- based n-gram models of natural language. Computational Linguistics, 18:467-479, 1992.
  2. S. Banerjee and T. Pedersen. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of IJCAI 2003.
  3. C. Buckley, J.A. Salton and A. Singhal. Automatic query expansion using Smart: TREC 3. In The third Text Retrieval Conference, Gaithersburg, MD, 1995.
  4. A. Budanitsky and G. Hirst. Evaluating WordNet-based measures of semantic distance. Com- putational Linguistics, 32(1), 2006.
  5. P. Edmonds. Choosing the word most typical in context using a lexical co-occurrence network. In Proceedings of ACL 1997.
  6. G. Grefenstette. Automatic thesaurus generation from raw text using knowledge-poor tech- niques. In Making Sense of Words, 9th Annual Conference of the UW Centre for the New OED and Text Research, 1993.
  7. G. Hirst and D. St-Onge. Lexical Chains as representations of context for the detection and correction of malapropisms. In WordNet An Electronic Database, 1998.
  8. D. Inkpen. Near-synonym choice in an Intelligent Thesaurus, HLT-NAACL 2007.
  9. D. Inkpen and A. Desilets. Semantic similarity for detecting recognition errors in automatic speech transcripts. In Proceedings of EMNLP 2005.
  10. A. Islam and D. Inkpen. Second order co-occurrence PMI for determining the semantic similarity of words. In Proceedings of LREC 2006.
  11. M. Jarmasz and S. Szpakowicz. Roget's thesaurus and semantic similarity. In Proceedings of RANLP 2003.
  12. J. Jiang and D. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of COLING 1997.
  13. T.K. Landauer and S.T. Dumais. A Solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 1997.
  14. C. Leacock and M. Chodorow. Combining local context and WordNet sense similarity for word sense identification. In WordNet, An Electronic Lexical Database, 1998.
  15. Lesk, M.E. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the SIGDOC Conference , Toronto, 1986.
  16. Y. Li, D. McLean, Z. Bandar, J. O'Shea, K. and Crockett. Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowledge and Data Eng. 18:8, 2006.
  17. H. Li and N. Abe. Word clustering and disambiguation based on co-occurrence data. In Proceedings of COLING-ACL, 1998, pp. 749-755.
  18. D. Lin. An information-theoretic definition of similarity. In Proceedings of ICML 1998.
  19. D. Lin. Automatic retrieval and clustering of similar words. In Proceedings of COLING- ACL, 1998, pp. 768-774.
  20. C.Y. Lin and E.H. Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of HLT-NAACL, 2003.
  21. R. Mihalcea, C. Corley, C. Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of AAAI 2006.
  22. G.A. Miller and W.G. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1): 1-28, 1991.
  23. P. Pantel and D. Lin. Discovering word senses from text. In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2002, pp. 613-619.
  24. S. Patwardhan. Incorporating dictionary and corpus information into a vector measure of semantic relatedness. MSc Thesis, University of Minnesota, 2003.
  25. P. Resnik. Semantic similarity in a taxonomy: An information-based measure and its appli- cations to problems of ambiguity in natural language. JAIR 11, 1999.
  26. H. Rubenstein and J.B. Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10): 627-633, 1995.
  27. J. Weeds, D. Weir and D. McCarthy. Characterising measures of lexical distributional sim- ilarity. In Proceedings of COLING 2004.
  28. Z. Wu and M. Palmer. Verb semantics and lexical selection. In Proceedings of ACL 1994.
  29. P.D. Turney. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of ECML 2001.
  30. J. Xu and B. Croft. Improving the effectiveness of information retrieval. ACM Transactions on Information Systems, 18(1):79-112, 2000. University of Ottawa, School of Information Technology and Engineering E-mail address: diana@site.uottawa.ca