Academia.eduAcademia.edu

Outline

Part of Speech Based Term Weighting for Information Retrieval

Abstract

Automatic language processing tools typically assign to terms so-called 'weights' corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics , e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the 'POS contexts' in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7% from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline.

References (36)

  1. J. A. Aslam and V. Pavlu. Query hardness estimation using Jensen-Shannon divergence among multiple scoring functions. In ECIR, pages 198-209, 2007.
  2. H. Baayen, H. van Halteren, and F. Tweedie. Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3):121-131, 1996.
  3. A. Bas, D. Denison, E. Keizer, and G. Popova, editors. Fuzzy Grammar, a Reader. Oxford University Press, 2004.
  4. A. Bookstein and D. Swanson. Probabilistic models for automatic indexing. JASIS, 25:312-318, 1974.
  5. B. C. Brookes. The measure of information retrieval effectiveness proposed by Swets. Journal of Documentation, 24:41-54, 1968.
  6. P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer. Class- based n-gram models of natural language. Computational Linguistics, 18(4):467- 479, 1992.
  7. C. Buckley, A. Singhal, and M. Mitra. New retrieval approaches using Smart: TREC 4. In TREC-4, pages 25-48, 1995.
  8. K. W. Church and W. A. Gale. Poisson mixtures. Natural Language Engineering, 1(2):163-190, 1995.
  9. W. S. Cooper, A. Chen, and F. Gey. Full text retrieval based on probalistic equa- tions with coefficients fitted by logistic regression. In TREC-2, pages 57-66, 1993.
  10. S. Corston-Oliver, E. Ringer, M. Gamon, and R. Campbell. Task-focused summa- rization of email. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 43-50, 2004.
  11. N. Craswell, S. E. Robertson, H. Zaragoza, and M. J. Taylor. Relevance weighting for query independent evidence. In SIGIR, pages 416-423, 2005.
  12. B. Croft and J. Lafferty. Language Modeling for Information Retrieval. Kluwer Academic Publishers, 2003.
  13. S. P. Harter. A probabilistic approach to automatic keyword indexing: Part I. On the distribution of specialty words in a technical literature. JASIS, 26(4):197-206, 1975.
  14. R. Hwa, P. Resnik, A. Weinberg, and O. Kolak. Evaluating translational corre- spondence using annotation projection. In ACL, pages 392-399, 2002.
  15. O. Jespersen. The Philosophy of Grammar. Allen and Unwin, 1929.
  16. M. Koppel, S. Argamon, and A. R. Shimoni. Automatically categorizing written texts by author gender. Literary and Linguistic Computing, (4):401-412, 2003.
  17. C. Lioma and I. Ounis. Light syntactically-based index pruning for information retrieval. In ECIR, pages 88-100, 2007.
  18. C. Lioma and C. J. K. van Rijsbergen. Part of speech n-grams and information retrieval. RFLA, 8:9-22, 2008.
  19. J. Lyons. Semantics: Volume 2. Cambridge University Press, Cambridge, 1977.
  20. E. L. Margulis. N-Poisson document modelling. In SIGIR, pages 177-189, 1992.
  21. J. Mikk. Prior knowledge of text content and values of text characteristics. Journal of Quantitative Linguistics, 8(1):67-80, 2001.
  22. C. Monz. Model tree learning for query term weighting in question answering. In ECIR, pages 589-596, 2007.
  23. S. Ozmutlu, A. Spink, and H. C. Ozmutlu. A day in the life of Web searching: an exploratory study. Inf. Process. Manage., 40(2):319-345, 2004.
  24. K. Papineni. Why inverse document frequency? In NAACL, pages 25-33, 2001.
  25. M. Pasca. High-Performance Open-Domain Question Answering from Large Text Collections. PhD thesis, Southern Methodist University, 2001.
  26. A. Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In EMNLP, pages 130-142, 1996.
  27. J. D. M. Rennie and T. Jaakkola. Using term informativeness for named entity detection. In SIGIR pages 353-360
  28. S. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society of Information Science, 27:129-146, 1976.
  29. S. Robertson and S. Walker. Some simple approximations to the 2-Poisson model for probabilistic weighted retrieval. In SIGIR, pages 232-241. Springer-Verlag, 1994.
  30. M. Santini, R. Power, and R. Evans. Implementing a characterization of genre for automatic genre identification of Web pages. In COLING/ACL, pages 699-706, 2006.
  31. H. Schmid. Probabilistic part-of-speech tagging using decision trees. New Methods in Language Processing Studies, 1997.
  32. A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In SIGIR, pages 21-29. ACM Press, 1996.
  33. K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11-21, 1972.
  34. T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. In SIGIR, pages 295-302. ACM, 2007.
  35. E. M. Voorhees and D. K. Harman. TREC: Experiment and Evaluation in Infor- mation Retrieval. MIT Press, 2005.
  36. J. Wagner, J. Foster, and J. van Genabith. A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors. In EMNLP-CoNLL, pages 112-121, 2007.