Part of Speech Based Term Weighting for Information Retrieval
Abstract
Automatic language processing tools typically assign to terms so-called 'weights' corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics , e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the 'POS contexts' in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7% from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline.
References (36)
- J. A. Aslam and V. Pavlu. Query hardness estimation using Jensen-Shannon divergence among multiple scoring functions. In ECIR, pages 198-209, 2007.
- H. Baayen, H. van Halteren, and F. Tweedie. Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3):121-131, 1996.
- A. Bas, D. Denison, E. Keizer, and G. Popova, editors. Fuzzy Grammar, a Reader. Oxford University Press, 2004.
- A. Bookstein and D. Swanson. Probabilistic models for automatic indexing. JASIS, 25:312-318, 1974.
- B. C. Brookes. The measure of information retrieval effectiveness proposed by Swets. Journal of Documentation, 24:41-54, 1968.
- P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer. Class- based n-gram models of natural language. Computational Linguistics, 18(4):467- 479, 1992.
- C. Buckley, A. Singhal, and M. Mitra. New retrieval approaches using Smart: TREC 4. In TREC-4, pages 25-48, 1995.
- K. W. Church and W. A. Gale. Poisson mixtures. Natural Language Engineering, 1(2):163-190, 1995.
- W. S. Cooper, A. Chen, and F. Gey. Full text retrieval based on probalistic equa- tions with coefficients fitted by logistic regression. In TREC-2, pages 57-66, 1993.
- S. Corston-Oliver, E. Ringer, M. Gamon, and R. Campbell. Task-focused summa- rization of email. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 43-50, 2004.
- N. Craswell, S. E. Robertson, H. Zaragoza, and M. J. Taylor. Relevance weighting for query independent evidence. In SIGIR, pages 416-423, 2005.
- B. Croft and J. Lafferty. Language Modeling for Information Retrieval. Kluwer Academic Publishers, 2003.
- S. P. Harter. A probabilistic approach to automatic keyword indexing: Part I. On the distribution of specialty words in a technical literature. JASIS, 26(4):197-206, 1975.
- R. Hwa, P. Resnik, A. Weinberg, and O. Kolak. Evaluating translational corre- spondence using annotation projection. In ACL, pages 392-399, 2002.
- O. Jespersen. The Philosophy of Grammar. Allen and Unwin, 1929.
- M. Koppel, S. Argamon, and A. R. Shimoni. Automatically categorizing written texts by author gender. Literary and Linguistic Computing, (4):401-412, 2003.
- C. Lioma and I. Ounis. Light syntactically-based index pruning for information retrieval. In ECIR, pages 88-100, 2007.
- C. Lioma and C. J. K. van Rijsbergen. Part of speech n-grams and information retrieval. RFLA, 8:9-22, 2008.
- J. Lyons. Semantics: Volume 2. Cambridge University Press, Cambridge, 1977.
- E. L. Margulis. N-Poisson document modelling. In SIGIR, pages 177-189, 1992.
- J. Mikk. Prior knowledge of text content and values of text characteristics. Journal of Quantitative Linguistics, 8(1):67-80, 2001.
- C. Monz. Model tree learning for query term weighting in question answering. In ECIR, pages 589-596, 2007.
- S. Ozmutlu, A. Spink, and H. C. Ozmutlu. A day in the life of Web searching: an exploratory study. Inf. Process. Manage., 40(2):319-345, 2004.
- K. Papineni. Why inverse document frequency? In NAACL, pages 25-33, 2001.
- M. Pasca. High-Performance Open-Domain Question Answering from Large Text Collections. PhD thesis, Southern Methodist University, 2001.
- A. Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In EMNLP, pages 130-142, 1996.
- J. D. M. Rennie and T. Jaakkola. Using term informativeness for named entity detection. In SIGIR pages 353-360
- S. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society of Information Science, 27:129-146, 1976.
- S. Robertson and S. Walker. Some simple approximations to the 2-Poisson model for probabilistic weighted retrieval. In SIGIR, pages 232-241. Springer-Verlag, 1994.
- M. Santini, R. Power, and R. Evans. Implementing a characterization of genre for automatic genre identification of Web pages. In COLING/ACL, pages 699-706, 2006.
- H. Schmid. Probabilistic part-of-speech tagging using decision trees. New Methods in Language Processing Studies, 1997.
- A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In SIGIR, pages 21-29. ACM Press, 1996.
- K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11-21, 1972.
- T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. In SIGIR, pages 295-302. ACM, 2007.
- E. M. Voorhees and D. K. Harman. TREC: Experiment and Evaluation in Infor- mation Retrieval. MIT Press, 2005.
- J. Wagner, J. Foster, and J. van Genabith. A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors. In EMNLP-CoNLL, pages 112-121, 2007.