Academia.eduAcademia.edu

Outline

Modeling score distributions in information retrieval

2011, Information Retrieval

https://doi.org/10.1007/S10791-010-9145-5

Abstract

We review the history of modeling score distributions, focusing on the mixture of normal-exponential by investigating the theoretical as well as the empirical evidence supporting its use. We discuss previously suggested conditions which valid binary mixture models should satisfy, such as the Recall-Fallout Convexity Hypothesis, and formulate two new hypotheses considering the component distributions under some limiting conditions of parameter values. From all the mixtures suggested in the past, the current theoretical argument points to the two gamma as the most-likely universal model, with the normal-exponential being a usable approximation. Beyond the theoretical contribution, we provide new experimental evidence showing vector space or geometric models, and BM25, as being "friendly" to the normal-exponential, and that the non-convexity problem that the mixture possesses is practically not severe.

References (25)

  1. Robertson, S.: On score distributions and relevance. In: Proceedings ECIR'07, Springer (2007) 40-51
  2. Nottelmann, H., Fuhr, N.: From uncertain inference to probability of relevance for advanced IR applications. In: Proceedings ECIR'03. (2003) 235-250
  3. Callan, J.: Distributed information retrieval. In: Advances Information Retrieval: Recent Research from the CIIR. Kluwer Academic Publishers (2000) 127-150
  4. Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. In: Pro- ceedings SIGIR'95, ACM Press (1995) 246-254
  5. Oard, D.W., Hedin, B., Tomlinson, S., Baron, J.R.: Overview of the TREC 2008 legal track. In: Proceedings TREC 2008. (2009)
  6. Lee, J.H.: Analyses of multiple evidence combination. In: Proceedings SIGIR'97, ACM Press (1997) 267-276
  7. Manmatha, R., Rath, T.M., Feng, F.: Modeling score distributions for combining the outputs of search engines. In: Proceedings SIGIR'01, ACM Press (2001) 267-275
  8. Fernández, M., Vallet, D., Castells, P.: Using historical data to enhance rank aggregation. In: Proceedings SIGIR'06, ACM Press (2006) 643-644
  9. Arampatzis, A., Beney, J., Koster, C.H.A., van der Weide, T.P.: Incrementality, half-life, and threshold optimization for adaptive document filtering. In: Proceesing TREC 2000. (2000)
  10. Zhang, Y., Callan, J.: Maximum likelihood estimation for filtering thresholds. In: Proceed- ings SIGIR'01, ACM Press (2001) 294-302
  11. Collins-Thompson, K., Ogilvie, P., Zhang, Y., Callan, J.: Information filtering, novelty de- tection, and named-page finding. In: Proceedings TREC 2002. (2002)
  12. Arampatzis, A., Robertson, S., Kamps, J.: Where to stop reading a ranked list? threshold optimization using truncated score distributions. In: Proceedings SIGIR'09, ACM Press (2009)
  13. Swets, J.A.: Information retrieval systems. Science 141(3577) (1963) 245-250
  14. Swets, J.A.: Effectiveness of information retrieval methods. American Documentation 20 (1969) 72-89
  15. Bookstein, A.: When the most "pertinent" document should not be retrieved -an analysis of the Swets model. Information Processing and Management 13(6) (1977) 377-383
  16. Baumgarten, C.: A probabilitstic solution to the selection and fusion problem in distributed information retrieval. In: Proceedings SIGIR'99, ACM Press (1999) 246-253
  17. Arampatzis, A., van Hameren, A.: The score-distributional threshold optimization for adap- tive binary classification tasks. In: Proceedings SIGIR'01, ACM Press (2001) 285-293
  18. Fernández, M., Vallet, D., Castells, P.: Probabilistic score normalization for rank aggregation. In: Proceedings ECIR'06, Springer (2006) 553-556
  19. van Rijsbergen, C.J.: Information Retrieval. Butterworth (1979)
  20. Cooper, W.S.: Some inconsistencies and misnomers in probabilistic information retrieval. In: Proceedings SIGIR'91, ACM Press (1991) 57-61
  21. Cooper, W.S., Gey, F.C., Dabney, D.P.: Probabilistic retrieval based on staged logistic re- gression. In: Proceedings SIGIR'92, ACM Press (1992) 198-210
  22. Arampatzis, A.: Unbiased s-d threshold optimization, initial query degradation, decay, and incrementality, for adaptive document filtering. In: Proceedings TREC 2001. (2002)
  23. Robertson, S.E.: The parametric description of retrieval tests. part 1: The basic parameters. Journal of Documentation 25(1) (1969) 1-27
  24. Robertson, S.E., Bovey, J.D.: Statistical problems in the application of probabilistic models to information retrieval. Technical Report Report No. 5739, BLR&DD (1982)
  25. Arampatzis, A., Kamps, J.: Where to stop reading a ranked list? In: Proceedings TREC 2008. (2008)