Probabilistic score propagation in information retrieval
Abstract
Information retrieval techniques deal with different units of information such as terms, topics or documents. There usually exist explicit or implicit link structures between different items of each unit or between items across different units. For example hyperlinks between pages in a hypertext collection are explicit structures, while the links between terms in a co-occurrence network are implicit structures. Many of the traditional information retrieval methods only use the content information of the items for retrieval purposes and overlook the link structures. Those that use the link structures also do not fully exploit the discrimination power of contents as well as all useful link information. In this thesis, we propose a general probabilistic score propagation framework for combining content and link information, which can fully take advantage of content information and the link
References (105)
- Review of "trec -experiment and evaluation in information retrieval" by ellen m. voorhees, donna k. harman (eds.), the mit press, cambridge, ma, 2005. Inf. Process. Manage., 43(1):285-287, 2007. Reviewer-Gheorghe Muresan.
- V.N. Anh and A. Moffat. Melbourne university 2004: Terabyte and web tracks. In Proceed- ings of the TREC Conference, 2004.
- Krisztian Balog, Leif Azzopardi, and Maarten de Rijke. Formal models for expert finding in enterprise corpora. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 43-50, New York, NY, USA, 2006. ACM.
- Krishna Bharat and Monika R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 104-111, New York, NY, USA, 1998. ACM Press.
- Krishna Bharat and George A. Mihaila. When experts agree: using non-affiliated experts to rank popular topics. ACM Trans. Inf. Syst., 20(1):47-58, 2002.
- Allan Borodin, Gareth O. Roberts, Jeffrey S. Rosenthal, and Panayiotis Tsaparas. Link analysis ranking: algorithms, theory, and experiments. ACM Trans. Inter. Tech., 5(1):231- 297, 2005.
- S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Au- tomatic resource list compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference, 1998.
- S. Chakrabarti, B. Dom, D. Gibson, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Spectral filtering for resource discovery. In ACM SIGIR workshop on Hy- pertext Information Retrieval on the Web, 1998.
- Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext categorization using hyperlinks. In SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, volume 27, pages 307-318. ACM Press, 1998.
- Paul R. Cohen and Rick Kjeldsen. Information retrieval by constrained spreading activation in semantic networks. Inf. Process. Manage., 23(4):255-268, 1987.
- N. Craswell and D. Hawking. Overview of the trec-2002 web track. In Proceedings of the TREC Conference, 2002.
- N. Craswell and D. Hawking. Overview of the trec-2003 web track. In Proceedings of the TREC Conference, 2003.
- N. Craswell and D. Hawking. Overview of the trec-2004 web track. In Proceedings of the TREC Conference, 2004.
- Nick Craswell and Martin Szummer. Random walks on the click graph. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and de- velopment in information retrieval, pages 239-246, New York, NY, USA, 2007. ACM.
- Fabio Crestani and Puay Leng Lee. Searching the web by constrained spreading activation. Inf. Process. Manage., 36(4):585-605, 2000.
- W. B. Croft, T. J. Lucia, and P. R. Cohen. Retrieving documents by plausible inference: a priliminary study. In SIGIR '88: Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval, pages 481-494, New York, NY, USA, 1988. ACM Press.
- W. B. Croft, T. J. Lucia, J. Cringean, and P. Willett. Retrieving documents by plausible inference: an experimental study. Inf. Process. Manage., 25(6):599-614, 1989.
- Brian D. Davison. Toward a unification of text and link analysis. In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 367-368, New York, NY, USA, 2003. ACM Press.
- Hui Fang and ChengXiang Zhai. Probabilistic models for expert finding. In ECIR, pages 418-430, 2007.
- Martin Franz and J. Scott McCarley. Arabic information retrieval at ibm. In TREC, 2002.
- Martin Franz, J. Scott McCarley, and Salim Roukos. Ad hoc and multilingual information retrieval at IBM. In Text REtrieval Conference, pages 104-115, 1998.
- H. P. Frei and D. Stieger. The use of semantic links in hypertext information retrieval. Inf. Process. Manage., 31(1):1-13, 1995.
- Mark Edwin Frisse. Searching for information in a hypertext medical handbook. In HY- PERTEXT '87: Proceedings of the ACM conference on Hypertext, pages 57-66, New York, NY, USA, 1987. ACM.
- Norbert Fuhr. Probabilistic models in information retrieval. The Computer Journal, 35(3):243-255, 1992.
- Pascale Fung and Lo Yuen Yee. An ir approach for translating new words from nonparallel, comparable texts. In Proceedings of the 17th international conference on Computational linguistics, pages 414-420, Morristown, NJ, USA, 1998. Association for Computational Linguistics.
- Richard Furuta, Catherine Plaisant, and Ben Shneiderman. A spectrum of automatic hyper- text constructions. Hypermedia, 1(2):179-195, 1989.
- E. Garfield. Citation indexes for science. Science, 129, 1955.
- Lise Getoor and Christopher P. Diehl. Link mining: a survey. SIGKDD Explor. Newsl., 7(2):3-12, 2005.
- Zoubin Ghahramani. Learning dynamic Bayesian networks. Lecture Notes in Computer Science, 1387:168-197, 1998.
- G. Grimmett and D. Stirzaker. Probability and random processes. In Oxford University Press, 1989.
- C. O. Hartman. Virtual Muse: Experiments in Computer Poetry (Wesleyan Poetry). Wes- leyan University Press, 1996.
- T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. IEEE Transactions on Knowledge and Data Engineering, 15(4):784-796, 2003.
- Djoerd Hiemstra and Wessel Kraaij. Twenty-one at trec7: Ad-hoc and cross-language track. In Text REtrieval Conference, pages 174-185, 1998.
- R. A. Hummel and S. W. Zucker. On the foundation of relaxation labeling processes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 5:267-287, 1983.
- J. Kamps, G. Mishne, and M. de Rijke. Language models for searching in web corpora. In Proceedings of the TREC Conference, 2004.
- Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604- 632, 1999.
- Oren Kurland and Lillian Lee. Corpus structure, language models, and ad hoc information retrieval. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR confer- ence on Research and development in information retrieval, pages 194-201, New York, NY, USA, 2004. ACM Press.
- Oren Kurland and Lillian Lee. Pagerank without hyperlinks: structural re-ranking using links induced by language models. In SIGIR '05: Proceedings of the 28th annual interna- tional ACM SIGIR conference on Research and development in information retrieval, pages 306-313, New York, NY, USA, 2005. ACM Press.
- John Lafferty and Chengxiang Zhai. Document language models, query models, and risk minimization for information retrieval. In SIGIR '01: Proceedings of the 24th annual in- ternational ACM SIGIR conference on Research and development in information retrieval, pages 111-119, New York, NY, USA, 2001. ACM Press.
- R. Larson. Bibliometrics of the world wide web: An exploratory analysis of the intellectual structure of cyberspace. In Annual Meeting of the American Society of Information Science, 1996.
- Victor Lavrenko and W. Bruce Croft. Relevance based language models. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and de- velopment in information retrieval, pages 120-127, New York, NY, USA, 2001. ACM.
- R. Lempel and S. Moran. The stochastic approach for link-structure analysis (salsa) and the tkc effect. Comput. Netw., 33(1-6):387-401, 2000.
- Xiaoyong Liu and W. Bruce Croft. Cluster-based retrieval using language models. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 186-193, New York, NY, USA, 2004. ACM Press.
- Massimo Marchiori. The quest for correct information on the Web: Hyper search engines. Computer Networks and ISDN Systems, 29(8-13):1225-1236, 1997.
- Hiroshi Masuichi, Raymond Flournoy, Stefan Kaufmann, and Stanley Peters. A bootstrap- ping method for extracting bilingual text pairs. In Proceedings of the 18th conference on Computational linguistics, pages 1066-1070, Morristown, NJ, USA, 2000. Association for Computational Linguistics.
- Fabien Mathieu and Mohamed Bouklit. The effect of the back button is a random walk: Application for pagerank. In Proceedings of the 13th International World Wide Web Con- ference, 2004.
- S. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.
- S. Meyn and R. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, 1993.
- David R. H. Miller, Tim Leek, and Richard M. Schwartz. A hidden markov model infor- mation retrieval system. In SIGIR '99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 214-221, New York, NY, USA, 1999. ACM Press.
- Dharmendra S. Modha and W. Scott Spangler. Clustering hypertext with applications to web searching. In HYPERTEXT '00: Proceedings of the eleventh ACM on Hypertext and hypermedia, pages 143-152, New York, NY, USA, 2000. ACM Press.
- Dragos Stefan Munteanu and Daniel Marcu. Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist., 31(4):477-504, 2005.
- Andrew Y. Ng, Alice X. Zheng, and Michael I. Jordan. Stable algorithms for link analysis. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 258-266, New York, NY, USA, 2001. ACM Press.
- Douglas W. Oard and Anne R. Diekema. Cross-language information retrieval. In Annual Review of Information Science and Technology, volume 33, pages 223 -256, 1998.
- Douglas W. Oard and Fredric C. Gey. The trec 2002 arabic/english clir track. In TREC, 2002.
- Paul Ogilvie and Jamie Callan. Combining document representations for known-item search. In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR confer- ence on Research and development in informaion retrieval, pages 143-150, New York, NY, USA, 2003. ACM.
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library, 1998.
- Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.
- E. Picchi and C. Peters. Cross language information retrieval: A system for comparable corpus querying. In Workshop on Cross-Linguistic Information Retrieval, SIGIR'96, pages 24 -33, 1996.
- Peter Pirolli, James Pitkow, and Ramana Rao. Silk from a sow's ear: Extracting usable structures from the web. In Proc. ACM Conf. Human Factors in Computing Systems, CHI. ACM Press, 1996.
- Jay M. Ponte and W. Bruce Croft. A language modeling approach to information retrieval. In SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 275-281, New York, NY, USA, 1998. ACM.
- Tao Qin, Tie-Yan Liu, Xu-Dong Zhang, Zheng Chen, and Wei-Ying Ma. A study of rele- vance propagation for web search. In SIGIR '05: Proceedings of the 28th annual interna- tional ACM SIGIR conference on Research and development in information retrieval, pages 408-415, New York, NY, USA, 2005. ACM Press.
- Prabhakar Raghavan Rajeev Motwani. Randomized Algorithms. Cambridge University Press, 1995.
- Reinhard Rapp. Identifying word translations in non-parallel texts. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pages 320-322, Mor- ristown, NJ, USA, 1995. Association for Computational Linguistics.
- Mathew Richardson and Pedro Domingos. The intelligent surfer: Probabilistic combination of link and content information in pagerank. In Advances in Neural Information Processing Systems, 2002.
- Stephen Robertson. Threshold setting and performance optimization in adaptive filtering. Information Retrieval, 5(2-3):239-256, 2002.
- Stephen E. Robertson and Karen Sparck Jones. Relevance weighting of search terms. Jour- nal of the American Society for Information Science, 27(3):129-146, 1976.
- J ROCCHIO. Relevance feedback in information retrieval. In In The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313-323. PrenticeHall, 1971.
- Fatiha Sadat, Masatoshi Yoshikawa, and Shunsuke Uemura. Bilingual terminology acqui- sition from comparable corpora and phrasal translation to cross-language information re- trieval. In ACL '03: Proceedings of the 41st Annual Meeting on Association for Computa- tional Linguistics, pages 141-144, Morristown, NJ, USA, 2003. Association for Computa- tional Linguistics.
- Gerard Salton. Associative document retrieval techniques using bibliographic information. J. ACM, 10(4):440-457, 1963.
- Gerard Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989.
- Gerard Salton and Chris Buckley. On the use of spreading activation methods in automatic information retrieval. Technical report, Ithaca, NY, USA, 1988.
- Gerard Salton and Chris Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41(4):288-297, 1990.
- Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1986.
- Gerard Salton, C. S. Yang, and C. T. Yu. A theory of term importance in automatic text analysis. Journal of the American Society for Information Science, 26(1):33-44, 1975.
- Jacques Savoy. Bayesian inference networks and spreading activation in hypertext systems. Inf. Process. Manage., 28(3):389-406, 1992.
- Jacques Savoy. Ranking schemes in hybrid boolean systems: a new approach. J. Am. Soc. Inf. Sci., 48(3):235-253, 1997.
- Jacques Savoy and Yves Rasolofo. Report on the trec 11 experiment: Arabic, named page and topic distillation searches. In TREC, 2002.
- Azadeh Shakery and Chengxiang Zhai. Relevance propagation for topic distillation uiuc trec 2003 web track experiments. In Proceedings of the TREC Conference, 2003.
- P. Sheridan, J. Ballerini, and P. Schauble. Building a large multilingual test collection from comparable news documents. In G. Grefenstette, editor, Cross-Language Information Re- trieval, Boston, Massachusetts, 1998. Kluwer Academic Publishers.
- Luo Si and Jamie Callan. Modeling search engine effectiveness for federated search. In SI- GIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 83-90, New York, NY, USA, 2005. ACM.
- Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In SIGIR '96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 21-29, New York, NY, USA, 1996. ACM.
- H. Small. Co-citation in the scientific literature: a new measure of the relationship between two documents. The American Society of Information Science, 24, 1973.
- R. Song, J. R. Wen, S. M. Shi, T. Y. Xin, G. M. abd Liu, T. Qin, X. Zheng, J. Y. Zhang, G. R. Xue, and W. Y. Ma. Microsoft research asia at web track and terabyte track of trec 2004. In Proceedings of the TREC Conference, 2004.
- Marcin Sydow. Random surfer with back step. In Proceedings of the 13th International World Wide Web Conference, 2004.
- Tao Tao, Xuanhui Wang, Qiaozhu Mei, and ChengXiang Zhai. Language model information retrieval with document expansion. In HLT-NAACL, 2006.
- Tao Tao and ChengXiang Zhai. Mining comparable bilingual text corpora for cross- language information integration. In KDD '05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 691-696, New York, NY, USA, 2005. ACM Press.
- Anastasios Tombros. The effectiveness of query-based hierarchic clustering of documents for information retrieval. Technical report, Glasgow : University of Glasgow, 2002.
- T. Tomiyama, K. Karoji, T. Kondo, Y. Kakuta, and T. Takagi. Meiji university web, novelty and genomics track experiments. In Proceedings of the TREC Conference, 2004.
- Stephen Tomlinson. Experiments in named page finding and arabic retrieval with humming- bird searchservertm at trec 2002. In TREC, 2002.
- Howard Turtle and W. Bruce Croft. Evaluation of an inference network-based retrieval model. ACM Trans. Inf. Syst., 9(3):187-222, 1991.
- C. J. van Rijsbergen. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, 33:106-119, June 1977.
- C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979.
- C. J. van Rijsbergen. A non-classical logic for information retrieval. The Computer Journal, 29(6):481-485, 1986.
- Ellen M. Voorhees. The cluster hypothesis revisited. In SIGIR '85: Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval, pages 188-196, New York, NY, USA, 1985. ACM.
- Ross Wilkinson and Alan F. Smeaton. Automatic link generation. ACM Comput. Surv., page 27, 1999.
- Peter Willett. Recent trends in hierarchic document clustering: a critical review. Inf. Pro- cess. Manage., 24(5):577-597, 1988.
- S. K. M. Wong and Y. Y. Yao. On modeling information retrieval with probabilistic infer- ence. ACM Trans. Inf. Syst., 13(1):38-68, 1995.
- Jinxi Xu and W. Bruce Croft. Query expansion using local and global document analysis. In SIGIR '96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 4-11, New York, NY, USA, 1996. ACM.
- F. Zanettin. Bilingual comparable corpora and the training of translators. In Laviosa, Sara. (ed.) META, 43:4, Special Issue. The corpus-based approach: a new paradigm in transla- tion studies, pages 616-630, 1998.
- H. Zaragoza, N. Craswell, M. Taylor, S. Saria, and S. Robertson. Microsoft cambridge at trec 13: Web and hard tracks. In Proceedings of the TREC Conference, 2004.
- Chengxiang Zhai and John Lafferty. Model-based feedback in the language modeling ap- proach to information retrieval. In CIKM '01: Proceedings of the tenth international con- ference on Information and knowledge management, pages 403-410, New York, NY, USA, 2001. ACM Press.
- Chengxiang Zhai and John Lafferty. Model-based feedback in the language modeling ap- proach to information retrieval. In CIKM '01: Proceedings of the tenth international con- ference on Information and knowledge management, pages 403-410, New York, NY, USA, 2001. ACM Press.
- Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 334-342, New York, NY, USA, 2001. ACM Press.
- ChengXiang Zhai and John Lafferty. Two-stage language models for information retrieval. In SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 49-56, New York, NY, USA, 2002. ACM.
- Z. Zhou, Y. Guo, B. Wang, X. Cheng, H. Xu, and G. Zhang. Trec 2004 web track experi- ments at cas-ict. In Proceedings of the TREC Conference, 2004.