Distributed hypertext resource discovery through examples
1999
Abstract
We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page about oil and natural gas over the last year." A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that that a keywordbased "find similar" search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classifier that evaluates the relevance of a region of the web to the user's interest (2) a distiller that evaluates a page as an access point for a large neighborhood of relevant pages. Our implementation uses IBM's Universal Database, not only for robust data storage, but also for integrating the computations of the classifier and distiller into the database. This results in significant increase in I/O efficiency: a factor of ten for the classifier and a factor of three for the distiller. In addition, ad-hoc SQL queries can be used to monitor the crawler, and dynamically change crawling strategies. We report on experiments to establish that our system is efficient, effective, and robust.
References (33)
- C. Apte, F. Damerau, and S. M. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 1994. IBM Research Report RC18879.
- I. Ben-Shaul, M. Herscovici, M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalheim, V. Soroka, and S. Ur. Adding support for dy- namic and focused search with Fetuccino. In 8th World Wide Web Conference. Toronto, May 1999.
- K. Bharat and A. Broder. A technique for measuring the rela- tive size and overlap of public web search engines. In Proceed- ings of the 7th World-Wide Web Conference (WWW7), 1998. Online at http://www7.scu.edu.au/programme/fullpapers/1937/ com1937.htm; also see an update at http://www.research. digital.com/SRC/whatsnew/sem.html.
- K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st International ACM SIGIR Conference on Re- search and Development in Information Retrieval, pages 469- 477, 1998. Online at http://www.research.digital.com/SRC/ personal/monika/papers/sigir98.ps.gz.
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th World-Wide Web Conference (WWW7), 1998. Online at http://decweb.ethz. ch/WWW7/1921/com1921.htm.
- S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Scal- able feature selection, classification and signature generation for organizing large text databases into hierarchical topic tax- onomies. VLDB Journal, Aug. 1998. Invited paper.
- S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Ragha- van, and S. Rajagopalan. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceed- ings of the 7th World-wide web conference (WWW7), 1998. Online at http://www7.scu.edu.au/programme/fullpapers/1898/ com1898.html and at http://www.almaden.ibm.com/cs/people/ pragh/www98/438.html.
- S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In SIGMOD. ACM, 1998. On- line at http://www.cs.berkeley.edu/~soumen/sigmod98.ps.
- S. Chakrabarti, D. Gibson, and K. McCurley. Surfing the web backwards. In 8th World Wide Web Conference, Toronto, Canada, May 1999.
- S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawl- ing: A new approach to topic-specific resource discovery. In 8th World Wide Web Conference, Toronto, May 1999.
- D. Chamberlin. A complete guide to DB2 universal database. Morgan-Kaufmann, 1998.
- C. Chekuri, M. Goldwasser, P. Raghavan, and E. Upfal. Web search using automatic classification. In Sixth World Wide Web Conference, San Jose, CA, 1996.
- J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. In 7th World Wide Web Conference, Brisbane, Australia, Apr. 1998. Online at http://www7.scu. edu.au/programme/fullpapers/1919/com1919.htm.
- W. W. Cohen. Fast effective rule induction. In Twelfth International Conference on Machine Learning, Lake Tahoe, CA, 1995. Online at http://www.research. att.com/~wcohen/postscript/ml-95-ripper.ps and http://www. research.att.com/~wcohen/ripperd.html.
- J. Dean and M. R. Henzinger. Finding related pages in the world wide web. In 8th World Wide Web Conference, Toronto, May 1999.
- P. DeBra and R. Post. Information retrieval in the world-wide web: Making client-based searching feasible. In Proceedings of the First International World Wide Web Conference, Geneva, Switzerland, 1994.
- S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Induc- tive learning algorithms and representations for text catego- rization. In 7th Conference on Information and Knowledge Management, 1998. Online at http://www.research.microsoft. com/~jplatt/cikm98.pdf.
- R. Goldman, N. Shivakumar, S. Venkatasubramanian, and H. Garcia-Molina. Proximity search in databases. In VLDB, volume 24, pages 26-37, New York, Sept. 1998. Online at http://www-db.stanford.edu/pub/papers/proximity-vldb98.ps.
- J. Hammer, H. Garcia-Molina, K. Ireland, Y. Papakon- stantinou, J. Ullman, and J. Widom. Information transla- tion, mediation, and mosaic-based browsing in the TSIM- MIS system. In SIGMOD Exhibit, page 483, San Jose, CA, June 1995. Online at ftp://www-db.stanford.edu/pub/papers/ mobie-demo-proposal.ps.
- T. Joachims, D. Freitag, and T. Mitchell. WebWatcher: A tour guide for the web. In IJCAI, Aug. 1997. Online at http: //www.cs.cmu.edu/~webwatcher/ijcai97.ps.
- L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39-43, Mar. 1953.
- T. Kistler and H. Marais. WebL-a programming language for the web. In 7th World Wide Web Conference, Brisbane, Australia, 1998. Online at http://www.research.digital.com/ SRC/personal/Johannes_Marais/pub/www7/paper.html and http: //www.research.digital.com/SRC/WebL.
- J. Kleinberg. Authoritative sources in a hyperlinked envi- ronment. In Proc. ACM-SIAM Symposium on Discrete Al- gorithms, 1998. Also appears as IBM Research Report RJ 10076(91892), and online at http://www.cs.cornell.edu/home/ kleinber/auth.ps.
- D. Konopnicki and O. Shmueli. WWW information gath- ering: The W3QL query language and the W3QS system. TODS, 1998. Online at http://www.cs.technion.ac.il/~konop/ todsonline.ps.gz.
- S. Macskassy, A. Banerjee, B. Davidson, and H. Hirsh. Human performance on clustering web pages: A performance study. In Knowledge Discovery and Data Mining, volume 4, pages 264-268, 1998.
- A. Mendelzon and T. Milo. Formal models of the web. In PODS, Tucson, AZ, June 1997. Online at ftp://ftp.db. toronto.edu/pub/papers/pods97MM.ps.
- A. Mendelzon and T. Milo. Formal models of the web. In PODS, Tucson, Arizona, June 1997. ACM. Online at ftp: //ftp.db.toronto.edu/pub/papers/pods97MM.ps.
- M. S. Mizruchi, P. Mariolis, M. Schwartz, and B. Mintz. Tech- niques for disaggregating centrality scores in social networks. In N. B. Tuma, editor, Sociological Methodology, pages 26-48. Jossey-Bass, San Francisco, 1986.
- W. Niblack, X. Zhu, J. Hafner, T. Breuel, D. Ponceleon, D. Petkovic, M. Flickner, E. Upfal, S. Nin, , S. Sull, B. Dom, B. Yeo, S. Srinivasan, D. Zivkovic, and M. Penner. Updates to the QBIC system. In Storage and Retrieval for Image and Video Databases VI, volume 3312 of Proceedings of SPIE, Jan. 1998.
- M. Pazzani, L. Nguyen, and S. Mantik. Learning from hotlists and coldlists: Towards a www information filtering and seeking agent. In Seventh International Conference on Tools with Artificial Intelligence, 1995. Online at http://www.ics.uci. edu/~pazzani/Publications/Coldlist.pdf.
- J. Savoy. An extended vector processing scheme for searching information in hypertext systems. Information Processing and Management, 32(2):155-170, Mar. 1996.
- L. Terveen and W. Hill. Finding and visualizing inter- site clan graphs. In Computer Human Interaction (CHI), pages 448-455, Los Angeles, CA, Apr. 1998. ACM SIGCHI. Online at http://www.research.att.com/~terveen/ chi98.htm and http://www.acm.org/pubs/articles/proceedings/ chi/274644/p448-terveen/p448-terveen.pdf.
- S. Wasserman and K. Faust. Social Network Analysis. Cam- bridge University Press, 1994.