Evaluation of crawling policies for a web-repository crawler
2006
https://doi.org/10.1145/1149941.1149972Abstract
We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live versions of the websites. We conclude with our experiences reconstructing lost websites on behalf of others and discuss plans for improving our web-repository crawler.
References (53)
- REFERENCES
- A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the Web. ACM Transactions on Internet Technology (TOIT), 1(1):2-43, 2001.
- R. Baeza-Yates and C. Castillo. Characterization of national web domains. Technical report, Universitat Pompeu Fabra, 2005.
- R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a country: better strategies than breadth-first for web page ordering. In Proceedings of WWW '05, pages 864-872, 2005.
- S. Baldwin. Museum of e-failure, 2006. http://disobey.com/ghostsites/mef.shtml.
- Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. In Proceedings from WWW '04, pages 328-337, 2004.
- M. K. Bergman. The deep web: Surfacing hidden value. The Journal of Electronic Publishing, August 2001. http: //www.press.umich.edu/jep/07-01/bergman.html.
- T. Berners-Lee, R. Fielding, and L. Masinter. Uniform Resource Identifier (URI): Generic syntax. RFC 3986, Jan. 2005.
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. Computer Networks and ISDN Systems, 29(8-13):1157-1166, 1997.
- S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific resource discovery. In Proceedings from WWW '04, 1999.
- J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings from VLDB '00, pages 200-209, 2000.
- J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proceedings of SIGMOD '00, pages 117-128, 2000.
- J. Cho and H. Garcia-Molina. Parallel crawlers. In Proceedings from WWW '02, pages 124-135, 2002.
- J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proceedings from SIGMOD '00, pages 355-366, 2000.
- V. Cothey. Web-crawling reliability. Journal of the American Society for Information Science and Technology, 55(14):1228-1238, 2004.
- M. Cutts. SEO advice: URL canonicalization. Jan 2006. http://www.mattcutts.com/blog/ seo-advice-url-canonicalization/.
- Z. Dalal, S. Dash, P. Dave, L. Francisco-Revilla, R. Furuta, U. Karadkar, and F. Shipman. Managing distributed collections: evaluating web page changes, movement, and replacement. In Proceedings of JCDL '04, pages 160-168, 2004.
- M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of VLDB '00, pages 527-534, 2000.
- J. Edwards, K. McCurley, and J. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. In Proceedings from WWW '01, pages 106-113, 2001.
- D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of WebDB '04, pages 1-6, 2004.
- D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the World Wide Web. In Proceedings from ACM SIGIR '05, pages 170-177, 2005.
- Fire destroys top research centre. Oct 31, 2005. http://news.bbc.co.uk/2/hi/uk news/england/ hampshire/4390048.stm.
- D. Gomes and M. J. Silva. Characterizing a national community web. ACM Trans. Inter. Tech., 5(3):508-531, 2005.
- Google sitemap protocol, 2005. http://www.google. com/webmasters/sitemaps/docs/en/protocol.html.
- Y. Hafri and C. Djeraba. High performance crawling system. In Proceedings from MIR '04, pages 299-306, 2004.
- T. L. Harrison and M. L. Nelson. Just-in-time recovery of missing web pages. In Proceedings from HYPERTEXT '06, Aug 2006.
- Internet Archive FAQ: How can I get my site included in the Archive?, 2006. http://www.archive.org/about/faqs.php.
- C. Lampos, M. Eirinaki, D. Jevtuchova, and M. Vazirgiannis. Archiving the Greek Web. 4th International Web Archiving Workshop (IWAW'04), Sept 2004.
- S. H. Lee, S. J. Kim, and S. H. Hong. On URL normalization. In ICCSA '05: Proceedings of the International Conference on Computational Science and Its Applications, pages 1076-1085, June 2005.
- S. W. Liddle, D. W. Embley, D. T. Scott, and S. H. Yau. Extracting data behind web forms. In Workshop on Conceptual Modeling Approaches for e-Business, pages 402-413, Tampere, Finland, Oct 2002.
- S. W. Liddle, S. H. Yau, and D. W. Embley. On the automatic extraction of data from the hidden web. In International Workshop on Data Semantics in Web Information Systems (DASWIS-2001), pages 212-226, Yokohama, Japan, Nov 2001.
- T. Lutkenhouse, M. L. Nelson, and J. Bollen. Distributed, real-time computation of community preferences. In Proceedings from HYPERTEXT '05, pages 88-97, 2005.
- C. C. Marshall and G. Golovchinsky. Saving private hypertext: requirements and pragmatic dimensions for preservation. In Proceedings of HYPERTEXT '04, pages 130-138, 2004.
- F. McCown. Google is sorry. Jan 2006. http://frankmccown.blogspot.com/2006/01/ google-is-sorry.html.
- F. McCown, J. A. Smith, M. L. Nelson, and J. Bollen. Reconstructing websites for the lazy webmaster. Technical report, Old Dominion University, 2005. http://arxiv.org/abs/cs.IR/0512069.
- F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating topic-driven web crawlers. In Proceedings from SIGIR '01, pages 241-249, 2001.
- G. Mohr, M. Kimpton, M. Stack, and I. Ranitovic. Introduction to Heritrix, an archival quality web crawler. 4th International Web Archiving Workshop (IWAW'04), Sept 2004.
- S. Mukherjea. Organizing topic-specific web information. In Proceedings of HYPERTEXT '00, pages 133-141, 2000.
- M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of WWW '01, pages 114-118, 2001.
- M. L. Nelson, H. Van de Sompel, X. Liu, T. L. Harrison, and N. McFarland. mod oai: An Apache module for metadata harvesting. In Proceedings from ECDL 2005, 2005.
- A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content through keyword queries. In Proceedings from JCDL '05, pages 100-109, 2005.
- E. T. O'Neill, B. F. Lavorie, and R. Bennett. Trends in the evolution of the public web. D-Lib Magazine, 3(4), April 2003.
- G. Pant, P. Srinivasan, and F. Menczer. "Crawling the Web". In Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Edited by M. Levene and A. Poulovassilis, pages 153-178. Springer-Verlag, 2004.
- J. S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software: Practice and Experience, 27(9):995-1012, 1997.
- M. O. Rabin. Efficient dispersal of information for security, load balancing, and fault tolerance. Journal of the ACM, 36(2):335-348, 1989.
- S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proceedings from VLDB '01, pages 129-138, 2001.
- M. A. Serrano, A. Maguitman, M. Boguna, S. Fortunato, and A. Vespignani. Decoding the structure of the WWW: facts versus sampling biases. Technical report, 2006. http://www.arxiv.org/abs/cs.NI/0511035.
- V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. volume 60, pages 357-368. IEEE Computer Society, 2002.
- K. Sigurosson. Incremental crawling with Heritrix. 5th International Web Archiving Workshop (IWAW'05), Sept 2005.
- J. A. Smith, F. McCown, and M. L. Nelson. Observed web robot behavior on decaying web subsites. D-Lib Magazine, 12(2), Feb 2006.
- D. Waters and J. Garrett. Preserving digital information: Report of the task force on archiving of digital information. Technical report, 1996. http://www.rlg.org/ArchTF/.
- What are Google's design and technical guidelines? http://www.google.com/support/webmasters/bin/ answer.py?answer=35770.
- J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proceedings of WWW '02, pages 136-147, 2002.