Academia.eduAcademia.edu

Outline

Joint optimization of wrapper generation and template detection

2007

https://doi.org/10.1145/1281192.1281287

Abstract

Many websites have large collections of pages generated dynamically from an underlying structured source like a database. The data of a category are typically encoded into similar pages by a common script or template. In recent years, some value-added services, such as comparison shopping and vertical search in a specific domain, have motivated the research of extraction technologies with high accuracy. Almost all previous works assume that input pages of a wrapper induction system conform to a common template and they can be easily identified in terms of a common schema of URL. However, we observed that it is hard to distinguish different templates using dynamic URLs today. Moreover, since extraction accuracy heavily depends on how consistent input pages are, we argue that it is risky to determine whether pages share a common template solely based on URLs. Instead, we propose a new approach that utilizes similarity between pages to detect templates. Our approach separates pages with notable inner differences and then generates wrappers, respectively. Experimental results show that our proposed approach is feasible and effective for improving extraction accuracy.

References (20)

  1. A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pages 337 -348, 2003.
  2. C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web, pages 681-688, 2001.
  3. S.-L. Chuang and J. Y.-j. Hsu. Tree-structured template generation for web pages. In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, pages 327 -333, 2004.
  4. W. W. Cohen, M. Hurst, and L. S. Jensen. A flexible learning system for wrapping tables and lists in html documents. In Proceedings of the 11th International Conference on World Wide Web, pages 232 -241, 2002.
  5. V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the 27th International Conference on Very Large Data Bases, pages 109 - 118, 2001.
  6. V. Crescenzi, G. Mecca, and P. Merialdo. Wrapping-oriented classification of web pages. In Proceedings of the 2002 ACM symposium on Applied computing, pages 1108 -1112, 2002.
  7. S. Flesca, G. Manco, E. Masciari, E. Rende, and A. Tagarelli. Web wrapper induction: a brief survey. AI Communications, 17:57 -61, 2004.
  8. J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semistructured information from the web. In Proceedings of the Workshop on Management fo Semistructured Data, 1997.
  9. A. Hogue and D. Karger. Thresher: automating the unwrapping of semantic content from the world wide web. In Proceedings of 14th International Conference on World Wide Web, pages 86 -95, 2005.
  10. C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, Special Issue on Semistructured Data, 23(8):521-538, 1998.
  11. N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 729-737, 1997.
  12. A. H. F. Laender, B. A. Ribeiro-Neto, A. S. da Silva, and J. S. Teixeira. A brief survey of web data extraction tools. SIGMOD Record, 31(2):84-93, 2002.
  13. B. Liu. Web content mining (tutorial). In Proceedings of the 14th International Conference on World Wide Web, 2005.
  14. B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 601 -606, 2003.
  15. L. Liu, C. Pu, and W. Han. Xwrap: an xml-enabled wrapper construction system for web information sources. In Proceedings of the 16th International Conference on Data Engineering, pages 611-621, 2000.
  16. I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the 3rd Annual Conference on Autonomous Agents, pages 190 -197, 1999.
  17. D. C. Reis, P. B. Golgher, A. S. Silva, and A. F. Laender. Automatic web news extraction using tree edit distance. In Proceedings of the 13th International Conference on World Wide Web, pages 502 -511, 2004.
  18. S. Sarawagi. Automation in information extraction and data integration (tutorial). In Proceedings of the 28th International Conference on Very Large Data Bases, 2002.
  19. P. Willett. Recent trends in hierarchic document clustering: a critical review. Information Processing and Management, 24(5):577-597, 1988.
  20. H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines. In Proceedings of the 14th International Conference on World Wide Web, pages 66 -75, 2005.