Academia.eduAcademia.edu

Outline

Methods For Extracting Content Blocks From Web Pages

Abstract

The Web is perhaps the single largest data source in the world .The coverage of Web information is very wide and diverse. It has information which is of type required information by the user i.e. content blocks of the pages & the rest irrelevant information is termed as non content information or blocks like banner ads, navigation bars, and copyright notices. Web mining aims to extract and mine useful knowledge from the Web. But the non content blocks causes harm to web mining .So as to enhance web mining there is necessity of differentiate between contents & non contents blocks and to eliminate the non content blocks from web pages. So as to perform this task this paper deals with some techniques and methods which ultimately provides significant storage and timing saving by providing content blocks from web pages to user.

References (6)

  1. Lin, S. and Ho, J. 2002." Discovering informative content blocks from Web documents". In Proceedings of the Eighth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining
  2. Kao, H., Ho, J., and Chen, M. 2005. "WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model". IEEE Trans. on Knowl. and Data Eng. 17, 5 (May. 2005), 614-627.
  3. Yi, L. and Liu, B. 2003 " Web page cleaning for web mining through feature weighting". In Proceedings of the 18th international Joint Conference on Artificial intelligence .Publishers, San Francisco, CA, 43-48.
  4. Ziegler, C. and Skubacz, M. 2007. "Content Extraction from News Pages Using Particle Swarm Optimization on Linguistic and Structural Features". In Proceedings of the IEEE/WIC/ACM international Conference on Web intelligence(November 02 -05, 2007). Web Intelligence. IEEE Computer Society, Washington, DC, 242-249.
  5. Debnath, S., Mitra, P., Pal, N., and Giles, C. L. 2005. "Automatic Identification of Informative Sections of Web Pages". IEEE Trans. on Knowl. and Data Eng. 17, 9 (Sep. 2005), 1233-1246.
  6. Cai, D., Yu, S., Wen, J., and Ma, W. 2004. "Block-based web search". In Proceedings of the 27th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Sheffield, United Kingdom, July 25 -29, 2004). SIGIR '04. ACM, New York, NY, 456-463.