Academia.eduAcademia.edu

Outline

Study on Web Content Extraction Techniques

2019

https://doi.org/10.5281/ZENODO.3591250

Abstract

Nowadays, the explosive growth of the World Wide Web generates tremendous amount of web data and consequently web data mining has become an important technique for discovering useful information and knowledge. Web mining is a vivid research area closely related to Information Extraction IE . Automatic content extraction from web pages is a challenging yet significant problem in the fields of information retrieval and data mining. Web Content mining refers to the discovery of useful information from web content such as text, images videos etc. Web content extraction is the process of organizing data instances into groups whose members are similar in some way. Content Extraction helps the user to easily select the topic of interest. Web Content Ming technology is useful in management information system. Web content mining extracts or mines useful information or knowledge from web page contents. This paper aims to study on web content extraction techniques. Aye Pwint Phyu | Khaing Khai...

References (14)

  1. REFERENCES
  2. Sandip, prasenjit, Nirmal Pal and C.Lee Giles, 'Automatic Identification of Informative Sections of web pages', IEEE Transactions on knowledge and data Engineering, Vol 7, No 9, 2005.
  3. Sirsat, S. and Chavan, V, 'Pattern matching for extraction of core contents from news web pages'. IEEE Second International Conference on Web Research (ICWR) (pp. 13-18), 2016.
  4. Geng, H., Gao, Q. and Pan, J, 'Extracting content for news web pages based on DOM'. IJCSNS International Journal of Computer Science and Network Security, Vol 7, No (2), pp.124-129, 2007.
  5. Song, D., Sun, F. and Liao, L, 'A hybrid approach for content extraction with text density and visual importance of DOM nodes'. Knowledge and Information Systems, 42(1), pp.75-96, 2015.
  6. Carey, H.J. and Manic, M, 'Html web content extraction using paragraph tags', Industrial Electronics (ISIE), IEEE 25th International Symposium on (pp. 1099- 1105), 2016.
  7. Mahesha, S., Giri, M. and Shashidhara, M. S, 'An Efficient Web Content Extraction from Large Collection of Web Documents using Mining Methods' International Journal of Computer Applications, 69(7), 2013.
  8. Gondse, P., Raut, A. and HVPMCOET A, 'Primary Content Extraction Based On DOM', Intl. Journal of Research in Advent Technology, 2(4), pp.208-210, 2014.
  9. Nethra, K., Anitha, J. and Thilagavathi, G., 'Web Content Extraction Using Hybrid Approach', ICTACT Journal On Soft Computing, 4(2), 2014.
  10. Kaur, P. and Bhatia, R., 'Development of Cluster based Supervised Learning Technique for Web News Extraction', International Journal of Computer Applications, 152(5), 2016.
  11. Shu, K., Sliva, A., Wang, S., Tang, J. and Liu, H., 'Fake news detection on social media: A data mining perspective', ACM SIGKDD Explorations Newsletter, 19(1), pp.22-36, 2017.
  12. YesuRaju, P. and KiranSree, P., 'A language independent web data extraction using vision based page segmentation algorithm', arXiv preprint arXiv: 1310.6637, 2013.
  13. Afonso, A. R. and Duque, C. G., 'Automated text clustering of newspaper and scientific texts in brazilian portuguese: analysis and comparison of methods', JISTEM-Journal of Information Systems and Technology Management, 11(2), pp.415-436, 2014.
  14. Gondse, M. P. G. and Raut, A., 'Main content extraction from web page using DOM', International Journal of Advanced Research in Computer and Communication Engineering, 3, p.5302, 2014.