Academia.eduAcademia.edu

Outline

Tool for Parsing Important Data from Web Pages

Applied Sciences

https://doi.org/10.3390/APP122312031

Abstract

This paper discusses the tool for the main text and image extraction (extracting and parsing the important data) from a web document. This paper describes our proposed algorithm based on the Document Object Model (DOM) and natural language processing (NLP) techniques and other approaches for extracting information from web pages using various classification techniques such as support vector machine, decision tree techniques, naive Bayes, and K-nearest neighbor. The main aim of the developed algorithm was to identify and extract the main block of a web document that contains the text of the article and the relevant images. The algorithm on a sample of 45 web documents of different types was applied. In addition, the issue of web pages, from the structure of the document to the use of the Document Object Model (DOM) for their processing, was analyzed. The Document Object Model was used to load and navigation of the document. It also plays an important role in the correct identificatio...

References (31)

  1. ZVIERATA-bobor vodny.htm 1871 2418 29.24 5 Bobor vodný-Časopis Poľovníctvo a rybárstvo.htm 1752 2423
  2. 9 Kliešte v meste Správy Správy Rodinka.sk.htm 4447 4547 2.25 10 Kliešte ohrozujú najviac seniorov, chýba im očkovanie-Zdravie a prevencia-zdravie.pravda.sk.htm 2194 2401 9.43 11 Jedovaté huby.html 4611 5176 12.25 12 Huby.htm 1251 1251 0.00 13 Huba mesiaca júl 09-Časopis Poľovníctvo a rybárstvo.htm 1485 2160 45.45 14 Do lesa s dozimetrom Máme sa báť zbierať huby-Život.sk.htm 5092 5219 2.49 15 Pozor na huby na tanieri.htm 2825 2951 4.46 16 Najnovší objav vedcov Dinosaurus s vráskavými očami-ADAM.sk.htm 1713 1443 -15.76 17 www.kutilas.estranky.sk-Dinosaury.htm 7585 7981 5.22 18 infovekacik.htm 3044 3195 4.96 19 eQuark.sk-portál pre popularizáciu vedy-Prvý dinosaurus s jedným prstom.htm 1867 1882 0.80 20 Dinosaurus bol iný, ako sme si pôvodne mysleli Veda a technika-články 16-06-2011 Veda a technika Noviny.sk.htm 1424 1505 5.69 21 Plameniak ružový ZOO Bojnice.htm 1000 1329 32.90 22 plameniaky.htm 1451 1914 31.91 23 Plameniaky na ružovo farbia baktérie a beta karotén Zaujímavosti prievidza.sme.sk.htm 1882 1856 -1.38 24 Hadogenes paucidens.html 2535 2441 -3.71 25 eQuark.sk-portál pre popularizáciu vedy-Jed škorpiónov je vhodný do pesticídov.htm 1261 1276 1.19 26 História psa Pes-portál.sk.htm 2195 2195 0.00 27 Poľovníka postrelil pes.htm 980 778 -20.61 28 stvornohykamarat-Plemená psov na C a Č-Český strakatý pes.htm 1732 2538 46.54 29 Aj pes, ktorý šteká, hryzie. Nebezpečne-Zdravie a prevencia-zdravie.pravda.sk.htm 6596 6771 2.65 30 DOBERMAN-História dobermana-DOBERMAN3.htm 5406 6198 14.65 31 Slon africký ZOO Bojnice.htm 2786 3307 18.70 32 Je zaujímavé aká môže byť príroda že-Fotoalbum-Cicavce-Slon africký.htm 1793 2240 24.93 33 Ivan Pleško-O slonoch-Slon Africký (Loxodonta Africana).htm 13539 13541 0.01 34 Zvieratká-Suchozemské zvieratá-Slon Africký.htm 4966 5336 7.45 35 MARŤANKOVIA.htm 1371 1475 7.59 36 Vyspelá civilizácia mravcov.htm 14181 15815 11.52 37 rad Blankokrídlovce-Mravce.htm 3203 3228 0.78 38 Mravce používajú antibiotiká. Pestujú huby a spolupracujú s baktériami Biológia veda.sme.sk.htm 3837 3731 -2.76 39 Mravce-etológia, biológia a ich chov-článok zo serveru www.vivarista.sk.htm 12426 12578 1.22 40 Mravce. Blog-Michal Wiezik (blog.sme.sk).htm 6793 8141 19.84 41 Atlas živočíchov vidlochvost ovocný-Na túru s NATUROU.htm 2060 2126 3.20 42 Moľa DDD služby.htm 1478 1491 0.88 43 Vidlochvost Feniklový Motýle Slovenskej republiky.htm 1763 2017 14.41 44 Babôčka osiková « CASD Liptovský Mikuláš.htm 1533 1937 26.35 45 Hnedáčik Pyštekový Motýle Slovenskej republiky.htmr.html 1838 2092 13.82
  3. Ferrara, E.; De Meo, P.; Fiumara, G.; Baumgartner, R. Web data extraction, applications and techniques: A survey. Knowl.-Based Syst. 2014, 70, 301-323.
  4. Uzun, E. A Novel Web Scraping Approach Using the Additional Information Obtained from Web Pages. IEEE Access 2020, 8, 61726-61740.
  5. Najafabadi, M.M.; Villanustre, F.; Khoshgoftaar, T.M.; Seliya, N.; Wald, R.; Muharemagic, E. Deep learning applications and challenges in big data analytics. J. Big Data 2015, 2, 1. https://doi.org/10.1186/s40537-014-0007-7.
  6. Figueiredo, L.N.L.; de Assis, G.T.; Ferreira, A.A. DERIN: A data extraction method based on rendering information and n-gram. Inf. Process. Manag. 2017, 53, 1120-1138.
  7. Kushmerick, N. Wrapper Induction for Information Extraction; University of Washington: Seattle, DC, USA, 1997.
  8. Liu, L.; Pu, C.; Han, W. XWRAP: An XML-enabled wrapper construction system for Web information sources. In Proceedings of the 16th International Conference on Data Engineering (Cat. No.00CB37073), San Diego, CA, USA, 29 February-3 March 2000; IEEE Computer Society: Washington, DC, USA, 2000; pp. 611-621.
  9. Das, R.; Turkoglu, I. Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method. Expert Syst. Appl. 2009, 36, 6635-6644.
  10. Fazzinga, B.; Flesca, S.; Tagarelli, A. Schema-based Web wrapping. Knowl. Inf. Syst. 2011, 26, 127-173.
  11. Kao, H.Y.; Lin, S.H.; Ho, J.M.; Chen, M.S. Mining web informative structures and contents based on entropy analysis. IEEE Trans. Knowl. Data Eng. 2004, 16, 41-55.
  12. Zachariasova, M.; Hudec, R.; Benco, M.; Kamencay, P. Automatic extraction of non-textual information in Web document and their classification. In Proceedings of the 2012 35th International Conference on Telecommunications and Signal Processing (TSP), Prague, Czech Republic, 3-4 July 2012; pp. 753-757. https://doi.org/10.1109/TSP.2012.6256398.
  13. Li, Z.; Ng, W.K.; Sun, A. Web data extraction based on structural similarity. Knowl. Inf. Syst. 2005, 8, 438-461.
  14. Maghdid, H.S. Web News Mining Using New Features: A Comparative Study. IEEE Access 2019, 7, 5626-5641. https://doi.org/10.1109/ACCESS.2018.2890088.
  15. Radilova, M.; Kamencay, P.; Matuska, S.; Benco, M.; Hudec, R. Tool for Optimizing Webpages on a Mobile Phone. In Proceedings of the 2020 43rd International Conference on Telecommunications and Signal Processing (TSP), Milan, Italy, 7-9 July 2020; IEEE: Milan, Italy, 2020; pp. 554-558.
  16. Wood, L. Programming the Web: The W3C DOM specification. IEEE Internet. Comput. 1999, 3, 48-54. https://doi.org/10.1109/4236.747321.
  17. World Wide Web Consortium. Document Object Model (DOM) Level 1 Specification; World Wide Web Consortium: Cambridge, MA, USA, October 1998.
  18. Nadee, W.; Prutsachainimmit, K. Towards data extraction of dynamic content from JavaScript Web applications. In Proceedings of the 2018 International Conference on Information Networking (ICOIN), Chiang Mai, Thailand, 10-12 January 2018, pp. 750- 754. https://doi.org/10.1109/ICOIN.2018.8343218.
  19. Vineel, G. Web page DOM node characterization and its application to page segmentation. In Proceedings of the IEEE International Conference on Internet Multimedia Services Architecture and Applications (IMSAA), Bangalore, India, 9-11 December 2009; pp. 1-6. https://doi.org/10.1109/IMSAA.2009.5439444.
  20. Luo, J.; Shen, J.; Xie, C. Segmenting the web document with document object model. In Proceedings of the IEEE International Conference on Services Computing (SCC 2004), Shanghai, China, 15-18 September 2004; IEEE: Shanghai, China, 2004; pp. 449- 452.
  21. Chowdhury, G.G. Natural language processing. Annu. Rev. Inf. Sci. Technol. 2003, 37, 51-89.
  22. Liddy, E.D. Natural Language Processing; Syracuse University: New York, NY, USA, 2001.
  23. Savolainen, R.; Kari, J. Placing the Internet in information source horizons. A study of information seeking by Internet users in the context of self-development. Libr. Inf. Sci. Res. 2004, 26, 415-433. https://doi.org/10.1016/j.lisr.2004.04.004.
  24. Shengnan, Z.; Jiawei, W.; Kun, J. A Webpage Segmentation Method Based on Node Information Entropy of DOM Tree. J. Phys. Conf. Ser. 2020, 1624, 032023. https://doi.org/10.1088/1742-6596/1624/3/032023.
  25. Joshi, P.M.; Liu, S. Web Document Text and Images Extraction using DOM Analysis and Natural Language Processing. In Proceedings of the 2009 ACM Symposium on Document Engineering, Munich, Germany, 16-18 September 2009; pp. 1-4. https://doi.org/10.1145/1600193.1600241.
  26. Alimohammadi, D. Meta-tag: A means to control the process of Web indexing. Online Inf. Rev. 2003, 27, 238-242.
  27. Gu, M.; Zhu, F.; Guo, Q.; Gu, Y.; Zhou, J.; Qu, W. Towards effective web page classification. In Proceedings of the 2016 International Conference on Behavioral, Economic and Socio-Cultural Computing (BESC), Durham, NC, USA, 11-13 November 2016; pp. 1-2. https://doi.org/10.1109/BESC.2016.7804494.
  28. Yu, X.; Jin, Z. Web content information extraction based on DOM tree and statistical information. In Proceedings of the 2017 IEEE 17th International Conference on Communication Technology (ICCT), Chengdu, China, 27-30 October 2017; pp. 1308- 1311. https://doi.org/10.1109/ICCT.2017.8359846.
  29. Utiu, N.; Ionescu, V. Learning Web Content Extraction with DOM Features. In Proceedings of the 2018 IEEE 14th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 6-8 September2018; pp. 5-11. https://doi.org/10.1109/ICCP.2018.8516632.
  30. Kalra, G.S.; Kathuria, R.S.; Kumar, A. YouTube Video Classification based on Title and Description Text. In Proceedings of the 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Greater Noida, India, 18-19 October 2019; pp. 74-79. https://doi.org/10.1109/ICCCIS48478.2019.8974514.
  31. Poornima, A.; Priya, K.S. A Comparative Sentiment Analysis of Sentence Embedding Using Machine Learning Techniques. In Proceedings of the 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 6-7 March 2020; pp. 493-496. https://doi.org/10.1109/ICACCS48705.2020.9074312.