Focused Crawler Research Papers

Design of Ontology-Driven Agent Based Focused Crawlers

2025, SSRN Electronic Journal

Existing focused crawlers (FCs) are based upon fixed model of web and thus are deficient in using available information. The premise of this paper is that ontology can play an important role in enhancing the efficiency of existing agent... more

descriptionView Paper arrow_downwardDownload

Web Mining Based Distributed Crawling with Instant Backup Supports

by YOGESH PAWAR

2025

As the World Wide Web is growing rapidly and data in the present day scenario is stored in a distributed manner. The need to develop a search Engine based architectural model for people to search through the Web. Broad web search engines... more

As the World Wide Web is growing rapidly and data in the present day scenario is stored in a distributed manner. The need to develop a search Engine based architectural model for people to search through the Web. Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of such web search engines. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. Given some URLs, the crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its queue and go back to the first phase of this cycle. The crawler also can retrieve some other information from the HTML files as it is parsing them to get the new URLs. In this paper, we describe the design of a web crawler that uses Page Rank algorithm for distributed searches and can be run on a network of workstations. The crawler initially search for all the stop words (such as a, an, the, and etc). While searching the web pages for some keyword the crawler will initially remove all collected stop word. Also at the same time the crawler will search for snippets from web documents. All the matching word & collected snippet will be stored in temporary cache memory created at central server of crawlers. Where after applying page rank algorithm on the basis of no. of visit of web pages we will arrange the pages according to their ranks & display the results. Since, due to extensive search on web through web crawlers the chances of various virus attacks are more & processing capacity of system may get halt so to provide solution in such scenario we can provide backup to our system by creating web services. The web service will be designed in such manner that any valid updations to any database servers will automatically updates the backup servers. Therefore, even in failure of any server system, we can continue with crawling process.

descriptionView Paper arrow_downwardDownload

Web Mining Based Distributed Crawling with Instant Backup Supports

by YOGESH PAWAR

2025, IJCST

As the World Wide Web is growing rapidly and data in the present day scenario is stored in a distributed manner. The need to develop a search Engine based architectural model for people to search through the Web. Broad web search engines... more

As the World Wide Web is growing rapidly and data in the present day scenario is stored in a distributed manner. The need to develop a search Engine based architectural model for people to search through the Web. Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of such web search engines. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. Given some URLs, the crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its queue and go back to the first phase of this cycle. The crawler also can retrieve some other information from the HTML files as it is parsing them to get the new URLs. In this paper, we describe the design of a web crawler that uses Page Rank algorithm for distributed searches and can be run on a network of workstations. The crawler initially search for all the stop words (such as a, an, the, and etc). While searching the web pages for some keyword the crawler will initially remove all collected stop word. Also at the same time the crawler will search for snippets from web documents. All the matching word & collected snippet will be stored in temporary cache memory created at central server of crawlers. Where after applying page rank algorithm on the basis of no. of visit of web pages we will arrange the pages according to their ranks & display the results. Since, due to extensive search on web through web crawlers the chances of various virus attacks are more & processing capacity of system may get halt so to provide solution in such scenario we can provide backup to our system by creating web services. The web service will be designed in such manner that any valid updations to any database servers will automatically updates the backup servers. Therefore, even in failure of any server system, we can continue with crawling process.

descriptionView Paper arrow_downwardDownload

Design of Medium Voltage Drive and Load Filter for Long Length Cable Connecting Electrical Submersible Pump

by Manjusha Bachawad

2025

Electrical submersible pump (ESP) of high power rating are mainly used in oil field application and to connect that ESP system long length cable is required in ESP system so harmonics and high voltage transient is produce due to series... more

descriptionView Paper arrow_downwardDownload

Agro-Explorer: A meaning based multilingual search engine

by Dr. Sunil Dubey

2025, … Conference on Digital …

Abstract: In this paper we describe Agro Explorer, a language independent search engine with multilingual information access facility. Instead of searching on plain text it does the search on the meaning representation, an Interlingua... more

descriptionView Paper arrow_downwardDownload

A Frame Work for Topical Collections Make with Focused and Accelerated Focused Crawlers

by Raju Raju

2025, International Journal of Computer Applications

descriptionView Paper arrow_downwardDownload

Crawling a country

by Mauricio Marin

2025, Special interest tracks and posters of the 14th international conference on World Wide Web - WWW '05

descriptionView Paper arrow_downwardDownload

Overview of Search Engine and Crawler

by Gaurav Kumar Srivastav

2025, International Journal of Computer Applications

Today, Internet is the most important part of human life but growth of internet is major problem of internet user due to internet down loading speed, quality of downloaded web pages and find out the relevant content in the millions number... more

descriptionView Paper arrow_downwardDownload

A Fuzzy Logic based Solution for Network Traffic Problems in Migrating Parallel Crawlers

by Hikmat Abdeljaber

2024, International Journal of Advanced Computer Science and Applications

Search engines are the instruments for website navigation and search, because the Internet is big and has expanded greatly. By continuously downloading web pages for processing, search engines provide search facilities and maintain... more

Search engines are the instruments for website navigation and search, because the Internet is big and has expanded greatly. By continuously downloading web pages for processing, search engines provide search facilities and maintain indices for web documents. Online crawling is the term for this process of downloading web pages. This paper proposes solution to network traffic problem in migrating parallel web crawler. The primary benefit of a parallel web crawler is that it does local analysis at the data's residence rather than inside the web search engine repository. As a result, network load and traffic are greatly reduced, which enhances the performance, efficacy, and efficiency of the crawling process. Another benefit of moving to a parallel crawler is that as the web gets bigger, it becomes important to parallelize crawling operations in order to retrieve web pages more quickly. A web crawler will produce pages of excellent quality. When the crawling process moves to a host or server with a specific domain, it begins downloading pages from that domain. Incremental crawling will maintain the quality of downloaded pages and keep the pages in the local database updated. Java is used to implement the crawler. The model that was put into practice supports all aspects of a three-tier, realtime architecture. An implementation of a parallel web crawler migration is shown in this paper. The method for efficient parallel web migration detects changes in the content and structure using neural network-based change detection techniques in parallel web migration. This will produce highquality pages and detection for changes will always download new pages. Either of the following strategies is used to carry out the crawling process: either crawlers are given generous permission to speak with one another, or they are not given permission to communicate with one another at all. Both strategies increase network traffic. Here, a fuzzy logic-based system that predicts the load at a specific node and the path of network traffic is presented and implemented in MATLAB using the fuzzy logic toolbox.

descriptionView Paper arrow_downwardDownload

Efficient Distributed Web Crawler Using Hefty and Enhanced Bandwidth Algorithms for Drug Website Search

by Dr Aghila Rajagopal

2024, International Journal of Machine Learning and Networked Collaborative Engineering

refabricate a proficient search structure is very important due to the current scale of the web. Search engines mine information from the web and a program called a web crawler, which efficiently surfs the web. A distributed crawler... more

descriptionView Paper arrow_downwardDownload

An Ontological Crawling Approach for Improving Information Aggregation over eGovernment Websites

by Bima Wijaya

2024, Journal of Computer Science

E-Government applications in developing countries are still lagging behind e-Governments in advanced countries. For example, the use of information integration for Web portal content is still very limited. This paper proposes an automated... more

descriptionView Paper arrow_downwardDownload

Focused Web Crawler and its Approaches

by Anmol Jain

2024

There has been a rapid growth of the worldwide web which has scaled beyond our imaginations. To surmount these challenges search engines are used. One of the most important type of crawler is Focused crawler which is used to index... more

descriptionView Paper arrow_downwardDownload

Empirical evaluation of the link and content-based focused Treasure-Crawler

by Joaquim Celestino Jr.

2024, Computer Standards & Interfaces

Indexing the Web is becoming a laborious task for search engines as the Web exponentially grows in size and distribution. Presently, the most effective known approach to overcome this problem is the use of focused crawlers. A focused... more

descriptionView Paper arrow_downwardDownload

A Query based Approach to Reduce the Web Crawler Traffic using HTTP Get Request and Dynamic Web Page

by SHEKHAR MISHRA

2024, International Journal of Computer Applications

The functions of Web crawler download information from web for search engine. Web pages changed without any notice. Web crawler has to revisit web site to download updated and new web pages. It is estimated 40% of current web traffic is... more

descriptionView Paper arrow_downwardDownload

Mobile Crawler Using JINI Technology to reduce Bandwidth Consumption during Crawling Process

by Rahul Tambe

2024

To search any information on the web users extensively use the search engines. As the growth of the World Wide Web exceeded all expectations, the search engines rely on web crawlers to maintain the index of billions of pages for efficient... more

descriptionView Paper arrow_downwardDownload

An Experimental evaluation of Adaptive Real Time Web Crawler

by Swapnil Phalak

2024

The internet is a vague collection of web pages containing vague amount of information arranged in multiple servers. The mere size of this collection is a daunting obstacle in getting necessary and relevant information. This is where... more

descriptionView Paper arrow_downwardDownload

A Framework for Incremental Hidden Web Crawler

by rosy bhatia

2024, International Journal

Hidden Web's broad and relevant coverage of dynamic and high quality contents coupled with the high change frequency of web pages poses a challenge for maintaining and fetching up-to-date information. For the purpose, it is required to... more

descriptionView Paper arrow_downwardDownload

Web crawler middleware for search engine digital libraries

by Lee Giles

2024, Proceedings of the twelfth international workshop on Web information and data management

Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX... more

descriptionView Paper arrow_downwardDownload

Aggregate Service Using One Stop Shop and Web Crawler in Cloud Computing

by Puspita dash IT

2024, International Journal of Scientific Research in Science and Technology

Cloud computing services has become an important paradigm as it is reliable and provides a cost effective way of storing and hosting applications. Cloud storage is growing exponentially and monitor the data in a secure manner. Cloud... more

descriptionView Paper arrow_downwardDownload

A Frame Work for Topical Collections Make with Focused and Accelerated Focused Crawlers

by rajesh saturi

2024, International journal of computer applications

The rapid growth of the WorldWide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In the personalized search domain, an alternative to general purpose crawler called focused crawlers are... more

descriptionView Paper arrow_downwardDownload

Crawling a country

by Carlos Castillo

2024, Special interest tracks and posters of the 14th international conference on World Wide Web - WWW '05

descriptionView Paper arrow_downwardDownload

Evaluation of crawling policies for a web-repository crawler

by Frank McCown

2024

We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling... more

descriptionView Paper arrow_downwardDownload

Cross-supervised synthesis of web-crawlers

by Adi Omari

2024

A web-crawler is a program that automatically and systematically tracks the links of a website and extracts information from its pages. Due to the different formats of websites, the crawling scheme for different sites can differ... more

descriptionView Paper arrow_downwardDownload

Cross-supervised synthesis of web-crawlers

by Adi Omari

2024, Proceedings of the 38th International Conference on Software Engineering

A web-crawler is a program that automatically and systematically tracks the links of a website and extracts information from its pages. Due to the different formats of websites, the crawling scheme for different sites can differ... more

descriptionView Paper arrow_downwardDownload

Crawling JavaScript websites using WebKit - with application to analysis of hate speech in online discussions

by Alfred Bratterud

2024

JavaScript Client-side hidden web pages (CSHW) contain dynamic material created as a result of specific user activities. The number of CSHW websites is increasing. Crawling the so-called Hidden Web is challenging, particularly when... more

descriptionView Paper arrow_downwardDownload

Classification and analysis for Focused Crawled Textual Dataset for retrieving Indian origin scientists

by Shivani Gautam

2024, International Journal of Experimental Research and Review

Text classification also called (text categorization or text tagging) is a crucial and extensively used approach in Natural Language Processing (NLP), to predict unseen content documents into prearranged categories. In this paper, we... more

descriptionView Paper arrow_downwardDownload

Dist-RIA Crawler: A Distributed Crawler for Rich Internet Applications Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM

by Iosif Onut

2024

descriptionView Paper arrow_downwardDownload

Clustering-based incremental web crawling

by Prasenjit Mitra

2024, ACM Transactions on Information Systems

When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that... more

descriptionView Paper arrow_downwardDownload

Classifying Pastebin Content Through the Generation of PasteCC Labeled Dataset

by FRANCISCO JÁÑEZ MARTINO

2024, Lecture Notes in Computer Science

Online notepad services allow users to upload and share free text anonymously. Reviewing Pastebin, one of the most popular online notepad services websites, it is possible to find textual content that could be related to illegal... more

descriptionView Paper arrow_downwardDownload

A New Approach To Focused Crawling: Combination of Text summarizing With Neural Networks and Vector Space Model

by Fahim Mohammadi

2024, Advances in Computer Science an International Journal

Focused crawlers are programs designed to browse the Web and download pages on a specific topic. They are used for answering user queries or for building digital libraries on a topic specified by the user. In this article we will show how... more

descriptionView Paper arrow_downwardDownload

Predictive and evolutive cross-referencing for web textual sources

by christophe cruz

2024, 2017 Computing Conference

One of the main challenges in the domain of competitive intelligence is to harness important volumes of information from the web, and extract the most valuable pieces of information. As the amount of information available on the web grows... more

descriptionView Paper arrow_downwardDownload

Determining the Characteristic Vocabulary for a Specialized Dictionary using Word2vec and a Directed Crawler

by Lawrence Muchemi

2023, arXiv (Cornell University)

Specialized dictionaries are used to understand concepts in specific domains, especially where those concepts are not part of the general vocabulary, or having meanings that differ from ordinary languages. The first step in creating a... more

descriptionView Paper arrow_downwardDownload

Word Embedding and Statistical Based Methods for Rapid Induction of Multiple Taxonomies

by Lawrence Muchemi

2023

In this paper we present two methodologies for rapidly inducing multiple subject-specific taxonomies from crawled data. The first method involves a sentence-level words co-occurrence frequency method for building the taxonomy, while the... more

descriptionView Paper arrow_downwardDownload

Predictive and evolutive cross-referencing for web textual sources

by Hassan Thomas

2023, 2017 Computing Conference

One of the main challenges in the domain of competitive intelligence is to harness important volumes of information from the web, and extract the most valuable pieces of information. As the amount of information available on the web grows... more

descriptionView Paper arrow_downwardDownload

A modular crawler-driven robot: Mechanical design and preliminary experiments

by Qiquan Quan

2023, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems

This paper presents a tracked robot composed of the proposed crawler mechanism, in which a planetary gear reducer is employed as the transmission device and provides two outputs in different forms with only one actuator. When the crawler... more

descriptionView Paper arrow_downwardDownload

Design of a Priority Based Frequency Regulated Incremental Crawler

by Ashutosh dixit

2023, International Journal of Computer Applications

The World Wide Web is a huge source of hyperlinked information contained in hypertext documents. Search engines use web crawlers to collect these documents from web for the purpose of storage and indexing. However, many of these documents... more

descriptionView Paper arrow_downwardDownload

A Framework for Incremental Hidden Web Crawler

by Ashutosh dixit

2023, International Journal

Hidden Web's broad and relevant coverage of dynamic and high quality contents coupled with the high change frequency of web pages poses a challenge for maintaining and fetching up-to-date information. For the purpose, it is required to... more

descriptionView Paper arrow_downwardDownload

Tree-based Focused Web Crawling with Reinforcement Learning

by George Giannakopoulos

2023, arXiv (Cornell University)

A focused crawler aims at discovering as many web pages relevant to a target topic as possible, while avoiding irrelevant ones. Reinforcement Learning (RL) has been utilized to optimize focused crawling. In this paper, we propose TRES, an... more

descriptionView Paper arrow_downwardDownload

Focused crawling: a new approach to topic-specific Web resource discovery

by Martin van den Berg

2023, Computer Networks

The rapid growth of the WorldWide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a... more

descriptionView Paper arrow_downwardDownload

Distributed hypertext resource discovery through examples

by Martin van den Berg

2023

We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as "find the number... more

descriptionView Paper arrow_downwardDownload

Distributed hypertext resource discovery through examples

by Martin van den Berg

2023

We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as "find the number... more

descriptionView Paper arrow_downwardDownload

Distributed hypertext resource discovery through examples

by Martin van den Berg

2023

We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as "find the number... more

descriptionView Paper arrow_downwardDownload

Improved Focused Crawling Using Bayesian Object Based Approach

by Ahmed Ghozia

2023, Menoufia Journal of Electronic Engineering Research

The rapid growth of the WorldWide Web made it difficult for general purpose search engines, e.g. Google and Yahoo, to retrieve most of the relevant results in response to the user queries. A vertical search engine specialized in a... more

descriptionView Paper arrow_downwardDownload

Using Exclusive Web Crawlers to Store Better Results in Search Engines' Database

by Ali Tourani

2023, arXiv (Cornell University)

Crawler-based search engines are the mostly used search engines among web and Internet users , involve web crawling, storing in database, ranking, indexing and displaying to the user. But it is noteworthy that because of increasing... more

descriptionView Paper arrow_downwardDownload

The discoverability of the web

by Anirban Dasgupta

2023, Proceedings of the 16th international conference on World Wide Web

Previous studies have highlighted the high arrival rate of new content on the web. We study the extent to which this new content can be efficiently discovered by a crawler. Our study has two parts. First, we study the inherent difficulty... more

descriptionView Paper arrow_downwardDownload

Crawling Strategy Based on Domain Ontology of Emergency Plans

by Depeng Dang

2023, Proceedings of the 2013 the International Conference on Education Technology and Information Systems

descriptionView Paper arrow_downwardDownload

Empirical evaluation of the link and content-based focused Treasure-Crawler

by Celestino Júnior

2023, Computer Standards & Interfaces

Indexing the Web is becoming a laborious task for search engines as the Web exponentially grows in size and distribution. Presently, the most effective known approach to overcome this problem is the use of focused crawlers. A focused... more

descriptionView Paper arrow_downwardDownload

Classification and analysis for Focused Crawled Textual Dataset for retrieving Indian origin scientists

by 'International Journal of Experimental Research and Review ISSN 2455-4855 (Online) and

2023, International Journal of Experimental Research and Review

Text classification also called (text categorization or text tagging) is a crucial and extensively used approach in Natural Language Processing (NLP), to predict unseen content documents into prearranged categories. In this paper, we... more

Figure 5. AUC-ROC representation of different classifiers implementation

Table 2. Accuracy percentage comparison with different classifiers precision, accuracy, reca performances, it can be d performs better than al performance is enhanced cross-validation. The ou model works best with 90 % accuracy. Random fores 1, Fl- score, and AUC-ROC as shown in Table 2. Based on the results of the model's educed that the Random Fores the other models, and their by adding SMOTE and k-fold put of our analysis reveals the efficacy of the Rand classification on Indian-origin scientist's datasets and om forest model for tex presents a strong groundwork for additional study in this field.

descriptionView Paper arrow_downwardDownload

Python-assisted biological knowledge acquisition method to trigger design inspiration

by George Aggidis

2023, Scientific Reports

Design inspiration comes from the continuous stimulation of external information and the continuous accumulation of knowledge. In order to obtain an ideal design inspiration from nature, researchers have proposed a large number of... more

descriptionView Paper arrow_downwardDownload

Focused crawling: a new approach to topic-specific Web resource discovery

by Byron Dom

2023, Computer Networks

The rapid growth of the WorldWide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a... more

descriptionView Paper arrow_downwardDownload

Focused Crawler

Related Topics