International journal of computer applications, Apr 30, 2012
Today information retrieval from Internet is becoming a commonplace phenomenon. Since, informatio... more Today information retrieval from Internet is becoming a commonplace phenomenon. Since, information is readily available and accessible to everyone. Whenever any user types a query in search engine, they would get answers within few micro seconds. However, the results which they get may or may not be accurate because different websites may give different information about the same entity. So, the biggest question is, which website should the user trust? There are many characteristics using that users can determine trustworthiness in content provided by Web information sources. In the proposed system, the filtering of website trustworthiness is based on five major areas as Authority, Related resources, Popularity, Age and Recommendation. The proposed system defines eighteen factors which are categorized under the mentioned five major areas. The website trustworthiness is calculated based on these eighteen factors of each URL and it is stored thereby increasing the performance in retrieving the trustworthy websites. The objective of the proposed system is to provide more trustworthy websites as top results which would save considerable amount of searching time.
International Journal of Computer Theory and Engineering, 2011
E-learning has gained its importance over the traditional classroom learning techniques in past f... more E-learning has gained its importance over the traditional classroom learning techniques in past few decades. It is required to have a system which adds to the learning needs of the learner for better understanding. Thus the resources which are used to store the information on the web have to be organized and stored in a way that their retrieval would be meaningful as compared to just a key word search. To accomplish this goal, semantics are used to store the resources. This paper demonstrates how semantic metadata can be stored and retrieved to provide better results to the learner along with personalized learning. Index Terms-E-learning, semantic web. Prof. Jayant Gadge Graduated in Computer Engineering from Walchand Institute of Technology, Solapur and has completed Masters Degree in Computer Engineering from University of Mumbai. He has more than 15 years of experience in Teaching. He has guided around 25 PostGraduate dissertations till date. He is Head Of Computer Engineering Department at Thadomal Shahani College of Engineering, Bandra (W), Mumbai, India. His areas of interest are Computer software Engineering Network, Data Mining and Web mining.
International Journal of Computer Applications, 2011
With the phenomenal growth of the web, there is an ever increasing volume of data and information... more With the phenomenal growth of the web, there is an ever increasing volume of data and information published in numerous web-pages. It is said that web is noisy. A web page typically contains a mixture of many kind of information e.g. main contains, advertisements, navigational panels, copy right blocks etc… for a particular application only part of information is useful and the rest are noise. These all seriously harm web mining. Advertisements and Sponsor images are not much important in surfing. As there is a need of technique that keep common navigation structure as it is but removes image advertisement and improve surfing efficiency. In this paper a small application HTML Tag Differentiator is created which removes image advertisement using rule based classifier.
The Web has become the largest available repository of data. The exponential growth and the fast ... more The Web has become the largest available repository of data. The exponential growth and the fast pace of change of the web makes really hard to retrieve all relevant information. The crawling of web pages with speed for finding relevant set of document is ...
The need for watermarking database relations to deter their piracy, identify the unique character... more The need for watermarking database relations to deter their piracy, identify the unique characteristics of relational data which pose new challenges for watermarking, and provide desirable properties of a watermarking system for relational data. Proving ownership rights on outsourced relational databases is a crucial issue in today's Internet-based application environments and in many content distribution applications. In this paper, new mechanism is proposed for proof of ownership based on the secure embedding of a robust imperceptible watermark in relational data. The steps of the proposed mechanism of watermarking relational database mainly involve decoding and encoding on numerical attribute of relational database. The first phase is to partition the original data and assign partition number to each and every tuple of the relation using cryptographic hashing function (MD5). In the second phase, while changing the data, select the desired watermark and bit bi is selected fro...
International Journal of Computer Applications, 2012
Information retrieval (IR) is the area of study concerned with searching documents or information... more Information retrieval (IR) is the area of study concerned with searching documents or information within documents. The user describes information needs with a query which consists of a number of words. Finding weight of a query term is useful to determine the importance of a query. Calculating term importance is fundamental aspect of most information retrieval approaches and it is traditionally determined through Term Frequency-Inverse Document Frequency (IDF). This paper proposes a new term weighting technique called concept-based term weighting (CBW) to give a weight for each query term to determine its significance by using WordNet Ontology.
Dimensionality refers to number of terms in a web page. While classifying web pages high dimensio... more Dimensionality refers to number of terms in a web page. While classifying web pages high dimensionality of web pages causes problem. The main objective of reducing dimensionality of web pages is improving the performance of classifier. Processing time and accuracy are two parameters which influence the performance of a classifier. To reduce the processing time, less informative and redundant terms have to be removed from web pages. This research describes hybrid approach for dimensionality reduction in web page classification using a rough set and naïve Bayesian method. Feature selection and dimensionality reduction methods are used for reducing the dimensionality. Information gain method is used as feature selection method. Rough set based Quick Reduct algorithm is used for dimensionality reduction. Naïve Bayesian method is used for classifying web pages to optimal predefined categories. Assignment of web pages to category is based on maximum posterior probability. Words remaining ...
Today, with the rapid development of the Internet, textual information is growing rapidly. So doc... more Today, with the rapid development of the Internet, textual information is growing rapidly. So document retrieval which aims to find and organize relevant information in text collections is needed. With the availability of large scale inexpensive storage the amount of information stored by organizations will increase. Searching for information and deriving useful facts will become more cumbersome. How to extract a lot of information quickly and effectively has become the focus of current research and hot topics. The state of the art for traditional IR techniques is to find relevant documents depending on matching words in users’ query with individual words in text collections. The problem with Content-based retrieval systems is that documents relevant to a users ’ query are not retrieved, and many unrelated or irrelevant materials are retrieved. In this paper information retrieval method is proposed based on LSI approach. Latent Semantic Indexing (LSI) model is a concept based retrie...
Today information retrieval from Internet is becoming a commonplace phenomenon. Since, informatio... more Today information retrieval from Internet is becoming a commonplace phenomenon. Since, information is readily available and accessible to everyone. Whenever any user types a query in search engine, they would get answers within few micro seconds. However, the results which they get may or may not be accurate because different websites may give different information about the same entity. So, the biggest question is, which website should the user trust? There are many characteristics using that users can determine trustworthiness in content provided by Web information sources. In the proposed system, the filtering of website trustworthiness is based on five major areas as Authority, Related resources, Popularity, Age and Recommendation. The proposed system defines eighteen factors which are categorized under the mentioned five major areas. The website trustworthiness is calculated based on these eighteen factors of each URL and it is stored thereby increasing the performance in retri...
Information Extraction has become an important task for discovering useful knowledge or informati... more Information Extraction has become an important task for discovering useful knowledge or information from the Web. A crawler system, which gathers the information from the Web, is one of the fundamental necessities of Information Extraction. A search engine uses a crawler to crawl and index web pages. Search engine takes into account only the informative content for indexing. In addition to informative content, web pages commonly have blocks that are not the main content blocks and are called the non-informative blocks or noise. Noise is generally illogical with the main content of the page and affects two major parameters of search engines: the precision of search and the size of index In order to improve the performance of information retrieval, cleaning of Web pages becomes critical. The main objective of proposed technique is to eliminate the non-informative content blocks from a Web Page. In the proposed technique, the extraction of informative content blocks and elimination of ...
Computer networks are usually vulnerable to attacks by any unauthorized person trying to misuse t... more Computer networks are usually vulnerable to attacks by any unauthorized person trying to misuse the resources.. Hence they need to be protected against such attacks by Intrusion Detection Systems (IDS). The traditional prevention techniques such as user authentication, data encryption, avoidance of programming errors, and firewalls are only used as the first line of defense. But, if a password is weak and is compromised, user authentication cannot prevent unauthorized use. Similarly, firewalls are vulnerable to errors in configuration and sometimes have ambiguous/undefined security policies. They fail to protect against malicious mobile code, insider attacks and unsecured modems. Therefore, intrusion detection is required as an additional wall for protecting systems. Previously many techniques have been used for the effective detection of intrusions. One of the major issues is however the accuracy of these systems i.e an increase in the number of false negatives. Due to the increasi...
Information retrieval is concerned with documents relevant to a user’s information needs from a c... more Information retrieval is concerned with documents relevant to a user’s information needs from a collection of documents. The user describes information needs with a query which consists of a number of words. Finding weight of a query is important to determine importance of a query. Calculating term importance is fundamental aspect of most information retrieval approaches and it is commonly determined through Term Frequency- Inverse Document Frequency (TF-IDF). This paper proposed Concept-based Term Weighting (CBW) technique to determine the term importance by finding the weight of a query. WordNet ontology is used to find the conceptual information of each word in the query.
Web is a system of interlinked hypertext documents accessed via Internet. Internet is a global sy... more Web is a system of interlinked hypertext documents accessed via Internet. Internet is a global system of interconnected computer networks that serve billions of users worldwide. The huge amount of documents on the web is challenging for web search engines. Web contains multiple copies of the same content or same web page. Many of these pages on the Web are duplicates and near duplicates of other pages. Web search engines face substantial problems due to such duplicate and near duplicate web pages. These pages enlarge the space required to store the index, increase the cost of serving results and so frustrates the users. To assist search engines to provide search results free of redundancy to users and to provide distinct useful results on the first page, duplicate and near duplicate detection is required. The proposed approach will detect near duplicate web pages to increase search effectiveness and storage efficiency of search engine.
Clustering of Web Pages based on Visual Similarity Mansi.M.Gurbani
Finding the appropriate information on the web is a very tedious job. There is a need to organize... more Finding the appropriate information on the web is a very tedious job. There is a need to organize the data by classifying the data into categories. This categorization of web pages can be achieved by clustering. The clustering is done by analyzing the content of the HTML page by extracting the keywords. Based on the keywords extracted the page is evaluated and clustered. But the visual feature of the web page which is also a semantic feature of the web page is not considered. The proposed method focuses on the visual content of the web page i.e. which structures are used to represent the content of web page. The proposed method cleans the web page, computes DOM tree, compresses it and calculate similarity. Based on this similarity measure the clustering is performed.
International Journal of Computer Applications, 2012
Computer networks are usually vulnerable to attacks by any unauthorized person trying to misuse t... more Computer networks are usually vulnerable to attacks by any unauthorized person trying to misuse the resources.. Hence they need to be protected against such attacks by Intrusion Detection Systems (IDS). The traditional prevention techniques such as user authentication, data encryption, avoidance of programming errors, and firewalls are only used as the first line of defense. But, if a password is weak and is compromised, user authentication cannot prevent unauthorized use. Similarly, firewalls are vulnerable to errors in configuration and sometimes have ambiguous/undefined security policies. They fail to protect against malicious mobile code, insider attacks and unsecured modems. Therefore, intrusion detection is required as an additional wall for protecting systems. Previously many techniques have been used for the effective detection of intrusions. One of the major issues is however the accuracy of these systems. To improve accuracy, data mining programs are used to analyze audit ...
Phishing Sites Detection Based on C4.5 Decision Tree Algorithm
The rapid increase in usage of the Internet and web services has led to a drastic increase in num... more The rapid increase in usage of the Internet and web services has led to a drastic increase in number of web attacks. Phishing is a web attack where the phis hers try to acquire user's sensitive information for fraudulent purposes. Phishers target user's sensitive information through a fake website that appears similar to a legitimate site in terms of interface and uniform resource locator (URL) address. Hence, there is an increase in victims falling prey to the phishing sites. This paper proposes an efficient way for detection of the phishing websites using C4.S decision tree approach and features of URL. The method proposed in this paper uses various URL features and also uses C4.S decision tree approach for better results.
Information on the web is growing exponentially. The unprecedented growth of available informatio... more Information on the web is growing exponentially. The unprecedented growth of available information coupled with the vast number of available online activities. It has introduced a new wrinkle to the problem of web search. It is difficult to retrieve relevant information. In this context search engines have become a valuable tool for users to retrieve relevant information. Finding relevant information according to user’s need is still a challenge. Various retrieval models have been proposed and empirically validated to find out relevant web pages related to user’s queries. The vector space model is one of the extensively used for web information retrieval. But this model ignores the importance of terms with respect to their position while calculating the weight to the terms. In this paper, new approach is proposed and validated based on vector space model, referred as Layered Vector Space model. In Layered Vector Space approach, the importance of terms with respect to their position ...
Web Scale Information Extraction Using Wrapper Induction Approach
International Journal of Electronics and Electical Engineering
Information extraction from unstructured, ungrammatical data such as classified listings is diffi... more Information extraction from unstructured, ungrammatical data such as classified listings is difficult because traditional structural and grammatical extraction methods do not apply. The proposed architecture extracts unstructured and un-grammatical data using wrapper induction and show the result in structured format. The source of data will be collected from various post website. The obtained post data pages are processed by page parsing, cleansing and data extraction to obtain new reference sets. Reference sets are used for mapping the user search query, which improvised the scale of search on unstructured and ungrammatical post data. We validate our approach with experimental results.
Uploads
Papers by Jayant Gadge