Guest Editor's Introduction: Special Issue on Web Content Mining
2000, Journal of Intelligent Information Systems
https://doi.org/10.1023/B:JIIS.0000019288.63141.E4Abstract
Research in Web mining is moving the World Wide Web toward a more useful environment in which users can quickly and easily find the information they need. Web mining refers to the discovery and analysis of data, documents, and multimedia from the World Wide Web. It includes hyperlink structure, statistical usage, and document content mining. Structure mining is concerned with the discovery of information through the analysis of Web page in and out links. This kind of information can establish the authority of a Web page, and help in page categorization. Usage mining applies data mining techniques to discover patterns in Web logs. This is useful in defining collaboration between users and refining user personal preferences. Content mining extracts concepts from the content of Web pages. Information retrieval techniques are applied to unstructured (text), semi-structured (HTML, XML), and structured (databases) Web pages to extract semantic meaning. This journal issue presents current research in Web content mining of unstructured and semi-structured Web pages. Search engines have the responsibility for extracting semantic meaning from the content of Web pages. So much information is now available that a searcher must depend upon search engines for possible information sources. With Web content as diverse as the authors creating Web pages, the search engine must understand the content of the individual Web pages for a searcher to effectively find information. This is not a trivial task. Authors of unstructured and semi-structured text may not be concerned with the automatic extraction of meaning. Typically text is written for a human audience, which is naturally capable of extracting meaning. To extract semantic meaning requires an understanding of the elements of the Web page and an understanding of the relationships between those elements. The extracted meaning must then be placed in a structure that is easily searchable in response to a query. Basically search engines consist of 3 parts-the user-interface, the spider, and the index. The user-interface is where the searcher enters keywords as a search query. These keywords represent the searcher's information need. Prior to the query, the spider had found pages on the Web. These pages are indexed as keywords, locations, and other descriptive information. The keywords selected represent the concepts expressed in the page. The search is simply a query of the index. The keywords in the query are matched to the keywords in the index, and the more matching keywords the better the page is as an information source. The location (and other information in the index) of the pages with the best matches are returned to the searcher.