Academia.eduAcademia.edu

Outline

Semantic annotation, indexing, and retrieval

2004, Journal of Web Semantics

https://doi.org/10.1016/J.WEBSEM.2004.07.005

Abstract

The Semantic Web realization depends on the availability of a critical mass of metadata for the web content, associated with the respective formal knowledge about the world. We claim that the Semantic Web, at its current stage of development, is in a state of a critically need of metadata generation and usage schemata that are specific, well-defined and easy to understand. This paper introduces our vision for a holistic architecture for semantic annotation, indexing, and retrieval of documents with regard to extensive semantic repositories. A system (called KIM), implementing this concept, is presented in brief and it is used for the purposes of evaluation and demonstration. A particular schema for semantic annotation with respect to real-world entities is proposed. The underlying philosophy is that a practical semantic annotation is impossible without some particular knowledge modelling commitments. Our understanding is that a system for such semantic annotation should be based upon a simple model of real-world entity classes, complemented with extensive instance knowledge. To ensure the efficiency, ease of sharing, and reusability of the metadata, we introduce an upper-level ontology (of about 250 classes and 100 properties), which starts with some basic philosophical distinctions and then goes down to the most common entity types (people, companies, cities, etc.). Thus it encodes many of the domain-independent commonsense concepts and allows straightforward domainspecific extensions. On the basis of the ontology, a large-scale knowledge base of entity descriptions is bootstrapped, and further extended and maintained. Currently, the knowledge bases usually scales between 10 5 and 10 6 descriptions. Finally, this paper presents a semantically enhanced information extraction system, which provides automatic semantic annotation with references to classes in the ontology and to instances. The system has been running over a continuously growing document collection (currently about 0.5 million news articles), so it has been under constant testing and evaluation for some time now. On the basis of these semantic annotations, we perform semantic based indexing and retrieval where users can mix traditional IR (information retrieval) queries and ontology-based ones. We argue that such large-scale, fully automatic methods are essential for the transformation of the current largely textual web into a semantic web.

References (36)

  1. Bontcheva K., Kiryakov A., Cunningham H., Popov B., Dimitrov M. Semantic Web Enabled, Open Source Language Technology. In proc. of EACL Workshop "Language Technology and the Semantic Web", NLPXML-2003, 13 April, 2003
  2. Brickley D, Guha R.V., eds. Resource Description Framework (RDF) Schemas, W3C http://www.w3.org/TR/2000/CR-rdf-schema-20000327/
  3. Carr L., Bechhofer S., Goble C., Hall W. Conceptual Linking: Ontology-based Open Hypermedia. In The WWW10 Conference, Hong Kong, May, pp. 334-342.
  4. Cunningham H., Information Extraction: a User Guide (revised version). Department of Computer Science, University of Sheffield, May, 1999.
  5. Cunningham H., Maynard D., Bontcheva K. and Tablan V., GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In proc. of the 40 th Anniversary Meeting of the Association for Computational Linguistics, 2002.
  6. Collier N., Takeuchi K, Kawazoe A. Open Ontology Forge: An Environment for Text Mining in a Semantic Web World. In proc. of the International Workshop on Semantic Web Foundations and Application Technologies, Nara, Japan, 11th March, 2003.
  7. Dean M., Connolly D., van Harmelen, F., Hendler J., Horrocks I., McGuinness D., Patel- Schneider P., Stein L.A., Web Ontology Language (OWL) Reference Version 1.0. W3C Working Draft 12 Nov. 2002, http://www.w3.org/TR/2002/WD-owl-ref-20021112/
  8. Dumais S., Cutrell E., Cadiz J., Jancke G., Sarin R. and Robbins D. Stuff I've Seen: A system for personal information retrieval and re-use. In proc. of SIGIR'03, 2003, Toronto, ACM Press.
  9. Fensel D. Ontology Language, v.2 (Welcome to OIL). Deliverable 2, On-To-Knowledge project, Dec 2001. http://www.ontoknowledge.org/downl/del2.pdf
  10. Handschuh S., Staab St., Ciravegna F. S-CREAM -Semi-automatic CREAtion of Metadata. The 13th International Conference on Knowledge Engineering and Management (EKAW 2002), ed Gomez-Perez, A., Springer Verlag, 2002.
  11. Kahan J., Koivunen M., Prud'Hommeaux E., Swick R. Annotea: An Open RDF Infrastructure for Shared Web Annotations. In The WWW10 Conference, Hong Kong, May, pp. 623-632.
  12. Kampman A., Harmelen F., Broekstra J. Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. In proc. of ISWC2002, June 9-12th, 2002, Italia.
  13. Kiryakov A., Simov K. Iv., Ognyanov D. Ontology Middleware: Analysis and Design Del. 38, On-To-Knowledge, March 2002. http://www.ontoknowledge.org/downl/del38.pdf
  14. Kiryakov A., Simov K. Iv. Ontologically Supported Semantic Matching. In proc. of "NODALIDA'99: Nordic Conference on Comp. Linguistics", Trondheim, Dec. 9-10, 1999.
  15. Landauer T., and Dumais S. A solution to Plato's problem: the Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104(2), 1997.
  16. Maedche A., Motik B., Stojanovic L., Studer R. and Volz R. Ontologies for Enterprise Knowledge Management. In IEEE Intelligent Systems, Vol. 18, Num. 2, pp. 26-33, 2003.
  17. Mahesh K., Kud J., Dixon P. Oracle at TREC8: A Lexical Approach, In proc. of the Eighth Text Retrieval Conference (TREC-8), 1999.
  18. Manov D, Kiryakov A, Popov B, Bontcheva K, Maynard D, Cunningham H. Experiments with geographic knowledge for information extraction. NAACL-HLT 2003, Canada. Workshop on the Analysis of Geographic References, May 31 2003, Edmonton, Alberta.
  19. Maynard D., Tablan V., Bontcheva K., Cunningham H, and Wilks Y. MUlti-Source Entity recognition -an Information Extraction System for Diverse Text Types. Technical report CS-- 02--03, Univ. of Sheffield, Dep. of CS, 2003. http://gate.ac.uk/gate/doc/papers.html
  20. Moldovan D., Mihalcea R. Document Indexing Using Named Entities. In "Studies in Informatics and Control", Vol. 10, No. 1, March 2001.
  21. Noy N., Musen M. Ontology Versioning as an Element of an Ontology-Management Framework. IEEE Intelligent Systems, to appear, 2003.
  22. Pustejovsky J., Boguraev B., Verhagen, M., Buitelaar P., and Johnston M., Semantic Indexing and Typed Hyperlinking. In proc. of the AAAI Conference, Spring Symposium, NLP for WWW, Stanford University, CA, 1997, pp. 120-128.
  23. van Ossenbruggen J., Hardman L., Rutledge L., Hypermedia and the Semantic Web: A Research Agenda. Journal of Digital information, volume 3 issue 1, May 2002.
  24. Vargas-Vera M., Motta E., Domingue J., Lanzoni M., Stutt A. and Ciravegna F, MnM: Ontology Driven Semi-Automatic and Automatic Support for Semantic Markup, In Proc. Of EKAW 2002, ed. Gomez-Perez, A., Springer Verlag, 2002.
  25. Voorhees E. Using WordNet for Text Retrieval. In "WordNet: an electronic lexical database." Fellbaum, C. (editor), MIT Press, 1998.
  26. Chinchor, N., Robinson, P. MUC-7 Named Entity Task Definition (version 3.5). In Proc. of the MUC-7. 1998.
  27. Dill S., Eiron N., Gibson D., Gruhl D., Guha R., Jhingran A., Kanungo T., Rajagopalan S., Tomkins A., Tomlin J. A., Zien, J. Y. SemTag and Seeker: Bootstrapping the semantic web via automated semantic annotation. Proceedings of the 12 th International Conference on World Wide Web (WWW'03). Budapest, Hungary. 2003.
  28. Guha R., McCool R. TAP: Towards a Web of data. http://tap.stanford.edu.
  29. Ciravegna F., Dingli A., Petrelli D., Wilks, Y. User-System Cooperation in Document Annotation based on Information Extraction. In Proc. of the 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW02), Springer. 2002.
  30. Kogut P., Holmes W. AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages. First International Conf. on Knowledge Capture (K-CAP'01). 2001.
  31. Kushmerick, N. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence J. 118(1-2):15-68 (special issue on Intelligent Internet Systems). 2000.
  32. Gruber, T. R. Toward principles for the design of ontologies used for knowledge sharing. In N. Guarino & R. Poli, (Eds.), International Workshop on Formal Ontology, Padova, Italy, 1993.
  33. Peikoff, Leonard. The Ominous Parallels. Plume Books, 1997. Also at http://www.aynrand.org/objectivism/
  34. Mahesh K., Nirenburg S., Cowie J. and Farwell D. An Assessment of Cyc for Natural Language Processing. MCCS Report, New Mexico State University, 1996.
  35. Davis J. et al. QuizRDF: Search Technology for the Semantic Web. In "Towards the Semantic Web: Ontology-Driven Knowledge Management", editors John Davies, Dieter Fensel, Frank van Harmelen. John Wiley & Sons, Europe, 2002.
  36. Popov B., Kiryakov A., Ognyanoff D., Manov D., Kirilov A., Goranov M. Towards Semantic Web Information Extraction. Human Language Technologies Workshop at the 2nd International Semantic Web Conference (ISWC2003), 20 October 2003, Florida, USA.