A Formal Framework for Linguistic Annotation
Abstract
Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions -audio, video and/or physiological recordings -or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, 'named entity' identification, co-reference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focussed on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats.
References (38)
- <21/3291.29> W/well <22/> 2348.81 2391.11 <12/> W/this <13/2391.11> <11/2348.81> speaker/Roger-Hedgecock <14/2391.60> <11/2348.81> spkrtype/male <14/2391.60> 2391.11 2391.29 <13/2391.11> W/country <14/2391.60> <11/2348.81> speaker/Roger-Hedgecock <14/2391.60> <11/2348.81> spkrtype/male <14/2391.60> 2391.29 2391.60 <13/2391.11> W/country <14/2391.60> <22/> W/i <23/2391.60> <21/3291.29> W/well <22/> <21/3291.29> speaker/Gloria-Allred <25/2439.82>
- T. Altosaar, M. Karjalainen, M. Vainio, and E. Meister. Finnish and Estonian speech applications developed on an object-oriented speech processing and database system. In Proceedings of the First International Conference on Language Resources and Evaluation Workshop: Speech Database Development for Central and Eastern European Languages, 1998. Granada, Spain, May 1998.
- A. Anderson, M. Bader, E. Bard, E. Boyle, G. M. Doherty, S. Garrod, S. Isard, J. Kowtko, J. McAllister, J. Miller, C. Sotillo, H. Thompson, and R. Weinert. The HCRC Map Task corpus. Language and Speech, 34:351-66, 1991.
- Claude Barras, Edouard Geoffrois, Zhibiao Wu, and Mark Liberman. Transcriber: a free tool for segmenting, labelling and transcribing speech. In Proceedings of the First International Conference on Language Resources and Evaluation, 1998.
- Steven Bird. Computational Phonology: A Constraint-Based Approach. Studies in Natural Language Processing. Cambridge University Press, 1995.
- Steven Bird. A lexical database tool for quantitative phonological research. In Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology. Association for Computational Linguistics, 1997.
- Steven Bird and Ewan Klein. Phonological events. Journal of Linguistics, 26:33-56, 1990.
- Steven Bird and D. Robert Ladd. Presenting autosegmental phonology. Journal of Linguistics, 27:193-210, 1991.
- Catherine Browman and Louis Goldstein. Articulatory gestures as phonological units. Phonology, 6:201-51, 1989.
- Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, Robert L. Mercer, and Paul S. Roossin. A statistical approach to machine translation. Computational Linguistics, 16:79-85, 1990.
- Bob Carpenter. The Logic of Typed Feature Structures, volume 32 of Cambridge Tracts in Theoretical Computer Science. Cambridge University Press, 1992.
- Steve Cassidy and Jonathan Harrington. Emu: An enhanced hierarchical speech data management system. In Proceedings of the Sixth Australian International Conference on Speech Science and Technology, 1996. [www.shlrc.mq.edu.au/emu/].
- Giuseppe Di Battista, Peter Eades, Roberto Tamassia, and Ioannis G. Tollis. Algorithms for drawing graphs: an annotated bibliography. [wilma.cs.brown/edu/pub/papers/compgeo/gdbiblio.ps.gz], 1994.
- Laila Dybkjaer, Niels Ole Bernsen, Hans Dybkjaer, David McKelvie, and Andreas Mengel. The mate markup framework. MATE Deliverable D1.2, Odense University, 1998.
- Konrad Ehlich. HIAT -a transcription system for discourse data. In Jane A. Edwards and Martin D. Lampert, editors, Talking Data: Transcription and Coding in Discourse Research, pages 123-48. Hillsdale, NJ: Erlbaum, 1992.
- John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathon G. Fiscus, David S. Pallett, and Nancy L. Dahlgren. The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CDROM. NIST, 1986. [www.ldc.upenn.edu/lol/docs/TIMIT.html].
- Gerald Gazdar and Chris Mellish. Natural Language Processing in Prolog: An Introduction to Computational Linguistics. Addison-Wesley, 1989.
- J. J. Godfrey, E. C. Holliman, and J. McDaniel. Switchboard: A telephone speech corpus for research and develpment. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, volume I, pages 517-20, 1992.
- S. Greenberg. The switchboard transcription project. LVCSR Summer Research Workshop, Johns Hopkins University, 1996.
- R. Grishman. TIPSTER Architecture Design Document Version 2.3. Technical report, DARPA, 1997. [www.nist.gov/itl/div894/894.02/related projects/tipster/].
- Jonathan Harrington, Steve Cassidy, Janet Fletcher, and A. McVeigh. The Mu+ speech database system. Computer Speech and Language, 7:305-31, 1993.
- Susan R. Hertz. The delta programming language: an integrated approach to nonlinear phonology, phonetics, and speech synthesis. In John Kingston and Mary E. Beckman, editors, Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech, chapter 13, pages 215-57. Cambridge University Press, 1990.
- Daniel Jurafsky, Rebecca Bates, Noah Coccaro, Rachel Martin, Marie Meteer, Klaus Ries, Elizabeth Shriberg, Andreas Stolcke, Paul Taylor, and Carol Van Ess-Dykema. Automatic detection of discourse structure for speech recognition and understanding. In Proceedings of the 1997 IEEE Workshop on Speech Recognition and Understanding, pages 88-95, Santa Barbara, 1997.
- Daniel Jurafsky, Elizabeth Shriberg, and Debra Biasca. Switchboard SWBD-DAMSL Labeling Project Coder's Manual, Draft 13. Technical Report 97-02, University of Colorado Institute of Cognitive Science, 1997. [stripe.colorado.edu/˜jurafsky/manual.august1.html].
- Brian MacWhinney. The CHILDES Project: Tools for Analyzing Talk. Mahwah, NJ: Lawrence Erlbaum., second edition, 1995. [poppy.psy.cmu.edu/childes/].
- Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313-30, 1993. www.cis.upenn.edu/ treebank/home.html.
- Boyd Michailovsky, John B. Lowe, and Michel Jacobson. Linguistic data archiving project. [lacito.vjf.cnrs.fr/ARCHIVAG/ENGLISH.htm].
- Carol Neidle and D. MacLaughlin. SignStream TM : a tool for linguistic research on signed languages. Sign Language and Linguistics, 1:111-14, 1998. [web.bu.edu/asllrp/SignStream].
- NIST. A universal transcription format (UTF) annotation specification for evaluation of spoken language technology corpora. [www.nist.gov/speech/hub4 98/utf-1.0-v2.ps], 1998.
- Ron Sacks-Davis, Tuong Dao, James A. Thom, and Justin Zobel. Indexing documents for queries on structure, content and attributes. In International Symposium on Digital Media Information Base, pages 236-45, 1997.
- Emanuel Schegloff. Reflections on studying prosody in talk-in-interaction. Language and Speech, 41:235-60, 1998. www.sscnet.ucla.edu/soc/faculty/schegloff/prosody/.
- Florian Schiel, Susanne Burger, Anja Geumann, and Karl Weilhammer. The Partitur format at BAS. In Proceedings of the First International Conference on Language Resources and Evaluation, 1998. [www.phonetik.uni-muenchen.de/Bas/BasFormatseng.html].
- Kåre Sjölander, Jonas Beskow, Joakim Gustafson, Erland Lewin, Rolf Carlson, and Björn Granström. Web-based educational tools for speech technology. In ICSLP-98, 1998.
- Ann Taylor. Dysfluency Annotation Stylebook for the Switchboard Corpus. University of Pennsylvania, Department of Computer and Information Science, 1995. [ftp.cis.upenn.edu/pub/treebank/swbd/doc/DFL-book.ps].
- Paul A. Taylor, Alan W. Black, and Richard J. Caley. The architecture of the Festival speech synthesis system. In Third International Workshop on Speech Synthesis, Sydney, Australia, November 1998.
- Paul A. Taylor, Alan W. Black, and Richard J. Caley. Heterogeneous relation graphs as a mechanism for representing linguistic information. [www.cstr.ed.ac.uk/publications/new/draft/Taylor draft a.ps], 1999.
- Text Encoding Initiative. Guidelines for Electronic Text Encoding and Interchange (TEI P3). Oxford University Computing Services, 1994. [www.uic.edu/orgs/tei/].
- Henry S. Thompson and David McKelvie. Hyperlink semantics for standoff markup of read-only documents. In SGML Europe '97, 1997. [www.ltg.ed.ac.uk/˜ht/sgmleu97.html].