Academia.eduAcademia.edu

Outline

Towards a Generic Format for Linguistic Data................................................................ 6

2005

Abstract

In the first part of this technical report we describe our approach to design a new data format, based on XML (Extensible Markup Language) and aimed to provide a better and unifying alternative to various legacy data formats used in various areas of corpus linguistics and specifically in the field of structured annotation. We introduce the first version of the format, called Prague Markup Language (PML). This version has already been employed as the main data format for the upcoming Prague Dependency Treebank 2.0 (PDT). Finally we outline our ideas and proposals for further improvement of PML, based on our current experience with using and processing data in PML format in the PDT 2.0 project.

References (23)

  1. The Prague Markup Language ...............................................................................................
  2. Introduction ........................................................................................................
  3. PML data types ...................................................................................................
  4. Atomic data formats ...........................................................................................
  5. PML roles ...........................................................................................................
  6. Header of a PML instance ..................................................................................
  7. PML Schema File ...............................................................................................
  8. References in PML .............................................................................................
  9. Layers of annotation ...........................................................................................
  10. Tools ................................................................................................................... References [TEI] The TEI Consorcium, TEI P5 -Guidelines for Electronic Text Encoding and Interchange, C.M.Sperberg-McQueen and LouBurnard ed. (January 2005). http://www.tei-c.org/P5/
  11. Steven Bird and Mark Liberman, A Formal Framework for Linguistic Annotation (revised version) (2000). http://arxiv.org/abs/cs/0010033
  12. Dipper] Stefanie Dipper XML-based Stand-off Representation and Exploitation of Multi- Level Linguistic Annotation, 2005, In Proceedings of Berliner XML Tage 2005 (BXML 2005), pp. 39-50, Berlin, Germany. http://www.ling.uni-potsdam.de/~dipper/papers/xmltage05.pdf
  13. D. McKelvie, A. Isard, A. Mengel, M.B. Møller, M. Grosse, M. Klein, 2001. The MATE Workbench -an annotation tool for XML coded speech corpora, Speech Communication 33 (1-2), pp. 97-112. Special Issue Speech Annotation and Corpus Tools. http://www.ltg.ed.ac.uk/~amyi/papers/speechcomm00.ps
  14. S. Bird, D. Day, J. Garofolo, J. Henderson, C.L. Laprun, 2000, ATLAS: A Flexible and Extensible Architecture for Linguistic Annotation, In Proceedings of the Second International Language Resources and Evaluation Conference, pp. 1699-1706. Paris, European Language Resources Association. http://arxiv.org/pdf/cs/0007022
  15. RELAX NG Specification, OASIS Committee Specification (3 December 2001). Definitive specification for RELAX NG using the XML syntax. Project homepage: http://relaxng.org/
  16. Extensible Markup Language, World Wide Web Consortium (W3C). http://www.w3.org/XML/
  17. XSL Transformations (XSLT) Version 1.0, W3C Recommendation (16 November 1999), JamesClark ed., World Wide Web Consortium (W3C). http://www.w3.org/TR/xslt
  18. Jan Hajič, Barbora Vidová-Hladká, Jarmila Panevová, Eva Hajičová, Petr Sgall, Petr Pajas, The Prague Dependency Treebank 1.0 (Final Production Label) (2001), Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Prague. http://ufal.mff.cuni.cz/pdt1/
  19. The Prague Dependency Treebank, 2.0 beta version, Institute of Formal and Applied Linguistics, Faculty of Mathematics and PhysicsPrague (2005). http://ufal.mff.cuni.cz/pdt2.0/
  20. The Penn Treebank Project, LINC Laboratory, Computer and Information Science Department, University of Pennsylvania published by Linguistic Data Consortium. http://www.cis.upenn.edu/~treebank/
  21. TIGER Corpus. Project homepage: http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCor- pus/ [PDTMarkup] PDT 2.0 Annotation Markup Reference. http://ufal.mff.cuni.cz/pdt2.0/doc/data-formats/pml-markup/index.html
  22. DiaBruck 2003 Tutorial: Best Practice in Empirically-based Dialogue Research, David Traum, Laurent Romary, Michael Strube. http://www.coli.uni-saarland.de/conf/diabruck/pages/tutorial.htm
  23. Tree Editor TrEd. Project homepage: http://ufal.mff.cuni.cz/~pajas/tred