Language resources are valuable assets, both for institutions and researchers. To safeguard these... more Language resources are valuable assets, both for institutions and researchers. To safeguard these resources requirements for repository systems and data management have been specified by various branch organizations, e.g., CLARIN and the Data Seal of Approval. This paper describes these and some additional ones posed by the authors’ home institutions. And it shows how they are met by FLAT, to provide a new home for language resources. The basis of FLAT is formed by the Fedora Commons repository system. This repository system can meet many of the requirements out-of-the box, but still additional configuration and some development work is needed to meet the remaining ones, e.g., to add support for Handles and Component Metadata. This paper describes design decisions taken in the construction of FLAT’s system architecture via a mix-and-match strategy, with a preference for the reuse of existing solutions. FLAT is developed and used by the a Institute and The Language Archive, but is al...
The ISOcat Data Category Registry (www.isocat.org) has been developed by ISO TC 37 and CLARIN to ... more The ISOcat Data Category Registry (www.isocat.org) has been developed by ISO TC 37 and CLARIN to share and explicitate semantics of data categories used within the linguistic community. Semantics in this large and diverse community are constantly evolving and sometimes conflicting. The ISOcat open registry allows community members to collaborate in defining the semantics of linguistic data categories. The aim is to create a core of possibly officially standardized, well specified and widely accepted linguistic data categories. This demonstration will show ISOcat’s features to support direct and indirect collaboration, its efforts to create a set of core data categories for various communities, and possible solutions for current bottlenecks.
This paper describes the development of a CLARIN-compatible repository solution that fulfils both... more This paper describes the development of a CLARIN-compatible repository solution that fulfils both the long-term preservation requirements as well as the current day discoverability and usability needs of an online data repository of language resources. The widely used Fedora Commons open source repository framework, combined with the Islandora discovery layer, forms the basis of the solution. On top of this existing solution, additional modules and tools are developed to make it suitable for the types of data and metadata that are used by the participating partners.
Im ISOcat-Datenkategorie-Register (Data Category Registry, www.isocat.org) des Technischen Komite... more Im ISOcat-Datenkategorie-Register (Data Category Registry, www.isocat.org) des Technischen Komitees ISO/TC 37 (Terminology and other language and content resources) werden Feldnamen und Werte für Sprachressourcen beschrieben. Empfohlene Feldnamen und zuverlässige Definitionen sollen dazu beitragen, dass Sprachdaten unabhängig von Anwendungen, Plattformen und Communities of Practice (CoP) wiederverwendet werden können. Datenkategorie-Gruppen (Data Category Selections) können eingesehen, ausgedruckt, exportiert und nach kostenloser Registrierung auch neu erstellt werden
The Lexical Markup Framework (ISO 24613:2008) provides a core class diagram and various extension... more The Lexical Markup Framework (ISO 24613:2008) provides a core class diagram and various extensions as the basis for constructing lexical resources. Unfortunately the informative Document Type Definition provided by the standard and other available LMF serializations lack support for many of the powerful features of the model. This paper describes RELISH LMF, which unlocks the full power of the LMF model by providing a set of extensible modern schema modules. As use cases RELISH LL LMF and support by LEXUS, an online lexicon tool, are described.
When managing data sets in research data workflows almost all research disciplines are faced with... more When managing data sets in research data workflows almost all research disciplines are faced with the challenge on how to deal with versioning or, broader, tracking provenance. At this stall we propose an extension to the CMD Infrastructure to specify (provenance) relationships among language resources. Although we are particularly interested in use-cases for describing relations between corpora (update, enrichment etc.), we also like to discuss provenance tracking and provenance use cases in general. Contributions to our work are very welcome.
In the CLARIN infrastructure various national projects have started initiatives to allow users of... more In the CLARIN infrastructure various national projects have started initiatives to allow users of the infrastructure to create chains or workflows of web services. The Component Metadata (CMD) core model for web services described in this paper tries to align the metadata descriptions of these various initiatives. This should allow chaining/workflow engines to find matching and invoke services. The paper describes the landscape of web services architectures and the state of the national initiatives. Based on this a CMD core model for CLARIN is proposed, which, within some limits, can be adapted to the specific needs of an initiative by the standard facilities of CMD. The paper closes with the current state and usage of the model and a look into the future.
The ISOcat Data Category Registry contains basically a flat and easily extensible list of data ca... more The ISOcat Data Category Registry contains basically a flat and easily extensible list of data category specifications. To foster reuse and standardization only very shallow relationships among data categories are stored in the registry. However, to assist crosswalks, possibly based on personal views, between various (application) domains and to overcome possible proliferation of data categories more types of ontological relationships need to be specified. RELcat is a first prototype of a Relation Registry, which allows storing arbitrary relationships. These relationships can reflect the personal view of one linguist or a larger community. The basis of the registry is a relation type taxonomy that can easily be extended. This allows on one hand to load existing sets o f relations specified in, for example, an OWL (2) ontology or SKOS taxonomy. And on the other hand allows algorithms that query the registry to traverse the stored semantic network to remain ignorant of the original so...
Proceedings 17th International Conference on Data Engineering
Due to the ubiquity and popularity of XML, users often are in the following situation: they want ... more Due to the ubiquity and popularity of XML, users often are in the following situation: they want to query XML documents which contain potentially interesting information but they are unaware of the markup structure that is used. For example, it is easy to guess the contents of an XML bibliography file whereas the markup depends on the methodological, cultural and personal background of the author(s). Nonetheless, it is this hierarchical structure that forms the basis of XML query languages. In this paper we exploit the tree structure of XML documents to equip users with a powerful tool, the meet operator, that lets them query databases with whose content they are familiar, but without requiring knowledge of tags and hierarchies. Our approach is based on computing the lowest common ancestor of nodes in the XML syntax tree: e.g., given two strings, we are looking for nodes whose offspring contains these two strings. The novelty of this approach is that the result type is unknown at query formulation time and dependent on the database instance. If the two strings are an author's name and a year, mainly publications of the author in this year are returned. If the two strings are numbers the result mostly consists of publications that have the numbers as year or page numbers. Because the result type of a query is not specified by the user we refer to the lowest common ancestor as nearest concept. We also present a running example taken from the bibliography domain, and demonstrate that the operator can be implemented efficiently.
2002 IEEE Workshop on Multimedia Signal Processing.
Most of the multimedia objects distributed over the WorldWide Web are unstructured or poorly meta... more Most of the multimedia objects distributed over the WorldWide Web are unstructured or poorly meta-indexed to be of any use in retrieval tasks formulated by users in natural language queries. In general these dynamic multimedia objects are manually annotated in terms of textual documents. The high costs involved in manually indexing multimedia objects, which grow in volume and are becoming ever more diverse in type, call for automatic sustainable categorization schemata that are accesible and operational on the Web. These categorization schemata comprise indexing, querying and retrieval schemata. We propose a webenabled advanced multimedia system as a solution to this categorization problem. We lay bare the physical, mathematical and logical framework underlying our system. We demonstrate that this system pays off especially in semantically user-defined summarisation tasks concerning multimedia presentations.
Uploads
Papers by M. Windhouwer