Managing Text as Data
Sign up for access to the world's latest research
Related papers
dsi.unive.it
The markup approach to represent and store large corpora of annotated textual documents is criticized for several reasons: it poses problems in expressing non-hierarchical structures, it limits the annotations in type and complexity, it makes difficult the writing of complex textual analysis programs since it requires the use of generic query languages like XQuery which are not well suited to the special need of the domain. We present a model and a language, called Manuzio, developed to be at the base of a new generation of textual document management systems which overcome the previous shortcomings. The model is an object based one, specialized for the specific domain, and has abstraction mechanisms which present some similarities with those of the object oriented database models. The language has query facilities and allows the development of sophisticated textual analysis applications. A prototype for a system has been designed and applied to several test cases.
ACM Transactions on Database Systems - TODS
We propose to apply relational databases (RDB) development methodologies to the design of lexical databases (LDB), which embody conceptual and linguistic knowledge. We represent the conceptual knowledge as an ontology, and the linguistic knowledge, which depends on each language, in lexicons. Our approach is based on a single language- independent ontology. Besides, we study some conceptual and linguistic requirements; in particular, meaning classifications in the ontology, focusing on taxonomies. We have followed a classical software development methodology for implementing lexical information systems in order to reach robust, maintainable, and integrateable RDB for storing the conceptual and linguistic knowledge. Weak attention has been paid on topics about development methodologies for building the software systems which manage LDB. We claim that the software engineering methodology subject is necessary in order to develop, reuse and integrate the diverse available linguistic inf...
2ο ΣΥΝΕΔΡΙΟ ΕΛΛΗΝΙΚΗ ΓΛΩΣΣΑ ΚΑΙ ΟΡΟΛΟΓΙΑ, ΑΘΗΝΑ, 1999
In the work presented here, new methods for designing and implementing large lexical databases were examined. These lexical databases or machine readable dictionaries are expected to be organized in a way to provide fast access to the stored data and efficient memory management. Directed graphs can be used to describe and organize a lexical database of large magnitude in a compact manner. These data structures are called matrix lexica, where the letters are described as nodes of directed graphs and the lemmata as paths (set of edges). It is claimed that matrix lexica can efficiently support automated language applications, in the fields of lexicography, terminology, machine translation and others, by providing high speed of resolution, sound mathematical foundation, low memory requirements and ability to handle distorted input in future developments. These methods were evaluated for Modern Greek as a target language. 0. INTRODUCTION The overall target of the work presented here was the development of a general purpose automated system for the computational treatment of Modern Greek morphology. Such a computerized system is composed of two major subsystems: (i) the subsystem which analyze the words, called "tagger" and (ii) a database, containing information about the words, which is called "lexical database" or "lexicon". It is expected from a tagger to provide one accurate analysis for every decomposed word fast and with low complexity (in order to improve maintainability). The model of functional decomposition [1] [4] was used to design and implement a tagger for Modern Greek and evaluate its performance using a large scale corpus (ECI-Greek Part), containing approximately 1,880,000 words. This tagger used a morpheme based lexicon having 7800 entries. The morpheme based lexicon is a lexical database that contains morphemes instead of words. The advantages of such a lexicon are: low computer
Journal of Computer and System Sciences, 1999
In order to enable the database programmer to reason about relations over strings of arbitrary length 12 (1965), 423-434. S. Ginsburg and X. Wang. Pattern matching by rs-operations: towards a unified approach to querying sequenced data.
… of Computer Science, …, 1995
Combined text and relational database support is increasingly recognized as an emerging need of industry, spanning applications requiring text fields as parts of their data (eg, for customer support) to those augmenting primary text resources by conventional relational data (eg, ...
1998
In this paper, we are going to describe some aspects of the TAMIC-P system for German, which interprets natural-language queries to databases in the social insurance domain. These natural language queries are complex NPs, con- sisting of clusters of NPs and PPs. The parser uses information about co-occurence and domi- nation of linguistic elements as well as concept hierarchies in
1997
A truly natural language interface to databases also needs to be practical for actual implementation. We developed a new feasible approach to solve the problem and tested it successfully in a laboratory environment. The new result is based on metadata search, where the metadata grow in largely linear manner and the search is linguistics-free (allowing for grammatically incorrect and incomplete input). A new class of reference dictionary integrates four types of enterprise metadata: enterprise information models, database values, user-words, and query cases. The layered and scalable information models allow user-words to stay in original forms as users articulated them, as opposed to relying on permutations of individual words contained in the original query. A graphical representation method turns the dictionary into searchable graphs representing all possible interpretations of the input. A branch-and-bound algorithm then identifies optimal interpretations, which lead to SQL implementation of the original queries. Query cases enhance both the metadata and the search of metadata, as well as providing casebased reasoning to directly answer the queries. This design assures feasible solutions at the termination of the search, even when the search is incomplete (i.e., the results contain the correct answer to the original query). The necessary condition is that the text input contains at least one entry in the reference dictionary. The sufficient condition is that the text input contains a set of entries corresponding to a complete, correct single SQL query. Laboratory testing shows that the system obtained accurate results for most cases that satisfied only the necessary condition.
2000
Both Hausser [1] and Lee [2][3] proposed Database Semantics as a computational model for natural language semantics that makes use of a database management system, DBMS. As an extension of these efforts, this paper aims at dealing with ways of representing linguistic descriptions in table forms because all the data in a relational model of DBMS is conceived of being stored in table forms. It is claimed here that, if an algorithm can be developed for converting linguistic representations like trees, logical formulas, and attribute-value matrices into table forms, many available tools for natural language processing can be efficiently utilized as part of interface or application programs for a relational database management system, RDBMS.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.