Managing Text as Data

Gordana Pavlović-Lažetić

Outline

Title

Computational Linguistics

Managing Text as Data

Gordana Pavlović-Lažetić

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Marek Maurizio

dsi.unive.it

The markup approach to represent and store large corpora of annotated textual documents is criticized for several reasons: it poses problems in expressing non-hierarchical structures, it limits the annotations in type and complexity, it makes difficult the writing of complex textual analysis programs since it requires the use of generic query languages like XQuery which are not well suited to the special need of the domain. We present a model and a language, called Manuzio, developed to be at the base of a new generation of textual document management systems which overcome the previous shortcomings. The model is an object based one, specialized for the specific domain, and has abstraction mechanisms which present some similarities with those of the object oriented database models. The language has query facilities and allows the development of sophisticated textual analysis applications. A prototype for a system has been designed and applied to several test cases.

downloadDownload free PDF View PDFchevron_right

Applying Relational Database Development Methodologies to the Design of Lexical Databases

Fernando Juan Saenz

ACM Transactions on Database Systems - TODS

We propose to apply relational databases (RDB) development methodologies to the design of lexical databases (LDB), which embody conceptual and linguistic knowledge. We represent the conceptual knowledge as an ontology, and the linguistic knowledge, which depends on each language, in lexicons. Our approach is based on a single language- independent ontology. Besides, we study some conceptual and linguistic requirements; in particular, meaning classifications in the ontology, focusing on taxonomies. We have followed a classical software development methodology for implementing lexical information systems in order to reach robust, maintainable, and integrateable RDB for storing the conceptual and linguistic knowledge. Weak attention has been paid on topics about development methodologies for building the software systems which manage LDB. We claim that the software engineering methodology subject is necessary in order to develop, reuse and integrate the diverse available linguistic inf...

downloadDownload free PDF View PDFchevron_right

From Annotated Corpora to Databases: the SgmlQL Language

Monique Rolbert

downloadDownload free PDF View PDFchevron_right

MATRIX LEXICA: AN ALTERNATIVE DESCRIPTION OF LEXICAL DATABASES

Evangelos C Papakitsos

2ο ΣΥΝΕΔΡΙΟ ΕΛΛΗΝΙΚΗ ΓΛΩΣΣΑ ΚΑΙ ΟΡΟΛΟΓΙΑ, ΑΘΗΝΑ, 1999

In the work presented here, new methods for designing and implementing large lexical databases were examined. These lexical databases or machine readable dictionaries are expected to be organized in a way to provide fast access to the stored data and efficient memory management. Directed graphs can be used to describe and organize a lexical database of large magnitude in a compact manner. These data structures are called matrix lexica, where the letters are described as nodes of directed graphs and the lemmata as paths (set of edges). It is claimed that matrix lexica can efficiently support automated language applications, in the fields of lexicography, terminology, machine translation and others, by providing high speed of resolution, sound mathematical foundation, low memory requirements and ability to handle distorted input in future developments. These methods were evaluated for Modern Greek as a target language. 0. INTRODUCTION The overall target of the work presented here was the development of a general purpose automated system for the computational treatment of Modern Greek morphology. Such a computerized system is composed of two major subsystems: (i) the subsystem which analyze the words, called "tagger" and (ii) a database, containing information about the words, which is called "lexical database" or "lexicon". It is expected from a tagger to provide one accurate analysis for every decomposed word fast and with low complexity (in order to improve maintainability). The model of functional decomposition [1] [4] was used to design and implement a tagger for Modern Greek and evaluate its performance using a large scale corpus (ECI-Greek Part), containing approximately 1,880,000 words. This tagger used a morpheme based lexicon having 7800 entries. The morpheme based lexicon is a lexical database that contains morphemes instead of words. The advantages of such a lexicon are: low computer

downloadDownload free PDF View PDFchevron_right

Reasoning about Strings in Databases

Matti Nykänen

Journal of Computer and System Sciences, 1999

In order to enable the database programmer to reason about relations over strings of arbitrary length 12 (1965), 423-434. S. Ginsburg and X. Wang. Pattern matching by rs-operations: towards a unified approach to querying sequenced data.

downloadDownload free PDF View PDFchevron_right

On a vocabulary data base

Marc Eisinger

1980

downloadDownload free PDF View PDFchevron_right

Text/relational database management systems: Overview and proposed sql extensions

Ian Davis

… of Computer Science, …, 1995

Combined text and relational database support is increasingly recognized as an emerging need of industry, spanning applications requiring text fields as parts of their data (eg, for customer support) to those augmenting primary text resources by conventional relational data (eg, ...

downloadDownload free PDF View PDFchevron_right

The treatment of noun phrase queries in a natural language database access system

Harald Trost

1998

In this paper, we are going to describe some aspects of the TAMIC-P system for German, which interprets natural-language queries to databases in the social insurance domain. These natural language queries are complex NPs, con- sisting of clusters of NPs and PPs. The parser uses information about co-occurence and domi- nation of linguistic elements as well as concept hierarchies in

downloadDownload free PDF View PDFchevron_right

A Feasible Approach to Natural Language Database

Veera Boonjing

1997

A truly natural language interface to databases also needs to be practical for actual implementation. We developed a new feasible approach to solve the problem and tested it successfully in a laboratory environment. The new result is based on metadata search, where the metadata grow in largely linear manner and the search is linguistics-free (allowing for grammatically incorrect and incomplete input). A new class of reference dictionary integrates four types of enterprise metadata: enterprise information models, database values, user-words, and query cases. The layered and scalable information models allow user-words to stay in original forms as users articulated them, as opposed to relying on permutations of individual words contained in the original query. A graphical representation method turns the dictionary into searchable graphs representing all possible interpretations of the input. A branch-and-bound algorithm then identifies optimal interpretations, which lead to SQL implementation of the original queries. Query cases enhance both the metadata and the search of metadata, as well as providing casebased reasoning to directly answer the queries. This design assures feasible solutions at the termination of the search, even when the search is incomplete (i.e., the results contain the correct answer to the original query). The necessary condition is that the text input contains at least one entry in the reference dictionary. The sufficient condition is that the text input contains a set of entries corresponding to a complete, correct single SQL query. Laboratory testing shows that the system obtained accurate results for most cases that satisfied only the necessary condition.

downloadDownload free PDF View PDFchevron_right

Developing Database Semantics as a Computational Model

Kiyong Lee

2000

Both Hausser [1] and Lee [2][3] proposed Database Semantics as a computational model for natural language semantics that makes use of a database management system, DBMS. As an extension of these efforts, this paper aims at dealing with ways of representing linguistic descriptions in table forms because all the data in a relational model of DBMS is conceived of being stored in table forms. It is claimed here that, if an algorithm can be developed for converting linguistic representations like trees, logical formulas, and attribute-value matrices into table forms, many available tools for natural language processing can be efficiently utilized as part of interface or application programs for a relational database management system, RDBMS.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

Etienne Kerre

Classical database systems have been introduced in the late 50's and have proved their usefulness in various domains. However, their incompetence to deal with vague and imprecise information, has lead to new data base designs. On the other hand the use of linguistic terms has also shown its usefulness. The assignment of linguistic terms to phenomena in order to describe the characteristics or properties of objects is very natural. People make such assignments every day. A drawback of most new database designs is that often the natural aspect of making assignment is lost. In this paper we introduce a new database model based on quasi-order relations (re exive and symmetric). The proposed model describes the mathematical background of the assignment of values to database attributes, using the theory of evaluation problems and sets. The constructed model o ers an interesting new approach to the theory of database design in combination with linguistic terms.

downloadDownload free PDF View PDFchevron_right

Managing Textual Data Semantically In Relational Databases

SK Ahammad Fahad

2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE), 2018

— the massive volume of data in databases, web pages, and document files usually causes information to be disorganized and unclear for the user. Therefore, information in such an environment can be classified into three forms: structured, semi-structured, or unstructured. Structured information is the best form of information because it facilitates the acquisition and comprehension of knowledge. Relational Database Management System (RDBMS) has a robust structure that manages, organizes and retrieves data. There are many attempts have been made in order to deal with such data. These attempts can be categorized into three groups: within a database schema, by a developed data model within the database, or by query-based techniques in database. Nonetheless, RDBMS contain massive amount of unstructured data such as textual data.. This paper proposed Textual Virtual Schema Model (TVSM). TVSM is conducted to perform semantic textual data linking and clustering and is embedded in the relational database structure (schema). In addition, linking and converting the unstructured information to structured data. Quality improvement of textual data clusters. Achievement of high query processing efficiency in retrieving data clusters. TVSM initially developed to assist researchers, developers, and database administrators who are concerned on

downloadDownload free PDF View PDFchevron_right

Linguistic approach to database theory: DDL-s for hierarchical model

Kazimierz Subieta

Information Systems, 1978

In this paper a new approach to database systems based on the mathematical theory of iinguisties is presented. It is assumed that any content of a database is a string of symbols. Then the expression of a data description language (DDL) can be considered a grammar which defines the set of strings; each of tbem may be the reaf content of database. Two types of DDLs are defined: a reguiar language which corresponds to the class of regular expressions in the theory of finite automata, and context-free language which corresponds to the class of context-free grammars. Some aspects of computer implementation of the above theoretical concepts are presented.

downloadDownload free PDF View PDFchevron_right

Design and implementation of a lexical data base

Eric Wehrli

Proceedings of the second conference on European chapter of the Association for Computational Linguistics -, 1985

This paper is concerned with the specifications and the implementation of a particular concept of word-based lexicon to be used for large natural language processing systems such as machine translation systems, and compares it with the morpheme-based conception of the lexicon traditionally assumed in computational linguistics. It will be argued that, although less concise, a relational word-based lexicon is superior to a morpheme-based lexicon from a theoretical, computational and also practical viewpoint.

downloadDownload free PDF View PDFchevron_right

Model of Lexicographical Database: Structure, Basic Functionality, Implementation

Olga Nevzorova

2012

In the paper we describe the model of lexicographical database and Russian-Tatar lexicographical database data model.

downloadDownload free PDF View PDFchevron_right

A relational database model and prototype for storing diverse discrete linguistic data

Alexander Magidow

2015

This article describes a model for storing multiple forms of linguistic data within a relational database as developed and tested through a prototype database for storing data from Arabic dialects.

downloadDownload free PDF View PDFchevron_right

Data models General Terms Languages

Carlo Combi

2015

Time characterizes every aspect of our life and its manage-ment when storing and querying data is very important. In this paper we propose a new temporal query language, called T4SQL, supporting multiple temporal dimensions of data. Besides the well-known valid and transaction times, it encompasses two additional temporal dimensions, namely, availability and event times. The availability time records when information is known and treated as true by the infor-mation system; the event times record the occurrence times of both the event that starts the valid time and the event that ends it. T4SQL is capable to deal with different tem-poral semantics (atemporal aka non-sequenced, current, se-quenced, next) with respect to every temporal dimension. Moreover, T4SQL provides a novel temporal grouping clause and an orthogonal management of temporal properties when defining the selection condition(s) and the schema for the output relation. ∗This work was partially supported by contributions ...

downloadDownload free PDF View PDFchevron_right

Processing of natural language queries to a relational database

Deepika Bhor

2003

A new method is developed to query a relational database in natural language (NL). Results: The method, based on a semantic approach, interprets grammatical and lexical units of a natural language into concepts of subject domain, which are given in a conceptual scheme. The conceptual scheme is mapped formally onto the logical scheme. We applied the method to query the FlyEx database in natural language. FlyEx contains information on the expression of segmentation genes in Drosophila melanogaster. The method allows formulation of queries in various natural languages simultaneously, and is adaptive to changes in the knowledge domain and user's views. It provides optimal transformation of queries from natural language to SQL, as well as visualization of information as a hyperscheme. The method does not require specification of all possible language constructions as well as a standard grammar accuracy in formulation of NL queries.

downloadDownload free PDF View PDFchevron_right

Bayan: A Text Database Management System which Supports a Fall Representation of the Arabic Language

Aqil Azmi

IEEE Data(base) Engineering Bulletin, 1989

Data Engineering Builetin is a quarteriy pubiicatlon of the IEEE Computer Society Technical Committee on Data Engineering. its scope of interest includes: data structures and models, access strategies, access control techniques, database architecture, database machines, inteiligent front ends, mass storage for very large databases, distributed database systems and techniques, database software design and Implementation, database utilities database security and related areas. Contribution to the Bulletin is hereby solicited. News items, letters, technical papers, book reviews, meeting previews, summaries, case studies, etc., should be sent to the Editor. All letters to the Editor wiii be considered for publication unless accompanied by a request to the contrary. Technical Letter from the Editor Several months ago, Won Kim asked me to put together the December 1989 issue. Ini tially, we agreed on graphical interfaces to databases as the topic. As I began looking into the literature and trying to figure what papers to invite, I noticed that there was very little out there that had to do with non-English interfaces to databases. So I set out to find a broad range of papers, all dealing with interface problems that American and European researchers would find unusual. The individuals who contributed to this issue were asked to put in a lot of effort in a very short amount of time. A few of them had difficulty writing in English. There were also communication problems for the authors located outside the USA. All of the authors deserve a very warm thanks. There are two papers dealing with Arabic. The first paper, by Elmaghraby, El-Shihaby, and El-Kassas, gives an overview of why Arabic presents unusual problems for the developer of interfaces, and why English-based software and hardware may not be easily adapted to handle Arabic. The second paper represents the Ph.D. thesis work of Au Mor feq and describes an effort to develop text database management techniques specifically suited for Arabic. The third paper also deals with textual data. It is by Yaacov Choueka and describes a very aggressive effort at developing a database of Hebrew texts. This effort has pro duced research contributions that are of use to developers of text retrieval systems in gen eral, not just to developers of Hebrew-based systems. The fourth paper is by Fang Sheng Liu and Ju Wei Tai, and concerns the development of graphical interfaces. It suggests that a useful way of constructing and categorizing graphical tokens can be based on Chinese character theory. One interesting aspect of this paper is that it is not dedicated to Chinese-based interfaces. The fifth paper (by Yoshioki Ishii and Hideki Nishimoto) describes the support of Japanese and Korean database interfaces. It gives an overview of why the Japanese and Korean languages present novel problems, and then describes the approach they have taken in developing their system. I hope that the readers of this issue find these papers to be unusual and interesting.

downloadDownload free PDF View PDFchevron_right

DATABASE LEXICOGRAPHY

Gary Coen

Data and Knowledge Engineering (42:3), 2002

This paper introduces database lexicography, a metadata analysis discipline that applies lexical graph theory to data design. 1 Database lexicography proposes a formal design criterion for data dependencies, and it provides metrics to evaluate the conformance of designs to this criterion. It treats the data dictionary as a first class object encoding design concepts, and its benefits include identification of database dependency architecture; quantification of interdependent data elements' sensitivity to change; categorization of core and peripheral data elements; model integration; and figures of merit by which to fortify data architectures to withstand design fossilization and guide their evolution amidst changing requirements. 1 PROBLEM The data assets of large enterprises tend to be compartmentalized. Line-of-business software systems typically incorporate legacy data designed for single application contexts. Routinely, only data and process owners comprehend the information content of these assets. Relentless disciplinary specialization of knowledge compounds the problem, resulting in minimal re-use of data across processes and organizations. Under these circumstances, data systems proliferate at the expense of information integration. Successful enterprises typically prevail by exploiting insights into the information content of data. For example, discovery technologies like multidisciplinary optimization and data mining can spot unexpected but useful correlations in heterogeneous data. Integration architectures like CORBA enable data systems to exchange information without requiring client knowledge of information resources. When applied in concert to enterprise data problems, frameworks such as these can support unanticipated information requirements , and from time to time this constitutes a competitive advantage. Nevertheless, this is not necessarily good news. Remedial services provided by such frameworks merely patch the deficiencies of stovepipe data assets that cannot keep up with evolving requirements. Since the data systems involved are often core assets of the enterprise, their inability to keep pace with change does not bode well for the future. The underlying problem is often data design. Typically, the design of legacy data systems is more suitable to initial information requirements than, say, requirements at the mid-point of their economic lifetimes. This deficiency usually reflects the structured 1 This paper extends logical foundations presented in [1] to the domain of data design and maintenance.

downloadDownload free PDF View PDFchevron_right

Managing Text as Data

Sign up for access to the world's latest research

Related papers

Related papers

Related topics