Skip to main content

Tobias Kuhn

Swiss Federal Institute of Technology (ETH), Humanities, Social and Political Sciences, Post-Doc

Followers

44

Following

46

Public Views

Interests

Uploads

Papers by Tobias Kuhn

Inheritance Patterns in Citation Networks Reveal Scientific Memes

Memes are the cultural equivalent of genes that spread across human culture by means of imitation... more Memes are the cultural equivalent of genes that spread across human culture by means of imitation. What makes a meme and what distinguishes it from other forms of information, however, is still poorly understood. Our analysis of memes in the scientific literature reveals that they are governed by a surprisingly simple relationship between frequency of occurrence and the degree to which they propagate along the citation graph. We propose a simple formalization of this pattern and validate it with data from close to 50 million publication records from the Web of Science, PubMed Central, and the American Physical Society. Evaluations relying on human annotators, citation network randomizations, and comparisons with several alternative approaches confirm that our formula is accurate and effective, without a dependence on linguistic or ontological knowledge and without the application of arbitrary thresholds or filters.

Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving of Data

Making available and archiving scientific results is for the most part still considered the task ... more Making available and archiving scientific results is for the most part still considered the task of classical publishing companies, despite the fact that classical forms of publishing centered around printed narrative articles no longer seem well-suited in the digital age. In particular, there exist currently no efficient, reliable, and agreed-upon methods for publishing scientific datasets, which have become increasingly important for science. Here we propose to design scientific data publishing as a Web-based bottom-up process, without top-down control of central authorities such as publishing companies. We present a protocol and a server network to decentrally store and archive data in the form of nanopublications, an RDF-based format to represent scientific data with formal semantics. We show how this approach allows researchers to produce, publish, retrieve, address, verify, and recombine datasets and their individual nanopublications in a reliable and trustworthy manner, and we argue that this architecture could be used for the Semantic Web in general. Our evaluation of the current small network shows that this system is efficient and reliable, and we discuss how it could grow to handle the large amounts of structured data that modern science is producing and consuming.

Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data

To make digital resources on the web verifiable, immutable, and permanent, we propose a technique... more To make digital resources on the web verifiable, immutable, and permanent, we propose a technique to include cryptographic hash values in URIs. We call them trusty URIs and we show how they can be used for approaches like nanopublications to make not only specific resources but their entire reference trees verifiable. Digital artifacts can be identified not only on the byte level but on more abstract levels such as RDF graphs, which means that resources keep their hash values even when presented in a different format. Our approach sticks to the core principles of the web, namely openness and decentralized architecture, is fully compatible with existing standards and protocols, and can therefore be used right away. Evaluation of our reference implementations shows that these desired properties are indeed accomplished by our approach, and that it remains practical even for very large files.

A Survey and Classification of Controlled Natural Languages

What is here called controlled natural language (CNL) has traditionally been given many different... more What is here called controlled natural language (CNL) has traditionally been given many different names. Especially during the last four decades, a wide variety of such languages have been designed. They are applied to improve communication among humans, to improve translation, or to provide natural and intuitive representations for formal notations. Despite the apparent differences, it seems sensible to put all these languages under the same umbrella. To bring order to the variety of languages, a general classification scheme is presented here. A comprehensive survey of existing English-based CNLs is given, listing and describing 100 languages from 1930 until today. Classification of these languages reveals that they form a single scattered cloud filling the conceptual space between natural languages such as English on the one end and formal languages such as propositional logic on the other. The goal of this article is to provide a common terminology and a common model for CNL, to contribute to the understanding of their general nature, to provide a starting point for researchers interested in the area, and to help developers to make design decisions.

Broadening the Scope of Nanopublications

In this paper, we present an approach for extending the existing concept of nanopublications -tin... more In this paper, we present an approach for extending the existing concept of nanopublications -tiny entities of scientific results in RDF representation -to broaden their application range. The proposed extension uses English sentences to represent informal and underspecified scientific claims. These sentences follow a syntactic and semantic scheme that we call AIDA (Atomic, Independent, Declarative, Absolute), which provides a uniform and succinct representation of scientific assertions. Such AIDA nanopublications are compatible with the existing nanopublication concept and enjoy most of its advantages such as information sharing, interlinking of scientific findings, and detailed attribution, while being more flexible and applicable to a much wider range of scientific results. We show that users are able to create AIDA sentences for given scientific results quickly and at high quality, and that it is feasible to automatically extract and interlink AIDA nanopublications from existing unstructured data sources. To demonstrate our approach, a web-based interface is introduced, which also exemplifies the use of nanopublications for non-scientific content, including meta-nanopublications that describe other nanopublications.

Collaborative multilingual knowledge management based on controlled natural language

User interfaces are a critical aspect of semantic knowledge representation systems, as users have... more User interfaces are a critical aspect of semantic knowledge representation systems, as users have to understand and use a formal representation language to model a particular domain of interest, which is known to be a difficult task. Things are even more challenging in a multilingual setting, where users speaking different languages have to create a multilingual ontology.

A Multilingual Semantic Wiki Based on Attempto Controlled English and Grammatical Framework

We describe a semantic wiki system with an underlying controlled natural language grammar impleme... more We describe a semantic wiki system with an underlying controlled natural language grammar implemented in Grammatical Framework (GF). The grammar restricts the wiki content to a well-defined subset of Attempto Controlled English (ACE), and facilitates a precise bidirectional automatic translation between ACE and language fragments of a number of other natural languages, making the wiki content accessible multilingually. Additionally, our approach allows for automatic translation into the Web Ontology Language (OWL), which enables automatic reasoning over the wiki content. The developed wiki environment thus allows users to build, query and view OWL knowledge bases via a userfriendly multilingual natural language interface. As a further feature, the underlying multilingual grammar is integrated into the wiki and can be collaboratively edited to extend the vocabulary of the wiki or even customize its sentence structures. This work demonstrates the combination of the existing technologies of Attempto Controlled English and Grammatical Framework, and is implemented as an extension of the existing semantic wiki engine AceWiki.

Attempto Controlled English for Knowledge Representation

Attempto Controlled English (ACE) is a controlled natural language, i.e. a precisely defined subs... more Attempto Controlled English (ACE) is a controlled natural language, i.e. a precisely defined subset of English that can automatically and unambiguously be translated into first-order logic. ACE may seem to be completely natural, but is actually a formal language, concretely it is a first-order logic language with an English syntax. Thus ACE is human and machine understandable. ACE was originally intended to specify software, but has since been used as a general knowledge representation language in several application domains, most recently for the semantic web. ACE is supported by a number of tools, predominantly by the Attempto Parsing Engine (APE) that translates ACE texts into Discourse Representation Structures (DRS), a variant of first-order logic. Other tools include the Attempto Reasoner RACE, the AceRules system, the ACE View plug-in for the Protégé ontology editor, AceWiki, and the OWL verbaliser.

Mining Images in Biomedical Publications: Detection and Analysis of Gel Diagrams

Authors of biomedical publications use gel images to report experimental results such as protein-... more Authors of biomedical publications use gel images to report experimental results such as protein-protein interactions or protein expressions under different conditions. Gel images offer a concise way to communicate such findings, not all of which need to be explicitly discussed in the article text. This fact together with the abundance of gel images and their shared common patterns makes them prime candidates for automated image mining and parsing. We introduce an approach for the detection of gel images, and present a workflow to analyze them. We are able to detect gel segments and panels at high accuracy, and present preliminary results for the identification of gene names in these images. While we cannot provide a complete solution at this point, we present evidence that this kind of image mining is feasible.

A Principled Approach to Grammars for Controlled Natural Languages and Predictive Editors

Controlled natural languages (CNL) with a direct mapping to formal logic have been proposed to im... more Controlled natural languages (CNL) with a direct mapping to formal logic have been proposed to improve the usability of knowledge representation systems, query interfaces, and formal specifications. Predictive editors are a popular approach to solve the problem that CNLs are easy to read but hard to write. Such predictive editors need to be able to "look ahead" in order to show all possible continuations of a given unfinished sentence. Such lookahead features, however, are difficult to implement in a satisfying way with existing grammar frameworks, especially if the CNL supports complex nonlocal structures such as anaphoric references. Here, methods and algorithms are presented for a new grammar notation called Codeco, which is specifically designed for controlled natural languages and predictive editors. A parsing approach for Codeco based on an extended chart parsing algorithm is presented. A large subset of Attempto Controlled English (ACE) has been represented in Codeco. Evaluation of this grammar and the parser implementation shows that the approach is practical, adequate and efficient.

The Understandability of OWL Statements in Controlled English

Different kinds of controlled natural language (CNL) have been proposed as a front-end for Semant... more Different kinds of controlled natural language (CNL) have been proposed as a front-end for Semantic Web systems, in order to make them more accessible to users with no background in formal notations and methods. This paper investigates whether OWL statements in CNL are indeed easier to understand than in other notations. To this aim, an experiment with 64 participants was conducted that compares a controlled natural language to a classical OWL notation. Concretely, Attempto Controlled English was compared to a simplified version of the Manchester OWL Syntax. For a reliable and tool-independent evaluation of understandability, the experiment is based on a novel evaluation framework making use of simple and intuitive diagrams. The results show that CNL is easier to understand, needs less learning time, and is more accepted by its users.

On Controlled Natural Languages: Properties and Prospects

This collaborative report highlights the properties and prospects of Controlled Natural Languages... more This collaborative report highlights the properties and prospects of Controlled Natural Languages (CNLs). The report poses a range of questions concerning the goals of the CNL, the design, the linguistic aspects, the relationships and evaluation of CNLs, and the application tools. In posing the questions, the report attempts to structure the field of CNLs and to encourage further systematic discussion by researchers and developers.

Controlled English for Knowledge Representation

Knowledge representation is a long-standing research area of computer science that aims at repres... more Knowledge representation is a long-standing research area of computer science that aims at representing human knowledge in a form that computers can interpret. Most knowledge representation approaches, however, have suffered from poor user interfaces. It turns out to be difficult for users to learn and use the logic-based languages in which the knowledge has to be encoded. A new approach to design more intuitive but still reliable user interfaces for knowledge representation systems is the use of controlled natural language (CNL). CNLs are subsets of natural languages that are restricted in a way that allows their automatic translation into formal logic. A number of CNLs have been developed but the resulting tools are mostly just prototypes so far. Furthermore, nobody has yet been able to provide strong evidence that CNLs are indeed easier to understand than other logic-based languages.

Evaluating the fully automatic multi-language translation of the Swiss avalanche bulletin

The Swiss avalanche bulletin is produced twice a day in four languages. Due to the lack of time a... more The Swiss avalanche bulletin is produced twice a day in four languages. Due to the lack of time available for manual translation, a fully automated translation system is employed, based on a catalogue of predefined phrases and predetermined rules of how these phrases can be combined to produce sentences. The system is able to automatically translate such sentences from German into the target languages French, Italian and English without subsequent proofreading or correction. Our catalogue of phrases is limited to a small sublanguage. The reduction of daily translation costs is expected to offset the initial development costs within a few years. After being operational for two winter seasons, we assess here the quality of the produced texts based on an evaluation where participants rate real danger descriptions from both origins, the catalogue of phrases versus the manually written and translated texts. With a mean recognition rate of 55%, users can hardly distinguish between the two types of texts, and give similar ratings with respect to their language quality. Overall, the output from the catalogue system can be considered virtually equivalent to a text written by avalanche forecasters and then manually translated by professional translators. Furthermore, forecasters declared that all relevant situations were captured by the system with sufficient accuracy and within the limited time available.

Coral: Corpus Access in Controlled Language

In this paper, we present Coral, an interface in which complex corpus queries can be expressed in... more In this paper, we present Coral, an interface in which complex corpus queries can be expressed in a controlled subset of natural English. With the help of a predictive editor, users can compose queries and submit them to the Coral system, which then automatically translates them into formal AQL statements. We give an overview of the controlled natural language developed for Coral and describes the functionalities of the predictive editor provided for it. It also reports on a user experiment in which the system was evaluated. The results show that, with Coral, corpora of annotated texts can be queried easier and faster than with the existing ANNIS interface. Our system demonstrates that complex corpora can be accessed without the need to learn a complicated formal query language.

Image Mining from Gel Diagrams in Biomedical Publications

Authors of biomedical publications often use gel images to report experimental results such as pr... more Authors of biomedical publications often use gel images to report experimental results such as protein-protein interactions or protein expressions under different conditions. Gel images offer a way to concisely communicate such findings, not all of which need to be explicitly discussed in the article text. This fact together with the abundance of gel images and their shared common patterns makes them prime candidates for image mining endeavors. We introduce an approach for the detection of gel images, and present an automatic workflow to analyze them. We are able to detect gel segments and panels at high accuracy, and present first results for the identification of gene names in these images. While we cannot provide a complete solution at this point, we present evidence that this kind of image mining is feasible.

Underspecified Scientific Claims in Nanopublications

The application range of nanopublications -small entities of scientific results in RDF representa... more The application range of nanopublications -small entities of scientific results in RDF representation -could be greatly extended if complete formal representations are not mandatory. To that aim, we present an approach to represent and interlink scientific claims in an underspecified way, based on independent English sentences.

An Evaluation Framework for Controlled Natural Languages

This paper presents a general framework called ontographs that relies on a graphical notation and... more This paper presents a general framework called ontographs that relies on a graphical notation and enables the tool-independent and reliable evaluation of human understandability of knowledge representation languages. An experiment with 64 participants is presented that applies this framework and compares a controlled natural language to a common formal language. The results show that the controlled natural language is easier to understand, needs less learning time, and is more accepted by its users.

Codeco: A Practical Notation for Controlled English Grammars in Predictive Editors

This paper introduces a new grammar notation, called Codeco, designed for controlled natural lang... more This paper introduces a new grammar notation, called Codeco, designed for controlled natural language (CNL) and predictive editors. Existing grammar frameworks that target either formal or natural languages do not work out particularly well for CNL, especially if they are to be used in predictive editors and if anaphoric references should be resolved in a deterministic way. It is not trivial to build predictive editors that can precisely determine which anaphoric references are possible at a certain position. This paper shows how such complex structures can be represented in Codeco, a novel grammar notation for CNL. Two different parsers have been implemented (one in Prolog and another one in Java) and a large subset of Attempto Controlled English (ACE) has been represented in Codeco. The results show that Codeco is practical, adequate and efficient.

Writing Clinical Practice Guidelines in Controlled Natural Language

Clinicians could benefit from decision support systems incorporating the knowledge contained in c... more Clinicians could benefit from decision support systems incorporating the knowledge contained in clinical practice guidelines. However, the unstructured form of these guidelines makes them unsuitable for formal representation. To address this challenge we translated a complete set of pediatric guideline recommendations into Attempto Controlled English (ACE). One experienced pediatrician, one physician and a knowledge engineer assessed that a suitably extended version of ACE can accurately and naturally represent the clinical concepts and the proposed actions of the guidelines. Currently, we are developing a systematic and replicable approach to authoring guideline recommendations in ACE.