Papers by Benjamin Van Durme
Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
Inferring attributes from search queries
Systems, techniques, and machine-readable instructions for inferring attributes from search queri... more Systems, techniques, and machine-readable instructions for inferring attributes from search queries. In one aspect, a method includes receiving a description of a collection of search queries, inferring attributes of entities from the description of the collection of search queries, associating the inferred attributes with identifiers of entities characterized by the attributes, and making the associations of the attributes and entities available.
Abstract We have created layers of annotation on the English Gigaword v. 5 corpus to render it us... more Abstract We have created layers of annotation on the English Gigaword v. 5 corpus to render it useful as a standardized corpus for knowledge extraction and distributional semantics. Most existing large-scale work is based on inconsistent corpora which often have needed to be re-annotated by research teams independently, each time introducing biases that manifest as results that are only comparable at a high level.
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium}
Probabilistic Counting as an Extension to Randomized Count Storage
Abstract Previous work by Talbot and Osborne (2007a) explored the use of randomized storage mecha... more Abstract Previous work by Talbot and Osborne (2007a) explored the use of randomized storage mechanisms in language modeling. These structures trade a small amount of error for significant space savings, enabling the use of larger language models on relatively modest hardware.
Abstract With a few exceptions, extensions to latent Dirichlet allocation (LDA) have focused on t... more Abstract With a few exceptions, extensions to latent Dirichlet allocation (LDA) have focused on the distribution over topics for each document. Much less attention has been given to the underlying structure of the topics themselves. As a result, most topic models generate topics independently from a single underlying distribution and require millions of parameters, in the form of multinomial distributions over the vocabulary.
Abstract Inferring attributes of discourse participants has been treated as a batch-processing ta... more Abstract Inferring attributes of discourse participants has been treated as a batch-processing task: data such as all tweets from a given author are gathered in bulk, processed, analyzed for a particular feature, then reported as a result of academic interest. Given the sources and scale of material used in these efforts, along with potential use cases of such analytic tools, discourse analysis should be reconsidered as a streaming challenge.
Abstract Speakers of many different languages use the Internet. A common activity among these use... more Abstract Speakers of many different languages use the Internet. A common activity among these users is uploading images and associating these images with words (in their own language) as captions, filenames, or surrounding text. We use these explicit, monolingual, image-to-word connections to successfully learn implicit, bilingual, word-to-word translations. Bilingual pairs of words are proposed as translations if their corresponding images have similar visual features.
Abstract The JAVELIN system evaluated at TREC 2003 is an integrated architecture for open-domain ... more Abstract The JAVELIN system evaluated at TREC 2003 is an integrated architecture for open-domain question answering. JAVELIN employs a modular approach that addresses individual aspects of the QA task in an abstract manner. The System implements a planner that controls the execution and information flow, as well as a multiple answer seeking strategies used differently depending on the type of question.
Abstract Recent studies have shown the applicability of streaming and randomized algorithms in a ... more Abstract Recent studies have shown the applicability of streaming and randomized algorithms in a variety of large-scale language mining tasks. However, lack of many publicly available implementations of these methods has limited the use of these techniques beyond initial proof-of-concept, despite the growing interest in large-scale data.
Topic models for corpus-centric knowledge generalization
Abstract Many of the previous efforts in generalizing over knowledge extracted from text have rel... more Abstract Many of the previous efforts in generalizing over knowledge extracted from text have relied on the use of manually created word sense hierarchies, such as WordNet. We present initial results on generalizing over textually derived knowledge, through the use of the LDA topic model framework, as the first step towards automatically building corpus specific ontologies.
Abstract Prior work has shown the utility of syntactic tree fragments as features in judging the ... more Abstract Prior work has shown the utility of syntactic tree fragments as features in judging the grammaticality of text. To date such fragments have been extracted from derivations of Bayesianinduced Tree Substitution Grammars (TSGs). Evaluating on discriminative coarse and fine grammaticality classification tasks, we show that a simple, deterministic, count-based approach to fragment identification performs on par with the more complicated grammars of Post (2011).
Abstract We report on the large-scale acquisition of class attributes with and without the use of... more Abstract We report on the large-scale acquisition of class attributes with and without the use of lists of representative instances, as well as the discovery of unary attributes, such as typically expressed in English through prenominal adjectival modification. Our method employs a system based on compositional language processing, as applied to the British National Corpus.
Abstract Within the larger area of automatic acquisition of knowledge from the Web, we introduce ... more Abstract Within the larger area of automatic acquisition of knowledge from the Web, we introduce a method for extracting relevant attributes, or quantifiable properties, for various classes of objects. The method extracts attributes such as capital city and President for the class Country, or cost, manufacturer and side effects for the class Drug, without relying on any expensive language resources or complex processing tools.
Deriving Generic Statements using Corpus Acquired Knowledge and WordNet
Abstract Existing work in the extraction of commonsense knowledge from text has been restricted t... more Abstract Existing work in the extraction of commonsense knowledge from text has been restricted to factoids that serve as statements about what may possibly obtain in the world. We present an approach to deriving stronger general claims from large sets of factoids. The idea is to coalesce the observed nominals for a given predicate argument into a few predominant types, obtained as WordNet synsets.
Abstract A new approach to large-scale information extraction exploits both Web documents and que... more Abstract A new approach to large-scale information extraction exploits both Web documents and query logs to acquire thousands of opendomain classes of instances, along with relevant sets of open-domain class attributes at precision levels previously obtained only on small-scale, manually-assembled classes.
Abstract We provide a model that extends the splitmerge framework of Petrov et al.(2006) to joint... more Abstract We provide a model that extends the splitmerge framework of Petrov et al.(2006) to jointly learn latent annotations and Tree Substitution Grammars (TSGs). We then conduct a variety of experiments with this model, first inducing grammars on a portion of the Penn Treebank and the Korean Treebank 2.0, and next experimenting with grammar refinement from a single nonterminal and from the Universal Part of Speech tagset.
Abstract Documents in languages such as Chinese, Japanese and Korean sometimes annotate terms wit... more Abstract Documents in languages such as Chinese, Japanese and Korean sometimes annotate terms with their translations in English inside a pair of parentheses. We present a method to extract such translations from a large collection of web documents by building a partially parallel corpus and use a word alignment algorithm to identify the terms being translated. The method is able to generalize across the translations for different terms and can reliably extract translations that occurred only once in the entire web.
Method We compare word frequency estimates from the Google N-Gram Corpus to behavioral data from ... more Method We compare word frequency estimates from the Google N-Gram Corpus to behavioral data from three tasks. Following previous work [1, 3], we model lexical decision and word naming [4]. We extend this approach to picture naming [5, 12]. In each model, we use frequency counts from the written portion of the CELEX database as a baseline. All other frequency terms are included in the form of the residual values after regressing CELEXw out of the relevant count.
Abstract The everyday intelligence of both humans and machines relies on a large store of backgro... more Abstract The everyday intelligence of both humans and machines relies on a large store of background, or common-sense, knowledge. That such a knowledge base is not yet available to machines helps partially explain the community's inability to provide society with the sort of synthetic intelligence described by futurists such as Turing, or Asimov. In response, there have emerged a variety of methods for automated Knowledge Acquisition (KA) that are now being actively explored.
Uploads
Papers by Benjamin Van Durme