Towards Well-Grounded Phrase-Level Polarity Analysis
Lecture Notes in Computer Science, 2011
ABSTRACT We propose a new rule-based system for phrase-level polarity analysis and show how it be... more ABSTRACT We propose a new rule-based system for phrase-level polarity analysis and show how it benefits from empirically validating its polarity composition through surveys with human subjects. The system’s two-layer architecture and its underlying structure, i.e. its composition model, are presented. Two functions for polarity aggregation are introduced that operate on newly defined semantic categories. These categories detach a word’s syntactic from its semantic behavior. An experimental setup is described that we use to carry out a thorough evaluation. It incorporates a newly created German-language data set that is made freely and publicly available. This data set contains polarity annotations at word-level, phrase-level and sentence-level and facilitates comparability between different studies and reproducibility of our results.
ABSTRACT We present methods for labeling queries for a specialized search engine: a people search... more ABSTRACT We present methods for labeling queries for a specialized search engine: a people search engine. Thereby, we propose several methods of different complexity from simple probabilistic ones to Conditional Random Fields. All methods are then evaluated on a manually annotated corpus of queries submitted to a people search engine. Additionally, we analyze this corpus with respect to typical search patterns and their distribution.
Language statistics are widely used to characterize and better understand language. In parallel, ... more Language statistics are widely used to characterize and better understand language. In parallel, the amount of text mining and information retrieval methods grew rapidly within the last decades, with many algorithms evaluated on standardized corpora, often drawn from newspapers. However, up to now there were almost no attempts to link the areas of natural language processing and language statistics in order to properly characterize those evaluation corpora, and to help others to pick the most appropriate algorithms for their particular corpus. We believe no results in the field of natural language processing should be published without quantitatively describing the used corpora. Only then the real value of proposed methods can be determined and the transferability to corpora originating from different genres or domains can be estimated. We lay ground for a language engineering process by gathering and defining a set of textual characteristics we consider valuable with respect to building natural language processing systems. We carry out a case study for the analysis of automotive repair orders and explicitly call upon the scientific community to provide feedback and help to establish a good practice of corpus-aware evaluations.
Recently, textual characteristics, i.e. certain language statistics, have been proposed to compar... more Recently, textual characteristics, i.e. certain language statistics, have been proposed to compare corpora originating from different genres and domains, to give guidance in language engineering processes and to estimate the transferability of natural language processing algorithms from one corpus to another. However, until now it is unclear how these textual characteristics behave for different-sized corpora. We monitor the behavior of 7 textual characteristics across 4 genres -news articles, Wikipedia articles, general web text and fora posts -and 10 corpus sizes, ranging from 100 to 3,000,000 sentences. Thereby we show, certain textual characteristics are almost constant across corpus sizes and thus might be used to reliably compare different-sized corpora, while others are highly corpus size-dependent and thus may only be used to compare similar-or same-sized corpora. Moreover we find, although textual characteristics vary from genre to genre, their behavior for increasing corpus size is quite similar.
An analysis of a diachronically organised corpus of German- language newspaper articles and blog ... more An analysis of a diachronically organised corpus of German- language newspaper articles and blog posts on economy and finance is presented using a prototype dictionary of aect in German. The changes in the frequency of occurrence of positive and negative polarity words are rendered as return time series and the properties of this time series are described. The returns and
Uploads
Papers by R. Remus