Peter McClanahan

A Probabilistic Morphological Analyzer for Syriac

by Kristian Heal, Kevin Seppi, and Peter McClanahan

We define a probabilistic morphological analyzer using a data-driven approach for Syriac in order... more We define a probabilistic morphological analyzer using a data-driven approach for Syriac in order to facilitate the creation of an annotated corpus. Syriac is an under-resourced Semitic language for which there are no available language tools such as morphological analyzers. We introduce novel probabilistic models for segmentation, dictionary linkage, and morphological tagging and connect them in a pipeline to create a probabilistic morphological analyzer requiring only labeled data. We explore the performance of models with varying amounts of training data and find that with about 34,500 labeled tokens, we can outperform a reasonable baseline trained on over 99,000 tokens and achieve an accuracy of just over 80%. When trained on all available training data, our joint model achieves 86.47% accuracy, a 29.7% reduction in error rate over the baseline.

Download

Active Learning for Part-of-Speech Tagging: Accelerating Corpus Annotation

In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A... more In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual tagging efforts in order to deliver an annotation of highest quality. In this context, we find that active learning is always helpful. We focus on Query by Uncertainty (QBU) and Query by Committee (QBC) and report on experiments with several baselines and new variations of QBC and QBU, inspired by weaknesses particular to their use in this application. Experiments on English prose and poetry test these approaches and evaluate their robustness. The results allow us to make recommendations for both types of text and raise questions that will lead to further inquiry.

Download

Modeling the annotation process for ancient corpus creation

In corpus creation human annotation is expensive. Annotation costs can be minimized through machi... more In corpus creation human annotation is expensive. Annotation costs can be minimized through machine learning and active learning, however there are many complex interactions among the machine learner, the active learning technique, the annotation cost, human annotation accuracy, the annotator user interface, and several other elements of the process. For example, we show that changing the way in which annotators are paid can drastically change the performance of active learning techniques. To date these interactions have been poorly understood. We introduce a decision-theoretic model of the annotation process suitable for ancient corpus annotation that clarifies these interactions and can guide the development of a corpus creation project.

Download

Tag Dictionaries Accelerate Manual Annotation

Lrec, 2010

Expert human input can contribute in various ways to facilitate automatic annotation of natural l... more Expert human input can contribute in various ways to facilitate automatic annotation of natural language text. For example, a part-of-speech tagger can be trained on labeled input provided offline by experts. In addition, expert input can be solicited by way of active learning to make the most of annotator expertise. However, hiring individuals to perform manual annotation is costly both in terms of money and time. This paper reports on a user study that was performed to determine the degree of effect that a part-of-speech dictionary has on a group of subjects performing the annotation task. The user study was conducted using a modular, web-based interface created specifically for text annotation tasks. The user study found that for both native and non-native English speakers a dictionary with greater than 60% coverage was effective at reducing annotation time and increasing annotator accuracy. On the basis of this study, we predict that using a part-of-speech tag dictionary with coverage greater than 60% can reduce the cost of annotation in terms of both time and money.

Download

Automatic diacritization for low-resource languages using a hybrid word and consonant CMM

Human Language Technologies the 2010 Annual Conference of the North American Chapter of the Association For Computational Linguistics, 2010

We are interested in diacritizing Semitic languages, especially Syriac, using only diacritized te... more We are interested in diacritizing Semitic languages, especially Syriac, using only diacritized texts. Previous methods have required the use of tools such as part-of-speech taggers, segmenters, morphological analyzers, and linguistic rules to produce state-of-the-art results. We present a low-resource, data-driven, and language-independent approach that uses a hybrid word-and consonant-level conditional Markov model. Our approach rivals the best previously published results in Arabic (15% WER with case endings), without the use of a morphological analyzer. In Syriac, we reduce the WER over a strong baseline by 30% to achieve a WER of 10.5%. We also report results for Hebrew and English.

Download

Assessing the Costs of Machine-Assisted Corpus Annotation through a User Study

Lrec, 2008

Assessing the costs of sampling methods in active learning for annotation

Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies Short Papers - HLT '08, 2008

Traditional Active Learning (AL) techniques assume that the annotation of each datum costs the sa... more Traditional Active Learning (AL) techniques assume that the annotation of each datum costs the same. This is not the case when annotating sequences; some sequences will take longer than others. We show that the AL technique which performs best depends on how cost is measured. Applying an hourly cost model based on the results of an annotation user study, we approximate the amount of time necessary to annotate a given sentence. This model allows us to evaluate the effectiveness of AL sampling methods in terms of time spent in annotation. We acheive a 77% reduction in hours from a random baseline to achieve 96.5% tag accuracy on the Penn Treebank. More significantly, we make the case for measuring cost in assessing AL methods.

Download

Active learning for part-of-speech tagging

Proceedings of the Linguistic Annotation Workshop on - LAW '07, 2007

ABSTRACT In the construction of a part-of-speech annotated corpus, we are constrained by a fixed ... more ABSTRACT In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual tagging efforts in order to deliver an annotation of highest quality. In this context, we find that active learning is always helpful. We focus on Query by Uncertainty (QBU) and Query by Committee (QBC) and report on experiments with several baselines and new variations of QBC and QBU, inspired by weaknesses particular to their use in this application. Experiments on English prose and poetry test these approaches and evaluate their robustness. The results allow us to make recommendations for both types of text and raise questions that will lead to further inquiry.

Modeling the Annotation Process for Ancient Corpus

Charles University, 2007

In corpus creation human annotation is expensive. Annotation costs can be minimized through machi... more In corpus creation human annotation is expensive. Annotation costs can be minimized through machine learning and active learning, however there are many complex interactions among the machine learner, the active learning technique, the annotation cost, human annotation accuracy, the annotator user interface, and several other elements of the process. For example, we show that changing the way in which annotators are paid can drastically change the performance of active learning techniques. To date these ...

Tag Dictionaries Accelerate Manual Annotation

Proceedings of LREC 2010, May 1, 2010

Expert human input can contribute in various ways to facilitate automatic annotation of natural l... more Expert human input can contribute in various ways to facilitate automatic annotation of natural language text. For example, a part-of-speech tagger can be trained on labeled input provided offline by experts. In addition, expert input can be solicited by way of active learning to make the most of annotator expertise. However, hiring individuals to perform manual annotation is costly both in terms of money and time. This paper reports on a user study that was performed to determine the degree of effect that a part-of-speech dictionary ...

Download

Automatic diacritization for low-resource languages using a hybrid word and consonant cmm

… Technologies: The 2010 …, Jun 2, 2010

We are interested in diacritizing Semitic languages, especially Syriac, using only dia-critized t... more We are interested in diacritizing Semitic languages, especially Syriac, using only dia-critized texts. Previous methods have required the use of tools such as part-of-speech taggers, segmenters, morphological analyzers, and linguistic rules to produce state-of-the-art results. We present a low-resource, data-driven, and language-independent approach that uses a hybrid word-and consonant-level conditional Markov model. Our approach rivals the best previously published results in Arabic (15% WER with case endings), without the ...

Download

Assessing the costs of machine-assisted corpus annotation through a user study

Proc. of LREC, 2008

Fixed, limited budgets often constrain the amount of expert annotation that can go into the const... more Fixed, limited budgets often constrain the amount of expert annotation that can go into the construction of annotated corpora. Estimating the cost of annotation is the first step toward using annotation resources wisely. We present here a study of the cost of annotation. This study includes the participation of annotators at various skill levels and with varying backgrounds. Conducted over the web, the study consists of tests that simulate machine-assisted pre-annotation, requiring correction by the annotator rather than annotation from ...

Download

Active learning for part-of-speech tagging: Accelerating corpus annotation

Proceedings of the …, Jun 28, 2007

In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A... more In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual tagging efforts in order to deliver an annotation of highest quality. In this context, we find that active learning is always helpful. We focus on Query by Uncertainty (QBU) and Query by ...

Download

Uploads

Computer Science Papers (Contributor) by Peter McClanahan

Papers by Peter McClanahan

Log In