Computer Science Papers (Contributor) by Peter McClanahan
We define a probabilistic morphological analyzer using a data-driven approach for Syriac in order... more We define a probabilistic morphological analyzer using a data-driven approach for Syriac in order to facilitate the creation of an annotated corpus. Syriac is an under-resourced Semitic language for which there are no available language tools such as morphological analyzers. We introduce novel probabilistic models for segmentation, dictionary linkage, and morphological tagging and connect them in a pipeline to create a probabilistic morphological analyzer requiring only labeled data. We explore the performance of models with varying amounts of training data and find that with about 34,500 labeled tokens, we can outperform a reasonable baseline trained on over 99,000 tokens and achieve an accuracy of just over 80%. When trained on all available training data, our joint model achieves 86.47% accuracy, a 29.7% reduction in error rate over the baseline.
Papers by Peter McClanahan

In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A... more In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual tagging efforts in order to deliver an annotation of highest quality. In this context, we find that active learning is always helpful. We focus on Query by Uncertainty (QBU) and Query by Committee (QBC) and report on experiments with several baselines and new variations of QBC and QBU, inspired by weaknesses particular to their use in this application. Experiments on English prose and poetry test these approaches and evaluate their robustness. The results allow us to make recommendations for both types of text and raise questions that will lead to further inquiry.
In corpus creation human annotation is expensive. Annotation costs can be minimized through machi... more In corpus creation human annotation is expensive. Annotation costs can be minimized through machine learning and active learning, however there are many complex interactions among the machine learner, the active learning technique, the annotation cost, human annotation accuracy, the annotator user interface, and several other elements of the process. For example, we show that changing the way in which annotators are paid can drastically change the performance of active learning techniques. To date these interactions have been poorly understood. We introduce a decision-theoretic model of the annotation process suitable for ancient corpus annotation that clarifies these interactions and can guide the development of a corpus creation project.

Lrec, 2010
Expert human input can contribute in various ways to facilitate automatic annotation of natural l... more Expert human input can contribute in various ways to facilitate automatic annotation of natural language text. For example, a part-of-speech tagger can be trained on labeled input provided offline by experts. In addition, expert input can be solicited by way of active learning to make the most of annotator expertise. However, hiring individuals to perform manual annotation is costly both in terms of money and time. This paper reports on a user study that was performed to determine the degree of effect that a part-of-speech dictionary has on a group of subjects performing the annotation task. The user study was conducted using a modular, web-based interface created specifically for text annotation tasks. The user study found that for both native and non-native English speakers a dictionary with greater than 60% coverage was effective at reducing annotation time and increasing annotator accuracy. On the basis of this study, we predict that using a part-of-speech tag dictionary with coverage greater than 60% can reduce the cost of annotation in terms of both time and money.
Human Language Technologies the 2010 Annual Conference of the North American Chapter of the Association For Computational Linguistics, 2010
We are interested in diacritizing Semitic languages, especially Syriac, using only diacritized te... more We are interested in diacritizing Semitic languages, especially Syriac, using only diacritized texts. Previous methods have required the use of tools such as part-of-speech taggers, segmenters, morphological analyzers, and linguistic rules to produce state-of-the-art results. We present a low-resource, data-driven, and language-independent approach that uses a hybrid word-and consonant-level conditional Markov model. Our approach rivals the best previously published results in Arabic (15% WER with case endings), without the use of a morphological analyzer. In Syriac, we reduce the WER over a strong baseline by 30% to achieve a WER of 10.5%. We also report results for Hebrew and English.
Assessing the Costs of Machine-Assisted Corpus Annotation through a User Study
Lrec, 2008
Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies Short Papers - HLT '08, 2008
Traditional Active Learning (AL) techniques assume that the annotation of each datum costs the sa... more Traditional Active Learning (AL) techniques assume that the annotation of each datum costs the same. This is not the case when annotating sequences; some sequences will take longer than others. We show that the AL technique which performs best depends on how cost is measured. Applying an hourly cost model based on the results of an annotation user study, we approximate the amount of time necessary to annotate a given sentence. This model allows us to evaluate the effectiveness of AL sampling methods in terms of time spent in annotation. We acheive a 77% reduction in hours from a random baseline to achieve 96.5% tag accuracy on the Penn Treebank. More significantly, we make the case for measuring cost in assessing AL methods.

Active learning for part-of-speech tagging
Proceedings of the Linguistic Annotation Workshop on - LAW '07, 2007
ABSTRACT In the construction of a part-of-speech annotated corpus, we are constrained by a fixed ... more ABSTRACT In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual tagging efforts in order to deliver an annotation of highest quality. In this context, we find that active learning is always helpful. We focus on Query by Uncertainty (QBU) and Query by Committee (QBC) and report on experiments with several baselines and new variations of QBC and QBU, inspired by weaknesses particular to their use in this application. Experiments on English prose and poetry test these approaches and evaluate their robustness. The results allow us to make recommendations for both types of text and raise questions that will lead to further inquiry.
Modeling the Annotation Process for Ancient Corpus
Charles University, 2007
In corpus creation human annotation is expensive. Annotation costs can be minimized through machi... more In corpus creation human annotation is expensive. Annotation costs can be minimized through machine learning and active learning, however there are many complex interactions among the machine learner, the active learning technique, the annotation cost, human annotation accuracy, the annotator user interface, and several other elements of the process. For example, we show that changing the way in which annotators are paid can drastically change the performance of active learning techniques. To date these ...
Proceedings of LREC 2010, May 1, 2010
Expert human input can contribute in various ways to facilitate automatic annotation of natural l... more Expert human input can contribute in various ways to facilitate automatic annotation of natural language text. For example, a part-of-speech tagger can be trained on labeled input provided offline by experts. In addition, expert input can be solicited by way of active learning to make the most of annotator expertise. However, hiring individuals to perform manual annotation is costly both in terms of money and time. This paper reports on a user study that was performed to determine the degree of effect that a part-of-speech dictionary ...
… Technologies: The 2010 …, Jun 2, 2010
We are interested in diacritizing Semitic languages, especially Syriac, using only dia-critized t... more We are interested in diacritizing Semitic languages, especially Syriac, using only dia-critized texts. Previous methods have required the use of tools such as part-of-speech taggers, segmenters, morphological analyzers, and linguistic rules to produce state-of-the-art results. We present a low-resource, data-driven, and language-independent approach that uses a hybrid word-and consonant-level conditional Markov model. Our approach rivals the best previously published results in Arabic (15% WER with case endings), without the ...
Proc. of LREC, 2008
Fixed, limited budgets often constrain the amount of expert annotation that can go into the const... more Fixed, limited budgets often constrain the amount of expert annotation that can go into the construction of annotated corpora. Estimating the cost of annotation is the first step toward using annotation resources wisely. We present here a study of the cost of annotation. This study includes the participation of annotators at various skill levels and with varying backgrounds. Conducted over the web, the study consists of tests that simulate machine-assisted pre-annotation, requiring correction by the annotator rather than annotation from ...
Proceedings of the …, Jun 28, 2007
In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A... more In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual tagging efforts in order to deliver an annotation of highest quality. In this context, we find that active learning is always helpful. We focus on Query by Uncertainty (QBU) and Query by ...
Uploads
Computer Science Papers (Contributor) by Peter McClanahan
Papers by Peter McClanahan