Accelerating Corpus Annotation through Active Learning
Sign up for access to the world's latest research
Abstract
AI
AI
The paper discusses methods for accelerating corpus annotation through active learning techniques, emphasizing the use of sequence labeling for predicting tags and the Viterbi algorithm to ensure the best overall sequence of tags. It highlights the importance of incorporating various features for part-of-speech tagging, including lexical and contextual information, to enhance the efficiency of data annotation while minimizing the effort required from human annotators.
Related papers
Language Resources and Evaluation, 2012
This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve part-of-speech tagging performance. Focusing mostly on French tagging, we introduce a maximum entropy Markov model-based tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.75 % accuracy on the French Treebank, an error reduction of 25 % (38 % on unknown words) over the same tagger without lexical information. We perform a series of experiments that help understanding how this lexical information helps improving tagging accuracy. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data versus developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half.
Lecture Notes in Computer Science, 2008
Sequence labeling problem is commonly encountered in many natural language and query processing tasks. SV M struct is a supervised learning algorithm that provides a flexible and effective way to solve this problem. However, a large amount of training examples is often required to train SV M struct , which can be costly for many applications that generate long and complex sequence data. This paper proposes an active learning technique to select the most informative subset of unlabeled sequences for annotation by choosing sequences that have largest uncertainty in their prediction. A unique aspect of active learning for sequence labeling is that it should take into consideration the effort spent on labeling sequences, which depends on the sequence length. A new active learning technique is proposed to use dynamic programming to identify the best subset of sequences to be annotated, taking into account both the uncertainty and labeling effort. Experiment results show that our SV M struct active learning technique can significantly reduce the number of sequences to be labeled while outperforming other existing techniques.
Proceedings of the 35th annual meeting on Association for Computational Linguistics -, 1997
We present an algorithm that automatically learns context constraints using statistical decision trees. We then use the acquired constraints in a flexible POS tagger. The tagger is able to use information of any degree: n-grams, automatically learned context constraints, linguistically motivated manually written constraints, etc. The sources and kinds of constraints are unrestricted, and the language model can be easily extended, improving the results. The tagger has been tested and evaluated on the WSJ corpus. * This research has been partially funded by the Spanish Research Department (CICYT) and inscribed as TIC96-1243-C03-02
2000
We have applied the inductive learning of statistical decision trees and relaxation labelling to the Natural Language Processing (nlp) task of morphosyntactic disambiguation (Part Of Speech Tagging). The learning process is supervised and obtains a language model oriented to resolve pos ambiguities, consisting of a set of statistical decision trees expressing distribution of tags and words in some relevant contexts. The acquired decision trees have been directly used in a tagger that is both relatively simple and fast, and which has been tested and evaluated on the Wall Street Journal (wsj) corpus with remarkable accuracy. However, better results can be obtained by translating the trees into rules to feed a exible relaxation labelling based tagger. In this direction we describe a tagger which is able to use information of any kind (n{grams, automatically acquired constraints, linguistically motivated manually written constraints, etc.), and in particular to incorporate the machine{learned decision trees. Simultaneously, we address the problem of tagging when only limited training material is available, which is crucial in any process of constructing, from the scratch, an annotated corpus. We show that high levels of accuracy can be achieved with our system in this situation, and report some results obtained when using it to develop a 5.5 million words Spanish corpus from scratch.
2019
This paper proposes a method to predict word grammatical classes using automatically generated discrete-time Markov chains to model typical sentences. Such method advantage relies on the availability of input resources needed to build an efficient and effective solution to virtually any language, dialect, or domain lingo. One of the main advantages of the proposed method is its simplicity when compared to other sophisticated approaches based on Hidden Markov Models or even more complex formalisms. The proposed method is instantiated to an example and we show that the achieved efficiency and effectiveness bring advantages to traditional similar solutions.
Current Issues in Linguistic Theory, 2000
We present an algorithm that automatically acquires a statistically{based language model for POS tagging, using statistical decision trees. The learning algorithm deals with more complex contextual information than simple collections of n{grams and it is able to use information of di erent nature. The acquired models are independent enough to be easily incorporated, as a statistical core of constraints/rules, in any exible tagger. They are also complete enough to be directly used as sets of POS disambiguation rules. We have implemented a simple and fast tagger that has been tested and evaluated on the WSJ corpus with a remarkable accuracy. Comparative results are reported.
2004
Active learning (AL) promises to reduce the cost of annotating labeled datasets for trainable human language technologies. Contrary to expectations, when creating labeled training material for HPSG parse selection and later reusing it with other models, gains from AL may be negligible or even negative. This has serious implications for using AL, showing that additional cost-saving strategies may need to be adopted. We explore one such strategy: using a model during annotation to automate some of the decisions. Our best results show an 80% reduction in annotation cost compared with labeling randomly selected data with a single model.
Amazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, i t i s resource-scarce. The main aim of this paper is to present o u r POS taggers results based on two state of the art sequence labeling techniques, namely Conditional Random Fields and Support Vector Machines, by making use of a small manually annotated corpus of only 20k tokens. Since creating labeled data is very time-consuming task while obtaining unlabeled data is less so, we have decided to gather a set of unlabeled data of Amazigh language that we have preprocessed and tokenized. The paper is also meant to address using semi-supervised techniques to improve POS tagging accuracy. An adapted self training algorithm, combining confidence measure with a function of Out Of Vocabulary words to select data for self training, has been used. Using this language independent method, we have managed to obtain encouraging results.
1996
Corpus-based methods for natural language processing often use supervised training, requiring expensive manual annotation of training corpora. This paper investigates methods for reducing annotation cost by sample selection. In this approach, during training the learning program examines many unlabeled examples and selects for labeling (annotation) only those that are most informative at each stage. This avoids redundantly annotating examples that contribute little new information. This paper extends our previous work on committee-based sample selection for probabilistic classifiers. We describe a family of methods for committee-based sample selection, and report experimental results for the task of stochastic part-ofspeech tagging. We find that all variants achieve a significant reduction in annotation cost, though their computational efficiency differs. In particular, the simplest method, which has no parameters to tune, gives excellent results. We also show that sample selection yields a significant reduction in the size of the model used by the tagger.
Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies Short Papers - HLT '08, 2008
Traditional Active Learning (AL) techniques assume that the annotation of each datum costs the same. This is not the case when annotating sequences; some sequences will take longer than others. We show that the AL technique which performs best depends on how cost is measured. Applying an hourly cost model based on the results of an annotation user study, we approximate the amount of time necessary to annotate a given sentence. This model allows us to evaluate the effectiveness of AL sampling methods in terms of time spent in annotation. We acheive a 77% reduction in hours from a random baseline to achieve 96.5% tag accuracy on the Penn Treebank. More significantly, we make the case for measuring cost in assessing AL methods.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.