Accelerating Corpus Annotation through Active Learning

Eric Ringger

Outline

Title

Abstract

More Data for Less Work

All Topics

Computer Science

Natural Language Processing

Accelerating Corpus Annotation through Active Learning

Eric Ringger

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract
AI

The paper discusses methods for accelerating corpus annotation through active learning techniques, emphasizing the use of sequence labeling for predicting tags and the Viterbi algorithm to ensure the best overall sequence of tags. It highlights the importance of incorporating various features for part-of-speech tagging, including lexical and contextual information, to enhance the efficiency of data annotation while minimizing the effort required from human annotators.

Pascal Denis

Language Resources and Evaluation, 2012

This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve part-of-speech tagging performance. Focusing mostly on French tagging, we introduce a maximum entropy Markov model-based tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.75 % accuracy on the French Treebank, an error reduction of 25 % (38 % on unknown words) over the same tagger without lexical information. We perform a series of experiments that help understanding how this lexical information helps improving tagging accuracy. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data versus developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half.

downloadDownload free PDF View PDFchevron_right

Maximum Margin Active Learning for Sequence Labeling with Different Length

Pang-ning Tan

Lecture Notes in Computer Science, 2008

Sequence labeling problem is commonly encountered in many natural language and query processing tasks. SV M struct is a supervised learning algorithm that provides a flexible and effective way to solve this problem. However, a large amount of training examples is often required to train SV M struct , which can be costly for many applications that generate long and complex sequence data. This paper proposes an active learning technique to select the most informative subset of unlabeled sequences for annotation by choosing sequences that have largest uncertainty in their prediction. A unique aspect of active learning for sequence labeling is that it should take into consideration the effort spent on labeling sequences, which depends on the sequence length. A new active learning technique is proposed to use dynamic programming to identify the best subset of sequences to be annotated, taking into account both the uncertainty and labeling effort. Experiment results show that our SV M struct active learning technique can significantly reduce the number of sequences to be labeled while outperforming other existing techniques.

downloadDownload free PDF View PDFchevron_right

A flexible POS tagger using an automatically acquired language model

Lluis Marquez

Proceedings of the 35th annual meeting on Association for Computational Linguistics -, 1997

We present an algorithm that automatically learns context constraints using statistical decision trees. We then use the acquired constraints in a flexible POS tagger. The tagger is able to use information of any degree: n-grams, automatically learned context constraints, linguistically motivated manually written constraints, etc. The sources and kinds of constraints are unrestricted, and the language model can be easily extended, improving the results. The tagger has been tested and evaluated on the WSJ corpus. * This research has been partially funded by the Spanish Research Department (CICYT) and inscribed as TIC96-1243-C03-02

downloadDownload free PDF View PDFchevron_right

A machine learning approach to POS tagging

Lluis Marquez

2000

We have applied the inductive learning of statistical decision trees and relaxation labelling to the Natural Language Processing (nlp) task of morphosyntactic disambiguation (Part Of Speech Tagging). The learning process is supervised and obtains a language model oriented to resolve pos ambiguities, consisting of a set of statistical decision trees expressing distribution of tags and words in some relevant contexts. The acquired decision trees have been directly used in a tagger that is both relatively simple and fast, and which has been tested and evaluated on the Wall Street Journal (wsj) corpus with remarkable accuracy. However, better results can be obtained by translating the trees into rules to feed a exible relaxation labelling based tagger. In this direction we describe a tagger which is able to use information of any kind (n{grams, automatically acquired constraints, linguistically motivated manually written constraints, etc.), and in particular to incorporate the machine{learned decision trees. Simultaneously, we address the problem of tagging when only limited training material is available, which is crucial in any process of constructing, from the scratch, an annotated corpus. We show that high levels of accuracy can be achieved with our system in this situation, and report some results obtained when using it to develop a 5.5 million words Spanish corpus from scratch.

downloadDownload free PDF View PDFchevron_right

Language Independent POS-tagging Using Automatically Generated Markov Chains (S)

Lucelene Lopes

2019

This paper proposes a method to predict word grammatical classes using automatically generated discrete-time Markov chains to model typical sentences. Such method advantage relies on the availability of input resources needed to build an efficient and effective solution to virtually any language, dialect, or domain lingo. One of the main advantages of the proposed method is its simplicity when compared to other sophisticated approaches based on Hidden Markov Models or even more complex formalisms. The proposed method is instantiated to an example and we show that the achieved efficiency and effectiveness bring advantages to traditional similar solutions.

downloadDownload free PDF View PDFchevron_right

Automatically acquiring a language model for POS tagging using decision trees

Horacio Rodriguez

Current Issues in Linguistic Theory, 2000

We present an algorithm that automatically acquires a statistically{based language model for POS tagging, using statistical decision trees. The learning algorithm deals with more complex contextual information than simple collections of n{grams and it is able to use information of di erent nature. The acquired models are independent enough to be easily incorporated, as a statistical core of constraints/rules, in any exible tagger. They are also complete enough to be directly used as sets of POS disambiguation rules. We have implemented a simple and fast tagger that has been tested and evaluated on the WSJ corpus with a remarkable accuracy. Comparative results are reported.

downloadDownload free PDF View PDFchevron_right

Active learning and the total cost of annotation

Jason Baldridge

2004

Active learning (AL) promises to reduce the cost of annotating labeled datasets for trainable human language technologies. Contrary to expectations, when creating labeled training material for HPSG parse selection and later reusing it with other models, gains from AL may be negligible or even negative. This has serious implications for using AL, showing that additional cost-saving strategies may need to be adopted. We explore one such strategy: using a model during annotation to automate some of the decisions. Our best results show an 80% reduction in annotation cost compared with labeling randomly selected data with a single model.

downloadDownload free PDF View PDFchevron_right

Using confidence and informativeness criteria to improve POS-tagging in amazigh

Mohamed Outahajala

Amazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, i t i s resource-scarce. The main aim of this paper is to present o u r POS taggers results based on two state of the art sequence labeling techniques, namely Conditional Random Fields and Support Vector Machines, by making use of a small manually annotated corpus of only 20k tokens. Since creating labeled data is very time-consuming task while obtaining unlabeled data is less so, we have decided to gather a set of unlabeled data of Amazigh language that we have preprocessed and tokenized. The paper is also meant to address using semi-supervised techniques to improve POS tagging accuracy. An adapted self training algorithm, combining confidence measure with a function of Out Of Vocabulary words to select data for self training, has been used. Using this language independent method, we have managed to obtain encouraging results.

downloadDownload free PDF View PDFchevron_right

Minimizing manual annotation cost in supervised training from corpora

Shlomo Argamon

1996

Corpus-based methods for natural language processing often use supervised training, requiring expensive manual annotation of training corpora. This paper investigates methods for reducing annotation cost by sample selection. In this approach, during training the learning program examines many unlabeled examples and selects for labeling (annotation) only those that are most informative at each stage. This avoids redundantly annotating examples that contribute little new information. This paper extends our previous work on committee-based sample selection for probabilistic classifiers. We describe a family of methods for committee-based sample selection, and report experimental results for the task of stochastic part-ofspeech tagging. We find that all variants achieve a significant reduction in annotation cost, though their computational efficiency differs. In particular, the simplest method, which has no parameters to tune, gives excellent results. We also show that sample selection yields a significant reduction in the size of the model used by the tagger.

downloadDownload free PDF View PDFchevron_right

Assessing the costs of sampling methods in active learning for annotation

Peter McClanahan

Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies Short Papers - HLT '08, 2008

Traditional Active Learning (AL) techniques assume that the annotation of each datum costs the same. This is not the case when annotating sequences; some sequences will take longer than others. We show that the AL technique which performs best depends on how cost is measured. Applying an hourly cost model based on the results of an annotation user study, we approximate the amount of time necessary to annotate a given sentence. This model allows us to evaluate the effectiveness of AL sampling methods in terms of time spent in annotation. We acheive a 77% reduction in hours from a random baseline to achieve 96.5% tag accuracy on the Penn Treebank. More significantly, we make the case for measuring cost in assessing AL methods.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

Eric Ringger

2007

Abstract In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual tagging efforts in order to deliver an annotation of highest quality. In this context, we find that active learning is always helpful.

downloadDownload free PDF View PDFchevron_right

Active learning for part-of-speech tagging

Marc Carmen

Proceedings of the Linguistic Annotation Workshop on - LAW '07, 2007

In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual tagging efforts in order to deliver an annotation of highest quality. In this context, we find that active learning is always helpful. We focus on Query by Uncertainty (QBU) and Query by Committee (QBC) and report on experiments with several baselines and new variations of QBC and QBU, inspired by weaknesses particular to their use in this application. Experiments on English prose and poetry test these approaches and evaluate their robustness. The results allow us to make recommendations for both types of text and raise questions that will lead to further inquiry.

downloadDownload free PDF View PDFchevron_right

PAL, a tool for Pre-annotation and Active Learning

Carita Paradis

Many natural language processing systems rely on machine learning models that are trained on large amounts of manually annotated text data. The lack of sufficient amounts of annotated data is, however, a common obstacle for such systems, since manual annotation of text is often expensive and time-consuming. The aim of “PAL, a tool for Pre-annotation and Active Learning” is to provide a ready-made package that can be used to simplify annotation and to reduce the amount of annotated data required to train a machine learning classifier. The package provides support for two techniques that have been shown to be successful in previous studies, namely active learning and pre-annotation. The output of the pre-annotation is provided in the annotation format of the annotation tool BRAT, but PAL is a stand-alone package that can be adapted to other formats.

downloadDownload free PDF View PDFchevron_right

Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort

Benoît Sagot

Proceedings of PACLIC, 2009

This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve POS tagging performance. Focusing on French tagging, we introduce a maximum entropy conditional sequence tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.7% accuracy on the French Treebank, an error reduction of 23% (28% on unknown words) over the same tagger without lexical information. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data vs. developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half.

downloadDownload free PDF View PDFchevron_right

An approach to text corpus construction which cuts annotation costs and maintains reusability of annotated data

Udo Hahn

Proceedings of the 2007 Joint …, 2007

We consider the impact Active Learning (AL) has on effective and efficient text cor-pus annotation, and report on reduction rates for annotation efforts ranging up until 72%. We also address the issue whether a corpus annotated by means of AL using a particu-lar ...

downloadDownload free PDF View PDFchevron_right

A multistage PoS-tagger at the EVALITA 2009 PoS-tagging Task

Emanuele Pianta

Abstract. This paper presents an experimental system architecture for Part-Of-Speech Tagging for the Italian language, able to manage a large tagset to provide both lexical and morphological information. The tagger was built as a cascade of four classifiers where each classifier in the cascade accepts data from an initial input or the guesses of the previous one, executes its annotation, and sends the resulting data to the next stage, or to the output of the cascade.

downloadDownload free PDF View PDFchevron_right

POS Tagging without a Tagger: Using Aligned Corpora for Transferring Knowledge to Under-Resourced Languages

Salma Jamoussi

Computación Y Sistemas, 2016

Almost all languages lack sufficient resources and tools for developing Human Language Technologies (HLT). These technologies are mostly developed for languages for which large resources and tools are available. In this paper, we deal with the underresourced languages, which can benefit from the available resources and tools to develop their own HLT. We consider as an example the POS tagging task, which is among the most primordial Natural Language Processing tasks. The task is importatn because it assigns to word tags that highlight their morphological features by considering the corresponding contexts. The solution that we propose in this research work, is based on the use of aligned parallel corpus as a bridge between a rich-resourced language and an under-resourced language. This kind of corpus is usually available. The rich-resourced language side of this corpus is annotated first. These POS-annotations are then exploited to predict the annotation on the under-resourced language side by using alignment training. After this training step, we obtain a matching table between the two languages, which is exploited to annotate an input text. The experimentation of the proposed approach is performed for a pair of languages: English as a rich-resourced language and Arabic as an under-resourced language. We used the IWSLT10 training corpus and English TreeTagger [15]. The approach was evaluated on the test corpus extracted from the IWSLT08 and obtained Fscore of 89%. It can be extrapolated to the other NLP tasks.

downloadDownload free PDF View PDFchevron_right

Part-of-speech tagging using decision trees

Lluis Marquez

Lecture Notes in Computer Science, 1998

downloadDownload free PDF View PDFchevron_right

FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction

Nghia Ngo

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations

This paper presents FAMIE, a comprehensive and efficient active learning (AL) toolkit for multilingual information extraction. FAMIE is designed to address a fundamental problem in existing AL frameworks where annotators need to wait for a long time between annotation batches due to the time-consuming nature of model training and data selection at each AL iteration. This hinders the engagement, productivity, and efficiency of annotators. Based on the idea of using a small proxy network for fast data selection, we introduce a novel knowledge distillation mechanism to synchronize the proxy network with the main large model (i.e., BERT-based) to ensure the appropriateness of the selected annotation examples for the main model. Our AL framework can support multiple languages. The experiments demonstrate the advantages of FAMIE in terms of competitive performance and time efficiency for sequence labeling with AL. We publicly release our code (https://github.com/ nlp-uoregon/famie) and demo website (http://nlp.uoregon.edu:9000/). A demo video for FAMIE is provided at: https://youtu.be/I2i8n_jAyrY.

downloadDownload free PDF View PDFchevron_right

Accelerating the annotation of sparse named entities by dynamic sentence selection

Junichi Tsujii

BMC Bioinformatics, 2008

Background: Previous studies of named entity recognition have shown that a reasonable level of recognition accuracy can be achieved by using machine learning models such as conditional random fields or support vector machines. However, the lack of training data (i.e. annotated corpora) makes it difficult for machine learning-based named entity recognizers to be used in building practical information extraction systems. Results: This paper presents an active learning-like framework for reducing the human effort required to create named entity annotations in a corpus. In this framework, the annotation work is performed as an iterative and interactive process between the human annotator and a probabilistic named entity tagger. Unlike active learning, our framework aims to annotate all occurrences of the target named entities in the given corpus, so that the resulting annotations are free from the sampling bias which is inevitable in active learning approaches. Conclusion: We evaluate our framework by simulating the annotation process using two named entity corpora and show that our approach can reduce the number of sentences which need to be examined by the human annotator. The cost reduction achieved by the framework could be drastic when the target named entities are sparse.

downloadDownload free PDF View PDFchevron_right

Accelerating Corpus Annotation through Active Learning

Sign up for access to the world's latest research

AbstractAI

Related papers

Related papers

Related topics

Abstract
AI