Active learning and the total cost of annotation
2004
Abstract
Active learning (AL) promises to reduce the cost of annotating labeled datasets for trainable human language technologies. Contrary to expectations, when creating labeled training material for HPSG parse selection and later reusing it with other models, gains from AL may be negligible or even negative. This has serious implications for using AL, showing that additional cost-saving strategies may need to be adopted. We explore one such strategy: using a model during annotation to automate some of the decisions. Our best results show an 80% reduction in annotation cost compared with labeling randomly selected data with a single model.
References (11)
- David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. 1995. Active learning with statistical models. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Infor- mation Processing Systems, volume 7, pages 705-712. The MIT Press.
- Dan Flickinger. 2000. On building a more efficient grammar by exploiting types. Natural Language Engineering, 6(1):15- 28. Special Issue on Efficient Processing with HPSG.
- G. E. Hinton. 1999. Products of experts. In Proc. of the 9th Int. Conf. on Artificial Neural Networks, pages 1-6.
- Rebecca Hwa, Miles Osborne, Anoop Sarkar, and Mark Steed- man. 2003. Corrected Co-training for Statistical Parsers. In Proceedings of the ICML Workshop "The Continuum from Labeled to Unlabeled Data", pages 95-102. ICML-03.
- Rebecca Hwa. 2000. Sample selection for statistical gram- mar induction. In Proc. of the 2000 Joint SIGDAT Conf. on EMNLP and VLC, pages 45-52, Hong Kong, China.
- Rebecca Hwa. 2001. On minimizing training corpus for parser acquisition. In Proc. of the 5th Conference on Natural Lan- guage Learning, Toulouse.
- Mark Johnson, Stuart Geman, Stephen Cannon, Zhiyi Chi, and Stephan Riezler. 1999. Estimators for Stochastic "Unification-Based" Grammars. In 37th Annual Meeting of the ACL.
- Stephan Oepen, Kristina Toutanova, Stuart Shieber, Christopher Manning, Dan Flickinger, and Thorsten Brants. 2002. The LinGO Redwoods Treebank: Motivation and preliminary ap- plications. In Proc. of the 19th International Conference on Computational Linguistics, Taipei, Taiwan.
- Miles Osborne and Jason Baldridge. 2004. Ensemble-based active learning for parse selection. In Proc. of HLT-NAACL, Boston.
- H. S. Seung, Manfred Opper, and Haim Sompolinsky. 1992. Query by committee. In Computational Learning Theory, pages 287-294.
- Min Tang, Xiaoqiang Luo, and Salim Roukos. 2002. Ac- tive Learning for Statistical Natural Language Parsing. In Proc. of the 40 th Annual Meeting of the ACL, pages 120- 127, Philadelphia, Pennsylvania, USA, July.