Topic labeled text classification
2014, Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval
https://doi.org/10.1145/2600428.2609565Abstract
Supervised text classifiers require extensive human expertise and labeling efforts. In this paper, we propose a weakly supervised text classification algorithm based on the labeling of Latent Dirichlet Allocation (LDA) topics. Our algorithm is based on the generative property of LDA. In our algorithm, we ask an annotator to assign one or more class labels to each topic, based on its most probable words. We classify a document based on its posterior topic proportions and the class labels of the topics. We also enhance our approach by incorporating domain knowledge in the form of labeled words. We evaluate our approach on four real world text classification datasets. The results show that our approach is more accurate in comparison to semi-supervised techniques from previous work. A central contribution of this work is an approach that delivers effectiveness comparable to the state-of-the-art supervised techniques in hard-toclassify domains, with very low overheads in terms of manual knowledge engineering.
References (32)
- REFERENCES
- D. Andrzejewski and X. Zhu. Latent Dirichlet Allocation With Topic-In-Set Knowledge. In NAACL HLT, pages 43-48, 2009.
- R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. Distributional Word Clusters Vs. Words For Text Categorization. J. Mach. Learn. Res., 3:1183-1208, Mar. 2003.
- D. M. Blei and J. D. McAuliffe. Supervised Topic Models. In NIPS, 2007.
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3:993-1022, March 2003.
- A. Blum and T. Mitchell. Combining Labeled And Unlabeled Data With Co-Training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92-100, 1998.
- S. Chakraborti, U. C. Beresi, N. Wiratunga, S. Massie, R. Lothian, and D. Khemani. Visualizing and Evaluating Complexity of Textual Case Bases. In ECCBR, pages 104-119, 2008.
- O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cambridge, MA, 2006.
- G. Druck, G. Mann, and A. McCallum. Learning From Labeled Features Using Generalized Expectation Criteria. In SIGIR, pages 595-602, 2008.
- Y. Grandvalet and Y. Bengio. Semi-Supervised Learning By Entropy Minimization. In NIPS, 2004.
- T. L. Griffiths and M. Steyvers. Finding Scientific Topics. PNAS, 101(suppl. 1):5228-5235, April 2004.
- T. L. Griffiths, J. B. Tenenbaum, and M. Steyvers. Topics In Semantic Representation. Psychological Review, 114:2007, 2007.
- S. Hingmire, S. Chougule, G. K. Palshikar, and S. Chakraborti. Document Classification By Topic Labeling. In SIGIR, pages 877-880, 2013.
- T. Joachims. Transductive Inference For Text Classification Using Support Vector Machines. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 200-209, 1999.
- G. Kotzé. Transformation-based tree-to-tree alignment. Computational Linguistics in the Netherlands Journal, 2:71-96, 2012.
- S. Lacoste-Julien, F. Sha, and M. I. Jordan. DiscLDA: Discriminative Learning For Dimensionality Reduction And Classification. In NIPS, 2008.
- S. Lee, J. Baker, J. Song, and J. C. Wetherbe. An Empirical Comparison Of Four Text Mining Methods. In Proceedings of the 2010 43rd Hawaii International Conference on System Sciences, pages 1-10, 2010.
- B. Liu, X. Li, W. S. Lee, and P. S. Yu. Text Classification By Labeling Words. In Proceedings of the 19th national conference on Artifical intelligence, pages 425-430, 2004.
- C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.
- A. McCallum, G. Mann, and G. Druck. Generalized Expectation Criteria. Technical report, Department of Computer Science, University of Massachusetts Amherst, 2007.
- R. Mihalcea and P. Tarau. TextRank: Bringing Order Into Text. In EMNLP, pages 404-411, 2004.
- D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing Semantic Coherence In Topic Models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 262-272, 2011.
- D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic Evaluation of Topic Coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 100-108, 2010.
- K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text Classification From Labeled And Unlabeled Documents Using Em. Machine Learning - Special issue on information retrieval, 39(2-3), May-June 2000.
- T. N. Rubin, A. Chambers, P. Smyth, and M. Steyvers. Statistical Topic Models For Multi-Label Document Classification. Mach. Learn., 88(1-2):157-208, 2012.
- K. Samuel. Lazy transformation-based learning. In FLAIRS Conference, pages 235-239, 1998.
- C. Seifert, E. Ulbrich, and M. Granitzer. Word Clouds For Efficient Document Labeling. In T. Elomaa, J. Hollmén, and H. Mannila, editors, Discovery Science, volume 6926, pages 292-306. Springer Berlin Heidelberg, 2011.
- M. Steyvers and T. Griffiths. Probabilistic Topic Models. In Latent Semantic Analysis: A Road to Meaning. 2006.
- H. M. Wallach, D. M. Mimno, and A. McCallum. Rethinking LDA: Why Priors Matter. In NIPS, pages 1973-1981, 2009.
- Y. Yang and J. O. Pedersen. A Comparative Study On Feature Selection In Text Categorization. In ICML, pages 412-420, 1997.
- J. Zhu, A. Ahmed, and E. P. Xing. Medlda: Maximum Margin Supervised Topic Models For Regression And Classification. In ICML, pages 1257-1264, 2009.
- X. Zhu and Z. Ghahramani. Learning From Labeled And Unlabeled Data With Label Propagation. Technical report, Carnegie Mellon University, 2002.