Theme Topic Mixture Model for Document Representation
Abstract
In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document density estimation models for representing documents. Inside this family we derive another possible model: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. An experiment reports the performance of the different models in this family over a common task.
References (10)
- J. R. Bellegarda and D. Nahamoo. Tied mixture continuous parameter models for large vocabulary isolated speech recognition. In Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, pages 13-16, 1989.
- D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993-1022, January 2003.
- W. Buntine. Variational Extensions to EM and Multinomial PCA. In T. Elomaa, H. Mannila, and H. Toivonen, editors, Machine Learning: ECML 2002: 13th European Conference on Machine Learning, Helsinki, Finland, August 19-23, 2002. Proceedings, pages 23 -34. Springer-Verlag Heidelberg, 2002.
- C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273-297, 1995.
- S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990.
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. B., 39:1-38, 1977.
- T. Hofmann. Unsupervised learning by Probabilistic Latent Semantic Analysis. Ma- chine Learning, 42:177-196, 2001.
- M. I. Jordan, Z. Ghahramani, T. Jaakkola, and L. K. Saul. An introduction to varia- tional methods for graphical models. Machine Learning, 37(2):183-233, 1999.
- M. Keller and S. Bengio. Theme Topic Mixture Model: A Graphical Model for Docu- ment Representation. IDIAP-RR 05, IDIAP, 2004.
- F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47, 2002.