Academia.eduAcademia.edu

Outline

Probabilistic Explicit Topic Modeling

2013

Abstract

Probabilistic Explicit Topic Modeling Joshua A. Hansen Department of Computer Science, BYU Master of Science Latent Dirichlet Allocation (LDA) is widely used for automatic discovery of latent topics in document corpora. However, output from analysis using an LDA topic model suffers from a lack of identifiability between topics not only across corpora, but across runs of the algorithm. The output is also isolated from enriching information from knowledge sources such as Wikipedia and is difficult for humans to interpret due to a lack of meaningful topic labels. This thesis introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDASTWD), and Explicit Dirichlet Allocation (EDA). LDA-STWD directly substitutes precomputed counts for LDA topic-word counts, leveraging existing Gibbs sampler inference. EDA defines an entirely new explicit topic model and derives the inference method from f...

References (45)

  1. PLSA . . . . . . . . . . . . . . .
  2. 2 Graphical model for Latent Dirichlet Allocation . . . . . . . . . . . . . . . .
  3. 3 A flowchart representing the Lau, et al. algorithm. The complexity of the algorithm makes it difficult to implement and to apply. . . . . . . . . . . . .
  4. 1 Graphical model for Explicit Dirichlet Allocation . . . . . . . . . . . . . . .
  5. 1 Log-likelihood convergence plot for LDA-STWD on SOTU Chunks across 50 iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
  6. 2 Log-likelihood convergence plot for LDA-STWD on Reuters 21578 across 50 iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
  7. 3 Topic count calibration results . . . . . . . . . . . . . . . . . . . . . . . . . .
  8. 4 Topic label quality user study prompt . . . . . . . . . . . . . . . . . . . . . .
  9. 5 Document label quality user study prompt . . . . . . . . . . . . . . . . . . . vii List of Tables
  10. 1 A typology of topical representations of documents . . . . . . . . . . . . . .
  11. 1 Outcome of topic label quality experiments . . . . . . . . . . . . . . . . . . .
  12. 2 Outcome of document label quality experiments with LDA-STWD . . . . .
  13. 3 Outcome of document label quality experiments with EDA . . . . . . . . . .
  14. 4 Topic label quality evaluation results . . . . . . . . . . . . . . . . . . . . . . References
  15. D. M. Blei, "Introduction to probabilistic topic models," Communications of the ACM, 2011. [Online]. Available: http://www.cs.princeton.edu/ ~blei/papers/Blei2011. pdf.
  16. D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent Dirichlet Allocation," Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.
  17. J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei, "Reading tea leaves: How humans interpret topic models," in Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, Neural Information Processing Systems Foundation, 2009.
  18. P. Cimiano, A. Schultz, S. Sizov, P. Sorg, and S. Staab, "Explicit versus latent concept models for cross-language information retrieval," in Proceedings of the 21st Interna- tional Joint Conference on Artifical intelligence, International Joint Conferences on Artificial Intelligence, 2009, pp. 1513-1518.
  19. S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman, "Indexing by Latent Semantic Analysis," Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990.
  20. E. Gabrilovich and S. Markovitch, "Computing semantic relatedness using Wikipedia- based Explicit Semantic Analysis," in Proceedings of the 20th International Joint Con- ference on Artificial Intelligence, International Joint Conferences on Artificial Intelli- gence, vol. 6, 2007, p. 12.
  21. M. J. Gardner, J. Lutes, J. Lund, J. Hansen, D. Walker, E. Ringger, and K. Seppi, "The Topic Browser: An interactive tool for browsing topic models," in NIPS Workshop on Challenges of Data Visualization, Neural Information Processing Systems Foundation, 2010.
  22. T. L. Griffiths and M. Steyvers, "Finding scientific topics," Proceedings of the National Academy of Sciences, vol. 101, suppl. 1, pp. 5228-5235, Jan. 2004. doi: 10 . 1073 / pnas.0307752101. [Online]. Available: http://www.pnas.org/content/101/suppl. 1/5228.abstract.
  23. T. Hofmann, "Unsupervised learning by Probabilistic Latent Semantic Analysis," Ma- chine Learning, vol. 42, no. 1, pp. 177-196, 2001.
  24. T. Hofmann and J. Puzicha, "Unsupervised learning from dyadic data," International Computer Science Institute, Tech. Rep., Dec. 1998. [Online]. Available: ftp://ftp. icsi.berkeley.edu/pub/techreports/1998/tr-98-042.pdf.
  25. Y. Hu, J. Boyd-Graber, and B. Satinoff, "Interactive topic modeling," in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Hu- man Language Technologies, Association for Computational Linguistics, vol. 1, 2011, pp. 248-257.
  26. J. Kruschke, "Bayesian estimation supersedes the t test," Journal of Experimental Psychology: General, 2012. doi: 10.1037/a0029146.
  27. M. de Kunder, The size of the world wide web (the internet), Website, 2012. [Online]. Available: http://www.worldwidewebsize.com/ (visited on 08/28/2012).
  28. J. H. Lau, K. Grieser, D. Newman, and T. Baldwin, "Automatic labelling of topic models," in Proceedings of the 49th Annual Meeting of the Association for Compu- tational Linguistics: Human Language Technologies, Association for Computational Linguistics, vol. 1, Los Angeles, California, 2011, pp. 1536-1545.
  29. J. H. Lau, D. Newman, S. Karimi, and T. Baldwin, "Best topic word selection for topic labelling," in Proceedings of the 23rd International Conference on Computational Linguistics: Posters, ser. COLING '10, Beijing, China: Association for Computational Linguistics, 2010, pp. 605-613. [Online]. Available: http://dl.acm.org/citation. cfm?id=1944566.1944635.
  30. D. D. Lewis, "An evaluation of phrasal and clustered representations on a text catego- rization task," in Proceedings of the 15th annual international ACM SIGIR conference on Research and Development in Information Retrieval, ser. SIGIR '92, Copenhagen, Denmark: Association for Computing Machinery, 1992, pp. 37-50, isbn: 0-89791-523-2. doi: 10.1145/133160.133172. [Online]. Available: http://doi.acm.org/10.1145/ 133160.133172.
  31. D. Magatti, S. Calegari, D. Ciucci, and F. Stella, "Automatic labeling of topics," in Intelligent Systems Design and Applications, 2009. ISDA'09. Ninth International Conference on, IEEE, 2009, pp. 1227-1232.
  32. Q. Mei, X. Shen, and C. X. Zhai, "Automatic labeling of multinomial topic models," in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, 2007, pp. 490- 499.
  33. T. Minka and J. Lafferty, "Expectation-propagation for the generative aspect model," in Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, Association for Uncertainty in Artificial Intelligence, 2002, pp. 352-359.
  34. D. Ramage, D. Hall, R. Nallapati, and C. Manning, "Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora," in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, vol. 1, 2009, pp. 248-256.
  35. J. E. Short, R. E. Bohn, and C. Baru, "How much information?," Global Information Industry Center, Tech. Rep., Dec. 2010. [Online]. Available: http://hmi.ucsd.edu/ pdf/HMI_2010_EnterpriseReport_Jan_2011.pdf.
  36. R. Snow, B. O'Connor, D. Jurafsky, and A. Ng, "Cheap and fast-but is it good?: Evaluating non-expert annotations for natural language tasks," in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2008, pp. 254-263.
  37. V. I. Spitkovsky and A. X. Chang, "A cross-lingual dictionary for english Wikipedia concepts," in Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey: European Language Resources Associa- tion (ELRA), May 23-25, 2012, isbn: 978-2-9517408-7-7.
  38. V. Stodden, R. LeVeque, and I. Mitchell, "Reproducible research for scientific comput- ing: tools and strategies for changing the culture," Computing in Science Engineering, vol. 14, no. 4, pp. 13-17, Jul. 2012, issn: 1521-9615. doi: 10.1109/MCSE.2012.38.
  39. L. Taycher, Books of the world, stand up and be counted! all 129,864,880 of you. Web- site. [Online]. Available: http://booksearch.blogspot.com/2010/08/books-of- world-stand-up-and-be-counted.html (visited on 08/05/2010).
  40. @twitter, Twitter turns six, Blog post. [Online]. Available: http://blog.twitter. com/2012/03/twitter-turns-six.html (visited on 03/21/2012).
  41. Various, List of Wikipedias, Wiki page. [Online]. Available: http://meta.wikimedia. org / w / index . php ? title = List _ of _ Wikipedias & oldid = 4072100 (visited on 08/28/2012).
  42. D. Walker, W. Lund, and E. Ringger, "Evaluating models of latent document semantics in the presence of OCR errors," in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2010, pp. 240-250.
  43. D. Walker, E. Ringger, and K. Seppi, "Evaluating supervised topic models in the presence of OCR errors," in IS&T/SPIE Electronic Imaging, International Society for Optics and Photonics, 2013, pp. 865 812-865 812.
  44. YouTube, Press statistics, Website. [Online]. Available: http://www.youtube.com/t/ press_statistics (visited on 08/28/2012).
  45. K. Zhai, J. Boyd-Graber, N. Asadi, and M. Alkhouja, "Mr. LDA: A flexible large scale topic modeling package using variational inference in map/reduce," in Proceedings of the 21st International Conference on World Wide Web, Association for Computational Linguistics, Lyon, France, 2012.