Academia.eduAcademia.edu

Outline

The Author-Topic Model for Authors and Documents

2004, Proceedings of the 20th …

Abstract

We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the authortopic model, and demonstrate applications to computing similarity between authors and entropy of author output.

References (8)

  1. D. M. Blei, A. Y. Ng, and M. I. Jordan (2003). Latent Dirichlet Allocation. Journal of Machine Learning Re- search 3 : 993-1022.
  2. D. Cohn and T. Hofmann (2001). The missing link: A probabilistic model of document content and hy- pertext connectivity. Neural Information Processing Systems 13.
  3. W. Gilks, S. Richardson, D. Spiegelhalter (1996). Markov Chain Monte Carlo in Practice. Chapman and Hall.
  4. T. L. Griffiths, and M. Steyvers (2004). Finding sci- entific topics. Proceedings of the National Academy of Sciences.
  5. T. Hofmann (1999). Probabilistic latent semantic in- dexing. in Proceedings of the 22nd International Con- ference on Research and Development in Information Retrieval (SIGIR'99).
  6. D. Holmes and R. Forsyth (1995). The Federalist re- visited: New directions in authorship attribution. Lit- erary and Linguistic Computing, 10(2):111-127.
  7. S. Lawrence, C. L. Giles, and K. Bollacker (1999). Digital Libraries and Autonomous Citation Indexing IEEE Computer 32(6): 67-71.
  8. T. Minka and J. Lafferty (2002). Expectation- propagation for the generative aspect model. In Un- certainty in Artificial Intelligence, Proceedings of the Eighteenth Conference.