Authorship Attribution with Topic Models
2013, Computational Linguistics
https://doi.org/10.1162/COLI_A_00173Abstract
Authorship attribution deals with identifying the authors of anonymous texts. Traditionally, research in this field has focused on formal texts, such as essays and novels, but recently more attention has been given to texts generated by on-line users, such as e-mails and blogs. Authorship attribution of such on-line texts is a more challenging task than traditional authorship attribution, because such texts tend to be short, and the number of candidate authors is often larger than in traditional settings. We address this challenge by using topic models to obtain author representations. In addition to exploring novel ways of applying two popular topic models to this task, we test our new model that projects authors and documents to two disjoint topic spaces. Utilizing our model in authorship attribution yields state-of-the-art performance on several data sets, containing either formal texts written by a few authors or informal texts generated by tens to thousands of on-line users. We also present experimental results that demonstrate the applicability of topical author representations to two other problems: inferring the sentiment polarity of texts, and predicting the ratings that users would give to items such as movies.
References (62)
- Argamon, Shlomo and Patrick Juola. 2011. Overview of the international authorship identification competition at PAN-2011. In CLEF 2011: Proceedings of the 2011 Conference on Multilingual and Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers), Amsterdam.
- Argamon, Shlomo, Moshe Koppel, James W. Pennebaker, and Jonathan Schler. 2009. Automatically profiling the author of an anonymous text. Communications of the ACM, 52(2):119-123.
- Blei, David M. 2012. Probabilistic topic models. Communications of the ACM, 55(4):77-84.
- Blei, David M. and Jon D. McAuliffe. 2007. Supervised topic models. In NIPS 2007: Proceedings of the 21st Annual Conference on Neural Information Processing Systems, pages 121-128, Vancouver.
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022.
- Breese, John S., David Heckerman, and Carl Kadie. 1998. Empirical analysis of predictive algorithms for collaborative filtering. In UAI 1998: Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pages 43-52, Madison, WI.
- Brennan, Michael and Rachel Greenstadt. 2009. Practical attacks against authorship recognition techniques. In IAAI 2009: Proceedings of the 21st Conference on Innovative Applications of Artificial Intelligence, pages 60-65, Pasadena, CA.
- Chaski, Carole E. 2005. Who's at the keyboard? Authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 4(1).
- Fan, Rong-En, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871-1874.
- Fog, Agner. 2008. Calculation methods for Wallenius' noncentral hypergeometric distribution. Communications in Statistics, Simulation and Computation, 37(2):258-273.
- Griffiths, Thomas L. and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Suppl. 1):5228-5235.
- Griffiths, Thomas L., Mark Steyvers, David M. Blei, and Joshua B. Tenenbaum. 2004. Integrating topics and syntax. In NIPS 2004: Proceedings of the 18th Annual Conference on Neural Information Processing Systems, pages 537-544, Vancouver.
- Groves, Trish. 2010. Is open peer review the fairest system? Yes. BMJ, 341:c6424.
- Herlocker, Jonathan L., Joseph A. Konstan, Al Borchers, and John Riedl. 1999. An algorithmic framework for performing collaborative filtering. In SIGIR 1999: Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 230-237, Berkeley, CA.
- Juola, Patrick. 2004. Ad-hoc authorship attribution competition. In ALLC-ACH 2004: Proceedings of the 2004 Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities, pages 175-176, G öteborg.
- Juola, Patrick. 2006. Authorship attribution. Foundations and Trends in Information Retrieval, 1(3):233-334.
- Kacmarcik, Gary and Michael Gamon. 2006. Obfuscating document stylometry to preserve author anonymity. In COLING-ACL 2006: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (Main Conference Poster Sessions), pages 444-451, Sydney.
- Kern, Roman, Christin Seifert, Mario Zechner, and Michael Granitzer. 2011. Vote/veto meta-classifier for authorship identification. In CLEF 2011: Proceedings of the 2011 Conference on Multilingual and Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers), Amsterdam.
- Koppel, Moshe and Jonathan Schler. 2004. Authorship verification as a one-class classification problem. In ICML 2004: Proceedings of the 21st International Conference on Machine Learning, pages 62-68, Banff.
- Koppel, Moshe, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1):9-26.
- Koppel, Moshe, Jonathan Schler, and Shlomo Argamon. 2011. Authorship attribution in the wild. Language Resources and Evaluation, 45(1):83-94.
- Koren, Yehuda, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30-37.
- Kourtis, Ioannis and Efstathios Stamatatos. 2011. Author identification using semi-supervised learning. In CLEF 2011: Proceedings of the 2011 Conference on Multilingual and Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers), Amsterdam.
- Lacoste-Julien, Simon, Fei Sha, and Michael I. Jordan. 2008. DiscLDA: Discriminative learning for dimensionality reduction and classification. In NIPS 2008: Proceedings of the 22nd Annual Conference on Neural Information Processing Systems, pages 897-904, Vancouver.
- Liu, Bing and Lei Zhang. 2012. A survey of opinion mining and sentiment analysis. In Charu C. Aggarwal and ChengXiang Zhai, editors, Mining Text Data. Springer US, pages 415-463.
- Luyckx, Kim and Walter Daelemans. 2011. The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing, 26(1):35-55.
- Mendenhall, Thomas C. 1887. The characteristic curves of composition. Science, 9(214S):237-246.
- Mimno, David and Andrew McCallum. 2008. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In UAI 2008: Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence, pages 411-418, Helsinki.
- Mosteller, Frederick and David L. Wallace. 1964. Inference and Disputed Authorship: The Federalist. Addison-Wesley.
- Nanavati, Mihir, Nathan Taylor, William Aiello, and Andrew Warfield. 2011. Herbert West-deanonymizer.
- In HotSec'11: Proceedings of the 6th USENIX Workshop on Hot Topics in Security, San Francisco, CA.
- Ng, Andrew Y. and Michael I. Jordan. 2001. On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In NIPS 2001: Proceedings of the 15th Annual Conference on Neural Information Processing Systems, pages 841-848, Vancouver.
- Pang, Bo and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL 2005: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 115-124, Ann Arbor, MI.
- Pang, Bo and Lillian Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2):1-135.
- Pearl, Lisa and Mark Steyvers. 2012. Detecting authorship deception: A supervised machine learning approach using author writeprints. Literary and Linguistic Computing, 27(2):183-196.
- Rajkumar, Arun, Saradha Ravi, Venkatasubramanian Suresh, M. Narasimha Murthy, and C. E. Veni Madhavan. 2009. Stopwords and stylometry: A latent Dirichlet allocation approach. In Proceedings of the NIPS 2009 Workshop on Applications for Topic Models: Text and Beyond (Poster Session), Whistler.
- Ramage, Daniel, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In EMNLP 2009: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 248-256, Singapore.
- Resnick, Paul and Hal R. Varian. 1997. Recommender systems. Communications of the ACM, 40(3):56-58.
- Rifkin, Ryan and Aldebaro Klautau. 2004. In defense of one-vs-all classification. Journal of Machine Learning Research, 5(Jan):101-141.
- Rosen-Zvi, Michal, Chaitanya Chemudugunta, Thomas Griffiths, Padhraic Smyth, and Mark Steyvers. 2010. Learning author-topic models from text corpora. ACM Transactions on Information Systems, 28(1):1-38.
- Rosen-Zvi, Michal, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In UAI 2004: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pages 487-494, Banff.
- Salton, Gerard. 1971. The SMART Retrieval System-Experiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River, NJ.
- Salton, Gerard. 1981. A blueprint for automatic indexing. SIGIR Forum, 16(2):22-38.
- Sanderson, Conrad and Simon Guenter. 2006. Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In EMNLP 2006: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 482-491, Sydney.
- Schler, Jonathan, Moshe Koppel, Shlomo Argamon, and James W. Pennebaker. 2006. Effects of age and gender on blogging. In Proceedings of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, pages 199-205, Stanford, CA.
- Seroussi, Yanir. 2012. Text Mining and Rating Prediction with Topical User Models. Ph.D. thesis, Faculty of Information Technology, Monash University, Clayton, Victoria, Australia.
- Seroussi, Yanir, Fabian Bohnert, and Ingrid Zukerman. 2011. Personalized rating prediction for new users using latent factor models. In HT 2011: Proceedings of the 22nd International ACM Conference on Hypertext and Hypermedia, pages 47-56, Eindhoven.
- Seroussi, Yanir, Fabian Bohnert, and Ingrid Zukerman. 2012. Authorship attribution with author-aware topic models. In ACL 2012: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 264-269, Jeju Island.
- Seroussi, Yanir, Russell Smyth, and Ingrid Zukerman. 2011. Ghosts from the High Court's past: Evidence from computational linguistics for Dixon ghosting for McTiernan and Rich. University of New South Wales Law Journal, 34(3):984-1005.
- Seroussi, Yanir, Ingrid Zukerman, and Fabian Bohnert. 2010. Collaborative inference of sentiments from texts. In UMAP 2010: Proceedings of the 18th International Conference on User Modeling, Adaptation and Personalization, pages 195-206, Waikoloa, HI.
- Seroussi, Yanir, Ingrid Zukerman, and Fabian Bohnert. 2011. Authorship attribution with latent Dirichlet allocation. In CoNLL 2011: Proceedings of the 15th International Conference on Computational Natural Language Learning, pages 181-189, Portland, OR.
- Stamatatos, Efstathios. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3):538-556.
- Steyvers, Mark and Tom Griffiths. 2007. Probabilistic topic models. In Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, and Walter Kintsch, editors, Handbook of Latent Semantic Analysis. Lawrence Erlbaum Associates, pages 427-448.
- Tanguy, Ludovic, Assaf Urieli, Basilio Calderone, Nabil Hathout, and Franck Sajous. 2011. A multitude of linguistically-rich features for authorship attribution. In CLEF 2011: Proceedings of the 2011 Conference on Multilingual and Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers), Amsterdam.
- Teh, Yee Whye, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566-1581.
- Wallach, Hanna M. 2006. Topic modeling: Beyond bag-of-words. In ICML 2006: Proceedings of the 23rd International Conference on Machine Learning, pages 977-984, Pittsburgh, PA.
- Wallach, Hanna M., David Mimno, and Andrew McCallum. 2009. Rethinking LDA: Why priors matter. In NIPS 2009: Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, pages 1,973-1,981, Vancouver.
- Wang, Xuerui, Andrew McCallum, and Xing Wei. 2007. Topical N-grams: Phrase and topic discovery, with an application to information retrieval. In ICDM 2007: Proceedings of the 7th IEEE International Conference on Data Mining, pages 697-702, Omaha, NE.
- Webb, Geoffrey I., Janice R. Boughton, and Zhihai Wang. 2005. Not so naive Bayes: Aggregating one-dependence estimators. Machine Learning, 58(1):5-24.
- Wong, Sze-Meng Jojo, Mark Dras, and Mark Johnson. 2011. Topic modeling for native language identification. In ALTA 2011: Proceedings of the Australasian Language Technology Association Workshop, pages 115-124, Canberra.
- Zhu, Jun, Amr Ahmed, and Eric P. Xing. 2009. MedLDA: Maximum margin supervised topic models for regression and classification. In ICML 2009: Proceedings of the 26th International Conference on Machine Learning, pages 1,257-1,264, Montreal.
- Zhu, Jun and Eric P. Xing. 2010. Conditional topic random fields. In ICML 2010: Proceedings of the 27th International Conference on Machine Learning, pages 1,239-1,246, Haifa.