Academia.eduAcademia.edu

Outline

Authorship Verification based on Linguistic Features

2017

Abstract

This thesis attempts to solve the problem of authorship verification. Authorship verification is a subdomain of authorship analysis and its origins lie in stylometry analysis. However most of the research in authorship analysis is based on authorship identification where authorship verification is rather unexplored. With the increase of digital documents and authors it is very difficult to employ authorship identification solutions. Hence in such cases authorship verification solutions are in necessity. This research focuses on utilizing digital documents with 1000 words, written in English to solve the problem of authorship verification: coming into conclusion about the authorship of a text in dispute by analyzing texts written by some candidate author. To solve this problem three machine learning models were designed employing two feature sets, comprising of linguistic features which are suggested to characterize the writing style of a person, one comprising of stylometric features and other consisting of word frequency based features. One-class support vector machine and two-class support vector machine are used as machine learning models to tackle this problem. Results suggest one-class support vector machine with selected stylometric features does not tackle the problem very well while two-class classification model with stylometric features trained for known author class and unknown author class shows potential in solving the problem if the unknown author class can be properly represented. One-class support vector machine with word frequency based features, shows promising results in solving the authorship verification problem. By conducting this research, I have developed an immense interest in stylometry analysis and natural language processing and gained my first experiences in research world. Hence I would like to thank my supervisor, Dr. A. R. Weerasinghe in introducing me to the project and advising me in any way needed and trusting my capabilities. I would also like to thank Dr. H. Ekanayake for effectively coordinating the final year project in computer science and helping students in case of problematic situations. I am immensely grateful for my batchmates who also helped me in many ways and giving suggestions to improve the project. I would also like to be thankful towards my family who gave me great support and courage to carry out the research studies and providing with a suitable academic environment. Lastly my appreciation goes to everyone who helped during this attempt. Document -A digital file containing text entirely written by one person. Known document -A document, with prior knowledge of the person who has written it. Unknown document -A document, with no knowledge of the person who has written it. Known author -The author who is suspected to have written the unknown document. Unknown author -The author of the unknown document, if the unknown document is written by a different person than the known author. Target class -The class that contain all documents with known authorship from a suspected author. Outlier class -The class which contains all documents from other authors than the suspected author.

References (48)

  1. R. Zheng, J. Li, H. Chen and Z. Huang, "A framework for authorship identification of online messages: Writing-style features and classification techniques", Journal of the American Society for Information Science and Technology, vol. 57, no. 3, pp. 378-393, 2006.
  2. M. Koppel, J. Schler and S. Argamon, "Computational methods in authorship attribution", Journal of the American Society for Information Science and Technology, vol. 60, no. 1, pp.
  3. P. Juola, "Authorship Attribution", Foundations and Trends® in Information Retrieval, vol. 1, no. 3, pp. 233-334, 2007.
  4. F. Mosteller and D. Wallace, "Inference in an Authorship Problem", Journal of the American Statistical Association, vol. 58, no. 302, pp. 275-309, 1963.
  5. "Breaking News, Analysis, Politics, Blogs, News Photos, Video, Tech Reviews - TIME.com",TIME.com,2009.[Online].Available:http://content.time.com/time/arts/article/0,8 599,1930971,00.html. [Accessed: 11-Apr-2017].
  6. K. Rasheed, C. He and Ramyaa, "Using Machine Learning Techniques for Stylometry", Proceedings of the International Conference on Artificial Intelligence, vol. 2, no. -04, 2004.
  7. Z. Li, "An Exploratory Study on Authorship Verification Models for Forensic Purpose", Master of Science, Knowledge and Expertise Center for Intelligent Data Analysis (KECIDA), Netherlands Forensic Institute, 2013.
  8. S. Nirkhi, R. Dharaskar and V. Thakare, "Authorship Verification of Online Messages for Forensic Investigation", Procedia Computer Science, vol. 78, pp. 640-645, 2016.
  9. F. Iqbal, L. Khan, B. Fung and M. Debbabi, "e-mail authorship verification for forensic investigation", Proceedings of the 2010 ACM Symposium on Applied Computing -SAC '10, 2010.
  10. A. Stolerman, "Authorship Verification", Ph.D, Drexel University, 2015.
  11. K. Luyckx and W. Daelemans, "Authorship Attribution and Verification with Many Authors and Limited Data", Proceeding COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1, pp. 513-520, 2008.
  12. H. Baayen, H. van Halteren and F. Tweedie, "Outside the cave of shadows: using syntactic annotation to enhance authorship attribution", Literary and Linguistic Computing, vol. 11, no. 3, pp. 121-132, 1996.
  13. E. Stamatatos, "A survey of modern authorship attribution methods", Journal of the American Society for Information Science and Technology, vol. 60, no. 3, pp. 538-556, 2009.
  14. M. Koppel and J. Schler, "Authorship Verification as a One-Class Classification Problem", The 21st International Conference on Machine Learning. (ICML-04), 2004.
  15. S. Argamon, C. Whitelaw, P. Chase, S. Hota, N. Garg and S. Levitan, "Stylistic text classification using functional lexical features", Journal of the American Society for Information Science and Technology, vol. 58, no. 6, pp. 802-822, 2007.
  16. D. M. J. Tax, "One-class classification; Concept-learning in the absence of counter-examples", Ph.D, Delft University of Technology, 2001.
  17. J. Noecker Jr and M. Ryan, "Distractorless Authorship Verification", Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pp. 785-789, 2012.
  18. A. Glover and G. Hirst, "Detecting Stylistic Inconsistencies in Collaborative Writing", The New Writing Environment, pp. 147-168, 1996.
  19. J. Grieve, "Quantitative Authorship Attribution: An Evaluation of Techniques", Literary and Linguistic Computing, vol. 22, no. 3, pp. 251-270, 2007.
  20. M. Corney, "Analysing E-mail Text Authorship for Forensic Purpose ", Master thesis, University of Software Engineering and Data Communications, 2003.
  21. M. Koppel, J. Schler, S. Argamon and E. Messeri, "Authorship attribution with thousands of candidate authors", Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval -SIGIR '06, 2006.
  22. G. Hirst and O. Feiguina, "Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts", Literary and Linguistic Computing, vol. 22, no. 4, pp. 405-417, 2007.
  23. J. Burrows, "'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship", Literary and Linguistic Computing, vol. 17, no. 3, pp. 267-287, 2002.
  24. D. Hoover, "Testing Burrows's Delta", Literary and Linguistic Computing, vol. 19, no. 4, pp. 453-475, 2004.
  25. D. Hoover, "Delta Prime?", Literary and Linguistic Computing, vol. 19, no. 4, pp. 477-495, 2004.
  26. M. Brocardo, I. Traore, S. Saad and I. Woungang, "Authorship verification for short messages using stylometry", 2013 International Conference on Computer, Information and Telecommunication Systems (CITS), 2013.
  27. S. Mechti, M. Jaoua, R. Faiz and L. Hadrich Belguith, "An Analysis Framework for Hybrid Authorship Verification", Research in Computing Science, vol. 110, pp. 151-158, 2016.
  28. K. Rasheed, C. He and Ramyaa, "Using Machine Learning Techniques for Stylometry", Proceedings of the International Conference on Artificial Intelligence, vol. 2, no. -04, 2004.
  29. Hanlein, H. "Studies in Authorship Recognition: a Corpus-based Approach". Peter Lang, 1999.
  30. "PAN", Pan.webis.de, 2017. [Online]. Available: http://pan.webis.de/data.html. [Accessed: 12-Mar-2017].
  31. Halvani, Oren (2016), "Reddit Cross-Topic Authorship Verification Corpus", Mendeley Data, v1
  32. Michael Brennan, Sadia Afroz, and Rachel Greenstadt. 2012. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. 15, 3, Article 12 (November 2012).
  33. "Unabomber", Federal Bureau of Investigation, 2017. [Online]. Available: https://www.fbi.gov/history/famous-cases/unabomber. [Accessed: 13-May-2017].
  34. B. Schölkopf, J. Platt, J. Shawe-Taylor, A. Smola and R. Williamson, "Estimating the Support of a High-Dimensional Distribution", Neural Computation, vol. 13, no. 7, pp. 1443-1471, 2001.
  35. J. Burrows, "Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style", Literary and Linguistic Computing, vol. 2, no. 2, pp. 61-70, 1987.
  36. R. Forsyth and D. Holmes, "Feature-Finding for Text Classification", Literary and Linguistic Computing, vol. 11, no. 4, pp. 163-174, 1996.
  37. F. Iqbal, R. Hadjidj, B. Fung and M. Debbabi, "A novel approach of mining write-prints for authorship attribution in e-mail forensics", Digital Investigation, vol. 5, pp. S42-S51, 2008.
  38. F. Tweedie, S. Singh and D. Holmes, "Neural network applications in stylometry: The Federalist Papers", Computers and the Humanities, vol. 30, no. 1, pp. 1-10, 1996.
  39. D. Khmelev and W. Teahan, "A repetition based measure for verification of text collections and for text categorization", Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval -SIGIR '03, 2003.
  40. D. Lewis, Y. Yang, T. Rose and F. Li, "RCV1: A New Benchmark Collection for Text Categorization Research", Journal of Machine Learning Research, vol. 5, pp. 361-397, 2004.
  41. D. Olson and D. Delen, Advanced data mining techniques. Berlin: Springer, 2008.
  42. S. Eissen and B. Stein, "Intrinsic Plagiarism Detection", Lecture Notes in Computer Science, pp. 565-569, 2006.
  43. I. Nation, "How Large a Vocabulary Is Needed for Reading and Listening?", The Canadian Modern Language Review / La revue canadienne des langues vivantes, vol. 63, no. 1, pp. 59-81, 2006.
  44. P. Nation, "How much input do you need to learn the most frequent 9,000 words?", Reading in a Foreign Language, vol. 26, no. 2, pp. 1-16, 2014.
  45. A. Chi, "A review of Longman Dictionary of Contemporary English (6th edition)", Lexicography, vol. 2, no. 2, pp. 179-186, 2016.
  46. "Longman Vocabulary Checker", Longmandictionariesusa.com, 2017. [Online].
  47. A. Coxhead, "A New Academic Word List", TESOL Quarterly, vol. 34, no. 2, p. 213, 2000.
  48. T. Joachims, "Text categorization with Support Vector Machines: Learning with many relevant features", Machine Learning: ECML-98, pp. 137-142, 1998.