Academia.eduAcademia.edu

Outline

A Systemic Functional Approach to Automated Authorship Analysis

2013

https://doi.org/10.1002/ASI.20553

Abstract

Most text analysis and retrieval work to date has focused on the topic of a text; that is, what it is about. However, a text also contains much useful information in its style, or how it is written. This includes information about its author, its purpose, feelings it is meant to evoke, and more. This article develops a new type of lexical feature for use in stylistic text classification, based on taxonomies of various semantic functions of certain choice words or phrases. We demonstrate the usefulness of such features for the stylistic text classification tasks of determining author identity and nationality, the gender of literary characters, a text's sentiment (positive/ negative evaluation), and the rhetorical character of scientific journal articles. We further show how the use of functional features aids in gaining insight about stylistic differences among different kinds of texts.

References (78)

  1. Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., & Spyropoulos, C. (2000). An evaluation of naive bayesian anti-spam fil- tering. In Proceedings of the Workshop on Machine Learning in the New Information Age (pp. 9-17). New York: ACM Press.
  2. Appelt, D., Hobbs, J., Bear, J., Israel, D., & Tyson, M. (1993). FASTUS: A finite-state processor for information extraction from real-world text. In Proceedings of the International Joint Conference on Artificial Intelligence (pp. 1172-1178).
  3. Argamon, S., Dodick, J., & Chase, P. (2005). The languages of science: A corpus-based study of experimental and historical science articles. In Proceedings of the 26th Annual Meeting of the Cognitive Science Society (pp. 157-162).
  4. Argamon, S., Koppel, M., Fine, J., & Shimony, A.R. (2003). Gender, genre, and writing style in formal written texts. Text, 23(3), 321-346.
  5. Argamon, S., & Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution. In Proceedings of the 2005 ACH/ ALLC Conference. Retrieved from http://web.uvic.ca/hrd/achallc2005/ abstracts.htm
  6. Argamon, S., & Olsen, M. (2006, April). Toward meaningful computing. Communications of the ACM, 49(4), 33-35.
  7. Baayen, R.H., Halteren, H. van, & Tweedie, F. (1996). Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 7, 91-109.
  8. Belkin, N.J. (1993). Interaction with texts: Information retrieval as infor- mation-seeking behavior. Information Retrieval, 93, 55-66.
  9. Berry, M.J., & Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support. New York: Wiley.
  10. Burrows, J.F. (1987). Computation into criticism: A study of Jane Austen's novels and an experiment in method. Oxford, England: Clarendon Press.
  11. Butler, C.S. (2003). Structure and function-A guide to three major structural-functional theories (No. 63-64). Amsterdam: John Benjamins.
  12. Chaski, C.E. (1999). Linguistic authentication and reliability. In Proceed- ings of the National Conference on Science and the Law (pp. 97-148). San Diego, CA: National Institute of Justice.
  13. Chen, S.Y., Magoulas, G.D., & Dimakopoulos, D. (2005). A flexible inter- face design for web directories to accommodate different cognitive styles. Journal of the American Society for Information Science and Technology, 56(1), 70-83.
  14. Cowie, J., & Lehnert, W. (1996). Information extraction. Communications of the ACM, 39(1), 80-91.
  15. Cristea, D., Marcu, D., Ide, N., & Tablan, V. (1999). Discourse structure and co-reference: An empirical study. In D. Cristea, N. Ide, & D. Marcu (Eds.), The relation of discourse/dialogue structure and reference (pp. 46-53). New Brunswick, NJ: Association for Computational Linguistics.
  16. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press.
  17. de Vel, O. (2000). Mining e-mail authorship. In ACM International Confer- ence on Knowledge Discovery and Data Mining Workshop on Text Mining. Boston. Retrieved from http://www.cs.cmu.edu/~dunja/ WshKDD2000.html
  18. de Vel, O., Corney, M., Anderson, A., & Mohay, G. (2002). Language and gender author cohort analysis of e-mail for computer forensics. In Proceedings of the Digital Forensic Research Workshop. Syracuse, NY. (pp. 7-9).
  19. Fawcett, R.P., & Tucker, G.H. (1990). Demonstration of GENESYS: A very large, semantically based systemic functional grammar. In Proceedings of the 13th International Conference on Computational Linguistics (COLING-90) (pp. 47-49). Helsinki, Finland.
  20. Finn, A., Kushmerick, N., & Smyth, B. (2002). Genre classification and do- main transfer for information filtering. In F. Crestani, M. Girolami, & C.J. van Rijsbergen (Eds.), Proceedings of the 24th European Collo- quium on Information Retrieval Research. Glasgow, United Kingdom: Springer Verlag, Heidelberg, DE.
  21. Firth, J. (1968). A synopsis of linguistic theory 1930-1955.
  22. In F. Palmer (Ed.), Selected papers of J.R. Firth 1952-1959. London: Longman.
  23. Fritch, J.W., & Cromwell, R.L. (2001). Evaluating internet resources: Iden- tity, affiliation, and cognitive authority in a networked world. Journal of the American Society for Information Science and Technology, 52(6), 498-507.
  24. Grossman, D., & Frieder, O. (1998). Information retrieval: Algorithms and heuristics. Dordrecht, The Netherlands: Kluwer.
  25. Halliday, M.A.K. (1994). Introduction to functional grammar (2nd ed.). London: Arnold.
  26. Halliday, M.A.K., & Hasan, R. (1976). Cohesion in English. London: Longman.
  27. Holmes, D.I. (1998). The evolution of stylometry in humanities scholar- ship. Literary and Linguistic Computing, 13(3), 111-117.
  28. Hoover, D. (2002). Frequent word sequences and statistical stylistics. Liter- ary and Linguistic Computing, 17, 157-180.
  29. Kamps, J., Marx, M., Mokken, R.J., & Rijke, M. de. (2002). Words with attitude. In Proceedings of the 1st International Conference on Global WordNet. Mysore, India (pp. 332-341).
  30. Karlgren, J. (2000). Stylistic experiments for information retrieval. Unpub- lished doctoral dissertation, SICS.
  31. Kehagias, A., Petridis, V., Kaburlasos, V., & Fragkou, P. (2003). A compari- son of word-and sense-based text categorization using several classifica- tion algorithms. Journal of Intelligent Information Systems, 21(3), 227-247.
  32. Kessler, B., Nunberg, G., & Schütze, H. (1997). Automatic detection of text genre. In P.R. Cohen & W. Wahlster (Eds.), Proceedings of the 35th an- nual meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computa- tional Linguistics (pp. 32-38). Somerset, NJ: Association for Computa- tional Linguistics.
  33. Kjell, B., & Frieder, O. (1992). Visualization of literary style. In Proceed- ings of the IEEE International Conference on Systems, Man and Cyber- netics (pp. 656-661). Chicago: IEEE Press.
  34. Koppel, M., Akiva, N., & Dagan, I. (2003). A corpus-independent feature set for style-based text categorization. In Workshop on Computational Approaches to Style Analysis and Synthesis, 18th International Joint Conference on Artificial Intelligence.
  35. Kushmerick, N. (1999). Learning to remove internet advertisement. In O. Etzioni, J.P. Müller, & J.M. Bradshaw (Eds.), Proceedings of the 3rd International Conference on Autonomous Agents (agents'99) (pp. 175-181). Seattle, WA: ACM Press.
  36. Labov, W. (1973). Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press.
  37. Lewis, D., Schapire, R.E., Callan, J.P., & Papka, R. (1996). Training algo- rithms for linear text classifiers. In Proceedings of the 19th International Conference on Research and Development in Information Retrieval (pp. 298-306). New York: ACM Press.
  38. Marcu, D. (1997). The rhetorical parsing of natural language texts. In Meet- ing of the Association for Computational Linguistics (pp. 96-103). Morristown, NJ: ACL.
  39. Marcu, D. (1999). A decision-based approach to rhetorical parsing. In Pro- ceedings of the ACL'99 (pp. 365-372). Morristown, NJ: ACL.
  40. Martin, J.R., & White, P.R.R. (2005). The language of evaluation: Appraisal in English. London: Palgrave. http://www.grammatics.com/appraisal/
  41. Matthews, R.A.J., & Merriam, T.V.N. (1997). Distinguishing literary styles using neural networks. In E. Fiesler & R. Beale (Eds.), Handbook of neural computation (pp. 8). New York: IOP Publishing and Oxford Uni- versity Press.
  42. Matthiessen, C. (1983). Systemic grammar in computation: The nigel case. In Proceedings of the Meeting of the European Association for Computa- tional Linguistics (pp. 155-164). Morristown, NJ: ACL.
  43. Matthiessen, C. (1995). Lexico-grammatical cartography: English systems. Tokyo: International Language Sciences.
  44. Matthiessen, C., & Bateman, J.A. (1991). Text generation and systemic- functional linguistics: Experiences from English and Japanese. London, New York: Pinter, St. Martin's Press.
  45. McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy Markov models for information extraction and segmentation. In Pro- ceedings of the 17th International Conference on Machine Learning. Stanford, CA (pp. 591-598).
  46. McEnery, A., & Oakes, M. (2000). Authorship studies/textual statistics. In R. Dale, H. Moisl, & H. Somers (Eds.), Handbook of natural language processing (pp. 234-248). Philadelphia: Dekker.
  47. McKinney, V., Yoon, K., & Zahedi, F.M. (2002). The measurement of web- customer satisfaction: An expectation and disconfirmation approach. In- formation Systems Research, 13(3), 296-315.
  48. McMenamin, G. (2002). Forensic linguistics: Advances in forensic stylis- tics. Boca Raton, FL: CRC Press.
  49. Miller, G., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. (1990). Wordnet: An on-line lexical database. International Journal of Lexicogra- phy, 3(4), 235-312.
  50. Moore, J.D., & Pollack, M.E. (1992). A problem for RST: The need for multi- level discourse analysis. Computational Linguistics, 18(4), 537-544.
  51. Mosteller, F., & Wallace, D.L. (1964). Inference and disputed authorship: The federalist. Reading, MA: Addison-Wesley.
  52. Ng, V., & Cardie, C. (2002). Improving machine learning approaches to coref- erence resolution. In Proceedings of the 40th annual meeting of the Associ- ation for Computational Linguistics (pp. 104-111). Morristown, NJ: ACL.
  53. O'Donnell, M. (1993). Reducing complexity in a systemic parser. In Pro- ceedings of the 3rd International Workshop on Parsing Technologies. Tilburg, the Netherlands (pp. 10-13).
  54. Osgood, C.E., Succi, G.J., & Tannenbaum, P.H. (1957). The measurement of meaning. Urbana: University of Illinois Press.
  55. Pang, B., & Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceed- ings of the 42nd ACL (pp. 271-278). Morristown, NJ: ACL.
  56. Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment clas- sification using machine learning techniques. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing (pp. 79-86). Morristown, NJ: ACL.
  57. Patrick, J. (2004). The ScamSeek project: Text mining for finanical scams on the internet. In S. Simoff & G. Williams (Eds.), Proceedings of the 3rd Australasian Data Mining Conference (pp. 33-38).
  58. Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C.J.C. Burges, & A.J. Smola (Eds.), Advances in Kernel Methods-Support vector learning. Cam- bridge, MA, MIT Press.
  59. Ponte, J.M., & Croft, W.B. (1998). A language modeling approach to infor- mation retrieval. In Proceedings of ACM SIGIR. New York: ACM Press.
  60. Roth, D., & Yih, W. (2001). Relational learning via propositional algorithms: An information extraction case study. In Proceedings of the International Joint Conference on Artificial Intelligence (pp. 1257-1263).
  61. Salton, G., & McGill, M. (1983). Introduction to modern information re- trieval. New York: McGraw-Hill.
  62. Schauer, H., & Hahn, U. (2001). Anaphoric cues for coherence relations. In G. Angelova, K. Bontcheva, R. Mitkov, N. Nicolov, & N. Nikolov (Eds.), Proceedings of the Euroconference Recent Advances in Natural Lan- guage Processing (RANLP-2001) (pp. 228-234). Tzigov, Bulgaria.
  63. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47.
  64. Shakespeare, W. (n.d.). The complete Moby Shakespeare. http://www- tech.mit.edu/Shakespeare/
  65. Stamatatos, E., Fakotakis, N., & Kokkinakis, G.K. (2000). Automatic text categorization in terms of genre, author. Computational Linguistics, 26(4), 471-495.
  66. Taboada, M., & Grieve, J. (2004). Analyzing appraisal automatically. In AAAI Spring Symposium on Exploring Attitude and Affect in Text. Menlo Park, CA: AAAI Press.
  67. Tang, R., Ng, K.B., Strzalkowski, T., & Kantor, P.B. (2003). Toward ma- chine understanding of information quality. In Proceedings of Annual Meeting of American Society for Information Science and Technology (Vol. 40, pp. 213-220).
  68. Teich, E. (1995). A proposal for dependency in systemic functional grammar-Metasemiosis in computational systemic functional linguis- tics. Unpublished doctoral dissertation, University of the Saarland and GMD/IPSI, Darmstadt, Germany.
  69. Torvik, V.I., Weeber, M., Swanson, D.R., & Smalheiser, N.R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140-158.
  70. Trudgill, P. (2001). Sociolinguistics: An introduction to language and society (4th ed.). New York: Penguin.
  71. Turney, P.D. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th annual meeting of the ACL (pp. 417-424). Morristown, NJ: ACL.
  72. Tweedie, F., Singh, S., & Holmes, D. (1996). Neural network applications in stylometry: The Federalist Papers. Computers and the Humanities, 30(1), 1-10.
  73. Whitelaw, C., Garg, N., & Argamon, S. (2005, May). Using appraisal tax- onomies for sentiment analysis. In Proceedings of the 2nd Midwest Com- putational Linguistic Colloquium (MCLC 2005).
  74. Wiebe, J., McKeever, K., & Bruce, R. (1998). Mapping collocational prop- erties into machine learning features. In Proceedings of the 6th Workshop on Very Large Corpora (pp. 225-233). Morristown, NJ: ACL.
  75. Wiebe, J., Wilson, T., & Bell, M. (2001). Identifying collocations for recog- nizing opinions. In Proceedings of ACL/EACL 2001 Workshop on Collocation (pp. 24-31).
  76. Winograd, T. (1972). Understanding natural language. Orlando, FL: Academic Press.
  77. Witten, I.H., & Frank, E. (2000). Data mining: Practical machine learning tools with java implementations. San Francisco: Kaufmann.
  78. Yule, G.U. (1944). Statistical study of literary vocabulary. Cambridge, UK: Cambridge University Press.