A Systemic Functional Approach to Automated Authorship Analysis

Moshe Koppel

doi:10.1002/ASI.20553

Outline

A Systemic Functional Approach to Automated Authorship Analysis

Moshe Koppel

2013

https://doi.org/10.1002/ASI.20553

visibility

…

description

21 pages

link

1 file

Abstract

Most text analysis and retrieval work to date has focused on the topic of a text; that is, what it is about. However, a text also contains much useful information in its style, or how it is written. This includes information about its author, its purpose, feelings it is meant to evoke, and more. This article develops a new type of lexical feature for use in stylistic text classification, based on taxonomies of various semantic functions of certain choice words or phrases. We demonstrate the usefulness of such features for the stylistic text classification tasks of determining author identity and nationality, the gender of literary characters, a text's sentiment (positive/ negative evaluation), and the rhetorical character of scientific journal articles. We further show how the use of functional features aids in gaining insight about stylistic differences among different kinds of texts.

Figures (18)

TABLE 1. The authorship attribution corpus, comprising the chapters in a set of 20 19-century novels. The number of chapters in each book and the average number of words per chapter in each book are as shown.

FIG. 2. Ten-fold cross-validation accuracy for book attribution in 19th-century literature. Baseline (majority class) classification would give 9.6% accuracy. FIG. 1. Ten-fold cross-validation accuracy for authorship attribution in 19th-century literature. Baseline (majority class) classification would give 24% accuracy.

We may tentatively examine differences between the tasks based on which functional features seemed to help the most. Examination of the features shows some clear differences among the two tasks, in terms of which sorts of features were most significant for classification, from which we can draw some tentative conclusions. Book discrimination involves a

TABLE 2. The top 20 features (by rank sum) for each of book, authorship, and nationality attribution.

FIG. 3. Ten-fold cross-validation accuracy for nationality attribution in 19th-century literature. Baseline (majority class) classification would give 54.0% accuracy.

TABLE 3. Oppositions from the 15 highest ranked systemic features for each class, with weights taken from the model learned using FW + Com + Mod for Nationality attribution.

TABLE 4. Composition of the corpus of characters’ speeches from Shakespeare. The table shows the number and average total speech length of selected characters of each gender from each play. The text has more detail on the construction of the corpus.

FIG. 4. Accuracies for various feature sets over the corpus.

TABLE 5. Oppositions found in the top 20 features from both genders of Shakespearean characters. Note that they are organized categorically, not in rank order.

FIG. 5. Movie review classification results for using SMO with default parameters and a linear kernel with various feature sets; see text for further details.

TABLE 6. Top 15 features for positive and negative reviews from BoW + App, showing wo (multiplied b’ 10,000 for scaling).

FIG. 6. Ten-fold cross-validation accuracies for SMO with various feature sets, on the corpus of science articles.

TABLE7. Summary of the geology and paleontology journals used in the scientific literature corpus study, giving the number of articles from each journal in the corpus, and the average number of words per article.

TABLE 8. Oppositions from the 20 highest ranked systemic features in geology and paleontology articles, from the model learned using FW + Con + Com + Mod + App.

FIG. Al. The CONJUNCTION system (Matthiessen, 1995). Options here are disjunctive; examples of lexical realizations for the leaves are given in italics. Different patterns of CONJUNCTION usage lead to markedly different textual styles. Frequent use of Extension can give a text with high information density which can give a “panoramic” effect of touring through a conceptual landscape, but if done poorly may overwhelm and lose a reader in too many facts. On the other hand, Elaboration can be used to good effect to create textual coherence around a single focused storyline. We note, too, that many of the standard function words traditionally used in computational stylistic studies are types of CONJUNCTION, which fur- ther argues for this system’s importance for stylistic text analysis.

FIG. A2. The MODALITY system networks (Matthiessen, 1995).

FIG. A3. Examples of indicator features for various combinations of MODALITY options. Note that not all combinations are realized in the language; note also the ambiguity of some of the indicators.

FIG. A4. Options in the Attitude network, with examples of appraisal adjectives from our lexicon

References (78)

Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., & Spyropoulos, C. (2000). An evaluation of naive bayesian anti-spam fil- tering. In Proceedings of the Workshop on Machine Learning in the New Information Age (pp. 9-17). New York: ACM Press.
Appelt, D., Hobbs, J., Bear, J., Israel, D., & Tyson, M. (1993). FASTUS: A finite-state processor for information extraction from real-world text. In Proceedings of the International Joint Conference on Artificial Intelligence (pp. 1172-1178).
Argamon, S., Dodick, J., & Chase, P. (2005). The languages of science: A corpus-based study of experimental and historical science articles. In Proceedings of the 26th Annual Meeting of the Cognitive Science Society (pp. 157-162).
Argamon, S., Koppel, M., Fine, J., & Shimony, A.R. (2003). Gender, genre, and writing style in formal written texts. Text, 23(3), 321-346.
Argamon, S., & Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution. In Proceedings of the 2005 ACH/ ALLC Conference. Retrieved from http://web.uvic.ca/hrd/achallc2005/ abstracts.htm
Argamon, S., & Olsen, M. (2006, April). Toward meaningful computing. Communications of the ACM, 49(4), 33-35.
Baayen, R.H., Halteren, H. van, & Tweedie, F. (1996). Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 7, 91-109.
Belkin, N.J. (1993). Interaction with texts: Information retrieval as infor- mation-seeking behavior. Information Retrieval, 93, 55-66.
Berry, M.J., & Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support. New York: Wiley.
Burrows, J.F. (1987). Computation into criticism: A study of Jane Austen's novels and an experiment in method. Oxford, England: Clarendon Press.
Butler, C.S. (2003). Structure and function-A guide to three major structural-functional theories (No. 63-64). Amsterdam: John Benjamins.
Chaski, C.E. (1999). Linguistic authentication and reliability. In Proceed- ings of the National Conference on Science and the Law (pp. 97-148). San Diego, CA: National Institute of Justice.
Chen, S.Y., Magoulas, G.D., & Dimakopoulos, D. (2005). A flexible inter- face design for web directories to accommodate different cognitive styles. Journal of the American Society for Information Science and Technology, 56(1), 70-83.
Cowie, J., & Lehnert, W. (1996). Information extraction. Communications of the ACM, 39(1), 80-91.
Cristea, D., Marcu, D., Ide, N., & Tablan, V. (1999). Discourse structure and co-reference: An empirical study. In D. Cristea, N. Ide, & D. Marcu (Eds.), The relation of discourse/dialogue structure and reference (pp. 46-53). New Brunswick, NJ: Association for Computational Linguistics.
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press.
de Vel, O. (2000). Mining e-mail authorship. In ACM International Confer- ence on Knowledge Discovery and Data Mining Workshop on Text Mining. Boston. Retrieved from http://www.cs.cmu.edu/~dunja/ WshKDD2000.html
de Vel, O., Corney, M., Anderson, A., & Mohay, G. (2002). Language and gender author cohort analysis of e-mail for computer forensics. In Proceedings of the Digital Forensic Research Workshop. Syracuse, NY. (pp. 7-9).
Fawcett, R.P., & Tucker, G.H. (1990). Demonstration of GENESYS: A very large, semantically based systemic functional grammar. In Proceedings of the 13th International Conference on Computational Linguistics (COLING-90) (pp. 47-49). Helsinki, Finland.
Finn, A., Kushmerick, N., & Smyth, B. (2002). Genre classification and do- main transfer for information filtering. In F. Crestani, M. Girolami, & C.J. van Rijsbergen (Eds.), Proceedings of the 24th European Collo- quium on Information Retrieval Research. Glasgow, United Kingdom: Springer Verlag, Heidelberg, DE.
Firth, J. (1968). A synopsis of linguistic theory 1930-1955.
In F. Palmer (Ed.), Selected papers of J.R. Firth 1952-1959. London: Longman.
Fritch, J.W., & Cromwell, R.L. (2001). Evaluating internet resources: Iden- tity, affiliation, and cognitive authority in a networked world. Journal of the American Society for Information Science and Technology, 52(6), 498-507.
Grossman, D., & Frieder, O. (1998). Information retrieval: Algorithms and heuristics. Dordrecht, The Netherlands: Kluwer.
Halliday, M.A.K. (1994). Introduction to functional grammar (2nd ed.). London: Arnold.
Halliday, M.A.K., & Hasan, R. (1976). Cohesion in English. London: Longman.
Holmes, D.I. (1998). The evolution of stylometry in humanities scholar- ship. Literary and Linguistic Computing, 13(3), 111-117.
Hoover, D. (2002). Frequent word sequences and statistical stylistics. Liter- ary and Linguistic Computing, 17, 157-180.
Kamps, J., Marx, M., Mokken, R.J., & Rijke, M. de. (2002). Words with attitude. In Proceedings of the 1st International Conference on Global WordNet. Mysore, India (pp. 332-341).
Karlgren, J. (2000). Stylistic experiments for information retrieval. Unpub- lished doctoral dissertation, SICS.
Kehagias, A., Petridis, V., Kaburlasos, V., & Fragkou, P. (2003). A compari- son of word-and sense-based text categorization using several classifica- tion algorithms. Journal of Intelligent Information Systems, 21(3), 227-247.
Kessler, B., Nunberg, G., & Schütze, H. (1997). Automatic detection of text genre. In P.R. Cohen & W. Wahlster (Eds.), Proceedings of the 35th an- nual meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computa- tional Linguistics (pp. 32-38). Somerset, NJ: Association for Computa- tional Linguistics.
Kjell, B., & Frieder, O. (1992). Visualization of literary style. In Proceed- ings of the IEEE International Conference on Systems, Man and Cyber- netics (pp. 656-661). Chicago: IEEE Press.
Koppel, M., Akiva, N., & Dagan, I. (2003). A corpus-independent feature set for style-based text categorization. In Workshop on Computational Approaches to Style Analysis and Synthesis, 18th International Joint Conference on Artificial Intelligence.
Kushmerick, N. (1999). Learning to remove internet advertisement. In O. Etzioni, J.P. Müller, & J.M. Bradshaw (Eds.), Proceedings of the 3rd International Conference on Autonomous Agents (agents'99) (pp. 175-181). Seattle, WA: ACM Press.
Labov, W. (1973). Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press.
Lewis, D., Schapire, R.E., Callan, J.P., & Papka, R. (1996). Training algo- rithms for linear text classifiers. In Proceedings of the 19th International Conference on Research and Development in Information Retrieval (pp. 298-306). New York: ACM Press.
Marcu, D. (1997). The rhetorical parsing of natural language texts. In Meet- ing of the Association for Computational Linguistics (pp. 96-103). Morristown, NJ: ACL.
Marcu, D. (1999). A decision-based approach to rhetorical parsing. In Pro- ceedings of the ACL'99 (pp. 365-372). Morristown, NJ: ACL.
Martin, J.R., & White, P.R.R. (2005). The language of evaluation: Appraisal in English. London: Palgrave. http://www.grammatics.com/appraisal/
Matthews, R.A.J., & Merriam, T.V.N. (1997). Distinguishing literary styles using neural networks. In E. Fiesler & R. Beale (Eds.), Handbook of neural computation (pp. 8). New York: IOP Publishing and Oxford Uni- versity Press.
Matthiessen, C. (1983). Systemic grammar in computation: The nigel case. In Proceedings of the Meeting of the European Association for Computa- tional Linguistics (pp. 155-164). Morristown, NJ: ACL.
Matthiessen, C. (1995). Lexico-grammatical cartography: English systems. Tokyo: International Language Sciences.
Matthiessen, C., & Bateman, J.A. (1991). Text generation and systemic- functional linguistics: Experiences from English and Japanese. London, New York: Pinter, St. Martin's Press.
McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy Markov models for information extraction and segmentation. In Pro- ceedings of the 17th International Conference on Machine Learning. Stanford, CA (pp. 591-598).
McEnery, A., & Oakes, M. (2000). Authorship studies/textual statistics. In R. Dale, H. Moisl, & H. Somers (Eds.), Handbook of natural language processing (pp. 234-248). Philadelphia: Dekker.
McKinney, V., Yoon, K., & Zahedi, F.M. (2002). The measurement of web- customer satisfaction: An expectation and disconfirmation approach. In- formation Systems Research, 13(3), 296-315.
McMenamin, G. (2002). Forensic linguistics: Advances in forensic stylis- tics. Boca Raton, FL: CRC Press.
Miller, G., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. (1990). Wordnet: An on-line lexical database. International Journal of Lexicogra- phy, 3(4), 235-312.
Moore, J.D., & Pollack, M.E. (1992). A problem for RST: The need for multi- level discourse analysis. Computational Linguistics, 18(4), 537-544.
Mosteller, F., & Wallace, D.L. (1964). Inference and disputed authorship: The federalist. Reading, MA: Addison-Wesley.
Ng, V., & Cardie, C. (2002). Improving machine learning approaches to coref- erence resolution. In Proceedings of the 40th annual meeting of the Associ- ation for Computational Linguistics (pp. 104-111). Morristown, NJ: ACL.
O'Donnell, M. (1993). Reducing complexity in a systemic parser. In Pro- ceedings of the 3rd International Workshop on Parsing Technologies. Tilburg, the Netherlands (pp. 10-13).
Osgood, C.E., Succi, G.J., & Tannenbaum, P.H. (1957). The measurement of meaning. Urbana: University of Illinois Press.
Pang, B., & Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceed- ings of the 42nd ACL (pp. 271-278). Morristown, NJ: ACL.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment clas- sification using machine learning techniques. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing (pp. 79-86). Morristown, NJ: ACL.
Patrick, J. (2004). The ScamSeek project: Text mining for finanical scams on the internet. In S. Simoff & G. Williams (Eds.), Proceedings of the 3rd Australasian Data Mining Conference (pp. 33-38).
Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C.J.C. Burges, & A.J. Smola (Eds.), Advances in Kernel Methods-Support vector learning. Cam- bridge, MA, MIT Press.
Ponte, J.M., & Croft, W.B. (1998). A language modeling approach to infor- mation retrieval. In Proceedings of ACM SIGIR. New York: ACM Press.
Roth, D., & Yih, W. (2001). Relational learning via propositional algorithms: An information extraction case study. In Proceedings of the International Joint Conference on Artificial Intelligence (pp. 1257-1263).
Salton, G., & McGill, M. (1983). Introduction to modern information re- trieval. New York: McGraw-Hill.
Schauer, H., & Hahn, U. (2001). Anaphoric cues for coherence relations. In G. Angelova, K. Bontcheva, R. Mitkov, N. Nicolov, & N. Nikolov (Eds.), Proceedings of the Euroconference Recent Advances in Natural Lan- guage Processing (RANLP-2001) (pp. 228-234). Tzigov, Bulgaria.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47.
Shakespeare, W. (n.d.). The complete Moby Shakespeare. http://www- tech.mit.edu/Shakespeare/
Stamatatos, E., Fakotakis, N., & Kokkinakis, G.K. (2000). Automatic text categorization in terms of genre, author. Computational Linguistics, 26(4), 471-495.
Taboada, M., & Grieve, J. (2004). Analyzing appraisal automatically. In AAAI Spring Symposium on Exploring Attitude and Affect in Text. Menlo Park, CA: AAAI Press.
Tang, R., Ng, K.B., Strzalkowski, T., & Kantor, P.B. (2003). Toward ma- chine understanding of information quality. In Proceedings of Annual Meeting of American Society for Information Science and Technology (Vol. 40, pp. 213-220).
Teich, E. (1995). A proposal for dependency in systemic functional grammar-Metasemiosis in computational systemic functional linguis- tics. Unpublished doctoral dissertation, University of the Saarland and GMD/IPSI, Darmstadt, Germany.
Torvik, V.I., Weeber, M., Swanson, D.R., & Smalheiser, N.R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140-158.
Trudgill, P. (2001). Sociolinguistics: An introduction to language and society (4th ed.). New York: Penguin.
Turney, P.D. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th annual meeting of the ACL (pp. 417-424). Morristown, NJ: ACL.
Tweedie, F., Singh, S., & Holmes, D. (1996). Neural network applications in stylometry: The Federalist Papers. Computers and the Humanities, 30(1), 1-10.
Whitelaw, C., Garg, N., & Argamon, S. (2005, May). Using appraisal tax- onomies for sentiment analysis. In Proceedings of the 2nd Midwest Com- putational Linguistic Colloquium (MCLC 2005).
Wiebe, J., McKeever, K., & Bruce, R. (1998). Mapping collocational prop- erties into machine learning features. In Proceedings of the 6th Workshop on Very Large Corpora (pp. 225-233). Morristown, NJ: ACL.
Wiebe, J., Wilson, T., & Bell, M. (2001). Identifying collocations for recog- nizing opinions. In Proceedings of ACL/EACL 2001 Workshop on Collocation (pp. 24-31).
Winograd, T. (1972). Understanding natural language. Orlando, FL: Academic Press.
Witten, I.H., & Frank, E. (2000). Data mining: Practical machine learning tools with java implementations. San Francisco: Kaufmann.
Yule, G.U. (1944). Statistical study of literary vocabulary. Cambridge, UK: Cambridge University Press.

A Systemic Functional Approach to Automated Authorship Analysis

Sign up for access to the world's latest research

Abstract

Related papers

References (78)

Related papers

Related topics

Cited by