PAN 2018 explores several authorship analysis tasks enabling a systematic comparison of competitive approaches and advancing research in digital text forensics. More specifically, this edition of PAN introduces a shared task in...
morePAN 2018 explores several authorship analysis tasks enabling a systematic comparison of competitive approaches and advancing research in digital text forensics. More specifically, this edition of PAN introduces a shared task in cross-domain authorship attribution, where texts of known and unknown authorship belong to distinct domains, and another task in style change detection that distinguishes between single-author and multi-author texts. In addition, a shared task in multimodal author profiling examines, for the first time, a combination of information from both texts and images posted by social media users to estimate their gender. Finally, the author obfuscation task studies how a text by a certain author can be paraphrased so that existing author identification tools are confused and cannot recognize the similarity with other texts of the same author. New corpora have been built to support these shared tasks. A relatively large number of software submissions (41 in total) was received and evaluated. Best paradigms are highlighted while baselines indicate the pros and cons of submitted approaches. (stylistic fingerprint) but she also shares some properties with other people of similar background (age, gender, education, etc.) It is quite challenging to define or measure both personal style (for each individual author) and collective style (males, females, young people, old people, etc.). In addition, it remains unclear what one should modify in her texts in order to attempt to hide her identity or to mimic the style of another author. This edition of PAN deals with these challenging issues. Author identification puts emphasis on the personal style of individual authors. The most common task is authorship attribution where there is a set of candidate authors (suspects), with samples of their texts, and one of them is selected as the most likely author of a text of disputed authorship [31]. This can be a closed-set (one of the suspects is surely the true author) or an open-set (the true author may not be among the suspects) attribution case. This edition of PAN focuses on closed-set cross-domain authorship attribution, that is, when the texts unquestionably written by the suspects and the texts of disputed authorship belong to different domains. This is a realistic scenario suitable for several applications. For example, imagine the case of a crime novel published anonymously when all candidate authors have only published fantasy novels [13] or a disputed tweet when the available texts written by the suspects are newspaper articles. To be able to control the domain of texts, we turned to so-called fanfiction [11]. This term refers to the large body of contemporary fiction that is nowadays created by non-professional authors ('fans'), who write in the tradition of a well-known source work, such as the Harry Potter series by J.K. Rowling, that is sometimes called the 'canon'. These writings or 'fics' within such a 'fandom' heavily borrow characters, motives, settings, etc. from the source fandom. Fanfiction provides excellent material to study cross-domain authorship attribution since most fans are active in multiple fandoms. Another important dimension in author identification is to intrinsically analyse a document, possibly written by multiple authors and identify the contribution of each co-author. The previous edition of PAN aimed to find the exact border positions within a document where the authorship changes. Taking the respective results into account which have shown that the problem is quite hard [39], we substantially relaxed the task this year and broke it down to the simple question: Given a document, are there any style changes or not? An alternative formulation would thus be to solely predict whether a document is written by a single author or by multiple collaborators, whereby it is irrelevant to the task to identify the exact border positions between authors. While the evaluation of the two preceding tasks relied on the Webis-TRC-12 data set [21], we created a novel data set by utilizing the StackExchange network 1. Containing millions of publicly available questions and answers regarding several topics and subtopics, it represents a rich source which we exploited to build a comprehensive, but still realistic data set for the style change detection task. When the collective style of groups of authors is considered, author profiling attempts to predict demographic and social characteristics, like age, gender, education, and personality traits. It is a research area associated with important applications in social media analytics and marketing as well as cyber forensics. In this edition of PAN, for the first time, multimodal information is considered. Both texts and images posted by social media users are used to predict their gender.