Papers by Richard Forsyth
This program uses an evolutionary (Darwinian) optimization technique to perform clustering, i.e. ... more This program uses an evolutionary (Darwinian) optimization technique to perform clustering, i.e. it identifies within a dataset groups of items which in some sense belong together. An important point about CUES is that it decides on the number of groups as part of the optimization process without having to be given the number to find as input -unlike many well-established clustering algorithms. It has been written in Python3 and is released under the GNU Public License for general usage.

Recently there has been an upsurge of interest in the problem of text categorization, e.g. of new... more Recently there has been an upsurge of interest in the problem of text categorization, e.g. of newswire stories (Hayes & Weinstein, 1991;. However, classifying documents is not a new problem: workers in the field of stylometry have been grappling with it for over a hundred years . Typically, they have given most attention to authorship attribution, while more modern research in text categorization, conducted from within the paradigm of Artificial Intelligence, has concentrated on discrimination based on subject matter. Nevertheless both fields share similar aims, and it is the contention of the present author that they could profit from being more aware of each other. Accordingly, the present study addresses an issue common to both approaches, the problem of finding an effective set of attributes or features for text discrimination. Stylometers, in their quest to capture consistent and distinctive features of linguistic style, have proposed and used a wide variety of textual features or markers , including measures of vocabulary richness , grammatical transition frequencies , rates of usage of frequent function words , and preferences for words in certain semantic categories . In many text-categorization tasks the choice of textual features is a crucial determinant of success, yet it is not usually treated as a major focus of attention. This is often true of AI-based text-categorization studies as well. It would be desirable if this part of the process were better understood. This paper, therefore, reports an empirical comparison of nine different methods of textual feature-finding that: (1) do not depend on subjective judgement; (2) do not need background knowledge external to the texts being analyzed, such as a lexicon or thesaurus; (3) do not presuppose that the texts being analyzed are in the English language; and (4) do not presume that words (or word-based measures) are the only possible textual descriptors. Results of a benchmark test on 13 representative text-classification problems suggest that one of these techniques, here designated Monte-Carlo Feature-Finding, has certain advantages that merit consideration by future workers seeking to characterize stylistic habits efficiently without imposing many preconceptions.

The recent revival of Connectionism has led to an upsurge of interest in trainable pattern associ... more The recent revival of Connectionism has led to an upsurge of interest in trainable pattern associators and pattern classifiers of many types. However, one training method currently dominates the field --the back propagation algorithm. This method is crowding out other neural learning algorithms and other inductive techniques. The present paper reports some empirical trials comparing seven different neural learning algorithms (including two versions of back propagation) on four test problems. Though limited in scope the present study does shed light on the performance of a variety of learning techniques, compared under relatively uniform conditions. The results cast some doubt on the status of back propagation as an 'industrial strength' learning algorithm. It appears to scale up rather poorly; and on two pattern recognition tasks it gave a higher error rate than a commonly used statistical technique. These results suggest that the neurocomputing community as a whole may be in danger of becoming fixated at a local optimum, just like some of its algorithms.
Lecture Notes in Computer Science, 1996
Instance-based methods of classification are easy to implement, easy to explain and relatively ro... more Instance-based methods of classification are easy to implement, easy to explain and relatively robust. Furthermore, they have often been found in empirical studies to be competitive in accuracy with more sophisticated classification techniques (

Literary and Linguistic Computing, 2013
Most authorship attribution studies have focused on works which are available in the language use... more Most authorship attribution studies have focused on works which are available in the language used by the original author since this provides a direct way of examining an author's linguistic habits. Sometimes, however, questions of authorship arise regarding a work only surviving in translation. One example is Constance, the putative "last play" of Oscar Wilde, only existing in a supposed French translation of a lost English original. The present study aims to take a step towards dealing with cases of this kind by addressing two related questions: ( ) to what extent are authorial differences preserved in translation; ( ) to what extent does this carry-over depend on the particular translator? With these aims we analyzed 262 letters written by Vincent van Gogh and by his brother Theo, dated between 1888 and 1890, each available in the original French and in an English translation. We also performed a more intensive investigation of a subset of this corpus, comprising 48 letters, for which two different English translations were obtainable. Using three different indices of discriminability (classification accuracy, Hedge's g, and area under the ROC curve), we found that much of the stylistic discriminability between the two brothers was preserved in the English translations. Subsidiary analyses were used to identify which lexical features were contributing most to interauthor discriminability. Discrimination between translation sources was possible, though less effective than between authors. We conclude that "handprints" of both author and translator can be found in translated texts, using appropriate techniques.

Literary and Linguistic Computing, 1999
Assigning a date to a text is an important task in stylometry. Most previous researchers, however... more Assigning a date to a text is an important task in stylometry. Most previous researchers, however, have worked on intractable problems, where a true chronology will never be known with certainty, such as the works of Plato, Shakespeare or Marlowe. It is argued here that stylochronometric methods should be extensively tested on unproblematic texts before being used in disputed cases. As part of such testing, the present study applies a novel technique, Monte-Carlo Feature-Finding (Forsyth, 1997), to the verse of W.B. Yeats, where the dating is relatively well documented. Yeats insisted that his language changed as he grew older, and most readers would concur; yet scholars have not reached agreement on the nature of this linguistic change . A quasi-random search algorithm was used to find marker substrings in 142 poems of Yeats. To test their distinctiveness, four trials were performed: (1) assignment of 10 poems absent from the training sample to their correct period; (2) detecting differences in two poems written by Yeats in his twenties and revised when he was sixty; (3) constructing a regression formula; (4) classifying two prose extracts written 46 years apart. Assigning short poems (median length 114 words) to their correct chronological period is a non-trivial task. Nevertheless, counting of distinctive substrings gave the right assignment in 9 out of 10 unseen cases. Moreover, these substring frequencies were sensitive enough to detect authorial revision in two early poems revised by Yeats many years after he originally wrote them, and robust enough to classify a pair of short prose extracts correctly; as well as accounting for 71% of the variance when used in a regression to predict the year in which 13 poems, absent from the training sample, were composed. These results suggest that short substrings found by a Monte-Carlo process warrant further investigation as stylistic indicators.

Kybernetes, 1981
BEAGLE (Biological Evolutionary Algorithm Generating Logical Expressions) is a computer package f... more BEAGLE (Biological Evolutionary Algorithm Generating Logical Expressions) is a computer package for producing decision‐rules by induction from a database. It works on the principle of “Naturalistic Selection” whereby rules that fit the data badly are “killed off” and replaced by “mutations” of better rules or by new rules created by “mating” two better adapted rules. The rules are Boolean expressions represented by tree structures. The software consists of two Pascal programs, HERB (Heuristic Evolutionary Rule Breeder) and LEAF (Logical Evaluator And Forecaster). HERB improves a given starting set of rules by running over several simulated generations. LEAF uses the rules to classify samples from a database where the correct membership may not be known. Preliminary tests on three different databases have been carried out—on hospital admissions (classing heart patients as deaths or survivors), on athletic physique (classing Olympic finalists as long‐distance runners or sprinters) and...
Journal of Experimental & Theoretical Artificial Intelligence, 1994
This paper describes a method of simplifying inductively generated discrimination trees using a m... more This paper describes a method of simplifying inductively generated discrimination trees using a measure of tree quality based on the principle of information economy, which takes into account both the size of the tree and the size of the outcome data after (notional) encoding by that tree. Results of testing this method on a selection of data sets show that it has some practical advantages over previously used techniques for tree-pruning. Some of the theoretical implications of the present method are also discussed.

Empirical Studies of the Arts, 2000
This article describes a preliminary study of linguistic attributes that differentiate popular fr... more This article describes a preliminary study of linguistic attributes that differentiate popular from obscure poems in English. Following in the footsteps of Simonton (1989), Martindale (1990) and others, frequency of appearance in anthologies was used as an index of poetic popularity. Twenty general anthologies published between 1966 and 1997 were selected and all poems appearing in more than five of them were taken as a reference sample. This gave eighty-five poems by fifty-four different authors. (The two most popular were Matthew Arnold's Dover Beach with 16 occurrences and Kubla Khan by Samuel T. Coleridge with 15.) As a control group, fifty-four other poets were selected by finding a less eminent poet of the same sex born within ten years of each poet in the reference sample. The same number of poems were chosen (as near as possible randomly) from each obscure poet as from the matching popular poet. This gave eighty-five obscure poems, also by fifty-four different authors. A...

Computers in Human Behavior, 1995
There has recently been a great deal of interest in case-based reasoning, the generation of solut... more There has recently been a great deal of interest in case-based reasoning, the generation of solutions to new problems using methods which have served for similar problems in the past. Much of the commonly available computer software is however concerned with "case-retrieval." The latter involves the matching of an observation for which the outcome is not known, to a database of examples for which the outcome is known. Various types of case retrieval, or "classification by similarity" (CBS), algorithms are discussed. Several CBS algorithms, as well as various other techniques, were applied to two small datasets. Although more comparisons are required, the CBS algorithms were found to perform significantly better than a linear discriminant analysis on a predominantly binary dataset. A single-nearest-neighbor technique, first developed in the 1950s, performed particularly well on this dataset. A more sophisticated CBS algorithm, based upon a type of neural network, performed consistently well on both datasets. As CBS techniques generally encourage the
Causal Models and Intelligent Data Management, 1999
As we near the end of the twentieth century, the printed book also appears to be drawing near the... more As we near the end of the twentieth century, the printed book also appears to be drawing near the end of its fivecentury career." --Philip Davies Roberts (1986).
Cordial thanks are also due to the inter-library-loan staff at the Bolland Library of the Univers... more Cordial thanks are also due to the inter-library-loan staff at the Bolland Library of the University of the West of England (U.W.E.). In addition, this project has benefitted from access to two public-domain text repositories, Project Gutenberg and the Oxford Text Archive, as well as from computing support provided by the Faculty of Computer Science and Mathematics at U.W.E. Last but not least, I would also like to thank my wife, Mei Lynn, for support, suggestions and encouragement during the long period that this thesis took up, my daughter, Frances, for advice that altered the course of this research, and my son, Edward, for putting up with a good deal of thesis-induced absence and absent-mindedness. Naturally, none of those named above should be held responsible for any blunders or blemishes in what follows. v iii Contents List

Poznan Studies in Contemporary Linguistics, 2015
This paper focuses on detecting and measuring traces of "formulaic language". For this ... more This paper focuses on detecting and measuring traces of "formulaic language". For this purpose, we test a number of computational formulae that quantify the degree to which a text type incorporates inflexible sequences of words. We assess these candidate indices using a number of reference corpora representing a wide variety of text types, both routine and creative. We adopt the concept of "phrase-frame" proposed by Fletcher (2002–2007) as a means of exploring phraseological pattern variability. To date, there have been few studies explicitly addressing this issue, with the exception of Roemer (2010). We examine ten productivity indices, including Roemer's VPR, the Herfindahl-Hirschman index, Simpson's diversity index and relative Shannon entropy. We report that a novel measure, which we term Hapaxity, best meets our criteria, and show how this index of micro-productivity (in phrase-frames) may be used to assess macro-productivity (in text registers), thu...

Literary and Linguistic Computing, 1995
The Federalist Papers, twelve of which are claimed by both Alexander Hamilton and James Madison, ... more The Federalist Papers, twelve of which are claimed by both Alexander Hamilton and James Madison, have long been used as a testing-ground for authorship attribution techniques despite the fact that the styles of Hamilton and Madison are unusually similar. This paper assesses the value of three novel stylometric techniques by applying them to the Federalist problem. The techniques examined are a multivariate approach to vocabulary richness, analysis of the frequencies of occurrence of sets of common highfrequency words, and use of a machine-learning package based on a 'genetic algorithm' to seek relational expressions characterizing authorial styles. All three approaches produce encouraging results to what is acknowledged to be a difficult problem. During 1787 and 1788, seventy-seven articles were printed in four of New York City's five newspapers, with the aim of persuading New Yorkers to support ratification of the proposed new constitution of the United States of America. These articles appeared under the pseudonym Publius and, as it happens, were unsuccessful: 56% of the citizens of New York state voted against ratifying the constitution. Undeterred by this setback, Publius re-issued these propaganda pieces in book form in May 1788, together with an additional eight essays that had not previously been published, so that delegates at the Constitutional Convention, then sitting, might be swayed by their case in favour of federalism. The New York delegation did eventually abandon opposition to the constitution, but mainly, it is thought, because nine of the thirteen states ratified, leaving New York potentially isolated . The book, however, has remained in print for over 200 years. Speculation concerning the identity of Publius was widespread at the time, and gradually it became accepted that General Alexander Hamilton had been heavily involved in the composition of the Federalist Papers but that he had not written them all alone. Hamilton died in a duel with Aaron Burr in 1804, and in 1807 a Philadelphia periodical received a list, said to have been made by Hamilton just before his fatal duel, assigning specific papers to specific authors -himself, John Jay, and James Madison (the fourth president of the United States). Not until he retired from the presidency did Madison concern himself with asserting authorship of particular Federalist papers, but in 1818

Literary and Linguistic Computing, 2013
Quantifying the similarity or dissimilarity between documents is an important task in authorship ... more Quantifying the similarity or dissimilarity between documents is an important task in authorship attribution, information retrieval, plagiarism detection, text mining, and many other areas of linguistic computing. Numerous similarity indices have been devised and used, but relatively little attention has been paid to calibrating such indices against externally imposed standards, mainly because of the difficulty of establishing agreed reference levels of inter-text similarity. The present article introduces a multi-register corpus gathered for this purpose, in which each text has been located in a similarity space based on ratings by human readers. This provides a resource for testing similarity measures derived from computational text-processing against reference levels derived from human judgement, i.e. external to the texts themselves. We describe the results of a benchmarking study in five different languages in which some widely used measures perform comparatively poorly. In particular, several alternative correlational measures (Pearson r, Spearman rho, tetrachoric correlation) consistently outperform cosine similarity on our data. A method of using what we call 'anchor texts' to extend this method from monolingual inter-text similarity-scoring to inter-text similarity-scoring across languages is also proposed and tested.

Digital Scholarship in the Humanities, 1996
Stylometrists have proposed and used a wide variety of textual features or markers, but until rec... more Stylometrists have proposed and used a wide variety of textual features or markers, but until recently very little attention has been focused on the question: where do textual features come from? In many text-categorization tasks the choice of textual features is a crucial determinant of success, yet is typically left to the intuition of the analyst. We argue that it would be desirable, at least in some cases, if this part of the process were less dependent on subjective judgement. Accordingly, this paper compares five different methods of textual feature finding that do not need background knowledge external to the texts being analyzed (three proposed by previous stylometers, two devised for this study). As these methods do not rely on parsing or semantic analysis, they are not tied to the English language only. Results of a benchmark test on 10 representative text-classification problems suggest that the technique here designated Monte-Carlo Feature-Finding has certain advantages that deserve consideration by future workers in this area.

Digital Scholarship in the Humanities, 1999
When his daughter Tullia died in 45 BC, the Roman orator Marcus Tullius Cicero (106-43 BC) was as... more When his daughter Tullia died in 45 BC, the Roman orator Marcus Tullius Cicero (106-43 BC) was assailed by grief which he attempted to assuage by writing a philosophical work now known as the Consolatio. Despite its high reputation in the classical world, only fragments of this text --in the form of quotations by subsequent authors --are known to have survived the fall of Rome. However, in 1583 a book was printed in Venice purporting to be a rediscovery of Cicero's Consolatio. Its editor was a prominent humanist scholar and Ciceronian stylist called Carlo Sigonio. Some of Sigonio's contemporaries, notably Antonio Riccoboni, voiced doubts about the authenticity of this work; and since that time scholarly opinion has differed over the genuineness of the 1583 Consolatio. The main aim of this study is to bring modern stylometric methods to bear on this question in order to see whether internal linguistic evidence supports the belief that the Consolatio of 1583 is a fake, very probably perpetrated by Sigonio himself. A secondary objective is to test the application of methods previously used almost exclusively on English texts to a language with a different structure, namely Latin. Our findings show that language of the 1583 Consolatio is extremely uncharacteristic of Cicero, and indeed that the text is much more likely to have been written during the Renaissance than in classical times. The evidence that Sigonio himself was the author is also strong, though not conclusive.

Representations of spoken discourse must accommodate the phenomenon of simultaneous speech. Lingu... more Representations of spoken discourse must accommodate the phenomenon of simultaneous speech. Linguists and other social scientists have employed numerous transcription conventions for exhibiting the temporal interleaving of multi-speaker talk (e.g. Atkinson and Heritage, 1984; Schiffrin, 1994; Leech et al., 1995; Carter, 2004; MICASE, 2004). Most of these conventions are mutually incompatible. The existence of many different systems is evidence that representing turn-taking in natural dialogue remains a problematic issue. The present study discusses a novel orthographic transcription layout which records how participants contribute to the stream of spoken events based on word timings. To test this method, the Maptask corpus (Anderson et al., 1991) was used because it contains unusually precise information on the timings of vocal events. Using the term vocable to denote words and a small number of short phrases, every vocable in the Maptask corpus has its onset and ending time recorde...

Linguists and other social scientists have employed many transcription conventions to exhibit the... more Linguists and other social scientists have employed many transcription conventions to exhibit the temporal interleaving of multi-speaker talk. The existence of many different systems, which are mutually incompatible, is evidence that representing spoken discourse remains problematic. This study proposes a novel orthographic transcription layout based on word timings. To test this method, the Maptask corpus is used because it contains unusually precise information on the timings of vocal events. This makes it possible to evaluate a non-standard talkdivision format (TST1) in which the alternation of speakers is not imposed by a transcriber's intuition but emerges from the empirical data. It highlights the prevalence of "echoing" in the joint production of dialogue. Moreover, lengths of speech segments and inter-speaker intervals as defined by this procedure show significant associations with a number of contextual and interactional variables, indicating that this approach has analytic as well as representational benefits.

Expert Systems, 1989
These are interesting times for the British expert-systems community. The great wave of public in... more These are interesting times for the British expert-systems community. The great wave of public interest has surged and receded (likewise government funding) leaving some exponents of the subject high and dry. If the expert systems phenomenon i s not to be simply a passing fad, the time has come to find a new role for knowledge-based systems in the wider world of computing. Or, in the words of Bob Muller, co-Chairman of the ES88 Conference: 'the decision point is now'. In this mood of reappraisal, over 600 people attended ES88, the 8th annual conference of the BCS specialist group on expert systems, at the Brighton Metropole last December. Most had come to find out about (and a few to influence) the future course of knowledge-based computing in the United Kingdom and Europe. A prevalent viewpoint seemed to be that expert systems had come in from the cold and found a respectable niche within the DP environment. This was reflected in the official theme of the conference which was: 'integrating with mainstream software development'. Not everyone, however, saw this as a meek return to the fold. For instance, Alex Goodall, the other Conference co-Chairman, foresaw a kind of reverse takeover in which knowledge engineering, far from being swallowed up, would become the dominant force within software engineering. This view was central to the opening keynote lecture, by Derek Partridge of Exeter University, entitled 'To Add AI, or not to Add AI?'. He began by observing that the software crisis is twenty years old. AI (Artificial Intelligence) has been proposed as a way of solving it; yet, according to Partridge, it is far from clear that A1 methods will prove beneficial to software engineering. In fact, since software engineering is what he called a 'vulnerable technology', adding A1 could have an adverse impact, leading to a 'super software crisis' -which would be progress of a sort. Traditional software development is based on the idea of Specify-and-Verify (SAV) but in reality the requirement for verification is abandoned and it becomes a process of Spec-
Uploads
Papers by Richard Forsyth