Beal, J., Corrigan, K., Moisl, H., (ed.) Creating and Digitizing Language Corpora: Synchronic Databases, Palgrave Macmillan, 1-16, 2007
Six of the contributions to Volume 1 (Anderson et al.; Anderwald and
Wagner; Barbiers et al.; Se... more Six of the contributions to Volume 1 (Anderson et al.; Anderwald and
Wagner; Barbiers et al.; Sebba and Dray; Kallen and Kirk; Tagliamonte)
arose from invited presentations at the workshop on ‘Models and
Methods in the Handling of Unconventional Digital Corpora organized
by the editors of the present volume that was held in April 2004 during
the Fifteenth Sociolinguistics Symposium (SS15) at the University of
Newcastle. The book project then evolved by inviting further contributions
from key corpus creators so that the companion volumes would
contain treatments outlining the models and methods underpinning a
variety of digitized diachronic and synchronic corpora with a view to
highlighting synergies and points of contrast between them.
Uploads
Papers by Hermann Moisl
spoken corpus of interviews with residents of Tyneside and surrounding areas of North East
England. It updates the earlier Newcastle Electronic Corpus of Tyneside English (NECTE),
which combined two sub-corpora dating from the late 1960s and mid 1990s, and supplements
these with materials from an ongoing monitor corpus established in 2007. The first part of this
paper outlines the background and development of the DECTE project. It then reviews
research that has already been conducted on the corpus, comparing the different feature-based
and aggregate analyses that have been employed. In doing so, we hope to highlight the crucial
role that aggregate methods, such as hierarchical cluster analysis, can have in identifying and
explaining the parameters that underpin aspects of language variation, and to demonstrate that
such methods can and do work well in combination with feature-centric approaches.
The discussion is in two main parts: the first part briefly describes Necte, the second constructs the phonetic variation map.
DECTE is an amalgamation of the existing Newcastle Electronic Corpus of Tyneside English (NECTE) created between 2001 and 2005 (http://research.ncl.ac.uk/necte), and NECTE2, a collection of interviews conducted in the Tyneside area since 2007. It thereby constitutes a rare example of a publicly available on-line corpus presenting dialect material spanning five decades.
The present website is designed for research use. DECTE also, however, includes an interactive website, The Talk of the Toon, which integrates topics and narratives of regional cultural significance in the corpus with relevant still and moving images, and which is designed primarily for use in schools and museums and by the general public.
engineering disciplines as a way of identifying interesting structure in data
(refs). The advent of digital electronic natural language text has seen its
application in text-oriented disciplines like information retrieval (refs) and data
mining (refs) and, increasingly, in corpus-based linguistics (refs). In all these
domains, the reliability of cluster analytical results is contingent both on the
nature of the particular clustering algorithm being used and on the
characteristics of the data being analyzed, where 'reliability' is understood as
the extent to which the result identifies structure which really is present in the
domain from which the data was abstracted, given some well defined sense of
'really present'. The present discussion focuses on how the reliability of
cluster analysis can be compromised by one particular characteristic of data
abstracted from natural language corpora.
The characteristic in question arises when the aim is to cluster a collection of
length-varying documents based on the frequency of occurrence of one or
more linguistic or textual features; examples are (refs). Because longer
documents are, in general, likely to contain more examples of the feature or
features of interest than shorter ones, the frequencies of the data variables
representing those features will be numerically greater for the longer
documents than for the shorter ones, which in turn leads one to expect that
the documents will cluster in accordance with relative length rather than with
some more interesting criterion latent in the data; this expectation has been
empirically confirmed (refs). The solution is to eliminate relative document
length as a factor in clustering by adjusting the data frequencies using a
length normalization method such as cosine normalization, which is
extensively used in information retrieval for precisely this purpose (refs). This
solution is not a panacea, however. One or more documents in the collection
might be too short to provide accurate population probability estimates for the
data variables, and, because length normalization methods exacerbate such
inaccuracies, the result would be that analysis based on the normalized data
inaccurately clusters the documents in question.
The present discussion proposes a way of dealing with short documents in
clustering of length-varying multi-document corpora: that a threshold length
for acceptably accurate variable probability estimation be defined, and that all
documents shorter than that threshold be eliminated from the analysis. The
discussion is in 3 main parts. The first part outlines the nature of the problem
in detail, the second develops a method for determining a minimum document
length threshold, and the third exemplifies the application of that method to an
actual corpus.
and more specifically statistical methods in analyzing large digital electronic
corpora, focussing in particular on cluster analysis. The first part of the
discussion motivates the use of cluster analysis in corpus linguistics, the
second gives an outline account of data creation and clustering with reference
to the Newcastle Electronic Corpus of Tyneside English, and the third is a
selective literature review.
measured on different numerical scales, those whose scales permit relatively
larger values can have a greater influence on clustering than those whose
scales restrict them to relatively smaller ones, and this can compromise the
reliability of the analysis. The first part of this discussion describes the nature
of that compromise. The second part argues that a widely used method for
removing disparity of variable scale, Z-standardization, is unsatisfactory for
cluster analysis because it eliminates differences in variability among
variables, thereby distorting the intrinsic cluster structure of the
unstandardized data, and instead proposes a standardization method based
on variable means which preserves these differences. The proposed meanbased
method is compared to several other alternatives to Z-standardization,
and is found to be superior to them in cluster analysis applications.
The discussion is in three main parts. The first describes data abstraction from corpora, the second outlines the principles of cluster analysis, and the third shows how the results of cluster analysis can be used in the formulation of hypotheses. Examples are based on the Newcastle Electronic Corpus of Tyneside English (NECTE), a corpus of dialect speech (Allen et al. 2007). The overall approach is introductory, and as such the aim has been to make the material accessible to as broad a readership as possible.
English (NECTE), a legacy corpus based on data collected for two sociolinguistic
surveys conducted on Tyneside in the north-east of England in c.1969 and 1994,
respectively. It focusses on transcription issues relevant for addressing research
questions in phonetics/phonology. There is also discussion of the rationale for the text
encoding systems adopted in the corpus construction phase as well as the
dissemination strategy employed since completion in 2005.
explosive production of electronically encoded information of all
kinds. In the face of this, traditional philological methods for search
and interpretation of data have been overwhelmed by volume, and a
variety of computational methods have been developed in an attempt
to make the deluge tractable. These developments have clear
implications for corpus-based linguistics in general, and for corpusbased
study of historical dialectology in particular: as more and larger
historical text corpora become available, effective analysis of them
will increasingly be tractable only by adapting the interpretative
methods developed by the statistical (Hair et al. 2005; Tabachnik &
Fidell 2006), information retrieval (Belew 2000; Grossman & Frieder
2004), pattern recognition (Bishop 2006), and related communities.
To use such analytical methods effectively, however, issues that arise
with respect to the abstraction of data from corpora have to be
understood. This paper addresses an issue that has a fundamental
bearing on the validity of analytical results based on such data:
variation in document length. The discussion is in four main parts.
The first part shows how a particular class of computational methods,
exploratory multivariate analysis, can be used in historical
dialectology research, the second explains why variation in document
length can be a problem in such analysis, the third proposes document
length normalization as a solution to that problem, and the fourth
points out some difficulties associated with document length
normalization
tool: exploratory multivariate analysis. The discussion is
in six main parts. The first part is the present
introduction, the second explains what is meant by
exploratory multivariate analysis, the third discusses the
characteristics of data and the implications of these
characteristics for generation and interpretation of
analytical results, the fourth gives an overview of the
various exploratory analytical methods currently
available, the fifth reviews the application of exploratory
multivariate analysis in corpus linguistics, and the sixth
is a select bibliography. The material is presented in an
intuitively accessible way, avoiding formalisms as much
as possible. However, in order to work with multivariate
analytical methods some background in mathematics and
search and interpretation. Variation in document length can be a problem for these technologies, and
several normalization methods for mitigating its effects have been proposed. This paper assesses the
effectiveness of such methods in specific relation to exploratory multivariate analysis. The discussion
is in four main parts. The first part states the problem, the second describes some normalization
methods, the third identifies poor estimation of the population probability of variables as a factor that
compromises the effectiveness of the normalization methods for very short documents, and the fourth
proposes elimination of data matrix rows representing document which are too short to be reliably
normalized and suggests ways of identifying those documents.
validity of analytical results based on such data: sparsity. The discussion is
in three main parts. The first part shows how a particular class of
computational methods, exploratory multivariate analysis, can be used in
language variation research, the second explains why data sparsity can be a
problem in such analysis, and the third outlines some solutions.
Wagner; Barbiers et al.; Sebba and Dray; Kallen and Kirk; Tagliamonte)
arose from invited presentations at the workshop on ‘Models and
Methods in the Handling of Unconventional Digital Corpora organized
by the editors of the present volume that was held in April 2004 during
the Fifteenth Sociolinguistics Symposium (SS15) at the University of
Newcastle. The book project then evolved by inviting further contributions
from key corpus creators so that the companion volumes would
contain treatments outlining the models and methods underpinning a
variety of digitized diachronic and synchronic corpora with a view to
highlighting synergies and points of contrast between them.