12-15 - Working Group final reports
2015
Sign up for access to the world's latest research
Abstract
Working group reports from the four communities represented at the workshop: (1) Archivists, (2) Journal Editors, (3) IT/Big Data, (4) Ordinary Working Linguists. Presented at the first workshop on Developing Standards for Data Citation and Attribution for Reproducible Research in Linguistics, held at the University of Colorado at Boulder from 09/18/15-09/20/15.
Related papers
Corpus Pragmatics
Contemporaneously with the advances of technology as well as the advent of computers in language studies, we have witnessed a boom in the emergence of new books in Corpus Linguistics (see for example Dash & Ramamoorthy 2019; Paquot & Gries, 2020; Seoane & Biber, 2021). From among the informative books in this fast growing field of knowledge is the current one authored by Barth and Schnell in 2022. This work of scholarship has been organized in 11 chapters, which provide readers with state-of-the-art concepts of theory and practice for conducting research in the domain of Corpus Linguistics. The first two chapters function as an introduction in which the authors, succinctly, shed some light on the basic concept of corpus, its divergence from other approaches as well as its convergence with other usage-oriented fields within linguistics such as Sociolinguistics, Linguistics Typology and Language Change. The authors provide the reader with a definition of corpus and Corpus Linguistics, words, lexeme, type and token as well as some basic statistical concepts such as mode, mean and median. Later on, the authors make a distinction between structural context, syntagmatic context and constructional context in order to delineate the role of context in corpora. There are different types of corpora with specific composition criteria, which need to be delineated for the readers. In this regard, Chapter three, which is thematically divided into two parts, is a detailed description of the corpus composition criteria and typology. In the first part, the authors enumerate such concepts as size, balance, representativeness as well as authenticity and spontaneity as the core criteria for compiling a corpus. Furthermore, a subtle distinction is made between raw, primary
2012
Research in the Humanities is predominantly text-based. For centuries scholars have studied documents such as historical manuscripts, literary works, legal contracts, diaries of important personalities, old tax records etc. Manual analysis of such documents is still the dominant research paradigm in the Humanities. However, with the advent of the digital age this is increasingly complemented by approaches that utilise digital resources. More and more corpora are made available in digital form (theatrical plays, contemporary novels, critical literature, literary reviews etc.). This has a potentially profound impact on how research is conducted in the Humanities. Digitised sources can be searched more easily than traditional, paper-based sources, allowing scholars to analyse texts quicker and more systematically. Moreover, digital data can also be (semi-)automatically mined: important facts, trends and interdependencies can be detected, complex statistics can be calculated and the results can be visualised and presented to the scholars, who can then delve further into the data for verification and deeper analysis. Digitisation encourages empirical research, opening the road for completely new research paradigms that exploit `big data' for humanities research. This has also given rise to Digital Humanities (or E-Humanities) as a new research area. Digitisation is only a first step, however. In their raw form, electronic corpora are of limited use to humanities researchers. The true potential of such resources is only unlocked if corpora are enriched with different layers of linguistic annotation (ranging from morphology to semantics). While corpus annotation can build on a long tradition in (corpus) linguistics and computational linguistics, corpus and computational linguistics on the one side and the Humanities on the other side have grown apart over the past decades. We believe that a tighter collaboration between people working in the Humanities and the research community involved in developing annotated corpora is now needed because, while annotating a corpus from scratch still remains a labor-intensive and time-consuming task, today this is simplified by intensively exploiting prior experience in the field. Indeed, such a collaboration is still quite far from being achieved, as a gap still holds between computational linguists (who sometimes do not involve humanists in The ACRH-2 Co-Chairs and Organisers
2021
This preprint contains the text of a submission of written evidence to the UK Parliament, House of Commons Science and Technology Committee inquiry on reproducibility and research integrity (submitted: 24 September 2021. Viewable on the parliament website). In our review of the breadth of the reproducibility crisis within applied linguistics, we emphasise the necessity for full disclosure of data and code as well as full provision of experimental materials and protocols. We also highlight the critical role research funders have in supporting the field-specific open digital infrastructures which are needed to support research reproducibility. Finally, we call for a concerted effort to reduce the power of the large publishing houses and support society-led publishing efforts, and non-profit publication platforms.
2013
Father Busa in a picture taken in Gallarate, 1956) I Preface Over three consecutive years, the workshop on Annotation of Corpora for Research in the Humanitites (ACRH) has established itself as an occasion to foster cooperation between historical, philological and linguistic studies and current corpus and computational linguistics. On the one hand, we started from the impression that there is an undeniable similarity between how the form and meaning of the documents are examined by scholars in the Humanities and the task of corpus annotation. On the other hand, historical and literary documents are complex artifacts that require a multidisciplinary approach.
Corpus linguistics has revolutionised our way of working in historical linguistics. The painstaking job of collecting data and manually analysing them has been made less arduous with the introduction of the machine processing of corpora, which allows for quick and efficient searches. The aim of the present study is two-fold: to show how corpus linguistics has contributed to the ways in which researchers approach the study of the history of English, and to provide an overview of selected corpora available in the field. Setting aside the theoretical debate as to whether corpus linguistics should be considered merely a methodology, a branch of linguistics, or both (Taylor, 2008), it is widely acknowledged that corpus linguistics is of considerable help in any branch of linguistics, be it theoretical or applied. The use of corpora makes it possible to test hypotheses established within a specific linguistic area through the fast and reliable analysis of vast pools of material. As a result, the objective measurement of data is available to scholars, who can thus verify their hypotheses and intuitions, and can quickly amend or qualify their research claims if previous ones are seen to be falsifiable. There is, then, a continuous interaction within theory, as expressed in linguistic postulates, concepts and hypotheses, and an application and validation of these theoretical principles through the use of linguistic corpus analysis. The use of corpora is perhaps a more powerful instrument in the field of historical linguistics than in other fields, since the absence of living informants here makes judgements based on intuitions unreliable, and claims have to be empirically attested using data. This data can be extracted from systematically compiled collections of machine-readable texts, called corpora. However, in considering these undeniably advantageous working tools, some caveats should be borne in mind, as will be discussed in what follows.
System, 2011
The stated purpose of the Routledge Handbooks in Applied Linguistics is to 'provide comprehensive overviews of the key topics in applied linguistics' and it is claimed that they offer 'the ideal resource for both advanced undergraduates and postgraduate students'. Thus, they are conceived primarily as a resource for learners in the field, presumably those who are at a relatively early stage of their corpus work, or deciding whether to undertake a research project in the area. One of the questions, therefore, that this review needs to address is whether, as a whole, the Handbook of Corpus Linguistics provides a good introduction to the field. Is it accessible, informative and reasonably comprehensive, but perhaps equally important, would a novice corpus linguist find it inspiring, exciting and stimulating? The handbook contains 45 chapters divided into eight sections, each of which is in turn composed of several chapters. A general introductory section is followed by seven sections which focus on single areas: building and designing a corpus; analysing a corpus; using a corpus for language research; using a corpus for language pedagogy; designing corpus-based materials for the language classroom; using corpora to study literature and translation; and applying corpus linguistics to other areas of research. There is, then, a clear progression from more basic and general chapters in the first four sections, which deal with the fundamentals of corpus use, to a narrower focus on pedagogic applications in sections V and VI and finally to other specialist applications in the last two sections. One of the advantages of the wide spread of topics and clear organisation is that it provides the student reader with a number of possible entry points to corpus linguistics and these are easily accessible in a single volume. After reading the introductory section, for example, an aspiring corpus builder would find it useful to begin with the relevant practical chapter on corpus construction, progressing via the analysis section to an appropriate specialist chapter from the second half of the book, while a student selecting an area for project work might want to get an initial overview of the variety of corpus applications and would find the final sections of most interest. Users are most likely to dip into this volume as need arises rather than read it extensively and, in fact, there is some overlap between chapters (e.g. both O'Keeffe/McCarthy and Tribble discuss the history of concordances). However, this is probably more noticeable to the reviewer who reads the entire work than to the student who consults only certain chapters. In addition to the list of references, each chapter provides 'Further reading', which singles out and describes a few particularly important works. This guidance is especially helpful for beginners, since long lists of references can be overwhelming. Chapters are also well referenced to each other and the index is extensive and thorough. In a volume of this size and diversity, there will always be variation among the chapters, with some more successful than others. Overall, however, the handbook certainly provides an excellent overview of the topics included; all the chapters are accessible and informative for students at the target levels, while several also manage to convey the fascination and excitement of corpus investigation. There is, however, a notable omission. I was surprised that there is no chapter dedicated to the use and interpretation of statistics. As students of applied linguistics, many of the target readers may well be unused to dealing with statistical information and are certainly unlikely to be familiar with specific measures such as mutual information and log-likelihood. Given the importance of statistics in corpus linguistics, it would have been useful to include an explanation and discussion of such operations. Although some chapters, e.g. those by Scott and Walter, do describe certain statistical measures, the information is spread throughout the volume and is therefore not easily retrievable. There is, for example, no entry in the index for 'statistics'. In the individual chapters, it is noticeable that contributors seem to adopt one of two main approaches. Some writers take a broad-brush approach, attempting to cover all aspects of the topic and mentioning as many of the most significant studies as possible (e.g. Cheng, McIntyre and Walker, Moon), while others aim for a more in-depth account, going into greater detail about fewer studies (e.g. Atkins and Harvey, Clancy, Rühlemann). There are both advantages and drawbacks to each approach. The value of attempting wide coverage is that the reader gets a clear idea
Research in corpus linguistics, 2023
2006
Since, at the moment there is not a goldstandard annotated corpus for this objective, it is necessary to build one, to allow generation and testing of automatic systems for classifying the purpose or function of a citation referenced in an article. The development of this kind of corpus is subject to two conditions: the first one is to present a clear and unambiguous classification scheme. The second one is to secure an initial manual process of labeling to reach a sufficient inter-coder agreement among annotators to validate the annotation scheme and to be able to reproduce it even with coders who do not know in depth the topic of the analyzed articles. This paper proposes and validates a methodology for corpus annotation for citation classification in scientific literature that facilitate annotation and produces substantial interannotator agreement.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.