12-15 - Working Group final reports

Mandana Seyfeddinipur

Outline

12-15 - Working Group final reports

Mandana Seyfeddinipur

2015

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Working group reports from the four communities represented at the workshop: (1) Archivists, (2) Journal Editors, (3) IT/Big Data, (4) Ordinary Working Linguists. Presented at the first workshop on Developing Standards for Data Citation and Attribution for Reproducible Research in Linguistics, held at the University of Colorado at Boulder from 09/18/15-09/20/15.

Amelia Joulain

downloadDownload free PDF View PDFchevron_right

Ole Schützler and Julia Schlüter (eds.). Data and methods in corpus linguistics. Comparative approaches. Cambridge: Cambridge University Press, 2022. 357 pp. ISBN 978-1-10849964-4

Matthias Eitelmann

ICAME Journal

downloadDownload free PDF View PDFchevron_right

Understanding Corpus Linguistics by Danielle Barth & Stefan Schnell, 2022

Zahra Ghane

Corpus Pragmatics

Contemporaneously with the advances of technology as well as the advent of computers in language studies, we have witnessed a boom in the emergence of new books in Corpus Linguistics (see for example Dash & Ramamoorthy 2019; Paquot & Gries, 2020; Seoane & Biber, 2021). From among the informative books in this fast growing field of knowledge is the current one authored by Barth and Schnell in 2022. This work of scholarship has been organized in 11 chapters, which provide readers with state-of-the-art concepts of theory and practice for conducting research in the domain of Corpus Linguistics. The first two chapters function as an introduction in which the authors, succinctly, shed some light on the basic concept of corpus, its divergence from other approaches as well as its convergence with other usage-oriented fields within linguistics such as Sociolinguistics, Linguistics Typology and Language Change. The authors provide the reader with a definition of corpus and Corpus Linguistics, words, lexeme, type and token as well as some basic statistical concepts such as mode, mean and median. Later on, the authors make a distinction between structural context, syntagmatic context and constructional context in order to delineate the role of context in corpora. There are different types of corpora with specific composition criteria, which need to be delineated for the readers. In this regard, Chapter three, which is thematically divided into two parts, is a detailed description of the corpus composition criteria and typology. In the first part, the authors enumerate such concepts as size, balance, representativeness as well as authenticity and spontaneity as the core criteria for compiling a corpus. Furthermore, a subtle distinction is made between raw, primary

downloadDownload free PDF View PDFchevron_right

Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2)

Francesco Mambrini

2012

Research in the Humanities is predominantly text-based. For centuries scholars have studied documents such as historical manuscripts, literary works, legal contracts, diaries of important personalities, old tax records etc. Manual analysis of such documents is still the dominant research paradigm in the Humanities. However, with the advent of the digital age this is increasingly complemented by approaches that utilise digital resources. More and more corpora are made available in digital form (theatrical plays, contemporary novels, critical literature, literary reviews etc.). This has a potentially profound impact on how research is conducted in the Humanities. Digitised sources can be searched more easily than traditional, paper-based sources, allowing scholars to analyse texts quicker and more systematically. Moreover, digital data can also be (semi-)automatically mined: important facts, trends and interdependencies can be detected, complex statistics can be calculated and the results can be visualised and presented to the scholars, who can then delve further into the data for verification and deeper analysis. Digitisation encourages empirical research, opening the road for completely new research paradigms that exploit `big data' for humanities research. This has also given rise to Digital Humanities (or E-Humanities) as a new research area. Digitisation is only a first step, however. In their raw form, electronic corpora are of limited use to humanities researchers. The true potential of such resources is only unlocked if corpora are enriched with different layers of linguistic annotation (ranging from morphology to semantics). While corpus annotation can build on a long tradition in (corpus) linguistics and computational linguistics, corpus and computational linguistics on the one side and the Humanities on the other side have grown apart over the past decades. We believe that a tighter collaboration between people working in the Humanities and the research community involved in developing annotated corpora is now needed because, while annotating a corpus from scratch still remains a labor-intensive and time-consuming task, today this is simplified by intensively exploiting prior experience in the field. Indeed, such a collaboration is still quite far from being achieved, as a gap still holds between computational linguists (who sometimes do not involve humanists in The ACRH-2 Co-Chairs and Organisers

downloadDownload free PDF View PDFchevron_right

Reproducibility and research integrity in applied linguistics

Cylcia Bolibaugh

2021

This preprint contains the text of a submission of written evidence to the UK Parliament, House of Commons Science and Technology Committee inquiry on reproducibility and research integrity (submitted: 24 September 2021. Viewable on the parliament website). In our review of the breadth of the reproducibility crisis within applied linguistics, we emphasise the necessity for full disclosure of data and code as well as full provision of experimental materials and protocols. We also highlight the critical role research funders have in supporting the field-specific open digital infrastructures which are needed to support research reproducibility. Finally, we call for a concerted effort to reduce the power of the large publishing houses and support society-led publishing efforts, and non-profit publication platforms.

downloadDownload free PDF View PDFchevron_right

Proceedings of The Third Workshop on Annotation of Corpora for Research in the Humanities (ACRH-3)

Francesco Mambrini

2013

Father Busa in a picture taken in Gallarate, 1956) I Preface Over three consecutive years, the workshop on Annotation of Corpora for Research in the Humanitites (ACRH) has established itself as an occasion to foster cooperation between historical, philological and linguistic studies and current corpus and computational linguistics. On the one hand, we started from the impression that there is an undeniable similarity between how the form and meaning of the documents are examined by scholars in the Humanities and the task of corpus annotation. On the other hand, historical and literary documents are complex artifacts that require a multidisciplinary approach.

downloadDownload free PDF View PDFchevron_right

"Corpus Linguistics and the History of English: When the Past Meets the Future". 2016. (With Begoña Crespo)

Isabel de la Cruz Cabanillas, Begoña Crespo

Corpus linguistics has revolutionised our way of working in historical linguistics. The painstaking job of collecting data and manually analysing them has been made less arduous with the introduction of the machine processing of corpora, which allows for quick and efficient searches. The aim of the present study is two-fold: to show how corpus linguistics has contributed to the ways in which researchers approach the study of the history of English, and to provide an overview of selected corpora available in the field. Setting aside the theoretical debate as to whether corpus linguistics should be considered merely a methodology, a branch of linguistics, or both (Taylor, 2008), it is widely acknowledged that corpus linguistics is of considerable help in any branch of linguistics, be it theoretical or applied. The use of corpora makes it possible to test hypotheses established within a specific linguistic area through the fast and reliable analysis of vast pools of material. As a result, the objective measurement of data is available to scholars, who can thus verify their hypotheses and intuitions, and can quickly amend or qualify their research claims if previous ones are seen to be falsifiable. There is, then, a continuous interaction within theory, as expressed in linguistic postulates, concepts and hypotheses, and an application and validation of these theoretical principles through the use of linguistic corpus analysis. The use of corpora is perhaps a more powerful instrument in the field of historical linguistics than in other fields, since the absence of living informants here makes judgements based on intuitions unreliable, and claims have to be empirically attested using data. This data can be extracted from systematically compiled collections of machine-readable texts, called corpora. However, in considering these undeniably advantageous working tools, some caveats should be borne in mind, as will be discussed in what follows.

downloadDownload free PDF View PDFchevron_right

The Routledge Handbook of Corpus Linguistics

Anne O'Keeffe

System, 2011

The stated purpose of the Routledge Handbooks in Applied Linguistics is to 'provide comprehensive overviews of the key topics in applied linguistics' and it is claimed that they offer 'the ideal resource for both advanced undergraduates and postgraduate students'. Thus, they are conceived primarily as a resource for learners in the field, presumably those who are at a relatively early stage of their corpus work, or deciding whether to undertake a research project in the area. One of the questions, therefore, that this review needs to address is whether, as a whole, the Handbook of Corpus Linguistics provides a good introduction to the field. Is it accessible, informative and reasonably comprehensive, but perhaps equally important, would a novice corpus linguist find it inspiring, exciting and stimulating? The handbook contains 45 chapters divided into eight sections, each of which is in turn composed of several chapters. A general introductory section is followed by seven sections which focus on single areas: building and designing a corpus; analysing a corpus; using a corpus for language research; using a corpus for language pedagogy; designing corpus-based materials for the language classroom; using corpora to study literature and translation; and applying corpus linguistics to other areas of research. There is, then, a clear progression from more basic and general chapters in the first four sections, which deal with the fundamentals of corpus use, to a narrower focus on pedagogic applications in sections V and VI and finally to other specialist applications in the last two sections. One of the advantages of the wide spread of topics and clear organisation is that it provides the student reader with a number of possible entry points to corpus linguistics and these are easily accessible in a single volume. After reading the introductory section, for example, an aspiring corpus builder would find it useful to begin with the relevant practical chapter on corpus construction, progressing via the analysis section to an appropriate specialist chapter from the second half of the book, while a student selecting an area for project work might want to get an initial overview of the variety of corpus applications and would find the final sections of most interest. Users are most likely to dip into this volume as need arises rather than read it extensively and, in fact, there is some overlap between chapters (e.g. both O'Keeffe/McCarthy and Tribble discuss the history of concordances). However, this is probably more noticeable to the reviewer who reads the entire work than to the student who consults only certain chapters. In addition to the list of references, each chapter provides 'Further reading', which singles out and describes a few particularly important works. This guidance is especially helpful for beginners, since long lists of references can be overwhelming. Chapters are also well referenced to each other and the index is extensive and thorough. In a volume of this size and diversity, there will always be variation among the chapters, with some more successful than others. Overall, however, the handbook certainly provides an excellent overview of the topics included; all the chapters are accessible and informative for students at the target levels, while several also manage to convey the fascination and excitement of corpus investigation. There is, however, a notable omission. I was surprised that there is no chapter dedicated to the use and interpretation of statistics. As students of applied linguistics, many of the target readers may well be unused to dealing with statistical information and are certainly unlikely to be familiar with specific measures such as mutual information and log-likelihood. Given the importance of statistics in corpus linguistics, it would have been useful to include an explanation and discussion of such operations. Although some chapters, e.g. those by Scott and Walter, do describe certain statistical measures, the information is spread throughout the volume and is therefore not easily retrievable. There is, for example, no entry in the index for 'statistics'. In the individual chapters, it is noticeable that contributors seem to adopt one of two main approaches. Some writers take a broad-brush approach, attempting to cover all aspects of the topic and mentioning as many of the most significant studies as possible (e.g. Cheng, McIntyre and Walker, Moon), while others aim for a more in-depth account, going into greater detail about fewer studies (e.g. Atkins and Harvey, Clancy, Rühlemann). There are both advantages and drawbacks to each approach. The value of attempting wide coverage is that the reader gets a clear idea

downloadDownload free PDF View PDFchevron_right

Review of Egbert, Jesse, Douglas Biber and Bethany Gray. 2022. Designing and Evaluating Language Corpora: A Practical Framework for Corpus Representativeness. Cambridge: Cambridge University Press. ISBN: 978-1-107-15138-3. DOI: https://doi.org/10.1017/9781316584880

Javier Pérez-Guerra

Research in corpus linguistics, 2023

downloadDownload free PDF View PDFchevron_right

[Encyclopedia of Language & Linguistics] Volume 24 || Association for Computational Linguistics

Graeme Hirst

2006

Since, at the moment there is not a goldstandard annotated corpus for this objective, it is necessary to build one, to allow generation and testing of automatic systems for classifying the purpose or function of a citation referenced in an article. The development of this kind of corpus is subject to two conditions: the first one is to present a clear and unambiguous classification scheme. The second one is to secure an initial manual process of labeling to reach a sufficient inter-coder agreement among annotators to validate the annotation scheme and to be able to reproduce it even with coders who do not know in depth the topic of the analyzed articles. This paper proposes and validates a methodology for corpus annotation for citation classification in scientific literature that facilitate annotation and produces substantial interannotator agreement.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

Shobhana L Chelliah

Linguistics

This paper is a position statement on reproducible research in linguistics, including data citation and attribution, that represents the collective views of some 41 colleagues. Reproducibility can play a key role in increasing verification and accountability in linguistic research, and is a hallmark of social science research that is currently under-represented in our field. We believe that we need to take time as a discipline to clearly articulate our expectations for how linguistic data are managed, cited, and maintained for long-term access.

downloadDownload free PDF View PDFchevron_right

Corpus Linguistics as Digital Scholarship: Big Data, Rich Data and Uncharted Data

Irma Taavitsainen

From Data to Evidence in English Language Research, 2018

The past ten years have seen the rapid rise of Digital Humanities (DH), which currently subsumes a wide range of digital activities in various humanities disciplines, including linguistics and philology. One often-quoted definition of DH comes from the UCLA Digital Humanities Program, which states that: Digital Humanities interprets the cultural and social impact of new media and information technologies-the fundamental components of the new information age-as well as creates and applies these technologies to answer cultural, social, historical, and philological questions, both those traditionally conceived and those only enabled by new technologies. 1 Edward Vanhouette (2013) traces various strands of DH back to the common denominator of Humanities Computing. Many series of publications were launched in this multidisciplinary field, which also linked linguistic research and computers. But as computers have become the standard tools of the trade, they tend to be replaced in publication titles by the more data-and technologyoriented label "digital". For example, the journal Literary and Linguistic Computing is now Digital Scholarship in the Humanities, the change of title "reflecting the huge changes that have taken place over recent years". 2 Computers continue to be part of the title of the book series that publishes this volume, which was founded in 1988 with the title Language and Computers: Studies in Practical Linguistics and dedicated to "corpus linguistics and related areas". In 2016 the subtitle of the series was changed to Studies in Digital Linguistics. The series homepage updates its current agenda by saying that "a comprehensive digitization of our textual universe" calls for "a concerted research effort uniting linguistics and other disciplines involved in language-related research." 3 In this interdisciplinary context we may ask whether the term "corpus linguistics" has by now outlived its usefulness. We would not be the first to ask this question. It was already raised by Jan Aarts in response to Nancy Belmore's query in the Corpora list twenty years ago in 1998. The point made by Aarts, and revisited by Antoinette Renouf in her contribution to this volume, was that it "is

downloadDownload free PDF View PDFchevron_right

Reproducibility in Computational Linguistics: Are We Willing to Share?

Josine Rawee

Computational Linguistics, 2018

This study focuses on an essential precondition for reproducibility in computational linguistics: the willingness of authors to share relevant source code and data. Ten years after Ted Pedersen's influential "Last Words" contribution in Computational Linguistics, we investigate to what extent researchers in computational linguistics are willing and able to share their data and code. We surveyed all 395 full papers presented at the 2011 and 2016 ACL Annual Meetings, and identified whether links to data and code were provided. If working links were not provided, authors were requested to provide this information. Although data were often available, code was shared less often. When working links to code or data were not provided in the paper, authors provided the code in about one third of cases. For a selection of ten papers, we attempted to reproduce the results using the provided data and code. We were able to reproduce the results approximately for six papers. For only a single paper did we obtain the exact same results. Our findings show that even though the situation appears to have improved comparing 2016 to 2011, empiricism in computational linguistics still largely remains a matter of faith. Nevertheless, we are somewhat optimistic about the future. Ensuring reproducibility is not only important for the field as a whole, but also seems worthwhile for individual researchers: The median citation count for studies with working links to the source code is higher.

downloadDownload free PDF View PDFchevron_right

4 Corpus Linguistics Outcomes and Applications in the Digital Era

diana otat

2016

downloadDownload free PDF View PDFchevron_right

Avoiding data graveyards: From heterogeneous data collected in multiple research projects to sustainable linguistic resources

Christian Chiarcos

6th E-MELD …, 2006

downloadDownload free PDF View PDFchevron_right

A review of Jenset G.B., McGillivray B. Quantitative Historical Linguistics: A Corpus Framework. Oxford University Press, 2017.

Dmitry Nikolaev

Voprosy jazykoznanija, 2020

downloadDownload free PDF View PDFchevron_right

Inter-coder agreement for computational linguistics

Massimo Poesio

Computational Linguistics, 2008

A shortened version of this article was submitted to the journal Computational Linguistics; this is the full version.

downloadDownload free PDF View PDFchevron_right

Issues in corpus creation and distribution: The evolution of the linguistic data consortium

Mark Liberman

2000

Abstract The Linguistic Data Consortium (LDC) is a non-profit consortium of universities, companies and government research laboratories that supports education, research and technology development in language related disciplines by collecting or creating, distributing and archiving language resources including data and accompanying tools, standards and formats.

downloadDownload free PDF View PDFchevron_right

Development of Linguistic Linked Open Data Resources for Collaborative Data-Intensive Research in the Language Sciences

Christian Chiarcos

2020

We thank William J. Badecker, NSF Linguistics Program director, whose advice has been invaluable during all stages of this project, and Marc Lowenthal and Anthony Zannino at MIT Press, who guided us to publication. We thank Amy Brand, director of MIT Press, whose pursuit of the development of scholarly communication in the digital age provided support for our project. The Cornell University Library, through Oya Rieger, provided continual advice and support, as well as a critical dimension of library-researcher relations, continuing the early vision of previous Mann Library director, Janet McCue. Emily Bernardski provided key support and coordination for the workshop, as did Carissa Kang and Jonathan Masci, our student support team. Our editors Michelle Melanson and Rebecca Rich Goldweber provided invaluable assistance in volume publication. James Gair provided continual support throughout.

downloadDownload free PDF View PDFchevron_right

Corpora for computational linguistics

Ruslan Mitkov

2008

How to cite Complete issue More information about this article Journal's homepage in redalyc.org Scientific Information System Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal Non-profit academic project, developed under the open access initiative

downloadDownload free PDF View PDFchevron_right

12-15 - Working Group final reports

Sign up for access to the world's latest research

Abstract

Related papers

Related papers

Related topics