A foundational aspect of documenting an endangered language and preserving that documentation for... more A foundational aspect of documenting an endangered language and preserving that documentation for long-term access is identifying the language itself. The web version of the Ethnologue has become the de facto standard for identifying the more than 6,800 languages spoken in the world today. The system of three-letter codes that uniquely identify each language has been used within SIL for nearly three decades as an in-house standard, but now there is increasing demand for these codes to be used by other organizations and projects. This paper describes four changes that SIL International is implementing in order to make its set of language identification codes better meet the needs of the wider community. The changes seek to strike a balance between becoming more open while at the same time becoming more disciplined.
The digital language archiving enterprise is facing serious bottlenecks in scaling up the submiss... more The digital language archiving enterprise is facing serious bottlenecks in scaling up the submission of new materials and the use of already archived materials. This talk explores the strategies of separation of concerns and automation of services in developing an infrastructure for interoperation that can break these bottlenecks.
The Open Language Archives Community (OLAC) provides a comprehensive infrastructure that has allo... more The Open Language Archives Community (OLAC) provides a comprehensive infrastructure that has allowed our community to index and discover language resources over the past 20 years. However, OLAC infrastructure has fallen behind as the digital libraries community has continued to evolve. New investment is required in order to move OLAC into the digital libraries mainstream. This paper reports on the first 20 years of OLAC and on an agenda leading to a more sustainable future for open language archiving.
The Open Language Archives Community: a 20-year update
The Electronic Library
Purpose This paper reports on the first 20 years of the Open Language Archives Community (OLAC), ... more Purpose This paper reports on the first 20 years of the Open Language Archives Community (OLAC), comprehensive infrastructure for indexing and discovering language resources. Design/methodology/approach We begin with the original vision, assess progress relative to the original requirements, and identify ongoing challenges. Findings Based on the overview of OLAC history and recent developments and on the analysis of the situation in the language archives area as a whole, the authors propose an agenda for a more sustainable future for open language archiving. Originality/value This paper examines the progress of OLAC and discusses improvements in such areas as participation, access, and sustainability.
The users of endangered languages struggle to thrive in a digitally-mediated world. We have devel... more The users of endangered languages struggle to thrive in a digitally-mediated world. We have developed an automated method for assessing how well every language recognized by ISO 639 is faring in terms of digital language support. The assessment is based on scraping the names of supported languages from the websites of 143 digital tools selected to represent a full range of ways that digital technology can support languages. The method uses Mokken scale analysis to produce an explainable model for quantifying digital language support and monitoring it on a global scale. 3 Requirements Following Kornai's (2013) lead, we seek to develop an automated method for assessing digital
On the Verge of Major Business Re-Engineering "Insanity is doing the same thing over and over aga... more On the Verge of Major Business Re-Engineering "Insanity is doing the same thing over and over again and expecting different results."-Albert Einstein Seven years ago the senior leadership at SIL International (see Chart 1), a not-forprofit whose purpose is to facilitate language-based development among the peoples of the world, determined that it was time to build an integrated Enterprise Information System. There were three precipitating factors: mission critical IT systems were almost twenty years old and on the verge of obsolescence, their landscape was dotted with dozens of silo systems, and commitments to new strategic directions demanded significant business re-engineering.
We propose a model for a Resource Description Format (RDF) database for interlinear glossed text ... more We propose a model for a Resource Description Format (RDF) database for interlinear glossed text (IGT) created from documents encoded in the Extensible Markup Language (XML) using markup metaschemas. A metaschema, constructed using the Semantic Interpretation Language (SIL) (Simons 2004) maps XML-encoded documents to a common semantically rich RDF database. The RDF database in turn can be searched using RDFsearch engines providing the key functionality of a database management system (DBMS). Simons et al. (2004) gives a proof of concept of the model by mapping differently encoded XML lexicons to a common RDF form. Search capability is provided across these data using SeRQL, a SQL-like query language built around the Sesame RDF database program. In this paper, we extend these results to corpora of interlinear glossed text obtained from various sources, including some from the Web following Lewis (2003), combined with a language profile for each language variety, which provides basic grammatical information about that variety.
Language resources are the bread and butter of language documentation and linguistic investigatio... more Language resources are the bread and butter of language documentation and linguistic investigation. They include the primary objects of study such as texts and recordings, the outputs of research such as dictionaries and grammars, and the enabling technologies such as software tools and interchange standards. Increasingly, these resources are maintained in digital form and distributed via the web. However, searching on the web for language resources is a hit-and-miss affair. One problem is that many online resources are hidden behind interfaces to databases with the result that only a fraction of these resources are being indexed by search engines (He and others 2007). Even when resources are exposed to online search engines, they may not be discoverable since they are described in ad hoc ways that prevent searches from retrieving the desired results with high recall or precision. This paper describes work being done in the context of the Open Language Archives Community (OLAC) to develop a service that uses text mining methods (Weiss and others 2005) to find language resources located within the hidden web of institutional repositories. It then uses the OLAC infrastructure to expose them on the open web and make them discoverable through precise search.
Proceedings of the International Workshop on Digital Language Archives: LangArc 2021, 2021
The Open Language Archives Community (OLAC) provides a comprehensive infrastructure that has allo... more The Open Language Archives Community (OLAC) provides a comprehensive infrastructure that has allowed our community to index and discover language resources over the past 20 years. However, OLAC infrastructure has fallen behind as the digital libraries community has continued to evolve. New investment is required in order to move OLAC into the digital libraries mainstream. This paper reports on the first 20 years of OLAC and on an agenda leading to a more sustainable future for open language archiving.
A foundational aspect of documenting an endangered language and preserving that documentation for... more A foundational aspect of documenting an endangered language and preserving that documentation for long-term access is identifying the language itself. The web version of the Ethnologue has become the de facto standard for identifying the more than 6,800 languages spoken in the world today. The system of three-letter codes that uniquely identify each language has been used within SIL for nearly three decades as an in-house standard, but now there is increasing demand for these codes to be used by other organizations and projects. This paper describes four changes that SIL International is implementing in order to make its set of language identification codes better meet the needs of the wider community. The changes seek to strike a balance between becoming more open while at the same time becoming more disciplined. The need for language identifiers A foundational aspect of documenting an endangered language and preserving that documentation for long-term access is identifying the lan...
Uploads
Papers by Gary Simons