Mining language resources from institutional repositories

Gary Simons

Outline

Mining language resources from institutional repositories

Abstract

Language resources are the bread and butter of language documentation and linguistic investigation. They include the primary objects of study such as texts and recordings, the outputs of research such as dictionaries and grammars, and the enabling technologies such as software tools and interchange standards. Increasingly, these resources are maintained in digital form and distributed via the web. However, searching on the web for language resources is a hit-and-miss affair. One problem is that many online resources are hidden behind interfaces to databases with the result that only a fraction of these resources are being indexed by search engines (He and others 2007). Even when resources are exposed to online search engines, they may not be discoverable since they are described in ad hoc ways that prevent searches from retrieving the desired results with high recall or precision. This paper describes work being done in the context of the Open Language Archives Community (OLAC) to develop a service that uses text mining methods (Weiss and others 2005) to find language resources located within the hidden web of institutional repositories. It then uses the OLAC infrastructure to expose them on the open web and make them discoverable through precise search.

References (4)

Bird, Steven and Gary Simons. 2004. Building an Open Language Archives Community on the DC Foundation. In D. I. Hillmann and E. L. Westbrooks, eds., Metadata in Practice, pp. 203-222. Chicago: American Library Association. <http://www.ldc.upenn.edu/sb/home/ papers/mip.pdf>
He, Bin, Mitesh Patel, Zhen Zhang, and Kevin Chen-Chuan Chang. 2007. Accessing the deep web. Communications of the ACM 50(5): 95-101.
Simons, Gary and Steven Bird. 2003. Building an Open Language Archives Community on the OAI Foundation. Library Hi Tech, 21(2), 210-218. <http://arxiv.org/abs/cs.CL/0302021>
Weiss, Sholom M., Nitin Indurkhya, Tong Zhang, and Fred J. Damerau. 2005. Text Mining: Predictive Methods for Analyzing Unstructured Information. New York: Springer.

Mining language resources from institutional repositories

Sign up for access to the world's latest research

Abstract

Related papers

References (4)

Related papers

Related topics