A Document Recognition System for Early Modern Latin
2006, Chicago Colloquium on Digital Humanities and …
Sign up for access to the world's latest research
Abstract
Large-scale digitization of manuscripts is facilitated by high-accuracy optical character recognition (OCR) engines. The focus of our work is on using these tools to digitize Latin texts. Many of the texts in the language, especially the early modern, make heavy use of special characters ...
Related papers
2021
Book digitization is being increasingly enhanced, as it facilitates not only the dissemination and preservation of cultural heritage but also the analysis of large amounts of textual data as well as the extraction and discovery of knowledge in a faster, dynamic and interactive way. Quite often, OCR, as the core technology of book digitization, has to address major difficulties related to the condition of the primary source or to scanning issues. The main contribution of this paper is to provide an extensive study on Tesseract, an open-source OCR system, including image pre-processing and text post-processing methods, that overcome a variety of image handling problems. Additionally, a re-trained Greek language model, based on individual fonts training plus pairs of image-text training, is being provided. Finally, this paper proposes a pipeline of methods, including text line detection, that result in enhanced accuracy for Greek Literature documents, even when they consist of distorte...
Jurnal Elektronika dan Telekomunikasi
Printed media is still popular now days society. Unfortunately, such media encountered several drawbacks. For example, this type of media consumes large storage that impact in high maintenance cost. To keep printed information more efficient and long-lasting, people usually convert it into digital format. In this paper, we built Optical Character Recognition (OCR) system to enable automatic conversion the image containing the sentence in Latin characters into digital text-shaped information. This system consists of several interrelated stages including preprocessing, segmentation, feature extraction, classifier, model and recognition. In preprocessing, the median filter is used to clarify the image from noise and the Otsu’s function is used to binarize the image. It followed by character segmentation using connected component labeling. Artificial neural network (ANN) is used for feature extraction to recognize the character. The result shows that this system enable to recognize the ...
Zenodo (CERN European Organization for Nuclear Research), 2023
Automated text recognition is an ever-present task in the humanities. A multitude of manuscripts and prints from all epochs and cultures have neither been edited nor fully indexed. Even though a considerable amount of them have been digitized, they are usually only accessible as images or PDF files. Machine-actionable transcriptions are often not available, despite being imperative for full-text search, annotation, scholarly editions and text analyses. Consequently, public institutions -including (digital) humanities scholars -require easy to use software solutions which enable them to perform high-quality OCR (Optical Character Recognition) 1 . This proposed full-day workshop introduces the participants to the completely open-source and free of charge software OCR4all ( ) which, unlike other available platforms like Transkribus 2 , eScriptorium 3 , or PERO-OCR 4 project 5 workflows exactly to the specific needs of the material at hand.
Computing Research Repository, 2010
The objective of the paper is to recognize handwritten samples of Roman numerals using Tesseract open source Optical Character Recognition (OCR) engine. Tesseract is trained with data samples of different persons to generate one user-independent language model, representing the handwritten Roman digit-set. The system is trained with 1226 digit samples collected form the different users. The performance is tested on two different datasets, one consisting of samples collected from the known users (those who prepared the training data samples) and the other consisting of handwritten data samples of unknown users. The overall recognition accuracy is obtained as 92.1% and 86.59% on these test datasets respectively.
International Journal of Document Analysis and Recognition (IJDAR), 2007
Greek Early Christian manuscripts is essential for efficient content exploitation of the valuable Old Greek Early Christian historical collections. In this paper, we focus on the problem of recognizing Old Greek manuscripts and propose a novel recognition technique that has been tested in a large number of important historical manuscript collections which are written in lowercase letters and originate from St. Catherine's Mount Sinai Monastery. Based on an open and closed cavity character representation, we propose a novel, segmentation-free, fast and efficient technique for the detection and recognition of characters and character ligatures. First, we detect open and closed cavities that exist in the skeletonized character body. Then, the classification of a specific character or character ligature is based on the protrusible segments that appear in the topological description of the character skeletons. Experimental results prove the efficiency of the proposed approach.
2019
Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character recognition and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. A comfortable GUI allows error corrections not only in the final output, but already in early stages to minimize error propagations. Further on, extensive configuration capabilities are provided to set the degree of automation of the workf...
arXiv (Cornell University), 2023
In this paper, we investigate the usage of fine-grained font recognition on OCR for books printed from the 15th to the 18th century. We used a newly created dataset for OCR of early printed books for which fonts are labeled with bounding boxes. We know not only the font group used for each character, but the locations of font changes as well. In books of this period, we frequently find font group changes mid-line or even mid-word that indicate changes in language. We consider 8 different font groups present in our corpus and investigate 13 different subsets: the whole dataset and text lines with a single font, multiple fonts, Roman fonts, Gothic fonts, and each of the considered fonts, respectively. We show that OCR performance is strongly impacted by font style and that selecting fine-tuned models with font group recognition has a very positive impact on the results. Moreover, we developed a system using local font group recognition in order to combine the output of multiple font recognition models, and show that while slower, this approach performs better not only on text lines composed of multiple fonts but on the ones containing a single font only as well.
After many years of scholar study, manuscript collections continue to be an important source of novel information for scholars, concerning both the history of earlier times as well as the development of cultural documentation over the centuries. D-SCRIBE project aims to support and facilitate current and future efforts in manuscript digitization and processing. It strives toward the creation of a comprehensive software product, which can assist the content holders in turning an archive of manuscripts into a digital collection using automated methods. In this paper, we focus on the problem of recognizing early Christian Greek manuscripts. We propose a novel digital image binarization scheme for low quality historical documents allowing further content exploitation in an efficient way. Based on the existence of closed cavity regions in the majority of characters and character ligatures in these scripts, we propose a novel, segmentation-free, fast and efficient technique that assists the recognition procedure by tracing and recognizing the most frequently appearing characters or character ligatures.
Document Analysis and Recognition - ICDAR 2023., 2023
In this paper, we investigate the usage of fine-grained font recognition on OCR for books printed from the 15th to the 18th century. We used a newly created dataset for OCR of early printed books for which fonts are labeled with bounding boxes. We know not only the font group used for each character, but the locations of font changes as well. In books of this period, we frequently find font group changes mid-line or even mid-word that indicate changes in language. We consider 8 different font groups present in our corpus and investigate 13 different subsets: the whole dataset and text lines with a single font, multiple fonts, Roman fonts, Gothic fonts, and each of the considered fonts, respectively. We show that OCR performance is strongly impacted by font style and that selecting fine-tuned models with font group recognition has a very positive impact on the results. Moreover, we developed a system using local font group recognition in order to combine the output of multiple font recognition models, and show that while slower, this approach performs better not only on text lines composed of multiple fonts but on the ones containing a single font only as well.
2004
Recognition of old Greek manuscripts is essential for quick and efficient content exploitation of the valuable old Greek historical collections. In this paper, we focus on the problem of recognizing early Christian Greek manuscripts written in lower case letters. Based on the existence of hole regions in the majority of characters and character ligatures in these scripts, we propose a novel, segmentation-free, fast and efficient technique that assists the recognition procedure by tracing and recognizing the most frequently appearing characters or character ligatures. First, we detect hole regions that exist in the character body. Then, the protrusions in the outer contour outline of the connected components that contain the character hole regions are used for the classification of the area around holes to a specific character or a character ligature. The proposed method gives highly accurate results and offers great assistance to old Greek handwritten manuscript OCR.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (4)
- Thomas M. Cover and Peter E. Hart. Nearest neighbor pattern classification. IEEE Transac- tions on Information Theory, 13:21-27, 1967.
- Michael Droettboom, Karl MacMillan, and Ichiro Fujinaga. The gamera framework for build- ing custom recognition systems. In Symposium on Document Image Understanding Technolo- gies, pages 275-286, 2003.
- Okan Kolak and Philip Resnik. Ocr post-processing for low density languages. In HLT/EMNLP, 2005.
- Jeffrey A. Rydberg-Cox. Automatic disambiguation of latin abbreviations in early modern texts for humanities digital libraries. In JCDL, pages 372-, 2003.