A Document Recognition System for Early Modern Latin

Gregory Crane

Outline

Conclusion and Future Work

References

A Document Recognition System for Early Modern Latin

Gregory Crane

2006, Chicago Colloquium on Digital Humanities and …

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Large-scale digitization of manuscripts is facilitated by high-accuracy optical character recognition (OCR) engines. The focus of our work is on using these tools to digitize Latin texts. Many of the texts in the language, especially the early modern, make heavy use of special characters ...

fotini koidaki

2021

Book digitization is being increasingly enhanced, as it facilitates not only the dissemination and preservation of cultural heritage but also the analysis of large amounts of textual data as well as the extraction and discovery of knowledge in a faster, dynamic and interactive way. Quite often, OCR, as the core technology of book digitization, has to address major difficulties related to the condition of the primary source or to scanning issues. The main contribution of this paper is to provide an extensive study on Tesseract, an open-source OCR system, including image pre-processing and text post-processing methods, that overcome a variety of image handling problems. Additionally, a re-trained Greek language model, based on individual fonts training plus pairs of image-text training, is being provided. Finally, this paper proposes a pipeline of methods, including text line detection, that result in enhanced accuracy for Greek Literature documents, even when they consist of distorte...

downloadDownload free PDF View PDFchevron_right

Latin Letters Recognition Using Optical Character Recognition to Convert Printed Media Into Digital Format

Yogha Bintoro

Jurnal Elektronika dan Telekomunikasi

Printed media is still popular now days society. Unfortunately, such media encountered several drawbacks. For example, this type of media consumes large storage that impact in high maintenance cost. To keep printed information more efficient and long-lasting, people usually convert it into digital format. In this paper, we built Optical Character Recognition (OCR) system to enable automatic conversion the image containing the sentence in Latin characters into digital text-shaped information. This system consists of several interrelated stages including preprocessing, segmentation, feature extraction, classifier, model and recognition. In preprocessing, the median filter is used to clarify the image from noise and the Otsu’s function is used to binarize the image. It followed by character segmentation using connected component labeling. Artificial neural network (ANN) is used for feature extraction to recognize the character. The result shows that this system enable to recognize the ...

downloadDownload free PDF View PDFchevron_right

OCR4all - Open-Source OCR and HTR Across the Centuries

Florian Langhanki

Zenodo (CERN European Organization for Nuclear Research), 2023

Automated text recognition is an ever-present task in the humanities. A multitude of manuscripts and prints from all epochs and cultures have neither been edited nor fully indexed. Even though a considerable amount of them have been digitized, they are usually only accessible as images or PDF files. Machine-actionable transcriptions are often not available, despite being imperative for full-text search, annotation, scholarly editions and text analyses. Consequently, public institutions -including (digital) humanities scholars -require easy to use software solutions which enable them to perform high-quality OCR (Optical Character Recognition) 1 . This proposed full-day workshop introduces the participants to the completely open-source and free of charge software OCR4all ( ) which, unlike other available platforms like Transkribus 2 , eScriptorium 3 , or PERO-OCR 4 project 5 workflows exactly to the specific needs of the material at hand.

downloadDownload free PDF View PDFchevron_right

Recognition of Handwritten Roman Script Using Tesseract Open source OCR Engine

Subhajit Mandal

Computing Research Repository, 2010

The objective of the paper is to recognize handwritten samples of Roman numerals using Tesseract open source Optical Character Recognition (OCR) engine. Tesseract is trained with data samples of different persons to generate one user-independent language model, representing the handwritten Roman digit-set. The system is trained with 1226 digit samples collected form the different users. The performance is tested on two different datasets, one consisting of samples collected from the known users (those who prepared the training data samples) and the other consisting of handwritten data samples of unknown users. The overall recognition accuracy is obtained as 92.1% and 86.59% on these test datasets respectively.

downloadDownload free PDF View PDFchevron_right

An old greek handwritten OCR system based on an efficient segmentation-free approach

Ioannis Pratikakis

International Journal of Document Analysis and Recognition (IJDAR), 2007

Greek Early Christian manuscripts is essential for efficient content exploitation of the valuable Old Greek Early Christian historical collections. In this paper, we focus on the problem of recognizing Old Greek manuscripts and propose a novel recognition technique that has been tested in a large number of important historical manuscript collections which are written in lowercase letters and originate from St. Catherine's Mount Sinai Monastery. Based on an open and closed cavity character representation, we propose a novel, segmentation-free, fast and efficient technique for the detection and recognition of characters and character ligatures. First, we detect open and closed cavities that exist in the skeletonized character body. Then, the classification of a specific character or character ligature is based on the protrusible segments that appear in the topological description of the character skeletons. Experimental results prove the efficiency of the proposed approach.

downloadDownload free PDF View PDFchevron_right

OCR4all - An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

Dr. Christine Grundig

2019

Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character recognition and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. A comfortable GUI allows error corrections not only in the final output, but already in early stages to minimize error propagations. Further on, extensive configuration capabilities are provided to set the degree of automation of the workf...

downloadDownload free PDF View PDFchevron_right

Combining OCR Models for Reading Early Modern Printed Books

Janne van der Loop MA (Hons)

arXiv (Cornell University), 2023

In this paper, we investigate the usage of fine-grained font recognition on OCR for books printed from the 15th to the 18th century. We used a newly created dataset for OCR of early printed books for which fonts are labeled with bounding boxes. We know not only the font group used for each character, but the locations of font changes as well. In books of this period, we frequently find font group changes mid-line or even mid-word that indicate changes in language. We consider 8 different font groups present in our corpus and investigate 13 different subsets: the whole dataset and text lines with a single font, multiple fonts, Roman fonts, Gothic fonts, and each of the considered fonts, respectively. We show that OCR performance is strongly impacted by font style and that selecting fine-tuned models with font group recognition has a very positive impact on the results. Moreover, we developed a system using local font group recognition in order to combine the output of multiple font recognition models, and show that while slower, this approach performs better not only on text lines composed of multiple fonts but on the ones containing a single font only as well.

downloadDownload free PDF View PDFchevron_right

DIGITISATION PROCESSING AND RECOGNITION OF OLD GREEK MANUSCIPTS (THE D-SCRIBE PROJECT)

Christos Emmanouilidis

After many years of scholar study, manuscript collections continue to be an important source of novel information for scholars, concerning both the history of earlier times as well as the development of cultural documentation over the centuries. D-SCRIBE project aims to support and facilitate current and future efforts in manuscript digitization and processing. It strives toward the creation of a comprehensive software product, which can assist the content holders in turning an archive of manuscripts into a digital collection using automated methods. In this paper, we focus on the problem of recognizing early Christian Greek manuscripts. We propose a novel digital image binarization scheme for low quality historical documents allowing further content exploitation in an efficient way. Based on the existence of closed cavity regions in the majority of characters and character ligatures in these scripts, we propose a novel, segmentation-free, fast and efficient technique that assists the recognition procedure by tracing and recognizing the most frequently appearing characters or character ligatures.

downloadDownload free PDF View PDFchevron_right

Combining OCR Models for Reading Early Modern Books

Nikolaus Weichselbaumer, Janne Van Der Loop

Document Analysis and Recognition - ICDAR 2023., 2023

downloadDownload free PDF View PDFchevron_right

A segmentation-free recognition technique to assist old greek handwritten manuscript ocr

Ioannis Pratikakis

2004

Recognition of old Greek manuscripts is essential for quick and efficient content exploitation of the valuable old Greek historical collections. In this paper, we focus on the problem of recognizing early Christian Greek manuscripts written in lower case letters. Based on the existence of hole regions in the majority of characters and character ligatures in these scripts, we propose a novel, segmentation-free, fast and efficient technique that assists the recognition procedure by tracing and recognizing the most frequently appearing characters or character ligatures. First, we detect hole regions that exist in the character body. Then, the protrusions in the outer contour outline of the connected components that contain the character hole regions are used for the classification of the area around holes to a specific character or a character ligature. The proposed method gives highly accurate results and offers great assistance to old Greek handwritten manuscript OCR.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (4)

Thomas M. Cover and Peter E. Hart. Nearest neighbor pattern classification. IEEE Transac- tions on Information Theory, 13:21-27, 1967.
Michael Droettboom, Karl MacMillan, and Ichiro Fujinaga. The gamera framework for build- ing custom recognition systems. In Symposium on Document Image Understanding Technolo- gies, pages 275-286, 2003.
Okan Kolak and Philip Resnik. Ocr post-processing for low density languages. In HLT/EMNLP, 2005.
Jeffrey A. Rydberg-Cox. Automatic disambiguation of latin abbreviations in early modern texts for humanities digital libraries. In JCDL, pages 372-, 2003.

Matthew Christy

Journal on Computing and Cultural Heritage

Optical character recognition (OCR) engines work poorly on texts published with premodern printing technologies. Engaging the key technological contributors from the IMPACT project, an earlier project attempting to solve the OCR problem for early modern and modern texts, the Early Modern OCR Project (eMOP) of Texas A&M received funding from the Andrew W. Mellon Foundation to improve OCR outputs for early modern texts from the Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO) proprietary database products-or some 45 million pages. Added to print problems are the poor quality of the page images in these collections, which would be too time consuming and expensive to reimage. This article describes eMOP's attempts to OCR 307,000 documents digitized from microfilm to make our cultural heritage available for current and future researchers. We describe the reasoning behind our choices as we undertook the project based on other relevant studies; discoveries we made; the data and the system we developed for processing it; the software, algorithms, training procedures, and tools that we developed; and future directions that should be taken for further work in developing OCR engines for cultural heritage materials. CCS Concepts: • Applied computing → Optical character recognition;

downloadDownload free PDF View PDFchevron_right

CREMMA Medii Aevi: Literary Manuscript Text Recognition in Latin

Malamatenia Vlachou Efstathiou

Journal of open humanities data, 2022

This paper presents a novel segmentation and handwritten text recognition dataset for Medieval Latin, from the 11 th to the 16 th century. It connects with Medieval French datasets as well as earlier Latin datasets, by enforcing common guidelines, bringing 263,000 new characters and now totaling over a million characters for medieval manuscripts in both languages. We provide our own addition to Ariane Pinche's Old French guidelines to deal with specific Latin cases. We also offer an overview of how we addressed this dataset compilation through the use of pre-existing resources. With a higher abbreviation ratio and a better representation of abbreviating marks, we offer new models that outperform the Old French base model on Latin datasets, improving accuracy by 5% on unknown Latin manuscripts.

downloadDownload free PDF View PDFchevron_right

Recognizing characters of ancient manuscripts

Robert Sablatnig

Proceedings of SPIE, 2010

Considering printed Latin text, the main issues of Optical Character Recognition (OCR) systems are solved. However, for degraded handwritten document images, basic preprocessing steps such as binarization, gain poor results with state-of-the-art methods. In this paper ancient Slavonic manuscripts from the 11th century are investigated. In order to minimize the consequences of false character segmentation, a binarization-free approach based on local descriptors is proposed. Additionally local information allows the recognition of partially visible or washed out characters. The proposed algorithm consists of two steps: character classification and character localization. Initially Scale Invariant Feature Transform (SIFT) features are extracted which are subsequently classified using Support Vector Machines (SVM). Afterwards, the interest points are clustered according to their spatial information. Thereby, characters are localized and finally recognized based on a weighted voting scheme of pre-classified local descriptors. Preliminary results show that the proposed system can handle highly degraded manuscript images with background clutter (e.g. stains, tears) and faded out characters.

downloadDownload free PDF View PDFchevron_right

An old Greek handwritten OCR system

Ioannis Pratikakis

Eighth International Conference on Document Analysis and Recognition (ICDAR'05), 2005

Greek historical collections. In this paper, we focus on the problem of recognizing Old Greek handwritten manuscripts and propose a novel recognition technique that can be applied to a large number of important historical manuscript collections which are written in lower case letters and originate from St. Catherine's Mount Sinai Monastery. Based on an open and closed cavity character representation, we propose a novel, segmentation-free, fast and efficient technique for the detection and recognition of characters and character ligatures. First, we detect open and closed cavities that exist in the skeletonized character body. Then, the recognition of a specific character or character ligature is based on the protrusible segments that appear in the topological description of the character skeletons. Experimental results prove the efficiency of the proposed approach.

downloadDownload free PDF View PDFchevron_right

New Approaches to OCR for Early Printed Books

Nikolaus Weichselbaumer

DigItalia, 2020

Books printed before 1800 present major problems for OCR. One of the main obstacles is the lack of diversity of historical fonts in training data. The OCR-D project, consisting of book historians and computer scientists, aims to address this deficiency by focussing on three major issues. Our first target was to create a tool that identifies font groups automatically in images of historical documents. We concentrated on Gothic font groups that were commonly used in German texts printed in the 15 th and 16 th century: the well-known Fraktur and the lesser known Bastarda, Rotunda, Textura und Schwabacher. The tool was trained with 35,000 images and reaches an accuracy level of 98%. It can not only differentiate between the above-mentioned font groups but also Hebrew, Greek, Antiqua and Italic. It can also identify woodcut images and irrelevant data (book covers, empty pages, etc.). In a second step, we created an online training infrastructure (okralact), which allows for the use of various open source OCR engines such as Tesseract, OCRopus, Kraken and Calamari. At the same time, it facilitates training for specific models of font groups. The high accuracy of the recognition tool paves the way for the unprecedented opportunity to differentiate between the fonts used by individual printers. With more training data and further adjustments, the tool could help to fill a major gap in historical research. OCR-D

downloadDownload free PDF View PDFchevron_right

A Complete Optical Character Recognition Methodology for Historical Documents

Nikolaos Stamatopoulos

2008 The Eighth IAPR International Workshop on Document Analysis Systems, 2008

In this paper a complete OCR methodology for recognizing historical documents, either printed or handwritten without any knowledge of the font, is presented. This methodology consists of three steps: The first two steps refer to creating a database for training using a set of documents, while the third one refers to recognition of new document images. First, a pre-processing step that includes image binarization and enhancement takes place. At a second step a topdown segmentation approach is used in order to detect text lines, words and characters. A clustering scheme is then adopted in order to group characters of similar shape. This is a semi-automatic procedure since the user is able to interact at any time in order to correct possible errors of clustering and assign an ASCII label. After this step, a database is created in order to be used for recognition. Finally, in the third step, for every new document image the above segmentation approach takes place while the recognition is based on the character database that has been produced at the previous step.

downloadDownload free PDF View PDFchevron_right

An efficient segmentation-free approach to assist old Greek handwritten manuscript OCR

Ioannis Pratikakis

2006

Recognition of old Greek manuscripts is essential for quick and efficient content exploitation of the valuable old Greek historical collections. In this paper, we focus on the problem of recognizing early Christian Greek manuscripts written in lower case letters. Based on the existence of closed cavity regions in the majority of characters and character ligatures in these scripts, we propose a novel, segmentation-free, fast and efficient technique that assists the recognition procedure by tracing and recognizing the most frequently appearing characters or character ligatures. First, we detect closed cavities that exist in the character body. Then, the protrusions in the outer contour outline of the connected components that contain the character closed cavities are used for the classification of the area around closed cavities to a specific character or a character ligature. The proposed method gives highly accurate results and offers great assistance to old Greek handwritten manuscript OCR. We also provide additional OCR applications that not only prove the robustness of the proposed method but also demonstrate its generic flavor in case segmentation and text location tasks are very difficult to perform.

downloadDownload free PDF View PDFchevron_right

A Document Recognition System for Early Modern Latin

Sign up for access to the world's latest research

Abstract

Related papers

References (4)

Related papers