OCR4all - Open-Source OCR and HTR Across the Centuries
2023, Zenodo (CERN European Organization for Nuclear Research)
https://doi.org/10.5281/ZENODO.8108008Abstract
Automated text recognition is an ever-present task in the humanities. A multitude of manuscripts and prints from all epochs and cultures have neither been edited nor fully indexed. Even though a considerable amount of them have been digitized, they are usually only accessible as images or PDF files. Machine-actionable transcriptions are often not available, despite being imperative for full-text search, annotation, scholarly editions and text analyses. Consequently, public institutions -including (digital) humanities scholars -require easy to use software solutions which enable them to perform high-quality OCR (Optical Character Recognition) 1 . This proposed full-day workshop introduces the participants to the completely open-source and free of charge software OCR4all ( ) which, unlike other available platforms like Transkribus 2 , eScriptorium 3 , or PERO-OCR 4 project 5 workflows exactly to the specific needs of the material at hand.
FAQs
AI
What are the quantifiable improvements in OCR4all's recognition accuracy since its inception?
The software has achieved significant advances in layout analysis, leading to recognition accuracy improvements, although specific figures are not disclosed in recent updates.
How does OCR4all's integration with OCR-D enhance workflow interoperability?
OCR4all's integration with OCR-D fosters interoperability among over fifty open-source OCR solutions, allowing flexible combination into optimized workflows for mass processing.
What defines the mixed models in OCR4all and their training process?
Mixed models in OCR4all leverage heterogeneous training materials, ensuring compatibility with various typefaces such as medieval bastard scripts and 19th-century Fraktur.
What role does LAREX play in enhancing OCR results during the workflow?
LAREX serves as a comprehensive correction tool, enabling precise adjustments of OCR output, which in turn generates high-quality training materials suitable for diverse applications.
When was the first workshop conducted to teach OCR4all's workflow, and what outcomes are expected?
The initial workshop format has been successfully employed across multiple instances, allowing participants to independently execute the entire OCR workflow post-training.
References (13)
- Since processing printed and handwritten (HTR) material is very similar on a technical level we use the term "OCR" as a ge- neral term which relies to both application scenarios.
- Kahle et al. 2017.
- Kiessling et al. 2019.
- Kodym / Hradiš 2021.
- DFG-funded initiative for Optical CHaracter Recognition de- velopment, https://ocr-d.de/en
- Reul et al. 2022.
- Reul et al. 2019.
- Neudecker et al. 2019.
- Reul et al. 2022. Bibliography Kahle, Philip / Colutto, Sebastian / Hackl, Günter / Mühl- berger, Günter : Transkribus -A Service Platform for Transcrip- tion, Recognition and Retrieval of Historical Documents. In: 14th IAPR International Conference on Document Analysis and Reco- gnition (ICDAR), Kyoto, Japan, 2017, pp. 19-24. URL: https:// doi.org/10.1109/ICDAR.2017.307
- Kiessling, Benjamin / Tissot, Robin / Stokes, Peter / St- ökl Ben Ezra, Daniel : eScriptorium: An Open Source Platform for Historical Document Analysis. In: International Conference on Document Analysis and Recognition Workshops (ICDARW), Sydney, NSW, Australia, 2019, pp. 19-19. URL: https://doi.or- g/10.1109/ICDARW.2019.10032
- Neudecker, Clemens / Baierer, Konstantin / Federbusch, Maria / Boenig, Matthias / Würzner, Kay-Michael / Hart- mann, Volker / Herrmann, Elisa : OCR-D: An end-to-end open source OCR framework for historical printed documents. In: Pro- ceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage (DATeCH2019). Association for Com- puting Machinery, New York, NY, USA, 2019, pp. 53-58. URL: https://doi.org/10.1145/3322905.3322917
- Reul, Christian / Christ, Dennis / Hartelt, Alexander / Balbach, Nico / Wehner, Maximilian / Springmann, Uwe / Wick, Christoph / Grundig, Christine / Büttner, Andreas / Puppe, Frank : OCR4all -An Open-Source Tool Provi- ding a (Semi-)Automatic OCR Workflow for Historical Prin- tings. In: Applied Sciences 2019. (9) 22. URL: https://www.md- pi.com/2076-3417/9/22/4853
- Reul, Christian / Tomasek, Stefan / Langhanki, Florian / Springmann, Uwe : Open Source Handwritten Text Recogni- tion on Medieval Manuscripts Using Mixed Models and Docu- ment-Specific Finetuning. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol. 13237. Springer, Cham. URL: https:// doi.org/10.1007/978-3-031-06555-2_28