Academia.eduAcademia.edu

Outline

OCR4all - Open-Source OCR and HTR Across the Centuries

2023, Zenodo (CERN European Organization for Nuclear Research)

https://doi.org/10.5281/ZENODO.8108008

Abstract

Automated text recognition is an ever-present task in the humanities. A multitude of manuscripts and prints from all epochs and cultures have neither been edited nor fully indexed. Even though a considerable amount of them have been digitized, they are usually only accessible as images or PDF files. Machine-actionable transcriptions are often not available, despite being imperative for full-text search, annotation, scholarly editions and text analyses. Consequently, public institutions -including (digital) humanities scholars -require easy to use software solutions which enable them to perform high-quality OCR (Optical Character Recognition) 1 . This proposed full-day workshop introduces the participants to the completely open-source and free of charge software OCR4all ( ) which, unlike other available platforms like Transkribus 2 , eScriptorium 3 , or PERO-OCR 4 project 5 workflows exactly to the specific needs of the material at hand.

FAQs

sparkles

AI

What are the quantifiable improvements in OCR4all's recognition accuracy since its inception?add

The software has achieved significant advances in layout analysis, leading to recognition accuracy improvements, although specific figures are not disclosed in recent updates.

How does OCR4all's integration with OCR-D enhance workflow interoperability?add

OCR4all's integration with OCR-D fosters interoperability among over fifty open-source OCR solutions, allowing flexible combination into optimized workflows for mass processing.

What defines the mixed models in OCR4all and their training process?add

Mixed models in OCR4all leverage heterogeneous training materials, ensuring compatibility with various typefaces such as medieval bastard scripts and 19th-century Fraktur.

What role does LAREX play in enhancing OCR results during the workflow?add

LAREX serves as a comprehensive correction tool, enabling precise adjustments of OCR output, which in turn generates high-quality training materials suitable for diverse applications.

When was the first workshop conducted to teach OCR4all's workflow, and what outcomes are expected?add

The initial workshop format has been successfully employed across multiple instances, allowing participants to independently execute the entire OCR workflow post-training.

References (13)

  1. Since processing printed and handwritten (HTR) material is very similar on a technical level we use the term "OCR" as a ge- neral term which relies to both application scenarios.
  2. Kahle et al. 2017.
  3. Kiessling et al. 2019.
  4. Kodym / Hradiš 2021.
  5. DFG-funded initiative for Optical CHaracter Recognition de- velopment, https://ocr-d.de/en
  6. Reul et al. 2022.
  7. Reul et al. 2019.
  8. Neudecker et al. 2019.
  9. Reul et al. 2022. Bibliography Kahle, Philip / Colutto, Sebastian / Hackl, Günter / Mühl- berger, Günter : Transkribus -A Service Platform for Transcrip- tion, Recognition and Retrieval of Historical Documents. In: 14th IAPR International Conference on Document Analysis and Reco- gnition (ICDAR), Kyoto, Japan, 2017, pp. 19-24. URL: https:// doi.org/10.1109/ICDAR.2017.307
  10. Kiessling, Benjamin / Tissot, Robin / Stokes, Peter / St- ökl Ben Ezra, Daniel : eScriptorium: An Open Source Platform for Historical Document Analysis. In: International Conference on Document Analysis and Recognition Workshops (ICDARW), Sydney, NSW, Australia, 2019, pp. 19-19. URL: https://doi.or- g/10.1109/ICDARW.2019.10032
  11. Neudecker, Clemens / Baierer, Konstantin / Federbusch, Maria / Boenig, Matthias / Würzner, Kay-Michael / Hart- mann, Volker / Herrmann, Elisa : OCR-D: An end-to-end open source OCR framework for historical printed documents. In: Pro- ceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage (DATeCH2019). Association for Com- puting Machinery, New York, NY, USA, 2019, pp. 53-58. URL: https://doi.org/10.1145/3322905.3322917
  12. Reul, Christian / Christ, Dennis / Hartelt, Alexander / Balbach, Nico / Wehner, Maximilian / Springmann, Uwe / Wick, Christoph / Grundig, Christine / Büttner, Andreas / Puppe, Frank : OCR4all -An Open-Source Tool Provi- ding a (Semi-)Automatic OCR Workflow for Historical Prin- tings. In: Applied Sciences 2019. (9) 22. URL: https://www.md- pi.com/2076-3417/9/22/4853
  13. Reul, Christian / Tomasek, Stefan / Langhanki, Florian / Springmann, Uwe : Open Source Handwritten Text Recogni- tion on Medieval Manuscripts Using Mixed Models and Docu- ment-Specific Finetuning. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol. 13237. Springer, Cham. URL: https:// doi.org/10.1007/978-3-031-06555-2_28