An Open-Source Workflow for Handwritten Character Recognition
2025, An Open-Source Workflow for Handwritten Character Recognition
https://doi.org/10.2312/DH.20253069Abstract
The rich manuscript heritage of Italy, preserved in archives and libraries, is becoming increasingly accessible to a wider audience through dedicated digitization initiatives. However, the interpretation of these manuscripts often proves challenging due to several factors: the linguistic complexity of medieval Latin, the early development of vernacular languages, the continuous evolution of handwriting styles, and the extensive use of abbreviation systems devised to conserve space on costly materials such as parchment and paper. Artificial Intelligence (AI) tools can significantly boost the last step of the digitization process: transcription. In particular, the advent of Handwritten Character Recognition (HCR) technology enables recognition and processing of handwritten text. However, as with all AI tools-especially in the domain of handwritten texts and, more broadly, in the Humanities, training and fine-tuning is required. To support Digital Humanists in tailoring these powerful tools to specific needs-i.e. transcribing different handwriting styles-a Human-AI collaboration approach has been adopted to develop a collaborative web application, named HCR WORKFLOW, designed for the creation of ground-truth data for AI-based manuscript transcription. The platform is composed of a toolkit for document layout analysis based on Neural Networks for text line recognition (P2PaLA), an image Transformer encoder and an autoregressive text Transformer decoder for single-line transcription (TrOCR). This integrated system guides and assists Digital Humanists throughout the entire process-from digitization to transcription supervision. For this study, the platform was used to fine-tune TrOCR on humanistic script, and in particular to create the ground truth based on the Copialettere (Letterbooks) of Isabella d'Este and the letters addressed to her by Lucrezia Borgia. This research paper will discuss in detail the HCR WORKFLOW platform, the dataset used, the approach to create an AI-oriented transcription, and the results of the fine-tuning of the AI tool for manuscript transcription.
References (13)
- BASORA M.: L'epistolario di Isabella d'Este: dai libri dei copi- alettere alla piattaforma IDEA. In Natura Società Letteratura, Atti del XXII Congresso dell'ADI -Associazione degli Italianisti (Bologna, 13- 15 settembre 2018) (2020), pp. 1-3. 2
- KAHLE P., COLUTTO S., HACKL G., MÜHLBERGER G.: Transkribus -A Service Platform for Transcription, Recognition and Re- trieval of Historical Documents. In 2017 14th IAPR International Con- ference on Document Analysis and Recognition (ICDAR) (2017), vol. 04, pp. 19-24. doi:10.1109/ICDAR.2017.307. 2
- KIESSLING B., TISSOT R., STOKES P., STÖKL BEN EZRA D.: eScriptorium: An Open Source Platform for His- torical Document Analysis. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) (2019), vol. 2, pp. 19-19. doi:10.1109/ICDARW.2019.10032. 2
- LCH * 21] LI M., CUI L., HUANG S., WEI F., ZHOU M., ZHANG Z.: TrOCR: Transformer-based optical character recognition with pre- trained models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main1 Vol- ume (2021), pp. 886-897. 2, 3, 4
- NECCI A.: Isabella e Lucrezia, le due cognate, donne di potere e di corte nell'Italia del Rinascimento. Marsilio, 2019. 2
- NOCKELS J., GOODING P., AMES S., TERRAS M.: Un- derstanding the application of handwritten text recognition technology in heritage contexts: a systematic review of Transkribus in published research. Archival Science 22, 3 (2022), 367-392. doi:10.1007/ s10502-022-09397-0. 2
- PAOLI C.: Diplomatica. Ed. aggiornata da Giacomo C. Bascapè. Le lettere, 1942. 5
- PETRUCCI A.: Breve storia della scrittura latina. Bagatto libri, 1992, pp. 72-76. 1
- PIZZAGALLI D.: La signora del Rinascimento: vita e splendori di Isabella d'Este alla corte di Mantova. Rizzoli, 2001. 2
- PRATESI A.: Una questione di metodo: l'edizione delle fonti doc- umentarie. Presso la Società alla Biblioteca Vallicelliana, 1992, pp. 39- 46; 89-94. 1
- SMITH R.: An Overview of the Tesseract OCR Engine. In 2007 9th International Conference on Document Analysis and Recog- nition (Los Alamitos, CA, USA, Sept. 2007), vol. 2, IEEE Computer Society, pp. 629-633. doi:10.1109/ICDAR.2007.56. 4
- TOGNETTI G.: Criteri per la trascrizione di testi medievali latini e italiani. Ministero per i Beni Culturali e ambientali, 1982, pp. 13-64.
- VSP * 17] VASWANI A., SHAZEER N., PARMAR N., USZKOREIT J., JONES L., GOMEZ A. N., KAISER L., POLOSUKHIN I.: Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Red Hook, NY, USA, 2017), NIPS'17, Curran Associates Inc., p. 6000-6010. 3