Document digitization lifecycle for complex magazine collection
2005, Proceedings of the 2005 ACM symposium on Document engineering - DocEng '05
https://doi.org/10.1145/1096601.1096650Abstract
The conversion of large collections of documents from paper to digital formats that are suitable for electronic archival is a complex multi-phase process. The creation of good quality images from paper documents is just one phase. To extract relevant information that they contain, with an accuracy that fits the purpose of target applications, an automated document analysis system and a manual verification/review process are needed. The automated system needs to perform a variety of analysis and recognition tasks in order to reach an accuracy level that minimizes the manual correction effort downstream. This paper describes the complete process and the associated technologies, tools, and systems needed for the conversion of a large collection of complex documents and deployment for online web access to its information rich content. We used this process to recapture 80 years of Time magazines. The historical collection is scanned, automatically processed by advanced document analysis components to extract articles, manually verified for accuracy, and converted in a form suitable for web access. We discuss the major phases of the conversion lifecycle and the technology developed and tools used for each phase. We also discuss results in terms of recognition accuracy.
References (26)
- Adam, S., M. Rigamonti, E. Clavier, J.-M. Ogier, E. Trupin and K. Tombre. DocMining: A Document Analysis System Builder. In Document Analysis Systems VI -Proceedings of 6th IAPR International Workshop on Document Analysis System, Florence (Italy), pages 472-483, Lecture Notes in Computer Science, vol. 3163, Springer Verlag, september 2004.
- Aiello, M., C. Monzl, L. Todoran. Combining Linguistic and Spatial Information for Document Analysis. Proceedings of the RIAO'2000 Document-Based Multimedia Information Access, Paris, France, pp.266 -275, April 2000.
- Aiello, M.,C. Monz, L. Todoran, M. Worring. Document understanding for a broad class of documents. International Journal on Document Analysis and Recognition. 5: 1-16, 2002.
- Allen, J. Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11):832-843. 1983
- Altamura, O., F. Esposito & D. Malerba (2001). Transforming Paper Documents into XML Format with WISDOM++, International Journal of Document Analysis and Recognition, Springer Verlag, 3(2), 175-198.
- Clavier, E., G. Masini, M. Delalandre, M. Rigamonti, K. Tombre and J. Gardes. DocMining: A Cooperative Platform for Heterogeneous Document Interpretation According to User-Defined Scenarios. Lecture Notes in Computer Science. Volume 3088 / 2004 Title: Graphics Recognition: Recent Advances and Perspectives, 5th International Workshop, GREC 2003, Barcelona, Spain, July 30-31, 2003.
- Clavier, E., P. Heroux, J. Gardes, E. Trupin. Ground-Truth Production and Benchmarking Scenarios Creation With DocMining. Third International Workshop on Document Layout Interpretation and its Applications (DLIA2003). August 2, 2003 Edinburgh, Scotland
- Couasnon, B. DMOS: A generic document recognition method. application to an automatic generator of musical scores, mathematical formulae and table structures recognition systems. In Proceedings of 6th International Conference on Document Analysis and Recognition, Seattle (USA), pages 215-220, 2001
- Haralick, R. Document image understanding: geometric and logical layout. IEEE conference on computer vision and document understanding. 1994.
- Hitz, O., L. Robadey, and R. Ingold. An architecture for editing document recognition results using XML. In Proceedings of 4th IAPR International Workshop on Document Analysis Systems, Rio de Janeiro (Brazil), pages 385-396, 2000
- HP Laboratory, Barcelona Research Office. Time Archive + HP. http://welcome.hp.com/country/us/en/msg/corp/htmltimearch ive.html
- Kanungo, T., C. H. Lee, J. Czorapinski, I. Bella. TRUEVIZ: a groundtruth/metadata editing and visualizing toolkit for OCR. In Proc. of SPIE Conference on Document Recognition and Retrieval, Jan. 2001.
- Klink, S., A. Dengel, T. Kieninger. Document Structure Analysis Based on Layout and Textual Features. Proceedings of the 4th IAPR International Workshop on Document Analysis Systems, DAS2000, pp99-111, Brazil 2000.
- Lee, K., Y. Choy, S. Cho. Geometric structure analysis of document images: A knowledge approach. IEEE Transactions on PAMI, 22(11):1224-1240, 2000.
- Masataki, H., Y. Sgisaka. Variable-order N-gram generation byword-class splitting and consecutive word grouping, IEEE, pp. 188-191,1996.
- McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.
- Roussel, N., O. Hitz, and R. Ingold. Web-based cooperative document understanding. In Proceedings of ICDAR 2001, the 6th IEEE International Conference on Document Analysis and Recognition, pages 368-373, September 2001.
- Tsujimoto et al. Understanding Multi-Articled Documents. Proceedings of 10th Int. Conf. on Pattern Recognition, vol. 1, pp. 551-556, Jun. 1990
- Tsujimoto S. , H. Asada. Major Components of a Complete Text Reading System. Proceedings of the IEEE, 80(7):1133- 1149, 1992.
- Yacoub, S. Automated Quality Assurance for Document Understanding Systems. IEEE Software, 20(3):76-82, May/June 2003
- Yacoub, S., J. Abad. Detection of Document Structure and Table of Content in Magazine Archives. 8 th International Conference on Document Analysis and Recognition, Seoul Korea September 2005.
- Yacoub, S., V. Saxena. PerfectDoc: A Ground Truthing Environment for Complex Documents. 8 th International Conference on Document Analysis and Recognition, Seoul Korea September 2005.
- Abbyy Fine Reader. http://www.abbyy.com/finereader7/?param=28603
- Yacoub, S., P. Faraboschi, J. Burns, D. Ortega, J.Abad, J.A. Sanchez. Chronos: A Document Understanding System for Historical Magazine Collections. Submitted to International Journal on Document Analysis and Recognition IJDAR.
- Antonacopoulos, A., D. Karatzas, H. Krawczyk, B. Wiszniewski. The lifecycle of a digital historical document: structure and content. Proceedings of the 2004 ACM symposium on Document engineering, DocEng 2004, Milwaukee, Wisconsin, USA.
- Nicolas, S., Th. Paquet, L. Heutte. Digitizing cultural heritage manuscripts: the Bovary project. Proceedings of the 2003 ACM symposium on Document engineering, DocEng 2003, Grenoble, France.