Segmenting Arabic Handwritten Documents into Text lines and Words
Abstract
In this paper, we present a method for segmenting Arabic handwritten documents into text lines and words. Text line segmentation is addressed by a well-known technique, the horizontal projection profile, in which autocorrelation is used to enhance the self similarity of this profile. This technique promotes the estimation of text line spacing. Word extraction is based on an adaptation of a known method, gap metrics.This improvement relies on deriving the values of these gaps from the properties of each input document, making the proposed method tolerant and robust to Arabic handwritten nature. Text is often divided into words, sub-words and letters; however, some letters do not connect to the following letter, even in the middle of a word. A gap metric method exploits the membership values of a clustering algorithm to identify segmentation thresholds as "within word" or "between words" gaps. The proposed method is tested on the benchmarking datasets of Arabic handwritten text recognition research (AHDB), and very promising results were achieved, with an 84.8% correct extraction rate.
FAQs
AI
What techniques enhance text line spacing in Arabic handwritten documents?
The study demonstrates that applying autocorrelation to the horizontal projection profile improves text line spacing estimation in documents, addressing issues with skew and touching lines.
How does the proposed word extraction method improve segmentation accuracy?
The method adapts gap thresholds based on individual document properties, achieving a correct extraction rate of 84.8%, compared to 85.0% from fixed threshold methods.
What distinguishes word extraction from segmentation in Arabic handwritten text?
Word extraction focuses on separating words from handwritten text, whereas segmentation divides words into individual characters, significantly impacting recognition methodologies.
Which clustering algorithm yielded the best results in this study?
Fuzzy C-Means clustering produced the best accuracy for word extraction, achieving a lowest misplaced word rate of 5.5% among tested algorithms.
How does the proposed method compare to previous word extraction techniques?
Unlike previous techniques relying on fixed thresholds, the proposed method dynamically calculates thresholds for each document, enhancing adaptability and robustness in word extraction.
References (34)
- R. Plamondon and S. N. Srihari, "On-line and off-line handwriting recognition: A comprehensive survey", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.22, no.1, pp.63-84, 2000.
- L. Lorigo and V. Govindaraju, "Off-line Arabic Handwriting Recognition: A Survey", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.28, no.05, pp.712-724, 2006.
- A. Zahour, B. Taconet, P. Mercy, S. Ramdane, "Arabic handwritten text-line extraction", In Proceeding(s) of the Sixth International Conference on Document Analysis and Recognition( ICDAR), vol.37, pp. 281-285, 2001.
- J. Kumar, W. Abd-Almageed, L. Kang, D.S. Doermann, "Handwritten Arabic text line segmentation using affinity propagation", In Proceeding(s) of DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 135-142, 2010
- Z. Shi, S. Setlur, V. Govindaraju, "A Steerable Directional Local Profile Technique for Extraction of Handwritten, Arabic Text Lines", ICDAR, pp. 176-180, 2009.
- N. Ouwayed, A. Belaıd, "Separation of overlapping and touching lines within handwritten Arabic documents", In Proceeding(s) of the 13th International Conference on Computer Analysis of Images and Patterns, CAIP. 9, pp. 123-138, 2009.
- M. Khayyat, L. Lam, C. Y. Suen, F. Yin and C-L. Liu, "Arabic Handwritten Text Line Extraction by Applying an Adaptive Mask to Morphological Dilation," In Proceeding(s) of 10th IAPR International Workshop on Document Analysis Systems (DAS 2012), pp. 100-104, 2012.
- I.S.I. Abuhaiba, S. Datta, M.J.J. Holt, "Line Extraction and Stroke Ordering of Text Pages", In Proceeding(s) of Third International Conference on Document Analysis and Recognition ICDAR'9, pp. 390-393, 1995.
- H. Goraine, M. Sher, S. Al-Emami, "Off-Line Arabic Character Recognition", Computer, vol. 25, pp. 71-74, 1992.
- H.A. Al-Muhtaseb, S.A. Mahmoud, R.S. Qahwaji, "Recognition of offline printed Arabic text using Hidden Markov Models", Signal Processing, vol. 88, pp. 2902-2912, 2008.
- A. Amin, H. Alsadon, S. Fisher, "Hand printed Arabic character recognition system using an artificial network", Pattern Recognition, vol. 29, no. 4, pp. 663-675, 1996.
- R. El-Hajj, C. Mokbel, L. Likforman-Sulem, "Arabic Handwriting Recognition Using Baseline Dependant Features and Hidden Markov Modeling", In Proceeding(s) of the Eight International Conference on Document Analysis and Recognition, ICDAR , 2005.
- M. Khalifa, Y. BingRu, "A Novel Word Based Arabic Handwritten Recognition System Using SVM Classifier", Communications in Computer and Information Science, vol. 143, pp. 163-171, 2011.
- A. Benouareth, A. Ennaji, M. Sellami, "HMMs with explicit state duration applied to handwritten Arabic word recognition", In Proceeding(s) of the 18th International Conference on Pattern Recognition, ICPR , 2006.
- J.H. AIKhateeb, "Word-based Handwritten Arabic Scripts Recognition using DCT Features and Neural Network Classifier", In Proceeding(s) of the 5th International Multi-Conference on Systems, Signals and Devices, 2008.
- S. Almaadeed, C. Higgens, D. Elliman, "Recognition of off line hand written Arabic words using hidden markov model approach", In Proceeding(s) of the 16th International Conference on Pattern Recognition,vol. 3, pp. 481-484, 2002.
- S.N. Srihari, H. Srinivasan, P. Babu, C. Bhole, "Handwritten Arabic Word Spotting using the CEDARABIC Document Analysis System", In Proceeding(s) of Symposium on Document Image Understanding Technology (SDIUT-05), College Park, MD, pp. 123-132, 2005.
- S.N. Srihari, H. Srinivasan, P. Babu, C. Bhole, "Spotting Words in Handwritten Arabic Documents", In Proceeding(s) of the SPIE, pp. 606702-1-606702, 2006.
- M. Khayyat, L. Lam, C.Y. Suen," Arabic Handwritten Word Spotting Using Language Models", In Proceeding(s) of the 2012 International Conference on Frontiers in Handwriting Recognition (ICFHR '12), pp.43-48, 2012.
- J.H. AlKhateeb, J. Jiang, J. Ren, S. Ipson, "Interactive Knowledge Discovery for Baseline Estimation and Word Segmentation", in: Maurizio A Strangio (Ed.), Handwritten Arabic Text, Recent Advances in Technologies, ISBN: 978-953-307-017-9, InTech. DOI: 10.5772/7428.
- T. Stafylakis, V. Papavassiliou, V. Katsouros, G. Carayannis, "Robust text-line and word segmentation for handwritten documents images", In Proceeding(s) of International Conference on Acoustics, Speech and Signal Processing, pp. 3393-3396, 2008.
- V. Marti, H. Bunke, "Text line segmentation and word recognition in a system for general writer independent handwriting recognition", In Proceeding(s) of International Conference on Document Analysis and Recognition, pp. 159-163, 2001.
- R. Manmatha, J.L. Rothfeder, "A scale space approach for automatically segmenting words from historical handwritten documents", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no.8, pp. 1212-1225, 2005.
- V. Papavassiliou, T. Stafylakis, V. Katsouros, G. Carayannis, "Handwritten document image segmentation into text lines and words", Pattern Recognition, vol. 43, no. 1, pp. 369-377, 2010.
- G. Louloudis, B. Gatos, I. Pratikakis, C. Halatsis, "Text line and word segmentation of handwritten documents", Pattern Recognition, vol. 42, no. 12, pp. 3169-3183, 2009.
- G. Seni, E. Cohen, "External word segmentation of off-line handwritten text lines", Pattern Recognition, vol. 27, pp. 41-52, 1994.
- M. Pechwitz, S.S. Maddouri, V. Maergner, N. Ellouze, H. Amiri, "IFN/ENIT -database of handwritten Arabic words", In Proceeding(s) of CIFED, pp. 129-136, 2002.
- G.S. Peake T.N. Tan, "Script and language identification from document images", In Proceeding(s) of the British Machine Vision Conference (BMVC97), vol. 2, pp. 169-184, 1997.
- J.B. MacQueen, "Some Methods for classification and Analysis of Multivariate Observations", In Proceeding(s) of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, vol.1, pp. 281-297, 1967.
- J.C. Dunn, "A Fuzzy Relative of the ISODATA Process and its Use in Detecting Compact Well Separated Clusters", Journal of Cybernetics, vol. 3, pp. 32-57, 1974.
- J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, NewYork: Plenum Press, 1981.
- C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2007.
- M. Ester, H-P. Kriegel, J. Sander, X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise", In Proceeding(s) of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), AAAI Press, pp. 226-231. 1996.
- S. Al-Ma'adeed, D. Elliman, C.A. Higgins, "A Data Base for Arabic Handwritten Text Recognition Research", In Proceeding(s) of 8th International Workshop on Frontiers in Handwriting Recognition, 2002.