Word Fragments Based Arabic Language Identification
2004
Abstract
We discriminate efficiently between Arabic language and other languages exploiting Arabic script by a word fragments based method. The method makes use of a combination of features characteristic of Arabic language namely function words, prefixes, suffixes and unigrams representing the character set of Arabic script. Results based on 180 samples, selected randomly from the Internet, representing six Arabic based script languages namely Arabic, Persian, Urdu, Pashto, Kurdish and Uighur achieved 94% recall and 94% precision for Arabic language identification. A key advantage of this approach is that the language model used for identification is transparent and can be tuned and enhanced using linguistic expertise.
FAQs
AI
What is the word fragment algorithm's method of language identification?
The algorithm segments text into tokens using Unicode properties and performs partial morphological analysis with dictionaries. It assigns weights to detected features, calculating scores to determine the language based on the highest score.
How effective is the word fragment method for Arabic language identification?
The method achieved a Recall of 94%, Precision of 94%, and an overall Accuracy of 96.6% across 180 test files. This demonstrates its strong capability to discriminate Arabic from other languages using Arabic script.
What features are crucial for identifying the Arabic language?
Key features include function words, prefixes, and suffixes which are distinctive in Arabic morphology. Longer prefixes and suffixes receive progressively higher weights, enhancing identification accuracy.
What computational advantages does the proposed method offer?
The system processes approximately 5 Gigabytes of text per hour on a Pentium 4 processor, with a memory footprint of only 1.5 MB. These metrics suggest significant efficiency for industrial applications.
How do prefix and suffix characteristics affect language identification performance?
Prefix and suffix lengths increase their distinctive power, with bonus scores granted when both are found in compatible contexts. This effectively reduces confusion between Arabic and other languages utilizing Arabic script.
References (4)
- References Cavnar, W.B. & Trenkle, J.M. (1994). N-gram-based text categorization.. Symposium On Document Analysis and Information Retrieval, pages 161-176, University of Nevada, Las Vegas.
- Grefenstette, G. (1995). Comparing two Language identification schemes. In Proceedings of the 3rd International Conference on the Statistical Analysis of Textual Data (JADT '95). Rome, Italy.
- Ingle, Norman. (1976). A language identification table. The Incorporated Linguist, 15(4):98:101.
- Prager, John. (1999). Recognition of Language in Digital Documents. In Proceedings of the 32nd Hawaii International Conference on System Sciences. Wailea, HI. Scott, S. & Matwin, S. (1999). Feature Engineering for Text Classification. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML' 99) (pp. 379-388).