Dia Eddin Abuzeina

Exploring the language modeling toolkits for Arabic text

2017 International Conference on Electrical and Computing Technologies and Applications (ICECTA), 2017

Statistical N-grams language models (LMs) have shown to be very effective in natural language pro... more Statistical N-grams language models (LMs) have shown to be very effective in natural language processing (NLP), particularly in automatic speech recognition (ASR) and machine translation. In fact, the successful impact of LMs promote to introduce efficient techniques as well as different types models in various linguistic applications. The LMs mainly include two types that are grammars and statistical language models that is also called N-grams. The main difference between grammars and statistical language models is that the statistical language models are based on the estimation of probabilities for words sequences while the grammars usually do not have probabilities. Despite there are many toolkits that are used to create LMs, however, this work employs two well-known language modeling toolkits with focus on the Arabic text. The implementing toolkits include the Carnegie Mellon University (CMU)-Cambridge Language Modeling Toolkit and the Cambridge University Hidden Markov Model Toolkit (HTK) language modeling toolkits. For clarification, we used a small Arabic text corpus to compute the N-grams for 1-gram, 2-gram, and 3-gram. In addition, this paper demonstrates the intermediate steps that are needed to generate the ARPA-format LMs using both toolkits.

Arabic Part of Speech Tagging by Using the Stanford System: Prepositions as a Case Study

An-Najah University Journal for Research - B (Humanities), May 1, 2021

This paper discusses part of speech (PoS) tagging for Arabic prepositions. Arabic has a number of... more This paper discusses part of speech (PoS) tagging for Arabic prepositions. Arabic has a number of predefined sets of particles such as particles of Nasb, particles of Jazm, particles of Jarr (also called prepositions), etc. Each set has a particular role in the context in which it appears. In general, PoS is the process of assigning a tag for each word (e.g. name, verb, particle, etc.) based on the context. In fact, PoS is a beneficial tool for many natural language processing (NLP) toolkits. For instance, it is used in syntactic parsing to validate the grammar of the sentence in question. It is also beneficial to understand the required meaning via textual analysis for further processing in search engines. Many other language processing applications utilize PoS such as machine translation, speech synthesis, speech recognition, diacritization, etc. Hence, the performance quality of many NLP applications depends on the accuracy of outputs of the used tagging system. Hence, this study examines the Stanford tagger to explore its tag set in the text under examination and its performance for tagging Arabic prepositions. This study also discusses the weaknesses of the Stanford tagger, as it does not handle the merging case when a preposition joins with an adjacent word to form one single word. Another concern of the Stanford tagger is that it gives a unique tag for different particles such as Jarr and Jazm in terms of linguistic functions. Through our inductive study of prepositions in terms of linguistic functions such as Jazm and Istifham (interrogation), we did not note differences in tagging prepositions like "to" (‫(إلى‬ and "in" ‫.

Download

EasyChair Preprint An Empirical Study of Arabic Continuous Speech Recognition Perfomance An Empirical Study of Arabic Continuous Speech Recognition Perfomance

Although considerable research has been devoted to English speech recognition, rather less attent... more Although considerable research has been devoted to English speech recognition, rather less attention has been paid to Arabic speech recognition. The Arabic language is one of the most commonly used languages worldwide that is in need for accurate audio to text converters. In this paper, we evaluate the recognition performance of the Arabic continuous speech using Soundflower Mac utility. That is, Soundflower was employed as a speaker-independent continuous speech recognition system to evaluate the word error rate (WER) and the accuracy of the Arabic speech. The study also contains a comparative study of the speech recognition performance for male and female native speakers. The experiments conducted using a broadcast news modern standard Arabic (MSA) speech corpus of 2.63 hours (10 male and 10 female speakers). The experimental results show that the accuracy is 54.02 %, and the accuracy of the male and female speakers is almost same.

Download

The Capacity Of Mel Frequency Cepstral Coefficients For Speech Recognition

Speech recognition is of an important contribution in promoting new technologies in human compute... more Speech recognition is of an important contribution in promoting new technologies in human computer interaction. Today, there is a growing need to employ speech technology in daily life and business activities. However, speech recognition is a challenging task that requires different stages before obtaining the desired output. Among automatic speech recognition (ASR) components is the feature extraction process, which parameterizes the speech signal to produce the corresponding feature vectors. Feature extraction process aims at approximating the linguistic content that is conveyed by the input speech signal. In speech processing field, there are several methods to extract speech features, however, Mel Frequency Cepstral Coefficients (MFCC) is the popular technique. It has been long observed that the MFCC is dominantly used in the well-known recognizers such as the Carnegie Mellon University (CMU) Sphinx and the Markov Model Toolkit (HTK). Hence, this paper focuses on the MFCC method...

Download

Arabic text classification using linear discriminant analysis

2017 International Conference on Engineering & MIS (ICEMIS), 2017

The linear discriminant analysis (LDA) is a dimensionality reduction technique that is widely use... more The linear discriminant analysis (LDA) is a dimensionality reduction technique that is widely used in pattern recognition applications. The LDA aims at generating effective feature vectors by reducing the dimensions of the original data (e.g. bag-of-words textual representation) into a lower dimensional space. Hence, the LDA is a convenient method for text classification that is known by huge dimensional feature vectors. In this paper, we empirically investigated two LDA based methods for Arabic text classification. The first method is based on computing the generalized eigenvectors of the ratio (between-class to within-class) scatters, the second method includes linear classification functions that assume equal population covariance matrices (i.e. pooled sample covariance matrix). We used a textual data collection that contains 1,750 documents belong to five categories. The testing set contains 250 documents belong to five categories (50 documents for each category). The experiment...

Download

EasyChair Preprint No 82 An Empirical Study of Arabic Continuous Speech Recognition

Although considerable research has been devoted to English speech recognition, rather less attent... more Although considerable research has been devoted to English speech recognition, rather less attention has been paid to Arabic speech recognition. The Arabic language is one of the most commonly used languages worldwide that is in need for accurate audio to text converters. In this paper, we evaluate the recognition performance of the Arabic continuous speech using Soundflower Mac utility. That is, Soundflower was employed as a speaker-independent continuous speech recognition system to evaluate the word error rate (WER) and the accuracy of the Arabic speech. The study also contains a comparative study of the speech recognition performance for male and female native speakers. The experiments conducted using a broadcast news modern standard Arabic (MSA) speech corpus of 2.63 hours (10 male and 10 female speakers). The experimental results show that the accuracy is 54.02 %, and the accuracy of the male and female speakers is almost same. Keywords—Arabic, Speech, Recognition, Corpus, Sou...

Download

Reinforcing Arabic Language Text Clustering : Theory and Application

This paper presents a novel approach for automatic Arabic text clustering. The proposed method co... more This paper presents a novel approach for automatic Arabic text clustering. The proposed method combines two well-known information retrieval techniques that are latent semantic indexing (LSI) and cosine similarity measure. The standard LSI technique generates the textual feature vectors based on the words co-occurrences; however, the proposed method generates the feature vectors using the cosine measures between the documents. The goal is to obtain high quality textual clusters based on semantic rich features for the benefit of linguistic applications. The performance of the proposed method evaluated using an Arabic corpus that contains 1,000 documents belongs to 10 topics (100 documents for each topic).For clustering, we used expectation-maximization (EM) unsupervised clustering technique to cluster the corpus’s documents for ten groups. The experimental results show that the proposed method outperforms the standard LSI method by about 15%.

Download

A Survey of Markov Chain Models in Linguistics Applications

Computer Science & Information Technology ( CS & IT ), Nov 12, 2016

Markov chain theory isan important tool in applied probability that is quite useful in modeling r... more Markov chain theory isan important tool in applied probability that is quite useful in modeling real-world computing applications.For a long time, rresearchers have used Markov chains for data modeling in a wide range of applications that belong to different fields such as computational linguists, image processing, communications,bioinformatics, finance systems, etc. This paper explores the Markov chain theory and its extension hidden Markov models (HMM) in natural language processing (NLP) applications. This paper also presents some aspects related to Markov chains and HMM such as creating transition matrices, calculating data sequence probabilities, and extracting the hidden states.

Download

Exploring bigram character features for Arabic text clustering

TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES, 2019

The vector space model (VSM) is an algebraic model that is widely used for data representation in... more The vector space model (VSM) is an algebraic model that is widely used for data representation in text mining applications. However, the VSM poses a critical challenge, as it requires a high-dimensional feature space. Therefore, many feature selection techniques, such as employing roots or stems (i.e. words without infixes and prefixes, and/or suffixes) instead of using complete word forms, are proposed to tackle this space challenge problem. Recently, the literature shows that one more basic unit feature can be used to handle the textual features, which is the twoneighboring character form that we call microword. To evaluate this feature type, we measure the accuracy of the Arabic text clustering using two feature types: the complete word form and the microword form. Hence, the microword is two consecutive characters which are also known as the Bigram character feature. In the experiment, the principal component analysis (PCA) is used to reduce the feature vector dimensions while the k-means algorithm is used for the clustering purposes. The testing set includes 250 documents of five categories. The entire corpus contains 54,472 words, whereas the vocabulary contains 13,356 unique words. The experimental results show that the complete word form score accuracy is 97.2% while the two-character form score is 96.8%. In conclusion, the accuracies are almost the same; however, the two-character form uses a smaller vocabulary as well as less PCA subspaces. The study experiments might be a significant indication of the necessity to consider the Bigram character feature in the future text processing and natural language processing applications.

Download

Exploring the Performance of Tagging for the Classical and the Modern Standard Arabic

Advances in Fuzzy Systems, 2019

The part of speech (PoS) tagging is a core component in many natural language processing (NLP) ap... more The part of speech (PoS) tagging is a core component in many natural language processing (NLP) applications. In fact, the PoS taggers contribute as a preprocessing step in various NLP tasks, such as syntactic parsing, information extraction, machine translation, and speech synthesis. In this paper, we examine the performance of a modern standard Arabic (MSA) based tagger for the classical (i.e., traditional or historical) Arabic. In this work, we employed the Stanford Arabic model tagger to evaluate the imperative verbs in the Holy Quran. In fact, the Stanford tagger contains 29 tags; however, this work experimentally evaluates just one that is the VB ≡ imperative verb. The testing set contains 741 imperative verbs, which appear in 1,848 positions in the Holy Quran. Despite the previously reported accuracy of the Arabic model of the Stanford tagger, which is 96.26% for all tags and 80.14% for unknown words, the experimental results show that this accuracy is only 7.28% for the imper...

Download

Modelling of Cross-word Pronunciation Variation for Arabic ASRs: A Knowledge-Based Approach

Journal of Communications and Computer Engineering, Aug 27, 2011

Utilizing data-driven and knowledge-based techniques to enhance Arabic speech recognition

A chapter in a forthcoming Speech Recognition book, ISBN 979-953-307-790-0, to be publish by InTe... more

Download

Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing

Journal of King Saud University - Computer and Information Sciences, 2017

Cosine similarity is one of the most popular distance measures in text classification problems. I... more Cosine similarity is one of the most popular distance measures in text classification problems. In this paper, we used this important measure to investigate the performance of Arabic language text classification. For textual features, vector space model (VSM) is generally used as a model to represent textual information as numerical vectors. However, Latent Semantic Indexing (LSI) is a better textual representation technique as it maintains semantic information between the words. Hence, we used the singular value decomposition (SVD) method to extract textual features based on LSI. In our experiments, we conducted comparison between some of the well-known classification methods such as Naı¨ve Bayes, k-Nearest Neighbors, Neural Network, Random Forest, Support Vector Machine, and classification tree. We used a corpus that contains 4,000 documents of ten topics (400 documents for each topic). The corpus contains 2,127,197 words with about 139,168 unique words. The testing set contains 400 documents, 40 documents for each topics. As a weighing scheme, we used Term Frequency. Inverse Document Frequency (TF.IDF). This study reveals that the classification methods that use LSI features significantly outperform the TF.IDF-based methods. It also reveals that k-Nearest Neighbors (based on cosine measure) and support vector machine are the best performing classifiers.

Download

Capturing the Common Syntactical Rules for the Holy Quran: A Data Mining Approach

2013 Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences, 2013

This paper presents a novel approach to capture the common syntactical rules for the Holy Quran .... more This paper presents a novel approach to capture the common syntactical rules for the Holy Quran . By syntactical rules, we mean the common relationships between the words' tags that highly show up in the Quran. Arabic, like other language, has a number of tags which include nouns, verbs, and pronouns with a number of sub-types of each one of them. In this paper we used data mining approach to extract the common syntactical rules which will be offered to the natural language processing applications. Stanford part of speech tagger (29 tags) will be used to tag the Quran words. Then, the data mining too called WEKA (PredictiveApriori algorithm) will be used to find the famous syntactical rules. The extracted syntactical rules have a property that it is not necessary to have adjacent words tags. That is, long distance relation. The most common syntactical rule found is: tag1=RP tag2=NN tag3=WP 91 ⇒ tag4=VBD 90 acc:(0.97912)Which can be seen in the Quran sentence. This phrase (which is part of an ayah) appeared in 89 ayahs in 20 different surahs; the study used Mushaf Al-Madinah Al-Munawwarah (published by the King Fahd Complex for Printing the Holy Quran ).

An Overview of Speech Recognition Systems