This paper describes an Arabic Automatic Speech Recognition system developed on 15 hours of Multi... more This paper describes an Arabic Automatic Speech Recognition system developed on 15 hours of Multi-Genre Broadcast (MGB-3) data from YouTube, plus 1,200 hours of Multi-Dialect and Multi-Genre MGB-2 data recorded from the Aljazeera Arabic TV channel. In this paper, we report our investigations of a range of signal pre-processing, data augmentation , topic-specific language model adaptation, accent specific retraining , and deep learning based acoustic modeling topologies, such as feed-forward Deep Neural Networks (DNNs), Time-delay Neural Networks (TDNNs), Long Short-term Memory (LSTM) networks, Bidirectional LSTMs (BLSTMs), and a Bidirectional version of the Prioritized Grid LSTM (BPGLSTM) model. We propose a system combination for three purely sequence trained recognition systems based on lattice-free maximum mutual information, 4-gram language model re-scoring, and system combination using the minimum Bayes risk decoding criterion. The best word error rate we obtained on the MGB-3 Arabic development set using a 4-gram re-scoring strategy is 42.25% for a chain BLSTM system, compared to 65.44% baseline for a DNN system.
As a continuation of our efforts towards tackling the problem of spoken Dialect Identification (D... more As a continuation of our efforts towards tackling the problem of spoken Dialect Identification (DID) for Arabic languages, we present the QCRI-MIT Advanced Dialect Identification System (QMDIS). QMDIS is an automatic spoken DID system for Di-alectal Arabic (DA). In this paper, we report a comprehensive study of the three main components used in the spoken DID task: phonotactic, lexical and acoustic. We use Support Vector Machines (SVMs), Logistic Regression (LR) and Convolutional Neural Networks (CNNs) as backend classifiers throughout the study. We perform all our experiments on a publicly available dataset and present new state-of-the-art results. QMDIS discriminates between the five most widely used dialects of Arabic: namely Egyptian, Gulf, Levantine, North African, and Modern Standard Arabic (MSA). We report ≈ 73% accuracy for system combination. All the data and the code used in our experiments are publicly available for research.
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, Jul 2014
Recent studies show that Gaussian mixture model (GMM) weights carry less, yet complementary, info... more Recent studies show that Gaussian mixture model (GMM) weights carry less, yet complementary, information to GMM means
for language and dialect recognition. However, state-of-the-art language recognition
systems usually do not use this information. In this research, a non-negative factor analysis (NFA) approach
is developed for GMM weight decomposition and adaptation. This modeling, which is conceptually simple and
computationally inexpensive, suggests a new low-dimensional utterance representation method using a factor
analysis similar to that of the i-vector framework.
The obtained subspace vectors are then applied in conjunction with i-vectors to the language/dialect
recognition problem.
The suggested approach is evaluated on the NIST
2011 and RATS language recognition evaluation (LRE) corpora and
on the QCRI Arabic dialect recognition evaluation (DRE) corpus.
The assessment results show that the proposed adaptation method yields more accurate recognition results
compared to three conventional weight adaptation approaches, namely maximum likelihood re-estimation, non-negative matrix
factorization, and a subspace multinomial model. Experimental results also show that the
intermediate-level fusion of i-vectors and NFA subspace vectors improves the performance
of the state-of-the-art i-vector framework especially for the case of short utterances.
Uploads
Papers by Ahmed Ali
for language and dialect recognition. However, state-of-the-art language recognition
systems usually do not use this information. In this research, a non-negative factor analysis (NFA) approach
is developed for GMM weight decomposition and adaptation. This modeling, which is conceptually simple and
computationally inexpensive, suggests a new low-dimensional utterance representation method using a factor
analysis similar to that of the i-vector framework.
The obtained subspace vectors are then applied in conjunction with i-vectors to the language/dialect
recognition problem.
The suggested approach is evaluated on the NIST
2011 and RATS language recognition evaluation (LRE) corpora and
on the QCRI Arabic dialect recognition evaluation (DRE) corpus.
The assessment results show that the proposed adaptation method yields more accurate recognition results
compared to three conventional weight adaptation approaches, namely maximum likelihood re-estimation, non-negative matrix
factorization, and a subspace multinomial model. Experimental results also show that the
intermediate-level fusion of i-vectors and NFA subspace vectors improves the performance
of the state-of-the-art i-vector framework especially for the case of short utterances.