Skip to main content

Mohamed Afify

Followers

13

Following

7

Co-authors

7

Public Views

Interests

Uploads

Papers by Mohamed Afify

Text-Independent Speaker Verification Based on Deep Neural Networks and Segmental Dynamic Time Warping

arXiv (Cornell University), Jun 26, 2018

In this paper we present a new method for text-independent speaker verification that combines seg... more In this paper we present a new method for text-independent speaker verification that combines segmental dynamic time warping (SDTW) and the d-vector approach. The d-vectors, generated from a feed forward deep neural network trained to distinguish between speakers, are used as features to perform alignment and hence calculate the overall distance between the enrolment and test utterances. We present results on the NIST 2008 data set for speaker verification where the proposed method outperforms the conventional i-vector baseline with PLDA scores and outperforms d-vector approach with local distances based on cosine and PLDA scores. Also score combination with the i-vector/PLDA baseline leads to significant gains over both methods.

Text-Independent Speaker Verification Based on Deep Neural Networks and Segmental Dynamic Time Warping

arXiv (Cornell University), Jun 26, 2018

In this paper we present a new method for text-independent speaker verification that combines seg... more In this paper we present a new method for text-independent speaker verification that combines segmental dynamic time warping (SDTW) and the d-vector approach. The d-vectors, generated from a feed forward deep neural network trained to distinguish between speakers, are used as features to perform alignment and hence calculate the overall distance between the enrolment and test utterances. We present results on the NIST 2008 data set for speaker verification where the proposed method outperforms the conventional i-vector baseline with PLDA scores and outperforms d-vector approach with local distances based on cosine and PLDA scores. Also score combination with the i-vector/PLDA baseline leads to significant gains over both methods.

How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation

arXiv (Cornell University), Feb 17, 2023

Generative Pre-trained Transformer (GPT) models have shown remarkable capabilities for natural la... more Generative Pre-trained Transformer (GPT) models have shown remarkable capabilities for natural language generation, but their performance for machine translation has not been thoroughly investigated. In this paper, we present a comprehensive evaluation of GPT models for machine translation, covering various aspects such as quality of different GPT models in comparison with stateof-the-art research and commercial systems, effect of prompting strategies, robustness towards domain shifts and document-level translation. We experiment with eighteen different translation directions involving high and low resource languages, as well as non English-centric translations, and evaluate the performance of three GPT models: ChatGPT, GPT3.5 (text-davinci-003), and text-davinci-002. Our results show that GPT models achieve very competitive translation quality for high resource languages, while having limited capabilities for low resource languages. We also show that hybrid approaches, which combine GPT models with other translation systems, can further enhance the translation quality. We perform comprehensive analysis and human evaluation to further understand the characteristics of GPT translations. We hope that our paper provides valuable insights for researchers and practitioners in the field and helps to better understand the potential and limitations of GPT models for translation.

Fast Vocabulary Projection Method via Clustering for Multilingual Machine Translation on GPU

arXiv (Cornell University), Aug 14, 2022

Multilingual Neural Machine Translation has been showing great success using transformer models. ... more Multilingual Neural Machine Translation has been showing great success using transformer models. Deploying these models is challenging because they usually require large vocabulary (vocab) sizes for various languages. This limits the speed of predicting the output tokens in the last vocab projection layer. To alleviate these challenges, this paper proposes a fast vocabulary projection method via clustering which can be used for multilingual transformers on GPUs. First, we offline split the vocab search space into disjoint clusters given the hidden context vector of the decoder output, which results in much smaller vocab columns for vocab projection. Second, at inference time, the proposed method predicts the clusters and candidate active tokens for hidden context vectors at the vocab projection. This paper also includes analysis of different ways of building these clusters in multilingual settings. Our results show end-toend speed gains in float16 GPU inference up to 25% while maintaining the BLEU score and slightly increasing memory cost. The proposed method speeds up the vocab projection step itself by up to 2.6x. We also conduct an extensive human evaluation to verify the proposed method preserves the quality of the translations from the original model.

Language Tokens: A Frustratingly Simple Approach Improves Zero-Shot Performance of Multilingual Translation

arXiv (Cornell University), Aug 11, 2022

This paper proposes a simple yet effective method to improve direct (X-to-Y) translation for both... more This paper proposes a simple yet effective method to improve direct (X-to-Y) translation for both cases: zero-shot and when direct data is available. We modify the input tokens at both the encoder and decoder to include signals for the source and target languages. We show a performance gain when training from scratch, or finetuning a pretrained model with the proposed setup. In the experiments, our method shows nearly 10.0 BLEU points gain on in-house datasets depending on the checkpoint selection criteria. In a WMT evaluation campaign, From-English performance improves by 4.17 and 2.87 BLEU points, in the zero-shot setting, and when direct data is available for training, respectively. While X-to-Y improves by 1.29 BLEU over the zero-shot baseline, and 0.44 over the many-to-many baseline. In the low-resource setting, we see a 1.5 ∼ 1.7 point improvement when finetuning on X-to-Y domain data.

Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions

Cornell University - arXiv, Nov 16, 2020

This paper describes our submission to the WMT20 sentence filtering task. We combine scores from ... more This paper describes our submission to the WMT20 sentence filtering task. We combine scores from (1) a custom LASER built for each source language, (2) a classifier built to distinguish positive and negative pairs by semantic alignment, and (3) the original scores included in the task devkit. For the mBART finetuning setup, provided by the organizers, our method shows 7% and 5% relative improvement over baseline, in sacreBLEU score on the test set for Pashto and Khmer respectively.

Correlation based predictive adaptation of hidden Markov models

5th European Conference on Speech Communication and Technology (Eurospeech 1997)

ISCA Archive http://www.isca-speech.org/archive EUROSPEECH &#x27;97 5th European Conference o... more

Ensembling of Distilled Models from Multi-task Teachers for Constrained Resource Language Pairs

This paper describes the Microsoft Egypt Development Center (EgDC) submission to the constrained ... more This paper describes the Microsoft Egypt Development Center (EgDC) submission to the constrained track of WMT21 shared news translation task. We focus on the three relatively low resource language pairs Bengali ↔ Hindi, English ↔ Hausa and Xhosa ↔ Zulu. To overcome the limitation of relatively low parallel data we train a multilingual model using a multitask objective employing both parallel and monolingual data. In addition, we augment the data using back translation. We also train a bilingual model incorporating back translation and knowledge distillation then combine the two models using sequence-to-sequence mapping. We see around 70% relative gain in BLEU point for En ↔ Ha and around 25% relative improvements for Bn ↔ Hi and Xh ↔ Zu compared to bilingual baselines.

Sparse Bayesian factor analysis for stereo-based stochastic mapping

Interspeech 2012, 2012

N-best based stochastic mapping on stereo HMM for noise robust speech recognition

Interspeech 2008, 2008

In this paper we present an extension of our previously proposed feature space stereo-based stoch... more In this paper we present an extension of our previously proposed feature space stereo-based stochastic mapping (SSM). As distinct from an auxiliary stereo Gaussian mixture model in the front-end in our previous work, a stereo HMM model in the back-end is used. The basic idea, as in feature space SSM, is to form a joint space of the clean and noisy features, but to train a Gaussian mixture HMM in the new space. The MMSE estimation, which is the conditional expectation of the clean speech given the sequence of noisy observations, leads to clean speech predictors at the granularity of the Gaussian distributions in the HMM model. Because the Gaussians are not known during decoding, N-best hypotheses are employed. This results in a clean speech predictor which is a weighted (by posteriors) sum of the estimates from different Gaussian distributions. In experimental evaluation of the proposed method on the Aurora 2 database it gives better performance over the MST model, particularly, about 10%-20% relative improvement under unseen noise conditions.

Methods and apparatus for context adaptation of speech-to-speech translation systems

The Impact of ASR on Speech-to-Speech Translation Performance

2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007

This paper reports on experiments to quantify the impact of Automatic Speech Recognition (ASR) in... more This paper reports on experiments to quantify the impact of Automatic Speech Recognition (ASR) in general and discriminatively trained ASR in particular on the Machine Translation (MT) performance. The Minimum Phone Error (MPE) training method is employed for building the discriminative ASR acoustic models and a Weighted Finite State Transducer (WFST) based method is used for MT. The experiments are performed on a two-way English/Dialectal-Arabic speech-tospeech (S2S) translation task in the military/medical domain. We demonstrate the relationship between ASR and MT performance measured by BLEU and human judgment for both directions of the translation. Moreover, we question the use of BLEU metric for assessing the MT quality, present our observations and draw some conclusions.

Gaussian Mixture Language Models for Speech Recognition

2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007

We propose a Gaussian mixture language model for speech recognition. Two potential bene ts of usi... more We propose a Gaussian mixture language model for speech recognition. Two potential bene ts of using this model are smoothing unseen events, and ease of adaptation. It is shown how this model can be used alone or in conjunction with a a conventional N-gram model to calculate word probabilities. An interesting feature of the proposed technique is that many methods developed for acoustic models can be easily ported to GMLM. We developed two implementations of the proposed model for large vocabulary Arabic speech recognition with results comparable to conventional N-gram.

IBM MASTOR system

Proceedings of the Workshop on Medical Speech Translation - MST '06, 2006

In this paper, we describe the IBM MASTOR, a speech-to-speech translation system that can transla... more In this paper, we describe the IBM MASTOR, a speech-to-speech translation system that can translate spontaneous free-form speech in real-time on both laptop and hand-held PDAs. Challenges include speech recognition and machine translation in adverse environments, lack of training data and linguistic resources for under-studied languages, and the need to rapidly develop capabilities for new languages. Another challenge is designing algorithms and building models in a scalable manner to perform well even on memory and CPU deficient hand-held computers. We describe our approaches, experience, and success in building working free-form S2S systems that can handle two language pairs (including a low-resource language).

Environment normalization training and environment adaptation using mixture stochastic trajectory model

Speech Communication, 1998

This paper presents a theoretical framework for environment normalization training and adaptation... more This paper presents a theoretical framework for environment normalization training and adaptation in the context of mixture stochastic trajectory models. The presented approach extends, to segment based models, the currently successful technique of environment normalization used in adapting Hidden Markov models. It also adds to the environment normalization framework a novel method for representing and combining dierent sources of variability. In our approach the normalization and adaptation are performed using linear transformations. When applied to speaker and noise adaptation in a continuous speech recognition task, our method led to up to 34% improvement in the recognition accuracy for speaker adaptation compared to unadapted models. For noise adaptation the technique outperformed environment dependent models for some of the tested cases. It was also observed that using environment normalization training in conjunction with transformation adaptation outperforms conventional MLLR.

Minimum cross-entropy adaptation of hidden Markov models

Adaptation techniques that benefit from distribution correlation are important in practical situa... more Adaptation techniques that benefit from distribution correlation are important in practical situations having sparse adaptation data. The so called extended MAP (EMAP) algorithm provides an optimal, though expensive, solution. In this article we start from EMAP, and propose an approximate optimisation criterion, based on maximising a set of local densities. We then obtain expressions for these local densities based on the principle of minimum cross-entropy (MCE). The solution to the MCE problem is obtained using an analogy with MAP estimation, and avoids the use of complex numerical procedures, thus resulting in a simple adaptation algorithm. The implementation of the proposed method for the adaptation of HMMs with mixture Gaussian densities is discussed, and its efficiency is evaluated on an alphabet recognition task

A minimum cross-entropy approach to hidden Markov model adaptation

IEEE Signal Processing Letters, 1999

An adaptation algorithm using the theoretically optimal maximum a posteriori (MAP) formulation, a... more An adaptation algorithm using the theoretically optimal maximum a posteriori (MAP) formulation, and at the same time accounting for parameter correlation between different classes is desirable, especially when using sparse adaptation data. However, a direct implementation of such an approach may be prohibitive in many practical situations. In this letter, we present an algorithm that approximates the above mentioned correlated MAP algorithm by iteratively maximizing the set of posterior marginals. With some simplifying assumptions, expressions for these marginals are then derived, using the principle of minimum cross-entropy. The resulting algorithm is simple, and includes conventional MAP estimation as a special case. The utility of the proposed method is tested in adaptation experiments for an alphabet recognition task.

Estimation of mixtures of stochastic dynamic trajectories: application to continuous speech recognition

Computer Speech & Language, 1996

In this work we extend our previously proposed stochastic mixture trajectory models to modelling ... more In this work we extend our previously proposed stochastic mixture trajectory models to modelling time correlation. To achieve this extension we explicitly model the time evolution of an observed trajectory by the sum of a first order AR process and a mean component. This approach generalizes that employed in Digalakis et al., by using a mixture of trajectories to represent a phone in a parameter space. This generalization is necessary-from our experience-to account for different contextual variants of a phone. Optimum parameter estimates are obtained by two embedded EMalgorithms. Evaluated on an 850-word vocabulary continuous speech recognition task, the new method reduced the recognition error rate by about 25%.

On the use of morphological analysis for dialectal Arabic speech recognition

Interspeech 2006, 2006

Arabic has a large number of affixes that can modify a stem to form words. In automatic speech re... more Arabic has a large number of affixes that can modify a stem to form words. In automatic speech recognition (ASR) this leads to a high out-of-vocabulary (OOV) rate for typical lexicon size, and hence a potential increase in WER. This is even more pronounced for dialects of Arabic where additional affixes are often introduced and the available data is typically sparse. To address this problem we introduce a simple word decomposition algorithm which only requires a text corpus and a predefined list of affixes. Using this algorithm to create the lexicon for Iraqi Arabic ASR results in about 10% relative improvement in word error rate (WER). Also using the union of the segmented and unsegmented vocabularies and interpolating the corresponding language models results in further WER reduction. The net WER improvement is about 13%.

Continuous space language modeling techniques

2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010

This paper focuses on comparison of two continuous space language modeling techniques, namely Tie... more This paper focuses on comparison of two continuous space language modeling techniques, namely Tied-Mixture Language modeling (TMLM) and Neural Network Based Language Modeling (NNLM). Additionally, we report on using alternative feature representations for words and histories used in TMLM. Besides bigram co-occurrence based features we consider using NNLM based input features for training TMLMs. We also describe how we improve certain steps in building TMLMs. We demonstrate that TMLMs provide significant improvements of over 16% relative and 10% relative in Character Error Rate (CER) for Mandarin speech recognition, over the trigram and NNLM models, respectively in a speech to speech translation task.