About vocabulary adaptation for automatic speech recognition of video data
This paper discusses the adaptation of vocabularies for automatic speech recognition. The context... more This paper discusses the adaptation of vocabularies for automatic speech recognition. The context is the transcriptions of videos in French, English and Arabic. Baseline automatic speech recognition systems have been developed using available data. However, the available text data, including the GigaWord corpora from LDC, are getting quite old with respect to recent videos that are to be transcribed. The paper presents the collection of recent textual data from internet for updating the speech recognition vocabularies and training the language models, as well as the elaboration of development data sets necessary for the vocabulary selection process. The paper also compares the coverage of the training data collected from internet, and of the GigaWord data, with finite size vocabularies made of the most frequent words. Finally, the paper presents and discusses the amount of out-of-vocabulary word occurrences, before and after update of the vocabularies, for the three languages.
This paper discusses the adaptation of speech recognition vocabularies for automatic speech trans... more This paper discusses the adaptation of speech recognition vocabularies for automatic speech transcription. The context is the transcription of YouTube videos in French, English and Arabic. Base-line automatic speech recognition systems have been developed using previously available data. However, the available text data, including the GigaWord corpora from LDC, are getting quite old with respect to recent YouTube videos that are to be transcribed. After a discussion on the performance of the ASR baseline systems, the paper presents the collection of recent textual data from internet for updating the speech recognition vocabularies and for training the language models, as well as the elaboration of development data sets necessary for the vocabulary selection process. The paper also compares the coverage of the training data collected from internet, and of the GigaWord data, with finite size vocabularies made of the most frequent words. Finally, the paper presents and discusses the am...
Code-switching (CS) is the phenomenon that occurs when a speaker alternates between two or more l... more Code-switching (CS) is the phenomenon that occurs when a speaker alternates between two or more languages within an utterance or discourse. In this work, we investigate the existence of code-switching in formal text, namely proceedings of multilingual institutions. Our study is carried out on the Arabic-English code-mixing in a parallel corpus extracted from official documents of United Nations. We build a parallel code-switched corpus with two reference translations one in pure Arabic and the other in pure English. We also carry out a human evaluation of this resource in the aim to use it to evaluate the translation of code-switched documents. To the best of our knowledge, this kind of corpora does not exist. The one we propose is unique. This paper examines several methods to translate codeswitched corpus: conventional statistical machine translation, the end-to-end neural machine translation and multitask-learning.
Investigating Data Sharing in Speech Recognition for an Under-Resourced Language: The Case of Algerian Dialect
The Arabic language has many varieties, including its standard form, Modern Standard Arabic (MSA)... more The Arabic language has many varieties, including its standard form, Modern Standard Arabic (MSA), and its spoken forms, namely the dialects. Those dialects are representative examples of under-resourced languages for which automatic speech recognition is considered as an unresolved issue. To address this issue, we recorded several hours of spoken Algerian dialect and used them to train a baseline model. This model was boosted afterwards by taking advantage of other languages that impact this dialect by integrating their data in one large corpus and by investigating three approaches: multilingual training, multitask learning and transfer learning. The best performance was achieved using a limited and balanced amount of acoustic data from each additional language, as compared to the data size of the studied dialect. This approach led to an improvement of 3.8% in terms of word error rate in comparison to the baseline system trained only on the dialect data.
In this paper, we present the first results of the project AMIS (Access Multilingual Information ... more In this paper, we present the first results of the project AMIS (Access Multilingual Information opinionS) funded by Chist-Era. The main goal of this project is to understand the content of a video in a foreign language. In this work, we consider the understanding process, such as the aptitude to capture the most important ideas contained in a media expressed in a foreign language. In other words, the understanding will be approached by the global meaning of the content of a support and not by the meaning of each fragment of a video. Several stumbling points remain before reaching the fixed goal. They concern the following aspects: Video summarization, Speech recognition, Machine translation and Speech segmentation. All these issues will be discussed and the methods used to develop each of these components will be presented. A first implementation is achieved and each component of this system is evaluated on a representative test data. We propose also a protocol for a global subjective evaluation of AMIS.
Language modeling is a very important step in several NLP applications. Most of the current langu... more Language modeling is a very important step in several NLP applications. Most of the current language models are based on probabilistic methods. In this paper, we propose a new language modeling approach based on the possibility theory. Our goal is to suggest a method for estimating the possibility of a word-sequence and to test this new approach in a machine translation system. We propose a wordsequence possibilistic measure, which can be estimated from a corpus. We proceeded in two ways: first, we checked the behavior of the new approach compared with the existing work. Second, we compared the new language model with the probabilistic one used in statistical MT systems. The results, in terms of the METEOR metric, show that the possibilistic-language model is better than the probabilistic one. However, in terms of BLEU and TER scores, the probabilistic model remains better.
Arabic Language Processing: From Theory to Practice, 2019
In this paper, we present and evaluate a method for extractive textbased summarization of Arabic ... more In this paper, we present and evaluate a method for extractive textbased summarization of Arabic videos. The algorithm is proposed in the scope of the AMIS project that aims at helping a user to understand videos given in a foreign language (Arabic). For that, the project proposes several strategies to translate and summarize the videos. One of them consists in transcribing the Arabic videos, summarizing the transcriptions, and translating the summary. In this paper we describe the video corpus that was collected from YouTube and present and evaluate the transcription-summarization part of this strategy. Moreover, we present the Automatic Speech Recognition (ASR) system used to transcribe the videos, and show how we adapted this system to the Algerian dialect. Then, we describe how we automatically segment into sentences the sequence of words provided by the ASR system, and how we summarize the obtained sequence of sentences. We evaluate objectively and subjectively our approach. Results show that the ASR system performs well in terms of Word Error Rate on MSA, but needs to be adapted for dealing with Algerian dialect data. The subjective evaluation shows the same behaviour than ASR: transcriptions for videos containing dialectal data were better scored than videos containing only MSA data. However, summaries based on transcriptions are not as well rated, even when transcriptions are better rated. Last, the study shows that features, such as the lengths of transcriptions and summaries, and the subjective score of transcriptions, explain only 31% of the subjective score of summaries.
The aim of the work is to report the results of the Chist-Era project AMIS (Access Multilingual I... more The aim of the work is to report the results of the Chist-Era project AMIS (Access Multilingual Information opinionS). The purpose of AMIS is to answer the following question: How to make the information in a foreign language accessible for everyone? This issue is not limited to translate a source video into a target language video since the objective is to provide only the main idea of an Arabic video in English. This objective necessitates developing research in several areas that are not, all arrived at a maturity state: Video summarization, Speech recognition, Machine translation, Audio summarization and Speech segmentation. In this article we present several possible architectures to achieve our objective, yet we focus on only one of them. The scientific locks are be presented, and we explain how to deal with them. One of the big challenges of this work is to conceive a way to evaluate objectively a system composed of several components knowing that each of them has its limits and can propagate errors through the first component. Also, a subjective evaluation procedure is proposed in which several annotators have been mobilized to test the quality of the achieved summaries.
This paper addresses the issue of comparability of comments extracted from Youtube. The comments ... more This paper addresses the issue of comparability of comments extracted from Youtube. The comments concern spoken Algerian that could be either local Arabic, Modern Standard Arabic or French. This diversity of expression gives rise to a huge number of problems concerning the data processing. In this article, several methods of alignment will be proposed and tested. The method which permits to best align is Word2Vecbased approach that will be used iteratively. This recurrent call of Word2Vec allows us improve significantly the results of comparability. In fact, a dictionary-based approach leads to a Recall of 4, while our approach allows one to get a Recall of 33 at rank 1. Thanks to this approach, we built from Youtube CALYOU, a Comparable Corpus of the spoken Algerian.
Statistical phrase-based approach was dominating researches in the field of machine translation f... more Statistical phrase-based approach was dominating researches in the field of machine translation for these last twenty years. Recently , a new paradigm based on neural networks has been proposed: Neural Machine Translation (NMT). Even there is still challenges to deal with, NMT shows up promising results better than the Statistical Machine Translation (SMT) on some language pairs. The baseline architecture used in NMT systems is based on a large and a single neural network to translate a whole source sentence to a target one. Several powerful and advanced techniques have been proposed to improve this baseline system and achieve a performance comparable to the state-of-the-art approach. This article aims to describe some of these techniques and to compare them with the conventional SMT approach on the task of Arabic-English machine translation. The result obtained by the NMT system is close to the one obtained by the SMT system on our data set.
Proceedings of the Third Arabic Natural Language Processing Workshop
Automatic speech recognition for Arabic is a very challenging task. Despite all the classical tec... more Automatic speech recognition for Arabic is a very challenging task. Despite all the classical techniques for Automatic Speech Recognition (ASR), which can be efficiently applied to Arabic speech recognition, it is essential to take into consideration the language specificities to improve the system performance. In this article, we focus on Modern Standard Arabic (MSA) speech recognition. We introduce the challenges related to Arabic language, namely the complex morphology nature of the language and the absence of the short vowels in written text, which leads to several potential vowelization for each graphemes, which is often conflicting. We develop an ASR system for MSA by using Kaldi toolkit. Several acoustic and language models are trained. We obtain a Word Error Rate (WER) of 14.42 for the baseline system and 12.2 relative improvement by rescoring the lattice and by rewriting the output with the right hamoza above or below Alif.
Automatic speech recognition for Arabic is a very challenging task. Despite all the classical tec... more Automatic speech recognition for Arabic is a very challenging task. Despite all the classical techniques for Automatic Speech Recognition (ASR), which can be efficiently applied to Arabic speech recognition, it is essential to take into consideration the language specificities to improve the system performance. In this article, we focus on Modern Standard Arabic (MSA) speech recognition. We introduce the challenges related to Arabic language, namely the complex morphology nature of the language and the absence of the short vowels in written text, which leads to several potential vowelization for each graphemes, which is often conflicting. We develop an ASR system for MSA by using Kaldi toolkit. Several acoustic and language models are trained. We obtain a Word Error Rate (WER) of 14.42 for the baseline system and 12.2 relative improvement by rescoring the lattice and by rewriting the output with the right hamoza above or below Alif.
Uploads
Papers by amine Menacer