Papers by Diamantino Caseiro
In this work, we present a model-based approach to improving contextual biasing that improves qua... more In this work, we present a model-based approach to improving contextual biasing that improves quality without drastically increasing model computation during inference. Specifically, we look at injecting text data during training which is representative of contextually-relevant context that will be seen at inference, using a modality-matching text injection method known as JOIST. As JOIST injects text data directly into the E2E model, there is no additional model computation during inference, which is a big difference compared to most model-based biasing techniques. We find that our proposed approach, when combined with an FST-based context model, improves recognition of contacts between 5-15% relative.
This paper describes some experiments with pronunciation modeling for spontaneous speech in Europ... more This paper describes some experiments with pronunciation modeling for spontaneous speech in European Portuguese. The transducer framework provides an elegant way to combine a pronunciation lexicon of canonical forms with alternative pronunciation rules. The main phonological aspects that the rules are intended to cover are: vowel devoicing, deletion and coalescence, voicing assimilation, and simplification of consonantal clusters, both within words and across word boundaries. Our aligner proved sufficiently robust to be able to process fairly long dialogs with overlapping turns, despite many limitations, namely in terms of absence of models for voice quality changes.
A Mobile Virtual Assistant (MVA) is a communication agent that recognizes and understands free sp... more A Mobile Virtual Assistant (MVA) is a communication agent that recognizes and understands free speech, and performs actions such as retrieving information and completing transactions. One essential characteristic of MVAs is their ability to learn and adapt without supervision. This paper describes our ongoing research in developing more intelligent MVAs that recognize and understand very large vocabulary speech input across a variety of tasks. In particular, we present our architecture for unsupervised acoustic and language model adaptation. Experimental results show that unsupervised acoustic model learning approaches the performance of supervised learning when adapting on 40-50 device-specific utterances. Unsupervised language model learning results in an 8% absolute drop in word error rate.

Interspeech 2017, 2017
We present a new method for estimating the sparse non-negative model (SNM) by using a small amoun... more We present a new method for estimating the sparse non-negative model (SNM) by using a small amount of held-out data and the multinomial loss that is natural for language modeling; we validate it experimentally against the previous estimation method which uses leave-one-out on training data and a binary loss function and show that it performs equally well. Being able to train on held-out data is very important in practical situations where training data is mismatched from held-out/test data. We find that fairly small amounts of held-out data (on the order of 30-70 thousand words) are sufficient for training the adjustment model, which is the only model component estimated using gradient descent; the bulk of model parameters are relative frequencies counted on training data. A second contribution is a comparison between SNM and the related class of Maximum Entropy language models. While much cheaper computationally, we show that SNM achieves slightly better perplexity results for the same feature set and same speech recognition accuracy on voice search and short message dictation.

Preface The design and study of spoken dialog systems is a relatively young research field compar... more Preface The design and study of spoken dialog systems is a relatively young research field compared to other speech technologies such as recognition and synthesis. In recent years however, as these core technologies have improved, the field of spoken dialog systems has been generating increased interest both in the research community and in the industry. While most of the early work originated from the artificial intelligence community and addressed high-level issues such as discourse planning, the development and deployment of actual usable systems has led to the emergence of a wide range of new issues such as error handling in dialog, multimodal integration, or rapid system development. At the same time, researchers from a variety of disciplines including speech and language technologies, robotics, and human-computer interaction have started to bring their unique skills and backgrounds to bear on these issues. Unfortunately, while this richness and variety of interests constitute ...

Spoken language processing using weighted finite state transducers
The main goal of this paper is to illustrate the advantages of weighted finite state transducers ... more The main goal of this paper is to illustrate the advantages of weighted finite state transducers ( WFSTs) for spoken language processing, namely in terms of their capacity to efficiently integrate different types of knowledge sources. We shall illustrate their applicability in several areas: large vocabulary continuous speech recognition, automatic alignment using pronunciation modeling rules, grapheme-to-phone conversion, and speech-to-speech translation. The impact of the use of WFSTsin spoken language processing for European Portuguese was particularly noticeable in the area of broadcast news recognition, in which we used a specialized composition algorithm for composing the lexicon with the language model. Among other properties, this algorithm allows the on-the-fly generation of the composition WFST, being thus amenable to be embedded in a dynamic recognition system. The WFSTapproach achieved a 6 times reduction of the decoding time relative to a previous decoder not based on WFSTs.
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Word Alignment in Digital Talking Books Using WFSTs
Lecture Notes in Computer Science, 2002
... The hybrid acoustic models used in the alignment of spoken books were originally developed fo... more ... The hybrid acoustic models used in the alignment of spoken books were originally developed for a dictation task [3], in an ... 3. Example of a rule specified using the rule specification language. ... R is applied 3 times (SR = π2(S ◦ R ◦ R ◦ R)), to allow the application of one rule to the ...
1-The Spoken Language Systems Lab was formally created in 2001, bringing together the expertise o... more 1-The Spoken Language Systems Lab was formally created in 2001, bringing together the expertise of several research groups that shared a common goal: to bridge the gap between natural spoken language and the underlying semantic information, focusing on European Portuguese. This paper describes our efforts towards this long-term goal, starting by the two main areas of activity: semantic processing of multimedia contents, and multimodal dialogue systems. These strongly interdisciplinary areas integrate several core technologies developed in the lab, namely speech recognition and synthesis. This interdisciplinarity is also strongly patent in the the lab's most recent activities, such as speech-to-speech translation.
Automatic speech recognition (ASR) machine learning models are used to recognize spoken commands ... more Automatic speech recognition (ASR) machine learning models are used to recognize spoken commands or queries from users. End-to-end ASR models, which directly map a sequence of input acoustic features into a sequence of words, greatly simplify ASR system building and maintenance. This disclosure describes techniques to improve the performance of

Automatic spoken language identification is the problem of identifying the language being spoken ... more Automatic spoken language identification is the problem of identifying the language being spoken from a sample of speech by an unknown speaker. Current language identification systems vary in their complexity. The systems that use higher level information have the best performance. Nevertheless, that information is hard to collect for each new language. In this work, we present a state of the art language identification system, which uses very little linguistic information, and so easily extendable to new languages. In fact, the presented system needs only one language specific phone recogniser (in our case the Portuguese one), and is trained with speech from each of the other languages. We studied the problem of language identification in the context of the European languages (including, for the first time, European Portuguese), which allowed us to study the effect of language proximity in Indo-European languages. The results reveal a significant impact on the identification of som...

This paper reports an experience on producing manual word alignments over six different language ... more This paper reports an experience on producing manual word alignments over six different language pairs (all combinations between Portuguese, English, French and Spanish) (Graca et al., 2008). Word alignment of each language pair is made over the first 100 sentences of the common test set from the Europarl corpora (Koehn, 2005), corresponding to 600 new annotated sentences. This collection is publicly available at http://www.l2f.inesc- id.pt/resources/translation/. It contains, to our knowledge, the first word alignment gold set for the Portuguese language, with three other languages. Besides, it is to our knowledge, the first multi-language manual word aligned parallel corpus, where the same sentences are annotated for each language pair. We started by using the guidelines presented at (Marino, 2005) and performed several refinements: some due to under-specifications on the original guidelines, others because of disagreement on some choices. This lead to the development of an extens...
Spoken Language Technologies Appl
Digital Talking Books (DTBs) offer to visually impaired users an evolution of analogue talking bo... more Digital Talking Books (DTBs) offer to visually impaired users an evolution of analogue talking books that mimics the interaction possibilities of print books. This paper describes a new DTB player which tries to improve the usability and accessibility of current players, through the combination of the possibilities offered by multimodal interaction and interface adaptability, and the integration of several language processing components. Besides the potential for a greater enjoyment of the reader in general, these modifications also pave the way to the use of DTBs in different domains, from e-inclusion to e-learning applications.

Considerable effort has been devoted at to increase and broaden our speech and text data resource... more Considerable effort has been devoted at to increase and broaden our speech and text data resources. Digital Talking Books (DTB),comprising both speech and textdata are, as such, an invaluable asset as multimedia resources. Furthermore, those DTB have been under a speech-to-text alignment procedure, either word or phone-based, to increase their potential in research activities. This paper thus describes the motivation and the method that we used to accomplish this goal for aligning DTBs. This alignment allows specific access interfaces for persons with special needs, and also tools for easily detecting and indexing units (words, sentences, topics) in the spoken books. The alignment tool was implemented in a Weighted Finite State Transducer framework, which provides an efficient way to combine different types of knowledge sources, such as alternative pronunciation rules. With this tool, a 2-hour long spoken book was aligned in a single step in much less than real time. Last but not le...

9th European Signal Processing Conference (EUSIPCO 1998), 1998
Automatic spoken language identification is the problem of identifying the language being spoken ... more Automatic spoken language identification is the problem of identifying the language being spoken from a sample of speech by an unknown speaker. In this paper we studied the problem of language identification in the context of the European languages, which allowed us to study the effect of language proximity in Indo-European languages. The results reveal a significant impact on the identification of some languages. Current language identification systems vary in their complexity. The systems that use higher level information have the best performance. Nevertheless, that information is hard to collect for each new language. The system presented in this work is easily extendable to new languages because it uses very little linguistic information. In fact, the presented system needs only one language specific phone recogniser (in our case the Portuguese one), and is trained with speech from each of the other languages. With the SpeechDat-M corpus, with 6 European languages (English, Fre...
The goal of this work was to develop an algorithm for the integration of the lexicon with the lan... more The goal of this work was to develop an algorithm for the integration of the lexicon with the language model which would be computationally efficient in terms of memory requirements, even in the case of large trigram models. Two specialized versions of the algorithm for transducer composition were implemented. The first one is basically a composition algorithm that uses the precomputed set of the output labels that can be reached from a particular epsilon edge of the lexicon; the second includes an ”on the fly” implementation of the pushing of weights and output labels. Very significant memory savings were obtained with the proposed algorithms compared with the general determinization algorithm for weighted transducers.

Interspeech 2021
On-device end-to-end (E2E) models have shown improvements over a conventional model on Search tes... more On-device end-to-end (E2E) models have shown improvements over a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER) [1], and latency [2], measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model [3], we explore using a Hybrid Autoregressive Transducer (HAT) [4] factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rareword quality, as well as latency, and is 318X smaller.

Improving Entity Recall in Automatic Speech Recognition with Neural Embeddings
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Automatic speech recognition (ASR) systems often have difficulty recognizing long-tail entities s... more Automatic speech recognition (ASR) systems often have difficulty recognizing long-tail entities such as contact names and local restaurant names, which usually do not occur, or occur infrequently, in the system’s training data. In this work, we present a method which uses learned text embeddings and nearest neighbor retrieval within a large database of entity embeddings to correct misrecognitions. Our text embeddings are produced by a neural network trained so that the embeddings of acoustically confusable phrases have low cosine distances. Given the embedding of the text of a potential entity misrecognition and a precomputed database containing entities and their corresponding embeddings, we use fast, scalable nearest neighbor retrieval algorithms to find candidate corrections within the database. The inserted candidates are then scored using a function of the original text’s cost in the lattice and the distance between the embedding of the original text and the embedding of the candidate correction. Using this lattice augmentation techique, we demonstrate a 46% reduction in word error rate (WER) and 46% reduction in oracle word error rate (OWER) on an evaluation set with popular film queries.

Interspeech 2017
Maximum Entropy (MaxEnt) language models are powerful models that can incorporate linguistic and ... more Maximum Entropy (MaxEnt) language models are powerful models that can incorporate linguistic and non-linguistic contextual signals in a unified framework with a convex loss. MaxEnt models also have the advantage of scaling to large model and training data sizes We present the following two contributions to MaxEnt training: (1) By leveraging smaller amounts of transcribed data, we demonstrate that a MaxEnt LM trained on various types of corpora can be easily adapted to better match the test distribution of Automatic Speech Recognition (ASR); (2) A novel adaptive-training approach that efficiently models multiple types of non-linguistic features in a universal model. We evaluate the impact of these approaches on Google's state-of-the-art ASR for the task of voice-search transcription and dictation. Training 10B parameter models utilizing a corpus of up to 1T words, we show large reductions in word error rate from adaptation across multiple languages. Also, human evaluations show significant improvements on a wide range of domains from using non-linguistic features. For example, adapting to geographical domains (e.g., US States and cities) affects about 4% of test utterances, with 2:1 win to loss ratio.
This documents contains detailed guidelines for performing word-alignment annotations. These guid... more This documents contains detailed guidelines for performing word-alignment annotations. These guidelines where proposed in [4] they are base on the guidelines described in [8] for Spanish/English, with some changes and refinements that are described in section 4.3. Another source of information that was used were the alignments guidelines defined on the Blinker project [9] for English/French and the guidelines defined in [7] for Czech/English, although the later two differ in some general principles.
Uploads
Papers by Diamantino Caseiro