Papers by Sarangarajan Parthasarathy
Language models for ASR are traditionally trained on a sentence-level corpus. In this internship,... more Language models for ASR are traditionally trained on a sentence-level corpus. In this internship, we explore the potential of taking advantage of context beyond the current sentence for next word prediction. We show that adding an attention mechanism to LSTM allows modeling of long contexts.
Automatic Speech Recognition for Wireless Mobile Devices
Automatic Speech Recognition for Wireless Mobile Devices

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
In recent years, end-to-end (E2E) based automatic speech recognition (ASR) systems have achieved ... more In recent years, end-to-end (E2E) based automatic speech recognition (ASR) systems have achieved great success due to their simplicity and promising performance. Neural Transducer based models are increasingly popular in streaming E2E based ASR systems and have been reported to outperform the traditional hybrid system in some scenarios. However, the joint optimization of acoustic model, lexicon and language model (LM) in neural Transducer also brings about challenges in adapting ASR using just adaptation text. This drawback might prevent their potential applications in practice. In order to address this issue, we propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction, and adopting a standalone language model for the vocabulary prediction. It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition, which allows various language model adaptation techniques to be applied. We demonstrate that the proposed factorized neural Transducer yields 15.4% to 19.4% WER improvements when outof-domain text data is used for language model adaptation, at the cost of a minor degradation in WER on a general test set.
Hybrid segmental-LVQ/HMM for large vocabulary speech recognition
[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992
ABSTRACT The authors have assessed the possibility of modeling phone trajectories to accomplish s... more ABSTRACT The authors have assessed the possibility of modeling phone trajectories to accomplish speech recognition. This approach has been considered as one of the ways to model context-dependency in speech recognition based on the acoustic variability of phones in the current database. A hybrid segmental learning vector quantization/hidden Markov model (SLVQ/HMM) system has been developed and evaluated on a telephone speech database. The authors have obtained 85.27% correct phrase recognition with SLVQ alone. By combining the likelihoods issued by SLVQ and by HMM, the authors have obtained 94.5% correct phrase recognition, a small improvement over that obtained with HMM alone
The Journal of the Acoustical Society of America, 1988

ArXiv, 2020
LSTM language models (LSTM-LMs) have been proven to be powerful and yielded significant performan... more LSTM language models (LSTM-LMs) have been proven to be powerful and yielded significant performance improvements over count based n-gram LMs in modern speech recognition systems. Due to its infinite history states and computational load, most previous studies focus on applying LSTM-LMs in the second-pass for rescoring purpose. Recent work shows that it is feasible and computationally affordable to adopt the LSTM-LMs in the first-pass decoding within a dynamic (or tree based) decoder framework. In this work, the LSTM-LM is composed with a WFST decoder on-the-fly for the first-pass decoding. Furthermore, motivated by the long-term history nature of LSTM-LMs, the use of context beyond the current utterance is explored for the first-pass decoding in conversational speech recognition. The context information is captured by the hidden states of LSTM-LMs across utterance and can be used to guide the first-pass search effectively. The experimental results in our internal meeting transcripti...
This is a companion document to our Interspeech 2018 paper “What to Expect from Expected Kneser-N... more This is a companion document to our Interspeech 2018 paper “What to Expect from Expected Kneser-Ney Smoothing”. It provides additional details and considerations regarding derivation, optimizations and extensions of the n-gram smoothing on expected fractional counts, a technique originally introduced in Zhang et al 2014.

arXiv: Computation and Language, Nov 11, 2019
We explore neural language modeling for speech recognition where the context spans multiple sente... more We explore neural language modeling for speech recognition where the context spans multiple sentences. Rather than encode history beyond the current sentence using a cache of words or documentlevel features, we focus our study on the ability of LSTM and Transformer language models to implicitly learn to carry over context across sentence boundaries. We introduce a new architecture that incorporates an attention mechanism into LSTM to combine the benefits of recurrent and attention architectures. We conduct language modeling and speech recognition experiments on the publicly available LibriSpeech corpus. We show that conventional training on a paragraph-level corpus results in significant reductions in perplexity compared to training on a sentence-level corpus. We also describe speech recognition experiments using long-span language models in second-pass re-ranking, and provide insights into the ability of such models to take advantage of context beyond the current sentence.
Articulatory analysis and synthesis of speech
Fourth IEEE Region 10 International Conference TENCON
Recent progress in developing automatic articulatory analysis-synthesis procedures is described. ... more Recent progress in developing automatic articulatory analysis-synthesis procedures is described. The goal of the research is to find ways to fully exploit the advantages of articulatory modeling in producing natural-sounding speech from text and in low-bit-rate coding. Estimation of articulatory parameters by analysis-synthesis appears to be the most effective way of obtaining large amounts of data on human articulation. The
Directory Retrieval using Voice Form-Filling
2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007
Abstract Accurate retrieval of entries from large directories is a difficult task. Practical syst... more Abstract Accurate retrieval of entries from large directories is a difficult task. Practical systems attempt to achieve acceptable performance using dialog to restrict the size of the directory. For instance, knowledge of city and state can be used to restrict the entries in a ...
Interspeech 2017, 2017
We explore character-level neural network models for inferring punctuation from text-only input. ... more We explore character-level neural network models for inferring punctuation from text-only input. Punctuation inference is treated as a sequence tagging problem where the input is a sequence of un-punctuated characters, and the output is a corresponding sequence of punctuation tags. We experiment with six architectures, all of which use a long short-term memory (LSTM) network for sequence modeling. They differ in the way the context and lookahead for a given character is derived: from simple character embedding and delayed output to enable lookahead, to complex convolutional neural networks (CNN) to capture context. We demonstrate that the accuracy of proposed character-level models are competitive with the accuracy of a state-of-the-art word-level Conditional Random Field (CRF) baseline with carefully crafted features.
Interspeech 2018, Sep 2, 2018
In language modeling, it is difficult to incorporate entity relationships from a knowledge-base. ... more In language modeling, it is difficult to incorporate entity relationships from a knowledge-base. One solution is to use a reranker trained with global features, in which global features are derived from n-best lists. However, training such a reranker requires manually annotated n-best lists, which is expensive to obtain. We propose a method based on the contrastive estimation method that alleviates the need for such data. Experiments in the music domain demonstrate that global features, as well as features extracted from an external knowledge-base, can be incorporated into our reranker. Our final model, a simple ensemble of a language model and reranker, achieves a 0.44% absolute word error rate improvement over an LSTM language model on the blind test data.

Interspeech 2020, 2020
Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising ... more Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition. In this paper, we describe our recent development of RNN-T models with reduced GPU memory consumption during training, better initialization strategy, and advanced encoder modeling with future lookahead. When trained with Microsoft's 65 thousand hours of anonymized training data, the developed RNN-T model surpasses a very well trained hybrid model with both better recognition accuracy and lower latency. We further study how to customize RNN-T models to a new domain, which is important for deploying E2E models to practical scenarios. By comparing several methods leveraging text-only data in the new domain, we found that updating RNN-T's prediction and joint networks using text-to-speech generated from domain-specific text is the most effective.

2021 IEEE Spoken Language Technology Workshop (SLT), 2021
The external language models (LM) integration remains a challenging task for end-to-end (E2E) aut... more The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models. In this work, we propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models with no additional model training, including the most popular recurrent neural network transducer (RNN-T) and attention-based encoder-decoder (AED) models. Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the training data in the source domain. With ILME, the internal LM scores of an E2E model are estimated and subtracted from the log-linear interpolation between the scores of the E2E model and the external LM. The internal LM scores are approximated as the output of an E2E model when eliminating its acoustic components. ILME can alleviate the domain mismatch between training and te...

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021
The efficacy of external language model (LM) integration with existing end-to-end (E2E) automatic... more The efficacy of external language model (LM) integration with existing end-to-end (E2E) automatic speech recognition (ASR) systems can be improved significantly using the internal language model estimation (ILME) method [1]. In this method, the internal LM score is subtracted from the score obtained by interpolating the E2E score with the external LM score, during inference. To improve the ILME-based inference, we propose an internal LM training (ILMT) method to minimize an additional internal LM loss by updating only the E2E model components that affect the internal LM estimation. ILMT encourages the E2E model to form a standalone LM inside its existing components, without sacrificing ASR accuracy. After ILMT, the more modular E2E model with matched training and inference criteria enables a more thorough elimination of the source-domain internal LM, and therefore leads to a more effective integration of the target-domain external LM. Experimented with 30K-hour trained recurrent neu...
Word-phrase-entity language models: getting more mileage out of n-grams
Interspeech 2014
Interspeech 2018
Kneser-Ney smoothing on expected counts was proposed recently in [1]. In this paper we revisit th... more Kneser-Ney smoothing on expected counts was proposed recently in [1]. In this paper we revisit this technique and suggest a number of optimizations and extensions. We then analyze its performance in several practical speech recognition scenarios that depend on fractional sample counts, such as training on uncertain data, language model adaptation and Word-Phrase-Entity models. We show that the proposed approach to smoothing outperforms known alternatives by a significant margin.
Interspeech 2016, 2016
The recently introduced framework of Word-Phrase-Entity language modeling is applied to Recurrent... more The recently introduced framework of Word-Phrase-Entity language modeling is applied to Recurrent Neural Networks and leads to similar improvements as reported for n-gram language models. In the proposed architecture, RNN LMs do not operate in terms of lexical items (words), but consume sequences of tokens that could be words, word phrases or classes such as named entities, with the optimal representation for a particular input sentence determined in an iterative manner. We show how auxiliary techniques previously described for n-gram WPE language models, such as token-level interpolation and personalization, can also be realized with recurrent networks and lead to similar perplexity improvements.
System and method for spelled text input recognition using speech and non-speech input
Uploads
Papers by Sarangarajan Parthasarathy