Papers by Carol Espy-Wilson

arXiv (Cornell University), May 25, 2022
Data augmentation has proven to be a promising prospect in improving the performance of deep lear... more Data augmentation has proven to be a promising prospect in improving the performance of deep learning models by adding variability to training data. In previous work with developing a noise robust acoustic-to-articulatory speech inversion (SI) system, we have shown the importance of noise augmentation to improve the performance of speech inversion in 'noisy' speech conditions. In this work, we extend this idea of data augmentation to improve the SI systems on both the clean speech and noisy speech data by experimenting three data augmentation methods. We also propose a Bidirectional Gated Recurrent Neural Network as the speech inversion system instead of the previously used feed forward neural network. The inversion system uses mel-frequency cepstral coefficients (MFCCs) as the input acoustic features and six vocal tract-variables (TVs) as the output articulatory targets. The Performance of the system was measured by computing the correlation between estimated and actual TVs on the Wisconsin X-ray microbeam database. The proposed speech inversion system shows a 5% relative improvement in correlation over the baseline noise robust system for clean speech data. The pre-trained model, when adapted to each unseen speaker in the test set, improves the average correlation by another 6%.

Mel-frequency cepstral coefficients (MFCC) have been dominantly used in speaker recognition as we... more Mel-frequency cepstral coefficients (MFCC) have been dominantly used in speaker recognition as well as in speech recognition. However, based on theories in speech production, some speaker characteristics associated with the structure of the vocal tract, particularly the vocal tract length, are reflected more in the high frequency range of speech. This insight suggests that a linear scale in frequency may provide some advantages in speaker recognition over the mel scale. Based on two state-of-theart speaker recognition back-end systems (one Joint Factor Analysis system and one Probabilistic Linear Discriminant Analysis system), this study compares the performances between MFCC and LFCC (Linear frequency cepstral coefficients) in the NIST SRE (Speaker Recognition Evaluation) 2010 extended-core task. Our results in SRE10 show that, while they are complementary to each other, LFCC consistently outperforms MFCC, mainly due to its better performance in the female trials. This can be explained by the relatively shorter vocal tract in females and the resulting higher formant frequencies in speech. LFCC benefits more in female speech by better capturing the spectral characteristics in the high frequency region. In addition, our results show some advantage of LFCC over MFCC in reverberant speech. LFCC is as robust as MFCC in the babble noise, but not in the white noise. It is concluded that LFCC should be more widely used, at least for the female trials, by the mainstream of the speaker recognition community.

arXiv (Cornell University), Mar 11, 2022
Multi-resolution spectro-temporal features of a speech signal represent how the brain perceives s... more Multi-resolution spectro-temporal features of a speech signal represent how the brain perceives sounds by tuning cortical cells to different spectral and temporal modulations. These features produce a higher dimensional representation of the speech signals. The purpose of this paper is to evaluate how well the auditory cortex representation of speech signals contribute to estimate articulatory features of those corresponding signals. Since obtaining articulatory features from acoustic features of speech signals has been a challenging topic of interest for different speech communities, we investigate the possibility of using this multi-resolution representation of speech signals as acoustic features. We used U. of Wisconsin X-ray Microbeam (XRMB) database of clean speech signals to train a feed-forward deep neural network (DNN) to estimate articulatory trajectories of six tract variables. The optimal set of multi-resolution spectro-temporal features to train the model were chosen using appropriate scale and rate vector parameters to obtain the best performing model. Experiments achieved a correlation of 0.675 with ground-truth tract variables. We compared the performance of this speech inversion system with prior experiments conducted using Mel Frequency Cepstral Coefficients (MFCCs).

The problem addressed in this work is that of enhancing speech signals corrupted by additive nois... more The problem addressed in this work is that of enhancing speech signals corrupted by additive noise and improving the performance of automatic speech recognizers in noisy conditions. The enhanced speech signals can also improve the intelligibility of speech in noisy conditions for human listeners with hearing impairment as well as for normal listeners. The original Phase Opponency (PO) model, proposed to detect tones in noise, simulates the processing of the information in neural discharge times and exploits the frequency-dependent phase properties of the tuned filters in the auditory periphery along with the cross-auditory-nerve-fiber coincidence detection to extract temporal cues. The Modified Phase Opponency (MPO) proposed here alters the components of the PO model in such a way that the basic functionality of the PO model is maintained but the various properties of the model can be analyzed and modified independently of each other. This work presents a detailed mathematical formulation of the MPO model and the relation between the properties of the narrowband signal that needs to be detected and the properties of the MPO model. vi LIST OF TABLES 1.1 Comparison of performance of human speech perception and ASR in speech-shaped noise conditions. Numbers indicate accuracies in percentage.

Conference of the International Speech Communication Association, Aug 27, 2007
In this paper, we compare a Probabilistic Landmark-Based speech recognition System (LBS) which us... more In this paper, we compare a Probabilistic Landmark-Based speech recognition System (LBS) which uses Knowledge-based Acoustic Parameters (APs) as the front-end with an HMMbased recognition system that uses the Mel-Frequency Cepstral Coefficients as its front end. The advantages of LBS based on APs are (1) the APs are normalized for extra-linguistic information, (2) acoustic analysis at different landmarks may be performed with different resolutions and with different APs, (3) LBS outputs multiple acoustic landmark sequences that signal perceptually significant regions in the speech signal, (4) it may be easier to port this system to another language since the phonetic features captured by the APs are universal, and (5) LBS can be used as a tool for uncovering and subsequently understanding variability. LBS also has a probabilistic framework that can be combined with pronunciation and language models in order to make it more scalable to large vocabulary recognition tasks.

Interspeech 2007, 2007
The North American rhotic liquid has two maximally distinct articulatory variants, the classic "r... more The North American rhotic liquid has two maximally distinct articulatory variants, the classic "retroflex" and the classic "bunched" tongue postures. The evidence for acoustic differences between these two variants is reexamined using magnetic resonance images of the vocal tract in this study. Two subjects with similar vocal tract dimensions but different tongue postures for sustained /r/ are used. It is shown that these two variants have similar patterns of F1-F3 and zero frequencies. However, the "retroflex" variant has a larger difference between F4 and F5 than the "bunched" one (around 1400 Hz vs. around 700 Hz). This difference can be explained by the geometry differences between these two variants, in particular, the shorter and more forward palatal constriction of the "retroflex" /r/ and the sharper transition between palatal constriction and its anterior and posterior cavities. This formant pattern difference is confirmed by measurement from acoustic data of several additional subjects.

Interspeech 2015, 2015
Speech acoustic patterns vary significantly as a result of coarticulation and lenition processes ... more Speech acoustic patterns vary significantly as a result of coarticulation and lenition processes that are shaped by segmental context or by performance factors such as production rate and degree of casualness. The resultant acoustic variability continues to offer serious challenges for the development of automatic speech recognition (ASR) systems. Articulatory phonology provides a formalism to understand coarticulation through spatiotemporal changes in the patterns of underlying gestures. This paper studies the coarticulation occurring in certain fast spoken utterances using articulatory constriction tract-variables (TVs) estimated from acoustic features. The TV estimators are trained on the University of Wisconsin X-ray Microbeam (XRMB) database. The utterances analyzed are from a different corpus containing simultaneous acoustic and Electromagnetic Articulograph (EMA) data. Plots of the estimated TVs show that the estimation procedure successfully detected the articulatory constrictions even in the case of highly coarticulated utterances that a state-of-the-art phone recognition system failed to detect. These results highlight the potential of TV trajectory estimation methods for improving the performance of phone recognition systems, particularly when sounds are reduced or deleted.

Interspeech 2018, 2018
In previous work, we have shown that using articulatory features derived from a speech inversion ... more In previous work, we have shown that using articulatory features derived from a speech inversion system trained using synthetic data can significantly improve the robustness of an automatic speech recognition (ASR) system. This paper presents results from the first of two steps needed for exploring if the same will hold true for a speech inversion system trained with natural speech. Specifically, we developed a noise robust multi-speaker acoustic to articulatory speech inversion system. A feed forward neural network was trained using contextualized mel-frequency cepstral coefficients (MFCC) as the input acoustic features and six tract-variable (TV) trajectories as the output articulatory features. Experiments were performed on the U. Wisc. X-ray Microbeam (XRMB) database with 8 noise types artificially added at 5 different SNRs. Performance of the system was measured by computing the correlation between estimated and actual TVs. The performance of the multi-condition trained system was compared to the cleanspeech trained system. The effect of speech enhancement on TV estimation was also evaluated. Experiments showed a 10% relative improvement in correlation over the baseline cleanspeech trained system.

The Journal of the Acoustical Society of America, 2019
Speech inversion is a well-known ill-posed problem and addition of speaker differences typically ... more Speech inversion is a well-known ill-posed problem and addition of speaker differences typically makes it even harder. Normalizing the speaker differences is essential to effectively using multi-speaker articulatory data for training a speaker independent speech inversion system. This paper explores a vocal tract length normalization (VTLN) technique to transform the acoustic features of different speakers to a target speaker acoustic space such that speaker specific details are minimized. The speaker normalized features are then used to train a deep feed-forward neural network based speech inversion system. The acoustic features are parameterized as time-contextualized mel-frequency cepstral coefficients. The articulatory features are represented by six tract-variable (TV) trajectories, which are relatively speaker invariant compared to flesh point data. Experiments are performed with ten speakers from the University of Wisconsin X-ray microbeam database. Results show that the prop...

Speech Communication, 2017
Studies have shown that articulatory information helps model speech variability and, consequently... more Studies have shown that articulatory information helps model speech variability and, consequently, improves speech recognition performance. But learning speaker-invariant articulatory models is challenging, as speaker-specific signatures in both the articulatory and acoustic space increase complexity of speechto-articulatory mapping, which is already an ill-posed problem due to its inherent nonlinearity and nonunique nature. This work explores using deep neural networks (DNNs) and convolutional neural networks (CNNs) for mapping speech data into its corresponding articulatory space. Our speech-inversion results indicate that the CNN models perform better than their DNN counterparts. In addition, we use these inverse-models to generate articulatory information from speech for two separate speech recognition tasks: the WSJ1 and Aurora-4 continuous speech recognition tasks. This work proposes a hybrid convolutional neural network (HCNN), where two parallel layers are used to jointly model the acoustic and articulatory spaces, and the decisions from the parallel layers are fused at the output context-dependent (CD) state level. The acoustic model performs time-frequency convolution on filterbank-energy-level features, whereas the articulatory model performs time convolution on the articulatory features. The performance of the proposed architecture is compared to that of the CNN-and DNN-based systems using gammatone filterbank energies as acoustic features, and the results indicate that the HCNN-based model demonstrates lower word error rates compared to the CNN/DNN baseline systems.
IEEE International Conference on Acoustics Speech and Signal Processing, 2002
Coping with inter-speaker variability (i.e., differences in the vocal tract characteristics of sp... more Coping with inter-speaker variability (i.e., differences in the vocal tract characteristics of speakers) is still a major challenge for Automatic Speech Recognizers. In this paper, we discuss a method that compensates for differences in speaker characteristics. In particular, we demonstrate that when continuous density hidden Markov model based system is used as the back-end , a Knowledge-Based Front End (KBFE) can outperform the traditional Mel-Frequency Cepstral Coefficients (MFCCs), particularly when there is a mismatch in the gender and ages of the subjects used to train and test the recognizer.

2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010
The production of the lateral sounds generally involves a linguo-alveolar contact and one or two ... more The production of the lateral sounds generally involves a linguo-alveolar contact and one or two lateral channels along the parasagittal sides of the tongue. The acoustic effect of these articulatory features is not clearly understood. In this study, we compare two productions of /l/ in American English by one subject, one for a dark /l/ and the other for a light /l/. Threedimensional vocal tract models derived from the magnetic resonance images were analyzed. It was shown that zeros in the vocal tract acoustic response are produced in the F3-F5 region in both /l/ productions, but the number of zeros and their frequencies are affected by the length of the linguo-alveolar contact and by the presence or absence of lateral linguopalatal contacts. The dark /l/ has one zero below 5 kHz, produced by the cross mode posterior to the linguo-alveolar contact, while the light /l/ has three zeros below 5 kHz, produced by the asymmetrical lateral channels, the supralingual cavity and the cross mode posterior to linguo-alveolar contact.

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
Glossectomy changes properties of the tongue and negatively affects patients' speech production. ... more Glossectomy changes properties of the tongue and negatively affects patients' speech production. Among the most difficult consonants to produce in the post-glossectomy speakers, the sibilant fricatives /s/ and /sh/ are often problematic. To better understand these problems in production, this study analyzed acoustic and articulatory data of /s/ and /sh/ from three subjects: one normal speaker and two post-glossectomy speakers with abnormal /s/ or /sh. Based on cine magnetic resonance images, three dimensional vocal tract reconstructions, tongue surface shapes behind constrictions, and area functions were analyzed. Our results show that in each patient, contrary to normal, /s/ and /sh/ were quite similar in acoustic spectra, tongue surface shapes, and constriction locations. In the abnormal /s/, the missing unilateral tongue tissue created an air flow bypass which made the constriction further backward. The abnormal /sh/ may be explained by the lack of precise tongue control after surgery. In addition, the tongue surfaces in the patients were more asymmetric in the back and were not grooved for /s/ anterior to the constriction.

2011 IEEE Workshop on Automatic Speech Recognition & Understanding, 2011
Mel-frequency cepstral coefficients (MFCC) have been dominantly used in speaker recognition as we... more Mel-frequency cepstral coefficients (MFCC) have been dominantly used in speaker recognition as well as in speech recognition. However, based on theories in speech production, some speaker characteristics associated with the structure of the vocal tract, particularly the vocal tract length, are reflected more in the high frequency range of speech. This insight suggests that a linear scale in frequency may provide some advantages in speaker recognition over the mel scale. Based on two state-of-theart speaker recognition back-end systems (one Joint Factor Analysis system and one Probabilistic Linear Discriminant Analysis system), this study compares the performances between MFCC and LFCC (Linear frequency cepstral coefficients) in the NIST SRE (Speaker Recognition Evaluation) 2010 extended-core task. Our results in SRE10 show that, while they are complementary to each other, LFCC consistently outperforms MFCC, mainly due to its better performance in the female trials. This can be explained by the relatively shorter vocal tract in females and the resulting higher formant frequencies in speech. LFCC benefits more in female speech by better capturing the spectral characteristics in the high frequency region. In addition, our results show some advantage of LFCC over MFCC in reverberant speech. LFCC is as robust as MFCC in the babble noise, but not in the white noise. It is concluded that LFCC should be more widely used, at least for the female trials, by the mainstream of the speaker recognition community.

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
This paper presents a deep neural network (DNN) to extract articulatory information from the spee... more This paper presents a deep neural network (DNN) to extract articulatory information from the speech signal and explores different ways to use such information in a continuous speech recognition task. The DNN was trained to estimate articulatory trajectories from input speech, where the training data is a corpus of synthetic English words generated by the Haskins Laboratories' task-dynamic model of speech production. Speech parameterized as cepstral features were used to train the DNN, where we explored different cepstral features to observe their role in the accuracy of articulatory trajectory estimation. The best feature was used to train the final DNN system, where the system was used to predict articulatory trajectories for the training and testing set of Aurora-4, the noisy Wall Street Journal (WSJ0) corpus. This study also explored the use of hidden variables in the DNN pipeline as a potential acoustic feature candidate for speech recognition and the results were encouraging. Word recognition results from Aurora-4 indicate that the articulatory features from the DNN provide improvement in speech recognition performance when fused with other standard cepstral features; however when tried by themselves, they failed to match the baseline performance.

2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009
In this paper we present a technique for obtaining Vocal Tract (VT) time functions from the acous... more In this paper we present a technique for obtaining Vocal Tract (VT) time functions from the acoustic speech signal. Knowledgebased Acoustic Parameters (APs) are extracted from the speech signal and a pertinent subset is used to obtain the mapping between them and the VT time functions. Eight different vocal tract constriction variables consisting of five constriction degree variables, lip aperture (LA), tongue body (TBCD), tongue tip (TTCD), velum (VEL), and glottis (GLO); and three constriction location variables, lip protrusion (LP), tongue tip (TTCL), tongue body (TBCL) were considered in this study. The TAsk Dynamics Application model (TADA ) is used to create a synthetic speech dataset along with its corresponding VT time functions. We explore Support Vector Regression (SVR) followed by Kalman smoothing to achieve mapping between the APs and the VT time functions.

Speech Communication, 2004
Of all the sounds in any language, nasals are the only class of sounds with dominant speech outpu... more Of all the sounds in any language, nasals are the only class of sounds with dominant speech output from the nasal cavity as opposed to the oral cavity. This gives nasals some special properties including presence of zeros in the spectrum, concentration of energy at lower frequencies, higher formant density, higher losses, and stability. In this paper we propose acoustic correlates for the linguistic feature nasal. In particular, we focus on the development of Acoustic Parameters (APs) which can be extracted automatically and reliably in a speaker independent way. These APs were tested in a classification experiment between nasals and semivowels, the two classes of sounds which together form the class of sonorant consonants. Using the proposed APs with a support vector machine based classifier we were able to obtain classification accuracies of 89.53%, 95.80% and 87.82% for prevocalic, postvocalic and intervocalic sonorant consonants respectively on the TIMIT database. As an additional proof to the strength of these parameters, we compared the performance of a Hidden Markov Model (HMM) based system that included the APs for nasals as part of the front-end, with an HMM system that did not. In this digit recognition experiment, we were able to obtain a 60% reduction in error rate on the TI46 database.
The Journal of the Acoustical Society of America, 2004
A Matlab-based computer program for vocal tract acoustic response calculation (VTAR) has been dev... more A Matlab-based computer program for vocal tract acoustic response calculation (VTAR) has been developed. Based on a frequency-domain vocal tract model [Z. Zhang and C. Espy-Wilson, J. Acoust. Soc. Am. (2004)], VTAR is able to model various complex sounds such as nasals, rhotics, and liquids. With input in the form of vocal tract cross-sectional area functions, VTAR calculates the vocal tract acoustic response function and the formant frequencies and bandwidths. The user-friendly interface allows directed data input for defined categories: vowels, nasals, nasalized sounds, consonant, laterals, and rhotics. The program also provides an interface for input and modification of arbitrary vocal tract geometry configurations, which is ideal for research applications. [Work supported by NIH Grant 1 R01 DC05250-01.]
The Journal of the Acoustical Society of America, 2007
VTAR, a Matlab-based computer program for vocal tract acoustic response calculation [Zhou et al.,... more VTAR, a Matlab-based computer program for vocal tract acoustic response calculation [Zhou et al., J. Acoust. Soc. Am. 115(5), 2543 (2004)], has been used in both research and teaching. In its latest version, several new features are included, which are speech sound synthesis with source model options, formant sensitivity function calculation, susceptance plots calculation useful particularly for nasalized vowel analysis, and addition of a new set of area function data for liquid sounds extracted from MR (magnetic resonance) images. These new features along with the user-friendly interface significantly enhance the usability of VTAR for both teaching and research purposes. [Work supported by NIH Grant 1 R01 DC05250-01.]

The Journal of the Acoustical Society of America, 2013
Production of fricatives involves a narrow supraglottal constriction along the vocal tract. Air f... more Production of fricatives involves a narrow supraglottal constriction along the vocal tract. Air flows through the constriction, and generates turbulent noise source(s) by impinging on some obstacles downstream. In post-glossectomy speakers, the production of /s/ and /sh/ is often problematic. It is mainly caused by the tongue surgery, which changes tongue properties such as volume, motility, and symmetry, preventing the tongue from creating proper constrictions. The purpose of this study was to gain some insights on how the vocal tracts of abnormal /s/ and /sh/ are shaped and what are their corresponding acoustic consequences. Based on cine magnetic resonance images, we built 3-D vocal tract models for /s/ and /sh/ from two post-glossectomy speakers (one with abnormal /s/ and the other with abnormal /sh/). Due to the missing part of the tongue, the reconstructed vocal tracts are asymmetric with either an air-flow bypass or a side branch formed near the constrictions. Two coupled phy...
Uploads
Papers by Carol Espy-Wilson