We explore discriminative training of HMM parameters that directly minimizes the expected error r... more We explore discriminative training of HMM parameters that directly minimizes the expected error rate. In discriminative training one is interested in training a system to minimize a desired error function, like word error rate, phone error rate, or frame error rate. We review a recent method , which introduces an analytic expression for the gradient of the expected error-rate. The analytic expression leads to a perceptron-like update rule, which is adapted here for training of HMMs in an online fashion. While the proposed method can work with any type of the error function used in speech recognition, we evaluated it on phoneme recognition of TIMIT, when the desired error function used for training was frame error rate. Except for the case of GMM with a single mixture per state, the proposed update rule provides lower error rates, both in terms of frame error rate and phone error rate, than other approaches, including MCE and large margin. Index Terms: hidden Markov models, online learning, direct error minimization, discriminative training, automatic speech recognition, minimum phone error, minimum frame error
… , 2009. ASRU 2009. IEEE Workshop on, Jan 1, 2009
We consider learning acoustic feature transformations using an additional view of the data, in th... more We consider learning acoustic feature transformations using an additional view of the data, in this case video of the speaker's face. Specifically, we consider a scenario in which clean audio and video is available at training time, while at test time only noisy audio is available. We use canonical correlation analysis (CCA) to learn linear projections of the acoustic observations that have maximum correlation with the video frames. We provide an initial demonstration of the approach on a speaker recognition task using data from the VidTIMIT corpus. The projected features, in combination with baseline MFCCs, outperform the baseline recognizer in noisy conditions. The techniques we present are quite general, although here we apply them to the case of a specific speaker recognition task. This is the first work of which we are aware in which multiple views are used to learn an acoustic feature projection at training time, while using only the acoustics at test time.
Uploads
Papers by Mark Stoehr