Papers by Nicolas Scheffer

Annual Conference of the International Speech Communication Association, 2006
Building acoustic events and their sequence analysis (AES system) is a method that proved its eff... more Building acoustic events and their sequence analysis (AES system) is a method that proved its efficiency in . Indeed, the methodology combines the power of the world model GMM, used in stateof-the-art speaker detection systems, for extracting speaker independent events with an analysis of these event sequences via tools usually used in so-called High Level Speaker Detection systems. The efficiency of this system has been validated at the last NIST evaluation campaign. This paper aims at proposing a new framework by applying an AES system on multiple classes, C-AES. The originality of this work is to consider that intraclass sequence analysis can bring more information than a global analysis on the whole speaker utterance. This paper also proposes a method to take into account the apriori knowledge of the classes within the scoring process. The results support the fact that intraclass information is discriminant for speaker verification, as a combination with a state-of-the-art GMM brings a 12% relative gain at the DCF.
This paper presents some experiments on discriminative training for GMM/UBM based speaker recogni... more This paper presents some experiments on discriminative training for GMM/UBM based speaker recognition systems. We propose two MMIE adaptation methods for GMM com- ponent weights suitable for speaker recognition. The impact on performance of this training methods is compared to the standard weight estimation/adaptation criterion, MLE and MAP on standard GMM based systems and on SVM based systems. The results
Annual Conference of the International Speech Communication Association, 2009
Intersession variability (ISV) compensation in speaker recognition is well studied with respect t... more Intersession variability (ISV) compensation in speaker recognition is well studied with respect to extrinsic variation, but little is known about its ability to model intrinsic variation. We find that ISV compensation is remarkably successful on a corpus of intrinsic variation that is highly controlled for channel (a dominant component of ISV). The results are particularly surprising because the ISV training
Novel approaches using high level features have recently shown up in the speaker recognition fiel... more Novel approaches using high level features have recently shown up in the speaker recognition field. They basically consist in modeling speakers using linguistic features such as words, phonemes, idiolects. The benefit of these features was demonstrated in NIST campaigns. Their main disadvantage is their need of a huge amount of data to be efficient. The purpose of this study is to generalize this approach by using acoustic events, generated by a GMM, as input features. A methodology to build a dictionary and to model speakers using symbol sequences from this dictionary is derived. Different experiments on NIST SRE 2004 database show that the information produced is speaker specific and that a fusion experiment with a GMM verification system improves performance.
Calibration and multiple system fusion for spoken term detection using linear logistic regression
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014

A noise robust i-vector extractor using vector taylor series for speaker recognition
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
ABSTRACT We propose a novel approach for noise-robust speaker recognition, where the model of dis... more ABSTRACT We propose a novel approach for noise-robust speaker recognition, where the model of distortions caused by additive and convolutive noises is integrated into the i-vector extraction framework. The model is based on a vector taylor series (VTS) approximation widely successful in noise robust speech recognition. The model allows for extracting “cleaned-up” i-vectors which can be used in a standard i-vector back end. We evaluate the proposed framework on the PRISM corpus, a NIST-SRE like corpus, where noisy conditions were created by artificially adding babble noises to clean speech segments. Results show that using VTS i-vectors present significant improvements in all noisy conditions compared to a state-of-the-art baseline speaker recognition. More importantly, the proposed framework is robust to noise, as improvements are maintained when the system is trained on clean data.

2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010
We propose a new method to characterize a speaker within the Joint Factor Analysis (JFA) framewor... more We propose a new method to characterize a speaker within the Joint Factor Analysis (JFA) framework. Scoring within the JFA framework can be costly and a new method was proposed to produce an accurate score in a fast manner. However, this method is nonsymmetric and performs badly without any score normalization. We propose a new JFA scoring method that is both symmetrical and efficient. In the same way as means of Gaussians can be concatenated to form a supervector, we use several estimates of speaker factors from the eigenvoice space to build a supervector of factors that we call superfactors. We motivate the use of such factors in the current JFA model through comparison with a Tied Factor Analysis model. We show that this method substantially improves the performance of a system that uses only the standard speaker factors to produce scores, and usually outperforms the baseline system. We also show that this method is relatively effective even when score normalization is not an option.
International Conference on Acoustics, Speech, and Signal Processing, 2010
Prosodic information has been successfully used for speaker recog- nition for more than a decade.... more Prosodic information has been successfully used for speaker recog- nition for more than a decade. The best-performing prosodic system to date has been one based on features extracted over syllables ob- tained automatically from speech recognition output. The features are then transformed using a Fisher kernel, and speaker models are trained using support vector machines (SVMs). Recently, a sim- pler

2006 IEEE Odyssey - The Speaker and Language Recognition Workshop, 2006
In the past few years, discriminative approaches to perform speaker detection have shown good res... more In the past few years, discriminative approaches to perform speaker detection have shown good results and an increasing interest. Among these methods, SVM based systems have lots of advantages, especially their ability to deal with a high dimension feature space. Generative systems such as UBM-GMM systems show the greatest performance among other systems in speaker verification tasks. Combination of generative and discriminative approaches is not a new idea and has been studied several times by mapping a whole speech utterance onto a fixed length vector. This paper presents a straight-forward, cost friendly method to combine the two approaches with the use of a UBM model only to drive the experiment. We show that the use of the TFLLR kernel, while closely related to a reduced form of the Fisher mapping, implies a performance that is close to a standard GMM/UBM based speaker detection system. Moreover, we show that a combination of both outperforms the systems taken independently.
2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009
Combination strategies for a factor analysis phone-conditioned speaker verification system.
SRI Submissions to Chinese-English PatentMT NTCIR10
Method and apparatus for speaker-calibrated speaker detection
Multi-sample conversational voice verification
Method and apparatus for audio characterization
Content matching for short duration speaker recognition
SRI’s Submissions to Chinese-English PatentMT NTCIR10 Evaluation
Bilinear Factor Analysis for iVector Based Speaker Verification
Trial-based calibration for speaker recognition in unseen conditions

The purpose of this work is to show how recent developments in cepstral-based systems for speaker... more The purpose of this work is to show how recent developments in cepstral-based systems for speaker recognition can be leveraged for the use of Maximum Likelihood Linear Regression (MLLR) transforms. Speaker recognition systems based on MLLR transforms have shown to be greatly beneficial in combination with standard systems, but most of the advances in speaker modeling techniques have been implemented for cepstral features. We show how these advances, based on Factor Analysis, such as eigenchannel and ivector, can be easily employed to achieve very high accuracy. We show that they outperform the current state-of-the-art MLLR-SVM system that SRI submitted during the NIST SRE 2010 evaluation. The advantages of leveraging the new approaches are manyfold: the ability to process a large amount of data, working in a reduced dimensional space, importing any advances made for cepstral systems to the MLLR features, and the potential for system combination at the ivector level.

Towards noise-robust speaker recognition using probabilistic linear discriminant analysis
ABSTRACT This work addresses the problem of speaker verification where additive noise is present ... more ABSTRACT This work addresses the problem of speaker verification where additive noise is present in the enrollment and testing utterances. We show how the current state-of-the-art framework can be effectively used to mitigate this effect. We first look at the degradation a standard speaker verification system is subjected to when presented with noisy speech waveforms. We designed and generated a corpus with noisy conditions, based on the NIST SRE 2008 and 2010 data, built using open-source tools and freely available noise samples. We then show how adding noisy training data in the current i-vector-based approach followed by probabilistic linear discriminant analysis (PLDA) can bring significant gains in accuracy at various signal-to-noise ratio (SNR) levels. We demonstrate that this improvement is not feature-specific as we present positive results for three disparate sets of features: standard mel frequency cepstral coefficients, prosodic polynomial co-efficients and maximum likelihood linear regression (MLLR) transforms.
Uploads
Papers by Nicolas Scheffer