Speaker Verification

description2,075 papers

group6,176 followers

lightbulbAbout this topic

Speaker verification is a biometric authentication process that uses voice characteristics to confirm an individual's identity. It involves analyzing vocal attributes, such as pitch, tone, and speech patterns, to determine if the speaker matches a pre-registered voice model, ensuring secure access to systems or information.

lightbulbAbout this topic

Key research themes

1. How can speaker verification systems be robustly defended against diverse spoofing attacks including voice conversion, speech synthesis, and replay?

This research area focuses on understanding the vulnerabilities of automatic speaker verification (ASV) systems to a broad range of spoofing attacks, such as voice conversion, speech synthesis, and replay attacks, which pose severe security threats. It also investigates the design and evaluation of anti-spoofing countermeasures, including databases, protocols, and methodologies to detect and mitigate both known and unknown spoofing types, particularly in the context of text-independent ASV systems. The work is significant because spoofing can undermine the reliability of ASV systems deployed in real-world applications such as call centers, banking, and forensic investigations.

Anti-Spoofing for Text Independent Speaker Verification

by International Journal of Scientific Research in Science, Engineering and Technology IJSRSET

2017

Key finding: This study introduces the first comprehensive spoofing and anti-spoofing (SAS) database comprising nine diverse spoofing techniques (including multiple speech synthesis and voice conversion systems) for text-independent... Read more

articleView Paper downloadDownload

ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge

by Tomi Kinnunen

2022, IEEE Journal of Selected Topics in Signal Processing

Key finding: Describes the community-driven ASVspoof initiative that addresses the lack of common datasets and standardized protocols by providing the ASVspoof 2015 dataset and organizing competitive evaluations, demonstrating the... Read more

articleView Paper downloadDownload

Spoofing and countermeasures for automatic speaker verification

by Tomi Kinnunen

2021

Key finding: Provides a detailed survey of vulnerabilities unique to text-independent ASV systems, emphasizing how prior countermeasures often rely on known spoofing attacks and lack generalizability. It highlights the need for standard... Read more

articleView Paper downloadDownload

Joint Speaker Verification and Antispoofing in the <inline-formula> <tex-math notation="LaTeX">$i$ </tex-math></inline-formula>-Vector Space

by Tomi Kinnunen

2016, IEEE Transactions on Information Forensics and Security

Key finding: Presents a novel joint modeling approach in the i-vector subspace that simultaneously addresses speaker verification and voice conversion spoofing attack detection without relying on tailored discriminative features. By... Read more

articleView Paper downloadDownload

Voice Spoofing Countermeasures: Taxonomy, State-of-the-art, experimental analysis of generalizability, open challenges, and the way forward

by Awais A. Khan

2024, arXiv (Cornell University)

Key finding: Provides an extensive taxonomy and comprehensive experimental comparison of spoofing countermeasures across diverse feature extraction and classification paradigms, examining their generalizability on ASVspoof2019 and VSDC... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What techniques improve speaker verification performance and robustness under practical conditions such as limited data, language mismatch, recording channel variability, and multi-speaker environments?

This research theme focuses on enhancing speaker verification accuracy and reliability in realistic and challenging conditions. It includes methods dealing with limited-duration speech segments, channel distortions (e.g., GSM transcoded speech), multilingual and cross-lingual mismatches, and speaker overlap situations. The research addresses acoustic feature design, fusion of complementary feature sets, model adaptation, and joint optimization strategies to maintain verification performance in heterogeneous real-world scenarios.

i-Vector-Based Speaker Verification on Limited Data Using Fusion Techniques

by jayanthi kumari

2023, Journal of Intelligent Systems

Key finding: Demonstrates that combining vocal tract features (MFCC, LPCC) with excitation source features (LPR, LPRP) using feature- and score-level fusion significantly reduces equal error rate (EER) in i-vector based speaker... Read more

articleView Paper downloadDownload

The impact of mismatched recordings on an automatic-speaker-recognition system and human listeners

by Radek Skarnitzl

2024, Acta Universitatis Carolinae. Philologica

Key finding: Empirically shows that both automatic speaker recognition systems based on i-vectors/x-vectors and human listeners experience performance degradation when comparing recordings that differ in language and recording time. The... Read more

articleView Paper downloadDownload

Robust speaker verification from GSM-transcoded speech based on decision fusion and feature transformation

by Man-wai Mak

2024, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).

Key finding: Proposes a novel data-dependent score fusion algorithm that computes adaptive weights for fusing multiple utterance scores in GSM-transcoded speech speaker verification, using prior knowledge from enrollment scores. This... Read more

articleView Paper downloadDownload

Speaker Verification Based on Single Channel Speech Separation

by Mijit Ablimit

2025, IEEE Access

Key finding: Introduces an integrated approach combining feature-scale single-channel speech separation with back-end speaker verification, using neural network-based separation models and MFCC-T features. The proposed method trains both... Read more

articleView Paper downloadDownload

ScienceDirect Comparison of Text Independent Speaker Identification Systems using GMM and i-Vector Methods

by ab kh

2019

Key finding: Finds that i-vector-based speaker identification systems outperform Gaussian mixture model (GMM) methods, especially when combined with PLDA classifiers and features like PNCC and RASTA-PLP, and that augmenting features with... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can speaker verification fairness across demographic and language groups be improved without requiring subgroup labels or creating reliance on balanced data samples?

This research area addresses performance disparities in speaker verification systems arising from imbalanced representation of demographic groups such as gender and nationality, or language variability. The focus is on algorithmic fairness approaches that automatically identify underperforming groups without explicit annotations, using adversarial learning, group-adapted embeddings, fusion networks, and reweighting schemes. This direction is crucial for equitable deployment of speaker verification in diverse real-world populations and for mitigating biases inherent in training data.

Adversarial Reweighting for Speaker Verification Fairness

by Andreas Stolcke

2024, arXiv (Cornell University)

Key finding: Reformulates adversarial reweighting (ARW) for speaker verification with metric learning, enabling the adversarial network to assign higher weights to poorly performing instances without subgroup annotations. Demonstrates... Read more

articleView Paper downloadDownload

Improving Fairness in Speaker Verification via Group-Adapted Fusion Network

by Andreas Stolcke

2024, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Key finding: Proposes a modular network architecture combining group-specific embedding adaptation and score fusion to mitigate model unfairness caused by imbalanced gender representation during training. Experiments show that this... Read more

articleView Paper downloadDownload

Enhancing speaker verification accuracy with deep ensemble learning and inclusion of multifaceted demographic factors

by International Journal of Electrical and Computer Engineering (IJECE) and

2023, International Journal of Electrical and Computer Engineering (IJECE)

Key finding: Develops an ensemble-based deep learning framework integrating gender and ethnicity classifiers with a Siamese verification network, and demonstrates improved equal error rates and decision cost functions on the large-scale... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Speaker Verification

Inter Dataset Variability Compensation for Speaker Recognition

by Hagai Aronowitz

Recently satisfactory results have been obtained in NIST speaker recognition evaluations. These results are mainly due to accurate modeling of a very large development dataset provided by LDC. However, for many realistic scenarios the use... more

descriptionView Paper arrow_downwardDownload

An Impact of Wideband Speech Codec Mismatch on a Performance of GMM-UBM Speaker Verification over Telecommunication Channel

by Peter Pocta

Proceedings of the 11th International Conference ELEKTRO 2016

— An automatic verification of person's identity from its voice is a part of modern telecommunication services. In order to execute a verification task, a speech signal has to be transmitted to a remote server. So, a performance of the... more

Fig. 1. Speaker identification Depending upon the application, two different tasks are defined under the general heading of ASA: identification and verification. In the case of speaker identification, a voice sample from an unknown speaker is compared with a set of labeled speaker models. When it is known that the set o speaker models includes all speakers of interest the task is referred to as the closed-set identification. The label of the best matching speaker is taken to be the identified speaker. Mos speaker identification applications are open-set, meaning that i is possible that an unknown speaker is not included in the set of speaker models. In this case, if no satisfactory match is obtained, the “no-match” decision is provided. In the case of speaker verification, an identity claim is provided or asserted along with the voice sample. In this case, the unknown voice sample is compared only with the speaker model whose labe corresponds to the identity claim. If the match is satisfactory, the identity claim is accepted; otherwise the claim is rejected. The speaker verification decision mode is intrinsic to mos access control applications. This principle of a speaker identification and verification is displayed in Fig. | and Fig. 2 respectively [1].

Fig. 3. Choosing different operating points results in different FPR and FNR Demands on the accuracy of biometric systems are different depending on their use in a particular application. For example, in forensic applications such as_ criminal identification, FNR measure is very important for the design of the system. There is a need to identify a criminal even despite the risk of manually examining a large number of potentially incorrect matches produced by the biometric system. On the other hand, FPR measure is one of the most important factors in high security applications. The main aim here is to discourage impostors.

Fig. 4. DET curves of individual conditions EER (IN %) FOR THE FULLY MISMATCHED AND FULLY MATCHED TWO VERIFICATION SCENARIOS

coding/degradation situations. Such codec mismatch situations are represented in this study by the third scenario, defined above as the fully-codec mismatched scenario. In fact, it is not a simple task to select a codec for training »ffering a good performance over a wide range of WB codecs currently deployed in telecommunication networks because ¢ verification application mostly does not have access to 4 >ommunication/signalization protocol and thus does not have un information about a codec deployed for the corresponding voice transmission/communication. As it has been already shown above the performance of the verification system is very -odec-specific. Moreover a good performance in codec nismatch situations (different codecs involved in an enrollment and testing phase) is one of the main requirements for a design of robust system covering all the prospective

descriptionView Paper arrow_downwardDownload

Standard audio format encapsulation (SAFE)

by Homayoon Beigi

2011, Telecommunication Systems

One characteristic that distinguishes speaker recognition (identification, verification, classification, tracking, etc.) from other biometrics is that it is designed to operate with devices and over channels that were created for other... more

descriptionView Paper arrow_downwardDownload

Voice forgery using alisp: Indexation in a client memory

by Patrick Perrot and

2005, ICASSP 2005

This article deals with a technique of voice forgery using the ALISP (Automatic Language Independent Speech Processing) approach. Such a technique allows to transform the voice of an arbitrary person (the impostor), forging the identity... more

descriptionView Paper arrow_downwardDownload

Effects of device mismatch, language mismatch and environmental mismatch on speaker verification

by Man-wai Mak

2007, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Device, language and environmental mismatch adversely affect speaker verification (SV) performance. We investigate such effects empirically based on the M3 (multibiometric, multilingual and multi-device) Corpus [1]. Device mismatch (among... more

descriptionView Paper arrow_downwardDownload

Survery of approaches and challenges in multimodal biometric authentication systems

by Chiung Ching (Peter) Ho and

2006, The Proceedings of the 3rd International Conference …

Authentication is the process whereby a user proves his claim to identity. This paper aims to review existing MMBAS and multimodal biometric datasets commonly used for testing and benchmarking results. At the same time, a brief overview... more

descriptionView Paper arrow_downwardDownload

ONLINE BUS PASS GENERATION USING QR CODE

by IJIRAE - International Journal of Innovative Research in Advanced Engineering

2019, IJIRAE:: AM Publications,India

The main objectives of this work are to describe the online bus pass generation and ticket booking using QR code. Online bus pass generation is helpful to people who are suffering issues with the present technique for the generation of... more

descriptionView Paper arrow_downwardDownload

Multi Filter Bank Approach for Speaker Verification Based on Genetic Algorithm

by Bruno Gas and

2007, Lecture Notes in Computer Science

After the success of NOLISP'03, NOLISP'04 summer school and NOLISP'05, we are pleased to present NOLISP'07. The fourth event in a series of events related to Non-linear speech processing.

descriptionView Paper arrow_downwardDownload

Fundamentals of Speaker Recognition

by Homayoon Beigi

2011

descriptionView Paper arrow_downwardDownload

Multi filter bank approach for speaker verification based on genetic algorithm

by Denilson Cruz

2007

Hidden Markov Models based text-to-speech (HMM-TTS) synthesis is a technique for generating speech from trained statistical models where spectrum, pitch and durations of basic speech units are modelled altogether. The aim of this work is... more

descriptionView Paper arrow_downwardDownload

The QUT-NOISE-SRE Protocol for the Evaluation of Noisy Speaker Recognition

by Ahilan Kanagasundaram and

2015, 6th Annual Conference of the International Speech Communication Association, Interspeech 2015

The QUT-NOISE-SRE protocol is designed to mix the large QUT-NOISE database, consisting of over 10 hours of background noise, collected across 10 unique locations covering 5 common noise scenarios, with commonly used speaker recognition... more

descriptionView Paper arrow_downwardDownload

An improved method for voice pathology detection by means of a HMM-based feature space transformation

by J Godino-Llorente and

2010, Pattern Recognition

This paper presents new a feature transformation technique applied to improve the screening accuracy for the automatic detection of pathological voices. The statistical transformation is based on Hidden Markov Models, obtaining a... more

Fig. 1. Two different approaches of the processes involved in a pattern recognition system for automatic screening of voice disorders.

Fig. 2. Steps for transforming a new observation sequence, given a HMM and its associated transformation. Note that the subscripts of the matrices W; used to transform each single observation @, keep the same order of the state subscripts in the Viterbi path.

Fig. 3. ROC curves for the best accuracy of the system training with three different criteria using the MEEI database.

Fig. 4. ROC curves of the system trained with the MCE criterion for three different FST using the MEEI database.

Fig. 5. ROC curves for the best accuracy of the system obtained using the UPM database, and training with the different criteria.

Fig. 6. ROC curves training the system using the UPM database with the MCE criterion, and using three different transformation techniques.

Fig. 7. Evolution of the MCE training method along the iterations: (a) accuracy in the training phase; (b) distance between HMM (see Eq. (11)); and (c) loss function (see Eq. (9)).

* Results obtained without a rigorous validation methodology. > Results obtained with biased methods (see Section 1). Summary of previous research works on voice pathology detection, detailing the number of patients in the database (pathologic+normal), the features employed, the trasformation technique (if any), the classification method, and the accuracy reported.

AUC for the best results obtained using the MEEI database.

Best results obtained with the different FST using the MEEI database. Note that the number of samples for each class is different. ote that the number oF samples for each Class 1s different. * The value (-) is the number of features of the transformed space. > Mean value + standard deviation.

AUC for the best results obtained using the UPM database. Table 5

Best results obtained with the different FST using the UPM database. * The value (-) is the number of features of the transformed space. > Mean value + standard deviation Table 4

descriptionView Paper arrow_downwardDownload

A new nonlinear feature extraction algorithm for speaker verification

by Bruno Gas and

2004

In this paper we propose a new feature extraction algorithm based on nonlinear prediction: the Neural Predictive Coding model which is an extension of the classical LPC one. This model is applied to speaker verification by the... more

descriptionView Paper arrow_downwardDownload

Saudi accented Arabic voice bank

by Fayez A . Alhargan

2019

descriptionView Paper arrow_downwardDownload

Learning Speaker Representations with Mutual Information

by Mirco Ravanelli

Learning good representations is of crucial importance in deep learning. Mutual Information (MI) or similar measures of statistical dependence are promising tools for learning these representations in an unsupervised way. Even though the... more

descriptionView Paper arrow_downwardDownload

Automatic Speaker Recognition system using Mel Frequency Cepstral Coefficients (MFCC) and Vector Quantization (VQ) approach

by Abdullah - al - mamun

In this paper, automatic speaker recognition system is implemented by combining feature extraction and feature matching technique. Feature extraction method that is implemented by the Mel Frequency Cepstral Coefficients (MFCC). The Vector... more

Speech is one of the natural forms of human communication. Modern scientific technology has made it as a security systems based on speaker recognition system. So by the spea ker recognition technology makes to control access to secret services, for example, for giving computer, phone access to ban commands to king, database services, shopping or voice mail and access to secure equipment by a speaker’s voice. Here we want to discuss a simplest model of spea ker recognition system that could be applied to a speech of an unknown speaker but achieve more accuracy. Figure 1: Speaker Recogniton application on the computer as an password sytem.

Figure 3: Speaker Verification system. Speaker Verification determines whether the caller is who he/she claims to be from the utterance of a speaker. This is one-to-one comparison.

Figure 2: Speaker Identification system Speaker Identification determines the caller is out of a set of known speaker, using utterance from a speaker. This is one-to-many comparison.

A speaker recognition system is mainly composed of the following four parts: Figure 4: Block diagram of speaker recognition system.

Figure 5: Steps for Mel Frequency Cepstral Voronoi region, and it is defined by: V; = {xER* ;|| x- yillS||x-yj|| for all j#i}. The set of Voronoi regions partition the entire space R* such that: UL,Vi= R* and N,Vi=® for all i¥j.

Figure 6: Codeword resides in its own Voronoi region for a given vector space.

The clustering can be done by a clustering algorithm, such as, K-means clustering algorithm. K-means clustering is a method of classifying or grouping items into k groups (where k is the number of dimension). This grouping is compute by minimizing the sum of Euclidean distances between all samples and the corresponding centroid. The algorithm is composed of the following steps: Figure 7: Steps for k-means clustring algorithm The K-means algorithm partitions the XY vectors into N centroids. The algorithm first c feature hooses N cluster centroids among the X feature vectors ©). Then each feature vector is assigned to the nearest centroid and the new centroids are calculated . This procedure is continued until a stopping criterion is met, that is the mean square error between the feature vectors and the cluster-centroids is below a certain threshold or there is no more change in the cluster- centre assignment "“!,

Table 2: Test result for speaker recognition system. =e ee According to the database description, there are number of samples are collection of database from different speakers are shown below.

descriptionView Paper arrow_downwardDownload

Verified speaker localization utilizing voicing level in split-bands

by A. Asaei

2009, Signal Processing

This paper proposes a joint verification-localization structure based on split-band analysis of speech signal and the mixed voicing level. To address the problems in reverberant acoustic environments, a new fundamental frequency... more

descriptionView Paper arrow_downwardDownload

Comparison of clustering methods: A case study of text-independent speaker modeling

by Tomi Kinnunen

2011, Pattern Recognition Letters

Clustering is needed in various applications such as biometric person authentication, speech coding and recognition, image compression and information retrieval. Hundreds of clustering methods have been proposed for the task in various... more

descriptionView Paper arrow_downwardDownload

Articulatory feature-based conditional pronunciation modeling for speaker verification

by Man-wai Mak

2004

Because of the differences in education background, accents, etc., different persons have their unique way of pronunciation. This paper exploits the pronunciation characteristics of speakers and proposes a new conditional pronunciation... more

descriptionView Paper arrow_downwardDownload

Score Stabilization for Speaker Recognition Trained on a Small Development Set

by Hagai Aronowitz

Nowadays state-of-the-art speaker recognition systems obtain quite accurate results for both text-independent and text-dependent tasks as long as they are trained on a fair amount of development data from the target domain (assuming clean... more

descriptionView Paper arrow_downwardDownload

PLDA based Speaker Recognition on Short Utterances

by Ahilan Kanagasundaram

2012, In The Speaker and Language Recognition Workshop (Odyssey 2012)

This paper investigates the effects of limited speech data in the context of speaker verification using a probabilistic linear dis-criminant analysis (PLDA) approach. Being able to reduce the length of required speech data is important to... more

descriptionView Paper arrow_downwardDownload

Improving PLDA Speaker Verification using WMFD and Linear-weighted Approaches in Limited Microphone Data Conditions

by Ahilan Kanagasundaram

2015, 16th Annual Conference of the International Speech Communication Association, Interspeech 2015

This paper proposes the addition of a weighted median Fisher discriminator (WMFD) projection prior to length-normalised Gaussian probabilistic linear discriminant analysis (GPLDA) modelling in order to compensate the additional session... more

descriptionView Paper arrow_downwardDownload

One Class Projections in Speaker Verification

by Anthony Brew

Speaker verification might be considered a binary classification problem in that the objective is to determine whether or not an utterance is from the individual whose identity is claimed. Several factors make speaker verification different from a standard binary problem. Speaker verification is challenging because of the open nature of the problem, if the utterances of an individual are examples of the class to be recognised then the non- class examples cover everything else. It is also challenging due to the the format of the data to be classified, the data consists of sentences whose lengths depend on its phonetic content and the speaking rate of the underlying speaker.
One class classifiers have emerged as a set of techniques for situations where labelled data exists for only one of the classes in a two-class problem. A related problem arises where non-class examples exist, but the non-class distribution cannot be characterised as in speaker verification. The approach taken by one class classifiers is to develop a classifier that characterises the target class, and thus can distinguish it from all counter-examples.
Traditional speaker verification systems relied on one class classifiers in order to make a decision on the validity of the claim. A popular approach used Gaussian mixture models to create a score for variable length utterances which could then be thresholded. More recently these underlying one class classifiers have been successfully used to either project variable length utterances into a fixed dimensional space or provide a characterisation which can be compared against. This thesis investigates the use of one class classifiers in speaker verification, first by casting the problem as a one class problem and then by using one class classifiers to pre-process variable length utterances so that they can be used by any standard binary classifier.
This thesis has found that by using one class classifiers not in their traditional setting but as tool to model or characterise utterances they can be harnessed to enable binary learners to perform discriminative learning of variable length utterances.

For each class y the likelihood that a new item x comes from y is given by

Figure 2.2: An SVM finds the hyper-plane with the maximum margin. The margin is the Support Vector Machines

The training problem then becomes: In most real world scenarios the training data will not necessarily be linearly separable and so no maximum margin decision boundary can be found. To resolve this problem the constraints are relaxed by allowing errors to occur, this is achieved by introducing an additional hyper-parameter C’. The hyper-parameter C' controls the trade-off between maximising the margin and minimising the number of training errors (Vapnik [1998]). The trainine pnroblem then hecomes:

Figure 2.3: The k-nearest neighbour methods has many variants, one approach is to count the number of items from each class that make up the k-nearest neighbours and assign the class which had the most items. Figure 2.3: The k-nearest neighbour methods has many variants, one approach is to count

Figure 2.4: The four basic components of error can can be broken down into the following rates True Positive (TP, Positive items correctly classified),False Positive (FP, Negative items being classified as Positive), False Negative (FN, Positive items incorrectly labelled as Negative) and True Negatives (TN, Negative items correctly classified).

Figure 2.5: (a) Classifiers will generate a set of scores for each class, the scores from the negative distribution (red) will have a lower mean that the scores from the positive (green) distribution. (b) A ROC curve (red) shows the performance of a classifier at various thresholds. The black line shows the performance of a classifier that operates by random. A popular measure to measure the performance of a classifier is the AUC, this is the area below the ROC curve.

Figure 2.6: A D] 1997]. is plotted using the false rej both axis are scaled using rom separate normal distri HT curve d iffers from a ROC curve only by how the axis are scaled. It ect rate on the x-axis and the false accept rate on the y-axis, the normal deviate scale (i.e. plotting on double probability paper) if the score distributions of the negative class and the positive class are drawn butions then then the resulting curve will be linear [Bradley,

The number of models is selected as the number which maximises the likelihood on the

Figure 2.8: Blue circles are examples of the target class while the rest are non-target examples. In Strict One Class Classification only the blue circles are available to the classifier for training. In some statistically unrepresentative examples of the outlier class (Orange circles) may be available to aid the learner find a tighter boundary.

sroduct in the above problem with a ‘kernel’ function so that more flexible decision youndaries may be found. Once the optimisation problem is solved, new points that reside yutside the sphere are labelled as outliers. By introducing slack variables €; relaxing the ‘onstraint that all the items in the training set must reside within the sphere allows the SVDD to be trained to reject a given fraction of the data. age: kage ae we aaa a ms x — ~~“ -. * oe me ase cage ” a ats a =< eeueeat

Figure 2.11: By using some measure of distance to the k-nearest neighbours as a score t« threshold against, the k-NN classifier can be modified to operate as a one class classifier Vector Quantisation

Figure 2.12: Vector quantisation, representative codebook vectors are found so that they reate two codebook vectors by shifting the codebook slightly left and right and then ar erative k-means clustering (as in [MacQueen, 1967]) that takes the codebooks as seeds ; used until the codebook meets a convergence criterion. The resulting codebook vectors re then each split as before and the k-means process is again run until the convergence riteria is met. This process of split and converge is repeated until the required number ot odebook vectors is found. The resulting number of codebook vectors is a power of two e. an exact number of bits to encode each frame of the input sequence to a codebook a

Figure 3.1: The components of a speaker verification system.

‘igure 3.2: In open set speaker verification the system is required to distinguish the target the system: it is assumed that the voice must come from a fixed set of speakers. Thus For the problem of open-set speaker identification, it is generally assumed that the

Figure 3.3: The various parts of our vocal system that contribute to how we make sounds

Figure 3.4: The process of extracting MFCC features from a speech signa

Figure 3.5: In the above depiction, the six closest speaker models to the target speaker “A” are shown. While the closest speakers to the speaker are useful in making up the impostor model, they may not cover the space surrounding A’s model well. By selecting the most diverse set from the speakers close to A, good coverage of models that are close to A’s model are found.

When a new utterance is to be classified, it is projected into the anchor space and When a new utterance 1s to be Classined, 1t 1s projected mto the anchor space and the distance to the closest speaker characterisation vector is used to assign the utterance to its target speaker. The distance/similarity measure used in the anchor space has been widely studied [Mami and Charlet, 2002, Collet et al., 2005a]. Collet et al. [2005b] note that the use of distance to the speaker characterisation vector does not take into account intra-speaker variability and suggests that instead of modelling each speaker as a single point they be represented by a distribution. More recently Anchor models have been investigated as a fixed dimensional projection to be used by an SVM classifier for verification [Zhao et al., 2007, Charlet et al., 2008]. is projected to a fixed dimensional vector known as the Speaker Characterisation vector.

Figure 4.2: Looking at each speaker in turn, using only a single cepstral frame to classify whether the speaker was an ‘outlier’ or a ‘target’. The corresponding AUC, false positive and false negative rates are shown in box and whisker diagrams. It is clear that by using only a single cepstral frame classifiers do not perform well deciding the predicted class. The best false positive rate over the 16 speakers was 0.56 while the worst false positive rate of 0.86 which can be observed for the SVDD. Chapter 4: Evaluating One Class Classifiers for Speaker Verification

Chapter 4: Evaluating One Class Classifiers for Speaker Verification

Figure 4.4: It can be seen that for some individuals one class classification techniques yield good results when compared against other speakers. It is also noted that certain classification strategies preformed better for some speakers than for others indicating that model selection may also need to be considered when building an one class classifier for a given speaker. Chapter 4: Evaluating One Class Classifiers for Speaker Verification

Figure 5.1: If the score coming from target model (green) is simply thresholded to make decisions, sections of the features space that do not discriminate well (e.g 0 to -1) will be accepted, while other areas that do will be rejected (e.g 1 to 2) if the non-target model (red) is not used as part of the decision. Chapter 5: Outlier Models To Aid Performance

Figure 5.2: By building a background model from outlier data the decision surface around the target class is tightened and error rates are significantly improved.

Figure 5.3: Data projected into 2D score space using two models. The x-axis is the log- likelihood score obtained from the target speaker and the y-axis is the log-likelihood score of the cohort model. It can be seen that the Bayes decision boundary does a poor job of separating other speakers (i.e. non cohort) from the target speaker. Chapter 5: Outlier Models To Aid Performance

Figure 5.4: In the above depiction we show the six closest speaker models to the target speaker “A”, while the closest speakers to the speaker are useful in making up the impostor model they may not cover well the space surrounding A’s model well. By selecting the most diverse set from the speakers close to A, good coverage of models that are close to A’s model are found. Figure 5.4: In the above depiction we show the six closest speaker models to the target and Mariéthoz, 2001]. This work was extended to show that this boundary can be further

Figure 5.5: When the DI of speakers it can be see E'T curve is plotted for the different techniques that use the cohort n that the SVM learns a better boundary similar to the UBM for the background model made up by the cohorts. When the cohort are allowed to “speak for themselves” in the full cohort projection, a weighted sum of their scores is learnt by the SVM and accuracies are further improved. Chapter 5: Outlier Models To Aid Performance

Chapter 5: Outlier Models To Aid Performance is projected using the models that make up the cohort, the UBM and the model built on the target speaker. Similar to the result found for the individual cohort projection, it allows each model in the transform to ‘speak for itself’ whilst the classification boundary is being learnt from the data. While projections only using the cohort or the UBM have both been shown to improve accuracy, what is important here is that in combination they can go further, improving the EER from 4% to 1.4% on the YOHO dataset, 7.1% to 4.3% on the KING dataset and 5.7% to 4.5% on the OGI dataset. This indicates that the information encoded in the score space made up by the members of the cohort and score space of the UBM hold different information about the identity of the speaker and that in combination the benefits of both can be realised. Dlwaw wraaw lta tH kahlna EF 9S ehbaswcsr bhat the henwnbto fran wotmwrce thea QIN +A farnnyrnnvweta. Paw

Figure 5.7: These DET plots show how the combination of the UBM and cohort scores further improves the decision made by the SVM. This indicates that the fusion of Cohort and UBM techniques through the projection into the score space allows information that derives from both models to be exploited by the SVM. Chapter 5: Outlier Models To Aid Performance

YOHO dataset using the train test scenario described in Appendix 4. Figure 6.2 shows

Figure 6.3: Using a background model and using a relative score aids classification. This figure shows how as the size of the codebook is increased for the client model performance plateaus. It is clear that the use of a background model aids performance.

Figure 6.4: Updating the vector quantisation supervector.

Figure 6.6: ERR Rates and Model Size Changes. This leads to these codebooks contributing O entries to the supervector, i.e sparse regions of the supervector. For the YOHO utterances (short) the sparsity for a model size of 512 was 70%, for a model size of 1024 it was 82%, for the King (conversational) dataset sparsity for a model size of 512 was 13% and for a model size of 1024 was 25%.

Chapter 5: Outlier Models To Aid Performance training set to find a value that worked well for all speakers, on each sub-problem.

descriptionView Paper arrow_downwardDownload

Locally Recurrent Probabilistic Neural Networks with Application to Speaker Verification

by Todor Ganchev

This paper introduces Locally Recurrent Probabilistic Neural Networks (LRPNN) as an extension of the well-known Probabilistic Neural Networks (PNN). A LRPNN, in contrast to a PNN, is sensitive to the context in which events occur, and... more

Figure 1: Architecture of the Locally Recurrent Probabilistic Neural Network

Figure 2: Speaker verification score distribution for the: a) PNN system, and b) LRPNN system.

Figure 3: DET plots for the PNN and LRPNN systems.

descriptionView Paper arrow_downwardDownload

PERFORMANCE ANALYSIS OF SPEAKER IDENTIFICATION USING HMM AND SVM UNDER VARIOUS NOISE LEVELS

by SABIQ P V

Speaker recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. Speaker recognition is basically divided into two-classification: speaker verification and... more

descriptionView Paper arrow_downwardDownload

I-vector based speaker recognition using advanced channel compensation techniques

by Ahilan Kanagasundaram

2013, Computer Speech and Language

This paper investigates advanced channel compensation techniques for the purpose of improving i-vector speaker veriﬁcation performance in the presence of high intersession variability using the NIST 2008 and 2010 SRE corpora. The... more

descriptionView Paper arrow_downwardDownload

Automatic speaker verification on narrowband and wideband lossy coded clean speech

by Peter Pocta and

2017, IET Biometrics

Substantial progress has been achieved in voice-based biometrics in recent times but a variety of challenges still remain for speech research community. One such obstacle is reliable speaker authentication from speech signals degraded by... more

descriptionView Paper arrow_downwardDownload

UBM-GMM Driven Discriminative Approach for Speaker Verification

by Nicolas Scheffer and

2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop

In the past few years, discriminative approaches to perform speaker detection have shown good results and an increasing interest. Among these methods, SVM based systems have lots of advantages, especially their ability to deal with a high... more

descriptionView Paper arrow_downwardDownload

E-mail authorship verification for forensic investigation

by Farkhund Iqbal

2010

The Internet provides a convenient platform for cyber criminals to anonymously conduct their illegitimate activities, such as phishing and spamming. As a result, in recent years, authorship analysis of anonymous e-mails has received some... more

descriptionView Paper arrow_downwardDownload

An Impact of Narrowband Speech Codec Mismatch on a Performance of GMM-UBM Speaker Recognition over Telecommunication Channel

by Peter Pocta

2016

The automatic identification of person’s identity from their voice is a part of modern telecommunication services. In order to execute the identification task, speech signal has to be transmitted to a remote server. So a performance of... more

descriptionView Paper arrow_downwardDownload

Robust text-independent speaker identification using Gaussian mixture speaker models

by Vaishali Shinde

1995, Speech and Audio Processing, IEEE …

Fig. 1. Mel-scale cepstral feature analysis. voice from others. LPC spectral representations, such as LPC cepstral and reflection coefficients, have been used extensively for speaker recognition; however, these model-based represen- tations can be severely affected by noise [5]. Recent studies have found directly computed filterbank features to be more robust for noisy speech recognition [6]. In this paper we use cepstral coefficients derived from a mel-frequency filterbank to represent the short-time speech spectra. ee na

Fig. 2. Depiction of an M component Gaussian mixture density. A Gaussian mixture density is a weighted sum of Gaussian densities, where p;,7 = 1, ...,M, are the mixture weights and b;(),2 =1,..., M, are the component Gaussians.

Fig. 3. Comparison of distribution modeling: (a) Histogram of a single cepstral coefficient from a 25 second utterance by a male speaker; (b) max- imum likelihood unimodal Gaussian model; (c) GMM and its 10 underlying component densities; (d) histogram of the data assigned to the VQ centroid locations of a 10-element codebook. a linear combination of Gaussian basis functions is capable of representing a large class of sample distributions. One of the powerful attributes of the GMM is its ability to form smooth approximations to arbitrarily-shaped densities. The classical unimodal Gaussian speaker model represents a speaker’s feature distribution by a position (mean vector) and a elliptic shape (covariance matrix) and the VQ model represents a speaker’s distribution by a discrete set of characteristic templates. In some sense the GMM acts as a hybrid between these two models by using a discrete set of Gaussian functions, each with their own mean and covariance matrix, to allow a better modeling capability. Fig. 3 compares the densities obtained using a unimodal Gaussian model, a GMM and a VQ model. Plot (a) shows the histogram of a single cepstral coefficient from a 25 second utterance by a male speaker; plot (b) shows the maximum likelihood unimodal Gaussian model; plot (c) shows the GMM and its 10 underlying component densities; and plot (d) shows a histogram of the data assigned to the VQ centroid locations of a 10-element codebook. The GMM not only provides a smooth overall distribution fit, its components also clearly detail the multi-modal nature of the density.

where of, £4, and js; refer to arbitrary elements of the vectors 6, #, and fii, respectively. The 7a nnctorinr? nrahahility far arnanctir place 4 ice aivan hey On each EM iteration, the following reestimation formulas are used which guarantee a monotonic increase in the model’s likelihood value: The basic idea of the EM algorithm is, beginning with an initial model \, to estimate a new model 4, such that p(X | A) > p(X | A). The new model then becomes the initial model for the next iteration and the process is repeated until some convergence threshold is reached. This is the same basic technique used for estimating HMM parameters via the Baum-Welch reestimation algorithm [26].

Fig. 4. Speaker identification performance as a function of the number of component densities per speaker model.

Fig. 5. Speaker identification performance versus model order for models trained with 30, 60 and 90 seconds of speech. The test utterance length is 5 seconds.

Fig. 6. Identification performance for different spectral variability compen- sation techniques applied to telephone speech.

Fig. 7. Speaker identification performance versus test utterance length for population sizes of 16, 32, and 49 speakers: (a) Clean speech performance; (b) telephone speech performance.

Fig. 8. TGMM and RBF model structure. In each model, speakers are represented by a weighted combination of a common pool of Gaussian or basis functions.

descriptionView Paper arrow_downwardDownload

CHANNEL ADAPTATION OF PLDA FOR TEXT-INDEPENDENT SPEAKER VERIFICATION

by Kong Lee and

Probabilistic linear discriminant analysis (PLDA) has shown to be effective for modeling channel variability in the i-vector space for text-independent speaker verification. Speaker verification is a binary hypothesis testing. Given a... more

descriptionView Paper arrow_downwardDownload

Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification

by Gérard Chollet

Authentication System (BAS) based on the fusion of two user-friendly biometric modalities: signature and speech. All biometric data used in this work were extracted from the BIOMET multimodal database . The Signature

descriptionView Paper arrow_downwardDownload

Robustness to telephone handset distortion in speaker recognition by discriminative feature design

by Kemal Sonmez

2000, Speech Communication

A method is described for designing speaker recognition features that are robust to telephone handset distortion. The approach transforms features such as mel-cepstral features, log spectrum, and prosody-based features with a non-linear... more

descriptionView Paper arrow_downwardDownload

SPEAKERS IN THE WILD (SITW): The QUT Speaker Recognition System

by Ahilan Kanagasundaram and

This paper presents the QUT speaker recognition system, as a competing system in the Speakers In The Wild (SITW) speaker recognition challenge. Our proposed system achieved an overall ranking of second place, in the main core-core... more

descriptionView Paper arrow_downwardDownload

THE SRI NIST 2008 speaker recognition evaluation system

by Luciana Ferrer

2009

The SRI speaker recognition system for the 2008 NIST speaker recognition evaluation (SRE) incorporates a variety of models and features, both cepstral and stylistic. We highlight the improvements made to specific subsystems and analyze... more

descriptionView Paper arrow_downwardDownload

Robust Speaker Verification Using Improved PNCC Based on GMM-UBM

by Co. SEP

Focused on the issue that the robustness of traditional Mel Frequency Cepstral Coefficient (MFCC) feature degrades drastically in speaker verification in noisy environments, a kind of suitable extraction method for low SNR environments... more

Each period of speech feature distribution has the nature of interdependence and diversity. However, differen channel as well as the additive noise will destroy the feature distribution. The purpose of feature warping is tc warp the distribution of a cepstral feature stream to a standardized distribution over a specified time interval, anc make the cepstral tend to be consistent. The basic framework of feature warping is shown in figure 4. This method can compensate channel and limit the effects of additive noise. Figure 5 shows the effect of featur warping for the first cepstrum feature of traditional MFCC which is extracted under 5dB White Noise.

FIG. 5 THE EFFECT OF FRATURE WARPING ON THE FIRST CEPSTRUM OF MFCC

White Noise and Babble Noise come form NOISEX-92 database [8], Music Noise is recorded by ourselves. The training data are kept untouched, but the noises are added to the test files with a given average segmental SNR. We consider five SNR levels: 0dB, 5dB, 10dB, 15dB, 20dB.

Figure 8 presents the experimental results in terms of EER under White, Babble and Music noises respectively. First, we can find that GFCC performs better than MFCC in the test of all the noise, it illustrates that Gammatone filter is more robust than Mel filter bank. Second, PNCC has stronger anti-noise performance than GFCC because the algorithm of power-bias subtraction can remove the background noise effectively. Besides, power-law nonlinearity enhance anti-noise capability as well. Third, it is clearly observed that it is a relative permanent improvements obtained by the use of improved PNCC, especially below 10dB of noisy conditions. This result indicates that CMVN and feature warping are effective methods to enhance robustness. We use half raised-sine cepstral lifting to weight the cepstrum coefficient and VTLN to eliminate the vocal tract diversity. These methods are improved the performance and noise robustness in some extent.

FIG.8 EVALUATUIN OF FOUR FEATURES IN TERM OF EQUAL ERROR RATES UNDER WHITE,BABBLE,MUSIC NOISES ON THE NOISE SPEECH TEST SET FOR SNR VALUES OF {0,5,10,15,20DB.}

FIG. 2 THE ORIGINAL GAMMATONE FILTER AND NORMALIZED FILTER Compared with Mel filter bank of MFCC, Gammatone filter is smoother. When the number of filters increases, its overlap will increase because the bandwidth for Gammatone filter is determined by its center frequency, and this filter will reduce the frequency spectrum energy leakage between adjacent band filter groups. A normalized processing is taken in this paper, our purpose is to improve the weight of low frequency, the original Gammatone filter and normalized filter are shown in figure 2. The order of the filter is taken to be 32. Vocal Tract Length Normalization

Figure 3 shows the effect of CMVN for the first cepstrum feature of traditional MFCC which is extracted under 5dB White Noise.

descriptionView Paper arrow_downwardDownload

Robust speaker recognition using MAP estimation of additive noise in i-vectors space

by Waad Ben Kheder and

In the last few years, the use of i-vectors along with a generative back-end has become the new standard in speaker recognition. An i-vector is a compact representation of a speaker utterance extracted from a low dimensional total... more

descriptionView Paper arrow_downwardDownload

Speaker verification using adapted Gaussian mixture models

by Sowmyan Kousthubadharan

2000, Digital signal processing

In this paper we describe the major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around... more

FIG. 1. Likelihood ratio-based speaker detection system. The single-speaker detection task can be restated as a basic hypothesis test vatwreen

FIG. 2. Data and model pooling approaches for creating a UBM. (a) Data from subpopulations are pooled prior to training the UBM via the EM algorithm. (b) Individual subpopulation models are trained then combined (pooled) to create final UBM.

FIG. 3. Pictorial example of two steps in adapting a hypothesized speaker model. (a) The training vectors (x’s) are probabilistically mapped into the UBM mixtures. (b) The adapted mixture parameters are derived using the statistics of the new data and the UBM mixture parameters. The adaptation is data dependent, so UBM mixture parameters are adapted by different amounts.

FIG. 4. Pictorial example of HNORM compensation. This picture shows log-likelihood ratic score distributions for two speakers before (left column) and after (right column) HNORM has been applied. After HNORM, the non-speaker score distribution for each handset type has been normalized to zero mean and unit standard deviation.

FIG. 5. DET curves for three UBM compositions: Pooled male and female data, separate male and female models, and pooled male and female models. Results are on the NIST 1998 summer- development single-speaker data using all scores.

FIG. 6. DET curves for systems using UBMs with 16-2048 mixtures. Results are on the NIS’ 1998 summer-development single-speaker data using all scores.

FIG. 7. DET curves for adaptation of different combinations of parameters. W = weights, M = means, V = variances. Results are on the NIST 1998 summer-development single-speaker data using all scores.

FIG. 8. Comparison of GMM-UBM system with and without HNORM. Results are on the NIST 1999 SRE single-speaker data using all scores. Reynolds, Quatieri, and Dunn: Speaker Verification Using Adapted GMMs

FIG. 9. Comparison of GMM-UBM system with and without HNORM, using different poolings of files in the 1999 NIST SRE single-speaker data set. SNST = Same-Number, Same-Type, DNST = Different-Number, Same-Type, DNDT = Different-Number, Different-Type.

descriptionView Paper arrow_downwardDownload

Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification

by Tomi Kinnunen

2000, IEEE Transactions on Audio, Speech, and Language Processing

In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance... more

descriptionView Paper arrow_downwardDownload

Domain Adaptation for Text Dependent Speaker Verification

by Hagai Aronowitz

Recently we have investigated the use of state-of-the-art text-dependent speaker verification algorithms for user authentication and obtained satisfactory results mainly by using a fair amount of text-dependent development data from the... more

descriptionView Paper arrow_downwardDownload

Speaker recognition using neural networks and conventional classifiers

by Khaled Assaleh

1994, IEEE Transactions on Speech and Audio Processing

descriptionView Paper arrow_downwardDownload

An Evaluation of One-Class Classification Techniques for Speaker Verification

by Anthony Brew

Speaker verification is a challenging problem in speaker recognition where the objective is to determine whether a segment of speech in fact comes from a specific individual. In supervised machine learning terms this is a challenging... more

descriptionView Paper arrow_downwardDownload

Optimized Discriminative Kernel for SVM Scoring and its Application to Speaker Verification

by Austin S.-X. ZHANG

descriptionView Paper arrow_downwardDownload

Browsing and Retrieval of Full Broadcast-Quality Video

by Reha Civanlar

1999

In this paper we describe a system we have developed for automatic broadcast-quality video indexing that successfully combines results from the fields of speaker verification, acoustic analysis, very large vocabulary speech recognition,... more

descriptionView Paper arrow_downwardDownload

An evaluation of VTS and IMM for speaker verification in noise

by Suhadi Suhadi

2003

The performance of speaker verification (SV) systems degrades rapidly in noise rendering them unsuitable for security-critical applications in mobile phones, where false acceptance rates (FAR) of ∼ 10 −4 are required. However, less... more

descriptionView Paper arrow_downwardDownload

Speaker verification using target and background dependent linear transforms and multi-system fusion

by upendra chaudhari

2001

This paper describes a GMM-based speaker verification system that uses speaker-dependent background models transformed by speaker-specific maximum likelihood linear transforms to achieve a sharper separation between the target and the... more

descriptionView Paper arrow_downwardDownload

PLDA based Speaker Verification with Weighted LDA Techniques

by Ahilan Kanagasundaram

2012, The Speaker and Language Recognition Workshop (Odyssey 2012)

This paper investigates the use of the dimensionality-reduction techniques weighted linear discriminant analysis (WLDA), and weighted median fisher discriminant analysis (WMFD), before probabilistic linear discriminant analysis (PLDA)... more

descriptionView Paper arrow_downwardDownload

Using Discrete Probabilities With Bhattacharyya Measure for SVM-Based Speaker Verification

by Kong Aik Lee

2000, IEEE Transactions on Audio, Speech, and Language Processing

Support vector machines (SVMs), and kernel classifiers in general, rely on the kernel functions to measure the pairwise similarity between inputs. This paper advocates the use of discrete representation of speech signals in terms of the... more

descriptionView Paper arrow_downwardDownload

Speaker Verification

Key research themes

1. How can speaker verification systems be robustly defended against diverse spoofing attacks including voice conversion, speech synthesis, and replay?

2. What techniques improve speaker verification performance and robustness under practical conditions such as limited data, language mismatch, recording channel variability, and multi-speaker environments?

3. How can speaker verification fairness across demographic and language groups be improved without requiring subgroup labels or creating reliance on balanced data samples?

Related Topics

All papers in Speaker Verification