Automatic Speaker Verification Research Papers

Using Reversed MFCC and IT-EM for Automatic Speaker Verification

2024, Mehran University Research Journal of Engineering and Technology

This paper proposes text independent automatic speaker verification system using IMFCC (Inverse/ Reverse Mel Frequency Coefficients) and ITEM (Information Theoretic Expectation Maximization). To perform speaker verification, feature... more

descriptionView Paper arrow_downwardDownload

Evaluating Deep Learning-Based Speaker Verification Systems: A Comparative Study Across Open-Source and Forensic Datasets

by Seyed Sahand Mohammadi Ziabari

2024

Speaker verification (SV) is the process of verifying whether speech from two audio signals originate from the same speaker or different speakers. Current state-of-the-art SV systems are based on deep neural networds, predominantly... more

descriptionView Paper arrow_downwardDownload

Modeling Distance Normalization Technique in Multilingual Speaker Verification

by Kshirod Sarmah

2024

Speaker modeling distance normalization (D-Norm) is one of the important score normalization techniques in speaker verification (SV) system. For D-Norm implementation, it doesn’t need any additional speech data or external speaker... more

descriptionView Paper arrow_downwardDownload

Speaker Verification Using Acoustic and Prosodic Features

by Kshirod Sarmah

2024, Advanced Computing: An International Journal

In this paper we report the experiment carried out on recently collected speaker recognition database namely Arunachali Language Speech Database (ALS-DB)to make a comparative study on the performance of acoustic and prosodic features for... more

descriptionView Paper arrow_downwardDownload

A comparison of features for large population speaker identification

by Norman Baloyi

2024

Bibliography: leaves 95-104.Speech recognition systems all have one criterion in common; they perform better in a controlled environment using clean speech. Though performance can be excellent, even exceeding human capabilities for clean... more

descriptionView Paper arrow_downwardDownload

Bandwidth expansion of narrowband speech using non-negative matrix factorization

by Dhananjay Bansal

2024, Interspeech 2005

In this paper, we present a novel technique for the estimation of the high frequency components (4-8kHz) of speech signals from narrow-band (0-4 kHz) signals using convolutive Non-Negative Matrix Factorisation (NMF). The proposed... more

descriptionView Paper arrow_downwardDownload

Margin-Mixup: A Method for Robust Speaker Verification In Multi-Speaker Audio

by Kris Demuynck

2024, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

This paper is concerned with the task of speaker verification on audio with multiple overlapping speakers. Most speaker verification systems are designed with the assumption of a single speaker being present in a given audio segment.... more

descriptionView Paper arrow_downwardDownload

Fraud Detection in Voice-Based Identity Authentication Applications and Services

by Reza Sotudeh

2024, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW)

Keeping track of the multiple passwords, PINs, memorable dates and other authentication details needed to gain remote access to accounts is one of modern life's less appealing challenges. The employment of a voice-based verification as a... more

descriptionView Paper arrow_downwardDownload

Automatic emotion recognition in compressed speech using acoustic and non-linear features

by Julian Arias-Londoño

2024

Automatic recognition of emotions in speech has attracted the attention of the research community in recent years. Some of the most relevant proposed applications of it are in call-centers. In these scenarios the speech is distorted by... more

descriptionView Paper arrow_downwardDownload

Hybrid PSO-ANFIS for Speaker Recognition

by samiya silarbi

2024, International Journal of Cognitive Informatics and Natural Intelligence

This paper introduces an evolutionary approach for training the adaptive network-based fuzzy inference system (ANFIS). The previous works are based on gradient descendent (GD); this algorithm converges very slowly and gets stuck down at... more

descriptionView Paper arrow_downwardDownload

Digital speech watermarking for anti-spoofing attack in speaker recognition

by Mohammad Ali Nematollahi

2024, 2014 IEEE REGION 10 SYMPOSIUM

This paper presents new method for improving the security of speaker recognition in case of spoofing attack. In the proposed technique, digital speech watermarking has been applied on speech signal to increase robustness. To achieve this... more

descriptionView Paper arrow_downwardDownload

CN-Celeb: Multi-genre speaker recognition

by Thomas Zheng

2024, Speech Communication

Research on speaker recognition is extending to address the vulnerability in the wild conditions, among which genre mismatch is perhaps the most challenging, for instance, enrollment with reading speech while testing with conversational... more

descriptionView Paper arrow_downwardDownload

Review of wideband speech noise reduction techniques

by chris forrester

2024

descriptionView Paper arrow_downwardDownload

Transfer Learning from Audio Domains a Valuable Tool for Structural Health Monitoring

by Homayoon Beigi

2024, Springer eBooks

Today, the application of artificial neural network tools to define models that mimic the dynamic behavior of structural systems is a widespread approach. A fundamental issue in developing these strategies for damage assessment in civil... more

Today, the application of artificial neural network tools to define models that mimic the dynamic behavior of structural systems is a widespread approach. A fundamental issue in developing these strategies for damage assessment in civil structures is represented by the unbalanced nature of the available databases, which commonly contain plenty of data coming from the structure under healthy operational conditions and very few samples from the system in unhealthy conditions since the structure would have failed by that time. Consequently, the learning task, carried on with standard deep learning approaches, becomes case-dependent and tends to be specialized for a particular structure and for a very limited number of damage scenarios. This work presents a framework for damage classification in structural systems intended to overcome such limitations. In this methodology, the model is trained to gain knowledge in the learning task from a rich acoustic dataset (source domain), acquiring higher-level features characterizing vibration traits from a rich acoustic dataset. This knowledge is then transferred to a target domain, with much less training data, such as a structural system, in order to classify its structural condition. The framework starts with constructing a Time-Delay Neural Network (TDNN) structure, trained on the VoxCeleb dataset, in the speech domain. The input of the network consists of Cepstral and pitch features extracted from the audio records. Higher-level features, the x-vectors, speaker embeddings, capturing neural outputs of the network's intermediate layers, are derived and then used to train a Probabilistic Linear Discriminant analysis (PLDA) model to provide a probabilistic discriminant model for speaker comparison. These features collect generic information regarding the source domain and characterize a classification process based on the frequency content of signals, which is not strictly dependent on the original acoustic domain. Because of the non-case-dependent nature of the x-vector embeddings (features), they can be used to train an alternative PLDA model to address a damage classification task, considering vibration measurements coming from a different system, a structural one which represents the target domain. The simulated data from the 12 degrees of freedom benchmark shear-building structure provided by the IASC-ASCE Structural Health Monitoring Group are studied to verify the proposed framework's effectiveness.

descriptionView Paper arrow_downwardDownload

Voice Telephony for Individuals with Hearing Loss

by Paula Tucker

2024

This paper describes three studies conducted with a total of 114 individuals with hearing loss and 12 hearing controls, with the goal of investigating the impact of audio quality parameters on the accessibility of voice... more

descriptionView Paper arrow_downwardDownload

Robust Self-Supervised Speaker Representation Learning Via Instance Mix Regularization

by Abderrahim Fathan

2024, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Over the recent years, various self-supervised contrastive embedding learning methods for deep speaker verification were proposed. The performance of the self-supervised contrastive learning framework highly depends on the data... more

descriptionView Paper arrow_downwardDownload

Robust Self-Supervised Speaker Representation Learning Via Instance Mix Regularization

by Abderrahim Fathan

2024, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Over the recent years, various self-supervised contrastive embedding learning methods for deep speaker verification were proposed. The performance of the self-supervised contrastive learning framework highly depends on the data... more

descriptionView Paper arrow_downwardDownload

Deep Learning-based F0 Synthesis for Speaker Anonymization

by Nils Peters

2023, arXiv (Cornell University)

Voice conversion for speaker anonymization is an emerging concept for privacy protection. In a deep learning setting, this is achieved by extracting multiple features from speech, altering the speaker identity, and waveform synthesis.... more

descriptionView Paper arrow_downwardDownload

Fraud Detection in Voice-Based Identity Authentication Applications and Services

by Hock Gan

2023

Keeping track of the multiple passwords, PINs, memorable dates and other authentication details needed to gain remote access to accounts is one of modern life's less appealing challenges. The employment of a voice-based verification as a... more

descriptionView Paper arrow_downwardDownload

Improving Speaker Recognition in Environmental Noise With Adaptive Filter

by Wemerson D. Parreira

2023, IEEE Access

Speaker recognition is challenging in real-world environments. Typically, studies approach noises only in an additive manner. However, real environments commonly present reverberating conditions that worsen speech processing. When not... more

descriptionView Paper arrow_downwardDownload

Two-stage text feature selection method using fuzzy entropy measure and an t colony optimization

by majid hemmati

2023, 20th Iranian Conference on Electrical Engineering (ICEE2012)

Text categorization is widely used when organizing documents in a digital form. Due to the increasing number of documents in digital form, automated text categorization has been emerged as an appropriate tool to classify documents into... more

descriptionView Paper arrow_downwardDownload

Speech2Phone: A Novel and Efficient Method for Training Speaker Recognition Models

by Sandra Maria Aluísio

2023, arXiv (Cornell University)

In this paper we present an efficient method for training models for speaker recognition using small or under-resourced datasets. This method requires less data than other SOTA (State-Of-The-Art) methods, e.g. the Angular Prototypical and... more

Table 1: Preprocessed Speaker Verification datasets

Table 3: Results of Speech2Phone Experiments

descriptionView Paper arrow_downwardDownload

A Multi-View Approach To Audio-Visual Speaker Verification

by nayan singhal

2023, arXiv (Cornell University)

Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be... more

descriptionView Paper arrow_downwardDownload

A straightforward and efficient implementation of the factor analysis model for speaker verification

by D. Matrouf

2023, Interspeech 2007

For a few years, the problem of session variability in textindependent automatic speaker verification is being tackled actively. A new paradigm based on a factor analysis model have successfully been applied for this task. While very... more

descriptionView Paper arrow_downwardDownload

An Improved GMM-SVM System based on Distance Metric for Voice Pathology Detection

by mohamed fezari

2023, Applied mathematics & information sciences

As acoustic signal generated from vocal folds is directly affected by vocal tract pathologies, it can be an effective tool for diagnosis purpose. In this work, we present an efficient method for voice pathology detection based on speech... more

descriptionView Paper arrow_downwardDownload

Deep Discriminative Embeddings for Duration Robust Speaker Verification

by deyi tuo

2023

The embedding-based deep convolution neural networks (C-NNs) have demonstrated effective for text-independent speaker verification systems with short utterances. However, the duration robustness of the existing deep CNNs based algorithms... more

descriptionView Paper arrow_downwardDownload

Syllable-Dependent Discriminative Learning for Small Footprint Text-Dependent Speaker Verification

by deyi tuo

2023, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

This study proposes a novel scheme of syllable-dependent discriminative speaker embedding learning for small footprint text-dependent speaker verification systems. To suppress undesired syllable variation and enhance the power of... more

descriptionView Paper arrow_downwardDownload

Deep Discriminative Embeddings for Duration Robust Speaker Verification

by deyi tuo

2023, Interspeech 2018

The embedding-based deep convolution neural networks (C-NNs) have demonstrated effective for text-independent speaker verification systems with short utterances. However, the duration robustness of the existing deep CNNs based algorithms... more

descriptionView Paper arrow_downwardDownload

High-frequency Audiometry Hearing on Monitoring of Individuals Exposed to Occupational Noise: A Systematic Review

by Cleonice Aparecida

2023, International Archives of Otorhinolaryngology

Introduction The literature reports on high-frequency audiometry as one of the exams used on hearing monitoring of individuals exposed to high sound pressure in their work environment, due to the method s greater sensitivity in early... more

descriptionView Paper arrow_downwardDownload

Design and Simulation of Determining the Unknown Sampling Frequency Testing Procedures in Analog-Digital-Conversions

by Bless G . Ampuan

2023, Iconic Reasearch and Engineering Journals

In this study, the researcher investigates one of the testing procedures on acquiring the unknown sampling frequency which is relevant in (ADC) Analog-to-Digital Conversion on Digital Signal Processing. With the help of the Simulink tool... more

Figure 5 The input frequency used is 4kHz The Figure 4 above shows that the input frequency used is 3kHz and the maximum output frequency result using the Spectrum Analyzer is approximately equal also to 3kHz. The Figure 5 above shows that the input frequency used is 4kHz and the maximum output frequency result using the Spectrum Analyzer is approximately equal also to 4kHz. The Figure 5 above shows that the input frequency

The Figure 3 above shows that the input frequency used is 2kHz and the maximum output frequency result using the Spectrum Analyzer is approximately equal also to 2kHz. The Figure 3 above shows that the input frequency

The Figure 4 above shows that the input frequency

Figure 8 The input frequency used is 4.994kHz Since the discrepancy occurs between the input frequency 4kHz and 5kHz, the researcher tried to varied the input frequency from 4kHz to 5kHz as shown in Figure 8 and 9. The Figure 8 shows that the input frequency used is 4.994kHz and it resulted to an approximately equal to 4.994kHz output frequency. It was found that, it is the

Figure 6 The input frequency used is 5kHz

The Figure 7 above shows that the input frequency

Figure 9 The input frequency used is 4.995kHz last point which input and output frequency have no discrepancy. The Figure 9 above shows that the input frequency used is 4.995kHz and the maximum output frequency result using the Spectrum Analyzer is approximately equal to 2.044kHz. In this case, there is a great almost 3kHz discrepancy. The Figure 9 above shows that the input frequency

The Figure 2 above shows that the input frequency used is [kHz and the maximum output frequency result using the Spectrum Analyzer is approximately equal also to 1kHz. and the output frequency start to have a huge discrepancy when the input frequency is at 4.995kHz.

descriptionView Paper arrow_downwardDownload

Automatic Detection of Parkinson’s Disease from Compressed Speech Recordings

by Elmar Noeth

2023, Lecture Notes in Computer Science

The impact of speech compression in the automatic classification of speakers with Parkinson's disease (PD) and healthy controls (HC) is tested. The set of codecs considered to compress the speech recordings includes G.722, G.226, GSM-EFR,... more

descriptionView Paper arrow_downwardDownload

Enhancement and modification of automatic speaker verification by utilizing hidden Markov model

by ali najdet علي نجدت

2023, Indonesian Journal of Electrical Engineering and Computer Science

The purpose of this study is to discuss the design and implementation of autonomous surface vehicle (ASV) systems. There’s a lot riding on the advancement and improvement of ASV applications, especially given the benefits they provide... more

descriptionView Paper arrow_downwardDownload

Using vector quantization in Automatic Speaker Verification

by Mohamed Tayeb Laskri

2023, 2012 International Conference on Information Technology and e-Services

We aim to describe different approaches for vector quantization in Automatic Speaker Verification. We designed our novel architecture based on multiples codebook representing the speakers and the impostor model called universal background... more

descriptionView Paper arrow_downwardDownload

x-Vector DNN Refinement with Full-Length Recordings for Speaker Recognition

by David Snyder

2023, Interspeech 2019

State-of-the-art text-independent speaker recognition systems for long recordings (a few minutes) are based on deep neural network (DNN) speaker embeddings. Current implementations of this paradigm use short speech segments (a few... more

Figure 1: Block diagrams of the three network architectures used in this work. Gray boxes indicate that the parameters are frozen after pre-training. The position of the star indicates where the embedding is extracted.

Table 1: Extended TDNN x-vector architecture third group, layers 12 to 13, is a feed-forward network with a bottleneck layer and serves as a classifier that outputs posterior probabilities for the training speakers. The x-vector is extracted from layer 11 prior to the ReLU non-linearity. The bottleneck structure of the net is used to achieve a dimensionality reduc- tion of the embedding (512 dimensions). The total number of parameters of the DNN used in the experiments (using 7,168 training speakers) is approximately 8 million, of which only 4 million are needed for extracting the x-vector.

Table 2: Performance on SITW core-core task.

descriptionView Paper arrow_downwardDownload

The JHU Speaker Recognition System for the VOiCES 2019 Challenge

by David Snyder

2023, Interspeech 2019

This paper describes the systems developed by the JHU team for the speaker recognition track of the 2019 VOiCES from a Distance Challenge. On this far-field task, we achieved good performance using systems based on state-of-the-art deep... more

descriptionView Paper arrow_downwardDownload

Secured vocal access to telephone servers

by D. Genoud

2023, Proceedings of IVTTA '96. Workshop on Interactive Voice Technology for Telecommunications Applications

descriptionView Paper arrow_downwardDownload

Speech Bandwidth Extension with Wavenet

by Yannis Assael

2023, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

Large-scale mobile communication systems tend to contain legacy transmission channels with narrowband bottlenecks, resulting in characteristic 'telephone-quality' audio. While higher quality codecs exist, due to the scale and... more

descriptionView Paper arrow_downwardDownload

An overview of the PICASSO project research activities in speaker verification for telephone applications

by Chafic Mokbel

2023, 6th European Conference on Speech Communication and Technology (Eurospeech 1999)

IRISA (1) KTH (2) KUN (3) ENST (4) UBS-Ubilab (5) IDIAP (6)

descriptionView Paper arrow_downwardDownload

The 30-month European LE-Telematics project PICASSO (PIoneering Caller Authentication for Secure Service

by Chafic Mokbel

2023

IRISA (1) KTH (2) KUN (3) ENST (4) UBS-Ubilab (5) IDIAP (6)

descriptionView Paper arrow_downwardDownload

Text-independent speaker verification based on broad phonetic segmentation of speech

by Mohamed Adel

2023, Digital Signal Processing

The effectiveness of introducing deep neural networks into conventional speaker recognition pipelines has been broadly shown to benefit system performance. A noveltext-independent speaker verification (SV) framework based on the triplet... more

The effectiveness of introducing deep neural networks into conventional speaker recognition pipelines has been broadly shown to benefit system performance. A noveltext-independent speaker verification (SV) framework based on the triplet loss and a very deep convolutional neuralnetwork architecture (i.e., Inception-Resnet-v1) are investigated in this study, where a fixedlength speaker discriminative embedding is learned from sparse speech features and utilized as a feature representation for the SV tasks. A concise description of the neural network based speaker discriminative training with triplet loss is presented. An Euclidean distance similarity metric is applied in both network training and SV testing, which ensures the SV system to follow an end-to-end fashion. By replacing the final max/average pooling layer with a spatial pyramid pooling layer in the Inception-Resnet-v1 architecture, the fixed-length input constraint is relaxed and an obvious performance gain is achieved compared with the fixed-length input speaker embedding system. For datasets with more severe training/test condition mismatches, the probabilistic linear discriminant analysis (PLDA) back end is further introduced to replace the distance based scoring for the proposed speaker embedding system. Thus, we reconstruct the SV task with a neural network based front-end speaker embedding system and a PLDA that provides channel and noise variabilities compensation in the back end. Extensive experiments are conducted to provide useful hints that lead to a better testing performance. Comparison with the state-of-theart SV frameworks on three public datasets (i.e., a prompt speech corpus, a conversational speech Switchboard corpus, and NIST SRE10 10 s-10 s condition) justifies the effectiveness of our proposed speaker embedding system. Index Terms-Speaker recognition, very deep convolutional neutral networks, i-vector, PLDA, triplet loss, spatial pyramid pooling. I. INTRODUCTION S PEAKER verification (SV) is a binary classification problem which aims to verify a claimed identify based on the claimed/enrolled speaker model. According to different Manuscript

descriptionView Paper arrow_downwardDownload

DFRWS 2018 Europe d Proceedings of the Fifth Annual DFRWS Europe Speaker veri fi cation from codec distorted speech for forensic investigation through serial combination of classi fi ers

by P. Sathidevi

2023

Forensic investigation often uses biometric evidence as important aids for identifying the culprits. Speech is one of the easily available biometrics in today's hi-tech world. But, most of the speech biometric evidence acquired for... more

descriptionView Paper arrow_downwardDownload

Integration of graph clustering with ant colony optimization for feature selection

by Parham Moradi

2023, Knowledge-Based Systems

Feature selection is an important preprocessing step in machine learning and pattern recognition. The ultimate goal of feature selection is to select a feature subset from the original feature set to increase the performance of learning... more

descriptionView Paper arrow_downwardDownload

Digit-Based Speaker Verification in Spanish Using Hidden Markov Models

by Juan Carlos Moreno Rodríguez

2023, Res. Comput. Sci.

In this paper we propose a digit-based text-dependent speaker verification system (SVS) in Spanish. The system uses word level Hidden Markov Models (HMM) as classifiers and Frequency Cepstral Coefficients (MFCC) with Cepstral Mean... more

descriptionView Paper arrow_downwardDownload

Deep Speaker Verification Model for Low-Resource Languages and Vietnamese Dataset

by Trang Nguyen Thi Thu

2023

Speaker verification is an essential task in speech processing with great authentication and surveillance applications. Large-scale datasets have hugely contributed to the success of neural networks for speaker verification. However, in... more

descriptionView Paper arrow_downwardDownload

ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

by driss matrouf

2023, Computer Speech & Language

Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation... more

Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof challenge initiative was created to foster research on anti-spoofing and to provide common platforms for the assessment and comparison of spoofing countermeasures. The first edition, ASVspoof 2015, focused upon the study of countermeasures for detecting of text-to-speech synthesis (TTS) and voice conversion (VC) attacks. The second edition, ASVspoof 2017, focused instead upon replay spoofing attacks and countermeasures. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona fide utterances even by human subjects. It is expected that the ASVspoof 2019 database, with its varied coverage of different types of spoofing data, could further foster research on anti-spoofing.

Figure 1: LA access and PA access adopted for ASVspoof 2019

Figure 2: Partitions and protocols of ASVspoof 2019 database. Top part shows divisions of training, development, and evaluation sets common to logical access and physical access conditions and numbers of speakers included in each set. Middle part shows database partitions for logical access condition, and bottom part shows those for physical access condition.

Figure 3: Visualization of bona fide and spoofed speech data of ASVspoof 2019 LA subset. Black circle denotes range of bona fide data within mean +3 standard deviation.

Figure 4: Agglomerative clustering of LA attacks using same input data as Figure 3.

Figure 5: Illustration of AS Vspoof 2019 physical access (PA) scenario. Replay attacks are simulated within an acoustic environment/room of dimensions x x y (7.5m in the example above) with controllable reverberation. Recordings of bona fide presentations are acquired from distance D, from the talker. Bona fide or replay presentations are made at distance D, from ASV microphone.

Figure 6: Illustration of set of higher harmonic frequency responses (HHFRs) for arbitrary smart-tablet device estimated using synchronized-swept-sine approach to nonlinear system identification based on nonlinear convolution. H is linear component, while Hz — Hs are higher order non-linear components. Blue/shaded region is occupied bandwidth, difference in frequency between points where integrated power crosses 0.5% and 99.5% of total power in spectrum.

Figure 8: Illustration of PA simulation process based on impulse response (IR) modeling approach. Simulations take into account size of acoustic environ- ment S and level of reverberation R. Bona fide access attempts are made at distance D, from ASV microphone, whereas surreptitious recordings are made at distance D, from talker before being presented to ASV microphone, also at distance D,. Effects of digital-to-analogue conversion, signal amplification and replay (using loudspeaker) are all modelled, and represented with single device quality indicator Q. ‘igure 7: Characteristics measured from 40 different loudspeaker devices listed in Table 5. The top plot shows the operational bandwidth (OB). The middle plot hows the lower bound of the OB (minF). The bottom plot shows the linear-to-non-linear power ratio (LNLR) in the range of the OB. Change label of lower plot to -NLR.

Figure 9: As for Figure 3, except for PA data. Bona fide utterances and re- played versions according to 9 different replay configurations are completely overlapping. Mini-clusters correspond to speaker-utterance combinations.

Figure 10: ASV score distributions for target bona fide, non-target bona fide, and spoofed data from A1l6. Vertical black line denote classification thresh- old between spoofed and target bona fide data. Dot and square shaded areas correspond to false reject errors and false accept errors, respectively.

Figure 12: Summary of LA subset (evaluation section) results in terms of t-DCF (left) and EER (right). First row shows results for three categories of synthetic speech: (i) TTS, (ii) VC (TTS), and (iii) VC (Human). Next row shows results for four types of acoustic models: (i) neural-network-based pipeline TTS, (ii) neural- network-based end-to-end TTS, (iii) neural-network-based VC, and (iv) statistical-model-based VC. Last row shows results for different waveform generation methods: (i) neural waveform models, (ii) classic speech vocoders, (iii) waveform concatenation, (iv) spectral filtering, (v) waveform filtering, and (vi) others.

Figure 13: An illustration of baseline results for the PA scenario of the ASVspoof 2019 database. Results illustrated for individual replay configurations (pooled acoustic environments) and for: (top panel) standalone ASV results in terms of EER (%) with target and zero-effort impostor trials (black bars) and target and replay spoofing trials (gray bars); (middle panel) standalone replay spoofing in terms of EER (%) for baselines B1 and B2; (bottom panel) combined ASV and CM results illustrated in terms of the min-tDCF.

Figure 14: As for Figure 13 except for results in terms of individual acoustic environments (pooled replay configurations).

Figure 15: DET curves based on human assessment of similarity to target speakers (left) and speech quality (right)

Table 1: Summary of LA spoofing systems. * indicates neural networks. For abbreviations in this table, please refer to Section 3. Note that A04 and A16 use same waveform concatenation TTS algorithm, and A06 and A19 use same VC algorithm.

Table 2: Environment is defined as triplet (S,R,D;), each element of which takes one value in set (a,b,c) as a categorical value.

Table 3: Replay attack is defined as duple (Dz,Q), each element of which takes one value in set (A,B,C) as a categorical value. Table 4: Definition of replay device quality (Q). OB refers to occupied band- width, minF is lower bound of the OB, LNLR is linear-to-non-linear power ratio.

Table 5: List of real devices from which measurements were taken for sim- ulation of replay attack presentation. Q indicates device quality (B high, C low). Device code signifies device type: bluetooth (BT); headphone (H); mo- bile smartphone (M); larger consumer and professional loudspeaker (LS); tablet (T); laptop (LT). Level indicates volume (high, low) used during device charac- terisation. Right-most column indicates whether measured device characteris- tics were used for the simulation of utterances in training and development set (known devices) or evalutaion set (unknown devices).

Table 6: ASV performance in terms of EER (%) on LA subset of ASVSpoof 2019 dataset for baseline and different attack conditions for development and evaluation set. Note that A16 used same TTS algorithm as A04, and A19 used same VC algorithm as A06. expectation-maximization (EM) algorithm. Finally, scores are the log-likelihood ratio between the two hypotheses, namely that a given trial is either bona fide or spoofed speech. Base- line CMs are trained separately for LA and PA scenarios using designated CM training data.

Table 8: Same as Table 7 but for evaluation set

Table 7: Performance of integrated system in terms of min-tDCF and of stan- dalone countermeasures in terms of EER (%) on development set (LA subset) of ASVSpoof 2019 dataset. Results are shown for two baselines, B1 (CQCC- GMM) and B2 (LFCC-GMM), separately, combined with fixed ASV system based on x-vector. Last row describes results for “pooled condition” when tri- als from all the attacks are considered for evaluation.

descriptionView Paper arrow_downwardDownload

Additive noise compensation in the i-vector space for speaker recognition

by driss matrouf

2023, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

State-of-the-art speaker recognition systems performance degrades considerably in noisy environments even though they achieve very good results in clean conditions. In order to deal with this strong limitation, we aim in this work to... more

descriptionView Paper arrow_downwardDownload

Speaker verification from partially encrypted compressed speech for forensic investigation

by F223128 Muhammad Murtaza Baig

2023, Digital Investigation

Speaker verification has recently been introduced to the forensic field as a new and complimentary approach to other forensic methods. With the advancement in speech communication technologies including voice over IP and wireless... more

descriptionView Paper arrow_downwardDownload

Robust speaker verification in colored noise environment

by Rogerio Alves

2023, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003

"C-Tio.br rogeroclaritytechinc.com Absfrm-Noise robustness of automatic speaker verification systems is crucial in real life applications. A study on the performance of several spectral subtraction-based speech enhancement techniques... more

descriptionView Paper arrow_downwardDownload

Segmental Approaches for Automatic Speaker Verification

by Dijana Petrovska-delacrétaz

2023, Digital Signal Processing

Speech is composed of different sounds (acoustic segments). Speakers differ in their pronunciation of these sounds. The segmental approaches described in this paper are meant to exploit these differences for speaker verification purposes.... more

FIG. 1. Global and segmental speaker verification systems. Current text-independent speaker verification systems are usually based on modeling globally the probability density function of the speaker feature vectors. In such systems, denoted here as global systems, the temporal information is not taken into account and all classes are represented using a unique speaker model. As noted in the Introduction, another possibility consists of dividing the speech signal into distinct categories (also called classes or segments) and of performing the speaker modeling independently for each class. Such systems are denoted here as segmental systems. In such cases, the speaker verification task is divided into two parts: speech segmentation followed by speaker modeling for each of the classes. The general flowcharts of global and segmental speaker verification systems are shown in Fig. 1.

FIG. 2. DET curves for global GMM and global MLP systems, showing the performances for matched train/test conditions (SN) and mismatched train/test conditions (DT). Data from NIST 1998, training conditions 2F (2 min or more), and 30 s for test segment duration.

FIG. 3. DET curves for segmental MLP systems, showing the performances of five of eight classes, for matched train/test conditions (SN). Data from NIST 1998, training conditions 2F (2 min or more), and 30 s for test segment duration.

FIG. 4. DET curves for global and segmental MLP systems. MLPGlobC55 stands for the global system with 11 input frames and MLPSegC22RLin indicates the segmental system with linear score recombination using five input frames. Performances are reported for matched train/test conditions (SN) and mismatched train/test conditions (DT). Data from NIST1998, training conditions 2F (2 min or more), and 30 s for test segment duration.

FIG. 5. DET curves for global GMM (as baseline comparison), with global and segmental MLP systems. MLPGlobC55 stands for the global system with 11 input frames and MLPSegC22RLin indicates the segmental systems (using five input frames for the MLP), with linear score recombination. Performances are reported for matched train/test conditions (SN) and mismatched train/test conditions (DT). Data from NIST1998, training conditions 2F (2 min or more), and 30 s for test segment duration.

FIG. 6. Results for global and segmental GMM systems. Performances are reported for matched train/test conditions (SN) and mismatched train/test conditions (DT). Data from ELISA-1 (subset of NIST1998); training conditions 2s (2 min), and 3 s for test segment duration.

descriptionView Paper arrow_downwardDownload

Barlow Twins self-supervised learning for robust speaker recognition

by mohammad mohammadamini

2023, Interspeech 2022

Acoustic noise is a big challenge for speaker recognition systems. The state-of-the-art speaker recognition systems are based on deep neural network speaker embeddings called xvector extractor. A noise-robust x-vector extractor is highly... more

descriptionView Paper arrow_downwardDownload

Automatic Speaker Verification

Related Topics