Speech separation Research Papers

2025, Running ML Workloads on K8s Clusters

In this article, I explain how to run distributed machine learning workloads on a Kubernetes cluster using PyTorch’s Distributed Data Parallel (DDP) approach. DDP enables efficient multi-node training by replicating the model across... more

descriptionView Paper arrow_downwardDownload

NNMF with Speaker Clustering in a Uniform Filter-Bank for Blind Speech Separation

by Hasan M . Kadhim

2025, Iraqi Journal for Electrical and Electronic Engineering

This study proposes a blind speech separation algorithm that employs a single-channel technique. The algorithm's input signal is a segment of a mixture of speech for two speakers. At first, filter bank analysis transforms the input from... more

descriptionView Paper arrow_downwardDownload

SpeechBrain: A General-Purpose Speech Toolkit

by Mirco Ravanelli

2025, ArXiv

SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper... more

descriptionView Paper arrow_downwardDownload

Using full-rank spatial covariance models for noise-robust ASR

by Dung Tran

2025

We present a joint spatial and spectral denoising front-end for Track 1 of the 2nd CHiME Speech Separation and Recognition Challenge based on the Flexible Audio Source Separation Toolbox (FASST). We represent the sources by nonnegative... more

descriptionView Paper arrow_downwardDownload

Evaluating Deep Learning-Based Speaker Verification Systems: A Comparative Study Across Open-Source and Forensic Datasets

by Seyed Sahand Mohammadi Ziabari

2024

Speaker verification (SV) is the process of verifying whether speech from two audio signals originate from the same speaker or different speakers. Current state-of-the-art SV systems are based on deep neural networds, predominantly... more

descriptionView Paper arrow_downwardDownload

Voice Command Module for Smart Home Automation

by Maja Stella

2024

Voice control is the most prominent feature of smart home environment. In this paper, we proposed a voice command module that enables users hands-free interaction with the smart home environment. We presented three components required for... more

descriptionView Paper arrow_downwardDownload

End-to-End Transformer-Based Open-Vocabulary Keyword Spotting with Location-Guided Local Attention

by Jaeyun Lee

2024, Interspeech

Open-vocabulary keyword spotting (KWS) aims to detect arbitrary keywords from continuous speech, which allows users to define their personal keywords. In this paper, we propose a novel location guided end-to-end (E2E) keyword spotting... more

descriptionView Paper arrow_downwardDownload

End-to-End Transformer-Based Open-Vocabulary Keyword Spotting with Location-Guided Local Attention

by Jaeyun Lee

2024, Interspeech

Open-vocabulary keyword spotting (KWS) aims to detect arbitrary keywords from continuous speech, which allows users to define their personal keywords. In this paper, we propose a novel location guided end-to-end (E2E) keyword spotting... more

descriptionView Paper arrow_downwardDownload

Finding Reusable Machine Learning Components to Build Programming Language Processing Pipelines

by Chunhua Liao

2024, arXiv (Cornell University)

Programming Language Processing (PLP) using machine learning has made vast improvements in the past few years. Increasingly more people are interested in exploring this promising field. However, it is challenging for new researchers and... more

descriptionView Paper arrow_downwardDownload

High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

by Eric Sun

2024, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion,... more

descriptionView Paper arrow_downwardDownload

UX-Net: Filter-and-Process-Based Improved U-Net for real-time time-domain audio Separation

by Issa Panahi

2024

This study presents UX-Net, a time-domain audio separation network (TasNet) based on a modified U-Net architecture. The proposed UX-Net works in real-time and handles either single or multi-microphone input. Inspired by the... more

descriptionView Paper arrow_downwardDownload

A Multi-Purpose Audio-Visual Corpus for Multi-Modal Persian Speech Recognition: the Arman-AV Dataset

by Nasser Mozayani

2024, arXiv (Cornell University)

In recent years, significant progress has been made in automatic lip reading. But these methods require large-scale datasets that do not exist for many low-resource languages. In this paper, we have presented a new multipurpose... more

descriptionView Paper arrow_downwardDownload

SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

by Ankita Pasad

2024, arXiv (Cornell University)

Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lowerlevel tasks. Interest has been growing in... more

descriptionView Paper arrow_downwardDownload

What do self-supervised speech models know about words?

by Ankita Pasad

2024, arXiv (Cornell University)

Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is... more

descriptionView Paper arrow_downwardDownload

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

by Ankita Pasad

2024, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In this work, we introduce... more

descriptionView Paper arrow_downwardDownload

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

by Kris Demuynck

2024, Interspeech 2020

Current speaker verification techniques rely on a neural network to extract speaker representations. The successful x-vector architecture is a Time Delay Neural Network (TDNN) that applies statistics pooling to project variable-length... more

descriptionView Paper arrow_downwardDownload

Speaker Adaptation for End-to-End CTC Models

by Kshitiz Kumar

2024, 2018 IEEE Spoken Language Technology Workshop (SLT)

We propose two approaches for speaker adaptation in end-to-end (E2E) automatic speech recognition systems. One is Kullback-Leibler divergence (KLD) regularization and the other is multi-task learning (MTL). Both approaches aim to address... more

descriptionView Paper arrow_downwardDownload

Automatic Speech Recognition for Low-Resource and Morphologically Complex Languages

by Ethan Morris

2024

The application of deep neural networks to the task of acoustic modeling for automatic speech recognition (ASR) has resulted in dramatic decreases of word error rates, allowing for the use of this technology in smart phones and personal... more

Figure 2.2: An overview of the most basic feature extraction process, including the dif- ferent stopping points for various features [1].

Figure 2.3: Augmentations applied to the base input, given at the top. From top to bottom, the figures depict the log mel spectrogram of the base input with no augmentation, time warp, frequency masking and time masking applied [2] Figure 2.3: Augmentations applied to the base input, given at the top. From top to

Figure 4.1: Comparison of the WADA SNR algorithm and the average estimated NIST SNR against an artificially corrupted database. [3] There were several Kaldi implementations presented by the prior ASR work |67, 73

Table 5.2: Kaldi results for the two best architectures across the Wolof language with data augmentation. of 7000% in the system’s training time. ncreased training time required for the chosen Kaldi DNN model, noting an increas

Figure 5.1: CER vs. training epochs of the Swahili language at specific bottleneck depths.

Figure 5.2: U-Net architecture where the convolutional size increases to a maximum of 1024, before reducing back to the input size. [4]

Figure 5.3: CER vs. training epochs through the stages of training the Swahili language with transfer learning. efficacy of said step.

Figure 5.4: CER vs. training epochs through the stages of training the Swahili language without transfer learning.

Figure 5.5: Best produced WER vs. NIST SNR across all languages, training set depicted

Figure 5.6: Best produced WER vs. WADA SNR across all languages, training set depicted.

Figure 5.7: WER vs. % of Test Set for the Bemba Language.

Figure 5.8: WER vs. % of Test Set for the Seneca Language.

Figure 5.9: WER vs. % of Test Set for the Wolof Language.

Notably, for every model the DNN performed best, with often improvements up to Table 5.1: Overall Kaldi results for the two best architectures across all languages.

Table 5.3: Overall WireNet results across all languages. This table is incredibly insightful into the results produced for the languages: these results are shown in Table 5.3.

Analyzing the table indicates that Swahili, an agglutinative, morphologically rich Table 5.4: WireNet architecture exploration for Swahili, varying the width and depth of the system.

Table 5.6: Feature Comparison of Languages As one goal is to standardize the model parameters, and input features can take many forms, results such as Table 5.6 indicate that the filterbank features themselves.

Table 5.8: SNR overview per language. We see that Bemba, Iban, and Wolof have comparable SNR under both methods

Table 5.10: Word error rate (WER) for each language under the three architectures and two train/test split settings: the original, in which no speaker has utterances in both the train and test sets (Disjoint), and a new split in which each speaker’s utterances are split between the training and test sets (Overlap).

Table 5.11: Determining the impact of holding out each speaker for the Bemba language. languages are present in Tables 5.11, 5.12, 5.13 and Figures 5.7, 5.8, 5.9, respectively.

Table 5.12: Determining the impact of holding out each speaker for the Seneca language

Table 5.13: Determining the impact of holding out each speaker for the Wolof language.

descriptionView Paper arrow_downwardDownload

Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recognition

by wondimu lambamo

2024, Applied sciences

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY

descriptionView Paper arrow_downwardDownload

Cocktail parity problem solution based on modified blind extraction technique

by Ali Abdullah

2024, Indonesian Journal of Electrical Engineering and Computer Science

The Cocktail Party Problem solution is described as being the responsibility of isolating the voice signal in a noisy environment. Two popular methods for resolving this issue are blind source extraction and the Wiener filtering... more

descriptionView Paper arrow_downwardDownload

Voice Quality and Pitch Features in Transformer-Based Speech Recognition

by Mireia Farrús

2024, SpeechProsody

Jitter and shimmer measurements have shown to be carriers of voice quality and prosodic information which enhance the performance of tasks like speaker recognition, diarization or automatic speech recognition (ASR). However, such features... more

Figure 2: LibriSpeech test-clean and test-other WER (%) for several feature configurations tested, combining mel-spectrogram filter banks (FB), 3 random features (Rand), FO+POV+AFO0 (Pitch), jitter (J) and shimmer (S), using S2T and S2T VQ models. 40 FI baseline is marked by the dashline.

Figure 3: Error distributions across training hours. Error types: substitutions (S), deletions (D) and insertions (1).

Table 1: Convolutional front-end blocks for mel-spectrograms and voice quality + pitch features.

Table 2: Mean test WER scores for acoustic models trained with different subset sizes of LibriSpeech training. Two architectures are used: the baseline with filterbank features (80 FB) and the S2T VQ model with filterbank, pitch and voice quality features (+VQ).

descriptionView Paper arrow_downwardDownload

Analysis of Automatic Speech Recognition Methods

by Volodymyr Sokolov

2024, Cybersecurity Providing in Information and Telecommunication Systems

This paper outlines structures of different automatic speech recognition systems, hybrid and end-to-end, pros and cons for each of them, including the comparison of training data and computational resources requirements. Three main... more

descriptionView Paper arrow_downwardDownload

Natural Language Technology to Ensure the Safety of Speech Information

by Volodymyr Sokolov and

2024, Cybersecurity Providing in Information and Telecommunication Systems II

This paper is focused on Natural Language Processing (NLP) and speech area, describes the most prominent approaches and techniques, provides requirements to datasets for text and speech model training, compares major toolkits and... more

descriptionView Paper arrow_downwardDownload

Sound Source Separation Using Phase Difference and Reliable Mask Selection Selection

by Richard Stern

2024

In this paper, we present an algorithm called Reliable Mask Selection-Phase Difference Channel Weighting (RMS-PDCW) which selects the target source masked by a noise source using the Angle of Arrival (AoA) information calculated using the... more

descriptionView Paper arrow_downwardDownload

Sound source separation using phase difference and reliable mask selection

by Richard Stern

2024

In this paper, we present an algorithm called Reliable Mask Selection-Phase Difference Channel Weighting (RMS-PDCW) which selects the target source masked by a noise source using the Angle of Arrival (AoA) information calculated using the... more

descriptionView Paper arrow_downwardDownload

A Modulation-Domain Loss for Neural-Network-based Real-time Speech Enhancement

by Richard Stern

2024, arXiv (Cornell University)

We describe a modulation-domain loss function for deeplearning-based speech enhancement systems. Learnable spectro-temporal receptive fields (STRFs) were adapted to optimize for a speaker identification task. The learned STRFs were then... more

descriptionView Paper arrow_downwardDownload

Deep bi-directional recurrent networks over spectral windows

by Gerald Penn

2024, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)

Long short-term memory (LSTM) acoustic models have recently achieved state-of-the-art results on speech recognition tasks. As a type of recurrent neural network, LSTMs potentially have the ability to model long-span phenomena relating the... more

descriptionView Paper arrow_downwardDownload

Finnish parliament ASR corpus

by Nhan Phan

2024, Language Resources and Evaluation

Public sources like parliament meeting recordings and transcripts provide ever-growing material for the training and evaluation of automatic speech recognition (ASR) systems. In this paper, we publish and analyse the Finnish Parliament... more

descriptionView Paper arrow_downwardDownload

Self-Supervised Speaker Recognition Training using Human-Machine Dialogues

by zeya chen

2024, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Speaker recognition, recognizing speaker identities based on voice alone, enables important downstream applications, such as personalization and authentication. Learning speaker representations, in the context of supervised learning,... more

descriptionView Paper arrow_downwardDownload

Target Speech Extraction: Independent Vector Extraction Guided by Supervised Speaker Identification

by Jindřich Žďánský

2024, IEEE/ACM Transactions on Audio, Speech, and Language Processing

This manuscript proposes a novel robust procedure for the extraction of a speaker of interest (SOI) from a mixture of audio sources. The estimation of the SOI is performed via independent vector extraction (IVE). Since the blind IVE... more

The update rules for finding the optimum point of the auxiliary contrast function are obtained in the form:

ig. 1. | Source trajectories and locations for the Dynamic dataset. square subtraction of the extracted signal a = (wht)F tee from x. Let ar ” be the mixing vector after 2 deflation steps computed mm the tth block via (8). Due to the orthogonality of w*:? and a® * the subtraction is achieved through

DYNAMIC DATASET: THE EXTRACTION PERFORMANCE OF THE CSV-AUXIVE (CSV), THE STATIC (FS-IVE) AND THE BLOCK-WISE STATIC (BS-IVE) IVE TECHNIQUES Fig. 2. Accuracy in the task of the dominant speaker identification; each marker corresponds to a different SNR.

Fig. 3. Accuracy in the task of the SOI dominance identification; each marke corresponds to a different SNR.

DYNAMIC DATASET: THE EXTRACTION PERFORMANCE OF PILOTED CSV-AUXIVE (Lp = 200) WITH RESPECT TO X-VECTOR CONTEXT Le Fig. 4. Accuracy in the task of the SOI dominance identification with respect to language of speakers in the enrollment set; LZ. = 11 and each marker corresponds to a different SNR.

Fig. 5. MC-WSJ0-2mix, 4 channels: iSDR distributions achieved by CSV- AuxIVE endowed with various forms of guidance.

Fig.6. _MC-WSJ0-2mix, 4 channels: SDR achieved by the CSV-AuxIVE using oracle pilot, whose frames corresponding to SOJ are gradually interchanged with frames dominated by the interfering source.

Fig. 8. |. MC-WSJO-2mix: dependency between the improvements of the objec- tive criteria and the improvements of PLDA score on speech-speech mixtures.

Fig. 7. CHiME-4 (simulated development part): dependency between the improvements of the objective criteria and the improvements of PLDA score on speech-noise mixtures.

The input sizes for the context layers are stated after the mean pooling operation. DESCRIPTION OF THE FSMN PRODUCING THE X- VECTORS

present. The prolongation of inner blocks does not bring the ad- vantage of more available data. Application of an insufficiently short block (50 frames) allows for good adaptation ow SOI Attenuation), but the overall IS suppression is deteriorating (low iSIR).

The subscript Ly denotes the number of frames within the analyzed block.

the Norwegian dataset. This decrease does not stem from the language as such but it is caused by longer silences between Norwegian sentences. When the SOI is quiet, the non-piloted extractor tends to converge to an arbitrary active source. The utilization of a pilot completely removes this difference. The results for CSV-AuxIVE piloted via g*Y®° are comparable for both datasets (difference is maximally 1 dB in iSIR and iSDR); the proposed method can thus be considered language independent in this experiment.

Significant cases denote samples in which the erroneous increase/decrease in SIR/SDR is larger than 1 dB, 0.01 in STOI or 0.05 in PESQ. MC-WSJ0-2MIx (SPEECH-SPEECH MIXTURES): THE EVALUATION OF THE PROPOSED EXTRACTION ASSESSMENT SERVING AS A BINARY CLASSIFIER OF THE SPEECH-QUALITY METRICS

Significant cases denote samples in which the erroneous increase/decrease in SIR/SDR is larger than 1 dB, 0.01 in STOI or 0.05 in PESQ. CHIME-4 (SIMULATED DEVELOPMENT PART, SPEECH-NOISE MIXTURES): THE EVALUATION OF THE PROPOSED EXTRACTION ASSESSMENT SERVING AS A BINARY CLASSIFIER OF THE SPEECH-QUALITY METRICS

The column “Tr. data” quantifies the volume of the required scenario-specific training data. MC-WSJ0-2m1x: SDR [DB] YIELDED USING MACHINE-LEARNING (ML), BLIND SOURCE SEPARATION (BSS) AND BLIND SOURCE EXTRACTION (BSE) TABLE IX

descriptionView Paper arrow_downwardDownload

Blind Extraction of Moving Audio Source in a Challenging Environment Supported by Speaker Identification Via X-Vectors

by Jindřich Žďánský

2024, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

We propose a novel approach for semi-supervised extraction of a moving audio source of interest (SOI) applicable in reverberant and noisy environments. The blind part of the method is based on independent vector extraction (IVE) and uses... more

descriptionView Paper arrow_downwardDownload

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

by Basilio Garcia

2024, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

This work presents a large-scale audiovisual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audiovisual (A/V) dataset of... more

descriptionView Paper arrow_downwardDownload

ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion

by Sandra Maria Aluísio

2024, arXiv (Cornell University)

We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied to data augmentation for automatic speech recognition (ASR) systems in low/mediumresource scenarios. Through extensive experiments, we show... more

Figure 1: Full pipeline diagram for Baseline + DA experiment

Table 1: Related works comparison in the test-other subset. with human speech achieved relative improvement! of 4.56%, 0.79% and 4.25% compared to the models trained with human speech alone, respectively, for [10], [11] and [12]. A greater dif- ference is observed between the model trained with only human speech and only synthesized speech, with a relative difference of 80.17% and 78.98%, respectively, for [10] and [11], which motivates further improvements and research.

Table 4: WER of Upper Bound and Upper Bound + DA experi- ments on the test subsets of the pt-BR and ru-RU Common Voice datasets

descriptionView Paper arrow_downwardDownload

Robust Continuous On-Device Personalization for Automatic Speech Recognition

by Mason Chua

2024, Interspeech 2021

On-device personalization of an all-neural automatic speech recognition (ASR) model can be achieved efficiently by finetuning the last few layers of the model. This approach has been shown to be effective for adapting the model to... more

Figure 1: Continuous on-device ASR personalization workflow.

Figure 2: Comparison of WER performance with increasing training rounds using different acceptance criteria.

Figure 3: Comparison of word error rate on the Librispeaker test set for semi-supervised training with and without accep- tance criteria.

Figure 4: Acceptance rates for different training data sets.

Figure 5: Comparison of per-speaker absolute WER improve- ments on Librispeaker after 14 training rounds.

Table 1: LibriSpeech WER with different round sizes.

Table 2: Comparison of WER on Wiki-Names test set after 5 training rounds. Table 3: Comparing the catastrophic forgetting effect on WER with and without acceptance criteria.

studied in [22] and mitigated by elastic weight consolidation (EWC) in [32]. However, EWC requires additional statistics for the weights during training. This significantly increases mem- ory usage, which is not good for on-device training. Accep- tance criteria are an effective alternative, as shown in Table 3. The baseline model, which was trained with data from multi- ple domains including voice search [24], achieved 6.7% WER on the voice search test sets. Without any acceptance criterion (“always accept’), the forgetting effect is worse with a longer training horizon. With acceptance criteria (“loss + WER + re- gression eval’’), the effect is suppressed substantially.

descriptionView Paper arrow_downwardDownload

Optimising The Input Window Alignment in CD-DNN Based Phoneme Recognition for Low Latency Processing

by Giampiero Salvi

2024, arXiv (Cornell University)

We present a systematic analysis on the performance of a phonetic recogniser when the window of input features is not symmetric with respect to the current frame. The recogniser is based on Context Dependent Deep Neural Networks (CD-DNNs)... more

descriptionView Paper arrow_downwardDownload

Modernizing Open-Set Speech Language Identification

by Homayoon Beigi

2024, arXiv (Cornell University)

While most modern speech Language Identification methods are closed-set, we want to see if they can be modified and adapted for the open-set problem. When switching to the open-set problem, the solution gains the ability to reject an... more

descriptionView Paper arrow_downwardDownload

Towards Improving the Performance of Pre-Trained Speech Models for Low-Resource Languages Through Lateral Inhibition

by Dan Ioan Tufis

2024, arXiv (Cornell University)

With the rise of bidirectional encoder representations from Transformer models in natural language processing, the speech community has adopted some of their development methodologies. Therefore, the Wav2Vec models were introduced to... more

descriptionView Paper arrow_downwardDownload

Towards Improving the Performance of Pre-Trained Speech Models for Low-Resource Languages Through Lateral Inhibition

by Dan Ioan Tufis

2024, 2023 46th International Conference on Telecommunications and Signal Processing (TSP)

With the rise of bidirectional encoder representations from Transformer models in natural language processing, the speech community has adopted some of their development methodologies. Therefore, the Wav2Vec models were introduced to... more

descriptionView Paper arrow_downwardDownload

Hybrid Neural Network with Cross- and Self-Module Attention Pooling for Text-Independent Speaker Verification

by Abderrahim Fathan

2024, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Extraction of a speaker embedding vector plays an important role in deep learning-based speaker verification. In this contribution, to extract speaker discriminant utterance level embeddings, we propose a hybrid neural network that... more

descriptionView Paper arrow_downwardDownload

Robust Self-Supervised Speaker Representation Learning Via Instance Mix Regularization

by Abderrahim Fathan

2024, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Over the recent years, various self-supervised contrastive embedding learning methods for deep speaker verification were proposed. The performance of the self-supervised contrastive learning framework highly depends on the data... more

descriptionView Paper arrow_downwardDownload

Hybrid Neural Network with Cross- and Self-Module Attention Pooling for Text-Independent Speaker Verification

by Abderrahim Fathan

2024

Extraction of a speaker embedding vector plays an important role in deep learning-based speaker verification. In this contribution, to extract speaker discriminant utterance level embeddings, we propose a hybrid neural network that... more

Fig. 1. The general architecture of the proposed Cross- module Attention pooling - based Hybrid Neural Network (CA-HNN) for speaker verification.

Fig. 2. The general architecture of the proposed Cross- & Self-module Attention pooling - based Hybrid Neural Net- work (CSA-HNN) for speaker verification.

Table 1. Statistics of the three test sets of voxceleb | dataset in terms of numbers of speakers, recordings, total number of trials and target trials. Table 2. Comparison of speaker verification performances of various pooling mechanisms with the new HNN architecture. Results are reported on the Voxceleb1_O test set in terms of EER metric.

Table 3. Speaker verification performance on the three test partitions of Voxceleb | corpus. Here, AMS stands for addi- tive margin softmax. trial conditions. Impressively, the performance of the pro- posed frameworks were able to even surpass the conventional HNN system, where on the VoxCeleb1_O trial, the CSA- HNN achieved a relative improvement of 14.84% compared to the HNN in terms of EER. This further substantiates our idea that the complementary attributes latent in the attention weights generated from different network modules can allow the em! beddings to have more speaker-relevant information.

descriptionView Paper arrow_downwardDownload

Robust Self-Supervised Speaker Representation Learning Via Instance Mix Regularization

by Abderrahim Fathan

2024, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Over the recent years, various self-supervised contrastive embedding learning methods for deep speaker verification were proposed. The performance of the self-supervised contrastive learning framework highly depends on the data... more

descriptionView Paper arrow_downwardDownload

Improving Deep Attractor Network by BGRU and GMM for Speech Separation

by rawad Melhem

2024

Deep Attractor Network (DANet) is the state-of-the-art technique in speech separation field, which uses Bidirectional Long Short-Term Memory (BLSTM), but the complexity of the DANet model is very high. In this paper, a simplified and... more

descriptionView Paper arrow_downwardDownload

Towards Solving Cocktail-Party: The First Method to Build a Realistic Dataset with Ground Truths for Speech Separation

by rawad Melhem

2024, arXiv (Cornell University)

Speech separation is very important in real-world applications such as human-machine interaction, hearing aids devices, and automatic meeting transcription. In recent years, a significant improvement occurred towards the solution based on... more

descriptionView Paper arrow_downwardDownload

Deep bi-directional recurrent networks over spectral windows

by Gerald Penn

2024

Long short-term memory (LSTM) acoustic models have recently achieved state-of-the-art results on speech recognition tasks. As a type of recurrent neural network, LSTMs potentially have the ability to model long-span phenomena relating the... more

descriptionView Paper arrow_downwardDownload

ITALIC: An Italian Intent Classification Dataset

by Alkis Koudounas

2024, arXiv (Cornell University)

Recent large-scale Spoken Language Understanding datasets focus predominantly on English and do not account for language-specific phenomena such as particular phonemes or words in different lects. We introduce ITALIC, the first largescale... more

descriptionView Paper arrow_downwardDownload

NIESR: Nuisance Invariant End-to-End Speech Recognition

by Ayush Jaiswal

2023, Interspeech 2019

Deep neural network models for speech recognition have achieved great success recently, but they can learn incorrect associations between the target and nuisance factors of speech (e.g., speaker identities, background noise, etc.), which... more

descriptionView Paper arrow_downwardDownload

Performance analysis of speech enhancement using spectral gating with U-Net

by JHARNA AGRAWAL

2023, Journal of Electrical Engineering

Many speech processing systems’ crucial frontends include speech enhancement. Single-channel speech enhancement experiences a number of technological challenges. Due to the advent of cloud-based technology and the use of deep learning... more

descriptionView Paper arrow_downwardDownload

Towards Measuring and Scoring Speaker Diarization Fairness

by Jerome Boudy

2023, arXiv (Cornell University)

Speaker diarization, or the task of finding "who spoke and when", is now used in almost every speech processing application. Nevertheless, its fairness has not yet been evaluated because there was no protocol to study its biases one by... more

descriptionView Paper arrow_downwardDownload

The Newsbridge -Telecom SudParis voxceleb speaker recognition challenge 2022 system description

by Jerome Boudy

2023, HAL (Le Centre pour la Communication Scientifique Directe)

We describe the system used by our team for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC 2022) in the speaker diarization track. Our solution was designed around a new combination of voice activity detection algorithms that... more

descriptionView Paper arrow_downwardDownload

Speech separation

Related Topics