Papers by Alicia Lozano-Diez

arXiv (Cornell University), Dec 31, 2019
This chapter describes a number of signal-processing and statistical-modeling techniques that are... more This chapter describes a number of signal-processing and statistical-modeling techniques that are commonly used to calculate likelihood ratios in human-supervised automatic approaches to forensic voice comparison. Techniques described include mel-frequency cepstral coefficients (MFCCs) feature extraction, Gaussian mixture model -universal background model (GMM-UBM) systems, i-vector -probabilistic linear discriminant analysis (i-vector PLDA) systems, deep neural network (DNN) based systems (including senone posterior ivectors, bottleneck features, and embeddings / x-vectors), mismatch compensation, and score-to-likelihood-ratio conversion (aka calibration). Empirical validation of forensicvoice-comparison systems is also covered. The aim of the chapter is to bridge the gap between general introductions to forensic voice comparison and the highly technical automatic-speaker-recognition literature from which the signal-processing and statisticalmodeling techniques are mostly drawn. Knowledge of the likelihood-ratio framework for the evaluation of forensic evidence is assumed. It is hoped that the material presented here will be of value to students of forensic voice comparison and to researchers interested in learning about statistical modeling techniques that could potentially also be applied to data from other branches of forensic science.

IEEE Access, 2021
During the last decades, the volume of multimedia content posted in social networks has grown exp... more During the last decades, the volume of multimedia content posted in social networks has grown exponentially and such information is immediately propagated and consumed by a significant number of users. In this scenario, the disruption of fake news providers and bot accounts for spreading propaganda information as well as sensitive content throughout the network has fostered applied research to automatically measure the reliability of social networks accounts via Artificial Intelligence (AI). In this paper, we present a multilingual approach for addressing the bot identification task in Twitter via Deep learning (DL) approaches to support end-users when checking the credibility of a certain Twitter account. To do so, several experiments were conducted using state-of-the-art Multilingual Language Models to generate an encoding of the text-based features of the user account that are later on concatenated with the rest of the metadata to build a potential input vector on top of a Dense Network denoted as Bot-DenseNet. Consequently, this paper assesses the language constraint from previous studies where the encoding of the user account only considered either the metadata information or the metadata information together with some basic semantic text features. Moreover, the Bot-DenseNet produces a low-dimensional representation of the user account which can be used for any application within the Information Retrieval (IR) framework.

In this paper, we present the winning BUT submission for the text-dependent task of the SdSV chal... more In this paper, we present the winning BUT submission for the text-dependent task of the SdSV challenge 2020. Given the large amount of training data available in this challenge, we explore successful techniques from text-independent systems in the text-dependent scenario. In particular, we trained x-vector extractors on both in-domain and out-of-domain datasets and combine them with i-vectors trained on concatenated MFCCs and bottleneck features, which have proven effective for the text-dependent scenario. Moreover, we proposed the use of phrase-dependent PLDA backend for scoring and its combination with a simple phrase recognizer, which brings up to 63% relative improvement on our development set with respect to using standard PLDA. Finally, we combine our different i-vector and x-vector based systems using a simple linear logistic regression score level fusion, which provides 28% relative improvement on the evaluation set with respect to our best single system.

Interspeech 2022, Sep 18, 2022
End-to-end neural diarization (EEND) is nowadays one of the most prominent research topics in spe... more End-to-end neural diarization (EEND) is nowadays one of the most prominent research topics in speaker diarization. EEND presents an attractive alternative to standard cascaded diarization systems since a single system is trained at once to deal with the whole diarization problem. Several EEND variants and approaches are being proposed, however, all these models require large amounts of annotated data for training but available annotated data are scarce. Thus, EEND works have used mostly simulated mixtures for training. However, simulated mixtures do not resemble real conversations in many aspects. In this work we present an alternative method for creating synthetic conversations that resemble real ones by using statistics about distributions of pauses and overlaps estimated on genuine conversations. Furthermore, we analyze the effect of the source of the statistics, different augmentations and amounts of data. We demonstrate that our approach performs substantially better than the original one, while reducing the dependence on the fine-tuning stage. Experiments are carried out on 2-speaker telephone conversations of Callhome and DIHARD 3. Together with this publication, we release our implementations of EEND and the method for creating simulated conversations.

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
End-to-end diarization presents an attractive alternative to standard cascaded diarization system... more End-to-end diarization presents an attractive alternative to standard cascaded diarization systems because a single system can handle all aspects of the task at once. Many flavors of end-to-end models have been proposed but all of them require (so far non-existing) large amounts of annotated data for training. The compromise solution consists in generating synthetic data and the recently proposed simulated conversations (SC) have shown remarkable improvements over the original simulated mixtures (SM). In this work, we create SC with multiple speakers per conversation and show that they allow for substantially better performance than SM, also reducing the dependence on a fine-tuning stage. We also create SC with wide-band public audio sources and present an analysis on several evaluation sets. Together with this publication, we release the recipes for generating such data and models trained on public sets as well as the implementation to efficiently handle multiple speakers per conversation and an auxiliary voice activity detection loss.
Efficient Transformers for End-to-End Neural Speaker Diarization
IberSPEECH 2022

Interspeech 2022
End-to-end neural diarization (EEND) is nowadays one of the most prominent research topics in spe... more End-to-end neural diarization (EEND) is nowadays one of the most prominent research topics in speaker diarization. EEND presents an attractive alternative to standard cascaded diarization systems since a single system is trained at once to deal with the whole diarization problem. Several EEND variants and approaches are being proposed, however, all these models require large amounts of annotated data for training but available annotated data are scarce. Thus, EEND works have used mostly simulated mixtures for training. However, simulated mixtures do not resemble real conversations in many aspects. In this work we present an alternative method for creating synthetic conversations that resemble real ones by using statistics about distributions of pauses and overlaps estimated on genuine conversations. Furthermore, we analyze the effect of the source of the statistics, different augmentations and amounts of data. We demonstrate that our approach performs substantially better than the original one, while reducing the dependence on the fine-tuning stage. Experiments are carried out on 2-speaker telephone conversations of Callhome and DIHARD 3. Together with this publication, we release our implementations of EEND and the method for creating simulated conversations.
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

RESEARCH ARTICLE Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks
Long Short TermMemory (LSTM) Recurrent Neural Networks (RNNs) have recently outper-formed other s... more Long Short TermMemory (LSTM) Recurrent Neural Networks (RNNs) have recently outper-formed other state-of-the-art approaches, such as i-vector and Deep Neural Networks (DNNs), in automatic Language Identification (LID), particularly when dealing with very short utterances (*3s). In this contribution we present an open-source, end-to-end, LSTM RNN system running on limited computational resources (a single GPU) that outperforms a reference i-vector system on a subset of the NIST Language Recognition Evaluation (8 tar-get languages, 3s task) by up to a 26%. This result is in line with previously published research using proprietary LSTM implementations and huge computational resources, which made these former results hardly reproducible. Further, we extend those previous experiments modeling unseen languages (out of set, OOS, modeling), which is crucial in real applications. Results show that a LSTM RNN with OOSmodeling is able to detect these languages and generalizes robustly to unse...

In this paper, we summarize our efforts in the NIST Language Recognition Evaluations (LRE) 2017 w... more In this paper, we summarize our efforts in the NIST Language Recognition Evaluations (LRE) 2017 which resulted in systems providing very competitive and state-of-the-art performance. We provide both the descriptions and the analysis of the systems that we included in our submission. We explain our partitioning of the datasets that we were provided by NIST for training and development, and we follow by describing the features, DNN models and classifiers that were used to produce the final systems. After covering the architecture of our submission, we concentrate on post-evaluation analysis. We compare different DNN Bottle-Neck features, i-vector systems of different sizes and architectures, different classifiers and we present experimental results with data augmentation and with improved architecture of the system based on DNN embeddings. We present the performance of the systems in the Fixed condition (where participants are required to use only predefined data sets) and in addition...

arXiv: Applications, 2020
This chapter describes a number of signal-processing and statistical-modeling techniques that are... more This chapter describes a number of signal-processing and statistical-modeling techniques that are commonly used to calculate likelihood ratios in human-supervised automatic approaches to forensic voice comparison. Techniques described include mel-frequency cepstral coefficients (MFCCs) feature extraction, Gaussian mixture model - universal background model (GMM-UBM) systems, i-vector - probabilistic linear discriminant analysis (i-vector PLDA) systems, deep neural network (DNN) based systems (including senone posterior i-vectors, bottleneck features, and embeddings / x-vectors), mismatch compensation, and score-to-likelihood-ratio conversion (aka calibration). Empirical validation of forensic-voice-comparison systems is also covered. The aim of the chapter is to bridge the gap between general introductions to forensic voice comparison and the highly technical automatic-speaker-recognition literature from which the signal-processing and statistical-modeling techniques are mostly draw...
Fundamentals of voice biometrics: classical and machine learning approaches learning approaches
Voice Biometrics: Technology, trust and security, 2021

Interspeech 2015, 2015
In this work, we propose an end-to-end approach to the language identification (LID) problem base... more In this work, we propose an end-to-end approach to the language identification (LID) problem based on Convolutional Deep Neural Networks (CDNNs). The use of CDNNs is mainly motivated by the ability they have shown when modeling speech signals, and their relatively low-cost with respect to other deep architectures in terms of number of free parameters. We evaluate different configurations in a subset of 8 languages within the NIST Language Recognition Evaluation 2009 Voice of America (VOA) dataset, for the task of short test durations (segments up to 3 seconds of speech). The proposed CDNN-based systems achieve comparable performances to our baseline i-vector system, while reducing drastically the number of parameters to tune (at least 100 times fewer parameters). Then, we combine these CDNN-based systems and the i-vector baseline with a simple fusion at score level. This combination outperforms our best standalone system (up to 11% of relative improvement in terms of EER).

IEEE Access, 2021
During the last decades, the volume of multimedia content posted in social networks has grown exp... more During the last decades, the volume of multimedia content posted in social networks has grown exponentially and such information is immediately propagated and consumed by a significant number of users. In this scenario, the disruption of fake news providers and bot accounts for spreading propaganda information as well as sensitive content throughout the network has fostered applied research to automatically measure the reliability of social networks accounts via Artificial Intelligence (AI). In this paper, we present a multilingual approach for addressing the bot identification task in Twitter via Deep learning (DL) approaches to support end-users when checking the credibility of a certain Twitter account. To do so, several experiments were conducted using state-of-the-art Multilingual Language Models to generate an encoding of the text-based features of the user account that are later on concatenated with the rest of the metadata to build a potential input vector on top of a Dense Network denoted as Bot-DenseNet. Consequently, this paper assesses the language constraint from previous studies where the encoding of the user account only considered either the metadata information or the metadata information together with some basic semantic text features. Moreover, the Bot-DenseNet produces a low-dimensional representation of the user account which can be used for any application within the Information Retrieval (IR) framework.

6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020), 2020
This paper describes BUT's efforts in the development of the system for the CHiME-6 challenge wit... more This paper describes BUT's efforts in the development of the system for the CHiME-6 challenge with far-field dinner party recordings . Our experiments are on both diarization and speech recognition parts of the system. For diarization, we employ the VBx framework which uses Bayesian hidden Markov model with eigenvoice priors on x-vectors. For acoustic modeling, we explore using different subsets of data for training, different neural network architectures, discriminative training, more robust i-vectors, and semi-supervised training on Vox-Celeb data. Besides, we perform experiments with a neural network-based language model, exploring how to overcome the small size of the text corpus and incorporate across-segment context. When fusing our best systems, we achieve 41.21% / 42.55% WER on Track 1, for development and evaluation respectively, and 55.15% / 69.04% on Track 2, for development and evaluation respectively. Aside from techniques used in our final submitted systems, we also describe our efforts in end-toend diarization and end-to-end speech recognition.

Interspeech 2020, 2020
In this paper, we present the winning BUT submission for the text-dependent task of the SdSV chal... more In this paper, we present the winning BUT submission for the text-dependent task of the SdSV challenge 2020. Given the large amount of training data available in this challenge, we explore successful techniques from text-independent systems in the text-dependent scenario. In particular, we trained x-vector extractors on both in-domain and out-of-domain datasets and combine them with i-vectors trained on concatenated MFCCs and bottleneck features, which have proven effective for the text-dependent scenario. Moreover, we proposed the use of phrase-dependent PLDA backend for scoring and its combination with a simple phrase recognizer, which brings up to 63% relative improvement on our development set with respect to using standard PLDA. Finally, we combine our different i-vector and x-vector based systems using a simple linear logistic regression score level fusion, which provides 28% relative improvement on the evaluation set with respect to our best single system.
The Speaker and Language Recognition Workshop (Odyssey 2020), 2020
We present a condensed description and analysis of the joint submission of ABC team for NIST SRE ... more We present a condensed description and analysis of the joint submission of ABC team for NIST SRE 2019, by BUT, CRIM, Phonexia, Omilia and UAM. We concentrate on challenges that arose during development and we analyze the results obtained on the evaluation data and on our development sets. The conversational telephone speech (CMN2) condition is challenging for current state-of-the-art systems, mainly due to the language mismatch between training and test data. We show that a combination of adversarial domain adaptation, backend adaptation and score normalization can mitigate this mismatch. On the VAST condition, we demonstrate the importance of deploying diarization when dealing with multi-speaker utterances and the drastic improvements that can be obtained by combining audio and visual modalities.

Odyssey 2018 The Speaker and Language Recognition Workshop, Jun 26, 2018
In this work, we analyze different designs of a language identification (LID) system based on emb... more In this work, we analyze different designs of a language identification (LID) system based on embeddings. In our case, an embedding represents a whole utterance (or a speech segment of variable duration) as a fixed-length vector (similar to the ivector). Moreover, this embedding aims to capture information relevant to the target task (LID), and it is obtained by training a deep neural network (DNN) to classify languages. In particular, we trained a DNN based on bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) layers, whose frame-by-frame outputs are summarized into mean and standard deviation statistics for each utterance. After this pooling layer, we add two fully connected layers whose outputs are used as embeddings, which are afterwards modeled by a Gaussian linear classifier (GLC). For training, we add a softmax output layer and train the whole network with multi-class cross-entropy objective to discriminate between languages. We analyze the effect of using data augmentation in the DNN training, as well as different input features and architecture hyper-parameters, obtaining configurations that gradually improved the performance of the embedding system. We report our results on the NIST LRE 2017 evaluation dataset and compare the performance of embeddings with a reference i-vector system. We show that the best configuration of our embedding system outperforms the strong reference i-vector system by 3% relative, and this is further pushed up to 10% relative improvement via a simple score level fusion.

PLOS ONE, 2018
Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time... more Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform give rise to the Mel-Frequency Cepstral Coefficients (MFCC), which have been the most common speech features in speech processing for the last decades. These features were particularly well suited for the previous Hidden Markov Models/Gaussian Mixture Models (HMM/GMM) state of the art in ASR. In particular they produced highly uncorrelated features of small dimensionality (typically 13 coefficients plus deltas and double deltas), which was very convenient for diagonal covariance GMMs, for dealing with the curse of dimensionality and for the limited computing resources of a decade ago. Currently most ASR systems use Deep Neural Networks (DNN) instead of the GMMs for modeling the acoustic features, which provides more flexibility regarding the definition of the features. In particular, acoustic features can be highly correlated and can be much larger in size because the DNNs are very powerful at processing high-dimensionality inputs. Also, the computing hardware has reached a level of evolution that makes computational cost in speech processing a less relevant issue. In this context we have decided to revisit the problem of the time-frequency resolution in speech analysis, and in particular to check if multiresolution speech analysis (both in time and frequency) can be helpful in improving acoustic modeling using DNNs. Our experiments start with several Kaldi baseline system for the well known TIMIT corpus and modify them by adding multi-resolution speech representations by concatenating different spectra computed using different time-frequency resolutions and different post-processed and speaker-adapted features using different time-frequency resolutions. Our experiments show that using a multi-resolution speech representation tends to improve over results using the baseline single resolution speech representation, which seems to confirm our main hypothesis. However, results combining multi-resolution with the highly post-processed and speaker-adapted features, which provide the best results in Kaldi for TIMIT, yield only very modest improvements.
Uploads
Papers by Alicia Lozano-Diez