Speech Reconstruction Research Papers

A PROSODY-DRIVEN EXTENSION TO EMBEDDING-GUIDED NEURAL VOICE CONVERSION FOR NATURAL AND EXPRESSIVE SPEECH SYNTHESIS

2025, Industrial Engineering Journal

Recent advancements in voice conversion systems have been largely driven by deep learning techniques, enabling the high-quality synthesis of human speech. However, existing models often fail to generate emotionally expressive speech,... more

descriptionView Paper arrow_downwardDownload

ZERO-SHOT EMOTION RECOGNITION IN CROSS-LANGUAGE SPEECH USING CONTRASTIVE LEARNING

by Vince Campbell

2025

This paper presents a zero-shot learning framework for speech-based emotion recognition across languages using contrastive learning. Traditional emotion AI models often depend on large, labeled corpora in a specific language, limiting... more

descriptionView Paper arrow_downwardDownload

Generating dynamic lip-syncing using target audio in a multimedia environment

by Diksha Pawar

2025, Natural Language Processing Journal

The presented research focuses on the challenging task of creating lip-sync facial videos that align with a specified target speech segment. A novel deep-learning model has been developed to produce precise synthetic lip movements... more

descriptionView Paper arrow_downwardDownload

Cross-Attention Fusion of Visual and Geometric Features for Large Vocabulary Arabic Lipreading

by Samar Daou

2025, arXiv (Cornell University)

Lipreading involves using visual data to recognize spoken words by analyzing the movements of the lips and surrounding area. It is a hot research topic with many potential applications, such as human-machine interaction and enhancing... more

descriptionView Paper arrow_downwardDownload

Disentangling Prosody Representations with Unsupervised Speech Reconstruction

by Stefan Wermter

2024, arXiv (Cornell University)

Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker... more

descriptionView Paper arrow_downwardDownload

Phonated speech reconstruction using twin mapping models

by Iman Ardekani

2024

Computational speech reconstruction algorithms have the ultimate aim of returning natural sounding speech to aphonic and dysphonic individuals. These algorithms can also be used by unimpaired speakers for communicating sensitive or... more

descriptionView Paper arrow_downwardDownload

Predictive articulatory speech synthesis with semantic discrimination

by Elnaz Shafaei

2024

An articulatory speech sythesiser is presented that extends the Recurrent Gradient-based Motor Inference Model for Speech Resynthesis [1] with a classifier optimizing for semantic discrimination. The new design features show promising... more

descriptionView Paper arrow_downwardDownload

Using a Neural Network Codec Approximation Loss to Improve Source Separation Performance in Limited Capacity Networks

by Joseph Paradiso

2024

A growing need for on-device machine learning has led to an increased interest in lightweight neural networks that lower model complexity while retaining performance. While a variety of general-purpose techniques exist in this context,... more

descriptionView Paper arrow_downwardDownload

Photorealistic reconstruction of visual texture from EEG signals

by isamu motoyoshi

2024

Recent advances in brain decoding have made it possible to classify image categories based on neural activity. Increasing numbers of studies have further attempted to reconstruct the image itself. However, because images of objects and... more

descriptionView Paper arrow_downwardDownload

In Other News: a Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data

by Nishant Prateek

2024, Proceedings of the 2019 Conference of the North

Neural text-to-speech synthesis (NTTS) models have shown significant progress in generating high-quality speech, however they require a large quantity of training data. This makes creating models for multiple styles expensive and... more

descriptionView Paper arrow_downwardDownload

Vocoder-Based Speech Synthesis from Silent Videos

by Jesper Jensen

2023, Interspeech 2020

Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a... more

descriptionView Paper arrow_downwardDownload

Decoupling recognition and transcription in Mandarin ASR

by jiahong yuan

2023, arXiv (Cornell University)

Much of the recent literature on automatic speech recognition (ASR) is taking an end-to-end approach. Unlike English where the writing system is closely related to sound, Chinese characters (Hanzi) represent meaning, not sound. We propose... more

descriptionView Paper arrow_downwardDownload

SL-ReDu

by Eleni Efthimiou

2023

We present SL-ReDu, a recently commenced innovative project that aims to exploit deep-learning progress to advance the state-of-theart in video-based automatic recognition of Greek Sign Language (GSL), while focusing on the use-case of... more

Figure 2: General architecture of the proposed three-stage system for isolated sign recognition from GSL video data.

Figure 3: Example of the hand detection and type classifi- cation pipeline of [28] (first stage of the proposed system), applied on data of the Polytropon corpus. Depicted are, left to right: (a) video frame marked with a rectangular box en- closing the detected face, as well as the central square of the detected face region; (b) segmented skin region; (c) tracked hands by Kalman filtering (yellow rectangles depict detected objects, red stars the predicted object positions, and blue stars their corrected positions); (d) frame marked with rect- angular boxes illustrating the signer’s left and right hands.

Figure 4: Duration histograms (in video frames) of signed numerals and other words in the Polytropon GSL corpus.

Figure 5: Sign classification error (%) of the proposed system on the Polytropon subset of highly-occurring “other words”, using various numbers of layers in the four sequence learn- ing models of Section 3.3.

Figure 6: Confusion matrix of ten “other words” for the at- tentional CNN encoder-decoder. The horizontal axis depicts predicted words, while the vertical one the ground truth.

Figure 7: Examples of a frame of signed words (a) “sum- mer” and (b) “water” in Polytropon. These are frequently confused by our model due to their similar handshapes.

descriptionView Paper arrow_downwardDownload

by Mathias Quillot

2023

Dubbing contributes to a larger international distribution of multimedia documents. It aims to replace the original voice in a source language by a new one in a target language. For now, the target voice selection procedure, called voice... more

descriptionView Paper arrow_downwardDownload

Identity-Preserving Realistic Talking Face Generation

by Sanjana Sinha

2023, 2020 International Joint Conference on Neural Networks (IJCNN)

Speech-driven facial animation is useful for a variety of applications such as telepresence, chatbots, etc. The necessary attributes of having a realistic face animation are 1) audiovisual synchronization (2) identity preservation of the... more

descriptionView Paper arrow_downwardDownload

FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement

by deyi tuo

2023, arXiv (Cornell University)

Previously proposed FullSubNet has achieved outstanding performance in Deep Noise Suppression (DNS) Challenge and attracted much attention. However, it still encounters issues such as inputoutput mismatch and coarse processing for... more

Fig. 1. (a) The overall diagram of the proposed FullSubNet+. The model contains three branches consisting of an MulCA module and z Full-band Extractor, a concatenation operation, and a Sub-band Model. (b) The System flowchart on one branch of the model, which contains specific details of the MulCA module, Full-band Extractor, Concatenation operation, and Sub-band Model.

Fig. 2. The details of TCN block. The “DD-Conv” indicates a di- lated depth-wise separable convolution. The “G-norm” is a global layer normalization [17]. where W represents the weight vector, and X, XK € R**? denote the spectrogram before and after weighting respectively. © denotes the dot product. Accordingly, the model will focus on frequency bands that play more significant roles in noise reduction.

Table 3. Performance of NB-PESQ, WB-PESQ and SI-SDR in the ablation study 3.4.2 using the test set without reverberation.

Table 2. Performance of WB-PESQ and SI-SDR in the ablation study 3.4.1 using the test set without reverberation. Here the back- bone is the FullSubNet.

Table 1. The performance in terms of WB-PESQ [MOS], NB-PESQ [MOS], STOI [%], and SI-SDR [dB] on the DNS Challenge test dataset.

descriptionView Paper arrow_downwardDownload

FullSubNet+: Channel Attention Fullsubnet with Complex Spectrograms for Speech Enhancement

by deyi tuo

2023, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Previously proposed FullSubNet has achieved outstanding performance in Deep Noise Suppression (DNS) Challenge and attracted much attention. However, it still encounters issues such as inputoutput mismatch and coarse processing for... more

descriptionView Paper arrow_downwardDownload

Replay to Remember: Continual Layer-Specific Fine-Tuning for German Speech Recognition

by Stefan Wermter

2023, Artificial Neural Networks and Machine Learning – ICANN 2023

While Automatic Speech Recognition (ASR) models have shown significant advances with the introduction of unsupervised or selfsupervised training techniques, these improvements are still only limited to a subsection of languages and... more

descriptionView Paper arrow_downwardDownload

A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models

by Shefali Garg

2023, Interspeech 2021

In the absence of large-scale in-domain supervised training data, ASR models can achieve reasonable performance through pre-training on additional data that is unlabeled, mismatched or both. Given such data constraints, we compare... more

descriptionView Paper arrow_downwardDownload

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

by Stefan Wermter

2023, IEEE Transactions on Neural Networks and Learning Systems

The aim of this work is to investigate the impact of 1 crossmodal self-supervised pre-training for speech reconstruc-2 tion (video-to-audio) by leveraging the natural co-occurrence 3 of audio and visual streams in videos. We propose... more

The aim of this work is to investigate the impact of 1 crossmodal self-supervised pre-training for speech reconstruc-2 tion (video-to-audio) by leveraging the natural co-occurrence 3 of audio and visual streams in videos. We propose LipSound2 4 that consists of an encoder-decoder architecture and location-5 aware attention mechanism to map face image sequences to 6 mel-scale spectrograms directly without requiring any human 7 annotations. The proposed LipSound2 model is first pre-trained 8 on ∼2400-h multilingual (e.g., English and German) audiovisual 9 data (VoxCeleb2). To verify the generalizability of the proposed 10 method, we then fine-tune the pre-trained model on domain-11 specific datasets (GRID and TCD-TIMIT) for English speech 12 reconstruction and achieve a significant improvement on speech 13 quality and intelligibility compared to previous approaches in 14 speaker-dependent and speaker-independent settings. In addition 15 to English, we conduct Chinese speech reconstruction on the 16 Chinese Mandarin Lip Reading (CMLR) dataset to verify the 17 impact on transferability. Finally, we train the cascaded lip 18 reading (video-to-text) system by fine-tuning the generated audios 19 on a pre-trained speech recognition system and achieve the state-20 of-the-art performance on both English and Chinese benchmark 21 datasets. 22 Index Terms-Lip reading, self-supervised pre-training, speech 23 recognition, speech reconstruction. 24 I. INTRODUCTION 25 I NSPIRED by human bimodal perception [1] in which both 26 sight and sound are used to improve the comprehension of 27 speech, a lot of effort has been spent on speech processing 28 tasks by leveraging visual information, for example, integrat-29 ing simultaneous lip movement sequences into speech recogni-30 tion [2], [3], guiding neural networks in isolating target speech 31 signals with a static face image for speech separation [4], [5], 32 and grounding speech recognition with visual objects and 33 scene information [6], [7]. Multimodal audiovisual methods 34 achieve significant improvement over single modality models 35 since the visual signals are invariant to acoustic noise and

Fig. 1. Process of (a) video-to-waveform generation and (b) waveform-to-text transformation.

Fig. 2. Architecture of LipSound2. The video is split into visual and acoustic streams. The face region, which is cropped from the silent visual stream is used as the model input. The acoustic spectrogram features extracted from the counterpart audio stream are used as the training target. During training, th ground-truth spectrogram frames are utilized to accelerate convergence, while, during inference, the outputs from previous steps are used.

Fig. 3. Computational flow of location-aware attention at time step f.

Fig. 4. Random face samples from audio-visual corpora. Only the face region is cropped during training and test. Samples from audio-visual cor- pora [31], [78], [79], [80].

Fig. 6. Attention alignment comparison on the GRID dataset.

Fig. 5. Comparison between generated mel spectrogram and ground truth in speaker-dependent and speaker-independent settings for English and Chines« [79], [31], [80].

CONFIGURATION OF LIPSOUND2 ENCODER, DECODER, ATTENTION, AND POSTNET but also easily learns long-distance dependence. Model details are listed in Table I. Then, a pre-trained neural vocoder, WaveGlow, follows to reconstruct the raw waveform from the generated mel spectrogram.

VERVIEW OF ALL CORPORA USED IN THIS ARTICLE. SPK: SPEAKERS. UTT: UTTERANCES. VOCAB: VOCABULARY

The LJ Speech dataset with only one female voice is especially designed for speech synthesis tasks, which is used for WaveGlow training, in this article, to transform mel spectrogram back to waveforms.

SPEAKER-DEPENDENT SPEECH RECONSTRUCTION RESULTS ON GRID AND TCD-TIMIT DATASETS

SPEAKER-INDEPENDENT SPEECH RECONSTRUCTION RESULTS ON GRID AND TCD-TIMIT DATASETS

LIP READING RESULTS ON THE GRID AND TCD-TIMIT DATASETS ON WER. SPK-DEP: SPEAKER-DEPENDENT. SPK-INDEP: SPEAKER-INDEPENDENT. LM: LANGUAGE MODEL

Lip READING RESULTS FOR CHINESE ON THE CMLR DATASET. CER: CHARACTER ERROR RATE

descriptionView Paper arrow_downwardDownload

SL-ReDu

by Petros Maragos

2023, Proceedings of the 13th ACM International Conference on PErvasive Technologies Related to Assistive Environments

We present SL-ReDu, a recently commenced innovative project that aims to exploit deep-learning progress to advance the state-of-theart in video-based automatic recognition of Greek Sign Language (GSL), while focusing on the use-case of... more

descriptionView Paper arrow_downwardDownload

Phonated speech reconstruction using twin mapping models

by Iman Ardekani

2023, 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)

Computational speech reconstruction algorithms have the ultimate aim of returning natural sounding speech to aphonic and dysphonic individuals. These algorithms can also be used by unimpaired speakers for communicating sensitive or... more

descriptionView Paper arrow_downwardDownload

Text-driven Mouth Animation for Human Computer Interaction With Personal Assistant

by Francis Rousseaux

2023, Proceedings of the 25th International Conference on Auditory Display (ICAD 2019)

Personal assistants are becoming more pervasive in our envi-ronments but still do not provide natural interactions. Their lack of realism in term of expressiveness and their lack of visual feedback can create frustrating experiences and... more

Figure 1: Personal Assistant based on our text-driven mouth animation system in a workspace setup.

Figure 2: The first line represents the successive 40ms Mel-spectrograms computed from a testing video audio track with the corresponding inferred 3D mouth landmarks on the second line which have been used to generate the final rendering on the last line.

Figure 3: Examples of extracted landmarks in 3D coordinates using FAN algorithms on cropped faces.

Figure 4: Illustration of the 20 points composing the mouth skele- ton with their corresponding FAN annotations.

Figure 6: The overall architecture is composed of successive neural modules. The modules Taoctron2 and WaveGlow are in charge of the text to speech transformation. The following ADMANet module computes the 3D mouth landmarks animation based on the speech waveform. Finally, a rendering engine generates the final animation with its corresponding audio file.

Figure 7: Success rate during blind independent identification test of artificially generated and human tracked mouth animations. Re- sults are displayed per independent discrimination criterion: gender (in blue), age (in green) and exposure to 3D animations (in red). Each box-plot describes the maximum, first quartile, median, third quartile and minimum of the candidates’ success rate. According to Figure 7, and Figure 8, gender and age do not seem to impact candidates’ ability to solve the identification and classification tasks. Contrary, the exposure criteria tend to help the users. Therefore, as illustrated in Figure 9, both experiments confirm that our model can produce mouth animations realistic enough to fool human candidates.

Figure 9: Confusion matrices during blind tests for respectively independent identification (on the left) and side by side discrimina- tion (on the right) of artificially generated and human tracked mouth animations. Each value have been transformed into percentages. Figure 8: Success rate during blind side by side discrimination test of artificially generated and human tracked mouth animations. Re- sults are displayed per independent discrimination criterion: gender (in blue), age (in green) and exposure to 3D animations (in red). Each box-plot describes the maximum, first quartile, median, third quartile and minimum of the candidates’ success rate.

Figure 10: Histograms of speech duration estimation for audio only (in blue) versus audio with visual feedback (in red). Each histogram corresponds to a different speech. Speech duration varies from 3 to 18s. Original answer for the speech duration (in green) and histogram tendency curve can be observed on the graph. In Figure 11, results show a small difference between candi- dates estimation errors with and without visual feedback. Compared to audio only, visual feedback slightly improves candidates ability to estimate the recordings’ duration. For more details, we provide a histogram for each question Figure 10. These histograms show that visual feedback tends to increase candidates accuracy for the du- ration estimation task. Answers are less sparse and more centered around the right one than with audio only.

Figure 11: Average error of the speech duration estimation by the candidates during comprehension test between audio only (in blue) and audio with visual feedback (in red). Each box-plot describes the maximum, first quartile, median, third quartile and minimum of the candidates’ duration estimation difference from the right answer.

Table 1: ADMA-Net architecture is composed of three stages. First one is in charge of learning frequency correlations at each time step whereas the second stage learns their dynamics according to the incoming Mel-spectrogram. Finally, the last stage learns a regression between these frequency dynamic feature maps and the mouth landmark positions. those is a crucial step because of cumulative approximation errors throughout the entire model. During development, we tried different approaches for the pipeline. One of them was to connect our CNN regressor directly to the Tacotron2 output. However, the output resolution was not sufficient to allow our model to learn the mouth landmarks properly. The Mel-spectrogram use by our architecture is computed using an STFT on the waveform produced by WaveGlow, as illustrated in Figure 6, to allow control on its resolution.

descriptionView Paper arrow_downwardDownload

Spotting words in silent speech videos: a retrieval-based approach

by abhishek jha

2023, Machine Vision and Applications

Our goal is to spot words in silent speech videos without explicitly recognizing the spoken words, where the lip motion of the speaker is clearly visible and audio is absent. Existing work in this domain has mainly focused on recognizing... more

descriptionView Paper arrow_downwardDownload

Diverse Pose Lip-Reading Framework

by Toqeer Mahmood

2023, Applied Sciences

Lip-reading is a technique to understand speech by observing a speaker’s lips movement. It has numerous applications; for example, it is helpful for hearing impaired persons and understanding the speech in noisy environments. Most of the... more

descriptionView Paper arrow_downwardDownload

DIGEST PODCAST

by IRJET Journal

2023, IRJET

This is the age of instant gratification. Browsing the entire Publication is dawdle , hence we have proposed an application that summarizes all your reading's in a snap of time using AI technologies. The system is composed of Optical... more

descriptionView Paper arrow_downwardDownload

Talking Face Generation by Conditional Recurrent Adversarial Network

by Hairong Qi

2023, Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence

Given an arbitrary face image and an arbitrary speech clip, the proposed work attempts to generate the talking face video with accurate lip synchronization. Existing works either do not consider temporal dependency across video frames... more

descriptionView Paper arrow_downwardDownload

Sequence-to-Sequence Voice Reconstruction for Silent Speech in a Tonal Language

by Zhiyuan Luo

2023, Brain Sciences

Silent speech decoding (SSD), based on articulatory neuromuscular activities, has become a prevalent task of brain–computer interfaces (BCIs) in recent years. Many works have been devoted to decoding surface electromyography (sEMG) from... more

descriptionView Paper arrow_downwardDownload

A Direct Speech-to-Speech Neural Network Methodology for Spanish-English Translation

by Miguel Yahir López Bernal

2023, EAI Endorsed Transactions on Energy Web

In this work, a novel direct speech-to-speech methodology for translation is proposed; it is based on an LSTM neural network structure which has proven useful for translation in the classical way, i.e., the one consisting of three stages:... more

descriptionView Paper arrow_downwardDownload

Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion

by Zhilei Liu

2023, Cornell University - arXiv

Talking head generation is to synthesize a lip-synchronized talking head video by inputting an arbitrary face image and corresponding audio clips. Existing methods ignore not only the interaction and relationship of cross-modal... more

descriptionView Paper arrow_downwardDownload

Talking Head Generation with Audio and Speech Related Facial Action Units

by Zhilei Liu

2023, Cornell University - arXiv

The task of talking head generation is to synthesize a lip synchronized talking head video by inputting an arbitrary face image and audio clips. Most existing methods ignore the local driving information of the mouth muscles. In this... more

Figure 1: The pipeline of our proposed method 3AN can produce realistic images, and it has achieved excellent results in many fields, sucl s image translation [15], face generation [4, 31, 39]. Chen et al. [4] devised a cascade GAN ipproach to generate the talking face. Eskimez et al. [10] used GAN training to improve th mage quality and mouth-speech synchronization. Here, we use the GAN training method t nforce the generated image distribution to approach real image distribution.

Table | shows the quantitative results on the GRID test set and TCD-TIMIT test set. To compare with other methods, we implement the methods proposed by Song et al. [31] and Jamaludin et al. [16] in the same conditions, and use the same training-testing data split as our proposed method. Our baseline only uses reconstruction loss. We can see that our proposed method has significant improvements in image quality and lip-sync accuracy in both datasets. Compared with Song et al. [31], our proposed method improves PSNR by about 1.8 and average Fl score of AUs by 4.1% on the GRID dataset. Our method also improves PSNR by 1.82 and average accuracy of AUs by about 3.5% on the TCD-TIMIT dataset. Similarly, our method is significantly higher than Jamaludin et al. [16] in all metrics.

Figure 4: Qualitative results were produced by our proposed method and other methods on the TCD-TIMIT test set.

Figure 5: Qualitative results of ablation study on the GRID test set. Each row is a continuous sequence of talking faces. Audio2AU means our proposed Audio-to-AU module.

Figure 7: Difference map and optical flow of the images generated by CRAN [31] and our proposed method. Table 5: The AU detection results of the images generated by our proposed method on AU classifier and Openface respectively. Because the ground truth AU labels of the real images are extracted by Openface, the Openface detection results are accurate.

Figure 8: Example of generated videos driven by silent audio with no one speaking and only slight noise. The images generated by our method is consistent with the input image, which obviously suppresses the lip movement.

Figure 10: Example of generated frames produced by our proposed model and other methods on the LRW dataset. DAVS [38] is trained on the LRW dataset. CRAN [31] and our model are trained on the GRID dataset.

Figure 11: Example of generated frames produced by our proposed model and other methods on the VoxCeleb2 dataset. All of the methods are trained on the GRID dataset.

Table 1: Quantitative results on the GRID test set and TCD-TIMIT test set. Avg. Fl and Avg. Acc. are average F1 score and average accuracy (%) of speech-related AUs respectively. To evaluate the quality of generated images, we adopt the reconstruction metrics Peak SNE (PSNR) and Structural Similarity (SSIM) [35]. For the lip-sync performance, we verif the recognition accuracy and F1 score of the five selected speech-related AUs in generatec frames. Specifically, we use the OpenFace toolkit [1, 2] to detect the state of the five selectec AUs (activated or not) in each generated frame, then compare them with ground truth labels

To quantify the contribution of each component of our method, we conduct an ablation study on the G RID dataset. As shown in Table 2, all the metrics have been improved after adding perceptual loss, identity loss, adversarial loss, and AU loss progressively. Especially after adding identity loss, PSNR has improved significantly. We think that when AU loss is used together we add t with the Audio-to-AU module, it can achieve the maximum effect. Finally, when he Audio-to-AU module, which can promote each other with AU classifier, all the metrics have achieved the best results, especially the average F1 score and average accuracy of AUs have been significantly improved. These experiments prove the effectiveness of our proposed method. To observe the results of each AU, we also show the F1 score and accuracy of the five speech-related AUs respectively in Table 2. Most of the results of our proposed model perform best.

descriptionView Paper arrow_downwardDownload

Decoupling Recognition and Transcription in Mandarin ASR

by jiahong yuan

2023, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Much of the recent literature on automatic speech recognition (ASR) is taking an end-to-end approach. Unlike English where the writing system is closely related to sound, Chinese characters (Hanzi) represent meaning, not sound. We propose... more

descriptionView Paper arrow_downwardDownload

Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

by Prabhav Agrawal

2023, Cornell University - arXiv

Text-based voice editing (TBVE) uses synthetic output from textto-speech (TTS) systems to replace words in an original recording. Recent work has used neural models to produce edited speech that is similar to the original speech in terms... more

descriptionView Paper arrow_downwardDownload

Training ASR Models By Generation of Contextual Information

by Sergey Edunov

2023, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has... more

descriptionView Paper arrow_downwardDownload

Large Scale Weakly and Semi-Supervised Learning for Low-Resource Video ASR

by Sergey Edunov

2023, Interspeech 2020

Many semi-and weakly-supervised approaches have been investigated for overcoming the labeling cost of building highquality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions,... more

descriptionView Paper arrow_downwardDownload

More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

by Tal Remez

2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

In this paper we present VDTTS, a Visually-Driven Textto-Speech model. Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate... more

descriptionView Paper arrow_downwardDownload

Audio-Visual Speech Super-Resolution

by Sindhu Hegde

2022

In this paper, we present an audio-visual model to perform speech super-resolution at large scale-factors (8× and 16×). Previous works attempted to solve this problem using only the audio modality as input and thus were limited to low... more

descriptionView Paper arrow_downwardDownload

Visual Speech Enhancement Without A Real Visual Stream

by Sindhu Hegde

2022, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV)

In this work, we rethink the task of speech enhancement in unconstrained real-world environments. Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent... more

Figure 1. We propose a novel approach to enhance the speech by hallucinating the visual stream for any given noisy audio. In contrast tc the existing audio-visual methods, our approach works even in the absence of a reliable visual stream, while also performing better thar audio-only works in unconstrained conditions due to the assistance of generated lip movements.

Figure 2. We train a novel student-teacher network for generating accurate lip movements for noisy speech segments. The teacher is a pre-trained lip synthesis network [27] that generates accurate lip movements on a static face using clean speech as input. The student is trained to mimic the teacher’s lip movements, but when given noisy speech as input.

Figure 3. Our proposed speech enhancement model. A pseudo-visual stream is generated for the given noisy audio, which acts as a visual noise filter. The enhancement model then ingests the noisy spectrogram along with the generated lip movements and outputs a mask for the clean speech.

Figure 4. We can clearly see that our network is able to recon- struct clean speech (center spectrogram) which is very close to the ground-truth (right). We also perform an additional comparative study on unseen speakers from LRS2 test set as shown in the third section of Table 2. In-line with the previous results on LRS3 test set, our method performs remarkably well in comparison with the audio-only approaches. The results clearly indi- cate that our method is robust to unseen speakers, which is also validated using our collected real-world test set. A sample spectrogram predicted by our model, along with the ground truth and the noisy input spectrograms are shown in Figure 4. We observe that our model is able to reconstruct accurate speech even from a highly noisy input.

Table 2. Quantitative comparison of different approaches. The first section contains clean speech from LRS3 [3] test set mixed with VGGSound [5] noises at different SNR levels. In the second section, we specifically evaluate the performance on “unseen noises” by mixing the LRS3 [3] test set audios with the QUT [5] city-street noises at different noise levels. Finally, in the third section, we evaluate specifically on “unseen speakers” by mixing the speeches of the unseen LRS2 [1] test set speakers with VGGSound [5] noises. Our method outperforms the audio-only approaches in all three sections and is comparable (< 3% difference) to the real visual-stream method.

Table 5. Comparison of different methods for generation of pseudo-visual stream. Our student lipsync network generates ac- curate lip movements, thus effectively enhances the noisy speech.

Table 4. The performance of our model is consistent if the age/ethnicity of the pseudo-lip identity is varied.

Table 6. Effect of speaker attributes such as gender, language and accent on model performance.

descriptionView Paper arrow_downwardDownload

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

by Sindhu Hegde

2022, Proceedings of the 30th ACM International Conference on Multimedia

Figure 1: We address the problem of generating speech from silent lip videos for any speaker in the wild. Previous works train either on large amounts of data of isolated speakers or in laboratory settings with a limited vocabulary.... more

descriptionView Paper arrow_downwardDownload

Voice Filter: Few-Shot Text-to-Speech Speaker Adaptation Using Voice Conversion as a Post-Processing Module

by Roberto Barra Chicote

2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts of training data, standard TTS models suffer from speech quality and... more

descriptionView Paper arrow_downwardDownload

RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis

by Rafael Valle

2022

This work introduces a predominantly parallel, end-to-end TTS model based on normalizing flows. It extends prior parallel approaches by additionally modeling speech rhythm as a separate generative distribution to facilitate variable token... more

descriptionView Paper arrow_downwardDownload

Waveglow: A Flow-based Generative Network for Speech Synthesis

by Rafael Valle

2022, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from melspectrograms. WaveGlow combines insights from Glow [1] and WaveNet [2] in order to provide fast, efficient and highquality audio... more

descriptionView Paper arrow_downwardDownload

INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge

by Solomiya Branets

2022, Interspeech 2022

Audio Packet Loss Concealment (PLC) is the hiding of gaps in audio streams caused by data transmission failures in packet switched networks. This is a common problem, and of increasing importance as end-to-end VoIP telephony and... more

Figure 1: CMOS for lossy audio files for different maxi- mum burst sizes and loss percentage quantiles included in our dataset.

Table 1: 2022 INTERSPEECH Audio Deep PLC Challenge Results. Teams marked as “tied”: Difference not significant (p > 0.05) Parameter count and structure given where known (TD: Time-Domain, FD: Frequency Domain). Table 2: 2022 INTERSPEECH Audio Deep PLC Challenge: Metrics for non-eligible models.

descriptionView Paper arrow_downwardDownload

Self-Training for End-to-End Speech Translation

by Juan Pino

2022, Interspeech 2020

One of the main challenges for end-to-end speech translation is data scarcity. We leverage pseudo-labels generated from unlabeled audio by a cascade and an end-to-end speech translation model. This provides 8.3 and 5.7 BLEU gains over a... more

descriptionView Paper arrow_downwardDownload

Hierarchical Sequence to Sequence Voice Conversion with Limited Data

by Praveen Narayanan

2022

We present a voice conversion solution using recurrent sequence to sequence modeling for DNNs. Our solution takes advantage of recent advances in attention based modeling in the fields of Neural Machine Translation (NMT), Text-to-Speech... more

descriptionView Paper arrow_downwardDownload

Talking Head Generation with Audio and Speech Related Facial Action Units

by Jiaxing Liu

2022

The task of talking head generation is to synthesize a lip synchronized talking head video by inputting an arbitrary face image and audio clips. Most existing methods ignore the local driving information of the mouth muscles. In this... more

descriptionView Paper arrow_downwardDownload

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

by Mohammad norouzi

2022, Interspeech 2021

This paper introduces WaveGrad 2, a non-autoregressive generative model for text-to-speech synthesis. WaveGrad 2 is trained to estimate the gradient of the log conditional density of the waveform given a phoneme sequence. The model takes... more

descriptionView Paper arrow_downwardDownload

Grounded Sequence to Sequence Transduction

by Amanda Duarte

2022, IEEE Journal of Selected Topics in Signal Processing

Speech recognition and machine translation have made major progress over the past decades, providing practical systems to map one language sequence to another. Although multiple modalities such as sound and video are becoming increasingly

descriptionView Paper arrow_downwardDownload

An Empirical Study of Visual Features for DNN Based Audio-Visual Speech Enhancement in Multi-Talker Environments

by Shrishti Saha Shetu

2022, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Audiovisual speech enhancement (AVSE) methods use both audio and visual features for the task of speech enhancement and the use of visual features has been shown to be particularly effective in multi-speaker scenarios. In the majority of... more

descriptionView Paper arrow_downwardDownload

VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

by VENKATESH SHENOY KADANDALE

2022, ArXiv

This paper presents an audio-visual approach for voice separation which outperforms state-of-theart methods at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained... more

descriptionView Paper arrow_downwardDownload

Speech Reconstruction

Related Topics