Academia.eduAcademia.edu

Speech Reconstruction

description18 papers
group2 followers
lightbulbAbout this topic
Speech reconstruction is the process of recreating spoken language from various forms of input, such as brain activity, audio signals, or text. It involves the application of signal processing, machine learning, and linguistic analysis to accurately restore or synthesize intelligible speech, often for applications in communication aids and neuroscience.
lightbulbAbout this topic
Speech reconstruction is the process of recreating spoken language from various forms of input, such as brain activity, audio signals, or text. It involves the application of signal processing, machine learning, and linguistic analysis to accurately restore or synthesize intelligible speech, often for applications in communication aids and neuroscience.
Recent advancements in voice conversion systems have been largely driven by deep learning techniques, enabling the high-quality synthesis of human speech. However, existing models often fail to generate emotionally expressive speech,... more
This paper presents a zero-shot learning framework for speech-based emotion recognition across languages using contrastive learning. Traditional emotion AI models often depend on large, labeled corpora in a specific language, limiting... more
The presented research focuses on the challenging task of creating lip-sync facial videos that align with a specified target speech segment. A novel deep-learning model has been developed to produce precise synthetic lip movements... more
Lipreading involves using visual data to recognize spoken words by analyzing the movements of the lips and surrounding area. It is a hot research topic with many potential applications, such as human-machine interaction and enhancing... more
Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker... more
Computational speech reconstruction algorithms have the ultimate aim of returning natural sounding speech to aphonic and dysphonic individuals. These algorithms can also be used by unimpaired speakers for communicating sensitive or... more
An articulatory speech sythesiser is presented that extends the Recurrent Gradient-based Motor Inference Model for Speech Resynthesis [1] with a classifier optimizing for semantic discrimination. The new design features show promising... more
A growing need for on-device machine learning has led to an increased interest in lightweight neural networks that lower model complexity while retaining performance. While a variety of general-purpose techniques exist in this context,... more
Recent advances in brain decoding have made it possible to classify image categories based on neural activity. Increasing numbers of studies have further attempted to reconstruct the image itself. However, because images of objects and... more
Neural text-to-speech synthesis (NTTS) models have shown significant progress in generating high-quality speech, however they require a large quantity of training data. This makes creating models for multiple styles expensive and... more
Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a... more
Much of the recent literature on automatic speech recognition (ASR) is taking an end-to-end approach. Unlike English where the writing system is closely related to sound, Chinese characters (Hanzi) represent meaning, not sound. We propose... more
We present SL-ReDu, a recently commenced innovative project that aims to exploit deep-learning progress to advance the state-of-theart in video-based automatic recognition of Greek Sign Language (GSL), while focusing on the use-case of... more
Dubbing contributes to a larger international distribution of multimedia documents. It aims to replace the original voice in a source language by a new one in a target language. For now, the target voice selection procedure, called voice... more
Speech-driven facial animation is useful for a variety of applications such as telepresence, chatbots, etc. The necessary attributes of having a realistic face animation are 1) audiovisual synchronization (2) identity preservation of the... more
Previously proposed FullSubNet has achieved outstanding performance in Deep Noise Suppression (DNS) Challenge and attracted much attention. However, it still encounters issues such as inputoutput mismatch and coarse processing for... more
Previously proposed FullSubNet has achieved outstanding performance in Deep Noise Suppression (DNS) Challenge and attracted much attention. However, it still encounters issues such as inputoutput mismatch and coarse processing for... more
While Automatic Speech Recognition (ASR) models have shown significant advances with the introduction of unsupervised or selfsupervised training techniques, these improvements are still only limited to a subsection of languages and... more
In the absence of large-scale in-domain supervised training data, ASR models can achieve reasonable performance through pre-training on additional data that is unlabeled, mismatched or both. Given such data constraints, we compare... more
The aim of this work is to investigate the impact of 1 crossmodal self-supervised pre-training for speech reconstruc-2 tion (video-to-audio) by leveraging the natural co-occurrence 3 of audio and visual streams in videos. We propose... more
We present SL-ReDu, a recently commenced innovative project that aims to exploit deep-learning progress to advance the state-of-theart in video-based automatic recognition of Greek Sign Language (GSL), while focusing on the use-case of... more
Computational speech reconstruction algorithms have the ultimate aim of returning natural sounding speech to aphonic and dysphonic individuals. These algorithms can also be used by unimpaired speakers for communicating sensitive or... more
Personal assistants are becoming more pervasive in our envi-ronments but still do not provide natural interactions. Their lack of realism in term of expressiveness and their lack of visual feedback can create frustrating experiences and... more
Our goal is to spot words in silent speech videos without explicitly recognizing the spoken words, where the lip motion of the speaker is clearly visible and audio is absent. Existing work in this domain has mainly focused on recognizing... more
Lip-reading is a technique to understand speech by observing a speaker’s lips movement. It has numerous applications; for example, it is helpful for hearing impaired persons and understanding the speech in noisy environments. Most of the... more
This is the age of instant gratification. Browsing the entire Publication is dawdle , hence we have proposed an application that summarizes all your reading's in a snap of time using AI technologies. The system is composed of Optical... more
Given an arbitrary face image and an arbitrary speech clip, the proposed work attempts to generate the talking face video with accurate lip synchronization. Existing works either do not consider temporal dependency across video frames... more
Silent speech decoding (SSD), based on articulatory neuromuscular activities, has become a prevalent task of brain–computer interfaces (BCIs) in recent years. Many works have been devoted to decoding surface electromyography (sEMG) from... more
In this work, a novel direct speech-to-speech methodology for translation is proposed; it is based on an LSTM neural network structure which has proven useful for translation in the classical way, i.e., the one consisting of three stages:... more
Talking head generation is to synthesize a lip-synchronized talking head video by inputting an arbitrary face image and corresponding audio clips. Existing methods ignore not only the interaction and relationship of cross-modal... more
The task of talking head generation is to synthesize a lip synchronized talking head video by inputting an arbitrary face image and audio clips. Most existing methods ignore the local driving information of the mouth muscles. In this... more
Much of the recent literature on automatic speech recognition (ASR) is taking an end-to-end approach. Unlike English where the writing system is closely related to sound, Chinese characters (Hanzi) represent meaning, not sound. We propose... more
Text-based voice editing (TBVE) uses synthetic output from textto-speech (TTS) systems to replace words in an original recording. Recent work has used neural models to produce edited speech that is similar to the original speech in terms... more
Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has... more
Many semi-and weakly-supervised approaches have been investigated for overcoming the labeling cost of building highquality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions,... more
In this paper we present VDTTS, a Visually-Driven Textto-Speech model. Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate... more
In this paper, we present an audio-visual model to perform speech super-resolution at large scale-factors (8× and 16×). Previous works attempted to solve this problem using only the audio modality as input and thus were limited to low... more
In this work, we rethink the task of speech enhancement in unconstrained real-world environments. Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent... more
Figure 1: We address the problem of generating speech from silent lip videos for any speaker in the wild. Previous works train either on large amounts of data of isolated speakers or in laboratory settings with a limited vocabulary.... more
State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts of training data, standard TTS models suffer from speech quality and... more
This work introduces a predominantly parallel, end-to-end TTS model based on normalizing flows. It extends prior parallel approaches by additionally modeling speech rhythm as a separate generative distribution to facilitate variable token... more
In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from melspectrograms. WaveGlow combines insights from Glow [1] and WaveNet [2] in order to provide fast, efficient and highquality audio... more
Audio Packet Loss Concealment (PLC) is the hiding of gaps in audio streams caused by data transmission failures in packet switched networks. This is a common problem, and of increasing importance as end-to-end VoIP telephony and... more
One of the main challenges for end-to-end speech translation is data scarcity. We leverage pseudo-labels generated from unlabeled audio by a cascade and an end-to-end speech translation model. This provides 8.3 and 5.7 BLEU gains over a... more
We present a voice conversion solution using recurrent sequence to sequence modeling for DNNs. Our solution takes advantage of recent advances in attention based modeling in the fields of Neural Machine Translation (NMT), Text-to-Speech... more
The task of talking head generation is to synthesize a lip synchronized talking head video by inputting an arbitrary face image and audio clips. Most existing methods ignore the local driving information of the mouth muscles. In this... more
This paper introduces WaveGrad 2, a non-autoregressive generative model for text-to-speech synthesis. WaveGrad 2 is trained to estimate the gradient of the log conditional density of the waveform given a phoneme sequence. The model takes... more
Speech recognition and machine translation have made major progress over the past decades, providing practical systems to map one language sequence to another. Although multiple modalities such as sound and video are becoming increasingly
Audiovisual speech enhancement (AVSE) methods use both audio and visual features for the task of speech enhancement and the use of visual features has been shown to be particularly effective in multi-speaker scenarios. In the majority of... more
This paper presents an audio-visual approach for voice separation which outperforms state-of-theart methods at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained... more
Download research papers for free!