Academia.eduAcademia.edu

Voice Conversion

description398 papers
group328 followers
lightbulbAbout this topic
Voice conversion is a technology and research field focused on transforming a source speaker's voice characteristics to sound like those of a target speaker, while preserving the linguistic content. It involves signal processing, machine learning, and speech synthesis techniques to manipulate vocal attributes such as pitch, timbre, and accent.
lightbulbAbout this topic
Voice conversion is a technology and research field focused on transforming a source speaker's voice characteristics to sound like those of a target speaker, while preserving the linguistic content. It involves signal processing, machine learning, and speech synthesis techniques to manipulate vocal attributes such as pitch, timbre, and accent.

Key research themes

1. How can residual prediction techniques improve spectral detail and naturalness in voice conversion?

This theme investigates methods to predict or reconstruct the residual (excitation) signals in voice conversion frameworks, aiming to enhance the spectral details and naturalness of the converted speech. Since spectral envelope transformation alone often results in over-smoothed or synthetic-sounding speech, incorporating accurate residual prediction is critical. Research focuses on comparing residual prediction techniques, modeling the correlation between spectral features and residuals, and developing methods that better preserve speaker-dependent excitation characteristics.

Key finding: Compared several existing residual prediction methods and proposed a novel approach that predicts target residuals conditioned on converted spectral features using line spectral frequency-based transforms. Experimental... Read more
Key finding: Identified that probabilistic spectral envelope transformations based on Gaussian mixture models suffer from over-smoothing and modeling errors, leading to degraded speech quality. Proposed a novel deterministic spectral... Read more
Key finding: Applied deep neural networks and Gaussian mixture models to convert esophageal speech features into laryngeal speech vocal tract features, using a voice conversion approach specifically accounting for pathological vocal tract... Read more

2. What strategies enable non-parallel voice conversion by establishing frame-level or sequence-level mappings without shared parallel data?

Non-parallel voice conversion (VC) methods aim to build conversion systems without the need for parallel utterances of source and target speakers, which is significant for practical deployment. This research theme explores algorithms to discover correspondences between frames or segments of unaligned source and target speech through clustering, recognition, or iterative alignment methods. Approaches include DNN-HMM-based frame recognition, iterative nearest-neighbor alignment with temporal context, and latent space embeddings to create mapping functions, prioritizing alignment accuracy and quality of synthesized speech.

Key finding: Introduced a method that uses a DNN-HMM speech recognizer to generate phoneme posterior (pseudo likelihood) vectors for source and target frames, enabling clustering and frame mapping without parallel data via similarity... Read more
Key finding: Presented Temporal-Context INCA (TC-INCA), a generalization of the iterative nearest neighbor and conversion alignment (INCA) method, that incorporates temporal context vectors (sequences of features) rather than single... Read more
Key finding: Explored flow-based generative models to achieve non-parallel voice conversion without requiring text transcriptions or phonetic alignment by learning invertible, lossless encodings of speech spectrograms. Demonstrated... Read more

3. How can system fusion and hybrid methods enhance voice conversion performance by leveraging complementary strengths of distinct approaches?

This theme addresses the combination of multiple voice conversion techniques to harness their complementary advantages, such as statistical robustness, spectral detail preservation, and prosodic naturalness. By fusing systems like Gaussian mixture models (GMM) and frequency warping (FW), or exemplar-based and parametric methods, researchers can create hybrid frameworks that yield better speaker similarity and naturalness than individual methods. The research evaluates the feasibility, integration designs, and empirical gains in objective and subjective metrics.

Key finding: Proposed a system fusion framework combining Gaussian mixture model (GMM)-based statistical parametric and frequency warping (FW) based voice conversion methods to leverage GMM's modeling of spectral envelopes and FW's... Read more
Key finding: Developed a fast locally linear embedding (LLE) algorithm for exemplar-based non-parametric VC, precomputing local clusters offline to reduce online matrix inversion complexity. The fast-LLE achieves comparable output quality... Read more
Key finding: Designed a multi-speaker parameter concatenation-based formant synthesizer using stored parameter sets and linear transformation functions to generate different speaker voices from a base set. The approach uses linear... Read more

All papers in Voice Conversion

The basic goal of the voice conversion system to mimics the characteristics of the target speaker voice by keeping the linguistic and paralinguistic information intact. The characteristics of a speaker in speech reflect at different level... more
The basic goal of the voice conversion system to mimics the characteristics of the target speaker voice by keeping the linguistic and paralinguistic information intact. The characteristics of a speaker in speech reflect at different level... more
The basic goal of the voice conversion system to mimics the characteristics of the target speaker voice by keeping the linguistic and paralinguistic information intact. The characteristics of a speaker in speech reflect at different level... more
Recent advancements in voice conversion systems have been largely driven by deep learning techniques, enabling the high-quality synthesis of human speech. However, existing models often fail to generate emotionally expressive speech,... more
The creation of the dataset has been supported by Deutsche Gesellschaft f ür Internationale Zusammenarbeit (GIZ) on behalf of the German Ministry for Economic Cooperation and Development.
Two speech processing systems have been developed for real-time and non-real-time voice conversion. Using the real-time processing the user can apply conversion during voice over IP (VoIP) calls imitating identity of a specified target... more
A real-time pitch modification system has been developed. The implemented processing scheme is based on hybrid deterministic/stochastic decomposition of the signal and includes extraction of instantaneous pitch, pitch-synchronous... more
Radio frequencies for controller pilot communication are becoming a scarce resource due to increasing air traffic worldwide. The controller pilot data link communication (CPDLC) technology, which is already mandatory in new aircrafts,... more
The NAM-to-speech conversion proposed by Toda and colleagues which converts Non-Audible Murmur (NAM) to audible speech by statistical mapping trained using aligned corpora is a very promising technique, but its performance is still... more
In this paper, we describe a concatenative synthesis sys- tem which was first designed for a realistic synthesis of melodic phrases. It has since been augmented to become an experimental TTS (Text-to-Speech) synthesizer. To- day, it is... more
Spectral envelopes are very useful in sound analysis and synthesis because of their connection with production and perception models, and their ability to capture and to manipulate important properties of sound using easily understandable... more
The recent explosive growth of voice over IP (VoIP) solutions calls for accurate modelling of VoIP traffic. This paper presents measurements of ON and OFF periods of VoIP activity from a significantly large database of VoIP call... more
The recent explosive growth of voice over IP (VoIP) solutions calls for accurate modelling of VoIP traffic. This paper presents measurements of ON and OFF periods of VoIP activity from a significantly large database of VoIP call... more
This paper describes the implementation of a unit selection text-to-speech system that incorporates a statistical model Cost (sCost), in addition to target and join costs, for controlling the selection of unit candidates. sCost, a quality... more
Este trabajo explora el uso de inteligencia artificial en la creación de recursos para el ejercicio de la comprensión auditiva en la enseñanza del griego antiguo como lengua hablada. Partiendo de la importancia de la adquisición de esta... more
A method for modifying voice quality attributes, i.e. breathiness and roughness, is presented in the context of voice conversion. Both breathiness and roughness of a speaker are collectively modelled by harmonic peak-to-valley ratio... more
This paper presents a novel approach for enhancing esophageal speech using voice conversion techniques. Esophageal speech (ES) is an alternative voice that allows a patient with no vocal cords to produce sounds after total laryngectomy:... more
Dans cet article, nous présentons un système de Reconnaissance Automatique de la Parole (RAP) combinant les données acoustiques et les données visuelles. Ce système de reconnaissance audiovisuelle utilise comme moteur de reconnaissance... more
The paper presents and evaluates a speaker de-identification technique using speech recognition and two speech synthesis techniques. The phoneme recognition system is built using HMM-based acoustical models of contextdependent diphone... more
Memorandum 4152 vibI1:,Cd JAvail 2nd/eor Dit Special. EXP)LICIT MODELLING OF STATE DURATION CORRELATIONS VN TMIDDEN MARKOV MODELS-5 M J Russell and L Siine September 1988 ABSTRACT4 fn recent years considerable effort has been directed... more
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or... more
In this paper, new techniques to improve whisper-to-speech conversion are investigated, in the framework of silent speech telephone communication. A preliminary conversion method from Non-Audible Murmur (NAM) to modal speech, based on... more
The compact representation of harmonic amplitudes in the sinusoidal coding of voiced speech is often achieved by the all-pole modeling of a spectral envelope. The perceptual accuracy of the representation may be enhanced by the use of... more
This paper presents a pitch modification scheme, based on the recursive least-squares (RLS) adaptive algorithm, for speech and singing voice signals. The RLS filter is used to determine the linear prediction (LP) model on a... more
Dans le domaine scientifique, le stockage et la gestion efficaces des données sont cruciaux pour la recherche, l'analyse et la collaboration. Différents formats de fichiers ont été développés pour répondre aux besoins spécifiques des... more
This paper presents a novel approach for enhancing esophageal speech using voice conversion techniques. Esophageal speech (ES) is an alternative voice that allows a patient with no vocal cords to produce sounds after total laryngectomy:... more
This paper describes the evaluation process of an emotional speech database recorded for standard Basque in order to determine its adequacy for the analysis of emotional models and its use in speech synthesis. The corpus consists of seven... more
In this paper, we propose a real-time method for duration modification of speech for packet based communication system. While there is rich literature available on duration modification, it fails to clearly address the issues in real-time... more
The evaluation procedure to choose a low bit rate voice coding algorithm is described for the Australian land mobile satellite system. The procedure is designed to assess both the inherent quality of the codec under 'normal'... more
This paper addresses a phase-related feature that is time-shift invariant, and that expresses the relative phases of all harmonics with respect to that of the fundamental frequency. We identify the feature as Normalized Relative Delay... more
Vowel recognition is frequently based on Linear Prediction (LP) analysis and formant estimation techniques. However, the performance of these techniques decreases in the case of female or child speech because at high pitch frequencies... more
Residual prediction is a technique that aims at recovering the spectral details of speech that was encoded using parameterizations as linear predictive coefficients. Example applications of residual prediction are hidden Markov modelbased... more
Residual prediction is a technique that aims at recovering the spectral details of speech that was encoded using parameterizations as linear predictive coefficients. Example applications of residual prediction are hidden Markov modelbased... more
1. 4. Contribution de la Thèse Les contributions de cette thèse pourraient être divisées en trois parties.
This paper presents an empirical study of several policies for managing the effect of delay jitter on the playout of audio and video in computer-based conferences. The problem addressed is that of managing the fundamental tradeoff between... more
There is a great demand to assess video quality transmitted in real time over packet networks, and to make this assessment in real time too. Quality assessment is achieved using two types of methods: objective or subjective. Subjective... more
by An Ji
This paper is NOT THE PUBLISHED VERSION; but the author's final, peer-reviewed manuscript. The published version may be accessed by following the link in th citation below.
Unconstrained face recognition remain a challenging problem due to intra-class variations caused by occlusion, disguise, varying orientations, facial expressions, age variations and illumination in real circumstances...etc. the... more
We propose a novel application based on acoustic-toarticulatory inversion (AAI) towards quality assessment of voice converted speech. The ability of humans to speak effortlessly requires coordinated movements of various articulators,... more
Recently, major services provided by mobile communications systems are shifting from voice conversations to data communications over the Internet. There is a strong demand for increasing the data transmission rate. However, an important... more
Speaker’s identity is the most crucial information exploited (implicitly) by an Automatic Speaker Verification (ASV) system. Numerous attacks can be obliterated simultaneously if privacy preservation is exercised for a speaker’s identity.... more
We present a multi-speaker formant synthesizer based on parameter concatenation. The user can choose among three speakers, two males and one female. The synthesizer stores all the parameters for the basic speaker and linear transformation... more
Generally defined, speech modification is the process of changing certain perceptual properties of speech while leaving other properties unchanged. Among the many types of speech information that may be altered are rate of articulation,... more
Generally defined, speech modification is the process of changing certain perceptual properties of speech while leaving other properties unchanged. Among the many types of speech information that may be altered are rate of articulation,... more
Download research papers for free!