Introduction to the Special Section on Voice Transformation

Olivier Rosec; Chung-hsien Wu

doi:10.1109/TASL.2010.2051826

Outline

Natural Language Processing

Introduction to the Special Section on Voice Transformation

Olivier Rosec

Chung-hsien Wu

2000, IEEE Transactions on Audio, Speech, and Language Processing

https://doi.org/10.1109/TASL.2010.2051826

visibility

…

description

3 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract
AI

Voice Transformation encompasses the manipulation of non-linguistic speech signal information, such as voice quality and individuality. It includes diverse research areas, from speech production and perception to modeling speaking style. Unlike speaker-dependent technologies, Voice Transformation requires the effective modification of individual voice characteristics to ensure natural-sounding transformed speech. High-quality systems must acknowledge the nonlinear nature of speech and the interaction between vocal tract and source features, promoting advanced techniques for style mapping and transformation.

jitendra dhiman

2013

Generally defined, speech modification is the process of changing certain perceptual properties of speech while leaving other properties unchanged. Among the many types of speech information that may be altered are rate of articulation, pitch and formant characteristics.Modifying the speech parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. In this thesis prosody modifications for voice conversion framework are presented. Among all the speech modifications for prosody two things are important firstly modification of duartion and pauses (Time scale modification) in a speech utterance and secondly modification of the pitch(pitch scale modification).Prosody modification involves changing the pitch and duration of speech without affecting the message and naturalness.In this work time scale and pitch scale modifications of speech are discussed using two methods Time Domain Pitch Synchronous Overlapped-Add (TD-PSOLA) and epoch b...

downloadDownload free PDF View PDFchevron_right

Parametric Formant Modelling and Transformation in Voice Conversion

S. Vaseghi

International Journal of Speech Technology, 2005

This paper presents a method for the estimation and mapping of parametric models of speech resonance at formants for voice conversion. The spectral features at formants that contribute to voice characteristics are the trajectories of the frequencies, the bandwidths and intensities of the resonance at formants. The formant features are extracted from the poles of a linear prediction (LP) model of speech. The statistical distributions of formants are modelled by a two-dimensional hidden Markov model (HMM) spanning the time and frequency dimensions. Experimental results are presented which show a close match between HMM-based formant models and the histograms of formants. For voice conversion two alternative methods are explored for mapping the formants of a source speaker to those of a target speaker. The first method is based on an adaptive formant-tracking warping of the frequency response of the LP model and the second method is based on the rotation of the poles of the LP model of speech. Both methods transform all spectral parameters of the resonance at formants of the source speaker towards those of the target speaker. In addition, the issues affecting the selection of the warping ratios for the mapping functions are investigated. Experimental results of formant estimation and perceptual evaluation of voice morphing based on parametric formant models are presented.

downloadDownload free PDF View PDFchevron_right

Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora

Olivier Rosec

IEEE Transactions on Audio, Speech, and Language Processing, 2000

In Voice Conversion (VC), the speech of a source speaker is modified to resemble that of a particular target speaker. Currently, standard VC approaches use Gaussian mixture model (GMM)-based transformations that do not generate high-quality converted speech due to "over-smoothing" resulting from weak links between individual source and target frame parameters. Dynamic Frequency Warping (DFW) offers an appealing alternative to GMM-based methods, as more spectral details are maintained in transformation; however, the speaker timbre is less successfully converted because spectral power is not adjusted explicitly. Previous work combines separate GMM-and DFW-transformed spectral envelopes for each frame. This paper proposes a more effective DFW-based approach that 1) does not rely on the baseline GMM methods, and 2) functions on the acoustic class level. To adjust spectral power, an amplitude scaling function is used that compares the average target and warped source log spectra for each acoustic class. The proposed DFW with Amplitude scaling (DFWA) outperforms standard GMM and hybrid GMM-DFW methods for VC in terms of both speech quality and timbre conversion, as is confirmed in extensive objective and subjective testing. Furthermore, by not requiring time-alignment of source and target speech, DFWA is able to perform equally well using parallel or nonparallel corpora, as is demonstrated explicitly.

downloadDownload free PDF View PDFchevron_right

Vivos Voco: A Survey of Recent Research on Voice Transformations at IRCAM

Stefan Huber

IRCAM has a long experience in analysis, synthesis and transformation of voice. Natural voice transformations are of great interest for many applications and can be combine with text-to-speech system, leading to a powerful creation tool. We present research conducted at IRCAM on voice transformations for the last few years. Transformations can be achieved in a global way by modifying pitch, spectral envelope, durations etc. While it sacrifices the possibility to attain a specific target voice, the approach allows the production of new voices of a high degree of naturalness with different gender and age, modified vocal quality, or another speech style. These transformations can be applied in realtime using ircamTools TRAX.Transformation can also be done in a more specific way in order to transform a voice towards the voice of a target speaker. Finally, we present some recent research on the transformation of expressivity.

downloadDownload free PDF View PDFchevron_right

Dynamic programming approach to voice transformation

Özgül SALOR

Speech Communication, 2006

This paper presents a voice transformation algorithm which modifies the speech of a source speaker such that it is perceived as if spoken by a target speaker. A novel method which is based on dynamic programming approach is proposed. The designed system obtains speaker-specific codebooks of line spectral frequencies (LSFs) for both source and target speakers. Those codebooks are used to train a mapping histogram matrix, which is used for LSF transformation from one speaker to the other. The baseline system uses the maxima of the histogram matrix for LSF transformation. The shortcomings of this system, which are the limitations of the target LSF space and the spectral discontinuities due to independent mapping of subsequent frames, have been overcome by applying the dynamic programming approach. Dynamic programming approach tries to model the long-term behaviour of LSFs of the target speaker, while it is trying to preserve the relationship between the subsequent frames of the source LSFs, during transformation. Both objective and subjective evaluations have been conducted and it has been shown that dynamic programming approach improves the performance of the system in terms of both the speech quality and speaker similarity.

downloadDownload free PDF View PDFchevron_right

A short review of research on voice transformations at IRCAM

Pierre Lanchantin, Axel Roebel

Abstract IRCAM has a long experience in analysis, synthesis and transformation of voice. Natural voice transformations are of great interest for many applications and can be combine with text-to-speech system, leading to a powerful creation tool. We present research conducted at IRCAM on voice transformations for the last few years. Transformations can be achieved in a global way by modifying pitch, spectral envelope, durations etc.

downloadDownload free PDF View PDFchevron_right

Voice Transformation Using Two-Level Dynamic Warping and Neural Networks

Todd Moon

Signals, 2021

Voice transformation, for example, from a male speaker to a female speaker, is achieved here using a two-level dynamic warping algorithm in conjunction with an artificial neural network. An outer warping process which temporally aligns blocks of speech (dynamic time warp, DTW) invokes an inner warping process, which spectrally aligns based on magnitude spectra (dynamic frequency warp, DFW). The mapping function produced by inner dynamic frequency warp is used to move spectral information from a source speaker to a target speaker. Artifacts arising from this amplitude spectral mapping are reduced by reconstructing phase information. Information obtained by this process is used to train an artificial neural network to produce spectral warping information based on spectral input data. The performance of the speech mapping compared using Mel-Cepstral Distortion (MCD) with previous voice transformation research, and it is shown to perform better than other methods, based on their reporte...

downloadDownload free PDF View PDFchevron_right

Voice conversion application (VOCAL)

Liliana Liliana

2011 International Conference on Uncertainty Reasoning and Knowledge Engineering, 2011

Recently, a lot of works has been done in speech technology. Text-to-Speech and Automatic Speech Recognition have been the priorities in research efforts to improve the human-machine interaction. The ways to improve naturalness in human-machine interaction is becoming an inportant matter of concern. Voice conversion can be served as a useful tools to provide new insights related to personification of speech enabled systems. In this research, there are two main parameters are considered vocal tract structure and pitch. For conversion process speech is resolved in two components, excitation component and filtered component using Linear Predictive Coding (LPC). Ptich is determined by autocorrelation. After obtained the acoustic components from source speaker and target speaker, then the acoustic components will be mapped one-to-one to replaced the the acoustic feature from source speaker to target speaker. At least, signal is modified by resynthesis so the resulted speech would perceive as if spoken by target speaker.

downloadDownload free PDF View PDFchevron_right

Including dynamic and phonetic information in voice conversion systems

Antonio Bonafonte

Proc. of the ICSLP'04, 2004

Voice Conversion (VC) systems modify a speaker voice (source speaker) to be perceived as if another speaker (target speaker) had uttered it. Previous published VC approaches using Gaussian Mixture Models [1] performs the conversion in a frame-by-frame basis using only spectral information. In this paper, two new approaches are studied in order to extend the GMM-based VC systems. First, dynamic information is used to build the speaker acoustic model. So, the transformation is carried out according to sequences of frames. Then, phonetic information is introduced in the training of the VC system. Objective and perceptual results compare the performance of the proposed systems.

downloadDownload free PDF View PDFchevron_right

Expressive speech style transformation: Voice quality and prosody modification using a harmonic plus noise model

Àngel Calzada

Proc. Speech …, 2010

This paper proposes an approach to transform speech from a neutral style into other expressive styles using both prosody and voice quality (VoQ). The main aim is to validate the usefulness of VoQ in the enhancement of expressive synthetic speech. A Harmonic plus Noise Model (HNM) is used to modify speech following a set of rules extracted from an expressive speech corpus with five categories (neutral, happy, sensual, aggressive and sad). Finally, modified speech utterances were used to perform a perceptual test. These results indicate that listeners prefer prosody together with VoQ transformation instead of only prosody modification.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

S. Vaseghi

2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), 2003

This paper presents a voice conversion method based on analysis and transformation of the characteristics that define a speaker's voice. Voice characteristic features are grouped into three main categories: (a) the spectral features at formants, (b) the pitch and intonation pattern and (c) the glottal pulse shape. Modelling and transformation methods of each group of voice features are outlined. The spectral features at formants are modelled using a two-dimensional phoneme-dependent HMMs. Subband frequency warping is used for spectrum transformation where the subbands are centred on estimates of formant trajectories. The F0 contour, extracted from autocorrelation-based pitchmarks, is used for modelling the pitch and intonation patterns of speech. A PSOLA based method is used for transformation of pitch, intonation patterns and speaking rate. Finally a method based on de-convolution of the vocal tract is used for modelling and mapping of the glottal pulse. The experimental results present illustrations of transformations of the various characteristics and perceptual evaluations.

downloadDownload free PDF View PDFchevron_right

Voice conversion based on parameter transformation

juancho Mntr

1998

This paper describes a voice conversion system based on parameter transformation [1]. Voice conversion is the process of making one person's voice "source" sound like another person's voice "target"[2]. We will present a voice conversion scheme consisting of three stages. First an analysis is performed on the natural speech to obtain the acoustical parameters. These parameters will be voiced and unvoiced regions, the glottal source model, pitch, energy, formants and bandwidths. Once these parameters have been obtained for two different speakers they are transformed using linear functions. Finally the transformed parameters are synthesized by means of a formant synthesizer. Experiments will show that this scheme is effective in transforming the speaker individuality. It will also be shown that the transformation can not be unique from one speaker to another but it has to be divided in several functions each to transform a certain part of the speech signal. Segmentation based on spectral stability will divide the sentence into parts, for each segment a transformation function will be applied.

downloadDownload free PDF View PDFchevron_right

Voice morphing based on spectral features and prosodic modification

Arslan Qavi

17th IEEE International Multi Topic Conference 2014, 2014

This paper is aimed at morphing the speech uttered by a source speaker in a manner that it seems to be spoken by another target speakera new identity is given while preserving the original content. The proposed method transforms the vocal tract parameters and glottal excitation of the source speaker into target speaker's acoustic characteristics. It relates to the development of appropriate vocal tract models that can capture information specific to the speaker and estimate the model parameters that closely relate to the model of the target speaker. It detects the pitch, separates the glottal excitation and vocal tract spectral features. The glottal excitation of the source is taken, voice/un-voice decision is made, the prosody information is found, PSOLA is used to modify the pitch, the spectral features are found, and finally speech is modified using target spectral features and prosody. The subjective experiment shows that the proposed method improves the quality of conversion and contains the original vocal and glottal characteristics of the target speaker.

downloadDownload free PDF View PDFchevron_right

Voice conversion through transformation of spectral and intonation features

S. Vaseghi

2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004

This paper presents a voice conversion method based on transformation of the characteristic features of a source speaker towards a target. Voice characteristic features are grouped into two main categories: (a) the spectral features at formants and (b) the pitch and intonation patterns. Signal modelling and transformation methods for each group of voice features are outlined. The spectral features at formants are modelled using a set of two-dimensional phoneme-dependent HMMs. Subband frequency warping is used for spectrum transformation with the subbands centred on the estimates of the formant trajectories. The F0 contour is used for modelling the pitch and intonation patterns of speech. A PSOLA based method is employed for transformation of pitch, intonation patterns and speaking rate. The experiments present illustrations and perceptual evaluations of the results of transformations of the various voice features.

downloadDownload free PDF View PDFchevron_right

Transformation Of Vocal Characteristics: A Review Of Literature

Minghui Dong

2009

The transformation of vocal characteristics aims at modifying voice such that the intelligibility of aphonic voice is increased or the voice characteristics of a speaker (source speaker) to be perceived as if another speaker (target speaker) had uttered it. In this paper, the current state-of-the-art voice characteristics transformation methodology is reviewed. Special emphasis is placed on voice transformation methodology and issues for improving the transformed speech quality in intelligibility and naturalness are discussed. In particular, it is suggested to use the modulation theory of speech as a base for research on high quality voice transformation. This approach allows one to separate linguistic, expressive, organic and perspective information of speech, based on an analysis of how they are fused when speech is produced. Therefore, this theory provides the fundamentals not only for manipulating non-linguistic, extra-/paralinguistic and intra-linguistic variables for voice tra...

downloadDownload free PDF View PDFchevron_right

Voice quality transformation using an extended source-filter speech model

Stefan Huber

In this paper we present a flexible framework for parametric speech analysis and synthesis with high quality. It constitutes an extended source-filter model. The novelty of the proposed speech processing system lies in its extended means to use a Deterministic plus Stochastic Model (DSM) for the estimation of the unvoiced stochastic component from a speech recording. Further contributions are the efficient and robust means to extract the Vocal Tract Filter (VTF) and the modelling of energy variations. The system is evaluated in the context of two voice quality transformations on natural human speech. The voice quality of a speech phrase is altered by means of re-synthesizing the deterministic component with different pulse shapes of the glottal excitation source. A Gaussian Mixture Model (GMM) is used in one test to predict energies for the re-synthesis of the deterministic and the stochastic component. The subjective listening tests suggests that the speech processing system is able to successfully synthesize and arise to a listener the perceptual sensation of different voice quality characteristics. Additionally, improvements of the speech synthesis quality compared to a baseline method are demonstrated.

downloadDownload free PDF View PDFchevron_right

SPEECH STYLE CONVERSION BASED ON THE STATISTICS OF VOWEL SPECTROGRAMS AND NONLINEAR FREQUENCY MAPPING

Toshio Irino, Hideki Kawahara

A simple, efficient, and high-quality speech style conversion algorithm is proposed based on STRAIGHT. A very highquality VOCODER STRAIGHT consists of instantaneousfrequency based F0 and source information extraction part and F0-adaptive time-frequency smoothing part to eliminate preriodicity interferences. The proposed method uses only vowel information to design the desired conversion functions and parameters. So, it is possible to reduce the amount of training data required for conversion. The processing of the proposed method is : 1) to produce abstract spectra that is represented on the perceptual frequency axis and is derived as average spectrum for each vowel and each style; 2) to decompose the original spectrum into the abstract spectrum and the residual, fine structure; 3) to replace the abstract spectrum from the original to the target style; 4) to map the fine structure with nonlinear frequency warping for adapting the target style fine structure; 5) then to add them together to produce target speech. An efficient algorithm for this conversion was developed using an orthogonal transformation referred to as warped-DCT. An informal listening test indicated that the proposed method yields more natural and high-quality speech style conversion than the previous methods.

downloadDownload free PDF View PDFchevron_right

Experiments in voice quality modification of natural speech signals: the spectral approach

Christophe d'Alessandro

1998

Voice quality is currently a key issue in speech synthesis research. The lack of realistic intra-speaker voice quality variation is an important source of concern for concatenation-based synthesis methods. A challenging problem is to reproduce the voice quality changes that are occuring in natural speech when the vocal e ort is varying. A new method for voice quality modi cation is presented. It takes advantage of a spectral theory for voice source signal representation. An algorithm based on periodic-aperiodic decomposition and spectral processing (using the short-term Fourier transform) is described. The use of adaptive inverse ltering in this framework is also discussed. Applications of this algorithm may include: pre-processing of speech corpora, modi cation of voice quality parameters together with intonation in synthesis, voice transformation. Some experiments are reported, showing convincing voice quality modi cations for various speakers.

downloadDownload free PDF View PDFchevron_right

Mapping Articulatory-Features to Vocal-Tract Parameters for Voice Conversion

Tsuneo Nitta

IEICE Transactions on Information and Systems, 2014

In this paper, we propose voice conversion (VC) based on articulatory features (AF) to vocal-tract parameters (VTP) mapping. An artificial neural network (ANN) is applied to map AF to VTP and to convert a speaker's voice to a target-speaker's voice. The proposed system is not only text-independent VC, in which it does not need parallel utterances between source and target-speakers, but can also be used for an arbitrary sourcespeaker. This means that our approach does not require source-speaker data to build the VC model. We are also focusing on a small number of target-speaker training data. For comparison, a baseline system based on Gaussian mixture model (GMM) approach is conducted. The experimental results for a small number of training data show that the converted voice of our approach is intelligible and has speaker individuality of the targetspeaker.

downloadDownload free PDF View PDFchevron_right

High-quality and light-weight voice transformation enabling extrapolation without perceptual and objective breakdown

Hideki Kawahara

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2010

A voice transformation method that only relies on vowel information is proposed. The method is based on empirical cumulative distributions of perceptually relevant spectral distances, which are used to design mapping functions from distance to proximity. A set of operators are optimized in the design phase to implement onthe-fly compilation of executable transformations used in the transformation phase. Proximity of the current input parameters to the speaker's own vowel templates is used in this compilation. The proposed method deforms the source speaker's parameter space using a set of monotonic and continuous mapping functions. This smooth and topology-preserving mapping yields high-quality modification of existing speech resources.

downloadDownload free PDF View PDFchevron_right

Introduction to the Special Section on Voice Transformation

Sign up for access to the world's latest research

AbstractAI

Related papers

Related papers

Related topics

Abstract
AI