Papers by jitendra dhiman

Speech Communication, Dec 1, 2020
We address the problem of suppressing musical noise from speech enhanced using a short-time proce... more We address the problem of suppressing musical noise from speech enhanced using a short-time processing algorithm. Enhancement algorithms rely on noise statistics and errors in estimating the statistics lead to residual noise in the enhanced signal. A frequently encountered residual noise type is the so-called musical noise, which is a consequence of spurious peaks occurring at random locations in the time-frequency (t-f) plane. Typically, speech enhancement algorithms operate on a short-time basis and perform attenuation of noisy speech spectral coefficients, effectively leading to a spectrotemporal gain function. We show that in case of speech distorted by musical noise, the spectrotemporal gain function has a distinct signature: the musical noise components are sparse in the t-f domain, whereas the spectrotemporal gain corresponding to the speech region exhibits a low-rank structure. Based on this observation, we propose a low-rank and sparse matrix decomposition of the spectrotemporal gain function. We show that musical noise can be effectively suppressed by reconstructing the speech signal using only the low-rank component. Performance comparison in terms of subjective scores and spectrographic analysis shows that the proposed technique is superior compared with two benchmark techniques. The proposed technique could be used in tandem with any speech enhancement algorithm that gives rise to musical noise.

Generally defined, speech modification is the process of changing certain perceptual properties o... more Generally defined, speech modification is the process of changing certain perceptual properties of speech while leaving other properties unchanged. Among the many types of speech information that may be altered are rate of articulation, pitch and formant characteristics.Modifying the speech parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. In this thesis prosody modifications for voice conversion framework are presented.Among all the speech modifications for prosody two things are important firstly modification of duartion and pauses (Time scale modification) in a speech utterance and secondly modification of the pitch(pitch scale modification).Prosody modification involves changing the pitch and duration of speech without affecting the message and naturalness.In this work time scale and pitch scale modifications of speech are discussed using two methods Time Domain Pitch Synchronous Overlapped-Add (TD-PSOLA) and epoch based approach.In order to apply desired speech modifications TD-PSOLA discussed in this thesis works directly on speech in time domian although there are many variations of TD-PSOLA.The epoch based approach involves modifications of LP-residual. Among the various perceptual properties of speech pitch contour plays a key role which defines the intonation patterns of speaker.Prosody modifications of speech in voice conversion framework involve modification of source pitch contour as per the pitch contour of target.In a voice conversion framework it requires prediction of target pitch contour. Mean/ variance method for pitch contour prediction is explored. Sinusoidal modeling has been successfully applied to a broad range of speech processing problems. It offers advantages over linear predictive modeling and the short-time Fourier transform for speech analysis/ synthesis and modification. The parameter estimation of sinusoidal modeling which permits flexible time and frequency scale voice modifications is presented. Speech synthesis using three models sinusoidal, harmonic and harmonic-plus-residual is discussed. vi

Speech Communication, 2020
We address the problem of suppressing musical noise from speech enhanced using a short-time proce... more We address the problem of suppressing musical noise from speech enhanced using a short-time processing algorithm. Enhancement algorithms rely on noise statistics and errors in estimating the statistics lead to residual noise in the enhanced signal. A frequently encountered residual noise type is the so-called musical noise, which is a consequence of spurious peaks occurring at random locations in the time-frequency (t-f) plane. Typically, speech enhancement algorithms operate on a short-time basis and perform attenuation of noisy speech spectral coefficients, effectively leading to a spectrotemporal gain function. We show that in case of speech distorted by musical noise, the spectrotemporal gain function has a distinct signature: the musical noise components are sparse in the t-f domain, whereas the spectrotemporal gain corresponding to the speech region exhibits a low-rank structure. Based on this observation, we propose a low-rank and sparse matrix decomposition of the spectrotemporal gain function. We show that musical noise can be effectively suppressed by reconstructing the speech signal using only the low-rank component. Performance comparison in terms of subjective scores and spectrographic analysis shows that the proposed technique is superior compared with two benchmark techniques. The proposed technique could be used in tandem with any speech enhancement algorithm that gives rise to musical noise.

A Spectro-temporal Technique for Estimating Aperiodicity and Voiced/unvoiced Decision Boundaries of Speech Signals
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019
In contrast to a 1-D short-time analysis of speech, 2-D approaches aim at characterizing the spee... more In contrast to a 1-D short-time analysis of speech, 2-D approaches aim at characterizing the speech signal attributes jointly in time and frequency. In this paper, we focus on the quasi-periodicity of a voiced spectro-temporal patch and quantify it by proposing an aperiodicity measure defined using the underlying frequency modulations in the patch. We further propose a time-frequency aperiodicity map obtained by overlapping and adding the aperiodicity measures across patches. The proposed aperiodicity map is utilized to obtain band-wise aperiodicity parameters, which are essential for high-quality speech synthesis. The aperiodicity in unvoiced patches is addressed by identifying them using the coherence of the patch. In addition, the proposed technique also provides voiced/unvoiced decisions boundaries of a speech signal. The effectiveness of the proposed band-wise aperiodicity parameters and voiced/unvoiced decisions is verified by incorporating them in an existing state-of-the-art vocoder for speech synthesis. Subjective listening tests show that the quality of the reconstructed speech is on par with that of the state-of-the-art WORLD vocoder in terms of mean opinion score, indicating that spectrotemporal approaches are highly promising for speech analysis and synthesis applications.

Generally defined, speech modification is the process of changing certain perceptual properties o... more Generally defined, speech modification is the process of changing certain perceptual properties of speech while leaving other properties unchanged. Among the many types of speech information that may be altered are rate of articulation, pitch and formant characteristics.Modifying the speech parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. In this thesis prosody modifications for voice conversion framework are presented. Among all the speech modifications for prosody two things are important firstly modification of duartion and pauses (Time scale modification) in a speech utterance and secondly modification of the pitch(pitch scale modification).Prosody modification involves changing the pitch and duration of speech without affecting the message and naturalness.In this work time scale and pitch scale modifications of speech are discussed using two methods Time Domain Pitch Synchronous Overlapped-Add (TD-PSOLA) and epoch b...

A Spectro-temporal Technique for Estimating Aperiodicity and Voiced/unvoiced Decision Boundaries of Speech Signals
In contrast to a 1-D short-time analysis of speech, 2-D approaches aim at characterizing the spee... more In contrast to a 1-D short-time analysis of speech, 2-D approaches aim at characterizing the speech signal attributes jointly in time and frequency. In this paper, we focus on the quasi-periodicity of a voiced spectro-temporal patch and quantify it by proposing an aperiodicity measure defined using the underlying frequency modulations in the patch. We further propose a time-frequency aperiodicity map obtained by overlapping and adding the aperiodicity measures across patches. The proposed aperiodicity map is utilized to obtain band-wise aperiodicity parameters, which are essential for high-quality speech synthesis. The aperiodicity in unvoiced patches is addressed by identifying them using the coherence of the patch. In addition, the proposed technique also provides voiced/unvoiced decisions boundaries of a speech signal. The effectiveness of the proposed band-wise aperiodicity parameters and voiced/unvoiced decisions is verified by incorporating them in an existing state-of-the-art vocoder for speech synthesis. Subjective listening tests show that the quality of the reconstructed speech is on par with that of the state-of-the-art WORLD vocoder in terms of mean opinion score, indicating that spectrotemporal approaches are highly promising for speech analysis and synthesis applications.
In this paper, we propose a real-time method for duration modification of speech for packet based... more In this paper, we propose a real-time method for duration modification of speech for packet based communication system. While there is rich literature available on duration modification, it fails to clearly address the issues in real-time implementation of the same. Most of the duration modification methods rely on accurate estimation of pitch marks, which is not feasible in a real-time scenario. The proposed method modifies the duration of Linear Prediction residual of individual frames without using any look-ahead delay and knowledge of pitch marks. In this method, multiples of pitch period is repeated or removed from a frame depending on a scheduling algorithm. The subjective quality of the proposed method was found to be better than waveform similarity overlap and add (WSOLA) technique as well as Linear Prediction Pitch Synchronous Overlap and Add (LP-PSOLA) technique.
Novel speech duration modifier for packet based communication system
Interspeech 2014

Interspeech 2018
In contrast to 1-D short-time analysis of speech, 2-D modeling of spectrograms provides a charact... more In contrast to 1-D short-time analysis of speech, 2-D modeling of spectrograms provides a characterization of speech attributes directly in the joint time-frequency plane. Building on existing 2-D models to analyze a spectrogram patch, we propose a multicomponent 2-D AM-FM representation for spectrogram decomposition. The components of the proposed representation comprise a DC, a fundamental frequency carrier and its harmonics, and a spectrotemporal envelope, all in 2-D. The number of harmonics required is patch-dependent. The estimation of the AM and FM is done using the Riesz transform, and the component weights are estimated using a least-squares approach. The proposed representation provides an improvement over existing state-of-the-art approaches, for both male and female speakers. This is quantified using reconstruction SNR and perceptual evaluation of speech quality (PESQ) metric. Further, we perform an overlap-add on the DC component, pooling all the patches and obtain a time-frequency (t-f) aperiodicity map for the speech signal. We verify its effectiveness in improving speech synthesis quality by using it in an existing state-of-theart vocoder.

Interspeech 2019
We address the problem of estimating the time-varying spectral envelope of a speech signal using ... more We address the problem of estimating the time-varying spectral envelope of a speech signal using a spectro-temporal demodulation technique. Unlike the conventional spectrogram, we consider a pitch-adaptive spectrogram and model a spectrotemporal patch using an amplitude-and frequency-modulated two-dimensional (2-D) cosine signal. We employ a demodulation technique based on the Riesz transform that we proposed recently to estimate the amplitude and frequency modulations. The amplitude modulation (AM) corresponds to the vocal-tract filter magnitude response (or envelope) and the frequency modulation (FM) corresponds to the excitation. We consider the AM and demonstrate its effectiveness by incorporating it as an acoustic feature for local conditioning in the statistical WaveNet vocoder for the task of speech synthesis. The quality of the synthesized speech obtained with the Riesz envelope is compared with that obtained using the envelope estimated by the WORLD vocoder. Objective measures and subjective listening tests on the CMU-Arctic database show that the quality of synthesis is superior to that obtained using the WORLD envelope. This study thus establishes the Riesz envelope as an efficient alternative to the WORLD envelope.

Interspeech 2017
Decomposing speech signals into periodic and aperiodic components is an important task, finding a... more Decomposing speech signals into periodic and aperiodic components is an important task, finding applications in speech synthesis, coding, denoising, etc. In this paper, we construct a time-frequency coherence function to analyze spectro-temporal signatures of speech signals for distinguishing between deterministic and stochastic components of speech. The narrowband speech spectrogram is segmented into patches, which are represented as 2-D cosine carriers modulated in amplitude and frequency. Separation of carrier and amplitude/frequency modulations is achieved by 2-D demodulation using Riesz transform, which is the 2-D extension of Hilbert transform. The demodulated AM component reflects contributions of the vocal tract to spectrogram. The frequency modulated carrier (FM-carrier) signal exhibits properties of the excitation. The time-frequency coherence is defined with respect to FM-carrier and a coherence map is constructed, in which highly coherent regions represent nearly periodic and deterministic components of speech, whereas the incoherent regions correspond to unstructured components. The coherence map shows a clear distinction between deterministic and stochastic components in speech characterized by jitter, shimmer, lip radiation, type of excitation, etc. Binary masks prepared from the time-frequency coherence function are used for periodic-aperiodic decomposition of speech. Experimental results are presented to validate the efficiency of the proposed method.

Interspeech 2017
We consider a two-dimensional demodulation framework for spectro-temporal analysis of the speech ... more We consider a two-dimensional demodulation framework for spectro-temporal analysis of the speech signal. We construct narrowband (NB) speech spectrograms, and demodulate them using the Riesz transform, which is a two-dimensional extension of the Hilbert transform. The demodulation results in timefrequency envelope (amplitude modulation or AM) and timefrequency carrier (frequency modulation or FM). The AM corresponds to the vocal tract and is referred to as the vocal tract spectrogram. The FM corresponds to the underlying excitation and is referred to as the carrier spectrogram. The carrier spectrogram exhibits a high degree of time-frequency consistency for voiced sounds. For unvoiced sounds, such a structure is lacking. In addition, the carrier spectrogram reflects the fundamental frequency (F0) variation of the speech signal. We develop a technique to determine the F0 from the carrier spectrogram. The time-frequency consistency is used to determine which time-frequency regions correspond to voiced segments. Comparisons with the state-of-the-art F0 estimation algorithms show that the proposed F0 estimator has high accuracy for telephone channel speech and is robust to noise.
Uploads
Papers by jitendra dhiman