Skip to main content

Phillip DeLeon

New Mexico State University, Klipsch School of Electrical & Computer Eng, Faculty Member

Followers

63

Following

0

Public Views

Interests

Uploads

Papers by Phillip DeLeon

Efficient Speaker Identification using Distributional Speaker Model Clustering

For large population speaker identification (SI) systems, likelihood computations between an unkn... more For large population speaker identification (SI) systems, likelihood computations between an unknown speaker's test feature vectors and speaker models can be very time-consuming and detrimental to applications where fast SI is required. In this paper, we propose a method whereby speaker models are clustered using a distributional distance measure such as KL divergence during the training stage. During the testing stage, only those clusters which are likely to contain high-likelihood speaker models are searched. The proposed method reduces the speaker model search space which directly results in faster SI. Any loss in identification accuracy can be controlled by trading off speed and accuracy. This paper implements GMM-UBM based SI system with MAP adapted speaker models and the results are presented on TIMIT, NTIMIT and NIST-2002 large population speech corpora.

Speaker Identification in Room Reverberation using GMM-UBM

Speaker recognition systems tend to degrade if the training and testing conditions differ signifi... more Speaker recognition systems tend to degrade if the training and testing conditions differ significantly. Such situations may arise due to the use of different microphones, telephone and mobile handsets or different acoustic conditions. Recently, the effect of the room acoustics on speaker identification (SI) has been investigated and it has been shown that a loss in accuracy results when using clean training and reverberated testing signals. Various techniques like dereverberation, use of multiple microphones, compensations have been proposed to minimize/alleviate the mismatch thereby increasing the SI accuracies. In this paper, we propose to use a Gaussian mixture model-Universal background model (GMM-UBM), with the multiple speaker model approach previously proposed, to compensate for the acoustical mismatch. By using this approach, the SI accuracies have improved over the conventional GMM based SI systems in the presence of room reverberation.

Low-Complexity Voice Detector for Mobile Environments

Provisioning of mobile audio and video services is a difficult challenge since in the mobile envi... more Provisioning of mobile audio and video services is a difficult challenge since in the mobile environment, bandwidth and processing resources are limited. Audio content is normally present in most multimedia services, however, the user expectation of perceived audio quality differs for speech and nonspeech content. Therefore, automatic voice or speech detection is needed in order to maximize perceived audio quality and reduce bandwidth and processing costs. The aim of this work is to find a low-complexity speech detector, suitable for detection of speech in a highly-compressed multimedia stream whose audio track may consist of speech, music, broadcast news, or other audio content. Finally, two methods for speech/non-speech detection are proposed and compared.

Support Vector Machine Based Speaker Identification Systems Using GMM Parameters

Speaker identification is the task of determining which speaker characteristics from the speakers... more Speaker identification is the task of determining which speaker characteristics from the speakers known to the system best matches the unknown voice sample. SI requires multiple decision alternatives and to implement SI system using SVM techniques requires multi-class SVM classifier. In this paper, speaker model clustering is implemented on a SVM based SI system. Here, instead of clustering the speakers, we build a SVM classifier which separates a group of speakers. Thus each hyperplane built using SVMs separates a group of speakers and this procedure is repeated in each sub-group until there is only one speaker in each group. Experiments performed on NIST-2002 speech corpus show an improvement in accuracy compared to the conventional multi-class SVM techniques.

Revisiting the Security of Speaker Verification Systems Against Imposture Using Synthetic Speech

In this paper, we investigate imposture using synthetic speech. Although this problem was first e... more In this paper, we investigate imposture using synthetic speech. Although this problem was first examined over a decade ago, dramatic improvements in both speaker verification (SV) and speech synthesis have renewed interest in this problem. We use a HMM-based speech synthesizer which creates synthetic speech for a targeted speaker through adaptation of a background model. We use two SV systems: standard GMM-UBM-based and a newer SVM-based. Our results show when the systems are tested with human speech, there are zero false acceptances and zero false rejections. However, when the systems are tested with synthesized speech, all claims for the targeted speaker are accepted while all other claims are rejected. We propose a two-step process for detection of synthesized speech in order to prevent this imposture. Overall, while SV systems have impressive accuracy, even with the proposed detector, high-quality synthetic speech will lead to an unacceptably high false acceptance rate.

Evaluation of the Vulnerability of Speaker Verification to Synthetic Speech

In this paper, we evaluate the vulnerability of a speaker verification (SV) system to synthetic s... more In this paper, we evaluate the vulnerability of a speaker verification (SV) system to synthetic speech. Although this problem was first examined over a decade ago, dramatic improvements in both SV and speech synthesis have renewed interest in this problem. We use a HMM-based speech synthesizer, which creates synthetic speech for a targeted speaker through adaptation of a background model and a GMM-UBM-based SV system. Using 283 speakers from the Wall-Street Journal (WSJ) corpus, our SV system has a 0.4% EER. When the system is tested with synthetic speech generated from speaker models derived from the WSJ journal corpus, 90% of the matched claims are accepted. This result suggests a possible vulnerability in SV systems to synthetic speech. In order to detect synthetic speech prior to recognition, we investigate the use of an automatic speech recognizer (ASR), dynamic-timewarping (DTW) distance of mel-frequency cepstral coefficients (MFCC), and previously-proposed average inter-frame difference of log-likelihood (IFDLL). Overall, while SV systems have impressive accuracy, even with the proposed detector, high-quality synthetic speech can lead to an unacceptably high acceptance rate of synthetic speakers.

Hybrid Scalar/Vector Quantization of Mel-Frequency Cepstral Coefficients for Low Bit-Rate Coding of Speech

In this paper, we propose a low bit-rate speech codec based on a hybrid scalar/vector quantizatio... more In this paper, we propose a low bit-rate speech codec based on a hybrid scalar/vector quantization of the mel-frequency cepstral coefficients (MFCCs). We begin by showing that if a high-resolution mel-frequency cepstrum (MFC) is computed, good-quality speech reconstruction is possible from the MFCCs despite the lack of explicit phase information. By evaluating the contribution toward speech quality that individual MFCCs make and applying appropriate quantization, our results show perceptual evaluation of speech quality (PESQ) of the MFCC-based codec matches the state-of-the-art MELPe codec at 600 bps and exceeds the CELP codec at 2000-4000 bps coding rates. The main advantage of the proposed codec is in distributed speech recognition (DSR) since speech features based on MFCCs can be directly obtained from codewords thus eliminating additional decode and feature extract stages.

Detection of Synthetic Speech for the Problem of Imposture

In this paper, we present new results from our research into the vulnerability of a speaker verif... more In this paper, we present new results from our research into the vulnerability of a speaker verification (SV) system to synthetic speech. We use a HMM-based speech synthesizer, which creates synthetic speech for a targeted speaker through adaptation of a background model and both GMM-UBM and support vector machine (SVM) SV systems. Using 283 speakers from the Wall-Street Journal (WSJ) corpus, our SV systems have a 0.35% EER. When the systems are tested with synthetic speech generated from speaker models derived from the WSJ journal corpus, over 91% of the matched claims are accepted. We propose the use of relative phase shift (RPS) in order to detect synthetic speech and develop a GMM-based synthetic speech classifier (SSC). Using the SSC, we are able to correctly classify human speech in 95% of tests and synthetic speech in 88% of tests thus significantly reducing the vulnerability.

Experimental results with increased bandwidth analysis filters in oversampled subband acoustic echo cancelers

The motivation for adaptive filtering in subbands stems from two well-known problems in least-mea... more The motivation for adaptive filtering in subbands stems from two well-known problems in least-mean square fullband adaptive filtering. First, the convergence and tracking can be very slow if the input correlation matrix is ill-conditioned as in the case with speech input. Second, very high order adaptive filters are computationally expensive. One problem with adaptive filtering in subbands is the slow, asymptotic convergence associated with oversampled systems. Increasing the bandwidth of analysis filters relative to the synthesis filters is proposed to reduce the slow asymptotic convergence. This letter will motivate this approach and present experimental results illustrating the benefits of this modification.

Radio Frequency Channel Modeling for Proximity Networks on the Martian Surface

NASAÕs long-term goals for the exploration of Mars include the use of rovers and sensors which co... more NASAÕs long-term goals for the exploration of Mars include the use of rovers and sensors which communicate through proximity wireless networks. The performance of any such wireless network depends fundamentally on the radio frequency (RF) environment. In order to evaluate and optimize the performance of such a wireless network, a basic understanding or model of the channel is important. In this paper, we present our results concerning the RF environment at selected sites on the surface of Mars with a focus on the link budget and RF coverage patterns. These results take into account the local topography using data from the Mars Global Surveyor, surface reflections, clutter, atmospheric absorption, etc., and contribute to a more accurate RF channel model. We consider a basic wireless network model and demonstrate the possibility for good site coverage and long links despite low antenna heights and radiated power. With such a channel model, mission operators can update elements of the wireless network after deployment with more accurate RF propagation information. Such updates could be used to extend the reach of the network or protect network elements from communication outages due to unforeseen features of the local topography.

Subband Transforms for Adaptive, RLS Direct Sequence Spread Spectrum Receivers

Adaptive Direct Sequence Spread Spectrum (DSSS) receivers have advantages over their fixed matche... more Adaptive Direct Sequence Spread Spectrum (DSSS) receivers have advantages over their fixed matched filter counterparts including interference cancellation capabilities and simplification of PN code acquisition. However, convergence using the LMS algorithm will be very slow in situations with relatively high SNR and/or a large number of users. The use of the RLS algorithm will improve convergence speed but at significantly increased computational cost, especially for long PN codes. Unfortunately, computationally efficient, fast RLS algorithms cannot be used because the filter is updated at the symbol rate rather than at every sample. In this paper, we propose a subband version of the RLS-based receiver that utilizes multiple, shorter length adaptive filters. This approach significantly reduces computation and introduces architectural parallelism into the system implementation. We design an optimal subband transform and provide simulation results demonstrating the improved convergence properties as compared with the fullband system. Index Terms-Adaptive direct sequence spread spectrum, parallel receiver, subband transforms.

Terrain-Based Simulation of IEEE 802.11a and b Physical Layers on the Martian Surface

This paper presents results concerning the use of IEEE 802.11a and b wireless local area network ... more This paper presents results concerning the use of IEEE 802.11a and b wireless local area network (WLAN) standards for proximity wireless networks on the Martian surface. The radio frequency (RF) environment on the Martian surface is modeled using high-resolution digital elevation maps (DEMs) of Gusev Crater and Meridiani Planum (Hematite) as sample sites. The resulting propagation path loss models are then used in a physical layer (PHY) simulation. Our results show that Martian terrain as represented by the sites studied, can create multipath conditions which in turn affect 802.11a and b PHY performance. However, with a few tens of milliwatts of radiated power and antenna heights within 1-2 m, orthogonal frequency division multiplexing (OFDM)-based 802.11a can have very good PHY performance in terms of bit-and packet-error rates for distances up to a few hundred meters; 802.11b, which is based on direct-sequence spread spectrum (DSSS), is found to be much more adversely affected in the multipath environment. The DEM-based simulation methodology presented here may be more useful to mission planners than generic statistical models.

Speaker Model Clustering for Efficient Speaker Identification in Large Population Applications

In large population speaker identification (SI) systems, likelihood computations between an unkno... more In large population speaker identification (SI) systems, likelihood computations between an unknown speaker's feature vectors and the registered speaker models can be very time-consuming and impose a bottleneck. For applications requiring fast SI, this is a recognized problem and improvements in efficiency would be beneficial. In this paper, we propose a method whereby GMM-based speaker models are clustered using a simple k-means algorithm. Then during the test stage only a small proportion of speaker models in selected clusters are used in the likelihood computations resulting in a significant speed-up with little to no loss in accuracy. In general, as the number of selected clusters is reduced, the identification accuracy decreases, however, this loss can be controlled through proper trade-off. The proposed method may also be combined with other test stage speed-up techniques resulting in even greater speed-up gains without additional sacrifices in accuracy.

Investigating the Option of Removing Anti-Aliasing Filter From Digital Relays

Digital relays traditionally employ sampling rates of less than 100 samples/cycle. In order to av... more Digital relays traditionally employ sampling rates of less than 100 samples/cycle. In order to avoid aliasing due to fault transients, these relays employ an analog antialiasing filter before critical-sampling (Nyquist rate) the input waveforms coming from instrument transformers. In many applications of electrical engineering, oversampling (greater than the Nyquist rate) has long been used to simplify the requirements of an antialiasing filter with a sharp cutoff; in some cases, the filter can even be eliminated. This paper investigates this option for a digital relay. The performance of a traditional digital relay is compared with a method that uses oversampling without using an antialiasing filter. By processing a comprehensive array of fault waveforms from Electromagnetic Transients Program simulations, a suitable oversampling rate is suggested. A comparison of phasor estimates using the traditional relay and the proposed method is made for different operating and fault conditions. The results suggest that oversampling can eliminate the antialiasing filter traditionally employed in digital relays.

Low Bit-Rate Speech Coding through Quantization of Mel-Frequency Cepstral Coefficients

In this paper, we propose a low bit-rate speech codec based on vector quantization (VQ) of the me... more In this paper, we propose a low bit-rate speech codec based on vector quantization (VQ) of the mel-frequency cepstral coefficients (MFCCs). We begin by showing that if a highresolution mel-frequency cepstrum (MFC) is computed, goodquality speech reconstruction is possible from the MFCCs despite the lack of phase information. By evaluating the contribution toward speech quality that individual MFCCs make and applying appropriate quantization, our results show that the MFCC-based codec exceeds the state-of-the-art MELPe codec across the entire range of 600-2400 bps, when evaluated with the perceptual evaluation of speech quality (PESQ) (ITU-T recommendation P.862). The main advantage of the proposed codec is in distributed speech recognition (DSR) since the MFCCs can be directly applied thus eliminating additional decode and feature extract stages; furthermore, the proposed codec better preserves the fidelity of MFCCs and better word accuracy rates as compared to CELP and MELPe codecs.

Evaluation of spherically invariant random process parameters as discriminators for speaker verification

… Processing Workshop, 2004 and the 3rd …, Jan 1, 2004

Current methods of speaker identification and verification rely on the complex extraction of hund... more Current methods of speaker identification and verification rely on the complex extraction of hundreds or even thousands of parameters in order to correctly model and identify a speaker. These methods have matured to the point where extremely accurate identification of a speaker (from a large population of speakers) is possible. In this work, we are interested in the potential use of Spherically Invariant Random Processes (SIRPs), described by two parameters, for speaker identification. These random processes have been shown to be a more statisticallyaccurate model for speech than Laplace and Gamma pdfs. Computation of the two SIRP parameters is fast and simple and storage requirements are obviously small. Although the proposed method does not yield the accuracy of current methods, identification rates are better than random guessing. The work demonstrates the first step for potential use of SIRPs in speaker identification. Usage might include an adjunct role where SIRPs could supplement existing methods to further improve identification or be used to reduce the parameter requirements of existing methods while maintaining accuracy rates.

Effective Utilization of Commercial Wireless Networking Technology in Planetary Environments

REPORT DOCUMENTATION PAGE ] FormApproved ! OMB No. 0704-0188 The public reporting burden for this... more REPORT DOCUMENTATION PAGE ] FormApproved ! OMB No. 0704-0188 The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data soumes, gathering and maintaining _ data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for r.ed. ucing this burden, to Department of Defense, Washington Headquarters Services, Directorate for Information Operations and Reports (0704-0188), 1215 Jefferson Davis H=ghway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law. no person shall be subject to any penalty for failingto comply with a collection of information if it does not display a currently valid OMB contz_ol number. PLEASE DO NOT RETURN Y(_R FORM T9 THE ABOVE AppRESS.

Design techniques for uniform-DFT, linear phase filter banks

Uniform-DFT filter banks are an important class of filter banks and their theory is well known.

Radio frequency channel modeling for proximity networks on the Martian surface

Computer Networks, Jan 1, 2005

NASAÕs long-term goals for the exploration of Mars include the use of rovers and sensors which co... more NASAÕs long-term goals for the exploration of Mars include the use of rovers and sensors which communicate through proximity wireless networks. The performance of any such wireless network depends fundamentally on the radio frequency (RF) environment. In order to evaluate and optimize the performance of such a wireless network, a basic understanding or model of the channel is important. In this paper, we present our results concerning the RF environment at selected sites on the surface of Mars with a focus on the link budget and RF coverage patterns. These results take into account the local topography using data from the Mars Global Surveyor, surface reflections, clutter, atmospheric absorption, etc., and contribute to a more accurate RF channel model. We consider a basic wireless network model and demonstrate the possibility for good site coverage and long links despite low antenna heights and radiated power. With such a channel model, mission operators can update elements of the wireless network after deployment with more accurate RF propagation information. Such updates could be used to extend the reach of the network or protect network elements from communication outages due to unforeseen features of the local topography.

A Design for Satellite Ground Station Receiver Autoconfiguration

International Telemetering …, Jan 1, 2003

In this paper, we propose a receiver design for satellite ground station use which can demodulate... more In this paper, we propose a receiver design for satellite ground station use which can demodulate a waveform without specific knowledge of the data rate, convolutional code rate, or line code used. Several assumptions, consistent with the Space Network operating environment, are made including only certain data rates, convolutional code rates and generator polynomials, and types of line encoders. Despite the assumptions, a wide class of digital signaling (covering most of what might be seen at a ground station receiver) is captured. The approach uses standard signal processing techniques to identify data rate and line encoder class and a look up table with coded sync words (a standard feature of telemetry data frame header) in order to identify the key parameters. As our research has shown, the leading bits of the received coded frame can be used to uniquely identify the parameters. With proper identification, a basic receiver autoconfiguration sequence (date rate, line decoder, convolutional decoder) may be constructed.