Perceptual Models for Speech, Audio, and Music Processing

Allen, Jont B.; Chan, Wai-Yip Geoffrey; Voran, Stephen

doi:10.1155/2007/12687

Outline

Electrical and Electronic Engineering

Perceptual Models for Speech, Audio, and Music Processing

Stephen Voran

2007, EURASIP Journal on Audio, Speech, and Music Processing

https://doi.org/10.1155/2007/12687

visibility

…

description

3 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract
AI

Recent advancements in understanding human auditory perception have led to significant progress in various domains related to audio, speech, and music processing, encompassing coding, recognition, synthesis, enhancement, and quality estimation. This special issue compiles seven papers that demonstrate the spectrum of current research in perceptual modeling, addressing topics such as cochlear-like filters, noise-robust speech recognition, wideband speech coding using psychoacoustic criteria, spectrotemporal modulation denoising, and the impact of listener accent on comprehension. The diverse findings presented here aim to both inform and inspire ongoing work in the modeling of human auditory perception.

William Yost

Perception & Psychophysics, 1974

Using a two-alternative temporal forced-choice technique, two binaural detection experiments were performed. In the first, the detectability of a 250-Hz I 28-msec tonal signal masked by a gated 7D-dB SPL tone of the same frequency and duration was measured as a function of the level of the signal, the phase angle at which the signal was added to the masker, and the interaural phase difference of the signal. In the second experiment, the signal was a wideband (100-3,000 Hz) 128-msec Gaussian noise masked by a continuous Gaussian noise of the same bandwidth and coherent with the signal. The detectability of this noise signal was measured as a function of the same variables investigated in the first experiment. In both experiments detectability was found to follow a simple energy-or power-detection model when the interaural phase difference was 0 deg. When the interaural phase difference was 180 deg, the function relating the signal level required for a constant level of performance to the signal-masker phase angle is such that neither the Webster-Jeffress hypothesis nor Durlach's E-C model accounts for the data. The data are reasonably well fit by a model proposed by Hafter and Carrier.

downloadDownload free PDF View PDFchevron_right

A Serial-Parallel Panoramic Filter Bank as a Model of Frequency Decomposition of Complex Sounds in the Human Inner Ear

Antanas Verikas

Informatica

We consider that the outer hair cells of the inner ear together with the local structures of the basilar membrane, reticular lamina and tectorial membrane form the primary filters (PF) of the second order. Taking into account a delay in transmission of the excitation signal in the cochlea and the influence of the Reissner membrane, we design a signal filtering system consisting of the PF with the common PF of the neighboring channels. We assess the distribution of the central frequencies of the channels along the cochlea, optimal number of the PF constituting a channel, natural frequencies of the channels, damping factors and summation weights of the outputs of the PF. As an example, we present a filter bank comprising 20 Gaussian-type channels each consisting of five PF. The proposed filtering system can be useful for designing cochlear implants based on biological principles of signal processing in the cochlea.

downloadDownload free PDF View PDFchevron_right

Auditory Phase Opponency: A Temporal Model for Masked Detection at Low Frequencies

Mary Evilsizer

2000

Summary We present a model for tone-in-noise detection at low frequencies that includes a physiologically realistic mech- anism for processing the information in neural discharge times. The proposed model exploits the frequency- dependent phase properties of the tuned filters in the auditory periphery and uses cross-auditory-nerve-fiber coincidence detection to extract temporal cues. Information in the responses of model coincidence detectors

downloadDownload free PDF View PDFchevron_right

Auditory filter nonlinearity across frequency using simultaneous notched-noise masking

Stuart Rosen

The Journal of the Acoustical Society of America, 2006

Psychoacoustic masking experiments have been widely used to investigate cochlear function in human listeners. Here we use simultaneous notched-noise masking experiments in normal hearing listeners to characterize the changes in auditory filter shape with stimulus level over the frequency range 0.25-6 kHz. At each frequency a range of fixed signal levels ͑30-70 dB SPL͒ and fixed masker levels ͑20-50 dB SPL spectrum level͒ are used in order to obtain accurate descriptions of the filter shapes in individual listeners. The notched-noise data for individual listeners are fitted with two filter shape models: a rounded exponential ͑roex͒ shape in which the filter skirt changes as a linear function of probe-tone level and the other, in which the gain of the tip filter relative to the filter tail changes as a function of signal level ͓Glasberg and Moore, J. Acoust. Soc. Am. 108, 2318-2328 ͑2000͔͒. The parameters for these fitted models are then described with a simple set of equations that quantify the changes in auditory filter shape across level and frequency. Both these models fitted the data equally well and both demonstrated increasing tip-tail gain as frequency increased.

downloadDownload free PDF View PDFchevron_right

A time-domain, level-dependent auditory filter: The gammachirp

Toshio Irino

The Journal of the Acoustical Society of America, 1997

A frequency-modulation term has been added to the gammatone auditory filter to produce a filter with an asymmetric amplitude spectrum. When the degree of asymmetry in this ''gammachirp'' auditory filter is associated with stimulus level, the gammachirp is found to provide an excellent fit to 12 sets of notched-noise masking data from three different studies. The gammachirp has a well-defined impulse response, unlike the conventional roex auditory filter, and so it is an excellent candidate for an asymmetric, level-dependent auditory filterbank in time-domain models of auditory processing.

downloadDownload free PDF View PDFchevron_right

Speech Enhancement Using Temporal Masking in the FFT Domain

Teddy Gunawan

Temporal masking models have not been previously applied in the Fast Fourier Transform (FFT) domain for speech enhancement applications. This paper presents a novel speech enhancement algorithm using temporal masking in the FFT domain. The proposed algorithm is suitable for the cochlear speech processor and for other speech applications. The input signal is analysed using FFT and then grouped into 22 critical bands. The noise power is estimated using a minimum statistics noise tracking algorithm. A short-term temporal masking threshold is then calculated for each critical band and a gain factor for each band is then computed. The objective and subjective evaluations show that the temporal masking model based speech enhancement scheme outperforms the traditional Wiener filtering approach in the FFT domain.

downloadDownload free PDF View PDFchevron_right

Auditory Time-Frequency Masking: Psychoacoustical Data and Application to Audio Representations

Peter Balazs

Lecture Notes in Computer Science, 2012

In this paper, the results of psychoacoustical experiments on auditory time-frequency (TF) masking using stimuli (masker and target) with maximal concentration in the TF plane are presented. The target was shifted either along the time axis, the frequency axis, or both relative to the masker. The results show that a simple superposition of spectral and temporal masking functions does not provide an accurate representation of the measured TF masking function. This confirms the inaccuracy of simple models of TF masking currently implemented in some perceptual audio codecs. In the context of audio signal processing, the present results constitute a crucial basis for the prediction of auditory masking in the TF representations of sounds. An algorithm that removes the inaudible components in the wavelet transform of a sound while causing no audible difference to the original sound after re-synthesis is proposed. Preliminary results are promising, although further development is required.

downloadDownload free PDF View PDFchevron_right

An original paradigm to investigate pure informational masking using complex tones

Trevor Agus

The Journal of the Acoustical Society of America, 2013

Most usual speech masking situations induce both energetic and informational masking. Energetic masking (E.M.) arises because both signal and maskers contain energy in the same critical bands. Informational masking (I.M.) prevents the listeners from disentangling acoustical streams even when they are well separated in frequency, and is thought to reflect central mechanisms. In order to quantify I.M. without E.M. contamination in complex auditory situations, target and maskers can by presented dichotically. However, this manipulation provides the listeners with important lateralisation cues, which reduces I.M. Therefore, this study aimed at restoring a fair amount of I.M. using complex tones in a new dichotic paradigm. Regularly repeating signals and random-frequency multitone maskers were presented dichotically, but switched from one ear to the other within a 10s sequence. Switches could either appear at a slow or rapid rate. We compared listeners' detection performance in these switching situations to that elicited in traditional diotic and dichotic situations. Results showed that the amount of I.M. induced when signal and maskers were rapidly switching throughout a sequence was significantly higher than in classical dichotic situations. Therefore, this paradigm provides an original tool to evaluate auditory perception in situations of pure I.M. using complex tones.

downloadDownload free PDF View PDFchevron_right

Perceptual audio coding using adaptive pre- and post-filters and lossless compression

Gerald Schuller

IEEE Transactions on Speech and Audio Processing, 2002

This paper proposes a versatile perceptual audio coding method that achieves high compression ratios and is capable of low encoding/decoding delay. It accommodates a variety of source signals (including both music and speech) with different sampling rates. It is based on separating irrelevance and redundancy reductions into independent functional units. This contrasts traditional audio coding where both are integrated within the same subband decomposition. The separation allows for the independent optimization of the irrelevance and redundancy reduction units. For both reductions, we rely on adaptive filtering and predictive coding as much as possible to minimize the delay. A psycho-acoustically controlled adaptive linear filter is used for the irrelevance reduction, and the redundancy reduction is carried out by a predictive lossless coding scheme, which is termed weighted cascaded least mean squared (WCLMS) method. Experiments are carried out on a database of moderate size which contains mono-signals of different sampling rates and varying nature (music, speech, or mixed). They show that the proposed WCLMS lossless coder outperforms other competing lossless coders in terms of compression ratios and delay, as applied to the pre-filtered signal. Moreover, a subjective listening test of the combined pre-filter/lossless coder and a state-of-the-art perceptual audio coder (PAC) shows that the new method achieves a comparable compression ratio and audio quality with a lower delay. Index Terms-Least mean squared (LMS) algorithm, lossless coding, perceptual audio coding, prediction. I. INTRODUCTION P ERCEPTUAL audio coding removes both "irrelevance" and "redundancy" from a signal. The former is defined as signal components undetectable by the receiver (the ear). Psycho-acoustics defines the masked threshold as the threshold below which distortions cannot be heard. This threshold is time-and frequency-dependent, as well as signal dependent. Perceptual audio coding keeps only audible signal components by hiding quantization distortions below the threshold, which Manuscript

downloadDownload free PDF View PDFchevron_right

Binaural and Multiple-Microphone Signal Processing Motivated by Auditory Perception

chanwoo kim

2008

It is well known that binaural processing is very useful for separating incoming sound sources as well as for improving the intelligibility of speech in reverberant environments. This paper describes and compares a number of ways in which the classic model of interaural cross-correlation proposed by Jeffress, quantified by Colburn, and further elaborated by Blauert, Lindemann, and others, can be applied to improving the accuracy of automatic speech recognition systems operating in cluttered, noisy, and reverberant environments. Typical implementations begin with an abstraction of cross-correlation of the incoming signals after nonlinear monaural bandpass processing, but there are many alternative implementation choices that can be considered. These implementations differ in the ways in which an enhanced version of the desired signal is developed using binaural principles, in the extent to which specific processing mechanisms are used to impose suppression motivated by the precedence effect, and in the precise mechanism used to extract interaural time differences.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

Toshio Irino

IEEE Transactions on Audio, Speech and Language Processing, 2000

It is now common to use knowledge about human auditory processing in the development of audio signal processors. Until recently, however, such systems were limited by their linearity. The auditory filter system is known to be level-dependent as evidenced by psychophysical data on masking, compression, and two-tone suppression. However, there were no analysis/synthesis schemes with nonlinear filterbanks. This paper describe18300060s such a scheme based on the compressive gammachirp (cGC) auditory filter. It was developed to extend the gammatone filter concept to accommodate the changes in psychophysical filter shape that are observed to occur with changes in stimulus level in simultaneous, tone-in-noise masking. In models of simultaneous noise masking, the temporal dynamics of the filtering can be ignored. Analysis/ synthesis systems, however, are intended for use with speech sounds where the glottal cycle can be long with respect to auditory time constants, and so they require specification of the temporal dynamics of auditory filter. In this paper, we describe a fast-acting level control circuit for the cGC filter and show how psychophysical data involving two-tone suppression and compression can be used to estimate the parameter values for this dynamic version of the cGC filter (referred to as the "dcGC" filter).

downloadDownload free PDF View PDFchevron_right

2D Psychoacoustic modeling of equivalent masking for automatic speech recognition

Peng Dai, Frank Rudzicz

Signal Processing, 2015

Noise robustness has long been one of the most important goals in speech recognition. While the performance of automatic speech recognition (ASR) deteriorates in noisy situations, the human auditory system is relatively adept at handling noise. To mimic this adeptness, we study and apply psychoacoustic models in speech recognition as a means to improve robustness of ASR systems. Psychoacoustic models are usually implemented in a subtractive manner with the intention to remove noise. However, this is not necessarily the only approach to this challenge. This paper presents a novel algorithm which implements psychoacoustic models additively. The algorithm is motivated by the fact that weak sound elements that are below the masking threshold are the same for the human auditory system, regardless of the actual sound pressure level. Another important contribution of our proposed algorithm is a superior implementation of masking effect. Only those sounds that fall below the masking threshold are modified, which better reflects physical masking effects. We give detailed experimental results showing relationships between the subtractive and additive approaches. Since all the parameters of the proposed filters are positive or zero, they are named 2D psychoacoustic P-filters. Detailed theoretical analysis is provided to show the noise removal ability of these filters. Experiments are carried out on the AURORA2 database. Experimental results show that the word recognition rate using our proposed feature extraction method has been effectively increased. Given models trained with clean speech, our proposed method achieves up to 84.23% word recognition on noisy data.

downloadDownload free PDF View PDFchevron_right

A new forward masking model and its application to perceptual audio coding

Tzi-dar Chiueh

1999

This paper presents a new forward masking model for perceptual audio coding. This model exploits adaptation of the peripheral sensory and neural elements in the auditory system, which is often deemed as the cause of forward masking. Nonlinearity of the ear is modeled by a nonlinear analog circuit with difference equations. We incorporate this model in the MPEG Layer 111 audio coding scheme and construct a masking plane in the frequency-time space. With some extra computations, the new audio coding scheme can improve the sound quality of the decoded audio signals. In our experiments, subjective and objective sound quality measurements show that, to achieve the same reconstructed sound quality, the new scheme requires 12% to 23% less bits than the original MPEG Layer I11 scheme.

downloadDownload free PDF View PDFchevron_right

On the Use of Simultaneous and Temporal Masking in Noise Suppression Applications

Teddy Gunawan

Many noise suppression algorithms employ simultaneous masking, however only a few exploit the temporal masking properties of the human auditory system. In this paper, both types of masking thresholds are analysed experimentally in noisy conditions. Here, critical band threshold calculation in the Fast Fourier Transform (FFT) domain, suitable for cochlear implant simulation, is employed. Noise power in each critical band is estimated using a minimum statistics noise tracking algorithm. Preliminary results show that temporal masking thresholds are less susceptible to noise than simultaneous masking threshold. The noise suppression algorithms were evaluated using objective evaluation (PESQ, ITU-T P.862) and subjective evaluation (ITU-T P.835), in which the temporal masking based noise suppression algorithm performs better than the algorithm exploit simultaneous masking.

downloadDownload free PDF View PDFchevron_right

Front-end Noise Reduction Algorithms for Automatic Speech Recognition

Peng Dai

downloadDownload free PDF View PDFchevron_right

TimeFrequency Sparsity by Removing Perceptually Irrelevant Components Using a Simple Model of Simultaneous Masking

B. Laback, Peter Balazs

Audio, Speech, and …, 2010

AbstractWe present an algorithm for removing timefre-quency components, found by a standard Gabor transform, of a real-world sound while causing no audible difference to the original sound after resynthesis. Thus, this representation is made sparser. The selection of removable ...

downloadDownload free PDF View PDFchevron_right

Auditory Time-Frequency Masking for Spectrally and Temporally Maximally-Compact Stimuli

Richard Kronland-Martinet

PLOS ONE, 2016

Many audio applications perform perception-based time-frequency (TF) analysis by decomposing sounds into a set of functions with good TF localization (i.e. with a small essential support in the TF domain) using TF transforms and applying psychoacoustic models of auditory masking to the transform coefficients. To accurately predict masking interactions between coefficients, the TF properties of the model should match those of the transform. This involves having masking data for stimuli with good TF localization. However, little is known about TF masking for mathematically well-localized signals. Most existing masking studies used stimuli that are broad in time and/or frequency and few studies involved TF conditions. Consequently, the present study had two goals. The first was to collect TF masking data for well-localized stimuli in humans. Masker and target were 10-ms Gaussianshaped sinusoids with a bandwidth of approximately one critical band. The overall pattern of results is qualitatively similar to existing data for long maskers. To facilitate implementation in audio processing algorithms, a dataset provides the measured TF masking function. The second goal was to assess the potential effect of auditory efferents on TF masking using a modeling approach. The temporal window model of masking was used to predict present and existing data in two configurations: (1) with standard model parameters (i.e. without efferents), (2) with cochlear gain reduction to simulate the activation of efferents. The ability of the model to predict the present data was quite good with the standard configuration but highly degraded with gain reduction. Conversely, the ability of the model to predict existing data for long maskers was better with than without gain reduction. Overall, the model predictions suggest that TF masking can be affected by efferent (or other) effects that reduce cochlear gain. Such effects were avoided in the experiment of this study by using maximally-compact stimuli.

downloadDownload free PDF View PDFchevron_right

Ear Modeling and Sound Signal Processing

Jack Xin

2008

Ear modeling can significantly improve sound signal processing and the design of hearing devices. Ear models based on mechanics and neural phenomenology of the inner ear (cochlea) form a class of nonlinear nonlocal dispersive partial differential equations (PDEs). These PDEs capture the essential nonlinear phenomena in hearing, such as combination frequency generation, compression, and suppression, in agreement with experimental data. The PDEs help to generate linear and nonlinear auditory transforms, including a class of orthogonal discrete auditory transforms with broader sprectrum than that of the discrete Fourier transform, consistent with human audition. The PDEs also facilitates an alternative sound amplification method for hearing aid design. ∗Department of Mathematics and ICES, University of Texas at Austin, Austin, TX 78712. Email: jxin@math.utexas.edu. Partially supported by NSF-ITR 0219004 and a Guggenheim fellowship.

downloadDownload free PDF View PDFchevron_right

Comparison of the roex and gammachirp filters as representations of the auditory filter

Toshio Irino

The Journal of the Acoustical Society of America, 2006

Although the rounded-exponential (roex) filter has been successfully used to represent the magnitude response of the auditory filter, recent studies with the roex(p,w,t) filter reveal two serious problems: the fits to notched-noise masking data are somewhat unstable unless the filter is reduced to a physically unrealizable form, and there is no time-domain version of the roex(p,w,t) filter to support modeling of the perception of complex sounds. This paper describes a compressive gammachirp (cGC) filter with the same architecture as the roex(p,w,t) which can be implemented in the time domain. The gain and asymmetry of this parallel cGC filter are shown to be comparable to those of the roex(p,w,t) filter, but the fits to masking data are still somewhat unstable. The roex(p,w,t) and parallel cGC filters were also compared with the cascade cGC filter [Patterson et al., J. Acoust. Soc. Am. 114, 1529-1542], which was found to provide an equivalent fit with 25% fewer coefficients. Moreover, the fits were stable. The advantage of the cascade cGC filter appears to derive from its parsimonious representation of the high-frequency side of the filter. It is concluded that cGC filters offer better prospects than roex filters for the representation of the auditory filter.

downloadDownload free PDF View PDFchevron_right

Automatic Minimisation of Masking in Multitrack Audio using Subgroups

Paul mc namara

arXiv (Cornell University), 2018

The iterative process of masking minimisation when mixing multitrack audio is a challenging optimisation problem, in part due to the complexity and non-linearity of auditory perception. In this article, we first propose a multitrack masking metric inspired by the MPEG psychoacoustic model. We investigate different audio processing techniques to manipulate the frequency and dynamic characteristics of the signal in order to reduce masking based on the proposed metric. We also investigate whether or not automatically mixing using subgrouping is beneficial or not to perceived quality and clarity of a mix. Evaluation results suggest that our proposed masking metric when utilised in an automatic mixing framework reduces inter-channel auditory masking as well as improves the perceived quality and perceived clarity of a mix. Furthermore, our results suggest that using subgrouping in an automatic mixing framework can also improve the perceived quality and perceived clarity of a mix.

downloadDownload free PDF View PDFchevron_right

Perceptual Models for Speech, Audio, and Music Processing

Sign up for access to the world's latest research

AbstractAI

Related papers

Related papers

Related topics

Abstract
AI