Perceptual Models for Speech, Audio, and Music Processing
2007, EURASIP Journal on Audio, Speech, and Music Processing
https://doi.org/10.1155/2007/12687…
3 pages
1 file
Sign up for access to the world's latest research
Abstract
AI
AI
Recent advancements in understanding human auditory perception have led to significant progress in various domains related to audio, speech, and music processing, encompassing coding, recognition, synthesis, enhancement, and quality estimation. This special issue compiles seven papers that demonstrate the spectrum of current research in perceptual modeling, addressing topics such as cochlear-like filters, noise-robust speech recognition, wideband speech coding using psychoacoustic criteria, spectrotemporal modulation denoising, and the impact of listener accent on comprehension. The diverse findings presented here aim to both inform and inspire ongoing work in the modeling of human auditory perception.
Related papers
Perception & Psychophysics, 1974
Using a two-alternative temporal forced-choice technique, two binaural detection experiments were performed. In the first, the detectability of a 250-Hz I 28-msec tonal signal masked by a gated 7D-dB SPL tone of the same frequency and duration was measured as a function of the level of the signal, the phase angle at which the signal was added to the masker, and the interaural phase difference of the signal. In the second experiment, the signal was a wideband (100-3,000 Hz) 128-msec Gaussian noise masked by a continuous Gaussian noise of the same bandwidth and coherent with the signal. The detectability of this noise signal was measured as a function of the same variables investigated in the first experiment. In both experiments detectability was found to follow a simple energy-or power-detection model when the interaural phase difference was 0 deg. When the interaural phase difference was 180 deg, the function relating the signal level required for a constant level of performance to the signal-masker phase angle is such that neither the Webster-Jeffress hypothesis nor Durlach's E-C model accounts for the data. The data are reasonably well fit by a model proposed by Hafter and Carrier.
Informatica
We consider that the outer hair cells of the inner ear together with the local structures of the basilar membrane, reticular lamina and tectorial membrane form the primary filters (PF) of the second order. Taking into account a delay in transmission of the excitation signal in the cochlea and the influence of the Reissner membrane, we design a signal filtering system consisting of the PF with the common PF of the neighboring channels. We assess the distribution of the central frequencies of the channels along the cochlea, optimal number of the PF constituting a channel, natural frequencies of the channels, damping factors and summation weights of the outputs of the PF. As an example, we present a filter bank comprising 20 Gaussian-type channels each consisting of five PF. The proposed filtering system can be useful for designing cochlear implants based on biological principles of signal processing in the cochlea.
2000
Summary We present a model for tone-in-noise detection at low frequencies that includes a physiologically realistic mech- anism for processing the information in neural discharge times. The proposed model exploits the frequency- dependent phase properties of the tuned filters in the auditory periphery and uses cross-auditory-nerve-fiber coincidence detection to extract temporal cues. Information in the responses of model coincidence detectors
The Journal of the Acoustical Society of America, 2006
Psychoacoustic masking experiments have been widely used to investigate cochlear function in human listeners. Here we use simultaneous notched-noise masking experiments in normal hearing listeners to characterize the changes in auditory filter shape with stimulus level over the frequency range 0.25-6 kHz. At each frequency a range of fixed signal levels ͑30-70 dB SPL͒ and fixed masker levels ͑20-50 dB SPL spectrum level͒ are used in order to obtain accurate descriptions of the filter shapes in individual listeners. The notched-noise data for individual listeners are fitted with two filter shape models: a rounded exponential ͑roex͒ shape in which the filter skirt changes as a linear function of probe-tone level and the other, in which the gain of the tip filter relative to the filter tail changes as a function of signal level ͓Glasberg and Moore, J. Acoust. Soc. Am. 108, 2318-2328 ͑2000͔͒. The parameters for these fitted models are then described with a simple set of equations that quantify the changes in auditory filter shape across level and frequency. Both these models fitted the data equally well and both demonstrated increasing tip-tail gain as frequency increased.
The Journal of the Acoustical Society of America, 1997
A frequency-modulation term has been added to the gammatone auditory filter to produce a filter with an asymmetric amplitude spectrum. When the degree of asymmetry in this ''gammachirp'' auditory filter is associated with stimulus level, the gammachirp is found to provide an excellent fit to 12 sets of notched-noise masking data from three different studies. The gammachirp has a well-defined impulse response, unlike the conventional roex auditory filter, and so it is an excellent candidate for an asymmetric, level-dependent auditory filterbank in time-domain models of auditory processing.
Temporal masking models have not been previously applied in the Fast Fourier Transform (FFT) domain for speech enhancement applications. This paper presents a novel speech enhancement algorithm using temporal masking in the FFT domain. The proposed algorithm is suitable for the cochlear speech processor and for other speech applications. The input signal is analysed using FFT and then grouped into 22 critical bands. The noise power is estimated using a minimum statistics noise tracking algorithm. A short-term temporal masking threshold is then calculated for each critical band and a gain factor for each band is then computed. The objective and subjective evaluations show that the temporal masking model based speech enhancement scheme outperforms the traditional Wiener filtering approach in the FFT domain.
Lecture Notes in Computer Science, 2012
In this paper, the results of psychoacoustical experiments on auditory time-frequency (TF) masking using stimuli (masker and target) with maximal concentration in the TF plane are presented. The target was shifted either along the time axis, the frequency axis, or both relative to the masker. The results show that a simple superposition of spectral and temporal masking functions does not provide an accurate representation of the measured TF masking function. This confirms the inaccuracy of simple models of TF masking currently implemented in some perceptual audio codecs. In the context of audio signal processing, the present results constitute a crucial basis for the prediction of auditory masking in the TF representations of sounds. An algorithm that removes the inaudible components in the wavelet transform of a sound while causing no audible difference to the original sound after re-synthesis is proposed. Preliminary results are promising, although further development is required.
The Journal of the Acoustical Society of America, 2013
Most usual speech masking situations induce both energetic and informational masking. Energetic masking (E.M.) arises because both signal and maskers contain energy in the same critical bands. Informational masking (I.M.) prevents the listeners from disentangling acoustical streams even when they are well separated in frequency, and is thought to reflect central mechanisms. In order to quantify I.M. without E.M. contamination in complex auditory situations, target and maskers can by presented dichotically. However, this manipulation provides the listeners with important lateralisation cues, which reduces I.M. Therefore, this study aimed at restoring a fair amount of I.M. using complex tones in a new dichotic paradigm. Regularly repeating signals and random-frequency multitone maskers were presented dichotically, but switched from one ear to the other within a 10s sequence. Switches could either appear at a slow or rapid rate. We compared listeners' detection performance in these switching situations to that elicited in traditional diotic and dichotic situations. Results showed that the amount of I.M. induced when signal and maskers were rapidly switching throughout a sequence was significantly higher than in classical dichotic situations. Therefore, this paradigm provides an original tool to evaluate auditory perception in situations of pure I.M. using complex tones.
IEEE Transactions on Speech and Audio Processing, 2002
This paper proposes a versatile perceptual audio coding method that achieves high compression ratios and is capable of low encoding/decoding delay. It accommodates a variety of source signals (including both music and speech) with different sampling rates. It is based on separating irrelevance and redundancy reductions into independent functional units. This contrasts traditional audio coding where both are integrated within the same subband decomposition. The separation allows for the independent optimization of the irrelevance and redundancy reduction units. For both reductions, we rely on adaptive filtering and predictive coding as much as possible to minimize the delay. A psycho-acoustically controlled adaptive linear filter is used for the irrelevance reduction, and the redundancy reduction is carried out by a predictive lossless coding scheme, which is termed weighted cascaded least mean squared (WCLMS) method. Experiments are carried out on a database of moderate size which contains mono-signals of different sampling rates and varying nature (music, speech, or mixed). They show that the proposed WCLMS lossless coder outperforms other competing lossless coders in terms of compression ratios and delay, as applied to the pre-filtered signal. Moreover, a subjective listening test of the combined pre-filter/lossless coder and a state-of-the-art perceptual audio coder (PAC) shows that the new method achieves a comparable compression ratio and audio quality with a lower delay. Index Terms-Least mean squared (LMS) algorithm, lossless coding, perceptual audio coding, prediction. I. INTRODUCTION P ERCEPTUAL audio coding removes both "irrelevance" and "redundancy" from a signal. The former is defined as signal components undetectable by the receiver (the ear). Psycho-acoustics defines the masked threshold as the threshold below which distortions cannot be heard. This threshold is time-and frequency-dependent, as well as signal dependent. Perceptual audio coding keeps only audible signal components by hiding quantization distortions below the threshold, which Manuscript
2008
It is well known that binaural processing is very useful for separating incoming sound sources as well as for improving the intelligibility of speech in reverberant environments. This paper describes and compares a number of ways in which the classic model of interaural cross-correlation proposed by Jeffress, quantified by Colburn, and further elaborated by Blauert, Lindemann, and others, can be applied to improving the accuracy of automatic speech recognition systems operating in cluttered, noisy, and reverberant environments. Typical implementations begin with an abstraction of cross-correlation of the incoming signals after nonlinear monaural bandpass processing, but there are many alternative implementation choices that can be considered. These implementations differ in the ways in which an enhanced version of the desired signal is developed using binaural principles, in the extent to which specific processing mechanisms are used to impose suppression motivated by the precedence effect, and in the precise mechanism used to extract interaural time differences.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.