From audio to content

Giovanni De Poli

Outline

From audio to content

Giovanni De Poli

2014

Abstract

Models for the representation the information from sound are necessary for the description of the in-formation, from the perceptive and operative point of view. Beyond the models, analysis methods are needed to discover the parameters which allow sound description, possibly lossless from the physical and perceptual properties description. When aiming to the extraction of information for sound, we need to discard every feature which is non relevant. This process of feature extraction consists of var-ious steps, starting from pre-processing the sound, then windowing, extraction, and post-processing procedures. An audio signal classification system can be generally represented as represented in Figure 4.1. Pre-processing: The pre-processing consists of noise reduction, equalization, low-pass filtering. In speech processing (voice has a low-pass behavior) a pre-emphasis is applied by high-pass fil-tering the signal to smooth the spectrum, to achieve an uniform energy distribution spect...

FAQs

What factors influence the choice of window functions in short-time signal analysis?add

The choice of window functions is influenced by the need for short stationarity assumption, noise reduction, and complete parameter coverage based on frame rate, indicating a balance between duration and smoothness.

How does frame rate affect the temporal accuracy in audio signal processing?add

The frame rate, defined as the number of frames computed per second, directly impacts temporal accuracy by determining the hopsize H and thus the resolution of the time-frequency analysis.

What methods are used for pitch detection in audio signals?add

Pitch detection methods utilize the Short-Time Autocorrelation Function and Short-Time Average Magnitude Difference Function (AMDF), identifying the first maximum or minimum after k=0 to accurately estimate the fundamental frequency F0.

How does Short-Time Average Energy differ from Average Magnitude in signal analysis?add

Short-Time Average Energy is sensitive to large signal events, while Average Magnitude addresses this issue by focusing on amplitude, providing a more stable measure in energy-affected scenarios.

What insights does the Zero-Crossing Rate (ZCR) provide regarding audio signals?add

ZCR indicates the fundamental frequency of narrow-band signals; it counts the number of times the signal crosses zero, linking to low computational cost and efficient frequency estimation.

Figures (42)

Figure 4.1: Scheme for supervised classification.

Figure 4.2: Various windows on the left; on the right, three windows over the signals(n), shifted from origin by 2N, 3N e 4N samples

Figure 4.3: On the top: a short excerpt from violin performance of Handel.... Other diagrams rep- resent normalized Short-Time Average Energy and Short-Time Average Magnitude , computed using rectangular windows with N=100 samples and frame rate equal to signal sampling rate (8kHz). Write two MATLAB functions to compute Short-Time Average Energy e Magnitude.

Figure 4.4: On the top: a short excerpt from violin performance of Handel.... Other diagrams represent Short-Time Average Energy computed with different Hamming windows.

Cepstrum method allows the separation of a signal y(n) = x(n) * h(n), (source-filter model), where the source x(n) passes through a filtered described by impulse response h(n). Signal spectrum y(n) results Y(k) = X(k) - H(k), which is the product of two spectrums (k is the discrete-frequencies index). The former is related to the source spectrum, and the latter to the filter spectrum. It’s pretty difficult to separate these two spectrums, thus what is usually done is to extract the envelope (real) of the filter, and making the phase related to the source only. Cesptrum idea is based on the properties of logarithms : log(a - b) = log(a) + log(b). Taking into account the logarithm of the absolute value of spectrum Y (k), we get:

Figure 4.6: Automatically formant estimation from cepstrally smooted log Spectra [from Schaefer Rabiner].

Figure 4.7: (a) Trasformation from Hz to mel. (b) Mel-scale filterbank. below 1000 Hz is 200 Hz; then it will raise exponentially. Mel-cesptrum aim to estimate the spectral envelope of this filterbank output. When Y, is the logarithm of energy exiting from channel n, we can use the discrete time cosine transform DCT to obtain the mel-cepstral coefficients MFCC (mel frequency cepstral coefficients) by means of following equation:

Figure 4.8: Example of mel-cesptrum analysis of a clarinet tone: tone spectrum (high left); spectral envelope estimated with mel cepstrum (high right); spectral envelope estimated with LPC (low Ikeft); spectral envelope reconstructed with first 6 mel cepstral coefficients (low right).

Figure 4.9: Block diagram of the joint Synchrony/Mean-Rate model of Auditory Speech Processing.

Figure 4.10: Block diagram of the 40-channel critical-band linear filter bank

Figure 4.11: Frequency responses of the 40-channel critical-band linear filter bank.

Figure 4.12: Mathematical framework of the joint Synchrony/Mean-Rate model of Auditory Speech Processing.

Figure 4.13: Result of the application of the four modules implementing the hair-cell synapse model to a simple 1000 Hz sinusoid. Left and right plots refer to the global 60 ms stimulus and to its corresponding first 10 ms window, in different positions along the model.

Figure 4.14: Block diagram of the Generalized Synchrony Detector (GSD) module.

Figure 4.15: Output of the model, as applied to a clean B Clarinet sound: (a) envelope, (b) synchrony.

Figure 4.16: Synchrony parameter output of the analysis of the same B Clarinet of Fig 7b, superin- posed with a gaussian random noise at a level of 5 db S/N ratio.

Zero-crossing rate is a measure of times the signal crosses the zero axe (see Sec. 4.1.1.4). Noisy sounds tend to have high values of ZCR.

This book is available under the CreativeCommons Attribution-NonC ommercial- ShareAlike licence, www.creativecommons.org. ©2006 by the authors where a, is the amplitude of 7-th partial; Total amplitude of the residual component, resulting by sum of absolute values of the residual within the frame:

Figure 4.18: Features and extraction processes as proposed by Cuidado Project for MPEG-7 sound descriptors. Weight of i-th harmonic with reference to the total sinusoidal component:

Figure 4.19: Example of onset detector based on local energy: time-domain audio signal (a), 40ms windowed RMS smoothing with 75% overlap (b), peaks in slope of envelope (c).

Figure 4.20: fo temporal variation (with a window and one step with the values above considered) calculated by standard spectrum for the 5 first seconds of a considered signal. In this type of study it is verified the variation of the spectral shape in respect to the first maximum of the FFT’.

Figure 4.21: RMS of a5 second window of signal.The blue line represents the R/S temporal variation, while that the red line represents the dynamic threshold that follows the RMS of the signal. The crosses on signal R/S represent the detected onsets. The circles represent the zones of effective onsets.

Figure 4.23: Detection of onsets for a five seconds window: the last figure represents the onset detec- tion, where red lines indicate the instants for each detected onset.

Figure 4.24: Mpeg-4 high level system architecture.

Figure 4.25: An example of an MPEG-4 scene.

Figure 4.26: Objects in a scene (a) and the corresponding BInary Format for Scene (BIFS) represen- tation (b). the nodes and graph leaves that require streaming support to retrieve media contents or ancillary data (video stream, audio stream, facial animation parameters) are logically connected to the decoding pipelines.

Figure 4.27: Major components of an MPEG-4 terminal (receiver side).

Figure 4.28: Major components of an MPEG-4 terminal (receiver side).

Figure 4.29: In a traditional audio coder, the source model and perception model are defined outside of the transmission (for example, in a standards document). The codec designers do the best job they can at designing these models, but then they are fixed for all content (a). In Structured Audio, the source model is part of the content. It is transmitted in the bitstream and used to give different semantics to the signal representation for each piece of content. There can be a different source model, or multiple source models, for each different piece of content (b). Figure 4.29: In a traditional audio coder, the source model and perception model are defined outside of

refer to text in addition to its audio-visual information. MPEG-7 therefore has standardized different Description Tools for textual annotation and controlled vocabularies, taking into account existing standards and practices.

Figure 4.34: Abstract representation of possible applications using MPEG-7.

Figure 4.35: Abstract representation of possible relation between Descriptors and Description Schemes.

Figure 4.36: Abstract representation of possible applications using MPEG-7.

Figure 4.38: AudioSpectrumEnvelope description of a pop song. The required data storage is NM values where NV is the number of spectrum bins and / is the number of time points.

Figure 4.39: A 10-basis component reconstruction showing most of the detail of the original spectro- gram including guitar, bass guitar, hi-hat and organ notes. The left vectors are an AudioSpectrumBasis Descriptor and the top vectors are the corresponding AudioSpectrumProjection Descriptor. The re- quired data storage is 10(M + N) values. high-level tools that exchange some degree of generality for descriptive richness. The five sets of audio Description Tools that roughly correspond to application areas are integrated in the standard: audio signature, musical instrument timbre, melody description, general sound recognition and indexing, and spoken content. The latter two are excellent examples of how the Audio Framework and MDS Description Tools may be integrated to support other applications.

References (68)

P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. B. Sandler. A tutorial on onset detection in music signals. IEEE Transactions on Speech and Audio Processing, 13(5):1035-1047, 2005.
L. Van Immerseel and J. Martens. Pitch and voiced/unvoiced determination with an auditory model. The Journal of the Acoustical Society of America, 91, 1992.
A. Klapuri. Sound onset detection by applying psychoacoustic knowledge. In Proceedings of the IEEE Int. Conf. Acoust. (ICASSP), volume 6, pages 3089-3092, Phoenix, AR (USA), 1999.
M. Leman. Visualization and calculation of roughness of acoustical musical signals using the synchronization index model (sim). In Proceedings of the of the COST G-6 Conference on Digital Audio Effects (DAFX-00), pages 125-130, Verona, Italy, 2000.
M. Leman, M. Lesaffre, and K. Tanghe. A toolbox for perception-based music analyses. IPEM -Dept. Of Musicology, Ghent University, ??
M. Leman, M. Lesaffre, and K. Tanghe. Introduction to the ipem toolbox. In Proceeding of the XIII Meeting of the FWO Research Society on Foundations of Music Research, 2001a.
M. Leman, M. Lesaffre, and K. Tanghe. An introduction to the ipem toolbox for perception-based music analysis. Mikropolyphonie-The Online Contemporary Music Journal, 7, 2001b.
G. Li and A. Khokar. Content-based indexing and retrieval of audio data using wavelets. In International Conference on Multimedia and Expo (II), page 885888. IEEE, 2000.
L. Rabiner and B. H. Juang. Fundamentals of Speech Recognition. Prentice-Hall, 1993.
E. Scheirer. Music-Listening Systems. PhD thesis, MIT, 2000.
W. Schloss. On the Automatic Transcription of Percussive Music: From Acoustic Signal to High Level Analysis. PhD thesis, Stanford University, 1985.
M. Slaney and R. Lyon. A perceptual pitch detector. In Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 357360, Albuquerque, NM, 1990. IEEE.
M. Slaney and R. Lyon. On the importance of time-a temporal representation of sound. In M. Cooke, B. Beet, and M. Crawford, editors, Visual Representations of Speech Signals, page 95116. John Wiley & Sons Ltd, 1993.
G. Tzanetakis and P. Cook. Sound analysis using mpeg compressed audio. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Istanbul, 2000.
G. Tzanetakis, G. Essl, and P. Cook. Audio analysis using the discrete wavelet transform. In Proceedings Conference in Acoustics and Music Theory Applications, 2001.
1 Sound Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.1.1 Time domain: Short-time analysis . . . . . . . . . . . . . . . . . . . . . . . 4.1
1.1.1 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2
1.1.2 Short-Time Average Energy and Magnitude . . . . . . . . . . . . 4.3
1.1.3 Temporal envelope estimation . . . . . . . . . . . . . . . . . . . . 4.5
1.1.4 Short-Time Average Zero-Crossing Rate . . . . . . . . . . . . . . 4.5
1.1.5 Short-Time Autocorrelation Function . . . . . . . . . . . . . . . . 4.6
1.1.6 Short-Time Average Magnitude Difference Function . . . . . . . . 4.7
1.1.7 Pitch detection (f 0 ) by time domain methods . . . . . . . . . . . . 4.8
1.2 Frequency domain analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 4.1.2.1 Spectral Envelope estimation . . . . . . . . . . . . . . . . . . . . 4.9
1.2.2 Spectral Envelope and Pitch estimation via Cepstrum . . . . . . . 4.9
1.2.3 Analysis via mel-cepstrum . . . . . . . . . . . . . . . . . . . . . 4.11
1.3 Auditory model for sound analysis . . . . . . . . . . . . . . . . . . . . . . . 4.13
1.3.1 Auditory modeling motivations . . . . . . . . . . . . . . . . . . . 4.13
1.3.2 Auditory analysis: Seneff's model . . . . . . . . . . . . . . . . . 4.15
2 Audio features for sound description . . . . . . . . . . . . . . . . . . . . . . . . . . 4.18 4.2.1 Temporal features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.20
2.1.1 ADSR envelop modelling . . . . . . . . . . . . . . . . . . . . . . 4.20
2.2 Energy features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.22
2.3 Spectral features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.24
2.4 Harmonic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.25
2.5 Perceptual features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.26
2.6 Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.26
2.6.1 Onset detection in frequency domain . . . . . . . . . . . . . . . . 4.27
2.6.2 Local Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.27
2.7 A case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.27
2.7.1 Cues from MIDI data . . . . . . . . . . . . . . . . . . . . . . . . 4.28
2.7.2 Auditory based cues . . . . . . . . . . . . . . . . . . . . . . . . . 4.28
2.7.3 Onset detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.30
2.7.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 4.35
3 Object description: MPEG-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.37
3.1 Scope and features of the MPEG-4 standard . . . . . . . . . . . . . . . . . . 4.37
3.2 The utility of objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.38
3.3 Coded representation of media objects . . . . . . . . . . . . . . . . . . . . . 4.39
3.3.1 Composition of media objects . . . . . . . . . . . . . . . . . . . . 4.40
3.3.2 Description and synchronization of streaming data for media objects 4.44
3.4 MPEG-4 visual objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.44
3.5 MPEG-4 audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.45
3.5.1 Natural audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.46
3.5.2 Synthesized audio . . . . . . . . . . . . . . . . . . . . . . . . . . 4.47
3.5.3 Sound spatialization . . . . . . . . . . . . . . . . . . . . . . . . . 4.51
3.5.4 Audio BIFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.51
4 Multimedia Content Description: Mpeg-7 . . . . . . . . . . . . . . . . . . . . . . . 4.53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.53
4.1.1 Context of MPEG-7 . . . . . . . . . . . . . . . . . . . . . . . . . 4.53
4.1.2 MPEG-7 objectives . . . . . . . . . . . . . . . . . . . . . . . . . 4.54
4.2 MPEG-7 terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.56
4.3 Scope of the Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.57
4.4 MPEG-7 Applications Areas . . . . . . . . . . . . . . . . . . . . . . . . . . 4.60 4.4.4.1 Making audio-visual material as searchable as text . . . . . . . . . 4.62
4.4.2 Supporting push and pull information acquisition methods . . . . . 4.63
4.4.3 Enabling nontraditional control of information . . . . . . . . . . . 4.64
4.5 Mpeg-7 description tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.65
4.6 MPEG-7 Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.67
4.6.1 MPEG-7 Audio Description Framework . . . . . . . . . . . . . . 4.67
4.6.2 High-level audio description tools (Ds and DSs) . . . . . . . . . . 4.70

From audio to content

Sign up for access to the world's latest research

Abstract

FAQs

Related papers

References (68)

Related papers