From audio to content
2014
Abstract
Models for the representation the information from sound are necessary for the description of the in-formation, from the perceptive and operative point of view. Beyond the models, analysis methods are needed to discover the parameters which allow sound description, possibly lossless from the physical and perceptual properties description. When aiming to the extraction of information for sound, we need to discard every feature which is non relevant. This process of feature extraction consists of var-ious steps, starting from pre-processing the sound, then windowing, extraction, and post-processing procedures. An audio signal classification system can be generally represented as represented in Figure 4.1. Pre-processing: The pre-processing consists of noise reduction, equalization, low-pass filtering. In speech processing (voice has a low-pass behavior) a pre-emphasis is applied by high-pass fil-tering the signal to smooth the spectrum, to achieve an uniform energy distribution spect...
FAQs
AI
What factors influence the choice of window functions in short-time signal analysis?
The choice of window functions is influenced by the need for short stationarity assumption, noise reduction, and complete parameter coverage based on frame rate, indicating a balance between duration and smoothness.
How does frame rate affect the temporal accuracy in audio signal processing?
The frame rate, defined as the number of frames computed per second, directly impacts temporal accuracy by determining the hopsize H and thus the resolution of the time-frequency analysis.
What methods are used for pitch detection in audio signals?
Pitch detection methods utilize the Short-Time Autocorrelation Function and Short-Time Average Magnitude Difference Function (AMDF), identifying the first maximum or minimum after k=0 to accurately estimate the fundamental frequency F0.
How does Short-Time Average Energy differ from Average Magnitude in signal analysis?
Short-Time Average Energy is sensitive to large signal events, while Average Magnitude addresses this issue by focusing on amplitude, providing a more stable measure in energy-affected scenarios.
What insights does the Zero-Crossing Rate (ZCR) provide regarding audio signals?
ZCR indicates the fundamental frequency of narrow-band signals; it counts the number of times the signal crosses zero, linking to low computational cost and efficient frequency estimation.
References (68)
- P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. B. Sandler. A tutorial on onset detection in music signals. IEEE Transactions on Speech and Audio Processing, 13(5):1035-1047, 2005.
- L. Van Immerseel and J. Martens. Pitch and voiced/unvoiced determination with an auditory model. The Journal of the Acoustical Society of America, 91, 1992.
- A. Klapuri. Sound onset detection by applying psychoacoustic knowledge. In Proceedings of the IEEE Int. Conf. Acoust. (ICASSP), volume 6, pages 3089-3092, Phoenix, AR (USA), 1999.
- M. Leman. Visualization and calculation of roughness of acoustical musical signals using the synchronization index model (sim). In Proceedings of the of the COST G-6 Conference on Digital Audio Effects (DAFX-00), pages 125-130, Verona, Italy, 2000.
- M. Leman, M. Lesaffre, and K. Tanghe. A toolbox for perception-based music analyses. IPEM -Dept. Of Musicology, Ghent University, ??
- M. Leman, M. Lesaffre, and K. Tanghe. Introduction to the ipem toolbox. In Proceeding of the XIII Meeting of the FWO Research Society on Foundations of Music Research, 2001a.
- M. Leman, M. Lesaffre, and K. Tanghe. An introduction to the ipem toolbox for perception-based music analysis. Mikropolyphonie-The Online Contemporary Music Journal, 7, 2001b.
- G. Li and A. Khokar. Content-based indexing and retrieval of audio data using wavelets. In International Conference on Multimedia and Expo (II), page 885888. IEEE, 2000.
- L. Rabiner and B. H. Juang. Fundamentals of Speech Recognition. Prentice-Hall, 1993.
- E. Scheirer. Music-Listening Systems. PhD thesis, MIT, 2000.
- W. Schloss. On the Automatic Transcription of Percussive Music: From Acoustic Signal to High Level Analysis. PhD thesis, Stanford University, 1985.
- M. Slaney and R. Lyon. A perceptual pitch detector. In Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 357360, Albuquerque, NM, 1990. IEEE.
- M. Slaney and R. Lyon. On the importance of time-a temporal representation of sound. In M. Cooke, B. Beet, and M. Crawford, editors, Visual Representations of Speech Signals, page 95116. John Wiley & Sons Ltd, 1993.
- G. Tzanetakis and P. Cook. Sound analysis using mpeg compressed audio. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Istanbul, 2000.
- G. Tzanetakis, G. Essl, and P. Cook. Audio analysis using the discrete wavelet transform. In Proceedings Conference in Acoustics and Music Theory Applications, 2001.
- 1 Sound Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.1.1 Time domain: Short-time analysis . . . . . . . . . . . . . . . . . . . . . . . 4.1
- 1.1.1 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2
- 1.1.2 Short-Time Average Energy and Magnitude . . . . . . . . . . . . 4.3
- 1.1.3 Temporal envelope estimation . . . . . . . . . . . . . . . . . . . . 4.5
- 1.1.4 Short-Time Average Zero-Crossing Rate . . . . . . . . . . . . . . 4.5
- 1.1.5 Short-Time Autocorrelation Function . . . . . . . . . . . . . . . . 4.6
- 1.1.6 Short-Time Average Magnitude Difference Function . . . . . . . . 4.7
- 1.1.7 Pitch detection (f 0 ) by time domain methods . . . . . . . . . . . . 4.8
- 1.2 Frequency domain analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 4.1.2.1 Spectral Envelope estimation . . . . . . . . . . . . . . . . . . . . 4.9
- 1.2.2 Spectral Envelope and Pitch estimation via Cepstrum . . . . . . . 4.9
- 1.2.3 Analysis via mel-cepstrum . . . . . . . . . . . . . . . . . . . . . 4.11
- 1.3 Auditory model for sound analysis . . . . . . . . . . . . . . . . . . . . . . . 4.13
- 1.3.1 Auditory modeling motivations . . . . . . . . . . . . . . . . . . . 4.13
- 1.3.2 Auditory analysis: Seneff's model . . . . . . . . . . . . . . . . . 4.15
- 2 Audio features for sound description . . . . . . . . . . . . . . . . . . . . . . . . . . 4.18 4.2.1 Temporal features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.20
- 2.1.1 ADSR envelop modelling . . . . . . . . . . . . . . . . . . . . . . 4.20
- 2.2 Energy features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.22
- 2.3 Spectral features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.24
- 2.4 Harmonic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.25
- 2.5 Perceptual features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.26
- 2.6 Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.26
- 2.6.1 Onset detection in frequency domain . . . . . . . . . . . . . . . . 4.27
- 2.6.2 Local Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.27
- 2.7 A case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.27
- 2.7.1 Cues from MIDI data . . . . . . . . . . . . . . . . . . . . . . . . 4.28
- 2.7.2 Auditory based cues . . . . . . . . . . . . . . . . . . . . . . . . . 4.28
- 2.7.3 Onset detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.30
- 2.7.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 4.35
- 3 Object description: MPEG-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.37
- 3.1 Scope and features of the MPEG-4 standard . . . . . . . . . . . . . . . . . . 4.37
- 3.2 The utility of objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.38
- 3.3 Coded representation of media objects . . . . . . . . . . . . . . . . . . . . . 4.39
- 3.3.1 Composition of media objects . . . . . . . . . . . . . . . . . . . . 4.40
- 3.3.2 Description and synchronization of streaming data for media objects 4.44
- 3.4 MPEG-4 visual objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.44
- 3.5 MPEG-4 audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.45
- 3.5.1 Natural audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.46
- 3.5.2 Synthesized audio . . . . . . . . . . . . . . . . . . . . . . . . . . 4.47
- 3.5.3 Sound spatialization . . . . . . . . . . . . . . . . . . . . . . . . . 4.51
- 3.5.4 Audio BIFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.51
- 4 Multimedia Content Description: Mpeg-7 . . . . . . . . . . . . . . . . . . . . . . . 4.53
- 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.53
- 4.1.1 Context of MPEG-7 . . . . . . . . . . . . . . . . . . . . . . . . . 4.53
- 4.1.2 MPEG-7 objectives . . . . . . . . . . . . . . . . . . . . . . . . . 4.54
- 4.2 MPEG-7 terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.56
- 4.3 Scope of the Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.57
- 4.4 MPEG-7 Applications Areas . . . . . . . . . . . . . . . . . . . . . . . . . . 4.60 4.4.4.1 Making audio-visual material as searchable as text . . . . . . . . . 4.62
- 4.4.2 Supporting push and pull information acquisition methods . . . . . 4.63
- 4.4.3 Enabling nontraditional control of information . . . . . . . . . . . 4.64
- 4.5 Mpeg-7 description tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.65
- 4.6 MPEG-7 Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.67
- 4.6.1 MPEG-7 Audio Description Framework . . . . . . . . . . . . . . 4.67
- 4.6.2 High-level audio description tools (Ds and DSs) . . . . . . . . . . 4.70