Audio Classification

description288 papers

group76 followers

lightbulbAbout this topic

Audio classification is a subfield of machine learning and signal processing that involves the automatic categorization of audio signals into predefined classes based on their features. It utilizes algorithms to analyze audio data, enabling applications such as speech recognition, music genre classification, and environmental sound identification.

lightbulbAbout this topic

Key research themes

1. How can feature extraction and dimensionality reduction improve accuracy in music genre and audio type classification?

This theme investigates the development and application of advanced feature extraction methods combined with dimensionality reduction techniques to enhance audio classification accuracy, particularly in music genre classification and speech/music discrimination. The focus lies on capturing relevant audio characteristics through timbral, spectral, and rhythmic features and optimizing their representation in reduced dimension spaces that preserve class-distinguishing information, facilitating more effective classification algorithms.

MUSICAL GENRE CLASSIFICATION OF AUDIO SIGNALS USING GEOMETRIC METHODS

by Денис Тихогло

2018

Key finding: This study introduced a nonlinear dimensionality reduction technique, Diffusion Maps, applied on timbral texture features for music genre classification. It improved classification accuracy dramatically, achieving 97%... Read more

articleView Paper downloadDownload

Construction and evaluation of a robust multifeature speech/music discriminator

by qweqwe qweqwe

2025, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing

Key finding: The research evaluated 13 distinct features related to temporal and spectral characteristics such as 4 Hz modulation energy, spectral rolloff, spectral centroid, spectral flux, and zero-crossing rate, and combined them using... Read more

articleView Paper downloadDownload

Optimized Audio Classification and Segmentation Algorithm by Using Ensemble Methods

by MUHAMMAD RASHID

2021, Mathematical Problems in Engineering

Key finding: The paper proposed a hybrid classification strategy combining bagged Support Vector Machines (SVMs) and Artificial Neural Networks (ANNs) using features like Mel-frequency cepstral coefficients (MFCCs) for four-class audio... Read more

articleView Paper downloadDownload

Primary Investigation of Sound Recognition for a domotic application using Support Vector

by Dan Istrate

2024, Annals of the University of Craiova, Series: …

Key finding: The study applied Support Vector Machines (SVMs) as a robust classification technique on features derived from environmental sounds including speech, door claps, and alarms in a domotic setting. Its methodological rigor in... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What roles do binaural and spatial features play in classifying complex acoustic scenes and spatial audio recordings?

This research theme focuses on the extraction and utilization of binaural spatial cues and spectro-temporal features for the classification of spatial audio scenes recorded with binaural setups. It addresses the classification of complex environments and sound distributions around a listener, which is essential for applications in virtual reality, audio indexing, and scene analysis. The studies explore feature selection, classifier performance, and challenges related to reverberation and source ambiguity in acoustically rich settings.

Automatic Spatial Audio Scene Classification in Binaural Recordings of Music

by Sławomir Zieliński and

2020, Applied Sciences

Key finding: The study demonstrated that binaural cues combined with Mel-frequency cepstral coefficients (MFCCs) enable classification of different spatial audio scenes with accuracies up to 98% on binaural room impulse response (BRIR)... Read more

articleView Paper downloadDownload

Feature Extraction of Binaural Recordings for Acoustic Scene Classification

by Hyunkook Lee

2022, 2018 Federated Conference on Computer Science and Information Systems (FedCSIS)

Key finding: By extracting over a thousand spatial and spectro-temporal features from binaural signals, the study showed the superior influence of spectro-temporal features over spatial-only metrics for classification accuracy. Using... Read more

articleView Paper downloadDownload

Musical Genre Classification Enhanced by Improved Source Separation Technique

by G. Tsihrintzis

2021

Key finding: Though primarily focused on source separation, this work indirectly relates by enhancing the extraction of instrument-specific audio features which can be spatialized through binaural and multi-channel processing. Using... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How are deep learning and neuromorphic approaches advancing audio event classification and bioacoustic signal recognition?

This theme examines the shift towards deep learning architectures, particularly convolutional neural networks (CNNs), and emerging neuromorphic computing techniques including spiking neural networks (SNNs) in audio event detection, environmental sound classification, and bioacoustic signal analysis. The focus lies in leveraging biologically inspired models and data-driven feature representations for improved robustness, scalability, and real-time processing capabilities across diverse audio classification tasks.

Convolutional Neural Network based Audio Event Classification

by Minkyu Lim

2023, Ksii Transactions on Internet and Information Systems

Key finding: The paper showed that treating Mel-scale filter bank features concatenated over frames as images input to CNNs led to an audio event classification accuracy of 81.5% across thirty classes including dog barks and sirens,... Read more

articleView Paper downloadDownload

A Review of Automated Bioacoustics and General Acoustics Classification Research

by Leah Mutanu

2023, Sensors

Key finding: This survey highlighted the growing adoption of machine learning, especially ensemble methods and CNNs, in bioacoustic and general acoustic classification. It revealed that deep learning architectures have improved... Read more

articleView Paper downloadDownload

Fundamental Survey on Neuromorphic Based Audio Classification

by Amlan Basu

2025, arXiv:2502.15056

Key finding: This comprehensive survey underscored the promise of neuromorphic computing platforms based on spiking neural networks for audio classification, detailing advantages such as energy efficiency, real-time event-based... Read more

articleView Paper downloadDownload

Sound Classification Using Python

by Siuli Das

2022, ITM Web of Conferences

Key finding: The paper presents a practical implementation of environmental sound classification applying neural networks trained on MFCC feature sets, illustrating that convolutional and fully connected networks can effectively... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Audio Classification

Classification of general audio data for content-based retrieval

by Nevenka Dimitrova

2001, Pattern Recognition Letters

In this paper, we address the problem of classi®cation of continuous general audio data (GAD) for content-based retrieval, and describe a scheme that is able to classify audio segments into seven categories consisting of silence, single... more

descriptionView Paper arrow_downwardDownload

Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations

by Shihab Shamma

2000, IEEE Transactions on Audio, Speech and Language Processing

We describe a content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing. The task explored is to discriminate speech from nonspeech... more

Fig. 1. Schematic of the early stages of auditory processing. (1) Sound is analyzed by a model of the cochlea consisting of a bank of 128 constant-@ bandpas: filters with center frequencies equally spaced on a logarithmic frequency axis (tonotopic axis). (2) Each filter output is then transduced into auditory-nerve pattern: by a hair cell stage which is modeled as a three-step operation: a highpass filter (the fluid-cilia coupling), followed by an instantaneous nonlinear compressior (gated ionic channels), and then a lowpass filter (hair cell membrane leakage). (3) Finally, a lateral inhibitory network detects discontinuities in the responses across the tonotopic axis of the auditory nerve array by a first-order derivative with respect to the tonotopic axis and followed by a half-wave rectification. The final outpu of this stage (auditory spectrogram) is obtained by integrating Yn over a short window, mimicking the further loss of phase-locking observed in the midbrain.

Fig. 2. (a) Cortical multiscale representation of speech. The auditory spectrogram (the output of the early stage) is analyzed by a bank of spectro-temporal modulation selective filters. The spectro-temporal response field (STRF) of one such filter is shown which corresponds to a neuron that responds well to a ripple of 4-Hz rate and 0.5 cycle/octave scale. The output from such a filter is computed by convolving the STRF with the input spectrogram. The total output as a function of time from the model is therefore indexed by three parameters: scale, rate, and frequency. (b) Average rate-scale modulation of speech obtained by summing over all frequencies and averaging over each time window (equation (21) and (22)). The right panel with positive rates is the response of downward filters (w+) and the right panel with negative rates is the upward ones (u_).

by projecting data samples on principal axes and keeping only the components that correspond to the largest singular values of that subspace. However, unlike the matrix case in which the best rank — R approximation of a given matrix is obtained from the truncated SVD, this procedure does not result in optimal approximation in the case of tensors. Instead, the optimal best rank —(R,, Ro,..., Rn) approximation of a tensor can be ob- tained by an iterative algorithm in which HOSVD provides the initial values [27].

Fig. 4. Total number of retained PCs in each of the subspaces of frequency, rate, and scale as a function of threshold on contribution percentage. The vertical axis indicates the number of PCs in each subspace that have contribution [a from equation (33)] more than the threshold.

Fig.5. Percentage of correctly classified samples as a function of threshold on contribution percentage.

Fig. 6. Effect of window length on the percentage of correctly classifiec speech.

Fig. 7. Effect of window length on the percentage of correctly classified nonspeech.

Fig. 8. Effects of white noise on percentage of correctly classified speech for auditory model, multifeature [1], and voicing-energy [2] methods.

Fig. 9. Effects of white noise on percentage of correctly classified nonspeech for auditory model, multifeature [1], and voicing-energy [2] methods.

Fig. 12. Effects of reverberation on percentage of correctly classified speech for auditory model, multifeature [1], and voicing-energy [2] methods.

Fig. 10. Effects of pink noise on percentage of correctly classified speech fot auditory model, multifeature [1], and voicing-energy [2] methods.

Fig. 14. Effect of white noise on average spectro-temporal modulations of speech for SNRs —15, 0, and 15 dB. The spectro-temporal representation of noisy speech preserves the speech specific spectro-temporal features (e.g., near 4 Hz, 2 cycle/octave) even at SNR as low as 0 dB.

Fig. 15. Effects of pink noise on average spectro-temporal modulations of speech for different SNRs —15, 0, and 15 dB. The speech specific spectro-temporal features (e.g. near 4 Hz, 2 cycle/octave) are preserved even at SNR as low as 0 dB.

Fig. 16. Effects of reverberation on average spectro-temporal modulations of speech for time delays 200, 400, and 600 ms. Increasing the time delay results in gradual loss of high-rate temporal modulations of speech.

descriptionView Paper arrow_downwardDownload

Multimodal music mood classification using audio and lyrics

by Perfecto Herrera

2008, Machine Learning and …

In this paper we present a study on music mood classification using audio and lyrics information. The mood of a song is expressed by means of musical features but a relevant part also seems to be conveyed by the lyrics. We evaluate each... more

descriptionView Paper arrow_downwardDownload

Audio analysis for surveillance applications

by Regunathan Radhakrishnan

2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005.

We propose a time series analysis based approach for systematic choice of audio classes for detection of crimes in elevators in 1. Since all the different sounds in a surveillance environment cannot be anticipated, a surveillance system... more

descriptionView Paper arrow_downwardDownload

Classification of audio signals using SVM and RBFNN

by suganya palanivel

2009, Expert Systems with Applications

In the age of digital information, audio data has become an important part in many modern computer applications. Audio classification has been becoming a focus in the research of audio processing and pattern recognition. Automatic audio... more

For audio retrieval, a new metric has been proposed in Guo and Li_(2003), called distance-from-boundary (DFB). When a query audio is given, the system first finds a boundary inside which the query pattern is located. Then, all the audio patterns in the data- base are sorted by their distances to this boundary. All boundaries are learned by the SVMs and stored together with the audio data- base. In Mubarak, Ambikairajah, and Epps (2005) a speech/music discrimination system was proposed based on mel-frequency ceps- tral coefficient (MFCC) and GMM classifier. This system can be used The paper is organized as follows. The acoustic feature extrac- tion is presented in Section 2, modeling techniques for audio clas- sification is described in Section 3. Experimental results using SVM and RBF are reported in Section 4. Finally, conclusions and future work are given in Section 5.

1. Randomly initialize the samples to k means (clusters) The pw; and o? are calculated by using suitable clustering algorithm. Here the k-means clustering algorithm is employed to determine the centers. The algorithm is composed of the following steps:

Fig. 6. Performance of RBFNN for different means. Fig. 5. Performance of RBFNN for audio classification.

Audio classification performance using SVM and RBFNN Table 1 RBFNN model. The RBF centers are located using k-means algo- rithm. The weights are determined using least squares algorithm. The value of k =1, 2, and 5 has been used in our studies for each category. The system gives optimal performance for k=5. For training, the weight matrix is calculated using the least squares algorithm discussed in Section 3.2 for each of the features.

descriptionView Paper arrow_downwardDownload

Temporal Integration for Audio Classification With Application to Musical Instrument Classification

by Gael Richard

2000, IEEE Transactions on Audio, Speech, and Language Processing

Nowadays, it appears essential to design automatic indexing tools which provide meaningful and efficient means to describe the musical audio content. There is in fact a growing interest for music information retrieval (MIR) applications... more

descriptionView Paper arrow_downwardDownload

Musical genre classification using nonnegative matrix factorization-based features

by Andre Holzapfel

2008, Audio, Speech, and Language …

descriptionView Paper arrow_downwardDownload

Gender identification using a general audio classifier

by Hadi Harb

2003

In the context of content-based multimedia indexing gender identification using speech signal is an important task. Existing techniques are dependent on the quality of the speech signal making them unsuitable for the video indexing... more

descriptionView Paper arrow_downwardDownload

Music Genre Classification: A Multilinear Approach

by Emmanouil Benetos

2008

In this paper, music genre classification is addressed in a multilinear perspective. Inspired by a model of auditory cortical processing, multiscale spectro-temporal modulation features are extracted. Such spectro-temporal modulation... more

Figure 1. Total number of retained principal components in each subspace (e.g. scale, rate, and frequency) as a function of the portion of variance retained for the GTZAN dataset.

Figure 2. Total number of retained principal components in each subspace (e.g. scale, rate, and frequency) as a function of the portion of variance retained for the ISMIR2004Genre dataset.

tures [20]. Commonly used classifiers are Support Vector Machines (SVMs), Nearest-Neighbor (NN) classifiers, or classifiers, which resort to Gaussian Mixture Models, Lin- ear Discriminant Analysis (LDA), etc. Several common au- dio datasets have been used in experiments to make the re- ported classification accuracies comparable. Notable results on music genre classification are summarized in Table 1. Table 1. Notable classification accuracies achieved by mu- sic genre classification approaches.

descriptionView Paper arrow_downwardDownload

Score-Independent Audio Features for Description of Music Expression

by Giovanni De Poli

2000, IEEE Transactions on Audio, Speech, and Language Processing

During a music performance, the musician adds expressiveness to the musical message by changing timing, dynamics, and timbre of the musical events to communicate an expressive intention. Traditionally, the analysis of music expression is... more

descriptionView Paper arrow_downwardDownload

SoundButton: Design of a Low Power Wearable Audio Classification System

by Gerhard Tröster

2003

The paper deals with the design of a sound recognition system focused on an ultra low power hardware implementation in a button like miniature form factor. We present the results of the first design phase focused on selection and... more

descriptionView Paper arrow_downwardDownload

New speech/music discrimination approach based on fundamental frequency estimation

by Sebastian Galan

2009, Multimedia Tools and Applications

Automatic discrimination of speech and music is an important tool in many multimedia applications. The paper presents a robust and effective approach for speech/music discrimination, which relies on a set of features derived from... more

Fig. 2 FO estimate for a representative speech signal of 1 s. a Normalized waveform; b Estimated
FO. The thick line corresponds to the segments that are “classified” as voiced by applying the 0.2
threshold to the aperiodicity (Ap0); ¢ Aperiodicity. The dashed line represents the boundary to
“classify” the signal frames as voiced or unvoiced

Fig. 3 FO estimate for a single instrument music signal of 1 s. a Normalized waveform; b Estimated
FO. The thick line corresponds to the segments that are “classified” as voiced by applying the 0.2
threshold to the aperiodicity (Ap0); ¢ Aperiodicity. The dashed line represents the boundary to
“classify” the signal frames as voiced or unvoiced

Fig. 4 FO estimate for a vocal (quartet) music signal of 1 s. a Normalized waveform; b Estimated
FO. The thick line corresponds to the segments that are “classified” as voiced by applying the 0.2
threshold to the aperiodicity (Ap0); ¢ Aperiodicity. The dashed line represents the boundary to
“classify” the signal frames as voiced or unvoiced

has good discrimination capability due to its different behavior for speech and
music.
As just stated, all histograms shown in Figures from 5 to 11 have been obtained
ising a continuous 2-h audio signal representative of the audio classes (speech
nd music) to be distinguished. The behavior of each feature is illustrated by two
ormalized histograms (one for each audio class to be distinguished), which evidence
he discrimination capability of the proposed FO0-derived features. The histograms
n figures from 5 to 11 have been normalized so as to resemble probability density
unctions.

= eS OSC ee eee ee ae

The membership function of a fuzzy set is a generalization of the indicator function
in classical sets. In fuzzy logic, it represents the degree of truth as an extension of
valuation. Membership functions were introduced by Zadeh in the first paper on
fuzzy sets [46]. The membership function which represents a fuzzy set A is usually
denoted by 3. For an element x of set X, the value py (x) is called the membership
degree of x in the fuzzy set A. The membership degree yw 4(x) quantifies the grade
of membership of element x to the fuzzy set A. The value 0 means that x is not a

(MUSIC, SPEECH) for the output variable (see Fig. 14).

When the obtained knowledge for the FRBS is not considered good enoug
to be used, some kind of learning is needed. In such sense, automatic definitio:
of FRBSs can be considered in many cases as an optimization or search proces:
Genetic Algorithms are known to be capable of finding near optimal solutions i
complex search spaces. In this work, the new rules added to the knowledge bas
of the FRBS have been obtained using Genetic Algorithms-based evolutionar
computation (genetic learning algorithms), giving rise to a Genetic Fuzzy Syster
(GFS). This means that the FRBS is evolved by a genetic learning process. A goo
review of GFS is found in [5]. The main genetic learning algorithms for FRBSs ar
known as Michigan [1], Pittsburgh [40] and Iterative Rule Learning [43].

A bees eee beeen aa leas thews weed ta t4hito worrvele +e: axcanlere tha BODES to thw Ditte

WILY IE AS dW Eat L4]> i flhouerset tT) CGALUIUE AIL ALIVe INUIL AW GLENS L Ph

The genetic learning algorithm used in this work to evolve the FRBS is the Pitts-
burgh algorithm. Next, the genetic learning process is described. In the Pittsburgh
approach, each chromosome represents an entire base of rules and evolution is
accomplished by means of genetic operators applied at the level of fuzzy rule sets.
The fitness function evaluates the accuracy of the entire rule base encoded in the
chromosome. The genetic learning process proposed in this work for SMD using the
Pittsburgh approach is illustrated in Fig. 15.

FO-based features vs. timbral
features for different classifiers

-NN SPR classifier.

As can be seen in Table 4, the discrimination capability of the FO-derived features
s related to the histograms shown in Figs. S—11. Given a FO-derived feature, the more
eparated the feature histograms are, the higher the accuracy rate is. From Table 4,
t also results that most of the meaningful information is provided by only three
eatures: the average of the estimated FO (F0,,), the dynamic range of the estimated
‘0 (D ro) and the number of notes (Nnore). When the three features are combined, the
lassification percentage goes up to about 90%, which is close to the value obtained
vhen all F0-derived features are considered (93.30%).

Further, we are interested in knowing the improvement (if any) due to the FO-
Fig. 18 ROC curves for all
considered features when a
NN classifier is used

VUULGaIINU Vy MEAL LHe EDULIS OE AN VICODIN.

As can be seen in Table 5, a great increase in the classification accuracy rate is
always produced when combining each timbral feature with the FO-derived ones.
Consequently, the F0-derived features (pitch features) constitute a good complement
to the timbral features for SMD. Note that the results obtained when combining each
timbral feature with the FO-derived ones are always above the reference (93.30%).
The best results are obtained by combining MFCC and the FO-derived features, with
an average classification percentage of about 97%.

ye) aa “ $e
in average classification percentage of about 97%.

Now, we investigate the influence of the GFS on the classification accuracy rate
or SMD. Table 6 shows the improvement in the accuracy rate (averaged results)
lue to the inclusion of the GFS within the proposed two-stage classification scheme
egarding the case of using only the classifier (first stage). The robustness of the
SFS against different widely used classifiers (k-NN, GMM, NN and SVM) has also
yeen assessed in Table 6. The results in Table 6 have been obtained when using the
sroposed FO-based feature set, the Pittsburgh learning algorithm is applied to evolve
he GFS, and the same audio database has been considered for testing the different
lassification schemes.

SCNLCG HI Lavie /, WdaVe OCCH OVIAHICG HI CWO Cas€s. WIL allG WITMOUL USE Ue CIPS 1
the classification scheme. In both cases, the k-NN SPR classifier has been considered

As can be seen in Table 7, the proposed FO-derived features are especially well-
suited to discriminate between speech and rap music, because they rely on the pitct
rather than on the timbral texture. From Table 7, it results that the FO-derived
features outperform MFCC for speech/rap music discrimination, the difference:
being close to 4% and 3.5% for the first (k-NN) and second (k-NN+GFS) cases.
respectively, which involves that the GFS does not give rise to further difference
between the FO-derived features and MFCC. Note that the difference between the
F0-derived features and MFCC is about 2% for the general case of SMD. Therefore
the F0-derived features can be an interesting alternative to MFCC, especially for the
particular case of speech/rap music discrimination.

To corroborate the results in Table 7. the ROC curves shown in Fic. 20 have been

J. E. Muiioz was born in Estepona (Malaga), Spain, in 1970. He received the MSc degree in
Telecommunication Engineering from the University of Malaga in 1995. Since 2003, he is assistant
professor in Telematics at the Telecommunication Engineering Department of the University of
Jaen. His area of research interest is Speech and Audio Analysis. He is involved in research projects
of the Spanish Ministry of Science and Education.

P. Vera-Candeas_ was born in Madrid, Spain, in 1976. He received the M.S. degree in Telecom-
munication Engineering from the University of Malaga (UMA) in 2000 and a Ph.D. degree from
the University of Alcala in 2006. Since 2000, he has worked at the Telecommunication Engineering
Departament of the University of Jaén. Nowadays, he is an Associate Professor in Signal Processing
and Communications Area. His areas of research interest are Signal Processing and its Applications
to Audio Analysis and Ultrasonic NDT. He has been involved in research projects of the Spanish
Ministry of Science and Education (MEC) and private companies.

F. J. Cafiadas_ was born in Linares (Jaén), Spain, in 1977. He received the M.S. degree in
Telecommunication Engineering from the University of Malaga (UMA) in 2004. During 2004-
2006, he worked as engineer in an Europe Research Project (INTUITION Network Excellence).
Nowadays, he is an assistant professor at the Telecommunication Engineering Departament of the
University of Jaén. His areas of research interests include automatic music transcription, multi-pitch
estimation and sound source separation in polyphonic music signals. His PhD Thesis is focused on
multi-pitch estimation techniques and single-channel source separation.

S. Garcia-Galan was born in Lahiguera (Jaen), Spain, in 1969. He received the MSc and PhD
degrees in Telecommunication Engineering from the University of Malaga (UMA) and the Tech-
nical University of Madrid (UPM), in 1995 and 2004, respectively. Since 1995, he is with the
Telecommunication Engineering Department of the University of Jaen. His areas of research are
engineering applications and artificial intelligence.

Table 1 Information concerning the signals in Figs. 2, 3 and 4

Fig. 16 ROC curves for all
considered features when a
k-NN classifier is used

The results in Table 6 show the good behavior of the proposed two-stage classi-
fication scheme for SMD, which implies that GFS can be an interesting component
to be used in pattern recognition or classification tasks. As expected, the GFS always
leads to a better performance of the speech/music discriminator. From results in
Table 6, we can say that an average reduction about 2.35% in the total error rate
has been achieved. The GFS leads to similar improvement in the accuracy rate for
all considered classifiers, which evidences the robustness of the GFS against the

Table 7 Speech/rap music discrimination: comparison between FO features and MFCC
type of classifier. The highest classification accuracy percentages correspond to the
NN+GFS scheme. An average accuracy percentage above 97% is achieved by such
classification scheme.

descriptionView Paper arrow_downwardDownload

Comparing MFCC and MPEG-7 audio features for feature extraction, maximum likelihood HMM and entropic prior HMM for sports audio classification

by Regunathan Radhakrishnan

2000, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).

We present a comparison of 6 methods for classification of sports audio. For the feature extraction we have two choices: MPEG-7 audio features and Mel-scale Frequency Cepstrum Coefficients (MFCC). For the classificaiton we also have two... more

descriptionView Paper arrow_downwardDownload

A learning approach to hierarchical feature selection and aggregation for audio classification

by Paul Ruvolo

2010, Pattern Recognition Letters

Audio classification typically involves feeding a fixed set of low-level features to a machine learning method, then performing feature aggregation before or after learning. Instead, we jointly learn a selection and hierarchical temporal... more

descriptionView Paper arrow_downwardDownload

Robust speech music discrimination using spectrum's first order statistics and neural networks

by Hadi Harb

2003

Most of speech/music discrimination techniques proposed in the literature need a great amount of training data in order to provide acceptable results. Besides, they are usually context-dependent. In this paper, we propose a novel... more

Figure 1. a plot of 1000 s of speech (+) and 1000 s of music (x) Figure | is shown to tlustrate the behaviour of speech and music samples in the proposed feature space based on the modelling scheme presented previously. Each point in the plot (+ speech, x music) corresponds to Is of audio where one mean vector and one variance vector of FFT spectrum are calculated. The abscissa of each point is the magnitude of its corresponding variance vector, and the ordinate is the magnitude of the corresponding mean vector. One can notice that the decision boundary between these two classes is quite simple in this simplified feature space demonstrating that the proposed modelling scheme can be effective.

In this experiment the effect of the amount of training data was studied. Set 1 was used for extracting training data as for testing. And the system with channel-based normalization and MFSC features was experimented. As one can expect, the classification error rate on the test data (training data is excluded in the evaluation) can be decreased by increasing the amount of training data. The plot of the error rate in function of the amount of training data is shown in Figure 3. Figure 3. A plot of the evolution of the error rate when increasing the amount of training data

Figure 2 the architecture of the classifier

Table | the composition of the evaluative database

Table 2. Classification accuracy for the two normalization techniques.

descriptionView Paper arrow_downwardDownload

Functional Link Expansions for Nonlinear Modeling of Audio and Speech Signals

by Simone Scardapane and

Nonlinear distortions pose a serious problem for the quality preservation of audio and speech signals. To address this problem, such signals are processed by nonlinear models. Functional link adaptive filter (FLAF) is a... more

descriptionView Paper arrow_downwardDownload

Feature Selection in Automatic Music Genre Classification

by Celso Kaestner

2008, 2008 Tenth IEEE International Symposium on Multimedia

This paper presents the results of the application of a feature selection procedure to an automatic music genre classification system. The classification system is based on the use of multiple feature vectors and an ensemble approach,... more

descriptionView Paper arrow_downwardDownload

Discrimination of speech from nonspeeech in broadcast news based on modulation frequency features

by maria markaki

2011, Speech Communication

In audio content analysis, the discrimination of speech and non-speech is the first processing step before speaker segmentation and recognition, or speech transcription. Speech/non-speech segmentation algorithms usually consist of a frame... more

Figure 1: Contribution ay; of the first 25 singular vectors (SVs) uy , UM), j=1,...,25, to the acoustic and modulation frequency subspaces, respectively.

Figure 2: Relevance of the original and compressed modulation spectral features: (a) Mutual information (MI) between the acoustic and modulation frequencies (65 x 125 dimensions) and the speech/non-speech class variable. (b) MI between the first 25 singular vectors in each subspace and the speech/non-speech class variable.

Figure 3: SVM classifier equal error rate (EER) as a function of features selected in terms of relevance or contribution.

Figure 4: (a) Rank—(13, 12) approximation (eq. 8) of |Xi(k,i)| for 500 ms of a speech signal. (b) 21 features approxi- mation for the same speech signal. Energy at modulations corresponding to pitch (~ 120 Hz) and syllabic and phonetic rates (< 40 Hz) remain prominent.

Figure 5: (a) Rank—(13, 12) approximation of |X)(k,i)| for 500 ms of a music signal. (b) 21 features approximation for the same music signal.

Figure 6: (a) Rank—(13, 12) approximation of |Xj(k,i)| for 500 ms of a noise signal (claps and crowd noise outdoors). (b) 21 features approximation for the same signal.

Figure 7: Detection Error Trade-off (DET) curves for frame- and segment-based SVM classification using cepstral features, and median smoothing of the frame-level scores; a small subset of training and testing sets has been used.

Figure 8: DET curves for segment-based SVM classification using cepstral features (MFCC+A + AA), the 21 mos relevant features (MaxRel), and the concatenated feature vector (Fusion) for the same training and testing sets.

Table 1: DCF, Pmiss and P fats on test set Figure 4.3 presents the DET curves and Table 1 the respective EER, and the optimal values ot denotes the system based on the first using the first (Ri, R2) projections, w Fusion of the two feature sets then fol DCF, Pmiss and P faise for the systems tested using SVM and the same training data set. MaxRel 21 most relevant features. The last column refers to the fusion of cepstral with MaxRel features; the concatenated (78+21=99)-features vector further reduced DCF down to 4.35%. For comparison, we also report the best EER and DCF wher hich were 5.19% and 5.12% respectively for the [13 x 12. PCs. MaxRel system is better at the low miss probability regions of the DET curve; cepstral features on the other hand yield better classification performance at the low false alarm regions ows the best of performances across the whole DET curve

descriptionView Paper arrow_downwardDownload

Instructional Video Content Analysis Using Audio Information

by C. Dorai

2000, IEEE Transactions on Audio, Speech and Language Processing

Automatic media content analysis and understanding for efficient topic searching and browsing are current challenges in the management of e-learning content repositories. This paper presents our current work on analyzing and... more

Fig. 1. Overview of the proposed system framework. Ying Li, Member, IEEE, and Chitra Dorai, Senior Member, IEEE as Cornell Lecture Browser [1], eClass [2], and BMRC lecture browser [3] all belong to this category. In contrast, work in the second group targets at automatic understanding, indexing, and annotation of learning media so as to facilitate topic-related queries and searching. Some research efforts along this direc- tion could be found in [4] and [5].

Fig. 2. (a) Proposed audio classification framework. (b) Proposed discussion scene detection framework. (c.1) waveforms of the linkage phrase “um” and a regular speech signal. (c.2) Spectrograms. (c.3) ZCR curves. (c.4) ASC curves of both signals.

Fig. 3. (a) K—L distance sequence obtained log-histogram. (b) Four-state transition machine for discussion scene extraction. (c) Flowchart of the proposed dis- cussion scene classification framework. easily seen, a threshold-based scheme is able to separate the linkage phrases from regular speeches. Fig. 2(b) shows the proposed discussion detection framework which consists of four major modules: audio content prepro- cessing, instructor modeling and model adaptation, speaker change detection, and a four-state transition machine. Each module is detailed as follows.

Fig. 4. (a) Comparison of SWDS — Precision between the adaptive and global GMM approaches for five test videos. (b) Comparisons of average SWDS — Precision and Recall for the three instructors. (c) Temporal distributions of detected discussion scenes and laughter segments in five test videos. Discussion patterns in two typical discussion scenes: (d) a 2-speaker discussion, and (e) a multispeaker discussion.

CLASSIFICATION RESULTS OF SEVEN SOUND TYPES WHERE EACH NUMBER IS IN UNITS OF SECOND

CLASSIFICATION ACCURACY COMPARISONS BETWEEN THREE DIFFERENT CLASSIFICATION SCHEMES

DISCUSSION SCENE CLASSIFICATION RESULTS Moreover, when a person ta miss the cluster formed by hand, experiments show that the initial value assignment of the S (Tmode and T) does not really affect system two threshold performance. Nevertheless, t! ks too infrequently, we tend to his or her speech. On the other he scene classification accuracy could be improved if we manually adjust threshold Tio). How to properly d etermine its va ue, or to derive a better set of cluster validation criteria is thus a part of our future work.

descriptionView Paper arrow_downwardDownload

Speech/Music Classification Using Occurrence Pattern of ZCR and STE

by Rudrasis Chakraborty

2009

With the rapid growth in audio data volume, research in the area of content-based audio retrieval has gained impetus in the last decade. Audio classification serves as the fundamental step towards it. Accuracy in classifying data relies... more

descriptionView Paper arrow_downwardDownload

AUDIO CONCEPT CLASSIFICATION WITH HIERARCHICAL DEEP NEURAL NETWORKS

by Mirco Ravanelli and

Audio-based multimedia retrieval tasks may identify semantic information in audio streams, i.e., audio concepts (such as music, laughter , or a revving engine). Conventional Gaussian-Mixture-Models have had some success in classifying a... more

descriptionView Paper arrow_downwardDownload

Segmentation and classification of broadcast news audio

by Thomas Hain

1998

descriptionView Paper arrow_downwardDownload

Automatic classification of speech and music using neural networks

by Wasfi Al-Khatib

2004

The importance of automatic discrimination between speech signals and music signals has evolved as a research topic over recent years. The need to classify audio into categories such as speech or music is an important aspect of many... more

descriptionView Paper arrow_downwardDownload

Broadcast news audio classification using SVM binary trees

by Jozef Juhár

2012, … and Signal Processing ( …

Audio classification is one of the most important task in content-based analysis and can be implemented in many audio applications, such as indexing and retrieving. This paper addresses the problem of broadcast news audio classification,... more

descriptionView Paper arrow_downwardDownload

Insights into Audio-Based Multimedia Event Classification with Neural Networks

by Mirco Ravanelli and

Multimedia Event Detection (MED) aims to identify events—also called scenes—in videos, such as a flash mob or a wedding ceremony. Audio content information complements cues such as visual content and text. In this paper, we explore the... more

descriptionView Paper arrow_downwardDownload

Video mining using combinations of unsupervised and supervised learning techniques

by Kadir Aşkın Peker

2004

We discuss the meaning and significance of the video mining problem, and present our work on some aspects of video mining. A simple definition of video mining is unsupervised discovery of patterns in audio-visual content. Such purely... more

descriptionView Paper arrow_downwardDownload

Detecting Semantic Concepts from Video Using Temporal Gradients and Audio Classification

by Mika Rautiainen

2003

In this paper we describe new methods to detect semantic concepts from digital video based on audible and visual content. Temporal Gradient Correlogram captures temporal correlations of gradient edge directions from sampled shot frames.... more

descriptionView Paper arrow_downwardDownload

ICBR - Multimedia Management System for Intelligent Content Based Retrieval

by Janko Calic

2004

This paper presents a system designed for the management of multimedia databases that embarks upon the problem of efficient media processing and representation for automatic semantic classification and modelling. Its objectives are... more

descriptionView Paper arrow_downwardDownload

Automatic genre classification of North Indian devotional music

by Preeti Rao

2011

The automatic classification of musical genre from audio signals has been a topic of active research in recent years. Although the identification of genre is a subjective task that likely involves high-level musical attributes such as... more

descriptionView Paper arrow_downwardDownload

Development of a Reference Platform for Generic Audio Classification

by Michal Kuba

2008, 2008 Ninth International Workshop on Image Analysis for Multimedia Interactive Services

Specific sounds such as applause, laugh, music, environmental noise, etc. are very helpful to understand high level semantic of the multimedia content. The detection of such key sounds is one of the challenges in intelligent management of... more

descriptionView Paper arrow_downwardDownload

Acoustic source localization of everyday sounds using wireless sensor networks

by Yukang Guo

2010, Proceedings of the 12th ACM international conference adjunct papers on Ubiquitous computing - Ubicomp '10

Acoustic events are a rich source of information for contextawareness and support various application areas, such as audio surveillance [1], sound sensing [2], intelligent auditory interfaces [3] and speech localization . Acoustic... more

descriptionView Paper arrow_downwardDownload

Mixed Type Audio Classification using Sinusoidal Parameters

by karim faez

2008, 2008 3rd International Conference on Information and Communication Technologies: From Theory to Applications

A preprocessing stage in every audio application including music/speech separation, speech or speaker recognition and audio transcription task is inevitable to determine each frame belongs to which classes, namely: speech only, music only... more

descriptionView Paper arrow_downwardDownload

Bowed String Sequence Estimation of a Violin Based on Adaptive Audio Signal Classification and Context-Dependent Error Correction

by Tetsuya Ogata

2009, 2009 11th IEEE International Symposium on Multimedia

descriptionView Paper arrow_downwardDownload

Audio Features for Noisy Sound Segmentation

by Myriam Desainte-Catherine

Automatic audio classification usually considers sounds as music, speech, silence or noise, but works about the noise class are rare. Audio features are generally specific to speech or music signals. In this paper, we present a new audio... more

descriptionView Paper arrow_downwardDownload

Speech/Music Classification Using Empirical Mode Decomposition

by Arijit Ghosal

2011, 2011 Second International Conference on Emerging Applications of Information Technology

Audio classification serves as the fundamental step towards application like content based audio retrieval. In this work, we have tried to exploit the inherent difference in the composition of speech and music signal. A music signal has... more

descriptionView Paper arrow_downwardDownload

GA-Based Feature Extraction for Clapping Sound Detection

by Michal Kuba

2006, 2006 8th Seminar on Neural Network Applications in Electrical Engineering

Automatically extracting semantic content from audio streams can be helpful in many multimedia applications. In this paper, we introduce a framework for automatic feature subspace selection from a common feature vector. The selected... more

descriptionView Paper arrow_downwardDownload

Multitimbral Musical Instrument Classification

by Alexandra Uitdenbogerd

2008, International Symposium on Computer Science and its Applications

The automatic identification of musical instrument timbres occurring in a recording of music has many applications, including music search by timbre, music recommender systems and transcribers. A major difficulty is that most music is... more

descriptionView Paper arrow_downwardDownload

Applying Supervised Classifiers Based on Nonnegative Matrix Factorization to Musical Instrument Classification

by Emmanouil Benetos

2006

In this paper, a new approach for automatic audio classification using non-negative matrix factorization (NMF) is presented. Training is performed onto each audio class individually, whilst during the test phase each test recording is... more

descriptionView Paper arrow_downwardDownload

Analytical features for the classification of percussive sounds: the case of the Pandeiro

by Francois Pachet

2007

There is an increasing need for automatically classifying sounds for MIR and interactive music applications. In the context of supervised classification, we describe an approach that improves the performance of the general bag-of-frame... more

Figure 1. The gestures to produce the six basic Pandeiro sounds. The need for automatically analyzing Pandeiro sounds is two- fold. First, MIR applications, for education notably, require the ability to automatically transcribe Pandeiro solos.

In the second phase, an attack is reported if, at a certain frame, the loudness level is greater than the loudness threshold and the norm of the differential curve exceeds the differential threshold. This frame is considered as the “attack frame’’. Figure 2. The attack detector: on the left, the full sound and attack portion. On the right, a zoom of the pre-attack and post-attack portions of the signal.

Figure 6. Analytical vs. reference features on attacks

A precise description of operators can be found in [46]. 7.2. Annex 2 — Reference features

Figure 3. Results on full sounds. IGR stands for Information Gain Ratio. EDS FS denotes our fea- ture selection algorithm based on the F-measure. Train/Test denotes the experiment in which the classifier is trained on the training database and tested on the test database. 10-fold XV denotes the 10-fold cross validation experiment on the test database.

Figure 4. Results obtained with on pre-attacks. See above for abbreviations. The “Signal” lin gives the performance of classifiers using the input signal directly as a feature.

Figure 5. Results obtained with on post-attacks. See above for abbreviations.

descriptionView Paper arrow_downwardDownload

Singing voice detection using modulation frequency features

by Andre Holzapfel

2008, Workshop on Statistical and …

descriptionView Paper arrow_downwardDownload

Alignment kernels for audio classification with application to music instrument recognition

by Gael Richard

2008

In this paper we study the efficiency of support vector machines (SVM) with alignment kernels in audio classification. The classification task chosen is music instrument recognition. The alignment kernels have the advantage of handling... more

descriptionView Paper arrow_downwardDownload

Mixture of experts for audio classification: an application to male female classification and musical genre recognition

by Hadi Harb

2004

In this paper we report the experimental results obtained when applying a mixture of experts to the problem of audio classification for multimedia applications. The mixture of experts is based on neural networks as individual experts and... more

descriptionView Paper arrow_downwardDownload

Browsing and Retrieval of Full Broadcast-Quality Video

by Reha Civanlar

1999

In this paper we describe a system we have developed for automatic broadcast-quality video indexing that successfully combines results from the fields of speaker verification, acoustic analysis, very large vocabulary speech recognition,... more

descriptionView Paper arrow_downwardDownload

Improving the classification of percussive sounds with analytical features: a case study

by Francois Pachet

2007

There is an increasing need for automatically classifying sounds for MIR and interactive music applications. In the context of supervised classification, we conducted experiments with so-called analytical features, an approach that... more

descriptionView Paper arrow_downwardDownload

Building local maps in surgical robotics

by Dominik Henrich

2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems

Currently, robotic systems employ almost exclusively global sensor information for navigation purposes. While a global map facilitates planning, it may have insufficient quality. Especially with autonomous robots, additional information... more

Figure 4: Feature vector histogram for s1*9 — robot moving, miller Figure 5: Feature vector histogram for s11;p,p; — robot moving, miller rotating, contact with bone or dura

Figure 3: Feature vector histogram for so*o — robot in standstill, miller stopped/rotating

Figure 1: The RONAF system. From left to right: robot, pneumatic overload protection device, force-torque sensor, surgical milling tool, and skull imultaneous localization and map-building (SLAM) problem or mobile robots, the case is different for stationary robots. Ve aim at augmenting the world model for these systems by mplementing map-building and navigation based on local ensors, with the case of the medical robot system RONAF [3], 7] as an example for orthopedic applications (Figure 1). In Section I, we give an overview over the state of the art in local sensing in stationary robotics. Section III provides a classification of useful navigation principles to be found in robotics. The local sensors used in the presented system are introduced in Section IV. Section V describes the local map data structure and related considerations. The final goal of navigation, the modification of motion paths, is briefly illustrated in Section VI. Finally, Section VII gives an outlook over our current work in this area.

Figure 2: General navigation system architecture including the four navigation principles (A through D), with examples from surgery in bracket: Almost all surgical procedures are subdivided into two main phases, one for the preparation (preoperative phase, e.g. implantation planning) and one for the execution of the surgical intervention (intraoperative phase). The mentioned navigation principles can be described as four sensory feedback cycles over both phases (Figure 2).

The map-building function dy(sx(t),ti,p) performs Based on the above-mentioned considerations, the local map structure in the RONAF system is implemented according to the general principle in Figure 7. As the local sensor information shall enhance the available global map data, which originate from rastered imaging sources, we decided on a gridded 3D representation M of the workspace. The resolution (size of the single voxels m, at position p) is chosen high enough to allow discarding the sample location information without significant loss of precision; in our case we chose an isometric resolution of Av = Ay = Az = 0.35mm. (This equals to both the relative accuracy of the robot and, in turn, the resolution of the global map used for motion planning before Figure 7: Left: Voxel grid of spatial map M with tool tip path p(¢) (long black arrows); right: temporal sequence of acquired samples sx(¢) from classifiers K; XK». Sample locations along path and time are shown as small circles. Voxels m)¢M can point to the beginning of their associated sublists in the sample lists, depending on the existence of such samples. Each sample list is in chronological order. Its composing sublists point to the respective voxels in turn. (Each arrow pair — only two of which are shown — between the spatial and the temporal side represents such a bi-directional association.)

Figure 8: Perspective view on the motion path for concentric milling; originally planned path.

Figure 9: Perspective view on the motion path for concentric milling; with excluded subregion at lower center, after complete replanning.

TABLE 2: RESULTS OF NUMERICAL AUDIO CLASSIFICATION Testing classification rates against the test set Vert test yielded results as shown in Table 2. Except for samples from Sioq, all other states are mostly classified correctly. As there occurs a mix-up between So99 and 5199 Judging only from audio data, we cannot reliably distinguish between a standstill and motion of the robot (noise obscures the robot hum). However, robot motion R can be deduced from other sources, as described before. Milling in air (C = 0) and in bone (C = B) can be distinguished clearly, while resonance (C = R) is only mixed up with bone milling, which is a closely related system

descriptionView Paper arrow_downwardDownload

An experiment in audio classification from compressed data

by Sean Marlow

2004

In this paper we present an algorithm for automatic classification of sound into speech, instrumental sound/ music and silence. The method is based on thresholding of features derived from the modulation envelope of the frequency limited... more

descriptionView Paper arrow_downwardDownload

Content-based recognition of musical instruments

by L. Caponetti

2004, Proceedings of the Fourth IEEE International Symposium on Signal Processing and Information Technology, 2004.

A method for content-based audio classification is presented. In particular we focus on identification of musical instruments sounds based on timbre classification, using a biologicalty plausible features extraction technique called... more

descriptionView Paper arrow_downwardDownload

NNET BASED AUDIO CONTENT CLASSIFICATION AND INDEXING SYSTEM

by SDIWC Organization

Rapid advancement in computers and internet technology has led large volume of multimedia files. The archiving and digitization of the old media contents also contributes to the growth of the digital library. The usefulness of these... more

descriptionView Paper arrow_downwardDownload

Classification of audio sources using neural network applicable in security or military industry

by Milan Navratil

2010

In this paper, classification of audio sources is presented to supplement current work on existing system for localization of audio sources. The question of achieving the audio classification lies in the convenient discrimination of the... more

Figure 1. Structure of feed-forward multilayer perceptron neural network The neurons of the input layer are passive and do not modify the data. They receive a single value on their input, and copy the value to their multiple outputs. In comparison, the neurons of the hidden and output layer are active. This means they modify the data as shown in Figure 2.

Values from the input layer are sent to all of the hidden neurons. The values entering a hidden neuron are multiplied by weights, which are predetermined numbers stored in the program. The weighted inputs are then added to produce a single number. Moreover, an additional neuron is added to the input layer, with its input always having a value of one. When this is multiplied by the weights of the hidden layer, it provides a bias to each transfer function.

As shown in Figure 3. , the discrete Fourier transform changes aninput signal (N real values) into output signal containing the amplitudes of the component sine and cosine waves (N/2+1 complex values). Figure 3. Transformation between time and frequeny domain.

Microphone units, equipped with GPS and radio modules at 2.4 GHz with parabolic antennas for transmitting audio signal, are placed around secure object, in area between this object and the monitored area. Geometry of microphone positions can be arbitrary, but it should not constitute a line. The structure of the whole security system depends on the size of monitored area. For large objects it is better to use decentralized structure with more number of microphone units due to less requirements to system implementation; it is evident from Figure 4. Boundary of monitored area is limited by sensitivity of used microphone units. Near the secure object, there is located decentralized evaluating unit. It consists of radio modules and parabolic antennas for receiving radio signals including audio, GPS and supervisory data that are processed by digital signal processor (DSP). As a result of DSP operation, time delays 0 from individual microphones as well as digitized audio signa from two nearest microphones (each one contains two stereo channels) from audio source are sent to the PC. Audio samples are acquired at sampling frequency 20 kHz and cut to the fixed length of the digitized signal is 1.1s where audio event begins at time 0.1 s.

The scheme of used connections between individual egments of the current system operation is obvious from ‘igure 5. Figure 5. Scheme of the localization system

Figure 6. Decentralized evaluating unit with parabolic antennas Practical tests were accomplished on sport aerodrome where eight microphone units were placed according to local conditions. Decentralized evaluating unit was located on suitable place near the base. This unit contained eight parabolic antennas and can be seen in Figure 6. Obtained audio data was stored in memory as wav files and from the most of it the training patterns were arranged. The rest was used for testing of neural network.

Studying frequency analysis of recorded audio signals taken in real environment, four types of feature vectors were selected as possible candidates. All of them were derived from power spectral density (PSD). A one-sided PSD contains the total power of the signal in the frequency interval from DC component to half of the Nyquist frequency, which was in our case 10 kHz. As illustration we took one shot from a submachine gun in a form which was acquired from DSP, see Figure 7. Figure 7. Audio signal (submachine gun shot) in time and frequency domain. For analysis the first 0.1 s was omitted and the signal taken of length 200 ms. Frequency interval from 100 Hz up to 5 kHz were taken due to microphone frequency characteristic and elimination of overtones.

Only a few significant points (the most powerful components including frequency information) of moving averaged PSD were taken in order to make feature vector as small as possible and simultaneously to keep information value. Figure 8. PSD vs only significant points of moving averaged PSD

Software application for audio signal classification was created in Matlab software with utilization of graphical user interface (GUI). ANN was implemented using neural network toolbox; Levenberg-Marquardt back-propagation algorithm and gradient descent with momentum weight/bias learning function were used. Mean squared error (MSE) was chosen as performance function. The application consists of two main parts; the first is designed for training of artificial neural network while the second is focused on classification of audio samples. Figure 9. Screenshot of the application in Matlab GUI

PARAMETERS OF NEURAL NETWORK AND ACHIEVED TEST RESULTS According to reduction degree of information from frequency analysis we obtained results which correspond to that consideration. The best performance had feature vector containing the most number of information. The more reduction is used the less ability of the neural network to learn and correctly classify is achieved. Artificial neural network was repeatedly learned under various conditions, especially structure of network and its parameters, feature vector settings and other. achieved results are summarized in following tab The best e where there are six columns. HN means number of neurons in hidden layer. Feature vector represents input into the network. FV length indicates number of neurons in input layer and on feature vector settings. FV time is duration o vector computation of one sample. MSE is per depends f feature formance function computed after 30 iterations. The most important parameter is success classification rate (SCR) which is ANN successfulness in correct classification.

descriptionView Paper arrow_downwardDownload

Audio content-based feature extraction algorithms using J-DSP for arts, media and engineering courses

by Mohit shah

2010

J-DSP is a java-based object-oriented online programming environment developed at Arizona State University for education and research. This paper presents a collection of interactive Java modules for the purpose of introducing... more

descriptionView Paper arrow_downwardDownload

Audio Classification

Key research themes

1. How can feature extraction and dimensionality reduction improve accuracy in music genre and audio type classification?

2. What roles do binaural and spatial features play in classifying complex acoustic scenes and spatial audio recordings?

3. How are deep learning and neuromorphic approaches advancing audio event classification and bioacoustic signal recognition?

Related Topics

All papers in Audio Classification