Audio Classification

description288 papers

group76 followers

lightbulbAbout this topic

Audio classification is a subfield of machine learning and signal processing that involves the automatic categorization of audio signals into predefined classes based on their features. It utilizes algorithms to analyze audio data, enabling applications such as speech recognition, music genre classification, and environmental sound identification.

lightbulbAbout this topic

Key research themes

1. How can feature extraction and dimensionality reduction improve accuracy in music genre and audio type classification?

This theme investigates the development and application of advanced feature extraction methods combined with dimensionality reduction techniques to enhance audio classification accuracy, particularly in music genre classification and speech/music discrimination. The focus lies on capturing relevant audio characteristics through timbral, spectral, and rhythmic features and optimizing their representation in reduced dimension spaces that preserve class-distinguishing information, facilitating more effective classification algorithms.

MUSICAL GENRE CLASSIFICATION OF AUDIO SIGNALS USING GEOMETRIC METHODS

by Денис Тихогло

2018

Key finding: This study introduced a nonlinear dimensionality reduction technique, Diffusion Maps, applied on timbral texture features for music genre classification. It improved classification accuracy dramatically, achieving 97%... Read more

articleView Paper downloadDownload

Construction and evaluation of a robust multifeature speech/music discriminator

by qweqwe qweqwe

2025, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing

Key finding: The research evaluated 13 distinct features related to temporal and spectral characteristics such as 4 Hz modulation energy, spectral rolloff, spectral centroid, spectral flux, and zero-crossing rate, and combined them using... Read more

articleView Paper downloadDownload

Optimized Audio Classification and Segmentation Algorithm by Using Ensemble Methods

by MUHAMMAD RASHID

2021, Mathematical Problems in Engineering

Key finding: The paper proposed a hybrid classification strategy combining bagged Support Vector Machines (SVMs) and Artificial Neural Networks (ANNs) using features like Mel-frequency cepstral coefficients (MFCCs) for four-class audio... Read more

articleView Paper downloadDownload

Primary Investigation of Sound Recognition for a domotic application using Support Vector

by Dan Istrate

2024, Annals of the University of Craiova, Series: …

Key finding: The study applied Support Vector Machines (SVMs) as a robust classification technique on features derived from environmental sounds including speech, door claps, and alarms in a domotic setting. Its methodological rigor in... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What roles do binaural and spatial features play in classifying complex acoustic scenes and spatial audio recordings?

This research theme focuses on the extraction and utilization of binaural spatial cues and spectro-temporal features for the classification of spatial audio scenes recorded with binaural setups. It addresses the classification of complex environments and sound distributions around a listener, which is essential for applications in virtual reality, audio indexing, and scene analysis. The studies explore feature selection, classifier performance, and challenges related to reverberation and source ambiguity in acoustically rich settings.

Automatic Spatial Audio Scene Classification in Binaural Recordings of Music

by Sławomir Zieliński and

2020, Applied Sciences

Key finding: The study demonstrated that binaural cues combined with Mel-frequency cepstral coefficients (MFCCs) enable classification of different spatial audio scenes with accuracies up to 98% on binaural room impulse response (BRIR)... Read more

articleView Paper downloadDownload

Feature Extraction of Binaural Recordings for Acoustic Scene Classification

by Hyunkook Lee

2022, 2018 Federated Conference on Computer Science and Information Systems (FedCSIS)

Key finding: By extracting over a thousand spatial and spectro-temporal features from binaural signals, the study showed the superior influence of spectro-temporal features over spatial-only metrics for classification accuracy. Using... Read more

articleView Paper downloadDownload

Musical Genre Classification Enhanced by Improved Source Separation Technique

by G. Tsihrintzis

2021

Key finding: Though primarily focused on source separation, this work indirectly relates by enhancing the extraction of instrument-specific audio features which can be spatialized through binaural and multi-channel processing. Using... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How are deep learning and neuromorphic approaches advancing audio event classification and bioacoustic signal recognition?

This theme examines the shift towards deep learning architectures, particularly convolutional neural networks (CNNs), and emerging neuromorphic computing techniques including spiking neural networks (SNNs) in audio event detection, environmental sound classification, and bioacoustic signal analysis. The focus lies in leveraging biologically inspired models and data-driven feature representations for improved robustness, scalability, and real-time processing capabilities across diverse audio classification tasks.

Convolutional Neural Network based Audio Event Classification

by Minkyu Lim

2023, Ksii Transactions on Internet and Information Systems

Key finding: The paper showed that treating Mel-scale filter bank features concatenated over frames as images input to CNNs led to an audio event classification accuracy of 81.5% across thirty classes including dog barks and sirens,... Read more

articleView Paper downloadDownload

A Review of Automated Bioacoustics and General Acoustics Classification Research

by Leah Mutanu

2023, Sensors

Key finding: This survey highlighted the growing adoption of machine learning, especially ensemble methods and CNNs, in bioacoustic and general acoustic classification. It revealed that deep learning architectures have improved... Read more

articleView Paper downloadDownload

Fundamental Survey on Neuromorphic Based Audio Classification

by Amlan Basu

2025, arXiv:2502.15056

Key finding: This comprehensive survey underscored the promise of neuromorphic computing platforms based on spiking neural networks for audio classification, detailing advantages such as energy efficiency, real-time event-based... Read more

articleView Paper downloadDownload

Sound Classification Using Python

by Siuli Das

2022, ITM Web of Conferences

Key finding: The paper presents a practical implementation of environmental sound classification applying neural networks trained on MFCC feature sets, illustrating that convolutional and fully connected networks can effectively... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Audio Classification

Overlapped Music segmentation using a new Effective Feature and Random Forests

by duraid Mohammed

2025, IAES International Journal of Artificial Intelligence (IJ-AI)

In the field of audio classification, audio signals may be broadly divided into three classes: speech, music and events. Most studies, however, neglect that real audio soundtracks can have any combination of these classes simultaneously.... more

descriptionView Paper arrow_downwardDownload

Spectrogram Classification Using Dissimilarity Space

by Loris Nanni

2025, Applied Sciences

In this work, we combine a Siamese neural network and different clustering techniques to generate a dissimilarity space that is then used to train an SVM for automated animal audio classification. The animal audio datasets used are (i)... more

descriptionView Paper arrow_downwardDownload

Real-Time monophonic and polyphonic audio classification from power spectra

by Raphaël GREFF

2025, Pattern Recognition

This work addresses the recurring challenge of real-time monophonic and polyphonic audio source classification. The whole normalized power spectrum (NPS) is directly involved in the proposed process, avoiding complex and hazardous... more

descriptionView Paper arrow_downwardDownload

Spectrogram Classification Using Dissimilarity Space

by Sheryl Brahnam

2025, Applied sciences

descriptionView Paper arrow_downwardDownload

Onset detection in musical audio signals

by Malcolm Macleod

2025, Proceedings of the International Computer …

This paper presents work on changepoint detection in musical audio signals, focusing on the case where there are note changes with low associated energy variation. Several methods are described and results of the best are presented.

descriptionView Paper arrow_downwardDownload

Real-time pitch modification system for speech and singing voice

by Maxim Vashkevich

2025

A real-time pitch modification system has been developed. The implemented processing scheme is based on hybrid deterministic/stochastic decomposition of the signal and includes extraction of instantaneous pitch, pitch-synchronous... more

descriptionView Paper arrow_downwardDownload

Fundamental Survey on Neuromorphic Based Audio Classification

by Amlan Basu

2025, arXiv:2502.15056

Audio classification is paramount in a variety of applications including surveillance, healthcare monitoring, and environmental analysis. Traditional methods frequently depend on intricate signal processing algorithms and manually crafted... more

descriptionView Paper arrow_downwardDownload

Intelligent Audio Signal Processing for Detecting Rainforest Species Using Deep Learning

by Abdulaziz Alhumam

2025, Intelligent Automation & Soft Computing

Hearing a species in a tropical rainforest is much easier than seeing them. If someone is in the forest, he might not be able to look around and see every type of bird and frog that are there but they can be heard. A forest ranger might... more

descriptionView Paper arrow_downwardDownload

Audio Type Identification UsingEEMD: A Noise Assisted Data AnalysisMethod

by Sanjay Nalbalwar

2025, International Journal of Innovative Research in Science, Engineering and Technology

Audio classification is a process of assigning particular class to an audio signal. Classifying the audio signal has many applications in the field of digital library, automatic organization of databases etc. In the last several years... more

descriptionView Paper arrow_downwardDownload

Speech and Music Discrimination based on Signal Modulation Spectrum

by Pavel Balabko

2025

This work is devoted to the problem of automatic speech and music discrimination. As we will see here, speech and music signals have quite distinctive features. However, the efficient distinction between speech and music is still an open... more

descriptionView Paper arrow_downwardDownload

2 Speech Signals Classification 1

by SUMA SWAMY

2025

Speech Signals are the primary source of direct transmitter-to-receiver human communication and falls in the category of acoustic signals. These signals are the mechanical waves represented in terms of analog signal and propagate as... more

descriptionView Paper arrow_downwardDownload

Bowed String Sequence Estimation of a Violin Based on Adaptive Audio Signal Classification and Context-Dependent Error Correction

by Hiroshi G Okuno

2024, 2009 11th IEEE International Symposium on Multimedia

The sequence of strings played on a bowed string instrument is essential to understanding of the fingering. Thus, its estimation is required for machine understanding of violin playing. Audio-based identification is the only viable way to... more

descriptionView Paper arrow_downwardDownload

Classification of audio signals using SVM and RBFNN

by RAMALINGAM venkatachalam

2024, Expert Systems with Applications

In the age of digital information, audio data has become an important part in many modern computer applications. Audio classification has been becoming a focus in the research of audio processing and pattern recognition. Automatic audio... more

descriptionView Paper arrow_downwardDownload

Content-Based Sound Retrieval for Web Application

by Chunru Wan

2024, Lecture Notes in Computer Science

It is both challenging and desirable to be able to retrieve sound files relevant to users' interests by searching the Internet. Unlike the traditional way of using keywords as input to search for web pages with relevant texts, query... more

descriptionView Paper arrow_downwardDownload

Content-based audio classification and retrieval using the nearest feature line method

by Somayeh Esmaili

2024, IEEE Transactions on Speech and Audio Processing

descriptionView Paper arrow_downwardDownload

Advancements in Bangla Speech Emotion Recognition: A Deep Learning Approach with Cross-Lingual Validation

by Khorshed Alam and

2024, 2024 IEEE 99th Vehicular Technology Conference (VTC2024-Spring)

Speech Emotion Recognition (SER) is a method where computers learn to recognize human emotions from speech to improve communication. In this study, we present an innovative Bangla SER framework, incorporating data augmentations, feature... more

descriptionView Paper arrow_downwardDownload

Using Convolutional

by Peter Pham

2024

In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning approaches have not been extensively studied... more

descriptionView Paper arrow_downwardDownload

Improving Music Genre Classification By Short-Time Feature Integration

by Anders Meng

2024, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.

Many different short-time features, using time windows in the size of 10-30 ms, have been proposed for music segmentation, retrieval and genre classification. However, often the available time frame of the music to make the actual... more

descriptionView Paper arrow_downwardDownload

Audio-Video based Classification using SVM and AANN

by RAMALINGAM venkatachalam

2024, International Journal of Computer Applications

This paper presents a method to classify audio-video data into one of five classes: advertisement, cartoon, news, movie and songs. Automatic audio-video classification is very useful to audio-video indexing, content based audio-video... more

Fig.1 combining audio and video classification S. Palanivel Associate Professor Dept of Comp Sci and Engg., Annamalai University Chidambaram - 608002 Professor Dept of Comp Sci and Engg. Annamalai University Chidambaram - 608002

Acoustic features representing the audio information can be extracted from the speech signal at the segmental level. The segmental features are the features extracted from short (0 to 5 minutes) segments of the speech signal. These features represent the short-time spectrum of the speech signal. The short-time spectrum envelope of the speech signal is attributed primarily to the shape of the vocal tract. Mel-frequency cepstral coefficients (MFCC) have been commonly used in speech processing. Fig. 2. illustrates the computation of MEFCC features for a segment of audio signal which is described as follows:

Fig.4. Architecture of the SVM (Ns is the number of support Support vector machine (SVM) has been used for classifying the obtained data (Burges, 1998). SVM is a supervised learning method used for classification and regression. They belong to a family of generalized linear classifiers. Let us denote a feature vector (termed as pattern) by x=x),X,...... Xp and its class label by y such that y = {+1,-1}. Therefore, consider the problem of separating the set of n-training patterns belonging to two classes,

Fig.3. Principle of Support Vector machine

Fig 5(a) A Five Layer AANN model International Journal of Computer Applications (0975 — 8887) Volume 44— No.6, April 2012 Let us consider the five layer AANN model shown in Fig.5(a).,which has three hidden layers. The processing units in the first and third hidden layers are non-linear, and the units in the second compression/hidden layer can be linear or non- linear.

Fig5(c) Probability Surface. Fig5(b) Two dimensional output

Audio and video frames are combined based on 4:1 ration of frame shifts. The indivual evinces of each audio frame and fourth video frames are combined based on audio and video frames. The weight for each of modality is decided by the parameter w is chosen such that the system gives optimal performance for audio-video based classification. The performance of SVM for audio-video based classification is shown in Fig. 6. This could also be useful for the audio-video indexing and retrieval task.

In this work, combining the modalities has been done at the score level. The methods to combine the two levels of information present in the audio signal and video signal have been proposed. The audio based scores and video based scores are combined for obtaining audio-video based scores as given equation (9). It is shown experimentally that the combined system outperforms the individual system, indicating complementary nature. The weight for each modality is decided empirically.

Fig.7 Performance of Audio-Video Classification using AANN The category is decided based on the highest confidence score various from 0 to 1. Audio and video frames are combined based on 4:1 ration of frame shifts. The weight for each of modality is decided by the parameter w is chosen such that the system gives optimal performance for audio-video based classification. The performance of AANN for audio-video based classification is shown in Fig. 7. This could also be useful for the audio-video indexing and retrieval task.

Table 1 : Combining audio-video classification Results

descriptionView Paper arrow_downwardDownload

Automatic Discrimination of Apraxia of Speech and Dysarthria Using a Minimalistic Set of Handcrafted Features

by Michaela Pernon

2024, Interspeech 2020

To assist clinicians in the differential diagnosis and treatment of motor speech disorders, it is imperative to establish objective tools which can reliably characterize different subtypes of disorders such as apraxia of speech (AoS) and... more

descriptionView Paper arrow_downwardDownload

Audio Matching via Chroma-Based Statistical Features

by Meinard Müller

2024, International Symposium/Conference on Music Information Retrieval

In this paper, we describe an efficient method for audio matching which performs effectively for a wide range of classical music. The basic goal of audio matching can be described as follows: consider an audio database containing several... more

descriptionView Paper arrow_downwardDownload

Audio Interval Retrieval Using Convolutional Neural Networks

by Ievgeniia Kuzminykh

2024, Lecture Notes in Computer Science

Modern streaming services are increasingly labeling videos based on their visual or audio content. This typically augments the use of technologies such as AI and ML by allowing to use natural speech for searching by keywords and video... more

descriptionView Paper arrow_downwardDownload

Audio Interval Retrieval Using Convolutional Neural Networks

by Ievgeniia Kuzminykh

2024, Lecture Notes in Computer Science

descriptionView Paper arrow_downwardDownload

Feature Selection in Automatic Music Genre Classification

by Celso Kaestner

2024

The version in the Kent Academic Repository may differ from the final published version. Users are advised to check http://kar.kent.ac.uk for the status of the paper. Users should always cite the published version of record.

descriptionView Paper arrow_downwardDownload

Speech/music discrimination for multimedia applications

by Grace Petrucci

2024, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)

Automatic discrimination of speech and music is an important tool in many multimedia applications. Previous work has focused on using long-term features such as differential parameters, variances, and time-averages of spectral parameters.... more

descriptionView Paper arrow_downwardDownload

Using Convolutional

by Peter Pham

2024

descriptionView Paper arrow_downwardDownload

Music instrument recognition: from isolated notes to solo phrases

by Thippur Sreenivas

2024, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing

Speech and Audio processing techniques are used along with statistical pattern recognition principles to solve the problem of music instrument recognition. Non temporal, frame level features only are used so that the proposed system is... more

descriptionView Paper arrow_downwardDownload

Breath Activity Detection Algorithm

by Ramiro Jordan

2024, ArXiv

This report describes the use of a support vector machines with a novel kernel, to determine the breathing rate and inhalation duration of a fire fighter wearing a Self-Contained Breathing Apparatus. With this information, an incident... more

Fig. 2, The inspiratory phase is distinctly present and the clicking sound of the regulator’s valves as they reset for the expiratory phase are shown at the end of the sound bite as2 sharp energy spikes. The movement of air in from the tank makes a very distinctive sound (Fig. 2) that resembles a fricative sound. Fricative sounds are the part of speech used for forming consonants such as “f’. The detection of these sounds along with other aspects of taking a breath (Fig. 3) have been the subject of several papers.[1], [2], [12] Their intended use is in the music and entertainment industry signal processing where these sounds are remove or enhance for clarity or artistic reasons. The approaches discussed are to identify the inspiratory or inhalation phase since the expiratory phase is generally used to generate the formants of speech.

Fig. 1. External and internal views of a commonly used SCBA mask showing the voicemitter port. The breathing cycle is divided into four different phases: [13] The mask (Fig. 1) is a rigid structure with a clear plastic face plate and a flexible rubber seal that contacts the forehead temples cheeks and chin of the wearer. The SCBA systemnoises includes low air alarms and air regulator noises as well. [13]

Ruinskiy and Lavner [1] propose automating breath detection by processing the short time (windowed) cepstrum using an image recognition technique. The generalized approach is shown in Fig. 4. The cepstrum is computed for the sound being filtered using a mel frequency scale. The mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured frequency in an attempt to match how people hear sounds. Fig. 4, Data flow for automatic breath detection using Ruinskiy and Lavner.

Fig. 5, Part of a voice waveform demonstrating a breath sound located between The ARINA algorithm is initialized using a set of autocorrelation coefficients derived offline from a sample of the inhalation noise. A 10 order Linear Predictive Coefficients (LPC) all pole filter. The filter is then inverted into a Finitie Impulse Response (FIR) filter and used to filter the speech signal of the fireman. Kushner et. al. also proposed using a moving average to update the autocorrelation coefficients and recomputing the LPC continuously to adapt to different wearers and different masks.

Fig. 6, The low air alarm is a cyclic sound made with a clacker at approximately 28Hz, Regardless of the approach used, it is necessary to identify a set of examples or exemplars to be used for training. The process of building exemplars involves isolating multiple instances of the air rushing into the mask in the recorded sound track. These sounds usually are corrupted with the low air alarm sound which is a clacking noise at approximately 28Hz (Fig. 6). The regulator inspirational sound with the low air alarm superimposed was used to form the exemplars. Using a sound mixing tool, the exemplars of the sounds are isolated and recorded into wave files. Each exemplar recording is sectioned into 15 ms frames with an overlap 5 ms. The frames are then windowed with a half-hamming window to smooth edges and de-emphasize high frequency components. The windowed frames are pre-emphasized using a first-order difference filter.

Fig. 7, General data flow for breath detection algorithms reviewed

Fig. 8, The first filter starts at the first point, reach its peak at the second point, then return to zero at the 3rd point. The second filter will start at the 2nd point, reach its max at the 3rd, then be zero at the 4th etc. [4]

Fig. 9, The resulting mean and variance Templates for the exemplars. of the filter and { @, } is a set of autoregression coefficients

Fig. 10, Processing of the recording using sliding window. A lifter is use on each column with a half-Hamming window to emphasize the lower cepstral coefficients. This liftering improves the separation between breath sounds and other sounds.

Another method uses a threshold technique to measure the duration of a peak (Fig. 11) and allows for an estimate of the breath duration. These measures are then categorized based on the peaks (greatest similarity between the template and the windowed signal). The peak finding process locates the local maxima (peaks) of the input signal vector, data. A local peak is a data sample that is either larger than its two neighboring samples or is equal to Inf (infinity). Non-Infinite signal endpoints are excluded.

Fig. 11, Measuring peak duration. The duration of the inhalation noise is fairly long compared to unvoiced speech. So we can use a duration threshold test to eliminate any false detection due to speech. Thus the threshold must be met for N consecutive frames before detection is validated.

Fig. 12 Binary classification of data in a two-dimensional plane.

Fig. 13, The top plot shows the audio track as separated from the training video. The red lines represent the detected regulator inspiration events. Not all events were detected. As this is a function of the threshold chosen and exemplars used to form the statistics for the breath index calculations. The bottom figure represents the normalized breath index (B/max(B)). The red line represents the threshold used (0.25) and the red crosses the peeks detected by the MATLAB peek detection function.

Fig. 17 Radial error magnitude plot for the identifying the start of a breathing event using the LPC algorithm and the secondary constraints. The spikes with magnitude of | represent events not detected or anticipated by the filter. The shorter spikes represent differences between the reference data and the filtet responses.

Fig. 16 Radial error magnitude plot for the identifying the start of a breathing event using the using the Breathing Index and Support Vector Machine. The spikes with magnitude of | represent events not detected or anticipated by the filter. The shorter spikes represent differences between the reference data and the filter responses.

Fig. 15 Radial error magnitude plot for the identifying the duration of a breath using the LPC algorithm and SVM. The spikes with magnitude of | represent events not detected or anticipated by the filter. The shorter spikes represent differences between the reference data and the filter responses.

Fig. 14 Radial error magnitude plot for the identifying the duration of a breath using the LPC algorithm and the secondary constraints. The spikes with magnitude of | represent events not detected or anticipated by the filter. The shorter spikes represent differences between the reference data and the filter responses.

Fig. 18 Radial error magnitude plot for the identifying the start of a breathing event using the LPC algorithm and a Support Vector Machine. The values at magnitude 1 represent events not detected or anticipated. The other much smaller spikes represent the smaller errors than using the LPC and secondary constraints.

Fig. 19, Histogram of the breathing durations. Note the outliers. These either represent false detections where the time between detected values are extremely short. The values around zero probably represent the intervals where the instructor is talking but is not wearing the mask. The medical references measure the nu minute as the standard. An estimate the mber of breaths in a breathing rate of an adult at rest of approximately 15 to 20 breaths per minute. A histogram of the data converted to breaths shows the start of a separation of the main The modes represent both the instructors per minute (Fig. 20), data into two modes. breathing at the start of the video and also after he has performed some physical activity at the end. The large valued outliers represent false detections where the time between d etected values are extremely short. The values around zero probably represent the intervals where the instructor is talking but is not wearing the mask.

Fig. 20, Histogram of the breathing rates. Note the outliers. These either represent false detections where the time between detected values are extremely short. The values around zero probably represent the intervals where the instructor is talking but is not wearing the mask. MEASURED BREATHING RATES

descriptionView Paper arrow_downwardDownload

Outils de navigation dans les fichiers audio

by Christian Wellekens

2024, Citeseer

Les services audio de la nouvelle génération requièrent des outils d'édition des fichiers audio qui permettent de traiter un tel fichier aussi simplement qu'un fichier texte. L'indexation des fichiers est une solution permettant l'accès... more

descriptionView Paper arrow_downwardDownload

Script Identification Using Gabor Feature and SVM Classifier

by Dr. Shailesh Chaudhari

2024, Procedia Computer Science

Script identification is challenging task in bilingual or multilingual optical character recognition system. A remarkable research work on script identification have been noted in Indian or non-Indian context. As many commercial and... more

descriptionView Paper arrow_downwardDownload

Robust parameter estimation for audio declipping in noise

by Richard Stern

2024, Interspeech 2015

Contemporary audio declipping algorithms often ignore the possibility of the presence of additive channel noise. If and when noise is present, however, the efficacy of any declipping algorithm is critically dependent on the accuracy with... more

descriptionView Paper arrow_downwardDownload

Learning Representations for Nonspeech Audio Events Through Their Similarities to Speech Patterns

by Huy Phan

2023, IEEE/ACM Transactions on Audio, Speech, and Language Processing

descriptionView Paper arrow_downwardDownload

Multimodal Music Mood Classification Using Audio and Lyrics

by Jens Grivolla

2023

In this paper we present a study on music mood classification using audio and lyrics information. The mood of a song is expressed by means of musical features but a relevant part also seems to be conveyed by the lyrics. We evaluate each... more

descriptionView Paper arrow_downwardDownload

Evaluation of supervised learning algorithms in binary and multi-class network anomalies detection

by Abdou Romaric TAPSOBA

2023

Due to age-bound onset of symptoms used for diagnosis of mild to moderate intellectual disability, early diagnosis of these problems has long been a difficult issue. The diagnosis includes tests pertaining to intellectual functioning and... more

descriptionView Paper arrow_downwardDownload

Auditory context recognition using SVMs

by Sajjad Ahmadi

2023

We study auditory context recognition for contextaware mobile computing systems. Auditory contexts are recordings of a mixture of sounds, or ambient audio, from mobile users' everyday environments. For training a classifier, a set of... more

descriptionView Paper arrow_downwardDownload

Application of Vector Quantization for Audio Retrieval

by kamal shah

2023, International Journal of Computer Applications

Due to the progress of the unlimited data storage capabilities and the proliferation use of the Internet, information retrieval systems encountered a large interest. Much of this data is in different forms from various sources. So, it... more

descriptionView Paper arrow_downwardDownload

Adaptive Combination of Second Order Volterra Filters with NLMS and Sign-NLMS Algorithms for Nonlinear Acoustic Echo Cancellation

by Mircea-florin Vaida

2023

In this paper, starting from a robust statistics (RS) adaptive approach presented in a previous work entitled the combined NLMS-Sign (CNLMS-S) adaptive filter, an automatic combination technique with similar performances is proposed.... more

descriptionView Paper arrow_downwardDownload

An algorithm for enhancement of audio content classification

by Arti Bang

2023, Bulletin of Electrical Engineering and Informatics

Presently, fast proliferation of information enforces novel challenges on content management. Further, computerized audio classification along-with content description is considered as valuable method to manage audio contents. In general,... more

descriptionView Paper arrow_downwardDownload

Applying neural network on the content-based audio classification

by Xi Shao

2023, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint

Many audio and multimedia applications would benefit if they could interpret the content of audio rather than relying on descriptions or keywords. These applications include multimedia databases and file systems, digital libraries,... more

descriptionView Paper arrow_downwardDownload

Human perception and computer extraction of musical beat strength

by George Tzanetakis

2023, Proceedings of the 2002 Digital Audio …

Musical signals exhibit periodic temporal structure that create the sensation of rhythm. In order to model, analyze, and retrieve musical signals it is important to automatically extract rhythmic information. To somewhat simplify the... more

descriptionView Paper arrow_downwardDownload

Fault Diagnosis Based on Enhancement of Barkhausen Noise Using Hybrid Method Emprical Mode Decomposition-Savitzky-Golay Filter

by Mohammed Khorchef

2023

The barkhausen noise carries important information which can be used in early damage detection and fault diagnosis. The barkhausen noise is corrupted by interference signals from other sources during the measure and the information of... more

descriptionView Paper arrow_downwardDownload

Deep Learning for Bioacoustic Recognition on Microcontrollers

by Md Mohaimenuzzaman

2023, PhD Thesis @ Faculty of Information Technology, Monash University

Acoustic monitoring is crucial for the conservation and management of ecosystems and their flora and fauna. Traditional animal monitoring is based on passive acoustic recordings that are manually assessed by human experts, which adds a significant workload. To alleviate this burden, Machine Learning (ML)-based solutions for automating this process have been in the spotlight.

This research explores end-to-end Deep Learning (DL) techniques for species recognition using bioacoustic data from the Internet of Things (IoT) devices installed in remote wild areas. The reason is twofold. First, DL-based techniques extract hierarchical features automatically through multiple layers of abstraction. Therefore, it eliminates the very expensive and time-consuming process of handcrafting the features required in Non-Deep Learning (Non-DL)-based techniques. Second, we try to leverage the resounding success of DL in producing state-of-the-art (SOTA) performance for a wide range of applications to acoustic analysis.

Due to the unavailability of power sources and the internet in remote areas, the extremely resource-constrained IoT devices (battery powered, low memory, and low processing power) cannot send the data to the cloud server for recognition. As a result, the recognition activities must be performed locally on IoT devices. However, DL-based techniques are resource and time-intensive. To harness the power of DL in IoT, the DL models must be significantly resource-optimized.

This study focuses on significantly downsizing the DL model architecture without sacrificing accuracy. It develops a compression-friendly Deep Convolutional Neural Network (Deep-CNN) architecture for end-to-end raw audio classification, a structured compression technique, and a generic pipeline that compresses this Deep-CNN architecture to produce a tiny DL model suitable for deployment on a resource-impoverished Microcontroller Unit (MCU). Beyond theorizing, we demonstrate such a deployment practically.

One of the major challenges in biological applications is the availability of labeled real-world data for the training phase. We demonstrate how Active Learning (AL) can be used with our edge models to address this issue. Our approach goes beyond standard AL by incorporating feature extraction into the AL loop, and we show that this results in superior performance.

To demonstrate the relevance of our work for real-world applications, we go beyond standard benchmarks and test the applicability of our methods with real-world data from conservation biology projects.

FIGURE 2.10: XNOR and POPCOUNT in XNOR-Net Like structured pruning-based compression works, almost all the works based on XNOR- Net are proposed for computer vision. A few recent SOTA XNOR-Net models for the benchmark image datasets are: [109] for MNIST, [110] for CIFAR-10 and CIFAR-100 and [111] for ImageNet.

FIGURE 2.11: The detailed architecture of commonly used active learning system

While the reasons to move the recognition directly onto small edge devices are partic ularly compelling in the case of ecological monitoring, similar considerations apply t many other applications of intelligent sound recognition, from industrial safety to con sumer devices, for a variety of reasons, including cost and convenience. Specifically, cos is a factor that commonly limits the available compute power and network bandwidth i1 edge applications.

FIGURE 1.2: Breakdown of the research problem and the scope of this research project Since, the MCUs of this research project do not have the luxury of sending the sensec data to the central server when needed, if a device fails to classify the sensed dat: (e.g., data from a new species on which they have not been trained), it must wait fo the intermittent internet connection to send the data to the cloud server for furthe processing. Once the unlabeled data from the devices is received by the cloud server human experts manually label the data. The Machine Learning (ML) model used fo recognition in the embedded device is then retrained in the cloud server and sent to thi device, replacing the old model with the retrained one. The problem is depicted in « structured manner in Figure 1.2, as are various paths toward potential solutions and thi scope of this research project.

FIGURE 1.3: 1-D representation of audio as wave

FIGURE 1.5: Time Series Classification (TSC) Approaches

Due to its significant achievements in the fields of computer vision, speech recognition and natural language processing, DL, also known as DNN, is one of the most activ study areas of ML. D N models have been particularly successful because they cai learn hierarchical features on their own, are resilient to the curse of dimensionality, an can be trained in para lel using graphical processing units [44]. In the case of TSC DNN models not only produce results close to COTE and HIVE-COTE [44], but als significantly surpass N -DTW (see ection 2.1.1.1). In Figure 2.1, we see a general DN} architecture, where xz; represents uniformly spaced input timestamps. architecture, where x; represents uniformly spaced input timestamps. 2.1.2 Deep Learning (DL)-based Approaches

to distribute the training, resulting in a very slow process [83]. Another popular DNN architecture is RNN (see Figure 2.2) which is mainly used for TS forecasting. It is very rarely used for TSC due to the fact that it is designed to predict output for each timestamp of a TS [82] and the basic version of RNN suffers from the vanishing gradient! problem while training long TS. Furthermore, it is difficult to distribute the training, resulting in a very slow process [83]. To avoid vanishing gradient, different variations of RNN such as long short term memory (LSTM), gated recurrent (GRU) and RNN is comprised of randomly initia Echo State Network (1 ESN) [84] are introduced. A ized hidden layers [44 | where the output weights are learned using algorithms like logistic regression. The hidden state can be calculatec using equation 2.5 where I(t) is the internal hidden state, Wj, is the weight matrices fo. input TS, and X(t) is the vector for input TS. The output is calculated according t« equation 2.6. 2.1.2.2. Recurrent Neural Network (RNN)

where x; denotes evenly spaced input time stamp. The TSC researchers have been inspired to adopt CNN for TSC from its resounding success in computer vision, speech recognition, and natural language processing. A CNN consists of an input layer, convolution layers, pooling layers, fully connected layers and an output layer at the end [80]. Figure 2.3 represents a typical CNN architecture where x; denotes evenly spaced input time stamp. A Convolution is the point-wise multiplication of data with the filters (i.e., a matrix that moves over the data, also known as kernels or feature detectors) associated with a specific layer to generate feature maps (also known as activation). See Section 9.1 of Deep Learning [80] for a more in-depth understanding of convolution. The advantage of using multiple filters is twofold: it allows you to learn multiple discriminative features across the TS and it requires far fewer model parameters than FCNN. Equation 2.7 represents the general form of convolution on an evenly spaced input time stamp X; of a TS of length T with filter w of length | [44].

FIGURE 2.4: Transformer Neural Network (Slightly modified from [86]) (0 make the most of the GPU. Figure 2.4 shows the overall structure of TNN.

FIGURE 2.5: Weight sharing by training quantization (taken from [91])

FIGURE 2.7: Example dense matrix and sparse matrix are referred to as sparse matrices (Figure 2.7 right matrix). In an Unstructured pruning (i.e., weight pruning) process the unimportant weight values are removed from the weight matrices of different layers in the DNN model by setting them to zeros [91, 100, 101]. Only a few important weights that influence the model’s decision making are left in the weight matrices. These matrices with many zeros are referred to as sparse matrices (Figure 2.7 richt matrix). There is a large amount of research work on unstructured pruning. The theoretical compression of the base DNN model provided by this pruning technique is astounding. However, it generates sparse matrix models, which require special representation and hardware to perform the sparse computation. A sparse matrix can be represented in two ways: triplet /array representation and linked representation. Figure 2.8 depicts a triplet /array representation. According to the figure, each non-zero value usually requires 3x memory for implementation.

FIGURE 2.8: Triplet /Array Representation of sparse matrix (taken from [102]) Sparse matrix models are not ideal for direct implementation on embedded devices. According to the literature, all such compressed models were deployed on mobile devices or in devices with at least equivalent resources. Oktay et al. [100], for example, deployed their model in the Samsung Galaxy $7 (32GB flash drive and 4GB RAM). Edge-L?, the only work we found on sound classification at the edge by Kumari et al. [101] at the time of this study, requires 12MB of run-time memory, which prevents it from being deployed in the device that this research project is aimed at. Furthermore, the MCUs lack the dedicated hardware and software support for sparse matrix computations required to capitalize on these theoretical savings [99, 103].

FIGURE 3.1: Visual representation of ACDNet model architecture. Unlike other suggested SOTA models, which use pre-extracted features and multi-channel input (such as hand-crafted features and spectrograms), the ACDNet architecture con- centrates solely on feature extraction through convolution layers. All the convolution layers are followed by a batch normalization and a ReLU activation layers. The maxpool layers have strides equal to their pool size to avoid overlapping. The network is fed with raw audio time series (i.e., single-channel input). ACDNet consists of two feature extraction blocks followed by an output block. The feature extraction blocks are Spec- tral Feature | Extraction Block (SF! EB) and Temporal Feature | Extraction Block (TFEB). Figure 3.1 depicts the layer structure of different blocks of the ACDNet architecture.

FIGURE 3.2: (a) Comparison of 80% pruned models from Table 3.8. (b) Comparison of 85% pruned models from Table 3.9. For visualization, the variables are linearly transformed to range [50 - max(variable)].

FIGURE 4.1: Comparison of prediction accuracy between the baseline models (i.e., ACDNet and AclNet) and the micro versions including their quantized versions (QMi- cro) on FSC-10.... .50 datasets.

FIGURE 4.2: Comparison of prediction accuracy between Baseline, Mini including their XNOR-Net versions and Micro including its quantized version (QMicro) on ESC-10,...,50 datasets. The Baseline network is AcdNet or AclNet. XMini and QMi- cro versions have approximately the same memory requirements. slope is much steeper than that for the other compression methods.

FIGURE 4.3: Comparison of prediction accuracy between Baseline, Mini including their XNOR-Net versions and Micro including its quantized version (QMicro) on the Urban- Sound&k dataset. The Baseline network is ACDNet or AclNet. XMini and QMicro versions have approximately the same memory requirements.

FIGURE 4.4: Comparison of prediction accuracy between the Baseline (ACDNet or AclNet), Mini including their XNOR-Net versions and Micro including its quantized version (QMicro) on AudioEvent datasets and its subsets (10, 20 and 28 classes). XMini and QMicro versions have approximately the same memory requirements. To facilitate a direct comparison, we extend our analysis to this dataset, which has also been used in a several other studies [157, 169]. These results are consistent with what we have seen above for the most widely used standard benchmarks. Table 4.7 and Figure 4.4 show the performance of our base nets on the AudioEvent dataset and its smaller subsets. show the performance of our base nets on the AudioEvent dataset and its smaller subsets.

FIGURE 4.5: Comparison of prediction accuracy between BNN-GAP8, GAP8-ACDNet and Nano-ACDNet on AudioEvent dataset for 28 classes. BNN-GAP8 runs on spec- trogram, in contrast, GAP8-ACDNet and Nano-ACDNet run on raw audio. GAP8- ACDNet has approximately the same resource requirements of BNN-GAP8. QNano- ACDNet has equivalent memory requirement of the XNOR versions of BNN-GAP8 and XGAP8-ACDNet How can BNN-GAP8 achieve such good performance on a 28-class problem? The likely reason is that BNN-GAP8 benefits from using spectrograms as input to the network. This means that in the BNN-GAP8 implementation much of the necessary full precision computation required, for which a binary net is not suitable, is encapsulated outside of the network in the conversion of raw audio to spectrograms. The other XNOR networks in this comparison are deprived of this possibility because they perform end-to-end clas- sification for raw audio input and thus must perform all required computations within the network.

FIGURE 4.6: Prediction accuracy of RESNET-18 and its XNOR-Net version on CIFAR- 10,20,....,100 datasets

the detailed construction of the DeepFeatAL. FIGURE 5.1: The detailed architecture of proposed DeepFeat AL The proposed Deep Active Feature Learning (DeepFeatAL) system refines feature extrac- tion and classification from raw audio. It aims to improve feature extraction quality and recognition accuracy while requiring minimal data annotation effort. Figure 5.1 depicts the detailed construction of the DeepFeat AL.

FIGURE 5.2: ACDNet in different training settings for incremental Learning. The first row indicates the performance of the model before incremental learning. esult, for the remainder of this chapter, we will refer to this task as fine-tuning.

FIGURE 5.3: DeepIcL vs SDAL using ACDNet on ESC-50 dataset Figure 5.3, which is generated from Table 5.2, shows that AL-RidgeC achieves the high- est prediction accuracy achieved by DeepIcL (i.e., 65.43%) in only 4 iterations on the ESC-50 dataset, saving ~43% of the labeling budget. At the end of seven human la- beling iterations, AL-RidgeC achieved the highest accuracy of 67.94%. The difference in prediction accuracy ensures that AL aids classifiers in learning particularly difficult samples, allowing them to produce better classification performance. We will compare the performance of AL-RidgeC to that of our proposed DeepFeatAL in Section 5.4.2.2. the performance of AL-RidgeC to that of our proposed DeepFeatAL in Section 5.4.2.2

FIGURE 5.4: DeepIcL vs SDAL on US8K dataset its performance to our proposed DeepFeatAL performance in Section 5.4.2.2. significantly better than DeepIcL. Furthermore, after only 2 iterations, AL-LogisticRes achieves the highest prediction accuracy achieved by DeepIcL (i.e., 88.19%), saving nearly 86.67% of the labeling budget. As US8K has accumulated more data, we have completec fifteen iterations of human labeling for incremental learning for ACDNet and the SDAI process. Since AL-LogisticReg outperforms AL-KNC most of the time, we will compart its performance to our proposed DeepFeatAL performance in Section 5.4.2.2.

FicuRE 5.5: DeepIcL vs SDAL on iWingBeat dataset compare the performance of DeepIcL with that of our proposed DeepFeat AL. than the accuracy of SDAL models (65.87%, 67.17% and 66.28%). Notably, the perfor: mance of the initial model and the model obtained after human annotations (simulated) and fine-tuning do not differ significantly. Since this dataset contains more labeled sam- ples than the previous two, we ran twenty human annotation (simulated) iterations. The performance of models using different learning techniques is shown in Table 5.4. We wil compare the performance of DeepIcL with that of our proposed DeepFeatAL.

FIGURE 5.6: AL-RidgeC vs DeepFeatAL on ESC-50 Figure 5.6 compares the performance of AL-RidgeC and DeepFeatAL on the ESC-5( dataset. At the end of the AL loops, DeepFeatAL achieves the highest prediction accu. racy. Its counterpart achieves better accuracy only after the fourth iteration. DeepFeat A] performs better in the remaining six cases. Additionally, the SDAL technique’s highest accuracy has been achieved in six iterations of DeepFeat AL, saving 14.28% of the labeling budget. Figure 5.7 displays the statistical significance of the final models obtained after complet- ing all human iterations for DeepIcL, SDAL, and DeepFeatAL on the ESC-50 dataset. According to the figure, all the models perform statistically significantly. It demonstrates that the model produced by DeepFeatAL outperforms other models produced by other methods and ranks first.

FIGURE 5.7: CD diagram for statistical significance test of performance of different learning methods on ESC-50 dataset. methods and ranks first.

FIGURE 5.8: AL-LogisticReg vs DeepFeatAL on US8K

FIGURE 5.9: CD diagram for statistical significance test of performance of different learning methods on US8K dataset.

FIGURE 5.10: DeepIcL vs DeepFeatAL on iWingBeat dataset Thus, it saves 50% of the labeling effort. Figure 5.10 compares the performance of DeepIcL, AL-LogisticReg and DeepFeatAL with respect to the iWingBeat dataset. It demonstrates that at the end of the AL loops, DeepFeatAL significantly outperforms its competitors. The highest accuracy achieved by DeepIcL was 68.27%, whereas DeepFeatAL achieved 68.86% after only ten iterations. Thus, it saves 50% of the labeling effort.

These results are consistent with the results for the full-size network. Figure 5.12 generated from the data presented in Table 5.6 demonstrates that AL-LogisticRe and AL-RidgeC outperform DeepIcL after the second iteration. The highest accuracy produced by DeepIcL is 57.70 + 0.15%, whereas AL-LogisticReg achieves this accuracy after only three iterations; thus, saving more than 47% of the labeling budget. This difference in performance ensures that active learning assists classifiers in mastering par- ticularly difficult samples, ultimately resulting in improved classification performance. These results are consistent with the results for the full-size network. FIGURE 5.12: Edge-based DeepIcL vs SDAL on ESC-50 dataset

FIGURE 5.13: Edge-based DeepIcL vs SDAL on US8K dataset Same comparative analysis was carried out on the US8K dataset. Figure 5.13 demon- strates that during incremental learning, DeepIcL produces the highest prediction ac- curacy of 83.13%. In contrast, AL-LogisticReg, achieves this level of accuracy afte only nine rounds of human la beling, thus, saving 40% of the labeling budget. Table 5.7 displays 95%CI for all the models. After fifteen iterations of human annotation (sim- ulated), DeepIcL achieves a prediction accuracy of 82.63 + 0.06%, whereas AL-KNC AL-LogisticReg and AL-RidgeC achieve 83.15+0.06%, 83.27 +0.06% and 82.10+0.06% respectively. As a result, AL- as well. LogisticReg produces the highest accuracy for this dataset

FiIGuRE 5.14: Edge-based DeepIcL vs SDAL on iWingBeat Dataset in previous section as well.

FIGURE 5.15: Edge-based AL-LogisticReg vs DeepFeatAL on ESC-50 dataset

ranks first. FIGURE 5.16: CD diagram for statistical significance test of performance of different learning methods on ESC-50 dataset.

FIGURE 5.17: Edge-based AL-LogisticReg vs DeepFeatAL on US8K dataset he same comparison is shown in Figure 5.17 between the performance of AL-LogisticReg ad DeepFeatAL on the US8K dataset. It is evident that during the AL process, eepFeatAL always produces the most accurate predictions. In fact, the highest level of ccuracy achieved by AL-LogisticReg is 83.38%, whereas DeepFeatAL achieved a higher vel of accuracy after only five iterations (83.56%). Thus, the labeling effort is reduced y 66.67%.

and ranks first. Figure 5.19 presents the same performance comparison between DeepIcL and DeepFeat AL on the iWingBeat dataset. It clearly reveals that during human annotation (simulated) iterations, DeepFeatAL almost always produces the highest prediction accuracy. The highest accuracy achieved by DeepIcL was achieved by DeepFeatAL after only eight iterations. Consequently, the labeling effort is reduced by 60%.

FIGURE 5.19: Edge-based DeepIcL vs DeepFeatAL on iWingBeat dataset

FIGURE 5.20: CD diagram for statistical significance test of performance of different learning methods on iWingBeat dataset. DeepFeatAL outperforms other models produced by other methods and ranks first.

FIGURE 6.1: No. of samples known before and after retraining with the 1,000 newly annotated samples

FIGURE 6.2: Retraining vs DeepFeatAL performance

FIGURE 2.9: Typical convolution layer vs binary convolution layer XNOR-Net represents the weights in 1-bit and performs convolutions using bit-wise XNOR operations, thus saving significant amount of memory and computation. In an XNOR-Net [109], all layers except the first and the last are binary layers. The input, activations, and the weights of the binary layers are represented using either +1 or -1 and are stored efficiently with single bits. Figure 2.9 shows the construction of a typical convolution laver and an equivalent binary convolution layer. 2.2.3 XNOR-Net

Chapter 3: Baseline DNN Architecture and its MCU-Compatible Version TABLE 3.1: SOTA models for ESC-10, ESC-50, US8K and AE datasets. Note: The grayed-out rows represent the DNN models that were published after the work in this chapter was completed (i.e., after 2019). Abbreviations: ATTN (Attention), CO (Cochleagram), CRP (Cross Recurrence Plot), CT (Chromagram), DGT (The Discrete Gabor Transform), ENS (Ensemble Model), FBE (FilterBank En- ergies), GT (GammaTone), LP (Log-Power Spectrogram), Mel (Mel Spectrogram), MFCC (Mel-Frequency Cepstral Coefficients), PE (Phase-Encoded), Raw (Raw audio wave), Spec (Spectrogram), TEO (Teager’s Energy Operator)

TABLE 3.2: ACDNet architecture. Output shape represents (channel, frequency, time), i_len is the input length, n_cls is the number of output classes, sr is the sampling rate in Hz models suitable for MCUs with comparable accuracy.

TABLE 3.3: CV Accuracy and estimated 95% CI of classification accuracy of ACDNet on ESC-10, ESC-50, US8K and AE datasets.

TABLE 3.4: ACDNet in the SOTA leaderboard for raw audio classification on ESC-10&50, US8K and AE datasets. Accuracy values with asterisk (*) are repro- duced by us.

TABLE 3.6: Parameters, size, and computation requirements for current SOTA models on the ESC-50 dataset. Here, x denotes the size and FLOPs of ACDNet, for the Size and FLOPs column, respectively. Note: The grayed-out rows represent the DNN models that were published after the work in this chapter was completed (i.e., after 2019).

present this method in the following form. 3.4.4 Experimental Results We use the same experimental setup that has been discussed in Section 3.3.2 for this experimental study. ESC-50 (fold-4) is used to fine-tune the network during the com- pression process. The performance of the resulting network is cross-validated on all the four datasets and the results are reported. four datasets and the results are reported.

TABLE 3.7: Sparsifying the Weights in ACDNet. The sparsified model is further pruned through channel pruning. We prune 80% and 85% channels using all the discussed methods and present the results in Tables 3.8 and 3.9 Models 1-2, 3-4, 5-6 and 7-8 in both the tables are the resulting models of magnitude. based pruning, taylor pruning, and the two combinations of our proposed hybrid pruning respectively. All models are iteratively pruned and fine-tuned. The column Fine-tunin, Accuracy shows the accuracy obtained at the end of the iterative pruning and fine-tuning

From the tables (Table 3.8 and Table 3.9), we observe that pruning and fine-tuning doe: not recover the loss in accuracy as hoped for. To achieve the best accuracy, we therefor« conduct further full re-training of the networks with their existing weights, scratch: training by re-initializing their weights. We use the base network’s training settings fo. re-training and scratch training. The accuracy of these different training processes art reported in Re-training Accuracy and Scratch-training Accuracy columns respectively. Ir addition to re-training and scratch-training, we also train the re-initialized fresh networks using knowledge distillation [97], and the result is presented in Table 3.10. TABLE 3.8: Models found after 80% channel pruning using magnitude-based ranking, Taylor criteria-based ranking and our hybrid pruning approach. using knowledge distillation |97|, and the result is presented in Table 3.10.

TABLE 3.11: Micro-ACDNet architecture for input length 30225 (approximately 1.51s audio @ 20kHz) Table 3.11 presents the architecture of Micro-ACDNet.

ter and Micro-ACDNet achieves the highest prediction accuracy, we conduct scratch- aining for all the networks, 5-fold (1 ESC-10&50) and 10-fold (US8K) CV over five in- pendent runs. Table 3.13 presents their classification accuracy on different datasets. he resu ts in the table clearly indicate that the compressed models generalize well on | the four datasets. Their classification accuracy is very close to their base model’s assifica tion accuracy. TABLE 3.13: ACDNet, AclNet , Micro-ACDNet and Micro-AclNet performance on different datasets 3.4.4.6 Comparing Micro-ACDNet with Other Networks

TABLE 4.2: Size and computation requirements for ACDNet, AclNet including their Micro and QMicro versions Chapter 4: Study on Deep Architecture Compression Techniques

increase in the number of classes, as may be expected. TABLE 4.3: Prediction accuracy (%) of Micro versions of ACDNet and AclNet including their quantized versions (QMicro) on ESC-10,...,50 datasets. The column headers ’4Cls’ stands for ’No. of Classes’ and ’Baseline’ stands for the base model (i.e., ACDNet or AclNet). Although the smaller models require fewer resources, they lose classification accuracy because they have less capacity to learn. According to [42], Micro-ACDNet has 80% less capacity than ACDNet and QMicro-ACDNet is a quarter precision version (8-bit) of Micro-ACDNet. Table 4.3 and Figure 4.1 present a comparison of the accuracy achieved by the three versions of both the baseline networks on ESC-10,...,50 datasets (derived using Equation 4.2. For further details, see Section 4.3.1). The table and the figure show that all the three versions of both the networks produce state-of-the-art and nea state-of-the-art accuracy on ESC-10, however, the accuracy drops continuously with an increase in the number of classes. as mav be expected.

TABLE 4.4: Prediction accuracy (%) of Micro versions of ACDNet and AclNet including their quantized versions (QMicro) on UrbanSound8k dataset. Here, ‘Baseline’ stands for the base model (i.e., ACDNet or AclNet) The scenario is not different when we apply the same networks on the UrbanSound8l dataset. Table 4.4 shows that QMicro-ACDNet has lost almost 13.5% accuracy comparec to the base network. For QMicro-AclNet, the loss in accuracy is 11.7% compared to the base network. We note that specialized quantization targeted to a particular mode can potentially reduce the loss of the accuracy that the models experience during th: quantization process. However, improving any technique used to demonstrate the proces: is beyond the scope of the study of this chapter and while a change in quantizatio1 techniques may shift the results, the general trends are expected to remain the same. The scenario is not different when we apply the same networks on the UrbanSound8k

spaces alone already leave hardly any room for the actual computations. Hence, we neec smaller versions of the original networks whose resource requirements do not exceed the resources available in the MCUs. To compare the network performance, memory, anc computation requirements with QMicro-ACDNet, we create Mini-ACDNet such that its XNOR-Net version has similar requirements to QMicro-ACDNet (see Tables 4.2 anc 4.5). To derive Mini-ACDNet, we use the same technique as that used to derive Micro ACDNet, summarized above and fully detailed in [42]. The same procedure is used t« derive Mini-AclNet from AclNet. TABLE 4.5: Size and computation requirements for the baseline(i.e., ACDNet or AclNet), Mini and their XNOR-Net versions XBaseline and XMini respectively.

TABLE 4.6: Prediction accuracy (%) of Baselines and Mini versions including their XNOR-Net versions on ESC-10,...,50 datasets. The column header ’#Cls’ stands for ’No. of Classes’ and ’Baseline’ stands for the base network (i.e., ACDNet or AclNet) Table 4.5 also shows that XMini versions are extremely small (128.5KB and 133.12KB) and require considerably less compu on the current off-the-shelf MCUs. XNOR networks produce reasonab Furthermore, from Table 4.6, we can see t e accuracy for a smaller number of classes ( tation. This clearly allows the network to be deployed hat the ESC-10 with 10 classes). However, as the number of classes increases, the network performance of the XNOR versions decreases ra pidly.

TABLE 4.7: Prediction accuracy (%) of the Baseline (ACDNet or AclNet), Mini includ- ing their XNOR-Net versions and Micro including its quantized version (QMicro) on AudioEvent-10,20,28 datasets. The column header ’#£Cls’ stands for "No. of Classes’. XMini and QMicro versions have approximately the same memory requirements. Chapter 4: Study on Deep Architecture Compression Techniques

TABLE 4.8: Prediction accuracy (%) of BNN-GAP8, Nano and GAP8 versions of both the baselines including the XNOR-Net for GAP8 (i.e., XGAP8) and the quantized version of Nano (i.e., QNano) on AudioEvent (28 classes) dataset However, the larger resource requirements of our base networks do not allow us a direct and fair comparison to the BNN-GAP8 network presented in |176]. Therefore, we derive two additional smaller networks (QNano-ACDNet and XGAP8-ACDNet) with memory requirements equivalent to BNN-GAP8. QNano-ACDNet is a 8-bit quantized versior of the associated full-precision network Nano-ACDNet that is derived by pruning the channels from ACDNet. XGAP8-ACDNet is an XNOR network derived from the ful precision network GAP8-ACDNet. GAP8-ACDNet is also derived by pruning channel: from ACDNet. The equivalent derivatives are also derived from AclNet. The new mod: els are constructed using the same procedures as described above. The tests of thes« networks on the AudioEvent dataset used in [176] are summarized in Table 4.8. networks on the AudioEvent dataset used in [176] are summarized in Table 4.8.

TABLE 4.9: State-of-the-art accuracy (%) for various image datasets. ’#Cls’ stands for “number of Classes’. Most of the work on XNOR nets so far has taken place in the image domain. A compari- son with this body of work is instructive. We summarize the state-of-the-art for the most widely used datasets in Table 4.9. The accuracy achieved by XNOR nets is impressive (second last column), but it has to be noted that these nets are significantly larger than our target size. None of these models fit on the relevant MCUs with the single exception of the one for MNIST. This is a very simple dataset with only 10 classes. of the one for MNIST. This is a very simple dataset with only 10 classes.

TABLE 4.10: RESNET-18 and its XNOR-Net version on CIFAR-10,20,....,100 datasets. To verify whether our own results for audio classification are consistent with the image domain, we conducted additional experiments on image classification using XNOR. We ESNET-18 [177| to classify the widely used benchmark have used the XNOR version of R1 image datasets CIFAR-10 and C FAR-100. We have created variably sized CIFAR-100 to determine whether the trend of the loss in accuracy is simil classification. The experimental details are provided in Section 4.3. Tab e 4. subsets of ar to audio 0 and Figure 4.6 summarize the results. The sizes of the full precision models are between 42.63MB and 42.80MB and the computation requirements between 95.17M 95.21M FLOPs. For the XNOR-Net version, the model size is approximately 1 FLO using 2.06M FLOPs. The results of this experiment are consistent with those Ps and .34MB or the audio domain, and it is clearly visible that a similar performance drop for XNOR occurs as the number of classes increases.

TABLE 5.1: ACDNet in different training settings for incremental Learning. The first row indicates the performance of the model before incremental learning.

TABLE 5.2: DeepIcL vs SDAL using ACDNet on ESC-50 dataset

TABLE 5.3: DeepIcL vs SDAL on US8K dataset The third dataset on which the same comparative analysis has been performed is iWingBeat. Figure 5.5, which is populated from Table 5.4, reveals that DeepIcL has the highest prediction accuracy. This scenario is clearly different from the previous two datasets. DeepIcL achieves 68.27% at the end of human annotation (simulated), which is higher

TABLE 5.4: DeepIcL vs SDAL on iWingBeat dataset

TABLE 5.5: DeepIcL vs SDAL vs DeepFeatAL on All Datasets Furthermore, our significance test shows that the performance of the model produced by DeepFeatAL is statistically significant. We used the statistical significance test described in [44] to determine the statistical significance of the performance of each model. We begin by rejecting the null hypothesis using the Friedman test [188]. Then, we conduct a pairwise post-hoc analysis in accordance with Benavoli et al. [225], utilizing the Wilcoxon signed-rank test [189] and Holm’s alpha (5%) correction [226, 227]. We employ DemSar 228]’s Critical Difference (CD) diagram as a graphical representation. A thick horizontal ine in the CD diagram indicates that the accuracy of a group of classifiers is insignificant. ine in the CD diagram indicates that the accuracy of a group of classifiers is insignificant.

TABLE 5.6: Edge-based DeepIcL vs SDAL on ESC-50 dataset for this dataset.

TABLE 5.7: Edge-based DeepIcL vs SDAL on US8K dataset

TABLE 5.8: Edge-based DeepIcL vs SDAL on iWingBeat dataset

TABLE 5.9: Edge-based DeepIcL vs SDAL vs DeepFeatAL on all the datasets

For rolling recognition, we applied the trained model to the test set (49h of continuous recordings). Four criteria were used during the post-processing stage. The first one is 80% overlapping samples that are predicted to be true (Algorithm 3). And the rest are a sequence of two, three, and four samples (Algorithm 4) predicted to be true.

TABLE 6.1: Classification of 49h continuous audio recording Table 6.1 shows the outcome of this post-processing. The resul are quite impressive. The figures in the Recall column, on that impressive. In addition, the F'N rates are extremely hig ts in the Accuracy colum1 the other hand, are no h. Although the FP rat« decreases as you progress to the last rows, the other parameters, such as Recall anc FN rates, increase significantly. These findings show that, des pite a high accuracy rate we have unacceptable FP and FN rates for a real-world application. The impressiv accuracy figure is due to the fact that the test set contains only 0.23% BTF calls anc the remaining 99.77% are BGS.

TABLE 6.2: Training set, validation set and test set details The remaining audio files (6 audio files) are used to create a test set. For each of the six audio files, there is a subset in the test set. Each subset has approximately ~72,000 overlapping samples. There are 426 overlapping BTF calls in the six files (107 actual calls). The details of the training, validation and test sets are presented in Table 6.2.

TABLE 6.3: Results after post-processing of the predictions from ACDNet achieved from the above process. We applied the trained models to detect the presence of BTF calls in six audio files (2h each). Post-processing the result from ACDNet yields 21 TPs and 8 FPs. That is, only 29 five-second audio segments, which is 2m 25s out of 12h, are handed to human experts for further verification. In the case of Micro-ACDNet, the post-processing yields 16 TPs and 14 FPs. Despite having fewer TPs and more FPs, it does not miss any file with BTF call. Only 30 five-second audio segments, which is 2.5m out of 12h, are handed to human experts for further verification. Table 6.3 and Table 6.4 present the results achieved from the above process.

TABLE 6.4: Results after post-processing of the predictions from Micro-ACDNet

TABLE 6.5: Results after post-processing of the predictions from Micro-ACDNet trained on only 34h (i.e., 17 files) of audio recording.

TABLE 6.6: Results after post-processing of the predictions from Micro-ACDNet The DeepFeatAL models and the retrained models perform similarly to the naked eye with only a small performance difference between them. However, after examining Fig ures 6.2 and 6.1, as well as Table 6.7, it is clear that the model found in the DeepFeat A! process produces better results. Micro-ACDNet trained and fine-tuned on 35h of record ings now outperforms Micro-ACDNet trained on 80h of recordings in terms of accuracy Furthermore, Micro-ACDNet achieved through DeepFeatAL now produces nearly th same number of TPs as its parent model (i.e., ACDNet) with roughly twice the numbe of FPs. Although the FPs are higher, they are manageable because only 36 five-secon audio segments (i.e., 3m) are handed over to the human expert for verification out of 12 of recordings. Table 6.6 displays the detailed record-by-record result of the DeepFeatA. models. of recordings. Table 6.6 displays the detailed record-by-record result of the DeepFeat AL

TABLE 6.7: Results after post-processing of the predictions from Micro-ACDNet 6.7 Conclusion

descriptionView Paper arrow_downwardDownload

Pruning vs XNOR-Net: A Comprehensive Study of Deep Learning for Audio Classification on Edge-Devices

by Md Mohaimenuzzaman

2023

Deep learning has celebrated resounding successes in many application areas of relevance to the Internet of Things (IoT), such as computer vision and machine listening. These technologies must ultimately be brought directly to the edge to... more

descriptionView Paper arrow_downwardDownload

Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices

by Md Mohaimenuzzaman

2023

Significant effort s are being invested to bring state-of-the-art classification and recognition to edge devices with extreme resource constraints (memory, speed, and lack of GPU support). Here, we demonstrate the first deep network for... more

SOTA models for ESC-10, ESC-50, US8K and AE datasets sorted by year of publication. Abbreviations: ATTN (Attention), CO (Cochlea gram), CRP (Cross Recurrence Plot), CT (Chromagram), DGT (The Discrete Gabor Transform), ENS (Ensemble Model), FBE (FilterBan Energies), GT (GammatTone), LP (Log-Power Spectrogram), Mel (Mel Spectrogram), MFCC (Mel-Frequency Cepstral Coeficients), P (Phase-Encoded), Raw (Raw audio wave), Spec (Spectrogram), TEO (Teagerfis Energy Operator).

ACDNet architecture. Output shape represents (channel, frequency, time), i_len is the input length, n_cls is the number of output classes, sr is the sampling rate in Hz.

Fig. 1. Visual representation of ACDNet model architecture.

Fig. 2. (a) Comparison of 80% pruned models from Table 8. (b) Comparison of 85% pruned models from Table 9. For visualization, the variables are linearly transformed to range [50 - max(variable)].

CV Accuracy and estimated 95% CI of clas- sification accuracy of ACDNet on ESC-10, ESC-50, US8K and AE datasets. the he_normal [69] initialization method. At the end of the training and validation, the best model is used as the final model. The loss function optimized is shown in Eq. 5. Here, x is the input mini- batch, fg (x) is the approximation and y is the true distribution of labels for the input data. Furthermore, n is the mini-batch size, m is the number of classes, 7 is the learning rate, and 0 < 0 — qo. Table 3

Parameters, size, and computation requirements for current SOTA models on the ESC-50 dataset. Here, x denotes the size and FLOPs of ACDNet, for the Size and FLOPs column, respectively.

ACDNet in the SOTA leaderboard for raw audio classification on ESC-10&50, US8K and AE datasets. Accuracy values with asterisk (*) are reproduced by us.

ACDNet in overall SOTA leaderboard of ESC-10&50, US8K and AE datasets. Column ’Acc’ presents model accuracy. Accu- racy values with asterisk (*) are reproduced by us.

Performance of Micro-ACDNet when trained using Knowledge Distillation.

Models found after 80% channel pruning using magnitude-based ranking, Taylor criteria-based ranking and our hybrid pruning approach.

Models found after 85% channel pruning using magnitude-based ranking, Taylor criteria-based ranking and our hybrid pruning approach.

ACDNet vs AclNet performance after 80% compression.

Micro-ACDNet architecture for input length 30,225 (approximately 1.51s audio @ 20kHz).

Parameters, size and computation requirements for ACDNet and Micro-ACDNet for approximately 1.51s audio @ 20kHz.

ACDNet, AclNet, Micro-ACDNet and Micro-AclNet performance on different datasets.

Prediction accuracy on ESC-50 after quantization.

descriptionView Paper arrow_downwardDownload

A Hybrid System for Audio Segmentation and Speech-endpoint Detection of Broadcast News

by Andrey Ronzhin

2023

A hybrid speech/non-speech detector is proposed for the pre-processing of broadcast news. During the first stage speech/non-speech classification of uniform overlapping segments is performed. The accuracy in the detection of boundaries is... more

The speech detection method is based on calculation of the information entropy of the signal spectrum as the measure of uncertainty or disorder in a given distribution [6]. The distinction between entropy for speech segments and entropy for background noise is used for speech endpoint detection. Such criterion is less sensitive to the variations of the signal amplitude than the energy- based methods. The method is a modification of the speech detection approach proposed by J.-L. Shen [7] and includes new levels into the analysis of speech signal (Figure 1). The audio signal is divided into short segments with duration 11.6 ms each with overlapping 25%. Short-time signal spectrum is computed using FFT, and normalization of the calculated spectrum over all frequency components is fulfilled giving the probability density function p;. Acceptable values of probability density function are upper and lower bounded. This restriction allows us tc

where O, and O, are the upper and lower values of probability density, respectively. They hav been experimentally determined to be O, = 0.3 and O, = 0.01. At the next stage the informatio spectral entropy / is estimated, and median smoothing in a window of 5-9 segments is applied Finally, a logical-temporal processing of h (Figure 2) takes into account the possible durations o speech and non-speech fragments.

Figure 3: DET curves for speech/non-speech segment-based classification. Mean and variance of MFCCs are computed over each segment, with (solid line) or without (dashed line) equal-loudness pre-emphasis and cube-root intensity-loudness compression [5]. The minimal values of the corresponding detection cost functions (DCF) are also presented (circles).

Table 1: Speech/non-speech segment based classification results

Table 2: Speech/silence classification results based on spectrum entropy 4. Conclusions In this paper we have applied a two-stage speech detection system. During the first stage, segment- based speech/non-speech classification is performed based on MFCC features and Support Vector Machines within 250 ms accuracy. An improvement is reported if we use loudness equalization and cube-root compression to the power spectrogram after critical-band analysis. Extracted speech segments are further processed through an entropy-based method for speech-endpoint detection within 10 ms accuracy. The proposed system can successfully address the two-fold requirement for robustness and accuracy during the pre-processing stages preceding broadcast speech transcription or speaker diarization.

descriptionView Paper arrow_downwardDownload

Spectrogram Classification Using Dissimilarity Space

by Sheryl Brahnam

2023, Applied Sciences

descriptionView Paper arrow_downwardDownload

Content-based audio classification using collective network of binary classifiers

by Moncef Gabbouj

2023, 2011 IEEE Workshop on Evolving and Adaptive Intelligent Systems (EAIS)

In this paper, a novel collective network of binary classifiers (CNBC) framework is presented for content-based audio classification. The topic has been studied in several publications before, but in many cases the number of different... more

descriptionView Paper arrow_downwardDownload

An interactive query implementation over high precision progressive query scheme

by Moncef Gabbouj

2023, Progressive

The recently proposed Progressive Query method is a dynamic retrieval technique, which is mainly designed to bring an effective solution especially for queries on large-scale multimedia databases and furthermore to provide periodic... more

descriptionView Paper arrow_downwardDownload

Content-based audio classification and retrieval using the nearest feature line method

by Nilesh Patil

2023, IEEE Transactions on Speech and Audio Processing

Support vector machines (SVMs) have been recently proposed as a new learning algorithm for pattern recognition. In this paper, the SVMs with a binary tree recognition strategy are used to tackle the audio classification problem. We... more

descriptionView Paper arrow_downwardDownload

Fast genre classification and artist identification

by George Tzanetakis

2023, Proc. ISMIR-Mirex

This abstract describes the audio feature extraction and classification algorithm used for the University of Victoria submission to the MIREX (Music Information Retrieval Exchange) 2005. The same audio features and classification... more

descriptionView Paper arrow_downwardDownload

Audio Classification

Key research themes

1. How can feature extraction and dimensionality reduction improve accuracy in music genre and audio type classification?

2. What roles do binaural and spatial features play in classifying complex acoustic scenes and spatial audio recordings?

3. How are deep learning and neuromorphic approaches advancing audio event classification and bioacoustic signal recognition?

Related Topics

All papers in Audio Classification