Emotion Recognition from Speech Research Papers

Emotion Recognition from Spontaneous Tunisian Dialect Speech

2025, Emotion Recognition from Spontaneous Tunisian Dialect Speech

Emotional expressions are a fundamental aspect of human communication, with speech being one of the most natural modes of interaction. Speech Emotion Recognition (SER) is a significant research topic in Natural Language Processing (NLP),... more

descriptionView Paper arrow_downwardDownload

A PROSODY-DRIVEN EXTENSION TO EMBEDDING-GUIDED NEURAL VOICE CONVERSION FOR NATURAL AND EXPRESSIVE SPEECH SYNTHESIS

by A BALA RAJU

2025, Industrial Engineering Journal

Recent advancements in voice conversion systems have been largely driven by deep learning techniques, enabling the high-quality synthesis of human speech. However, existing models often fail to generate emotionally expressive speech,... more

descriptionView Paper arrow_downwardDownload

ZERO-SHOT EMOTION RECOGNITION IN CROSS-LANGUAGE SPEECH USING CONTRASTIVE LEARNING

by Vince Campbell

2025

This paper presents a zero-shot learning framework for speech-based emotion recognition across languages using contrastive learning. Traditional emotion AI models often depend on large, labeled corpora in a specific language, limiting... more

descriptionView Paper arrow_downwardDownload

Churn Prediction via Multimodal Fusion Learning:Integrating Customer Financial Literacy, Voice, and Behavioral Data

by David Hason

2025, arXiv (Cornell University)

In today's competitive landscape, businesses grapple with customer retention. Churn prediction models, although beneficial, often lack accuracy due to the reliance on a single data source. The intricate nature of human behavior and... more

descriptionView Paper arrow_downwardDownload

Leveraged Mel Spectrograms Using Harmonic and Percussive Components in Speech Emotion Recognition

by David Hason

2025, SSRN Electronic Journal

Speech Emotion Recognition (SER) affective technology enables the intelligent embedded devices to interact with sensitivity. Similarly, call centre employees recognise customers' emotions from their pitch, energy, and tone of voice so as... more

descriptionView Paper arrow_downwardDownload

Demographic and Clinicohistological Profiles of Women Diagnosed with Breast Cancer at Al-El Wiya Maternity Teaching Hospital / Baghdad: A Retrospective Study

by Maeda H Mohammad

2025, Indian Journal of Forensic Medicine & Toxicology

Knowing the incidence of, breast cancer, diagnosis, and treatment methods given a strategic approach for community awareness and rapid management. This study was aimed to: Estimate the demographic pattern including age, marital status,... more

descriptionView Paper arrow_downwardDownload

Advancements in Bangla Speech Emotion Recognition: A Deep Learning Approach with Cross-Lingual Validation

by Khorshed Alam and

2024, 2024 IEEE 99th Vehicular Technology Conference (VTC2024-Spring)

Speech Emotion Recognition (SER) is a method where computers learn to recognize human emotions from speech to improve communication. In this study, we present an innovative Bangla SER framework, incorporating data augmentations, feature... more

descriptionView Paper arrow_downwardDownload

Disentangling Prosody Representations with Unsupervised Speech Reconstruction

by Stefan Wermter

2024, arXiv (Cornell University)

Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker... more

descriptionView Paper arrow_downwardDownload

Analysis and Assessment of State Relevance in HMM-Based Feature Extraction Method

by Simon Dobrisek

2024, Lecture Notes in Computer Science

In the article we evaluate the importance of different HMM states in an HMM-based feature extraction method used to model paralinguistic information. Specifically, we evaluate the distribution of the paralinguistic information across... more

descriptionView Paper arrow_downwardDownload

Human Speech Emotion Recognition Using Artificial Neural Networks Technique

by Md Enamul Haque

2024, IEEE

Speech Emotion Recognition is an essential research analysis for utilizing interconnection with embedded systems to introduce human interference technology. In this study, we proposed a model to recognize emotion from speech using the... more

descriptionView Paper arrow_downwardDownload

Speaker Recognition and Gender Identification using Artificial Neural Network and Support Vector Machine

by Ayush Kumar Krishna

2024, International Journal for Research in Applied Science and Engineering Technology

Identity of a person via voice is one of the most interesting techniques used for user identification. Accuracy of identification process depends on the number of feature vectors and the number of speakers. This paper aims to develop a... more

descriptionView Paper arrow_downwardDownload

Emotional control and visual representation using advanced audiovisual interaction

by Μ. Στραπατσάκη

2024, International Journal of Arts and Technology

Modern interactive means combined with new digital media processing and representation technologies can provide a robust framework for enhancing user experience in multimedia entertainment systems and audiovisual artistic installations... more

descriptionView Paper arrow_downwardDownload

University of Ljubljana system for interspeech 2011 speaker state challenge

by Simon Dobrišek

2024, Interspeech 2011

The paper presents our efforts in the Interspeech 2011 Speaker State Challenge. Both systems, for the Intoxication and the Sleepiness Sub-Challenge, are based on a Universal Background Model (UBM) in a form of a Hidden Markov Model (HMM),... more

descriptionView Paper arrow_downwardDownload

Emotional control and visual representation using advanced audiovisual interaction

by Marianne Strapatsakis

2024, International Journal of Arts and Technology

Modern interactive means combined with new digital media processing and representation technologies can provide a robust framework for enhancing user experience in multimedia entertainment systems and audiovisual artistic installations... more

descriptionView Paper arrow_downwardDownload

Speaker Recognition and Gender Identification using Artificial Neural Network and Support Vector Machine

by Ayush Krishna

2023, International Journal for Research in Applied Science and Engineering Technology

Identity of a person via voice is one of the most interesting techniques used for user identification. Accuracy of identification process depends on the number of feature vectors and the number of speakers. This paper aims to develop a... more

descriptionView Paper arrow_downwardDownload

Demographic and Clinicohistological Profiles of Women Diagnosed with Breast Cancer at Al-El Wiya Maternity Teaching Hospital / Baghdad: A Retrospective Study

by Zaynab Saad

2023, Indian Journal of Forensic Medicine & Toxicology

Knowing the incidence of, breast cancer, diagnosis, and treatment methods given a strategic approach for community awareness and rapid management. This study was aimed to: Estimate the demographic pattern including age, marital status,... more

descriptionView Paper arrow_downwardDownload

Churn Prediction via Multimodal Fusion Learning: Integrating Customer Financial Literacy, Voice, and Behavioral Data

by David HASON RUDD

2023, BESC

In today's competitive landscape, businesses grapple with customer retention. Churn prediction models, although beneficial, often lack accuracy due to the reliance on a single data source. The intricate nature of human behavior and... more

descriptionView Paper arrow_downwardDownload

Speech Emotion Recognition and Implementation: A Survey

by Mamatha G

2023, International Journal of Engineering Science and Invention (IJESI)

The last few decades have seen a wide range of research projects focusing on automatic emotion recognition based on speech for human-machine communication, Speech is the most fundamental and natural means of communication while... more

Fig, | Automated Speech emotion detection Speech is the most common form of communication which is rich in paralinguistic information, to convey emotion, age, gender and other attributes in real-time. From the past few decades, Speech Emotion Recognition (SER) has developed intoan interesting research area of Computer Science related to smart home automation, social media, education, health care, and a variety of other Artificial Intelligence (AI) based applications. Fig. 1 shows a simple setup for automated SER. One of the most challenging steps of SER is the feature generation for emotions, because the features derived from raw speech signals will be able to effectively distinguish emotion states [6]. Speech emotion recognition is a process of identifying human emotions from a recorded speech or in real time through theuse of advanced technologies, algorithms, and accurate datasets to train the machine or system to detect and classify these emotions based on the words used or tone of the voice. Fig. 2 shows the Architectural components of an ideal SER system. Due to the gap or disparity amongst Acoustic characteristics (intensity and frequency pattern of sound) and Human emotions (happy, sad, etc), automated Speech Emotion Recognition is a challenging procedure, which depends greatly on the distinguishable acoustic characteristics captured from a specified recognition task.

Speech Emotion Recognition and Implementation: A Survey

Fig. 3 Deep Stride CNN design for speech emotion recognition Convolutional Neural Network (CNN) consists of convolution layers, pooling layers, fully connected layers, and a SoftMax unit; this sequential network forms a feature extraction. Initially, input spectrograms are convolved with different filters during the training phase and feature maps are obtained. Polling layers accumulate maximum activation functions from the feature maps, to reduce their dimensionality. Lastly, SoftMax unit performs the task of classification.

descriptionView Paper arrow_downwardDownload

Leveraged Mel Spectrograms Using Harmonic and Percussive Components in Speech Emotion Recognition

by David HASON RUDD

2023, Pacific-Asia Conference on Knowledge Discovery and Data Mining

Speech Emotion Recognition (SER) affective technology enables the intelligent embedded devices to interact with sensitivity. Similarly, call centre employees recognise customers' emotions from their pitch, energy, and tone of voice so as... more

descriptionView Paper arrow_downwardDownload

Speech based Emotion Recognition using various Features and SVM Classifier

by Kamalakar Desai

2023, International Journal for Research in Applied Science and Engineering Technology

In this paper methodology for human emotion recognizes by extracting the speech signal. This speaker-based emotion recognition system recognizes the four emotions namely happiness, sadness, fear and angry. Basically, aim of this system to... more

descriptionView Paper arrow_downwardDownload

Fusing Utterance-Level Classifiers for Robust Intoxication Recognition from Speech

by Björn Schuller

2022, Proceedings MMCogEmS 2011 Workshop (Inferring Cognitive and Emotional States from Multimodal Measures), held in conjunction with the 13th International Conference on Multimodal Interaction, ICMI

Obtaining speech samples is an attractive non-invasive method to recognize alcohol intoxication. In this paper, we aim to improve accuracy of speech-based intoxication recognition by decision fusion of utterance-level classifiers. On the... more

descriptionView Paper arrow_downwardDownload

Singer Identification by Vocal Parts Detection and Singer Classification Using LSTM Neural Networks

by IJRASET Publication

2022, International Journal for Research in Applied Science & Engineering Technology (IJRASET)

Identification of singers is considered an important research area in audio signal processing. It has acquired the scientist's intrigues in two primary branches,1) recognizing vocal parts of polyphonic music, and 2) Classifying Singer.... more

In the above figure, we can observe after clicking on Upload Test Audio & Classify button, It opens the test audio file dialog box which contains audios of unseen data, and next select any audio clip and click on open, Now using predict function code it extracts mfcc features of a clip using librosa and then ravels it to convert into a sequence of features and these are resized and reshaped into an array and to utilize data processing technique converts it into NumPy array, then give this data into the identification and classification model, argmax function classification depends on maximum probability to which result belongs like if gender voice is of 80% as female voice and 20% as male voice then it produces a result as female, In this similar singer can also be classified and observe the below results after uploading audio clip. It displays results on the window as Uploaded Vocal Parts identified as Female Uploaded Audio file classified as singer name: Annar. In this paper, we propose a supervised system for singer identification that combines several stages of processing blocks involving mainly deep recurrent neural networks. The strategy used here is to first recognize the vocal parts and classify the gender of the singer, and then perform the identification of the singer. Together with a suitable feature vector, the obtained results show the efficiency of our strategy and the proposed system in terms of determining the vocal parts of the singer and incorporating the optimal feature vectors using a deep auto-encoder. Moreover, a similarity measure could be defined to detect similar styles among singers. The use of other processing steps, such as drop-out and batch normalization, will certainly increase the overall accuracy of the system as well. For our future works, it would be intriguing to decide the artist’s vocal sorts and integrate the ideal element vectors utilizing a

descriptionView Paper arrow_downwardDownload

Drink and speak: on the automatic classification of alcohol intoxication by acoustic, prosodic and text-based features

by Elmar Noeth

2022, Interspeech 2011

This paper focuses on the automatic detection of a person's blood level alcohol based on automatic speech processing approaches. We compare 5 different feature types with different ways of modeling. Experiments are based on the ALC corpus... more

descriptionView Paper arrow_downwardDownload

Use of prosodic speech characteristics for automated detection of alcohol intoxication

by Elmar Nöth

2022

In this paper we describe our methodology for automatic detection of speaker alcoholization. Our task is restricted to detection of considerable alcoholization (alcohol blood level ≥ 0.8 per mille), so that a two-class classification... more

descriptionView Paper arrow_downwardDownload

Analysis and Assessment of State Relevance in HMM-Based Feature Extraction Method

by Rok Gajsek

2022, Lecture Notes in Computer Science

In the article we evaluate the importance of different HMM states in an HMM-based feature extraction method used to model paralinguistic information. Specifically, we evaluate the distribution of the paralinguistic information across... more

descriptionView Paper arrow_downwardDownload

Acoustic modeling for speech recognition in telephone based dialog system using limited audio resources

by Rok Gajsek

2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

In the article we evaluate different techniques of acoustic modeling for speech recognition in the case of limited audio resources. The objective was to build different sets of acoustic models, the first was trained on a small set of... more

descriptionView Paper arrow_downwardDownload

Hybrid Speech Enhancement with Wiener filters and Deep LSTM Denoising Autoencoders

by Leandro Di Persia

2022, 2018 IEEE International Work Conference on Bioinspired Intelligence (IWOBI)

Over the past several decades, numerous speech enhancement techniques have been proposed to improve the performance of modern communication devices in noisy environments. Among them, there is a large range of classical algorithms (e.g.... more

descriptionView Paper arrow_downwardDownload

Affective Model Based Speech Emotion Recognition Using Deep Learning Techniques

by Akalya Devi C - PSGCT

2022, Indian Journal of Computer Science

Background: Breast cancer is the most commonly identified dangerous cancer in females. While breast cancer has been recorded to be a source of female mortality in many developing countries, studies have shown that bronchogenic carcinoma... more

descriptionView Paper arrow_downwardDownload

Development and Evaluation of the Emotional Slovenian Speech Database - EmoLUKS

by Janez Zibert

2022, Lecture Notes in Computer Science

This paper describes a speech database built from 17 Slovenian radio dramas. The dramas were obtained from the national radio-and-television station (RTV Slovenia) and were given at the universities disposal with an academic license for... more

descriptionView Paper arrow_downwardDownload

Emotional control and visual representation using advanced audiovisual interaction

by Marianne Strapatsakis

2022, International Journal of Arts and Technology

Modern interactive means combined with new digital media processing and representation technologies can provide a robust framework for enhancing user experience in multimedia entertainment systems and audiovisual artistic installations... more

descriptionView Paper arrow_downwardDownload

Hybrid Speech Enhancement with Wiener filters and Deep LSTM Denoising Autoencoders

by Hugo Leonardo Rufiner

2022, 2018 IEEE International Work Conference on Bioinspired Intelligence (IWOBI)

Over the past several decades, numerous speech enhancement techniques have been proposed to improve the performance of modern communication devices in noisy environments. Among them, there is a large range of classical algorithms (e.g.... more

descriptionView Paper arrow_downwardDownload

Emotional control and visual representation using advanced audiovisual interaction

by Marianne Strapatsakis

2022, International Journal of Arts and Technology

Modern interactive means combined with new digital media processing and representation technologies can provide a robust framework for enhancing user experience in multimedia entertainment systems and audiovisual artistic installations... more

descriptionView Paper arrow_downwardDownload

Use of prosodic speech characteristics for automated detection of alcohol intoxication

by Elmar Nöth

2022, ISCA Tutorial and Research …

In this paper we describe our methodology for automatic detection of speaker alcoholization. Our task is restricted to detection of considerable alcoholization (alcohol blood level ≥ 0.8 per mille), so that a two-class classification... more

descriptionView Paper arrow_downwardDownload

Emotion Speech Recognition using Deep Learning

by OTHMAN O KHALIFA

2022

Emotion Speech Recognition (ESR) is recognizing the formation and change of speaker’s emotional state from his/her speech signal. The main purpose of this field is to produce a convenient system that is able to effortlessly communicate... more

Fig. 2. Speech emotion recognition system block diagram.

Pre-emphasis: Pre-emphasizing includes filtering the sample word to increase the energy of the signal at higher frequencies. Frame blocking: this step involves segmenting the signal into small durations (frames) ranging between 20ms to 40ms. The speech is signal divided into N samples, the adjacent frames will be set rates by M(M<N). M and N values are 100 and 256 respectively. Fast Fourier Transform: FFT is converting signals from time domain into frequency domain. Mel Filter bank: voices signal in FFT uses triangular band pass Filter. To obtain smooth magnitude spectrum, and reduce the size of the involved features, a set of triangular filters are multiplied by the magnitude frequency response. this step is the final categorized emotion. Various Cc Cc H Cc assi assi fication methods were used to categorize utterance-level features, some example of a famous fication methods are: neural network, hidden Markov models (HMM), and (GMM) Gaussian mixture mode . However conventional classifying methods like MM and GMM may not give a high and accurate assi fying rate compared to deep learning and neural network methods. Moreover hybrid methods [21] may give better results compared to individual methods. Pitch is the most important property of speech. It is a known fact that speech is nothing but a wave that is generated by vibrating objects in a medium such as air. Speech emotion recognition’s researches have always considered the characteristics of pitch in order to have an accurate emotion classification. Many methods were invented to extract features from pitch for example the Cepstral method. In Cepstral method, the signal goes into several stages. First, the analog signal will be sampled and quantized to digitize the signal. The digital signal is then framed in a suitable size and this can be obtained by passing the signal into a hamming window

Fig. 5. Conventional neural networks [23]. Conventional neural networks are constructed using simple and connected neurons also known as processors, theses neurons produce a sequence of real-value activation, some input neurons get activated through an environmental sensing sensors while other get their activation through assigned weight. Fig. 5 shows a simple neural network with neuron inputs X;,, the weight W,,; and the neuron which sums the weights multiplied by the input.

Fig. 7. Illustration of: (a) a Multilayer Perceptron and (b) a Recurrent Neural Network. connection between the layers and the visibility and the hiddenness. It is interesting to note that in DBNs, there is no connection between units if they are from the same layer. The greedy strategy adapts just the highlights of one layer at any given moment and it never straightens

Fig. 8. The connection between the layers.

Fig. 9. Left: A three-layer Deep Belief Network and a three-layer Deep Boltzmann Machine. Right: Pretraining comprises of learning a heap of adjusted RBM's, that are then formed to make a profound Boltzmann machine [30].

Fig. 10. Autoencoder of the SDAs model. Stacked Denoising Autoencoders (SDAs) is one of DL architectures that was first introduced in 2008 by Vincet et al [33],[34]. SDAs is an extension of a classical autoencoder as shown in Fig. 10. What makes SDAs a “deep” architecture is that it is constructed by stacking multiple autoencoders together. The unsupervised pre training of each autoencoder is performed in a greedy layer by layer method. Once a SDAs is learnt, its yield will be utilized as the input portrayal of a regulated learning calculation for recognition tasks. So, the main idea is that the created or reconstructed input will have as minimum as possible of errors, so the input value closes to the real value. Notedly, in this method the output, which is in fact corrupted, comes out clean. The idea of an autoencoder can be quickly portrayed here: Given a set of data points x = {x,, X, ....,Xm}, map x to another set of data points Y = {Y%, ,{Y,...,{%m} where n < m. From the compacted set y, reproducing a lot of ~x, which approximates the first information x. The mapping x —y is called “encoding" and the mapping y > ~x is called “decoding".

Fig. 11. The proposed architecture of the system (training stage). order to be resampled. The resultant frequency rate should be 16Khz before any processing is implemented. The utterances will then be converted into their respected spectrogram. the spectrogram consists of two axis time and frequency axis, where time is shown as a horizontal axis and the frequency is shown as the vertical axis. The spectrogram is simply a picture that shows the different variation of energy with different frequency at different times. The intensity of the energy will be represented by either a darkness or different colors. The fact that every period of vocal vibration is associated with glottal pulse gives a special importance to the vocal fold vibration. Furthermore, in the preprocessing stage of the proposed model, all the audio files will be converted in a wide-band spectrogram. The hamming window will be fixed along with the overlap. DFT will also be fixed. e.g. (hamming windows is 4ms with 4ms overlap, and 512 DFT points). Notice that any frequency that is greater than 4000hz will be discarded because 4kHz is cindered to be sufficient for speech emotion recognition according to many studies. Taking into consideration that different databases might be tested on this project, it is important that all the audio files go through an antialiasing low pass filter in

Fig. 13. (a) Learning curve and overfitting (b) Results of initial model in detecting 10 targeted classes. Majlesi Journal of Electrical Engineering

Fig. 14. Learning curve after splitting the data into training, validation and testing. Another problem with this model is that the testing set scored a higher accuracy than the validation dataset, this indicated that there is a data leakage problem in which the data used in the validation dataset are the same as the one used in the testing dataset. To validate this theory of a data leakage problem, the RAVDESS dataset was divided into 3 different sections: training, validation, and testing, to ensure that data are not used

Fig. 15. (a) Learning curve and overfitting (b) Results of initial model in detecting lasses of female emotions. After splitting the dataset, the number of targeted classes reduced to 5 and, then the data fed into the model in two different times and the results came as follows:

Fig. 16. (a) Learning curve and overfitting (b) Results of initial model in detecting classes of male data.

Fig. 17. (a) Learning curve and overfitting (b) Results of improved model in detecting two emotional classes (Male samples). Majlesi Journal of Electrical Engineering

Fig. 18. (a) Learning curve and overfitting (b) Results of improved model in detecting three emotional classes (Male samples).

Fig. 19. (a) Learning curve and overfitting (b) Results of improved model in detecting five emotional classes (Male samples).

Fig. 23. Two classes emotional classifier with (Noise Adding +Time Shifting). "1 This step took place to make sure that the model will not have an underfitting problem. Initially we implemented the augmentation method on the model to classify 2 targeted classes, the goal was to determine the best data augmentation method. The next step was implementing the chosen augmentation method on the model to classify 3 and 5 targeted classes.

The result in the figure above shows a dramatic increase in the validation accuracy. Nevertheless, it is clear from the results obtained that the training valid loss shows a relatively high loss value which is a clear sign of an underfitting problem in the model. We figured that this problem occurred because the model was not fed enough data. The reason that we did not face this problem in the first experiment is that we did not split the dataset into female and male set, the splitting process caused the number of training data to drop significantly.

The two methods used were Noise Adding combined with Shifting, Noise Adding combined with Pitch Tuning. The results came as follow:

Fig. 24. Two classes emotional classifier with (Pitch Shifting + Noise Adding).

Fig. 25. Three classes emotional classifier with (Noise Adding + Time Shifting). As we can see from the results above, Noise Adding + Time Shifting method produced the best accuracy, hence I carried the method forward and tested on the two remaining groups (3 classes emotions, 5 classes emotions) and the results came as follow:

Fig. 26. Five classes emotional classifier with (Noise Adding + time Shifting).

Fig. 27. Actor expressing anger (self-collected test data). Fig. 28. Actor expressing anger (RAVESSD database).

Table. 1 The fundamental emotions by Robert Plutchik [13]. Robert Plutchik graphically represented his model on a color wheel known as ’Plutchik’s wheel’ shown in Fig. 1. Plutchik’s wheel shows that there are 8 different fundamental emotions which are joy, trust, fear,

6. PROPOSED METHODOLOGY type of compression. Authors in [7] used Convolution recurrent neural network stacked on top of 2-layers of LTSM long short-term memory and concluded that their model has clearly outperformed other models in arousal and valence. Other studies preferred to use RNNs and were able to bypass what other studies could not, which is making the model be able to ignore the silence frames of the waveform by assigning very small weight to them which causes them to be ignored in the pooling operation [10]. In [35], a convolutional neural network is proposed as a deep learning model, his design consisted one convolutional layer followed by max pooling layer, his method showed a relatively higher performance than LDA yet a lower performance than other recognition methods namely RDA, KDA, SVM, KNN and CNN [21]. The two methods were later combined. The first used method was GMM-HMM were the system inputs are the MFCC features of the audio signal. The second method was the conventional SVM classifier with LLD (Low Lever Descriptors) as the input of the model. The

descriptionView Paper arrow_downwardDownload

Deep Multilayer Perceptrons for Dimensional Speech Emotion Recognition

by Bagus Tris Atmaja

2022

Modern deep learning architectures are ordinarily performed in high performance computing facilities due to the large size of their input features and complexity of their models. This paper proposes traditional multilayer perceptrons... more

descriptionView Paper arrow_downwardDownload

A novel method of artificial bandwidth extension using deep architecture

by danish bukhari

2022

This paper presents a novel artificial bandwidth extension (ABE) framework based on deep neural networks (DNNs) with a multiple-layer’s deep architecture. It demonstrates the suitability of DNNs for modeling log power spectra of speech... more

Figure 1: Block diagram of the proposed method In this section, we firstly introduce the framework of the proposed ABE method. Subsequently, the further details are presented. The flowchart of the proposed framework is shown in Fig. 1.

Fig. 2. The global variances on the training set. To address the over-smoothing problem, a globa qualization factor a@(@) is applied as follows [21]: where X,,(@) is the d-th component of a DNNs output vector at the n-th frame and M is the total number of speech frame in the training set. The global variance of the reference wideband speech features can be calculated in a similar way. Fig. 2 shows the global variances of the estimated and reference log power spectra of wideband speech in the high-band. It can be observed that the global variances of the estimated wideband speech features were smaller. Fig. 2. The global variances on the training set.

Fig. 3. The LSD for different configuration in DNNs The configurations of the DNNs were 3 hidden layers, 3072 hidden units and 11 frames expansion in propose method. Table II. The LSD with different input features

using deep architecture. Subjective evaluation indicates that proposed ABE yields a brighter sound and produce more clear than three different baseline methods. We demonstrated that proposed ABE is a promising regression model for speech, applying them to ABE. Motivated by the success of DNNs on the voice conversion and speech enhancement, we used DNNs to reconstruct log power spectra in the high-band. The resulting system clearly improves the state-of-the-art both in subjective performance evaluation and objective performance evaluation.

Fig. 4. Subjective test for GMM-ABE and proposed ABE

Fig. 6. Subjective test for SPE-ABE and proposed ABE

2.2. DNNs training with rich acoustic features Table I. The attributes of the features

3.3.2. Subjective test evaluation Table III. The objective test for different ABE methods

descriptionView Paper arrow_downwardDownload

IRJET- Speech Emotion Recognition: A survey

by IRJET Journal

2021, IRJET

The chief point of this paper is to supply an outline of Speech Emotion Recognition. Emotions can be recognized by extracting many features from the speech. In SERs, numerous methods have been resorted to remove sentiments from waves,... more

descriptionView Paper arrow_downwardDownload

Analysis and Assessment of State Relevance in HMM-Based Feature Extraction Method

by Simon Dobrišek

2021, Lecture Notes in Computer Science

In the article we evaluate the importance of different HMM states in an HMM-based feature extraction method used to model paralinguistic information. Specifically, we evaluate the distribution of the paralinguistic information across... more

descriptionView Paper arrow_downwardDownload

Towards Efficient Multi-Modal Emotion Recognition

by Simon Dobrišek

2021, International Journal of Advanced Robotic Systems

emotion recognition that can easily be integrated into various systems, such as humanoid robots, smart surveillance systems and alike.

descriptionView Paper arrow_downwardDownload

University of Ljubljana System for Interspeech 2011 Speaker State Challenge

by Simon Dobrišek

2021

The paper presents our efforts in the Interspeech 2011 Speaker State Challenge. Both systems, for the Intoxication and the Sleepiness Sub-Challenge, are based on a Universal Background Model (UBM) in a form of a Hidden Markov Model (HMM),... more

descriptionView Paper arrow_downwardDownload

Speech Emotion Recognition

by IJRASET Publication

2021, International Journal for Research in Applied Science & Engineering Technology

Recognizing emotions is automatically and subconsciously performed by humans. It is a vital process for human-to human communication, and thus, to achieve better human machine interaction, emotions need to be considered. Emotional speech... more

descriptionView Paper arrow_downwardDownload

Enhancing Emotion Recognition from Speech through Feature Selection

by Theodoros Kostoulas

2016

In the present work we aim at performance optimization of a speaker-independent emotion recognition system through speech feature selection process. Specifically, relying on the speech feature set defined in the Interspeech 2009 Emotion... more

descriptionView Paper arrow_downwardDownload

Analysis and assessment of AvID: Multi-modal emotional database

by Luka Komidar

2015, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

The paper deals with the recording and the evaluation of a multi modal (audio/video) database of spontaneous emotions. Firstly, motivation for this work is given and different recording strategies used are described. Special attention is... more

descriptionView Paper arrow_downwardDownload

Drink and Speak: On the automatic classification of alcohol intoxication by acoustic, prosodic and text-based features

by Tobias Bocklet and

2015

This paper focuses on the automatic detection of a person's blood level alcohol based on automatic speech processing approaches. We compare 5 different feature types with different ways of modeling. Experiments are based on the ALC corpus... more

descriptionView Paper arrow_downwardDownload

Fusing Utterance-Level Classifiers for Robust Intoxication Recognition from Speech

by Felix Weninger

2015, Proceedings MMCogEmS 2011 Workshop (Inferring Cognitive and Emotional States from Multimodal Measures), held in conjunction with the 13th International Conference on Multimodal Interaction, ICMI

Obtaining speech samples is an attractive non-invasive method to recognize alcohol intoxication. In this paper, we aim to improve accuracy of speech-based intoxication recognition by decision fusion of utterance-level classifiers. On the... more

descriptionView Paper arrow_downwardDownload

An Approach for Modeling Affective Acoustic Ecology in City Environments

by Konstantinos Drossos and

2015, In proceedings of the SD-Med EchoPolis-Days of Sound : “Sounds, Noise and Music in re-thinking sustainable city and the eco-neighborhoods”

Urban sonic ecology represents a major field of research interest for exploiting the relations raised through sound between human populations and a city environment. Recently, the concept of emotional city has boosted the ideas and... more

descriptionView Paper arrow_downwardDownload

University of Ljubljana system for Interspeech 2011 Speaker State Challenge

by Simon Dobrisek

2015

The paper presents our efforts in the Interspeech 2011 Speaker State Challenge. Both systems, for the Intoxication and the Sleepiness Sub-Challenge, are based on a Universal Background Model (UBM) in a form of a Hidden Markov Model (HMM),... more

descriptionView Paper arrow_downwardDownload

Analysis and assessment of state relevance in HMM-based feature extraction method

by Simon Dobrisek and

2015

In the article we evaluate the importance of different HMM states in an HMM-based feature extraction method used to model paralinguistic information. Specifically, we evaluate the distribution of the paralinguistic information across... more

descriptionView Paper arrow_downwardDownload

Speaker state recognition using an HMM-based feature extraction method

by Simon Dobrisek

2015

In this article we present an efficient approach to modeling the acoustic features for the tasks of recognizing various paralinguistic phenomena. Instead of the standard scheme of adapting the Universal Background Model (UBM), represented... more

descriptionView Paper arrow_downwardDownload

Analysis and Assessment of AvID: Multi-Modal Emotional Database

by Anja Podlesek

2014

The paper deals with the recording and the evaluation of a multi modal (audio/video) database of spontaneous emotions. Firstly, motivation for this work is given and different recording strategies used are described. Special attention is... more

descriptionView Paper arrow_downwardDownload

Emotion Recognition from Speech

Related Topics