Dr.-Ing. Zied MnasriAssistant professor at University Tunis El Manar, Ecole Nationale d'Ingénieurs de Tunis, Electrical engineering dept., BP37, Le Belevedere, Tunis, 1002, Tunisia
With the development of multi-modal man-machine interaction, audio signal analysis is gaining imp... more With the development of multi-modal man-machine interaction, audio signal analysis is gaining importance in a field traditionally dominated by video. In particular, anomalous sound event detection offers novel options to improve audio-based man-machine interaction, in many useful applications such as surveillance systems, industrial fault detection and especially safety monitoring, either indoor or outdoor. Event detection from audio can fruitfully integrate visual information and can outperform it in some respects, thus representing a complementary perceptual modality. However, it also presents specific issues and challenges. In this paper, a comprehensive survey of anomalous sound event detection is presented, covering various aspects of the topic, ı.e.feature extraction methods, datasets, evaluation metrics, methods, applications, and some open challenges and improvement ideas that have been recently raised in the literature.
Proceedings of 13th International workshop on Fuzzy logic and applications (WILF 2021), 2021
Audio signal processing is moving towards detecting and/or defining rare/anomalous sounds. The ap... more Audio signal processing is moving towards detecting and/or defining rare/anomalous sounds. The application of such an anomaly detection problem can be easily extended to audio surveillance systems. Thus, a rare sound event detection method for road traffic monitoring is proposed in this paper, including detection of hazardous events, i.e., road accidents. The method is based on combining anomaly detection techniques, such as variational autoencoders (VAE) and Interval-valued fuzzy sets. The VAE is used to calculate the reconstruction error of the input audio segment. Based on this reconstruction error, a fuzzy membership function, composed of an optimistic/upper component and a pessimistic/lower component, is calculated. Finally, a probabilistic method for interval comparison is used to calculate the membership score, hence to evaluate the interval-valued fuzzy sets. Finally, classification into anomalous/normal events is obtained by defuzzification. Results show that with a careful parameter setting, the proposed method outperforms the state-of-the-art one-class SVM for anomaly detection.
In this paper, a novel relationship between instantaneous frequency (IF) and fundamental frequenc... more In this paper, a novel relationship between instantaneous frequency (IF) and fundamental frequency (F0) in voiced parts of speech signals is presented. IF is calculated as the time-derivative of the phase of the analytic signal, yielding from Hilbert transform. Whereas F0 can be extracted using any classical pitch tracking technique (e.g. autocorrelation, cepstrum, subharmonic-to-harmonic ratio (SHR) independently of the tool used to extract F0. This relationship states that the envelope of the residual of the instantaneous frequency, defined as the difference between IF and the maximum of harmonics tends to F0. Such a direct relationship may be useful for further developments of F0 extraction directly from the speech signal, avoiding the approximation that exists in most pitch extraction techniques.
International Journal of Computational Intelligence Systems
Speech processing is quickly shifting toward affective computing, that requires handling emotions... more Speech processing is quickly shifting toward affective computing, that requires handling emotions and modeling expressive speech synthesis and recognition. The latter task has been so far achieved by supervised classifiers. This implies a prior labeling and data preprocessing, with a cost that increases with the size of the database, in addition to the risk of committing errors. A typical emotion recognition corpus therefore has a relatively limited number of instances. To avoid the cost of labeling, and at the same time to reduce the risk of overfitting due to lack of data, unsupervised learning seems a suitable alternative to recognize emotions from speech. The recent advances in clustering techniques make it possible to reach good performances, comparable to that obtained by classifiers, with much less preprocessing load and even with generalization guarantees. This paper presents a novel approach for emotion recognition from speech signal, based on some variants of fuzzy clustering, such as probabilistic, possibilistic and graded-possibilistic fuzzy c-means. Experiments indicate that this approach (a) is effective in recognition, with in-corpus performances comparable to other proposals in the literature but with the added value of complexity control and (b) allows an innovative way to analyze emotions conveyed by speech using possibilistic membership degrees.
Speech synthesis quality depends on its naturalness and intelligibility. These abstract concepts ... more Speech synthesis quality depends on its naturalness and intelligibility. These abstract concepts are the concern of phonology. In terms of phonetics, they are transmitted by prosodic components, mainly the fundamental frequency (F0) contour. F0 contour modeling is performed either by setting rules or by investigating databases, with or without parameters and following a timely sequential path or a parallel and super-positional scheme. In this study, we opted to model the F0 contour for Arabic using the Fujisaki parameters to be trained by neural networks. Statistical evaluation was carried out to measure the predicted parameters accuracy and the synthesized F0 contour closeness to the natural one. Findings concerning the adoption of Fujisaki parameters to Arabic F0 contour modeling for text-to-speech synthesis were discussed.
Duration modelling and evaluation for Arabic statistical parametric speech synthesis
Multim. Tools Appl., 2021
Sound duration is responsible for rhythm and speech rate. Furthermore, in some languages phoneme ... more Sound duration is responsible for rhythm and speech rate. Furthermore, in some languages phoneme length is an important phonetic and prosodic factor. For example, in Arabic, gemination and vowel quantity are two important characteristics of the language. Therefore, accurate duration modelling is crucial for Arabic TTS systems. This paper is interested in improving the modelling of phone duration for Arabic statistical parametric speech synthesis using DNN-based models. In fact, since a few years, DNN have been frequently used for parametric speech synthesis, instead of HMM. Therefore, several variants of DNN-based duration models for Arabic are investigated. The novelty consists in training a specific DNN model for each class of sounds, i.e. short vowels, long vowels, simple consonants and geminated consonants. The main idea behind this choice is the improvement that we already achieved in the quality of Arabic parametric speech synthesis by the introduction of two specific features...
Statistical modelling of speech units in HMM-based speech synthesis for Arabic
This paper investigates statistical parametric speech synthesis of Modern Standard Arabic (MSA). ... more This paper investigates statistical parametric speech synthesis of Modern Standard Arabic (MSA). Hidden Markov Models (HMM)-based speech synthesis system relies on a description of speech segments corresponding to phonemes, with a large set of features that represent phonetic, phonologic, linguistic and contextual aspects. When applied to MSA two specific phenomena have to be taken in account, the vowel lengthening and the consonant gemination. This paper studies thoroughly the modeling of these phenomena through various approaches: as for example, the use of different units for modeling short vs. long vowels and the use of different units for modeling simple vs. geminated consonants. These approaches are compared to another one which merges short and long variants of a vowel into a single unit and, simple and geminated variants of a consonant into a single unit (these characteristics being handled through the features associated to the sound). Results of subjective evaluation show ...
On the Use of Spectrogram Inversion for Speech Enhancement
2021 18th International Multi-Conference on Systems, Signals & Devices (SSD), 2021
Spectrogram inversion or phase retrieval is an old topic in digital signal processing, that has b... more Spectrogram inversion or phase retrieval is an old topic in digital signal processing, that has been revisited since a few years for its proved relevance to many recent applications, such as source separation, speech enhancement and compressive sensing. Spectrogram inversion aims to reconstruct a signal from partial spectral information, such as the magnitude spectrum or the phase spectrum only, which are obtained by the short-time Fourier transform (STFT). Thus, in this work the relevance of signal reconstruction is studied. First, the proposed algorithm, based on the recent theoretic relationships between STFT magnitude and phase is presented. Secondly, the proposed method is tested on clean and simulated-noisy speech. Finally, the relevance of spectrogram inversion as implemented either in our proposal or in state-of-the-art algorithms is evaluated for the particular application on speech enhancement. The results show the advantages and the limits of using spectrogram inversion i...
Road safety has always been a major concern, where a variety of competences is involved, ranging ... more Road safety has always been a major concern, where a variety of competences is involved, ranging from government and local authorities, medical caregivers and other service provides. Prompt intervention in emergency cases is one of the key factors to minimize damages. Therefore, real-time surveillance is proposed as an efficient means to detect problems on roads. Video surveillance alone is not enough to detect serious accidents, since any hazardous behavior on the road may be confused with an accident, which may lead to many wrong alarms. Instead, audio processing has the potential to recognize sounds coming from different sources, such as crashes, tire skidding, harsh braking, etc. Since a few years, deep learning has become the state of the art for audio events detection. However, the usual dominance of absence of events in road surveillance would make a bias in the training process. Therefore, a novel method to initialize the neural network's weights using an autoencoder trained only on event-related data is used to balance the data distribution.
Gemination prediction using DNN for Arabic text-to-speech synthesis
2019 16th International Multi-Conference on Systems, Signals & Devices (SSD)
This paper describes a gemination prediction model for Arabic consonants, based on deep neural ne... more This paper describes a gemination prediction model for Arabic consonants, based on deep neural networks (DNN). Actually, though the importance of gemination to understand the right meaning of the word, the gemination sign (shadda) is very often omitted in modern standard Arabic printed/typed texts, which would generate errors in automatic text applications, such as text-to-speech synthesis and automatic translation. Therefore, gemination prediction for Arabic consonants has been achieved as a part of automatic diacritization module, for DNN-based arabic text-to-speech synthesis. Different DNN models were trained using feedforward and recurrent architectures. The reported results show the ability of recurrent DNN to detect the consonants which have to be geminated in a non-diacritized arabic text, with a very high accuracy.
Audio Surveillance of Road Traffic: An Approach Based on Anomaly Detection and Interval Type-2 Fuzzy Sets
Joint Proceedings of the 19th World Congress of the International Fuzzy Systems Association (IFSA), the 12th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT), and the 11th International Summer School on Aggregation Operators (AGOP)
Surveillance systems are increasingly exploiting multimodal information for improved effectivenes... more Surveillance systems are increasingly exploiting multimodal information for improved effectiveness. This paper presents an audio event detection method for road traffic surveillance, combining generative deep autoencoders and fuzzy modelling to perform anomaly detection. Baseline deep autoencoders are used to compute the reconstruction error of each audio segment, which provides a primary estimation of outlierness. To account for the uncertainty associated to this decision-making step, an interval type-2 fuzzy membership function composed of an optimistic/upper component and a pessimistic/lower component is used. The final class attribution employs a probabilistic method for interval comparison. Evaluation results obtained after defuzzification show that, with a careful parameter setting, the proposed membership function effectively improves the performance of the baseline autoencoder, and performs better than the stateof-the-art one-class SVM in anomaly detection.
F0 contour parametric modeling using multivariate adaptive regression splines for arabic text-to-speech synthesis
ABSTRACT Arabic text-to-speech synthesis needs to be developed, in order to be integrated to many... more ABSTRACT Arabic text-to-speech synthesis needs to be developed, in order to be integrated to many IT applications, like email and SMS reading, automatic information delivery and helping disabled people to use such sophisicated services. However, a standalone text-to-speech system needs automatic generation of prosody, including F0 contour prediction. Thus, F0 contour is linked to the text data via the Fujisaki model, which divides F0 contour into phrase and accents components. Furthermore, the parametric structure of Fujisaki model reduces the problem into the estimation of parameters. Hence, regression techniques, such as MARS, are useful to map the text-retrieved features to the speech-signal-extracted parameters. Then, the overall F0 contour is reconstructed and compared to the original one, to validate the model.
Prosody modeling has become the backbone of TTS synthesis systems. Amongst all the prosodic model... more Prosody modeling has become the backbone of TTS synthesis systems. Amongst all the prosodic modeling approaches, phonetic methods aiming to predict duration and F0 contour are being very praised, thanks to the development of regression tools, such as neural networks (NN). Besides, parametric representations like Fujisaki model for F 0 contour generation help to reduce the problem into the approximation of parameters only. But, prior to the prediction process, text analysis should be carried out first, to select and encode the necessary input features. In our purpose to promote Arabic TTS synthesis, an Integrated Model of Arabic Prosody for Speech Synthesis (IMAPSS) tool has been designed to integrate our developed models for text analysis, NN-based phonemic duration prediction and Fujisaki-inspired F 0 contour. Hence, the yielding parameters provide a command file to be read by speech synthesis systems, like MBROLA.
Cet article est une description d'une nouvelle technique d'analyse/synthèse qui profite de la mod... more Cet article est une description d'une nouvelle technique d'analyse/synthèse qui profite de la modélisation sinusoïdale et du principe de recouvrement/addition (OLA). Elle permet un traitement adéquat aux caractéristiques intrinsèques du signal, à savoir la stationnarité et l'harmonicité. Cette méthode se base essentiellement sur l'estimation des paramètres principaux tels que l'amplitude, la fréquence et la phase. Pour ce faire, on est amené à l'étude de l'interpolation linéaire d'amplitude, ainsi que l'interpolation cubique de phase et la différentiation de fréquence. L'application de cette méthode aux phonèmes et mots arabes donne des résultats satisfaisants, au niveau de la conservation de la forme d'onde, du spectre et l'énergie du signal.
With the development of multi-modal man-machine interaction, audio signal analysis is gaining imp... more With the development of multi-modal man-machine interaction, audio signal analysis is gaining importance in a field traditionally dominated by video. In particular, anomalous sound event detection offers novel options to improve audio-based man-machine interaction, in many useful applications such as surveillance systems, industrial fault detection and especially safety monitoring, either indoor or outdoor. Event detection from audio can fruitfully integrate visual information and can outperform it in some respects, thus representing a complementary perceptual modality. However, it also presents specific issues and challenges. In this paper, a comprehensive survey of anomalous sound event detection is presented, covering various aspects of the topic, ı.e.feature extraction methods, datasets, evaluation metrics, methods, applications, and some open challenges and improvement ideas that have been recently raised in the literature.
29th European Signal Processing Conference (EUSIPCO 2021)At: Dublin, Ireland, 23-27 Aug, 2021, 2021
In this paper, a novel pitch detection algorithm (PDA) is presented. Though pitch detection is a ... more In this paper, a novel pitch detection algorithm (PDA) is presented. Though pitch detection is a classical problem that has been investigated since the very beginning of speech processing, the proposed algorithm is based on a novel approach relying on a proposed empirical relationship between fundamental frequency (f0) and instantaneous frequency (fi). Basically, f0 is defined for periodic signals only, whereas fi can be calculated for any type of signals using the Hilbert transform. Notwithstanding this substantial difference, the relationship described in this paper shows some interaction between them, at least empirically. Once this relationship was validated on a large set of speech signals, it has been exploited to implement an algorithm in order to (a) detect voiced parts of speech and (b) extract f0 contour from fi pattern in the voiced regions. The obtained results of the proposed method were compared to those of some well-rated state-of-the-art PDA's of different backgrounds, to show that the quality of pitch detection yielded by the proposed approach is quite satisfactory, both in clean and simulated noisy speech.
Advances in Computational Intelligence Systems - Contributions Presented at the 20th UK Workshop on Computational Intelligence, September 8-10, 2021, Aberystwyth, Wales, UK, 2021
Surveillance systems are getting more and more multimodal. The availability of audio motivates a ... more Surveillance systems are getting more and more multimodal. The availability of audio motivates a method for anomalous audio event detection (anomalous AED) for road traffic surveillance, which is proposed in this paper. The method is based on combining anomaly detection techniques, such as reconstruction deep autoencoders and fuzzy membership functions. A baseline deep autoencoder is used to compute the reconstruction error of each audio segment. The comparison of this error to a preset threshold provides a primary estimation of outlierness. To account for the uncertainty associated to this decision-making step, a fuzzy membership function composed of an optimistic/upper component and a pessimistic/lower component is used. Evaluation results obtained after defuzzification show that with a careful parameter setting, the proposed membership function improves the performance of the baseline autoencoder for anomaly detection, and yields better or at least similar results than other anomaly detection state-of-the-art methods such as one-class SVM.
Conference: 19th World Congress of the International Fuzzy Systems Association (IFSA), 12th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT), and 11th International Summer School on Aggregation Operators (AGOP), 2021
Surveillance systems are increasingly exploiting multimodal information for improved effectivenes... more Surveillance systems are increasingly exploiting multimodal information for improved effectiveness. This paper presents an audio event detection method for road traffic surveillance, combining generative deep autoencoders and fuzzy modelling to perform anomaly detection. Baseline deep autoencoders are used to compute the reconstruction error of each audio segment, which provides a primary estimation of outlierness. To account for the uncertainty associated to this decision-making step, an interval type-2 fuzzy membership function composed of an optimistic/upper component and a pessimistic/lower component is used. The final class attribution employs a probabilistic method for interval comparison. Evaluation results obtained after defuzzification show that, with a careful parameter setting, the proposed membership function effectively improves the performance of the baseline autoencoder, and performs better than the stateof-the-art one-class SVM in anomaly detection.
Change detection in audio data streams has become an efficient way for online notification of eve... more Change detection in audio data streams has become an efficient way for online notification of events. An interesting application is audio surveillance, including road traffic monitoring and online car crash alarms. However, in the particular case of crash alarms, most of the collected sounds are background noises, with various sub-classes, such as talking pedestrians, engine sounds, horn blowing, etc., whereas those corresponding to the event of interest are much less abundant. Therefore, it is difficult to apply classical tools of classification or clustering to detect crash sounds as a particular class or as outliers. To tackle this problem, we propose an ensemble classifier based on oneclass SVM in order to separate outliers from normal data first, and deep neural networks to classify event-related data. Finally the results of both outlier detection and classification outputs are aggregated in such a way that outliers are considered as a novel class, a priori unknown by the DNN classifier. The application of this method on an audio traffic monitoring database confirms its ability to detect, (a) non-events (background noise), nonhazardous and hazardous event, and (b) non-accidents and accidents, from a stream of audio data.
This paper describes an experimental study of Arabic consonants automatic geminating using deep n... more This paper describes an experimental study of Arabic consonants automatic geminating using deep neural networks. Actually, though the importance of gemination to understand the right meaning of the word, the gemination sign (shadda) is very often omitted in modern standard Arabic printed/typed texts, which would generate errors in automatic text applications, such as text-to-speech synthesis and automatic translation. Therefore, gemination prediction for arabic consonants has been achieved as a part of automatic diacritization module, for DNN-based arabic text-to-speech synthesis. Differents DNN models were trained using feedforward and recurrent architectures. The reported results show the ability of recurrent DNN to detect the consonants which have to be geminated in a non-diacritized arabic text, with a very high accuracy.
Uploads
Papers by Zied MNASRI