Skip to main content

Elmar Nöth

Friedrich-Alexander-Universität Erlangen-Nürnberg, Computer Science, Faculty Member

King AbdulAziz University (KAU) Jeddah, Saudi Arabia, Electrical and Computer Engineering, Adjunct

Followers

117

Following

6

Co-authors

6

Public Views

Interests

Uploads

Papers by Elmar Nöth

Analyzing features for automatic age estimation on cross-sectional data

Interspeech 2009, 2009

We develop an acoustic feature set for the estimation of a person's age from a recorded speech si... more We develop an acoustic feature set for the estimation of a person's age from a recorded speech signal. The baseline features are Mel-frequency cepstral coefficients (MFCCs) which are extended by various prosodic features, pitch and formant frequencies. From experiments on the University of Florida Vocal Aging Database we can draw different conclusions. On the one hand, adding prosodic, pitch and formant features to the MFCC baseline leads to relative reductions of the mean absolute error between 4-20%. Improvements are even larger when perceptual age labels are taken as a reference. On the other hand, reasonable results with a mean absolute error in age estimation of about 12 years are already achieved using a simple genderindependent setup and MFCCs only. Future experiments will evaluate the robustness of the prosodic features against channel variability on other databases and investigate the differences between perceptual and chronological age labels.

Drink and speak: on the automatic classification of alcohol intoxication by acoustic, prosodic and text-based features

Interspeech 2011, 2011

This paper focuses on the automatic detection of a person's blood level alcohol based on automati... more This paper focuses on the automatic detection of a person's blood level alcohol based on automatic speech processing approaches. We compare 5 different feature types with different ways of modeling. Experiments are based on the ALC corpus of IS2011 Speaker State Challenge. The classification task is restricted to the detection of a blood alcohol level above 0.5 ‰. Three feature sets are based on spectral observations: MFCCs, PLPs, TRAPS. These are modeled by GMMs. Classification is either done by a Gaussian classifier or by SVMs. In the later case classification is based on GMM-based supervectors, i.e. concatenation of GMM mean vectors. A prosodic system extracts a 292-dimensional feature vector based on a voicedunvoiced decision. A transcription-based system makes use of text transcriptions related to phoneme durations and textual structure. We compare the stand-alone performances of these systems and combine them on score level by logistic regression. The best stand-alone performance is the transcriptionbased system which outperforms the baseline by 4.8 % on the development set. A Combination on score level gave a huge boost when the spectral-based systems were added (73.6 %). This is a relative improvement of 12.7 % to the baseline. On the test-set we achieved an UA of 68.6 % which is a significant improvement of 4.1 % to the baseline system.

Age and gender recognition based on multiple systems - early vs. late fusion

Interspeech 2010, 2010

This paper focuses on the automatic recognition of a person's age and gender based only on his or... more This paper focuses on the automatic recognition of a person's age and gender based only on his or her voice. Up to five different systems are compared and combined in different configurations: three systems model the speaker's characteristics in different feature spaces, i.e., MFCC, PLP, TRAPS, by Gaussian mixture models. The features of these systems are the concatenated mean vectors. System number 4 uses a physical two-mass vocal model and estimates in a data-driven optimization procedure 9 glottal features from voiced speech sections. For each utterance the minimum, maximum and mean vectors form a 27-dimensional feature vector. The last system calculates a 219-dimensional prosodic feature set for each utterance based on voice and unvoiced speech segments. We compare two different ways to fuse the different systems: First, we concatenate the system on feature level. The second way of combination is performed on score level by multi-class logistic regression. Despite there are just minor differences between the two approaches, late fusion is slightly superior. On the development set of the Interspeech Agender challenge we achieved an unweighted recall of 46.1 % with early fusion and 47.8 % with late fusion.

Does it groove or does it stumble - automatic classification of alcoholic intoxication using prosodic features

Interspeech 2011, 2011

This paper studies how prosodic features can help in the automatic detection of alcoholic intoxic... more This paper studies how prosodic features can help in the automatic detection of alcoholic intoxication. We compute features that have recently been proposed to model speech rhythm such as the pair-wise variability index for consonantal and vocalic segments (PVI) and study their aptness for the task. Further, we use a large prosodic feature vector modelling the usual candidates-pitch, intensity, and duration-and apply it onto different units such as words, syllables and stressed syllables to create generalizations of the rhythm features mentioned. The results show that the prosodic features computed are suitable for detecting alcoholic intoxication and add complementary information to state-of-the-art features. The database is the intoxication database provided by the organizers of the 2011 Interspeech Speaker State Challenge.

The automatic assessment of non-native prosody: combining classical prosodic analysis with acoustic modelling

Interspeech 2012, 2012

In earlier studies, we employed a large prosodic feature vector to assess the quality of L2 learn... more In earlier studies, we employed a large prosodic feature vector to assess the quality of L2 learner's utterances with respect to sentence melody and rhythm. In this paper, we combine these features with two standard approaches in paralinguistic analysis: (1) features derived from a Gaussian Mixture Model used as Universal Background Model (GMM-UBM), and (2) openSMILE, an open-source toolkit for extracting acoustic features. We evaluate our approach with English speech from 94 non-native speakers perceptually scored by 62 native labellers. GMM-UBM or openSMILE modelling alone yields lower performance than our prosodic feature vector; however, adding information from the GMM-UBM modelling or openSMILE by late fusion improves results.

Recognition and Labelling of Prosodic Events in Slovenian Speech

Lecture Notes in Computer Science, 2000

The paper describes prosodic annotation procedures of the GOPOLIS Slovenian speech data database ... more The paper describes prosodic annotation procedures of the GOPOLIS Slovenian speech data database and methods for automatic classification of different prosodic events. Several statistical parameters concerning duration and loudness of words, syllables and allophones were computed for the Slovenian language, for the first time on such a large amount of speech data. The evaluation of the annotated data showed a close match between automatically determined syntactic-prosodic boundary marker positions and those obtained by a rule-based approach. The obtained knowledge on Slovenian prosody can be used in Slovenian speech recognition and understanding for automatic prosodic event determination and in Slovenian speech synthesis for prosody prediction.

Integrating syntactic and prosodic information for the efficient detection of empty categories

Proceedings of the 16th conference on Computational linguistics -, 1996

Geh ort zu den Antragsabschnitten: 3.11 Extraktion prosodischer Merkmale 3.12 Behandlung spontans... more Geh ort zu den Antragsabschnitten: 3.11 Extraktion prosodischer Merkmale 3.12 Behandlung spontansprachlicher Ph anomene auf Au erungsebene 6.4 Syntax und Satzprosodie 6.5 Spontansprachliche Konstruktionen Die vorliegende Arbeit wurde im Rahmen des Verbundvorhabens Verbmobil vom Bundesministerium f ur Bildung, Wissenschaft, Forschung und Technologie (BMBF) unter dem F orderkennzeichen 01 IV 101 G gef ordert. Die Verantwortung f ur den Inhalt dieser Arbeit liegt bei den AutorInnen.

Intelligibility Is More Than a Single Word: Quantification of Speech Intelligibility by ASR and Prosody

Lecture Notes in Computer Science

In this paper we examine the quality of the prediction of intelligibility scores of human experts... more In this paper we examine the quality of the prediction of intelligibility scores of human experts. Furthermore, we investigate the differences between subjective expert raters who evaluated speech disorders of laryngectomees and children with cleft lip and palate. We use the recognition rate of a word recognizer and prosodic features to predict the intelligibility score of each individual expert. For each expert and the mean opinion of all experts we present the best features to model their scoring behavior according to the mean rank obtained during a 10-fold cross-validation. In this manner all individual speech experts were modeled with a correlation coefficient of at least r > .75. The mean opinion of all raters is predicted with a correlation of r =.90 for the laryngectomees and r =.86 for the children.

The Prosody Module

Verbmobil: Foundations of Speech-to-Speech Translation, 2000

We describe the acoustic-prosodic and syntactic-prosodic annotation and classification of boundar... more We describe the acoustic-prosodic and syntactic-prosodic annotation and classification of boundaries, accents and sentence mood integrated in the Verbmobil system for the three languages German, English, and Japanese. For the acoustic-prosodic classification, a large feature vector with normalized prosodic features is used. For the three languages, a multilingual prosody module was developed that reduces memory requirement considerably, compared to three monolingual modules. For classification, neural networks and statistic language models are used.

To talk or not to talk with a computer

If no specific precautions are taken, people talking to a computer canthe same way as while talki... more If no specific precautions are taken, people talking to a computer canthe same way as while talking to another human-speak aside, either to themselves or to another person. On the one hand, the computer should notice and process such utterances in a special way; on the other hand, such utterances provide us with unique data to contrast these two registers: talking vs. not talking to a computer. In this paper, we present two different databases, SmartKom and SmartWeb, and classify and analyse On-Talk (addressing the computer) vs. Off-Talk (addressing someone else)and by that, the user's focus of attention-found in these two databases employing uni-modal (prosodic and linguistic) features, and employing multimodal information (additional face detection).

Boosting of prosodic and pronunciation features to detect mispronunciations of non-native children

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2007

Commercial products that support L2-learners with computer assisted pronunciation training usuall... more Commercial products that support L2-learners with computer assisted pronunciation training usually focus per exercise only on one possible pronunciation mistake that is typical for speakers of the respective L1 group. Acoustic models for words with wrong pronunciation are added to the system. In the present paper a more general approach with features that have proved to be widely independent of the learners' mother tongue is proposed. It is able to take various possible mistakes into consideration all at once. High dimensional feature vectors that encode prosodic varieties and differences of reference and recognized sentences are analyzed. With the ADABOOST algorithm those features are found, which contain the most important information to assess German children learning English. With 35 features 89 % of the agreement of experts is achieved.

Towards the automatic classification of reading disorders in continuous text passages

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2009

In this paper, we present an automatic classification approach to identify reading disorders in c... more In this paper, we present an automatic classification approach to identify reading disorders in children. This identification is based on a standardized test. In the original setup the test is performed by a human supervisor who measures the reading duration and notes down all reading errors of the child at the same time. In this manner we recorded tests of 38 children who were suspected to have reading disorders. The data was confronted to an automatic system which employs speech recognition and prosodic analysis to identify the reading errors. In a subsequent classification experiment-based on the speech recognizer's output, the duration of the test, and prosodic features-94.7 % of the children could be classified correctly.

Objective vs. subjective evaluation of speakers with and without complete dentures

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2009

For dento-oral rehabilitation of edentulous (toothless) patients, speech intelligibility is an im... more For dento-oral rehabilitation of edentulous (toothless) patients, speech intelligibility is an important criterion. 28 persons read a standardized text once with and once without wearing complete dentures. Six experienced raters evaluated the intelligibility subjectively on a 5-point scale and the voice on the 4-point Roughness-Breathiness-Hoarseness (RBH) scales. Objective evaluation was performed by Support Vector Regression (SVR) on the word accuracy (WA) and word recognition rate (WR) of a speech recognition system, and a set of 95 word-based prosodic features. The word accuracy combined with selected prosodic features showed a correlation of up to r = 0.65 to the subjective ratings for patients with dentures and r = 0.72 for patients without dentures. For the RBH scales, however, the average correlation of the feature subsets to the subjective ratings for both types of recordings was r < 0.4.

Private emotions versus social interaction: a data-driven approach towards analysing emotion in speech

User Modeling and User-Adapted Interaction, 2007

The 'traditional' first two dimensions in emotion research are VALENCE and AROUSAL. Normally, the... more The 'traditional' first two dimensions in emotion research are VALENCE and AROUSAL. Normally, they are obtained by using elicited, acted data. In this paper, we use realistic, spontaneous speech data from our 'AIBO' corpus (humanrobot communication, children interacting with Sony's AIBO robot). The recordings were done in a Wizard-of-Oz scenario: the children believed that AIBO obeys their commands; in fact, AIBO followed a fixed script and often disobeyed. Five labellers annotated each word as belonging to one of eleven emotion-related states; seven of these states which occurred frequently enough are dealt with in this paper. The confusion matrices of these labels were used in a Non-Metrical Multi-dimensional Scaling to display two dimensions; the first we interpret as VALENCE, the second, however, not as AROUSAL but as INTERACTION, i.e., addressing oneself (angry, joyful) or the communication partner (motherese, reprimanding). We show that it depends on the specifity of the scenario and on the subjects' conceptualizations whether this new dimension can be observed, and discuss impacts on the practice of labelling and processing emotional data. Two-dimensional solutions based on acoustic and linguistic features that were used for automatic classification of these emotional states are interpreted along the same lines.

M = Syntax + Prosody: A syntactic–prosodic labelling scheme for large spontaneous speech databases

Speech Communication, 1998

In automatic speech understanding, division of continuous running speech into syntactic chunks is... more In automatic speech understanding, division of continuous running speech into syntactic chunks is a great problem. Syntactic boundaries are often marked by prosodic means. For the training of statistical models for prosodic boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation), we developed a syntactic±prosodic labelling scheme where dierent types of syntactic boundaries are labelled for a large spontaneous speech corpus. This labelling scheme is presented and compared with other labelling schemes for perceptual±prosodic, syntactic, and dialogue act boundaries. Interlabeller consistencies and estimation of eort needed are discussed. We compare the results of classi®ers (multi-layer perceptrons (MLPs) and n-gram language models) trained on these syntactic±prosodic boundary labels with classi®ers trained on perceptual±prosodic and pure syntactic labels. The main advantage of the rough syntactic±prosodic labels presented in this paper is that large amounts of data can be labelled with relatively little eort. The classi®ers trained with these labels turned out to be superior with respect to purely prosodic or syntactic labelling schemes, yielding recognition rates of up to 96% for the two-class-problem`boundary versus no boundary'. The use of boundary information leads to a marked improvement in the syntactic processing of the VM system.

Automatic detection of articulation disorders in children with cleft lip and palate

The Journal of the Acoustical Society of America, 2009

Speech of children with cleft lip and palate ͑CLP͒ is sometimes still disordered even after adequ... more Speech of children with cleft lip and palate ͑CLP͒ is sometimes still disordered even after adequate surgical and nonsurgical therapies. Such speech shows complex articulation disorders, which are usually assessed perceptually, consuming time and manpower. Hence, there is a need for an easy to apply and reliable automatic method. To create a reference for an automatic system, speech data of 58 children with CLP were assessed perceptually by experienced speech therapists for characteristic phonetic disorders at the phoneme level. The first part of the article aims to detect such characteristics by a semiautomatic procedure and the second to evaluate a fully automatic, thus simple, procedure. The methods are based on a combination of speech processing algorithms. The semiautomatic method achieves moderate to good agreement ͑ Ϸ 0.6͒ for the detection of all phonetic disorders. On a speaker level, significant correlations between the perceptual evaluation and the automatic system of 0.89 are obtained. The fully automatic system yields a correlation on the speaker level of 0.81 to the perceptual evaluation. This correlation is in the range of the inter-rater correlation of the listeners. The automatic speech evaluation is able to detect phonetic disorders at an experts'level without any additional human postprocessing.

Speech Intelligibility Enhancement After Maxillary Denture Treatment and Its Impact on Quality of Life

The International Journal of Prosthodontics, 2014

Tooth loss and its prosthetic rehabilitation significantly affect speech intelligibility. However... more Tooth loss and its prosthetic rehabilitation significantly affect speech intelligibility. However, little is known about the influence of speech deficiencies on oral health-related quality of life (OHRQoL). The aim of this study was to investigate whether speech intelligibility enhancement through prosthetic rehabilitation significantly influences OHRQoL in patients wearing complete maxillary dentures. Speech intelligibility by means of an automatic speech recognition system (ASR) was prospectively evaluated and compared with subjectively assessed Oral Health Impact Profile (OHIP) scores. Materials and Methods: Speech was recorded in 28 edentulous patients 1 week prior to the fabrication of new complete maxillary dentures and 6 months thereafter. Speech intelligibility was computed based on the word accuracy (WA) by means of an ASR and compared with a matched control group. One week before and 6 months after rehabilitation, patients assessed themselves for OHRQoL. Results: Speech intelligibility improved significantly after 6 months. Subjects reported a significantly higher OHRQoL after maxillary rehabilitation with complete dentures. No significant correlation was found between the OHIP sum score or its subscales to the WA. Conclusion: Speech intelligibility enhancement achieved through the fabrication of new complete maxillary dentures might not be in the forefront of the patients' perception of their quality of life. For the improvement of OHRQoL in patients wearing complete maxillary dentures, food intake and mastication as well as freedom from pain play a more prominent role.

Prosodic information for integrated word-and-boundary recognition

In this paper, we present an integrated approach for recognizing both the word sequence and the s... more In this paper, we present an integrated approach for recognizing both the word sequence and the syntactic-prosodic structure of a spontaneous utterance. The approach aims at improving the performance of the understanding component of speech understanding systems by exploiting not only acoustic and syntactic information, but also prosodic information directly within the speech recognition process. Whereas spoken utterances are commonly modelled as unstructured word sequences in the speech recognizer, our approach includes phrase (or clause) boundary information in the language model, and provides HMMs to model the acoustic and prosodic characteristics of phrase boundaries and disfluencies. This methodology has two major advantages compared to pure word-based speech recognizers. First, additional syntactic information is determined by the speech recognizer which facilitates parsing and resolves syntactic and semantic ambiguities. Second, the integrated model yields significantly better word accuracies than the traditional word-based approach.

Can we tell apart intonation from prosody (if we look at accents and boundaries)?

Studies on prosody/intonation normally look for important (distinctive) features denoting linguis... more Studies on prosody/intonation normally look for important (distinctive) features denoting linguistic contrasts in production or perception experiments. The recent development in automatic speech processing and the availability of large speech data bases made it possible to have a fresh look at these topics. In our study, we classify automatically accent and boundary positions in a spontaneous speech corpus with a large feature vector comprising as many relevant prosodic features as possible. The results obtained for di erent subsets of prosodic features (F0, duration, energy, etc.) show that each feature class contributes to the marking of accents and boundaries, and that the best results can be achieved by simply using all feature subsets together. Finally, we discuss possible conclusions for prosodic theory and for the application of prosody in speech processing.

Integrated recognition of words and phrase boundaries

In this paper we present a n i n tegrated approach for recognizing both the word sequence and the... more In this paper we present a n i n tegrated approach for recognizing both the word sequence and the syntactic-prosodic structure of a spontaneous utterance. We take into account the fact that a spontaneous utterance is not merely an unstructured sequence of words by incorporating phrase boundary information into the language model and by providing HMMs to model boundaries. This allows for a distinction between word transitions across phrase boundaries and transitions within a phrase. During recognition, the syntactic-prosodic structure of the utterance is determined implicitly. Without any increase in computational e ort, this leads to a 4 reduction of word error rate, and, at the same time, syntactic-prosodic boundary labels are provided for subsequent processing. The boundaries are recognized with a precision and recall rate of about 75 each. They can be used to reduce drastically the computational e ort for parsing spontaneous utterances. We also present a system architecture to incorporate additional prosodic information.