Speaker Characterization

description8 papers

group11 followers

lightbulbAbout this topic

Speaker characterization is the analytical process of identifying and describing the distinctive traits, attributes, and vocal qualities of a speaker in spoken discourse. This field examines aspects such as tone, pitch, accent, and speech patterns to understand the speaker's identity, intentions, and emotional state within a communicative context.

lightbulbAbout this topic

Key research themes

1. How do voice quality variations influence the perceived personality traits and charisma of a speaker?

This research area investigates how different laryngeal and supralaryngeal voice qualities produced by the same individual affect listeners’ perceptions of that speaker’s personality traits and charisma. It matters because voice quality conveys social and emotional cues crucial for interpersonal communication, speaker profiling, and forensic applications.

The effects of different voice qualities on the perceived personality of a speaker

by Sara Pearsell

2024, Frontiers in Communication

Key finding: This study found that voice quality variations, including modal, creaky, breathy (natural and artificial), nasalization, and smiling, produced by the same speakers significantly impacted listener ratings on personality traits... Read more

articleView Paper downloadDownload

Speech-based perception of speaker traits (Welch et al., 2021)

by Brett Welch

2022

Key finding: Listeners showed generally low accuracy (~33%, chance level) in judging speakers’ personality traits from speech alone, although some traits like Aggression and Social Potency had slightly higher recognition rates.... Read more

articleView Paper downloadDownload

Measuring a speaker's acoustic correlates of pitch - but which? A contrastive analysis for perceived speaker charisma

by Radek Skarnitzl

2021

Key finding: This paper identifies fundamental frequency (f0) measures that best correlate with perceived speaker charisma, finding that mean f0 is the most effective pitch-level metric while kurtosis and 80-percentile f0 range optimally... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What acoustic and phonetic features capture speaker-specific variability in spontaneous and controlled speech for speaker characterization and recognition?

This research theme focuses on identifying phonetic, acoustic, and articulatory features that characterize speaker individuality across different speech styles (read and spontaneous) and linguistic contexts. This theme is vital for improving speaker recognition systems, forensic voice comparison, and understanding within-speaker variability versus between-speaker differences.

Acoustic voice variation in spontaneous speech

by Cynthia Lee

2024, The Journal of the Acoustical Society of America

Key finding: The study extended prior findings from read speech to spontaneous speech for the same 99/100 talkers and showed that acoustic voice spaces remain highly similar across speaking styles, with fundamental frequency variability... Read more

articleView Paper downloadDownload

Analysis of speaker and co-articulation effects based on sub-band cepstral variances in the Japanese vowels of 300 male speakers

by Dr Frantz Clermont

2020, 14th Biennial Conference of the International Association of Forensic Linguistics (IAFL), Melbourne, Australia

Key finding: Using a large corpus of Japanese vowels produced in varied phonetic contexts, this study demonstrated that coarticulation affects lower-formant related sub-bands more strongly, whereas speaker effects dominate higher-formant... Read more

articleView Paper downloadDownload

"Analysis of speaker and coarticulation effects based on the sub-band cepstral variances in the Japanese vowels of 300 male speakers": ORAL presentation

by Dr Frantz Clermont

2020, 14th Biennial Conference of the International Association of Forensic Phonetics (IAFL), Melbourne

Key finding: This presentation summarized exploratory analysis on the complex interaction between speaker differences and phonetic context using cepstral measures, highlighting the importance of quantifying relative contributions of... Read more

articleView Paper downloadDownload

Phonetic Analysis of GMM-based Speaker Models

by Margit Antal

2023, Proceedings of “Verificatori Biometrici” Workshop, organized by Technical University of Cluj-Napoca, Universitas Napocensis Babes-Bolyai, Universitas Medicinae et Farmaciae Napocensis and CNCSIS, Cluj-Napoca, Romania, May

Key finding: This paper investigated the phoneme distributions within Gaussian Mixture Model (GMM) clusters representing speakers, revealing that certain phonetic segments contribute disproportionately to speaker modeling efficacy. The... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can speaker demographic traits such as age, height, and physiognomic factors be automatically estimated from speech using i-vector frameworks and machine learning?

This theme investigates computational methods, especially i-vector representations combined with regression and classification models, to infer speaker profile traits like age and height from speech. These traits offer valuable auxiliary information in forensic cases, user profiling, and personalized human-computer interaction systems. Understanding effectiveness, limitations, and variability factors improves model design and forensic applicability.

Speaker Profiling for Forensic Applications

by Amir Hossein Poorjam

2022

Key finding: The thesis developed novel approaches for estimating speaker age, height, weight, and smoking habits from spontaneous telephone speech using i-vector and Non-negative Factor Analysis (NFA) frameworks combined with Artificial... Read more

articleView Paper downloadDownload

Speaker age estimation using i-vectors

by David Van Leeuwen

2015, Engineering Applications of Artificial Intelligence

Key finding: The study proposed an age estimation method leveraging i-vectors and Within-Class Covariance Normalization, followed by Least Squares Support Vector Regression, achieving lower mean absolute error and higher correlation with... Read more

articleView Paper downloadDownload

Height Estimation from Speech Signals using i-vectors and Least-Squares Support Vector Regression

by Amir Hossein Poorjam and

2014, In Proceeding of the 37th International Conference on Telecommunications and Signal Processing, Germany

Key finding: This paper presented an automatic speaker height estimation approach using i-vectors and regression models (ANN and LSSVR), yielding effective height predictions on the NIST 2008 and 2010 SRE corpora. This contributes to the... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Speaker Characterization

Speaker age estimation using i-vectors

by Mohamad Hasan Bahari

2014

In this paper, a new approach for age estimation from speech signals based on i-vectors is proposed. In this method, each utterance is modeled by its corresponding i-vector. Then, a Within-Class Covariance Normalization technique is... more

descriptionView Paper arrow_downwardDownload

Height Estimation from Speech Signals using i-vectors and Least-Squares Support Vector Regression

by Amir Hossein Poorjam and

2014, In Proceeding of the 37th International Conference on Telecommunications and Signal Processing, Germany

This paper proposes a novel approach for automatic speaker height estimation based on the i-vector framework. In this method, each utterance is modeled by its corresponding ivector. Then artificial neural networks (ANNs) and least-squares... more

descriptionView Paper arrow_downwardDownload

Speaker Profiling for Forensic Applications

by Amir Hossein Poorjam

2014, Master’s thesis, KU Leuven – Faculty of Engineering Science, Belgium

where 7 is the index for gradient ascent iterations, ajy is the learning rate and I is an identity matrix of dimension C. BI SSS | ee ee, Se a e The subspace matrix L is estimated over a large training dataset. The obtained subspace vectors representing the utterances in train and test datasets are used to estimate the age of speakers in this chapter. This new low-dimensional utterance representation approach was successfully applied to speaker and language/dialect recognition tasks [8].

FIGURE 2.2: The age histogram of telephone speech utterances of training and testing datasets for male and female speaker.

FIGURE 3.1: Block diagram of the proposed speaker height estimation approach in training and testing phases.

FIGURE 3.2: The height histogram of telephone speech utterances for the NIST 2008 and NIST 2010 databases.

FIGURE 3.3: The scatter plot of height estimation for (a): male speakers, (b): female speakers, and (c): when the male and female data were pooled together.

FIGURE 4.1: Block diagram of the proposed speaker weight estimation approach in training and testing phases.

FIGURE 5.2: Block-diagram of the proposed smoker detection approach in training and testing phases. 5. AUTOMATIC SMOKER DETECTION FROM SPONTANEOUS TELEPHONE SPEECH

FIGURE 4.2: The weight histogram of telephone speech utterances of training and testing datasets for male and female speaker.

FIGURE 5.3: The smoking habit histogram of telephone speech utterances for training, development and testing datasets for (a): male speakers and (b): female speakers.

FIGURE 5.4: Block-diagram of the proposed smoker detection approach for score- level fusion of the i-vector-based recognizer (model-1) and the NFA-based recognizer (model-2). (U.M. stands for utterance modeling)

FiGuRE 5.5: The ROC curves of the proposed method for (a): male speakers, and (b): female speakers.

FIGURE 6.1: The scatter plots of (a): age and height, (b): age and weight, and (c): height and weight of speakers in NIST 2008 and 2010 SRE databases for both genders.

FIGURE 6.2: Block diagram of the proposed multitask speaker profiling approach for speaker age estimation and smoking habit detection, in training, development and testing phases. U.M. stands for utterance modeling. yf and y?. represent the training labels corresponding to age and smoking habit, respectively, and g4 and 9° represent the estimated age and smoking habit, respectively, after applying a test sample 2+.+.

FIGURE 6.3: Block diagram of the proposed multitask speaker profiling approach for speaker age, height and weight estimation, in training and testing phases. U.M. stands for utterance modeling. yf, yf! and y}{" represent the training labels corresponding to age, height and weight, respectively, and g4, g” and g” represent the estimated age, height and weight, respectively, after applying a test sample 7;.;. speaker age, height and weight estimation, in training and testing phases. U.M. stands

FIGURE 6.4: The ROC curves of the proposed MTL smoker detection after the score-level fusion and when age information is considered, for (a): male speakers, and (b): female speakers. * The bold numbers in the table indicate the improved results.

FIGURE A.1: The structure of a single neuron.

FicurE A.2: The structure of a multilayer perceptron (MLP) The activation functions commonly used in feedforward neural networks are logistic, hyperbolic tangent and linear functions. The logistic function takes the following form, which its output lies between 0 and 1:

FIGURE A.3: The structure of a single hidden layer feedforward neural network with error back propagation. The solid lines represent the forward paths and the dotted lines indicate the error back-propagation paths.

In the binary classification problems, we intend to model the probability of a T certain label given its featurs. That is, P(y|x;) = f(w* x; + wo), where x; is the feature vector corresponding to the i*” sample. Vector w and constant wo are the model parameters, which are found through the maximum likelihood estimation (MLI next E). The MLI section. E in the logistic regression model for binary cases is described in the The output of the logistic function, as shown in Figure (C.1), takes a value between zero and one.

At this moment, since the a is a positive and smaller than one, the B matrix is positive semidefinite,

estimated age in year. TABLE 2.1: The results of speaker age estimation using different utterance modeling methods (the i-vector and the NFA frameworks), and different function approximation techniques (LSSVR and MLPs). CC is the Pearson correlation coefficient between actual and estimated age, and MAE is the mean-absolute error between actual and pea Sees pee pa i anes layer and 400 neurons in the second hidden layer. The preferred activation function for hidden layers was logistic sigmoid function, and in order to perform regression, a linear activation function has been utilized for the output layers. Among the various training algorithms described in Section A.2.1 of the Appendix A, the “scaled conjugate gradient” and “one step secant back-propagation” algorithms were applied for networks related to males and females, respectively. Networks were trained to minimize the mean-absolute-error between the desired and estimated outputs. To attenuate the effect of random initialization, each experiment was repeated 10 times, and the most observed result was reported. methods (the i-vector and the NFA frameworks), and different function approximation

‘ The bold numbers in the table indicate the best results.

* The bold numbers in the table indicate the best results. TABLE 4.1: Results of speaker weight estimation using MLPs and LSSVR. CC is the Pearson correlation coefficient between actual and estimated weight, and MAE is the mean-absolute-error between actual and estimated weight, in ke.

TABLE 5.1: The Cimin of applying different classifiers (Multilayer Perceptrons (MLP), Logistic Regression (LR), Von-Mises-Fisher Scoring (VMF), Gaussian Scoring (GS) and Naive Bayesian Classifier (NBC)) over the i-vector and the NFA frameworks

9.4 Results and Discussion This section presents the results of the proposed smoking habit detection approach. The acoustic feature consists of 20 Mel-Frequency Cepstrum Coefficients (MFCCs) including energy appended with their first and second order derivatives, forming a 60 dimensional acoustic feature vector. MFCCs are obtained using cosine transform of the real logarithm of the short-term energy spectrum represented on a mel-frequency scale [80]. This type of feature is very common in the state-of-the-art i-vector-based speaker recognition systems. To have more reliable features, Wiener filtering, speech activity detection [68] and feature warping [80] have been considered in front-end processing. This section presents the results of the proposed smoking habit detection approach.

‘ The bold numbers in the table indicate the improved results. TABLE 6.1: The comparison between single-task and multitask speaker profiling for speaker height and age estimation. C'C is the Pearson correlation coefficient between actual and estimated height /age.

The bold numbers in the table indicate the improved results. TABLE 6.2: The comparison between single-task and multitask speaker profiling for speaker weight and age estimation. CC is the Pearson correlation coefficient between actual and estimated weight /age

TABLE 6.3: The comparison between single-task and multitask speaker profilins for speaker height and weight estimation. C'C is the Pearson correlation coefficien between actual and estimated height /weight TABLE 6.4: The comparison between single-task and multitask speaker profiling for speaker height, weight and age estimation. C'C is the Pearson correlation coefficient between actual and estimated height /weight /age

Pearson correlation coefficient between actual and estimated age

TABLE 7.2: The relative improvements (R.I.) in Cipmin and AUC of the proposed smoking habit detection after score-level fusion compared with the i-vector framework.

the baseline [9] (provided on the same databases) for males and females by 8.6% and 22.2%, respectively, which reflects the effectiveness of the proposed method in automatic speaker age estimation. the baseline |9| (provided on the same databases) for males and females by 8.6%

TABLE 7.3: The relative improvements in CC for age, height and weight estimations obtained in a multitask age, height and weight estimation, compared with the baselines. TABLE 7.4: The relative improvements (R.I.) in Cyipmin and AUC for smoking habit detection and the R.I. in CC for age estimation obtained in a multitask smoking habit and age estimation compared with baselines. TABLE 7.4: The relative improvements (R.I.) in Clip min and AUC for smoking habit

profiling systems was investigated. In the proposed method, each utterance of the NIST 2008 and 2010 SRE databases was modeled using the i-vector and the NFA frameworks. In this study, due to the task relatedness, a MTL for speaker age, height and weight estimation, and a MTL for speaker age and smoking habit estimation were performed in separate experiments.

descriptionView Paper arrow_downwardDownload

Analysis of speaker and co-articulation effects based on sub-band cepstral variances in the Japanese vowels of 300 male speakers

by Dr Frantz Clermont

2019, 14th Biennial Conference of the International Association of Forensic Linguistics (IAFL), Melbourne, Australia

The longstanding aim of achieving robust forensic voice identification is hampered by a number of complex and intertwined factors of variability in the speech signal such as: (1) speaker differences; (2) co-articulation effects; (3)... more

descriptionView Paper arrow_downwardDownload

Speaker Characterization

Key research themes

1. How do voice quality variations influence the perceived personality traits and charisma of a speaker?

2. What acoustic and phonetic features capture speaker-specific variability in spontaneous and controlled speech for speaker characterization and recognition?

3. How can speaker demographic traits such as age, height, and physiognomic factors be automatically estimated from speech using i-vector frameworks and machine learning?

Related Topics

All papers in Speaker Characterization