This study aimed to analyze the impact of the amount of data on the discriminatory performance of acoustic-phonetic parameters, some of which are frequently assessed in forensic speaker comparisons. Parameters from three distinct phonetic...
moreThis study aimed to analyze the impact of the amount of data on the discriminatory performance of acoustic-phonetic parameters, some of which are frequently assessed in forensic speaker comparisons. Parameters from three distinct phonetic domains were considered, namely, spectral, melodic, and temporal, which were assessed separately within the same phonetic domain and in combination. The speech material consisted of spontaneous telephone conversations between two subjects. During the recording sessions, the participants were placed in different rooms, not directly seeing, hearing, or interacting with each other. The speakers were encouraged to start a conversation using a mobile phone while being simultaneously recorded. All recordings were carried out with a high resolution (44.1 kHz and 16-bit). Data segmentation and transcription were performed in the Praat software [1]. The participants were 20 male subjects, Brazilian Portuguese speakers from the same dialectal area. Their age ranged from 19 to 35 years, with a mean of 26.4 years. Although the subjects (10 identical twin pairs) were recruited from a twin research project, cf. [2, 3, 4], the focus here was comparisons among all speakers (i.e., 190 inter-speaker comparisons) rather than on individual twin pairs. Two metrics of discriminatory performance were examined through the R software [5] as a function of the comparisons among all speakers in the study using the script fvclrr [6]: Log-likelihoodratio-cost (Cllr) and Equal Error Rate (EER) values. The Cllr metric is an empirical estimate of the precision of likelihood ratios. The EER metric captures the point where the false reject rate (type I error) and false accept rate (type II error) are equal and is used to describe the overall accuracy of a system. Lower Cllr and EER values are compatible with better discriminatory performance, whereas higher values suggest the opposite trend. A cross-validation procedure was adopted for the calculation of likelihood ratios, where multiple pairwise comparisons were performed across individuals in which the background sample consisted of data from all speakers, except those being directly compared. To assess the impact of the amount of data on discriminatory performance, a very straightforward approach was adopted. Based on a larger data set extracted from recordings of about 2.50 min per speaker, random data points were additively selected for the analyses. The minimum number of data points, i.e., acoustic measurements, selected per speaker was set at 2 to allow the minimum comparison of intra-speaker variability. Thereafter, more randomly selected data points were progressively added to the tests, two points at a time, and new Cllr and EER values were computed for the new resampled data set. For the present study, the maximum number of data points (acoustic measurements) was set at 30. Given the nature of the speech material assessed, a discrepancy in the number of samples produced per subject was observed. Because of that, a random downsampling procedure was repeated 200 times to minimize the selection bias using the R package Recipes [7]. Cllr and EER median values were reported after performing tests with the randomly selected data while performing a fusion and calibration of different estimates based on a logistic regression technique, cf. [8]. Melodic and temporal parameters were extracted from speech chunks with an average (mean and median) temporal window of 3 s, corresponding to inter-pause intervals in most cases. Spectral parameters were extracted from the midpoints of /a/ vowels in stressed and unstressed positions. The monophthongs displayed a mean and a median duration of 67 ms and 84 ms, respectively. After a manual segmentation, all parameters were extracted automatically using a Praat script [9]. Four models were compared. Model 1 (M1) comprises the combination of melodic parameters (f0 median and f0 base value). Model 2 (M2) comprises the combination of temporal parameters (speech rate and articulation rate). Model 3 (M3) consists of the combination of spectral parameters (F3 and F4). Model 4 (M4) considered the combination of all the acoustic-phonetic parameters.