Conference of the International Speech Communication Association, 1999
A new simple model for vocal tract (VT) acoustics is presented. The model is based on a systems a... more A new simple model for vocal tract (VT) acoustics is presented. The model is based on a systems approach applied to the modes of the VT and the subglottal cavities. Glottal interaction with the ability to produce pitch-synchronous modulation effects is included as well as the lip radiation impedance. The synthesized vowels have spectral and temporal features close to natural ones. They are almost free of the nasal quality present in many vowels synthesized by conventional methods.
The automatic classification of the unvoiced stop consonants is widely considered as a difficult ... more The automatic classification of the unvoiced stop consonants is widely considered as a difficult task for traditional frequency domain and even time-frequency methods. Main reason for this is their short duration and diverse temporal structure. In this paper we present a novel method for stop consonant recognition. The method is based on statistical properties of short temporal fine structure of burst part. Classification is also evaluated with simple frequency domain method.
Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on
A new method to realize arbitrary time-frequency plane tilings together with critical sampling in... more A new method to realize arbitrary time-frequency plane tilings together with critical sampling in block-recursive filterbanks is presented. The method leads to pole-zero approximation of the target channel transfer functions. Perfect reconstruction within the limits of the approximation error can be achieved.
An unpredictably strong geomagnetic storm took place on April 6-7 2000. A scientific study on aur... more An unpredictably strong geomagnetic storm took place on April 6-7 2000. A scientific study on aurora related sounds and acoustical effects had just been started. There were two possible choices: to perform audio recordings with a non-professional ad hoc setup, or, to miss a promising possibility to study the phenomena. The first choice was selected and it produced over seventy-five minutes of data, corrupted with impulsive noise from a DC/AC-inverter used to power recording equipment in the field. The data consists of ten successive recordings. This paper describes a novel method to cancel the impulses found in data without affecting the microphone signal significantly. Next, the power of the cleaned signal is computed in 1/3-octave bands. The spectral comparisons of the different recordings showed a clear increase in the average power in recording #5 that was sampled somewhat after the peak of geomagnetic activity. Specifically, the frequency band around 100 Hz showed an increase in power and also in its fluctuation during the most active geomagnetic periods. This new result supports those obtained earlier [4]. Also, the role of related infrasounds is preliminary discussed.
The Higher Pole Correction (HPC) function in analog and digital all-pole modelling of speech prod... more The Higher Pole Correction (HPC) function in analog and digital all-pole modelling of speech production is analyzed by contparing all-pole models witfi a Transmission Line (TL) model. The validity of the TL model, which wa,~ chosen as a computational reference system in the study, is tested by comparing its transfer functions to acoustical measurements made on a physical vocal tract model. The variation of effective length of the vocal tract turned out to be an important parameter in modelling the HPC. Even if the frequency responses of the HPC in analog and digital cases differ. the relative changes in the correction, influenced by the variations in the effective length of the vocal tract, are exactly the same in both cases. Therefore digital realizations should have a variablc HPC also. A polynomial analysis of the vocal tract transfer function was done to obtain new practical models for the HPC. The work results in all-zero models, which can bc used in analog as well as digital all-pole realizations to form a new type of pole-zero model for speech production. This new pole-zero model is related to the PARCAS terminal analog model [Laine, 1982 I, Zusammenfassung. Die Funktion zur Korrektur der h6heren Pole (I-IPC = Higher Pole Correction) in analogen und digitalen Nur-PoI-Modellen der Sprachproduktion wird analysiert durch den Vergleich w~n Nur-PoI-Modellen mit einem I~Ibcrtragungsleitungmodell (TL = Transmission Line). Die Giiltigkeit des TL-Modells, das als rechncrisehes Rcfcrenzsystem ffir die Untersuchung herang,.~,:ogen wurde, wird iiberpriift dutch den Vergleich seiner l]bertragungsfunktioncn mit akustischcn Messungen, die an einem physikalischen Ansatzrohr w~rgenommen wurden. Die Andcrung dcr effcktivcn L~inge des Traktes stelltc sich als wichtiger Parameter ftir die Ausformung der HPC heraus. Obv,,ohl sich analoge und digitalc HPC frcquenzmassig unterscheiden, sind dic rcl:Ltiven Ver~,inderungen in der Korrcktur, ahhiingig yon den Variationen in dcr cffectiven LLinge des Vokaltraktes, in beiden F~,illen gcnau glcich. D,'thcr sollte man auch bei dcr digitalen Anwendung iibcr einc variable HPC verfiigcn. Es warden cinc Polynom-Aualyse dcr Ansatzrohr-I]bcrtragungsfunktion durchgcfi]hrt, um ncuc praktische Modelle fi, ir die HPC zu erhalten. Die Arbeit I'aufl auf Nur-Nullstellen-Modelle hinaus, die als tin ncuer Typ yon Modellen fiir die Sprachproduktion sowohl in ,malogcn als auch in digitalen Nur-PoI-Modellcn Anwendungen finden k6nnen. Dicses ncue PoI-Nullstellen-Modell steht in interessanter Beziehung zu dem PARCAS Terminal Analogmodell [Laine, 19821. R6sum6. La fl)nction de correction de p61cs sup6rieurs (HPC) dans un modOle tout-p61c de production de parole, analogiquc ou numt~riquc, cst analyst3e en comparant celui-ci ilunc lignc de transmission (TL). Lit validit~ du nlod~:lc TL, qui scrt de r6f6rence dans 1'6tudc. est tcst6e en comparant sa fonction de transfcrt aux mesures acoustiqucs effcctuces surun mod/:le physique du conduit vocal. La variation de la Iongueur efficace du conduit vocal s'esl av6rt~e ('tre un param~:trc impt~rtant dans la formulation dc la HPC. Mt3tne si les rcponses cn fr6qucnce de la HPC diff/:rent dans les cas analogique et numt3rique, les changements relatifs dans la correction, qui sont influcnet~s par Its variations de llt Iongueur efficace du conduit voc:d, sont exactement les m/:mes duns Ics deux cas. Une analyse polynomiale de la fonction de transfert du conduit vocal a 6re effectu6e. Nous obtenons de nouveaux mod/~les pour la HPC qui comportent uniquement des zt~ros et qui peuvent t)tre utilis6s darts des r6alisations "tout p61cs'" analogiques et numt~riques. Nous formons ainsi un nouveau typc dc mod~:le de production de parole comportant des p61es ct des z0ros. Celui-ci est aussi mis cn relation avcc Ic mod01e PARCAS JLainc. 19821.
Scandinavian Journal of Logopedics and Phoniatrics, 1992
Aspect of the physiological sources of vocal vibrato A study of fundamental-period-synchronous ch... more Aspect of the physiological sources of vocal vibrato A study of fundamental-period-synchronous changes in electroglottographic signals obtained from one singer and two excised human larynges
Frequency-warped signal processing for audio applications
/ HÃRMÃETAL. PAPERS the design and implementation of warped FIR-type and IIR-type filters, which ... more / HÃRMÃETAL. PAPERS the design and implementation of warped FIR-type and IIR-type filters, which are the basic building blocks in warped signal-processing algorithms. Several audio applications where frequency-warped techniques have shown advantages are ...
Improved Broad Phonetic Classification and Segmentation with an Auditory Model
Speech Recognition and Understanding, 1992
We describe a broad phonetic classification and segmentation algorithm based on neural networks a... more We describe a broad phonetic classification and segmentation algorithm based on neural networks and dynamic programming. The basics of our algorithm are outlined in another paper [7], so here we will focus on the introduction of auditory model features replacing the mel-scale parameters. Our auditory model incorporates critical band filtering, short time adaptation and temporal analysis of the auditory nerve responses. Unlike previously proposed synchrony models, it emphasizes the envelope rather than the instantaneous frequency as the carrier of perceptually relevant information.
An efficient method for weakly supervised pattern discovery and recognition from discrete categor... more An efficient method for weakly supervised pattern discovery and recognition from discrete categorical sequences is introduced. The method utilizes two parallel sources of data: categorical sequences carrying some temporal or spatial information and a set of labeled, but not exactly aligned, contextual events related to the sequences. From these inputs the method builds associative models able to describe systematically co-occurring structures in the input streams. The learned models, based on transitional probabilities of events observed at several different time lags, inherently segment and classify novel sequences into contextual categories. Learning and recognition processes are purely incremental and computationally cheap, making the approach suitable for on-line learning tasks. The capabilities of the algorithm are demonstrated in a keyword learning task from continuous infantdirected speech and a continuous speech recognition task operating at varying noise levels.
Despite large-scale research, development of robust machines for imitation and inversion of human... more Despite large-scale research, development of robust machines for imitation and inversion of human speech into articulatory movements has remained an unsolved problem. We propose a set of principles that can partially explain real infants' speech acquisition processes and the emergence of imitation skills and demonstrate a simulation where a learning virtual infant (LeVI) learns to invert and imitate a virtual caregiver's speech. Based on recent findings in infants' language acquisition, LeVI learns the phonemes of his native language in a babbling phase using only caregiver's feedback as guidance and to map acoustically differing caregiver's speech into its own articulation in a phase where LeVI is imitated by the caregiver with similar, but not exact, utterances. After the learning stage, LeVI is able to recognize vowels from the virtual caregiver's VCVC utterances perfectly and all 25 Finnish phonemes with an average accuracy of 88.42%. The place of articulation of consonants is recognized with an accuracy of 96.81%. LeVI is also able to imitate the caregiver's speech since the recognition occurs directly in the domain of articulatory programs for phonemes. The learned imitation ability (speech inversion) is strongly language dependent since it is based on the phonemic programs learned from the caregiver. The findings suggest that caregivers' feedback can act as an important signal in guiding infants' articulatory learning, and that the speech inversion problem can be effectively approached from the perspective of early speech acquisition.
The Journal of the Acoustical Society of America, 2013
Several psychoacoustic phenomena such as loudness perception, absolute thresholds of hearing, and... more Several psychoacoustic phenomena such as loudness perception, absolute thresholds of hearing, and perceptual grouping in time are affected by temporal integration of the signal in the auditory system. Similarly, the frequency resolution of the hearing system, often expressed in terms of critical bands, implies signal integration across neighboring frequencies. Although progress has been made in understanding the neurophysiological mechanisms behind these processes, the underlying reasons for the observed integration characteristics have remained poorly understood. The current work proposes that the temporal and spectral integration are a result of a system optimized for pattern detection from ecologically relevant acoustic inputs. This argument is supported by a simulation where the average time-frequency structure of speech that is derived from a large set of speech signals shows a good match to the time-frequency characteristics of the human auditory system. The results also suggest that the observed integration characteristics are learnable from acoustic inputs of the auditory environment using a Hebbian-like learning rule.
A new method for inferring specific stochastic grammars is presented. The process called Hybrid M... more A new method for inferring specific stochastic grammars is presented. The process called Hybrid Model Learner (HML) applies entropy rate to guide the agglomeration process of type ab->c. Each rule derived from the input sequence is associated with a certain entropy-rate difference. A grammar automatically inferred from an example sequence can be used to detect and recognize similar structures in unknown sequences. Two important schools of thought, that of structuralism and the other of 'stochasticism' are discussed, including how these two have met and are influencing current statistical learning methods. It is argued that syntactic methods may provide universal tools to model and describe structures from the very elementary level of signals up to the highest one, that of language.
A vector autoregressive (VAR) model is used in the auditory time-frequency domain to predict spec... more A vector autoregressive (VAR) model is used in the auditory time-frequency domain to predict spectral changes. Forward and backward prediction errors increases at the phone boundaries. These error signals are then used to study and detect the boundaries of the largest changes allowing the most reliable automatic segmentation. Using a fully unsupervised method yields segments consisting of a variable number of phones. The quality of performance of this method was tested with a set of 150 Finnish sentences pronounced by one female and two male speakers. The performance for English was tested using the TIMIT core test set. The boundaries between stops and vowels, in particular, are detected with high probability and precision.
Uploads
Papers by Unto Laine