Pitch Tracking

description187 papers

group1 follower

lightbulbAbout this topic

Pitch tracking is the process of detecting and analyzing the fundamental frequency of a sound signal, typically in music or speech, to determine its pitch. This involves algorithms that can identify variations in frequency over time, enabling applications in music transcription, voice recognition, and audio processing.

lightbulbAbout this topic

Key research themes

1. How can auditory models improve accurate pitch segmentation and transcription in singing sequences?

This research area focuses on developing and evaluating auditory model-based transcription systems that can convert singing sequences into discrete pitch and duration pairs with minimized segmentation errors. Accurate segmentation and transcription are critical for applications like Query-by-Humming (QBH) systems, where matching sung queries to musical databases depends fundamentally on precise note boundary detection and pitch estimation. Challenges include reducing segmentation errors and accommodating variability caused by singing with or without lyrics.

An Auditory Model Based Transcriber of Singing Sequences

by Micheline Lesaffre

2018

Key finding: The study demonstrates that existing state-of-the-art transcription systems suffer from high segmentation error rates, up to 60%. By developing a new auditory model-based transcription system incorporating advanced acoustic... Read more

articleView Paper downloadDownload

A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-Scale Analysis

by Mohamed Anouar BEN MESSAOUD

2024, Signal Processing: An International …

Key finding: The paper introduces a novel voicing detection and pitch estimation algorithm that leverages the multi-scale product of wavelet transform coefficients to robustly detect pitch periods and voicing segments in noisy speech.... Read more

articleView Paper downloadDownload

Robust Pitch Extraction Method for the HMM-Based Speech Synthesis System

by M Kiran Reddy

2022, IEEE Signal Processing Letters

Key finding: Proposes an innovative pitch extraction method based on Continuous Wavelet Transform (CWT) coefficients mean signal to improve voiced/unvoiced detection and pitch estimation, especially in challenging creaky voice regions.... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What advanced signal processing techniques enable robust and high-resolution pitch tracking in complex and noisy audio signals?

This theme investigates algorithmic advancements in pitch estimation that provide enhanced time-frequency resolution, noise robustness, and effective multi-pitch tracking. These techniques use innovative mathematical transforms, empirical mode decomposition, canonical correlation analysis, and statistical modeling to disambiguate pitch information from acoustically rich or degraded signals. The focus is on leveraging continuous pitch estimation and harmonic models to improve pitch tracking accuracy, essential for applications such as speech synthesis, music transcription, and robot audition.

An efficient pitch-tracking algorithm using a combination of fourier transforms

by Sylvain Marchand

2022

Key finding: Develops a novel pitch detection technique utilizing a 'Fourier of Fourier' transform—applying two sequential Fourier transforms—to precisely identify the fundamental frequency in harmonic sounds, even when fundamentals are... Read more

articleView Paper downloadDownload

Effective Pitch Estimation using Canonical Correlation Analysis

by Subrata Kumer Paul

2022, 2nd International Conference on Advanced Information and Communication Technology 2020 (ICAICT-2020),

Key finding: Introduces a pitch estimation method that combines Empirical Mode Decomposition (EMD) to generate Intrinsic Mode Functions with Canonical Correlation Analysis (CCA) to select relevant components. The approach reconstructs a... Read more

articleView Paper downloadDownload

Pitch tracking based on statistical anticipation

by Mingyang Wu

2023, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222)

Key finding: Proposes a robust multi-pitch tracking algorithm for noisy speech that integrates an enhanced channel and peak selection method, a novel probabilistic integration of periodicity cues across frequency bands, and Hidden Markov... Read more

articleView Paper downloadDownload

Instantaneous Pitch Estimation Based On Rapt Framework

by Maxim Vashkevich

2025

Key finding: Enhances the RAPT pitch tracking method by introducing an instantaneous normalized cross-correlation function computed from instantaneous harmonic parameters obtained via complex bandpass filtering. This results in smooth,... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can integrated acoustic and music language models enable multi-pitch detection and voice assignment in polyphonic vocal music?

Research within this theme explores systems combining probabilistic acoustic models with musicological language models to simultaneously detect multiple concurrent pitches and assign detected pitches to individual voices or singers in polyphonic a cappella recordings. Such integration addresses challenges of pitch detection amidst overlapping harmonic sources and enables voice separation based on voice-leading rules and temporal continuity. The resulting methods facilitate transcription and analysis of complex vocal ensembles like chorales and quartets.

Multi-Pitch Detection and Voice Assignment for A Cappella Recordings of Multiple Singers

by Emmanouil Benetos

2022

Key finding: The paper proposes a system combining spectrogram factorization acoustic models (PLCA) driven by a learned spectral template dictionary with hidden Markov music language models embodying voice-leading constraints to perform... Read more

articleView Paper downloadDownload

A Hybrid Approach for Co-Channel Speech Segregation based on CASA, HMM Multipitch Tracking, and Medium Frame Harmonic Model

by Aliaa Youssif

2022, International Journal of Advanced Computer Science and Applications

Key finding: Presents a hybrid speech segregation approach that combines Hidden Markov Model (HMM)-based pitch tracking, computational auditory scene analysis (CASA), and medium-frame harmonic modeling to segregate co-channel speech from... Read more

articleView Paper downloadDownload

All papers in Pitch Tracking

Instantaneous Pitch Estimation Based On Rapt Framework

by Maxim Vashkevich

2025

Publication in the conference proceedings of EUSIPCO, Bucharest, Romania, 2012

descriptionView Paper arrow_downwardDownload

A Low-Delay Algorithm for Instantaneous Pitch Estimation

by Maxim Vashkevich

2025, Journal of The Audio Engineering Society

descriptionView Paper arrow_downwardDownload

Distance metrics and indexing strategies for a digital library of popular music

by Cristian Francu

2025, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532)

People identify powerfully with music: someone might say ÒthatÕs my song!Ó but they are unlikely to say ÒthatÕs my book!Ó or ÒthatÕs my picture!Ó A digital library of popular music therefore has the potential to be a compelling... more

descriptionView Paper arrow_downwardDownload

Minh-Quang “ A Processing Method for Pitch Smoothing Based on Autocorrelation

by Xufang Zhao

2025

Abstract-Chinese is known as a syllabic and tonal language and tone recognition plays an important role and provides very strong discriminative information for Chinese speech recognition [1]. Usually, the tone classification is based on... more

descriptionView Paper arrow_downwardDownload

High-resolution noise-robust spectral-based pitch estimation

by Marián Képesi

2025, Interspeech 2005

This paper introduces a new spectral representation-based pitch estimation method. Since pitch is never stationary during real conversations, but often undergoes changes because of intonation, the spectral representation is derived from... more

descriptionView Paper arrow_downwardDownload

A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-Scale Analysis

by Mohamed Anouar BEN MESSAOUD

2024, Signal Processing: An International …

This paper proposes a new voicing detection and pitch estimation method that is particularly robust for noisy speech. This method is based on the spectral analysis of the speech multi-scale product. The multi-scale product (MP) consists... more

descriptionView Paper arrow_downwardDownload

PARSHL: an analysis/synthesis program for non-harmonic sounds based on a sinusoidal representation

by Xavier Serra

2024, Proc. Int. Computer Music Conf

This paper describes a peak-tracking spectrum analyzer, called Parshl, which is useful for extracting additive synthesis parameters from inharmonic sounds such as the piano. Parshl is based on the Short-Time Fourier Transform (STFT),... more

Figure 1: Log magnitude of the transform of a triangle window. measures how many dB down is the highest side-lobe from the main lobe. Ideally we would like a narrow main lobe (good resolution) and a very low side-lo be level (no cross-talk between FFT channels). The choice of window determines this trade-off. For example, the rectangular window has the narrowest main lobe, 2 bins, but the first side-lobe main-lobe peak. The Hamming window has a wider main lo is very high, —13dB relative to the be, 4 bins, and the highest side-lobe is 42dB down. The Blackman window worst-case side-lobe rejection is 58 dB down which is good for audio applications. A very different window, the Kaiser, a lows control of the trade-off between the main-lobe width and the highest side-lobe level. If we want less main-lobe width we will get higher side-lobe level and vice versa. Since control of this trade-off is valuable, the Kaiser window is a good general-purpose choice. T Adee lal et Pade menhkhl law 4H a RAR Gieeornditdeal] e{fssadtaw 9" aN ne (eee Vee Ree a mee Leman eee (numer aS peeeeeN ee ne” Lome |

Figure 2: Spectrum of two clearly separated sinusoids.

Figure 3: Illustration of the first two steps of PARSHL. (a) Input data. (b) Windowed input data. (c) FFT buffer with the windowed input data. (d) Resulting magnitude spectrum.

Figure 4: Parabolic interpolation of the highest three samples of a peak.

Figure 5: Coordinate system for the parabolic interpolation.

to scale the frequencies to alter pitch and formant structure together. A more powerful class of spectral modifications comes about by decoupling the sinusoidal frequencies (which convey pitch and inharmonicity information) from the spectral envelope (which conveys formant structure so important to speech perception and timbre). By measuring the formant envelope of a harmonic spectrum (e.g., by drawing straight lines or splines across the tops of the sinusoidal peaks in the spectrum and then smoothing), modifications can be introduced which only alter the pitch or only alter the formants. Other ways to measure formant envelopes include cepstral smoothing [15] and the fitting of low-order LPC models to the inverse FFT of the squared magnitude of the spectrum [9]. By modulating the flattened (by dividing out the formant envelope) spectrum of one sound by the formant-envelope of a second sound, “cross-synthesis” is obtained. Much more complex modifications are possible.

Figure 7: (a) Original piano tone, (b) synthesis with phase information, (c) synthesis without phase information. which smoothly goes from frame to frame and where each sinusoid accounts for both the rapid phase changes (frequency) and the slowly varying phase changes. SaeMee ny @ PT YY ¢

descriptionView Paper arrow_downwardDownload

Automatic Transcription of Polyphonic Piano Music Using Genetic Algorithms, Adaptive Spectral Envelope Modeling, and Dynamic Noise Level Estimation

by Francisco Fernández Vega

2024, IEEE Transactions on Audio, Speech, and Language Processing

This paper describes a polyphonic note detection system incorporating a simple masking technique that can accurately transcribe chords and polyphonic piano music. The system, developed in MATLAB, will take input files in .wav format. The... more

descriptionView Paper arrow_downwardDownload

Rec. Asilomar Conference on Signals, Systems, and Computers

by Søren Holdt Jensen

2024

In this paper, a computationally efficient method for the estimation of the parameters of harmonic sinusoidal signals, including the order, which is of particular importance, for speech and audio signals is presented. The signal is... more

descriptionView Paper arrow_downwardDownload

Pitch Detection/Tracking Strategy for Musical Recordings of Solo Bowed-String and Wind Instruments

by Wei-chen Chang

2024, J. Inf. Sci. Eng.

A pitch detection/tracking strategy for solo bowed-string and wind musical instrumental recordings is presented. To avoid the missing fundamental problem, we adopted the greatest common divisor method and modified it with a... more

descriptionView Paper arrow_downwardDownload

Piano Legato-Pedal Onset Detection Based on a Sympathetic Resonance Measure

by Mark Sandler

2024, 2018 26th European Signal Processing Conference (EUSIPCO)

In this paper, the problem of legato pedalling technique detection in polyphonic piano music is addressed. We propose a novel detection method exploiting the effect of sympathetic resonance which can be enhanced by a legato-pedal onset.... more

descriptionView Paper arrow_downwardDownload

Review of Egg and Speech Processing Techniques for Glottal Activity Detection

by Grenze International Journal of Engineering and Technology GIJET and

2024

Glottal instants namely GCIs and GOIs are useful in a wide variety of speech processing and biomedical applications. This paper presents the recent developments in the methodologies for glottal activity detection using EGG and speech... more

(Zero Frequency Filtering) method is introduced for epoch extraction based on the excitation of vocal tract which produces impulse like excitation [29], with discontinuities at all frequencies. The characteristics at zero Figure 2 Block diagram for the YAGA algorithm

The proposed method [34] computes a three level multiscale product of speech wavelet transform at different dyadic scales in order to enhance edge detection of speech signal. The singularities or the signal peaks align along the peaks obtained from wavelet transform coefficients for the first few scales. However, choosing the scales too large will lead to misalignments as smoothing spreads the response and the singularities get separated. GCIs are identified as the greatest minimum peak obtained from the multiscale product and the smallest minimum peak situated between two GCIs is identified as a GOI. The block diagram for GD, DYPSA and MP methods is shown in Figure 3. Figure 3 Block diagram for GD, DYPSA and MP methods

TABLE I. RECENT DEVELOPMENTS TO DETECT GLOTTAL ACTIVITY

TABLE II. COMPARATIVE TABLES FOR GCI TABLE II]. COMPARATIVE TABLES FOR GOI

TABLE IV APPLICATIONS OF VARIOUS GLOTTAL ACTIVITY DETECTION METHODS

descriptionView Paper arrow_downwardDownload

Semi-Markov decision processes with limiting ratio average rewards

by HOD Mathematics

2024, Journal of Mathematical Analysis and Applications

We prove that a finite (state and action spaces) semi-Markov decision process with limiting ratio average (undiscounted) payoff has an optimal pure semi-stationary policy (i.e., a semi-Markov policy independent of decision epoch count).... more

descriptionView Paper arrow_downwardDownload

Speech Analysis and Synthesis Based on Dynamic Modes

by Julio Vargas

2024, IEEE Transactions on Audio, Speech, and Language Processing

In this paper, the source-filter model of speech production is adapted to represent the speech signal as the superposition and convolution of a dynamic source and resonant modes. The aim is to increase the resolution of the... more

descriptionView Paper arrow_downwardDownload

Automatic speech recognition system with pitch dependent features for Punjabi language on KALDI toolkit

by Aakansha Mishra

2024, Applied Acoustics

In this paper the improvement in performance of automatic speech recognition (ASR) system is achieved with help of pitch dependent features and probability of voicing estimated features. The pitch dependent features are useful for tonal... more

descriptionView Paper arrow_downwardDownload

Real-time noise synthesis with control of the spectral density

by Myriam Desainte-Catherine

2024

We propose in this paper a spectral synthesis model to generate noisy sounds with independent control parameters for spectral density and spectral envelope. Algorithms defining in a efficient way these spectral properties from the... more

descriptionView Paper arrow_downwardDownload

Automatic Transcription of Polyphonic Piano Music Using Genetic Algorithms, Adaptive Spectral Envelope Modeling, and Dynamic Noise Level Estimation

by Aníbal Ferreira

2024, IEEE Transactions on Audio, Speech, and Language Processing

descriptionView Paper arrow_downwardDownload

Pitch-Synchronous Multiresolution Analysis of Music Signals

by César Abad

2024

In this thesis a novel multiresolution approach for note detection in a polyphonic mix is proposed. The idea is to use a set of wavelets whose lengths are adapted to the theoretical fundamental period of musical notes. Using the typical... more

Figure 5.3: 3D representation of the PSWS and the FPSWS of the signal in Figure 5.2 for note C.

Figure 5.4: 3D representation of the PSWS and the FPSWS of the signal in Figure 5.2 for note G.

(b) Window length: 2048 samples; hop-size: 64 samples. (a) Window length: 128 samples; hop-size: 64 samples.

Figure 1.2: Cross-correlation of spectral magnitudes for several hours of music radio. The straight lines indicate correlations between harmonically related frequencies (figure extracted from [1]).

Transform of a signal which contains all the notes of a wavetable-synthesized piano from Ap to Gg. Note that the spectral peaks (harmonics) are much more concentrated in the low frequencies, while the peaks in high frequencies are more separated. By the own nature of the musical notes, if we want to describe the location of these peaks we will need much more resolution for low frequencies than for high frequencies. Note that between 9000 Hz and 12000 Hz only 5 different peaks can occur in this piano. The same number of peaks must be resolved between 300 Hz and 400 Hz. Using an FFT with a 1024-point window at a sample frequency of 44100 Hz we have a resolution of about 43 Hz. That is, we have 2 or 3 spectral bins available around 300-400 Hz, which is not enough to resolve the 5 harmonics, while between 9000 and 12000 Hz we will have about 70 spectral bins, far more than enough to resolve the same number of harmonics. Figure 2.2: Frequency representation of a signal containing all notes from Ag to Gg in a wavetable-synthesized piano.

Given this, it seems to be a good idea for music analysis and synthesis to use some kind of transform that has a good time resolution at the expense of a poorer frequency resolution for high frequencies (which is somehow unnecessary in music signals, as we have seen). That is, divide the time-frequency plane in Heisemberg boxes distributed in a more advantageous way. This is where the Wavelet transform enters the scene. Figure 2.3: Short Time Fourier Transform tiling of Heisemberg boxes for a 32- point window length. Note that the more frequency resolution we need, the more time samples will be necessary to calculate the FFT, and so the less time resolution we will have.

Figure 2.4: Typical dyadic time frequency decomposition. A stands for approx- imation level and D for detail level. This example is a decomposition of level 5. generate an orthonormal basis, such that any finite energy signal can be decom- posed over this wavelet basis. The wavelet function ~ is associated with a high pass filter and so with the detail level of the decomposition.

look at the scalogram in search of components corresponding to the fundamen- tal frequency of the musical notes. In Fig. 3.1 the scalogram of a pure tone signal is plotted. The difference of this representation with respect to the STFT spectrogram is clear: in the vertical axis we have the fundamental frequencies of the musical notes equally spaced. That is a direct consequence of a suited multiresolution analysis. Figure 3.1: 2D time-frequency plot of CWT coefficients using a complex Morlet 1-5 wavelet. The signal being analysed is a pure tone of 440 Hz. (Figure extracted from [2]).

Table 4.1: Structure of coefficients in a DWT dyadic decomposition using the wavedec MATLAB function. Let us analyze the properties of each one of the columns of a PSWS. We will use the MATLAB function wavedec. The output of this function is an “structure” of coefficients that contains one sub-vector for the approximation coefficients, and another 8 sub-vectors for the details. If we use a level 8 decomposition we have the distribution of coefficients shown in Table 4.1. Note that the Haar discrete wavelet decomposition is critically sampled: we obtain 256 wavelet coefficients from 256 temporal samples. If we place these coefficients ordered by level in one column at a time, we can build a “spectrogram-like” scalogram in which the frequency information is quite coarse (we only have 8 different levels to distinguish frequencies, each level centered in one different octave) but nevertheless can be very useful as we will see.

Figure 4.3: A tone of 880 Hz produces a DC pattern at level 4.

Figure 4.4: We add 8 harmonics with linearly decreasing amplitudes to the 88 Hz pure tone in Figure 4.3. Note that the first, the third and the fifth harmonics are the fundamentals of the A note in octaves 6, 7 and 8 respectively, producing the corresponding characteristic pattern at levels 5, 6 and 7. Note also the reinforcement effect: the other harmonics have contributed to raise the level of the coefficients at level 4 too.

Figure 4.5: In this case the fundamental has been removed and only the 8 har- monics are present. Note that even so, the DC pattern at level 4 indicates the presence of the fundamental of an As note, even though no tone with 880 Hz is present in the signal. When the fundamental is present, the two DC contribu- tions sum, and coefficients at the corresponding level are raised accordingly, as we see in Figure 4.4.

Figure 4.7: PSWS of a pure tone of 1760 Hz. Note that although there are nonzero coefficients at levels 5, 6, 7 and 8, only level 5 coefficients have nonzero mean within each column. Finally, we show in Figure 4.9 the same FPSWS of Figure 4.8 but seen “from above”, as a contour field. Horizontal lines that delimit the boundaries between levels have been added. Note that each level will correspond to the fundamental frequency of the A note in one different octave.

Figure 4.8: Filtered PSWS (FPSWS) of a pure tone of 1760 Hz. All coefficients with zero mean have been removed by a low pass filter that operates locally at each level across the columns.

Figure 4.9: The same FPSWS in Figure 4.8 but seen from above as a contour field. The philosophy of this representation is that, whenever a pure tone of any frequency that coincides with the fundamental frequency of the A note at any octave is present in a signal, it will appear as nonzero coefficients in the corresponding level of the FPSWS contour field representation. In any other case, the FPSWS would be ideally flat.

Figure 5.1: Spectrogram of a sequence of 24 notes, played in order from Cs to Bg in a wavetable-synthesis piano. FFT widow size: 512 samples; hop size: 256 samples. Let us analyze a signal that is a sequence of 24 notes, played in order from C5 to Bg in a wavetable-synthesis piano. The spectrogram of this signal is shown in Figure 5.1.

Figure 5.6: Spectrogram of a sequence of the chords C - Cm7 in the 5th octave, and C - Cm7 in the 6th octave played in in a wavetable-synthesis piano. FFT widow size: 512 samples; hop size: 256 samples.

Figure 5.7: FPSWS’s of the signal on Figure 5.6.

Let us now analyze the same signal in Figure 5.6 in a not-so-ideal situation. We have added a charles at quarter notes from the first bar, a snare drum at half notes in the second and fourth bar, and a bass drum at half notes in the third and the fourth bar. The spectrogram of this signal is shown in Figure 5.9 Figure 5.9: Spectrogram of a sequence of the chords in Figure 5.6 plus a drum pattern in which the notes coincide exactly first with the charles (first bar), then with the charles and the snare drum (second bar), then only with the charles and the snare drum (third bar), and finally with the charles, the bass drum and and snare drum (fourth bar). FFT widow size: 512 samples; hop size: 256 samples.

So far, we have used only a wavetable-synthesis piano for our analyses. In the next example, an excerpt taken from a recording of the beginning of Beethoven’s “For Elisa” played in a real piano is used. The spectrogram is shown in Fig- ure 5.12. Figure 5.12: Spectrogram of the beginning of Beethoven’s “For Elisa” played in a real piano. FFT widow size: 2048 samples; hop size: 512 samples. Only low frequencies are represented in the picture.

Figure 5.10: FPSWS’s of the signal on Figure 5.9.

descriptionView Paper arrow_downwardDownload

Sound synthesis using an allpass filter chain with audio-rate coefficient modulation

by Henri Penttinen

2023, Proceedings of the 12th International Conference on Digital Audio Effects (DAFx-09), Como, Italy

This paper describes a sound synthesis technique that modulates the coefficients of allpass filter chains using audio-rate frequencies. It was found that modulating a single allpass filter section produces a feedback AM–like spectrum, and... more

Additionally, this work is related to the audio-rate pick-up point modulation of waveguide models by Van Duyne and Smith [10], distributed string tension modulation by Pakarinen et al. [11], the adaptive FM technique described by Lazzarini, Timo- ney and Lysaght [12], and the feedback AM synthesis technique briefly discussed in [13]. Other related research on modulated filters includes the work by Greenfield on dithered digital filters, which focused on increasing numerical accuracy [14].

Figure 2: First-order allpass filter coefficient modulation at audio rate. Sinusoidal carrier (1 kHz) and modulator (100 Hz) sig- nals, linearly increasing modulation index (0...0.99) between (1...4) seconds. The chain length determines the bandwidth, while the modulation index controls the amount of higher end spectral content. (a) Single allpass stage, (b) chain of 70 allpass stages. Proc. of the 12" Int. Conference on Digital Audio Effects (DAFx-09), Como, Italy, September 1-4, 2009

Figure 3: Effect of the modulation index M: (a) M=0.99, (b) M=0.64. Other parameters are as in Fig. 2(b).

We notice that the chain length N controls the maximum amount of frequency deviation, as shown in Fig. 4, while the modulation index M controls the deviation within the limits im- posed by N. M also controls the shape of the applied frequency modulation, which is sinusoidal with low M values, but with higher M values it develops towards the non-sinusoidal forms shown in Fig. 4. It should be noted that the maximum amount of frequency deviation depends also on the modulation frequency ym, as indicated by (17).

Figure 5: Aliasing effects of the (a) chain length N, and the (hb) modulation index M.

Figure 6: Comparison of the synthesized pseudo saw- tooth and the ideal sawtooth partial amplitudes as a function of the (a) chain length N and the (b) modulation index M. In the second part of the analysis, the allpass chain length N is set to a constant 100 in the 120 Hz test case and to a constant 6 in the 1.2 kHz test case, and the modulation index M is varied from 0.6 to 0.9 in steps of 0.1. In addition, a modulation index 0.99 was also tested. The NMR values of these tests are plotted in Fig. 5b, in which it can be seen that the NMR values of both the low and the high frequency tests increase almost linearly as the modulation index increases. Yet again, this is an expected result, as with a larger modulation index the allpass chain applies more phase distortion to the input signal which causes more aliasing. As can be seen from Fig. 5, the aliasing of the sawtooth is quite moderate with these parameters. However, as a trade-off the levels of the sawtooth partials are far from the levels of the ideal case. This is illustrated in Fig. 6, where the error of the par- tial amplitudes is plotted for both test cases. Now, smaller values of amplitude error indicate a more accurate match to the partial amplitudes of the ideal sawtooth. However, it should be noted that when either the allpass chain length or the modulation index is increased, the amplitude error decreases. Yet, as discussed above, this leads to more aliasing distortion.

Figure 7: Comparison of (a) AM, (b) sinusoidal FM, and (c) CM synthesis. In each subfigure, the amplitude do- main waveform is plotted on top of a frequency domain spectrogram — time runs on the horizontal axis. Sinu- soidal carrier and modulator, f, = 1000 Hz, fm = 100 Hz, CM chain length N = 70.

Figure 8: Pd patch used for the synthesizer sounds. The CM algorithm is implemented inside the ap~ external, which takes two audio rate and one control rate input.

Figure 9: Two allpass chains in a cascade produce com- plex dynamic spectra. Modulation indices ramp from 0 to 0.5 for the first chain and from 0 to 0.99 for the second. It is also possible to imitate classic virtual analog waveforms by using the instrument of Fig. 8 with frequency ratios 1:1 and 1:2 between the carrier and the modulator signals, and emulating the lowpass filter cutoff frequency parameter of the subtractive

Table 1: CM synthesis parameters. The frequency ratio f/f defines the spectral structure of the synthesized sound. Consonant ratios, i.e., when the ratio can be expressed as n/m using small integer n and m values, produce harmonic timbres. Irrational frequency ratios produce inharmonic spectra. A slight detuning between f, and f,, results in beating.

descriptionView Paper arrow_downwardDownload

Progress in the BBN 2007 Mandarin Speech to Text system

by long nguyen

2023, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing

In this paper, we describe the BBN 2007 Mandarin Speechto-Text system developed for the GALE Evaluation 2007. In comparison to the BBN 2006 Mandarin system, we achieved 25% relative reduction in character error rate on the most important... more

descriptionView Paper arrow_downwardDownload

Towards the Automated Analysis of Simple Polyphonic Music: A Knowledge-based Approach

by Juan Pablo Salcedo Bello

2023

Music understanding is a process closely related to the knowledge and experience of the listener. The amount of knowledge required is relative to the complexity of the task in hand. This dissertation is concerned with the problem of automatically decomposing musical signals into a score-like representation. It proposes that, as with humans, an automatic system requires knowledge about the signal and its expected behaviour to correctly analyse music. The proposed system uses the blackboard architecture to combine the use of knowledge with data provided by the bottom-up processing of the signal's information. Methods are proposed for the estimation of pitches, onset times and durations of notes in simple polyphonic music. A method for onset detection is presented. It provides an alternative to conventional energy-based algorithms by using phase information. Statistical analysis is used to create a detection function that evaluates the expected behaviour of the signal regarding onsets. Two methods for multi-pitch estimation are introduced. The first concentrates on the grouping of harmonic information in the frequency-domain. Its performance and limitations emphasise the case for the use of high-level knowledge. This knowledge, in the form of the individual waveforms of a single instrument, is used in the second proposed approach. The method is based on a time-domain linear additive model and it presents an alternative to common frequency-domain approaches. Results are presented and discussed for all methods, showing that, if reliably generated, the use of knowledge can significantly improve the quality of the analysis. A mi casa, la que no tiene ni paredes ni techo, sino a Salvador, Maritza, Jesús y la Mava. Completing my doctoral research in a foreign country and in a language which is not my own has been quite an experience, one that can be moderately described as "intense". This dissertation has only been possible given the immense support and cooperation that I received from family, friends and colleagues. For them, to whom I am so profoundly indebted, I dedicate the following remarks. First of all, I would like to thank my supervisor Prof. Mark Sandler for opening the doors to a life-changing experience. For having the courage of recruiting me when none of my words made any sense. For the continuous practical and emotional support. For the good advise during the length of this project. Thanks to the Joint Information Systems Committee (JISC) in the United Kingdom, the National Science Foundation (NSF) in the United States and the Fundación Gran Mariscal de Ayacucho in Venezuela for funding my research and for covering my living expenses in London during the last three years.

descriptionView Paper arrow_downwardDownload

A new algorithm for instantaneous F 0 speech extraction based on Ensemble Empirical Mode Decomposition

by Maria Eugenia Torres

2023, European Signal Processing Conference

In this work, a new instantaneous fundamental frequency extraction method is presented, with the attention especially focused on its robustness for pathological voices processing. It is based on the Ensemble Empirical Mode Decomposition... more

descriptionView Paper arrow_downwardDownload

Analysis and resynthesis of polyphonic music

by Douglas Nunn

2023

This thesis examines applications of Digital Signal Processing to the analysis, transformation, and resynthesis of musical audio. First I give an overview of the human perception of music. I then examine in detail the requirements for a... more

As the aim was deduction of timing information, a smaller FFT, of size 32, was used. iming.'“°P! The aim was to determine which of the pianists it was. The spectrum is shown in Figure 98

None of this work has previously been submitted for a degree in this or any other university.

windscreen wipers or a squeaky gate hinge. The printer noise shown in Figure 2 is in a twelve-tone scale and a fairly regular tempo — the low E is the

the ‘hypersonic effect’, and was demonstrated using an Indonesian gamelan ensemble.* However, ultra

The Ionian (major), Dorian, Phrygian, Lydian, Mixolydian, Aeolian (minor), and Locrian modes refer to

Figure 9 - Responses of six fibres in the auditory nerve of a cat.

processes. However, some studies’ °""” “ have pointed out that this mechanism could also form the

Figure 13 - Inharmonicity of AO from Blackham’s data. tests by Schuck on a higher note, F1 (44 Hz), on a different piano, is shown in Figure 14.!°##°*"@™ § Data from tests by Blackham on the lowest note, AO (27.5 Hz), is shown in Figure 13, and data from

The above has discussed frequency stretching, but another factor counteracts this. Given one tone and

Many, indeed most, other instruments also have formants. It has been suggested, in the context of a Stradivarius violin, that a third formant at the sum of the frequencies of the first two is an important

{Licklider, Plomp 76, Van Klitzing, =xamples. Loman *4 Plomp concludes that the maximum effect i

Another case is when we hear a violin section playing in unison. Even if we know that there are sixteen have less musical importance and note decays are usually less abrupt. players, we cannot distinguish them, and hear it as a single but very complex instrument. Offset times are probably not detected with as much resolution as onset times because note durations a cymbal crash, for example

f all the notes are C, E, and G, except for one G#, then change the G# to G”, then we might correct an

However, transcription of polyphonic music is particularly difficult because before we can classif

bandwidth, and we wish to determine the frequencies involved. However, music covers at least ter The FFT is commonly used for frequency analysis where the range of interest covers a relatively small

For polyphonic music, this approach does not work due to the interference between notes, and a

individual peaks in the standard spectrogram, so the axis was tilted as shown in Figure 28. greyscales to different amplitude ranges, or by the contours of a spectrogram. It is difficult to pick out

e 29, after Risset!*isset 82) illustrates the framework for analysis and resynthesis

and thus detract from the generality of an analysis method. It is easy to form artificial tones that do not t should, however, be remembered that these assumptions may not apply to non-acoustic instruments woson yi] Other techniques include the phase vocoder, which is often used for speech , and linear

Figure 32. order to investigate the possibilities of real-time synthesis. The hardware is shown schematically i

The DAC buffering is shown in Figure 35, and the DAC subsystem is shown in Figure 36.

illustrated in Figure 37. This was later extended to analysis of synthesised musical sounds !8°w"> was designed for separating speech from the other sources described above [Browns 92a, BrownG 4b] 7, i.

In his system, the front-end analysis provides several outputs:- an overall signal intensity, a set o: wregman illusion, described in the previous chapter. " Ellis uses the continuity illusion to back up his case

filtering and FFTs. Eventually, these may be on separate processors. analyser task handles the computation — in this case the Octave Spectral Analysis which includes

The perfect reconstruction property implies that sharp filters are not an absolute necessity. However, it

gn Package." ” Figure 43 shows the frequency response, and Figure 44 shows detail of the band The FIR filter (FIRK6.FLT) used has a length of 255. It was designed using DFDP, the Digital Filter

Within each octave, the processing load fluctuates with time. When the FFT buffer is full, the FFT is 6.4.8 FFT buffering

Figure 48 - Logarithmic spectrogram of 30 seconds of Mendelssohn's Sonata 3 for Organ.

and plots a histogram of the energy separately for each octave. It also prints out a characterisation file Once we have the spectra from each octave, we then try to determine the sinusoids contained in it. This The next program, called DISTRIBx®3, also reads the files produced by the Octave Spectral Analysis procedure is either imprecise or computationally expensive, or somewhere in-between, in that there is no quick way to calculate it accurately, and several ways of approximating the real spectrum, such as AR

discussion, all frequencies in the following paragraph are normalised with respect to the analysis

First, a program was developed to test this deconvolution procedure. The test data was much simpler

The first diagram above shows the effect of the Blackman-Harris weighting; the peaks are much

by time using a program called TRAKSIN«x to give longer entities called chains- These are similar to the auditory elements’ of Brown,/O™"S “4! This is done by linking sines in neighbouring blocks using a simple birth-death model as shown in Figure 54. Note that sinusoids are only linked at their ends, s« frequently. It is possible for a sinusoid

threshold. The resultant sets of linked sines are

the sines into virtual memory. It then reads the chain file and outputs the tracks of linked sines (a 59. Since music relies on near-integer frequency ratios, we expect the harmonics of different notes to overlap. Thus, chains can be claimed several times, in which case the total amplitude is shared between

virtual memory. We then carry out an exhaustive search, looking for simultaneous frequencies that are close to being harmonically related. A partial at (n’xfy) is taken to be the n™ partial of fo if |n’/n] is le

that more harmonics are not sought is mainly due to memory restrictions. We match up to the 16" harmonic, and the resultant note groups are as shown in F igure 60. The reason

about ‘typical’ instrument spectra. After this, we try to smooth the amplitude envelopes of each harmonic by adjusting the strengths of the

workings of the analysis system. These form (((audio examples 1 and 2))) The pieces were around half a that real-time performance marginally not possible, but could possibly be achieved using shorter filters.

analyses is periodic, as illustrated in Figure 69. block contains high frequencies. This is due to the fundamental assumption of the FFT that the block i

Figure 71 - Hamming window. Figure 72 - Blackman 2-term window.

Figure 73 - Blackman minimum 4-term window.

The first five notes examined the effect of changing the duration of the note while the frequency and to be a proportionately weaker component. We could expect this to affect the third, fourth, and fifth 7.2.2.6 Duration and time resolution notes. The spectra, shown in Figure 74, show that this is indeed the case. The first two notes are long

illustrated in Figure 76. below shows how many were picked out, and how many partial tracks were formed; the data is

Figure 77 - Results of deconvolution for eight thresholds.

SHOWTRX for MTest2 for the eight deconvolution thresholds.

Figure 78 - Tracked partials for eight thresholds.

frequency to a twelve-tone set and would be unsuitable for comparing notes with glissandi. holes we would see if we held two piano rolls together. It is noted that this assumes a quantisation of these coincide, corresponding to ‘correct’ transcription, is shown in dark grey — this corresponds to th clear that with a lower threshold more possible notes are removed, but many are still removed for being

seconds of the Sonata III for organ written by Felix Mendelssohn between August 1844 and January example 3))). All audio examples are listed in Appendix Q. It will be noted that the current system does 845 ,!Mendetssohn] The audio was created in mono 16-bit linear format at 32 kHz using Csound files

Figure 87 - Battle output for Mendelssohn.

hoped-for) notes, dark grey is predicted notes, and the overlap is in black. The scores are compared using READASC, as shown in Figure 88. As before, light grey is expected (or positives (extra notes), correct identifications, and false negatives (missed notes). after three iterations. In this figure, time runs from top to bottom and frequency runs from left to right

This analysis was repeated for other deconvolution thresholds. Thresholds of -6, -12, and -18 dB gave Figure 90 - Comparison of scores for thresholds of -24/30/36/42 dB.

no output — below are the compared scores for -24, -30, -36, and -42 dB.

captures many of the details are recognisable, even for the relatively high polyphony.

Deconvolution was tried at various thresholds, but none of them was able to even remotely identify the trombone 118, for a total of 511. The score (in C) of the short extract above is shown in Figure 96.

It was sampled from CD and converted to mono with a sample rate of 32000 Hz. The test for this piece was that the pianist was known to be one of the 28 mentioned in a previous study on expressive

notes will rub each other out problem arises with the display: the colours are exclusive-or’ed on screen. This means that two identical

Figure 102 - Spectrum of start of Grieg Piano Concerto. ano polyphony is often 8, and has a maximum of 17 during the arpeggio in bar 4 played with the There are 166 notes in this excerpt, not including the initial timpani roll and orchestra hit. The wnitter

not been recognised. Some of the descending line, and some later chords, have been captured, but most of the details have

The output score for a threshold of -24 dB is shown in Figure 112. It shows that the descending scale

Figure 115 - Very low, low, average, high, and very high frequencies. convenient but inaccurate ‘scientific tuning’ where middle C has a frequency of 256 Hz rather than 261

frequency is effectively lowered.85 Roads also discusses FIR filtering by convolution.""°*5 *

Conventional amplitude modulation of a signal can be expressed as multiplication of the input quanta by

These too can be made using a set of impulses regularly spaced in the /og-time domain. the density, which should be infinite to avoid frequency distortion, as shown in Figure 131. what we might call a Shephard rhythm, which incessantly slows down while adding faster beats, /Risset 91 Reverberation is the same as echoes except the sound Whereas Shephard tones. have an incessantly rising time but a constant pitch height, Risset demonstrates is diffuse. This can be modelled as above using a large 8.6.5.9.5 Shephard rhythms but finite density, i.e. Q({set}, 0, large, {set}). This

In non-real-time situations, it is just as easy to create non-causal effects, achieved by convolution with a

87 A pedal note is one played using the lowest mode of vibration of a brass instrument. An example is shown in Figure 139; it represents a basic melody with no variation in magnitude. In species 12, shown in Figure 138, all the quanta share a magnitude and an alfa. replace it by a set of quanta summing to a rectangular pulse.

This species can represent a melodic line, as in Figure 141, although the quanta have the same density. Species 13 has only one density, and is illustrated in Figure 140.

$.6.5.16.1 MIDI atoms The notes in a MIDI file can easily be ‘converted to’ an atom of species 15.

% Moorer “!) Slawson cites research showing fixed formant for violins and double-reed woodwin

The, construction of larger musical structures is illustrated below. I show the musical intention and the

8.9 Analysis using quanta individually to the DAC, we can only manage 87 in real time. Thus, if buffering is implemented we cat

Xc are intermediate results, and Xs is the spectrum data.

convolved with a periodic input spectrum As this is independent of the period P, we will normalise it using P=1s. This gives the final result:-

The first term here indicates the linear phase shift, and the second is the amplitude response, showing how each frequency leaks into neighbouring bins in the DFT.

performer played the upbeat literally but delayed the chord. By comparing this to the first panel ot

The BiMouse is made using two standard and ideally identical mice, as shown in Figure 155. Wt 9! Four serial ports is normally the maximum on a PC — most systems have only two ports. However, an interesting installation by Okamoto at the 1996 ICMC used specially-designed hardware to allow sixteen mice to control real-time polyphonic synthesis Oxo!

This shows that the absolute error does not exceed 2-16 when we truncate the end of the series First we try to form

A DISSERTATION SUBMITTED TO THE FACULTY OF SCIENCE IN CANDIDACY FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

This table summarises the distinctions between the definitions, ignoring definitions 1 and 6.

This gives the following levels:- The sound power level is the total power emitted by a source in all directions. It is defined by:- reference pressure of 20 Pa. All logarithms are to base 10.

Table 5 - Dynamic ranges of several instruments. a distance of 50 metres, and another singing pianissimo at 1 metre. The listener will judge the distant

range, there are about 30 JNDs in one critical band. The table below summarises the results “”"** The resolution is poorer at lower frequencies, and is never better than 2 Hz. Throughout most of th

break the waveform into its constituent musical entities, i.e. we must carry out source separation. sufficient redundancy to be coded more efficiently. To exploit this redundancy, however, we must first Representations based on events offer parallelism, compactness, and intuitiveness, but most of the theoretically impossible. A central hypothesis of this research is that any musical waveform contains

generation, the following table shows how many operations are required per partial per sample./""°** ° suming linear interpolation of amplitude and frequency, and the use of a lookup table for sine A change in one actual controller, such as tongue-palette distance, will affect all of the harmonics, often reproduce the exact nuances of a sound is compromised when such approximations are made.

Table 14 - Specifications of PCs. Dan Technology, and ‘CMC’ is Cambridge MicroComputers. The PC experiments were performed on one of three machines, summarised in the table below. ‘Dan’ is

_ The theoretical computational power and other features of the four systems are as follows:- The task we are undertaking — the analysis, transcription, transformation, and resynthesis of polyphonic

performance analysis. In many cases the aims fall into several categories. The table below summarises most of the polyphonic systems above. EPA stands for expressive

Disk buffer (words) Table 18 - Dependence of OSA timing on disk and FFT buffer sizes.

Table 21 - Parameters of windows used in analysis. and the values are given in the following table. W(i)=ao-a) *cos(27*i/N)+a2*cos(2*272*i/N)-a3*cos(3 *22*i/N)

included in 2 blocks if it falls over a boundary. At this stage, it is informative to consider what are the shortest and lowest notes that are likely to occur.

sets of images such as those below. This produces an animation of ‘the effect of altering t programs used to animate the output of the spectral display program. This can also be applied to other

7.2.2.8 Recognition of quiet notes The number of partial tracks is shown in the table below and in Figure 79. The correct numbers are 7618 Pe ee ae “§ AML 2nus.pl-n

of 6 bits is equivalent to a halving of amplitude.

have ONE partial — i.e. the fundamental.

Table 28 - Quantitative scores for MTest2 recognition. these categories”4, and these are used to calculate the final ‘accuracy’ figure. The outputs for MTest2 can be summarised in the following table. It shows the total number of notes in

Table 31 - Comparison of comparison methods. neither evaluated or conveyed. Notes added from harmonics each track is viewed as potentially a note, and notes may share partials.

a note to 50 milliseconds. The results for several thresholds are shown below. limits the number picked per octave to 6. The battle used the flag -t 0.05 to set the minimum length

latter, we can ask how a computer model could be constructed so as to far exceed the capabilities of our individual violins’? If the former, we can ask how violins and other instruments fuse together; if | perception of ‘the sound of a violin section’, or is it that which separates this into ‘the sounds of sixteen Again, the transcription is poor. As the ‘correct’ score is not known, it is not possible to quantify the

minimum note length of 0.5 seconds. shows the flattened comparison. [he average level is -15.51 dB. The analysis was done using Pickout (v0.24) with a threshold of -24 dB

It is very simple to form the ideal score. The quantitative evaluation is as follows:- from the spectrum, the odd harmonics are considerably stronger than the even harmonics. This is to be at G#3, E#4, etc., corresponding to the third, fifth, etc. harmonics of the fundamental. As can be seen

analysis of the above short segment. The results.are shown below. However, some of the modes are nearly harmonic, so it is informative to see if these are sufficient, by

an atom of four quanta with different times but the same f, a, and m. The notation for quanta is extended using braces such that, for example, Q({to,t;,t2,t3},f,a,m) represents

When two atoms are multiplied, the resultant species is given by the following table. 8.6.2.3 Multiplication and convolution

Table 44 - Algorithm for smoothed FFT. The algorithm for N=8 is as follows:-

‘able 48 - List of audio examples These audio examples can be found on the enclosed cassette tape and on the World-Wide Web at the

descriptionView Paper arrow_downwardDownload

Progress in the BBN 2007 Mandarin Speech to Text system

by Long Khánh Dương Nguyễn

2023, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing

descriptionView Paper arrow_downwardDownload

Estimation of glottal closure instants by considering speech signal as a spectrum

by Sripriya Natarajan

2023, Electronics Letters

Close to glottal closure instants (GCIs), the speech signal is expected to change its amplitude rapidly and, at GCIs, it is expected to have strong negative peaks. A novel algorithm that exploits these two properties for the estimation of... more

descriptionView Paper arrow_downwardDownload

Speech analysis and synthesis using an AM–FM modulation model

by Petros Maragos

2023, Speech Communication

A new light is thrown on the Portnoff [1] speech signal timescale modification algorithm. It is shown in particular that the Portnoff algorithm easily accommodates expansion factors bigger than 2 without causing reverberation nor... more

3. In an algorithm of this type based on phase vocoder techniques, very great care must taken in performing the phase unwrapping. In order to calculate the term v(n, w), we implemented a phase unwrapping procedure which expresses v(n,W) only in terms of v(n-—Il,w), y(n -1,w) and y(n,w) where y(n,W) is the instantaneous phase of the DFT coefficient at the instant n and frequency wW . The next section will give the explicit calculation for v(n, W). Since D and I must be small, the set of the values of B is quite limited in size. The action of the zero-padding operator 1:I which inserts I-1 zeroes between two DFT coefficients belonging to the same frequency channel, followed by that of the interpolating filter f,,(m) and the decimating operator D:1, results in the spectral level time- scale modification of the signal. The phase is multiplied by 1/ B —1 instead of 1/ B in orderto avoid that the phase multiplication affects the phase modulation term. followed by that of the interpolating filter f,,(m) and the

The interest of Lagrange interpolating filters for this type of application resides in the facts that they are zero phase and preserve the support from which the interpolated points are calculated.

A chaotic phase jump was observed in the region close to the origin. The phase unwrapping procedure must take this into account

In this case, the DFT coefficient makes a 2M jump

In this case, the DFT coefficient makes a 1 jump unwrapped phase at the instant n by taking the previous phase as the current phase: v(n,k) =v(n —1,k) to avoid the chaotic phase jump. This constitutes the third modification introduced into the Portnoff algorithm as previously stated in section 2 above.

descriptionView Paper arrow_downwardDownload

A system for automatic detection and correction of detuned singing

by Bożena Kostek

2023, The Journal of the Acoustical Society of America

The aim of the paper is to show a system engineered for automatic detection and correction of detuned singing. For this purpose, existing methods of fundamental frequency detection and pitch correction are reviewed. In addition, main... more

A system for automatic detection and correction of detuned singing

Fig. 2 Average fundamental frequency detection offectiveness for fast autocorrelation algorithm depending on correlation threshold Fig. 1 Fundamental frequency detection effectiveness using fast autocorrelation algorithm for particular tones depending on correlation threshold ‘ig. 1 Fundamental frequency detection effectiveness usin: fast autocorrelation algorithm for particular tones depending on correlation threshold

Fig. 4 Fundamental frequency detection effectiveness using fast autocorrelation algorithm for female voice depending on frame length and hop size Fig. 3 Fundamental frequency detection effectiveness using fast autocorrelation algorithm for male voice depending on frame length and hop size

Fig. 5 Fundamental frequency detection effectiveness using HPS algorithm for male singing sample depending on frame length and hop size The research on relationship between frame length and the fundamental frequency detection correctness was also performed for HPS algorithm. Utilized input samples as well as frame lengths and hop sizes were the same as in the previous case. The obtained results have been presented in Figs. 5 and 6.

Fig. 8 The main window of the application with a view of pitch of the original signal and corrected one changing in time The system was implemented in JAVA, as it provides many, free sound libraries. The development environment used was Netbeans IDE 5.5 with JDK 1.6 and the runtime environment was JRE 1.6. The user graphical interface was developed using Swing library. In Fig. 8 there is the main window of the application showing correction of the input signal.

descriptionView Paper arrow_downwardDownload

Teaching Tone and Intonation With Microcomputers

by Dorothy Chun

2023, CALICO Journal

Although research on the use and effectiveness of visual feedback for teaching tone and intonation began more than thirty years ago, the technology for signal analysis and pitch extraction using microcomputers has only recently become... more

descriptionView Paper arrow_downwardDownload

Speech Analysis and Synthesis Based on Dynamic Modes

by Julio Guery Guizada Vargas

2023, IEEE Transactions on Audio, Speech, and Language Processing

descriptionView Paper arrow_downwardDownload

Robust pitch tracking for prosodic modeling in telephone speech

by Stephanie Seneff

2023, Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on

In this paper, we introduce a pitch detection algorithm that is particularly robust for telephone speech and prosodic modeling. The algorithm uses a logarithmically sampled spectral representation of speech, similar to that in the... more

descriptionView Paper arrow_downwardDownload

A new algorithm for instantaneous F0 speech extraction based on Ensemble Empirical Mode Decomposition

by Maria Eugenia Torres

2023, 2009 17th European Signal Processing Conference

descriptionView Paper arrow_downwardDownload

Tararira : sistema de búsqueda de música por melodía cantada

by Martín Rocamora

2023

The problem of music retrieval by sung query consists of building a machine capable of simulating the cognitive process of identifying a musical piece from a few sung notes of its melody. In this paper, the algorithms of pitch tracking,... more

descriptionView Paper arrow_downwardDownload

Tararira: Query By Singing System

by Martín Rocamora

2023

This extended abstract details a submission to the Music In-formation Retrieval Evaluation eXchange in the Query by Singing/Humming task. The problem of query by singing consists of building a machine capable of simulating the cognitive... more

descriptionView Paper arrow_downwardDownload

Pitch tracking based on statistical anticipation

by Mingyang Wu

2023, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222)

An effective multi-pitch tracking algorithm for noisy speech is critical for auditory processing. However, the performance of existing algorithms is not satisfactory. We have developed a robust algorithm for multi-pitch tracking of noisy... more

descriptionView Paper arrow_downwardDownload

A multipitch tracking algorithm for noisy speech

by Mingyang Wu

2023, IEEE Transactions on Speech and Audio Processing

We present a robust algorithm for multi-pitch tracking of noisy speech. Our approach integrates an improved channel and peak selection method, a new integration method for extracting periodicity information across different frequency... more

descriptionView Paper arrow_downwardDownload

Encoding of pitch in the human brainstem is sensitive to language experience

by Ravi Krishnan

2023, Brain research. Cognitive brain research

Neural processes underlying pitch perception at the level of the cerebral cortex are influenced by language experience. We investigated whether early, pre-attentive stages of pitch processing at the level of the human brainstem may also... more

Fig. 1. Acoustic spectra (left column) and Fp contours (right column) of the synthetic speech stimuli. Each spectrum contains four steady-state formants (F1, Fo. F3, F4). Spectra are identical across all four stimuli. Fo contours vary depending on the Mandarin tonal category (1, 2, 3, 4).

Fig. 2. Short-term autocorrelation functions (left panels) and running autocorrelogram (right panels) of the average FFR waveforms of Chinese (top panels) anc English (bottom panels) groups when listening to the Tone 2 stimulus (vi*). The broader phase-locked pitch interval band (white) for the English group i: consistent with the broader and smaller magnitude autocorrelation peak. Narrow-band spectrograms were obtained from each FFR waveform using a 30-ms analysis window (Gaussian) to evaluate the spectral composition and magnitude of the phase-locked neural activity at each of the first five harmonics. Twenty short-term spectral slices were derived from the FFR spectrogram at every 10 ms between 30 ms and 220 ms. H;_s harmonic peaks were identified for each spectral slice at positions close to the corresponding harmonics of the stimulus. Magnitude of each harmonic was expressed in dB relative to each harmonic’s noise floor. Short-term autocorrelation functions and the running autocorrelograms of the FFR to the Tone 2 stimulus (vi?) are shown in Fig. 2 for the Chinese and English groups. In the autocorrelation functions (left panels), a peak at the fundamental period 1/F'‘9 is observed for both groups, which means that phase-locked activity to the fundamental period is present regardless of language experience. However, the peak for the English group is smaller and broader relative to the Chinese group, suggesting that phase-locked activity is not as robust for English listeners. In the autocorrelograms (right panels), a time-variant band of phase-locked activity (white) closely follows the decreasing fundamental period, corresponding to increasing Fp in yi’, especially over the last half of its duration (cf. Fig. 1). Consistent with their respective autocorrelation functions, the band of phase- locked interval for the Chinese group is narrower than that for the English, suggesting that phase-locked activity for the

Fig. 3. Comparison between Chinese and English groups on the autocorrelation magnitude for each of the four speech stimuli. Higher magnitudes are observed in the Chinese group relative to the English regardless of tone. Narrow-band spectrograms of the original speech stim- ulus (left panel) and of the grand-average FFRs in response to the Tone 2 (yi”) stimulus for the Chinese (middle panel) and English (right panel) groups are displayed in Fig. 6. As expected, the yi? stimulus spectrogram reveals energy bands at several multiples of Fo including stronger energy bands at Fl-related harmonics (h2, h3). The FFR spectrograms show energy bands at several harmonics for both groups. The energy band of the second harmonic (h), in particular, appears to be stronger in the Chinese group relative to the

Fig. 4. Grand-average Fy contours of Tone 2 (yi) derived from the FFR waveforms of all subjects across both ears in the Chinese (red) and English (blue) groups. The Fo contour of the original speech stimulus (yi) is displayed in black. The enlarged inset shows that the Fp contour derived from the FFR waveforms of the Chinese group more closely approximates that of the original stimulus (yi?) when compared to the English group.

Fig. 5. Comparison between Chinese and English groups on the rank- transformed crosscorrelation coefficient between the Fg contours of the four tonal stimuli and FFR waveforms. Higher crosscorrelation coefficients are observed in the Chinese group relative to the English regardless of tone.

Fig. 7. Cross-language differences in FFR harmonic magnitude between the two groups (Chinese, English) for the first five harmonics (h,—hs) ir response to each of the four Mandarin tones. Positive and negative bar: indicate greater harmonic magnitude for the Chinese and English groups respectively. Pooling across both ears, the Chinese group show significantly greater harmonic magnitude for hy (red positive bars) regardless of tona stimulus. Error bars represent standard errors. *P < 0.05; **P < 0.01. compared to non-native listeners. In terms of the temporal pattern of neural activity, this means that the degree of phase-locking is greater and the variability is smaller around the phase-locked interval for the Chinese listeners compared to the English listeners. Current temporal encoding schemes of pitch extraction, based on the dominant interval in the distribution of interspike intervals, rely purely on the acoustic properties of the stimulus. Consequently, they would predict no significant differences in the characteristics of encoding (1.e., pitch strength, accuracy of pitch tracking) across listeners regardless of language experience. The present findings, however, clearly demonstrate that the encoding scheme is not static nor is it dedicated to faithfully extract only the physical properties of the stimulus. Rather, they are consistent with a temporal encoding scheme which is plastic, i.e., sensitive to language experience. This plasticity enables enhancing or priming of temporal intervals that carry linguistically relevant features of pitch contours. The relatively greater pitch strength and smoother pitch tracking in native Mandarin listeners may reflect the operation of this language-dependent encoding scheme.

Fig. 6. Narrow-band spectrograms of the original Tone 2 stimulus (yi) (left panel) and of the grand-average FFR waveforms of the Chinese (middle panel) and English (right panel) groups. F, identifies the first formant; h,—hs the first through fifth harmonics.

descriptionView Paper arrow_downwardDownload

The effects of tone language experience on pitch processing in the brainstem

by Ravi Krishnan

2023, Journal of Neurolinguistics

Neural encoding of pitch in the auditory brainstem is shaped by long-term experience with language. The aim herein was to determine to what extent this experience-dependent effect is specific to a particular language. Analysis of variance... more

descriptionView Paper arrow_downwardDownload

Analysis and resynthesis of polyphonic music

by Douglas Nunn

2023

printer noise shown in Figure 2 is in a twelve-tone scale and a fairly regular tempo — the low E is the

Figure 9 - Responses of six fibres in the auditory nerve of a cat. [he CFs are roughly evenly distributed by pitch, although gammatone filters are a closer approximation.

{Licklider, Plomp 76, Van Klitzing, examples. Leman %] biomp concludes that the maximum effect i 18 Keyed brass are now rare; they include ophicleides and bugles.

However, transcription of polyphonic music is particularly difficult because before we can classifj

bandwidth, and we wish to determine the frequencies involved. However, music covers at least ten

e 29) after Risset!Risset 82] illustrates the framework for analysis and resynthesis

Figure 32. order to investigate the possibilities of real-time synthesis. The hardware is shown schematically i Towards the end of this work, an output board for the C40 was designed and made by Milos Kolar in

illustrated in Figure 37. This was later extended to analysis of synthesised musical sounds !8°°w"s Figure 37 - Overview of Guy Brown’s analysis system. was designed for separating speech from the other sources described above /BrownS 92a, BrownG 4b] 7, i.

6.4.6 Filter timing The filter length of 255 means that there is a group delay of 128 samples. However, since the samp

The next program, called DISTRIBx®?, also reads the files produced by the Octave Spectral Analysis procedure is either imprecise or computationally expensive, or somewhere in-between, in that there is no quick way to calculate it accurately, and several ways of approximating the real spectrum, such as AR

by time using a program called TRAKSIN«x to give longer entities called chains- These are similar to the ; elements’ of Brown.6f¥"S 1 This is done by linking sines in neighbouring blocks using simple birth-death model as shown in Figure 54. Note that sinusoids are only linked at their ends, s frequently. It is possible for a sinusoid

First, all of the data is read into virtual memory. We then carry out an exhaustive search, looking for simultaneous frequencies that are

sines and 405 tracks. 7.2.2.8 Recognition of quiet notes

example 3))). All audio examples are listed in Appendix Q. It will be noted that the current system doe 345 [Mendelssohn] THe audio was created in mono 16-bit linear format at 32 kHz using Csound files

hoped-for) notes, dark grey is predicted notes, and the overlap is in black. Next we judge the results using the diagram shown in Figure 89. This shows the overall amount of false The scores are compared using READASC, as shown in Figure 88. As before, light grey is expected (or positives (extra notes), correct identifications, and false negatives (missed notes). after three iterations. In this figure, time runs from top to bottom and frequency runs from left to right.

As the aim was deduction of timing information, a smaller FFT, of size 32, was used.

frequency is effectively lowered.85 Roads also discusses FIR filtering by convolution.!"°*s *

instruments were found to display formant-like behaviour. (Although Moorer claims they do not." °°"

Xc are intermediate results, and Xs is the spectrum data. The general flow diagram for the radix-2 decimation-in-time FFT is shown in Figure 150 for | order that this can be calculated with O(logyN) operations each sample, we must distort each of the

convolved with a periodic input spectrum

La réallocation: une méthode générale d'amélioration de la lisibilité des représentation temps-fréquence bilinéaires

nge, there are about 30 JNDs in one critical band. The table below summarises the results.'2”"*** The resolution is poorer at lower frequencies, and is never better than 2 Hz. Throughout most of th

generation, the following table shows how many operations are required per partial per sample Fre? ° suming linear interpolation of amplitude and frequency, and the use of a lookup table for sine A change in one actual controller, such as tongue-palette distance, will affect all of the harmonics, often

Other research not reviewed here includes work by Watson at Sydney", Heinbach at

performance analysis. In many cases the aims fall into several categories. [he table below summarises most of the polyphonic systems above. EPA stands for expressive

have ONE partial — i.e. the fundamental. The original score and the derived scores are shown in Figure 80. This diagram (for the -24 dB

Table 31 - Comparison of comparison methods. neither evaluated or conveyed. There are many different types of error. Notes added from harmonics

minimum note length of 0.5 seconds. The average level is -15.51 dB. The analysis was done using Pickout (v0.24) with a threshold of -24 dB

It is very simple to form the ideal score. The quantitative evaluation is as follows:- expected because while the didgeridoo is a lip-reed instrument, it has a cylindrical bore and should thus from the spectrum, the odd harmonics are considerably stronger than the even harmonics. This is to be

When two atoms are multiplied, the resultant species is given by the following table An easier way to express this is:- 8.6.2.3 Multiplication and convolution

short and long spectral details. Although the complete analysis/resynthesis system has not yet been full analysis as synthesis. Quanta can be used throughout the analysis scheme; they are applicable to |

10.10 Appendix J — Terms | define Table 47 - Terms I define in this thesis.

Table 48 - List of audio examples. URL http://capella.dur.ac.uk/doug/thesis/.

descriptionView Paper arrow_downwardDownload

Vector phaseshaping synthesis

by Vesa Valimaki

2023

This paper introduces the Vector Phaseshaping (VPS) synthesis technique, which extends the classic Phase Distortion method by providing flexible means to distort the phase of a sinusoidal oscillator. This is achieved by describing the... more

descriptionView Paper arrow_downwardDownload

Semi-Markov decision processes with limiting ratio average rewards

by Prasenjit Mondal

2023, Journal of Mathematical Analysis and Applications

descriptionView Paper arrow_downwardDownload

Segmentation of pitch tracks for melody detection in polyphonic audio

by Rui José Paiva

2023, 2005 13th European Signal Processing Conference

We propose a method for segmentation of pitch tracks for melody detection in polyphonic musical signals. This is an important issue for melody-based music information retrieval, as well as melody transcription. Past work in the field... more

descriptionView Paper arrow_downwardDownload

Speech Analysis and Synthesis Based on Dynamic Modes

by Julio Ramon Miranda Vargas

2023, IEEE Transactions on Audio, Speech, and Language Processing

descriptionView Paper arrow_downwardDownload

Automated Music Transcription

by Yash Rajput

2023, Journal of emerging technologies and innovative research

In this paper, we are proposing the idea of making an automated software that will transcribe each note while the musician plays the instrument. The software will take the sound of the instrument as an input and will process the frequency... more

descriptionView Paper arrow_downwardDownload

Entre la voz del poeta y la voz poética: Una escucha asistida por computadora al audio literario de Eduardo Lizalde

by Aurelio Meza

2023, PoéticaSonoraMX

El fallecimiento de Eduardo Lizalde (1929-2022) representa una pérdida irreemplazable para la poesía mexicana. A lo largo de su trayectoria literaria, desarrolló una afinidad casi inevitable con la sonoridad de su voz, ya fuera a través... more

descriptionView Paper arrow_downwardDownload

Biological changes in auditory function following training in children with autism spectrum disorders

by Nicole Russo-Ponsaran

2023, Behavioral and Brain Functions

Background Children with pervasive developmental disorders (PDD), such as children with autism spectrum disorders (ASD), often show auditory processing deficits related to their overarching language impairment. Auditory training programs... more

descriptionView Paper arrow_downwardDownload

Melody transcription framework using score information for Noh singing

by Katunobu Itou

2023

Comunicacio presentada a la Eighth International Conference on Creative Content Technologies, celebrada els dies 20 a 24 de marc de 2016 a Roma, Italia.

Figure 2. Pitch contours of Noh singing In Figure 1, the first row is the graphical notation. Bullets show onsets, and lines and curves show pitch transience. In this notation, the different vertical positions indicate different pitches similar to Western music notation. However, this notation is continuous unlike the discrete Western notation. Therefore, ornaments indicated using smaller symbols, e.g, a grace note, are indicated using the combination of curves. The second row contains the lyrics in Japanese kana. The last row contains note names. Figure 2 illustrates the difficulty of finding onsets and transitions of notes because of the extent of certain vibrato being greater than the transition pitch difference. By contrast, the extent of vibratos in Western music hardly exceeds 200 cents. Cent is a unit for musical pitch. One semitone is 100 cents. In this paper, n cents for an fHz pitch is n = 1200 - logs iE: As seen in this example,fitting a melodic line to a pitch contour is difficult because a new melodic line notation, closer to the acoustic signal, is required. Figure 1. A graphical notation of a melodic information in a Noh vocal book

In the commentary, to assist interpretation, graphical nota- tion of the melodic line is used. However, this notation does not assist understanding of the melodic line in the way it does in Western musical scores, because of the difficulty in estimating the exact line from acoustic signals. The commentary includes the graphical notation of 300 phrases from 55 pieces. Figure | shows an example of a graphical notation [7] and Figure 2 shows a pitch (f0) contour of its execution. This sample is a phrase in the dynamic mode.

Figure 3. Phone segmentation of pitch contours Figure 3 exemplifies the phonetic segmentation result of the pitch contour of the first 4 s of Figure 2. Vertical lines indicate onsets. This figure was plotted using manual labeling for a later explanation.

Figure 4. An initial centered pitch contour (dot plot) and an initial centered curve (solid line) The pitch for all its segments is not staying, and thus, its mean cannot be calculated. For such pitches, the mean is calculated using the value of the nearest calculated pitch by subtracting the default scale difference value. The scale value is correspondent with 100 cents. For example, in Table I, if 49 was not calculated, ji19 is calculated as p19 = pug * 209-9/12,

Figure 6. Comparison of the transcriptions of different singers of peak similar to float after the dips at approximately 6 s in both transcriptions. Asymmetric vibrato is one of the most specific characteristics of Noh singing, and is considered as an unprescribed ornament. This suggests that new ornaments will be discovered using this method.

Figure 5. A melody transcription. The blue line is f0 contour. The red line is the melody transcribed with the onset timing using the bullets.

Figure 7. A melody transcription using erroneous segmentation Regarding the phone segmentation accuracy, the absolute error compared with the hand-labeled onset was 0.37 s by the general HMM and 0.089 s by the adapted HMMs. Fig- ure 7 shows an example of melody transcription using highly erroneous phone segmentation from the upper pitch contour in Figure 6. The average absolute onset error was 0.23 s. Comparing the data in Figure 6, the first float was missing and the transient just after 6 s was missing. For both cases, successive vowels caused segmentation errors, for example the error of the vowel in the first float was 1.56 s and the error near the transient was 0.39 s, due to the phonation difference in modern Japanese speech. Such vowel differences were reduced by MLLR adaptation. We evaluated this type of transcription error using the average absolute pitch error and compared it to the transcrip- tion estimated using manually-labeled phone segmentation in the cent domain. The average absolute pitch error of the transcription in Figure 7 was 50 cents. The average absolute pitch error was 24 cents by the general HMM segmentation and 13 cents by the adapted HMM segmentation. Figure 8 shows the relationship between the phone segmentation error and the transcription estimation error. The adapted model did not estimate erroneous transcription. For modern or Western music, the onset detection accuracy is evaluated within 50 ms window [10], but for Noh singing more broader window, e.g., 200 ms, might be appropriate for evaluation, because Noh singing is very slow, e.g., the average phone duration was approximately 400 ms in the evaluation data, and the onset is flexible and not as important as it is in Western music.

Figure 8. The relationship between the segmentation error and the transcription error

TABLE I. SCORE INFORMATION In Western music, pitch transition occurs at the onset. In Noh singing, pitch transition does not occur quickly at the onset, and the execution is like a grace note [1]. In addition, unlike Western music or Japanese popular music, consonants last longer. Reflecting these aspects, a note must be divided into phones, i.e., a note must contain a phone boundary. Each note corresponds to a single syllable. In Japanese, syllable is classified in two types: CV, which means a conso- nant succeeding a vowel, V, which means just a vowel. Hence, each note has one or two phone fields. In addition, each phone field has pitch transition information, which consists of three values at most. The first is an original and mandatory pitch. The second is a first-transited pitch, and the third is a second- transited pitch. Pitch is expressed as an integer whose value difference refers approximately to a halftone of Western music. Table I is an example of the first three notes in Figure 1.

descriptionView Paper arrow_downwardDownload

Progress in the BBN 2007 Mandarin Speech to Text system

by Nguyen Nam Long

2022, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing

descriptionView Paper arrow_downwardDownload

Segmentation of pitch tracks for melody detection in polyphonic audio

by Rui Pedro Paiva

2022, 2005 13th European Signal Processing Conference

descriptionView Paper arrow_downwardDownload

From Pitches to Notes: Creation and Segmentation of Pitch Tracks for Melody Detection in Polyphonic Audio

by Rui Pedro Paiva

2022, Journal of New Music Research

and-conditions-of-access.pdf This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, redistribution , reselling , loan or sub-licensing, systematic supply or distribution in... more

descriptionView Paper arrow_downwardDownload

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

by DR. M. EKRAMUL HAMID

2022, Science Journal of Circuits, Systems and Signal Processing

This paper presents a voiced/unvoiced classification algorithm of the noisy speech signal by analyzing two acoustic features of the speech signal. Short-time energy and short-time zero-crossing rates are one of the most distinguishable... more

descriptionView Paper arrow_downwardDownload

Pitch Tracking

Key research themes

1. How can auditory models improve accurate pitch segmentation and transcription in singing sequences?

2. What advanced signal processing techniques enable robust and high-resolution pitch tracking in complex and noisy audio signals?

3. How can integrated acoustic and music language models enable multi-pitch detection and voice assignment in polyphonic vocal music?

Related Topics

All papers in Pitch Tracking