Papers by Jonathan Driedger

Background music is often used to generate a specific atmosphere or to draw our attention to spec... more Background music is often used to generate a specific atmosphere or to draw our attention to specific events. For example in movies or computer games it is often the accompanying music that conveys the emotional state of a scene and plays an important role for immersing the viewer or player into the virtual environment. In view of home-made videos, slide shows, and other consumer-generated visual media streams, there is a need for computer-assisted tools that allow users to generate aesthetically appealing music tracks in an easy and intuitive way. In this contribution, we consider a data-driven scenario where the musical raw material is given in form of a database containing a variety of audio recordings. Then, for a given visual media stream, the task consists in identifying, manipulating, overlaying, concatenating, and blending suitable music clips to generate a music stream that satisfies certain constraints imposed by the visual data stream and by user specifications. It is our main goal to give an overview of various content-based music processing and retrieval techniques that become important in datadriven sound track generation. In particular, we sketch a general pipeline that highlights how the various techniques act together and come into play when generating musically plausible transitions between subsequent music clips.

Formalizing and verifying proofs in cryptography has become an important task. Backes et al. ther... more Formalizing and verifying proofs in cryptography has become an important task. Backes et al. therefore invented a framework [1] that uses the proof assistant Isabelle/HOL [9] to verify game-based proofs. In this framework a powerful probabilistic language allows to formalize games that describe security properties. To show that these security properties hold one can modify the games such that their outcome is not altered. This is done until the games have the form of already known security properties. Such a modification of a game is called a transformation. Transformations are based on relations between games which have to be verified. But verifying such relations is very often a challenging task. To be able to come up with game-based proofs more naturally it is useful to have a collection of relations, and therefore transformations, verified upfront. This thesis presents a couple of different game-transformations, formalizes them, and shows proofs of their correctness.

Zenodo (CERN European Organization for Nuclear Research), Jun 7, 2022
Guitar tablature transcription is an important but understudied problem within the field of music... more Guitar tablature transcription is an important but understudied problem within the field of music information retrieval. Traditional signal processing approaches offer only limited performance on the task, and there is little acoustic data with transcription labels for training machine learning models. However, guitar transcription labels alone are more widely available in the form of tablature, which is commonly shared among guitarists online. In this work, a collection of symbolic tablature is leveraged to estimate the pairwise likelihood of notes on the guitar. The output layer of a baseline tablature transcription model is reformulated, such that an inhibition loss can be incorporated to discourage the co-activation of unlikely note pairs. This naturally enforces playability constraints for guitar, and yields tablature which is more consistent with the symbolic data used to estimate pairwise likelihoods. With this methodology, we show that symbolic tablature can be used to shape the distribution of a tablature transcription model's predictions, even when little acoustic data is available. † Main work completed as a research intern at Chordify.

Zenodo (CERN European Organization for Nuclear Research), Apr 17, 2022
Guitar tablature transcription is an important but understudied problem within the field of music... more Guitar tablature transcription is an important but understudied problem within the field of music information retrieval. Traditional signal processing approaches offer only limited performance on the task, and there is little acoustic data with transcription labels for training machine learning models. However, guitar transcription labels alone are more widely available in the form of tablature, which is commonly shared among guitarists online. In this work, a collection of symbolic tablature is leveraged to estimate the pairwise likelihood of notes on the guitar. The output layer of a baseline tablature transcription model is reformulated, such that an inhibition loss can be incorporated to discourage the co-activation of unlikely note pairs. This naturally enforces playability constraints for guitar, and yields tablature which is more consistent with the symbolic data used to estimate pairwise likelihoods. With this methodology, we show that symbolic tablature can be used to shape the distribution of a tablature transcription model's predictions, even when little acoustic data is available. † Main work completed as a research intern at Chordify.

Electronic Music (EM) is a popular family of genres which has increasingly received attention as ... more Electronic Music (EM) is a popular family of genres which has increasingly received attention as a research subject in the field of MIR. A fundamental structural unit in EM are loops-audio fragments whose length can span several seconds. The devices commonly used to produce EM, such as sequencers and digital audio workstations, impose a musical structure in which loops are repeatedly triggered and overlaid. This particular structure allows new perspectives on well-known MIR tasks. In this paper we first review a prototypical production technique for EM from which we derive a simplified model. We then use our model to illustrate approaches for the following task: given a set of loops that were used to produce a track, decompose the track by finding the points in time at which each loop was activated. To this end, we repurpose established MIR techniques such as fingerprinting and non-negative matrix factor deconvolution.

In recent years, methods to decompose an audio signal into a harmonic and a percussive component ... more In recent years, methods to decompose an audio signal into a harmonic and a percussive component have received a lot of interest and are frequently applied as a processing step in a variety of scenarios. One problem is that the computed components are often not of purely harmonic or percussive nature but also contain noise-like sounds that are neither clearly harmonic nor percussive. Furthermore, depending on the parameter settings, one often can observe a leakage of harmonic sounds into the percussive component and vice versa. In this paper we present two extensions to a state-of-the-art harmonic-percussive separation procedure to target these problems. First, we introduce a separation factor parameter into the decomposition process that allows for tightening separation results and for enforcing the components to be clearly harmonic or percussive. As second contribution, inspired by the classical sines+transients+noise (STN) audio model, this novel concept is exploited to add a third residual component to the decomposition which captures the sounds that lie in between the clearly harmonic and percussive sounds of the audio signal.

Melody estimation algorithms are typically evaluated by separately assessing the task of voice ac... more Melody estimation algorithms are typically evaluated by separately assessing the task of voice activity detection and fundamental frequency estimation. For both subtasks, computed results are typically compared to a single human reference annotation. This is problematic since different human experts may differ in how they specify a predominant melody, thus leading to a pool of equally valid reference annotations. In this paper, we address the problem of evaluating melody extraction algorithms within a jazz music scenario. Using four human and two automatically computed annotations, we discuss the limitations of standard evaluation measures and introduce an adaptation of Fleiss' kappa that can better account for multiple reference annotations. Our experiments not only highlight the behavior of the different evaluation measures, but also give deeper insights into the melody extraction task.
The analysis of recorded audio sources has become increasingly important in ethnomusicological re... more The analysis of recorded audio sources has become increasingly important in ethnomusicological research. Such audio material may contain important cues on performance practice, information that is often lost in manually generated symbolic music transcriptions. As an application scenario, we consider in this paper a musically relevant audio collection that consists of three-voice polyphonic Georgian chants. As one main contribution, we introduce an interactive graphical user interface that provides various visual and acoustic control mechanisms for estimating fundamental frequency (F0) trajectories from complex sound mixtures. We then apply this interface for determining F0 trajectories of sung pitches from the Georgian chant recordings and indicate how such F0 annotations can be used as basis for addressing important questions in Georgian music research.

Background music is often used to generate a specific atmosphere or to draw our attention to spec... more Background music is often used to generate a specific atmosphere or to draw our attention to specific events. For example in movies or computer games it is often the accompanying music that conveys the emotional state of a scene and plays an important role for immersing the viewer or player into the virtual environment. In view of home-made videos, slide shows, and other consumer-generated visual media streams, there is a need for computer-assisted tools that allow users to generate aesthetically appealing music tracks in an easy and intuitive way. In this contribution, we consider a data-driven scenario where the musical raw material is given in form of a database containing a variety of audio recordings. Then, for a given visual media stream, the task consists in identifying, manipulating, overlaying, concatenating, and blending suitable music clips to generate a music stream that satisfies certain constraints imposed by the visual data stream and by user specifications. It is our...

Formalizing and verifying proofs in cryptography has become an important task. Backes et al. ther... more Formalizing and verifying proofs in cryptography has become an important task. Backes et al. therefore invented a framework [1] that uses the proof assistant Isabelle/HOL [9] to verify game-based proofs. In this framework a powerful probabilistic language allows to formalize games that describe security properties. To show that these security properties hold one can modify the games such that their outcome is not altered. This is done until the games have the form of already known security properties. Such a modification of a game is called a transformation. Transformations are based on relations between games which have to be verified. But verifying such relations is very often a challenging task. To be able to come up with game-based proofs more naturally it is useful to have a collection of relations, and therefore transformations, verified upfront. This thesis presents a couple of different game-transformations, formalizes them, and shows proofs of their correctness.

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016
Harmonic-percussive-residual (HPR) sound separation is a useful preprocessing tool for applicatio... more Harmonic-percussive-residual (HPR) sound separation is a useful preprocessing tool for applications such as pitched instrument transcription or rhythm extraction. Recent methods rely on the observation that in a spectrogram representation, harmonic sounds lead to horizontal structures and percussive sounds lead to vertical structures. Furthermore, these methods associate structures that are neither horizontal nor vertical (i.e., non-harmonic, non-percussive sounds) with a residual category. However, this assumption does not hold for signals like frequency modulated tones that show fluctuating spectral structures, while nevertheless carrying tonal information. Therefore, a strict classification into horizontal and vertical is inappropriate for these signals and might lead to leakage of tonal information into the residual component. In this work, we propose a novel method that instead uses the structure tensor-a mathematical tool known from image processing-to calculate predominant orientation angles in the magnitude spectrogram. We show how this orientation information can be used to distinguish between harmonic, percussive, and residual signal components, even in the case of frequency modulated signals. Finally, we verify the effectiveness of our method by means of both objective evaluation measures as well as audio examples.

Timescale modification (TSM) algorithms have the purpose of stretching or compressing the timesca... more Timescale modification (TSM) algorithms have the purpose of stretching or compressing the timescale of an input audio signal without altering its pitch. Such tools are frequently used in scenarios like music production or music remixing. There exists a large variety of different algorithmic approaches to TSM, all of them having their very own advantages and drawbacks. In this paper, we present the TSM toolbox, which contains MATLAB implementations of several conceptually different TSM algorithms. In particular, our toolbox provides the code for a recently proposed TSM approach, which integrates different classical TSM algorithms in combination with harmonic-percussive source separation (HPSS). Furthermore, our toolbox contains several demo applications and additional code examples. Providing MATLAB code on a well-documented website under a GNU-GPL license and including illustrative examples, our aim is to foster research and education in the field of audio processing.

Die automatisierte Zerlegung von Musiksignalen in elementare Bestandteile stellt eine zentrale Au... more Die automatisierte Zerlegung von Musiksignalen in elementare Bestandteile stellt eine zentrale Aufgabe im Bereich der Musikverarbeitung dar. Hierbei geht es unter anderem um die Identifikation und Rekonstruktion von individuellen Melodieund Instrumentalstimmen aus einer als Wellenform gegebenen Audioaufnahme-eine Aufgabenstellung, die imübergeordneten Bereich der Audiosignalverarbeitung auch als Quellentrennung bezeichnet wird. Im Fall von Musik weisen die Einzelstimmen typischer Weise starke zeitliche und spektraleÜberlappungen auf, was die Zerlegung in die Quellen ohne Zusatzwissen zu einem im Allgemeinen kaum lösbaren Problem macht. Zur Vereinfachung des Problems wurden in den letzten Jahren zahlreiche Verfahren entwickelt, bei denen neben dem Musiksignal auch die Kenntnis des zugrundeliegenden Notentextes vorausgesetzt wird. Die durch den Notentext gegebene Zusatzinformation zum Beispiel hinsichtlich der Instrumentierung und den vorkommenden Noten kann zur Steuerung des Quellentrennungsprozesses ausgenutzt werden, wodurch sich auchüberlappende Quellen zumindest zu einem gewissen Grad trennen lassen. Weiterhin lassen sich durch den Notentext die zu trennenden Stimmen oft erst spezifizieren. In diesem Artikel geben wir einenÜberblicküber neuere Entwicklungen im Bereich der Notentext-informierten Quellentrennung, diskutieren dabei allgemeine Herausforderungen bei der Verarbeitung von Musiksignalen, und skizzieren mögliche Anwendungen.

A common method to create beat annotations for music recordings is to let a human annotator tap a... more A common method to create beat annotations for music recordings is to let a human annotator tap along with them. However, this method is problematic due to the limited human ability to temporally align taps with audio cues for beats accurately. In order to create accurate beat annotations, it is therefore typically necessary to manually correct the recorded taps in a subsequent step, which is a cumbersome task. In this work we aim to automate this correction step by "snapping" the taps to close-by audio cues - a strategy that is often used by beat tracking algorithms to refine their beat estimates. The main contributions of this paper can be summarized as follows. First, we formalize the automated correction procedure mathematically. Second, we introduce a novel visualization method that serves as a tool to analyze the results of the correction procedure for potential errors. Third, we present a new dataset consisting of beat annotations for 101 music recordings. Fourth, w...

A common method to create beat annotations for music recordings is to let a human annotator tap a... more A common method to create beat annotations for music recordings is to let a human annotator tap along with them. However, this method is problematic due to the limited human ability to temporally align taps with audio cues for beats accurately. In order to create accurate beat annotations, it is therefore typically necessary to manually correct the recorded taps in a subsequent step, which is a cumbersome task. In this work we aim to automate this correction step by "snapping" the taps to close-by audio cues-a strategy that is often used by beat tracking algorithms to refine their beat estimates. The main contributions of this paper can be summarized as follows. First, we formalize the automated correction procedure mathematically. Second, we introduce a novel visualization method that serves as a tool to analyze the results of the correction procedure for potential errors. Third, we present a new dataset consisting of beat annotations for 101 music recordings. Fourth, we use this dataset to perform a listening experiment as well as a quantitative study to show the effectiveness of our snapping procedure.

The task of novelty detection with the objective of detecting changes regarding musical propertie... more The task of novelty detection with the objective of detecting changes regarding musical properties such as harmony, dynamics, timbre, or tempo is of fundamental importance when analyzing structural properties of music recordings. But for a specific audio version of a given piece of music , the novelty detection result may also crucially depend on the individual performance style of the musician. This particularly holds true for tempo-related properties, which may vary significantly across different performances of the same piece of music. In this paper, we show that tempo-based novelty detection can be stabilized and improved by simultaneously analyzing a set of different performances. We first warp the version-dependent novelty curves onto a common musical time axis, and then combine the individual curves to produce a single fusion curve. Our hypothesis is that musically relevant points of novelty tend to be consistent across different performances. This hypothesis is supported by our experiments in the context of music structure analysis, where the cross-version fusion curves yield, on average, better results than the novelty curves obtained from individual recordings.

Background music is often used to generate a specific atmosphere or to draw our attention to spec... more Background music is often used to generate a specific atmosphere or to draw our attention to specific events. For example in movies or computer games it is often the accompanying music that conveys the emotional state of a scene and plays an important role for immersing the viewer or player into the virtual environment. In view of home-made videos, slide shows, and other consumer-generated visual media streams, there is a need for computer-assisted tools that allow users to generate aesthetically appealing music tracks in an easy and intuitive way. In this contribution, we consider a data-driven scenario where the musical raw material is given in form of a database containing a variety of audio recordings. Then, for a given visual media stream, the task consists in identifying, manipulating, overlaying, concatenating, and blending suitable music clips to generate a music stream that satisfies certain constraints imposed by the visual data stream and by user specifications. It is our main goal to give an overview of various content-based music processing and retrieval techniques that become important in data-driven sound track generation. In particular, we sketch a general pipeline that highlights how the various techniques act together and come into play when generating musically plausible transitions between subsequent music clips.

Electronic Music (EM) is a popular family of genres which has increasingly received attention as ... more Electronic Music (EM) is a popular family of genres which has increasingly received attention as a research subject in the field of MIR. A fundamental structural unit in EM are loops—audio fragments whose length can span several seconds. The devices commonly used to produce EM, such as sequencers and digital audio workstations, impose a musical structure in which loops are repeatedly triggered and overlaid. This particular structure allows new perspectives on well-known MIR tasks. In this paper we first review a prototypical production technique for EM from which we derive a simplified model. We then use our model to illustrate approaches for the following task: given a set of loops that were used to produce a track, decompose the track by finding the points in time at which each loop was activated. To this end, we repurpose established MIR techniques such as fingerprinting and non-negative matrix factor deconvo-lution.

Music source separation aims at decomposing music recordings into their constituent component sig... more Music source separation aims at decomposing music recordings into their constituent component signals. Many existing techniques are based on separating a time-frequency representation of the mixture signal by applying suitable model-ing techniques in conjunction with generalized Wiener filtering. Recently, the term α-Wiener filtering was coined together with a theoretic foundation for the long-practiced use of magnitude spectrogram estimates in Wiener filtering. So far, optimal values for the magnitude exponent α have been empirically found in oracle experiments regarding the additivity of spectral magnitudes. In the first part of this paper, we extend these previous studies by examining further factors that affect the choice of α. In the second part, we investigate the role of α in Kernel Additive Modeling applied to Harmonic-Percussive Separation. Our results indicate that the parameter α may be understood as a kind of selectivity parameter, which should be chosen in a signal-adaptive fashion.

Harmonic-percussive-residual (HPR) sound separation is a useful preprocessing tool for applicatio... more Harmonic-percussive-residual (HPR) sound separation is a useful preprocessing tool for applications such as pitched instrument transcription or rhythm extraction. Recent methods rely on the observation that in a spectrogram representation, harmonic sounds lead to horizontal structures and percussive sounds lead to vertical structures. Furthermore, these methods associate structures that are neither horizontal nor vertical (i.e., non-harmonic, non-percussive sounds) with a residual category. However, this assumption does not hold for signals like frequency modulated tones that show fluctuating spectral structures, while nevertheless carrying tonal information. Therefore, a strict classification into horizontal and vertical is inappropriate for these signals and might lead to leakage of tonal information into the residual component. In this work, we propose a novel method that instead uses the structure tensor—a mathematical tool known from image processing—to calculate predominant orientation angles in the magnitude spectrogram. We show how this orientation information can be used to distinguish between harmonic , percussive, and residual signal components, even in the case of frequency modulated signals. Finally, we verify the effectiveness of our method by means of both objective evaluation measures as well as audio examples.
Uploads
Papers by Jonathan Driedger