Researcher in Machine Learning and its application to Audio, Language and Music.
less
Interests
Uploads
Papers by Tillman Weyde
Understandability of Global Post-hoc Explanations of Black-box Models: Dataset and Analysis
The dataset contains the data collected in a user study carried out to evaluate the impact of usi... more The dataset contains the data collected in a user study carried out to evaluate the impact of using domain knowledge, ontologies, in the creation of global post-hoc explanations of black-box models. The research hypothesis was that the use of ontologies could enhance the understandability of explanations by humans. To validate this research hypothesis we ran a user study where participants were asked to carry out several tasks. In each task, the answers, time of response, and user understandability and confidence were collected and measured. The data analysis revealed that the use of ontologies do enhance the understandability of explanations of black-box models by human users, in particular, in the form of decision trees explaining artificial neural networks.
International Symposium/Conference on Music Information Retrieval, 2014
The multiple viewpoints representation is an event-based representation of symbolic music data wh... more The multiple viewpoints representation is an event-based representation of symbolic music data which offers a means for the analysis and generation of notated music. Previous work using this representation has predominantly relied on n-gram and variable order Markov models for music sequence modelling. Recently the efficacy of a class of distributed models, namely restricted Boltzmann machines, was demonstrated for this purpose. In this paper, we demonstrate the use of two neural network models which use fixed-length sequences of various viewpoint types as input to predict the pitch of the next note in the sequence. The predictive performance of each of these models is comparable to that of models previously evaluated on the same task. We then combine the predictions of individual models using an entropy-weighted combination scheme to improve the overall prediction performance, and compare this with the predictions of a single equivalent model which takes as input all the viewpoint types of each of the individual models in the combination.
Automatically following rhythms by beat tracking is by no means a solved problem, especially when... more Automatically following rhythms by beat tracking is by no means a solved problem, especially when dealing with varying tempo and expressive timing. This paper presents a connectionist machine learning approach to expressive rhythm prediction, based on cognitive and neurological models. We detail a multi-layered recurrent neural network combining two complementary network models as hidden layers within one system. The first layer is a Gradient Frequency Neural Network (GFNN), a network of nonlinear oscillators which acts as an entraining and learning resonant filter to an audio signal. The GFNN resonances are used as inputs to a second layer, a Long Short-term Memory Recurrent Neural Network (LSTM). The LSTM learns the long-term temporal structures present in the GFNN's output, the metrical structure implicit within it. From these inferences, the LSTM predicts when the next rhythmic event is likely to occur. We train the system on a dataset selected for its expressive timing qualities and evaluate the system on its ability to predict rhythmic events. We show that our GFNN-LSTM model performs as well as state-of-the art beat trackers and has the potential to be used in real-time interactive systems, following and generating expressive rhythmic structures. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 Unported License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
International Symposium/Conference on Music Information Retrieval, 2015
In this paper, we present the results of a study on dynamic models for predicting sequences of mu... more In this paper, we present the results of a study on dynamic models for predicting sequences of musical pitch in melodies. Such models predict a probability distribution over the possible values of the next pitch in a sequence, which is obtained by combining the prediction of two components (1) a long-term model (LTM) learned offline on a corpus of melodies, as well as (2) a short-term model (STM) which incorporates context-specific information available during prediction. Both the LTM and the STM learn regularities in pitch sequences solely from data. The models are combined in an ensemble, wherein they are weighted by the relative entropies of their respective predictions. Going by previous work that demonstrates the success of Connectionist LTMs, we employ the recently proposed Recurrent Temporal Discriminative Restricted Boltzmann Machine (RTDRBM) as the LTM here. While it is indeed possible for the same model to also serve as an STM, our experiments showed that n-gram models tended to learn faster than the RTDRBM in an online setting and that the hybrid of an RTDRBM LTM and an n-gram STM gives us the best predictive performance yet on a corpus of monophonic chorale and folk melodies.
International Symposium/Conference on Music Information Retrieval, 2002
Opuscope is an initiative targeted at sharing musical corpora and their analyses between research... more Opuscope is an initiative targeted at sharing musical corpora and their analyses between researchers. The Opuscope repository will contain musical corpora of high quality which can be annotated with hand-made or algorithmic musical analyses. So, analytical results obtained by others can be used as a starting point for one's own investigations. Experiments performed on Opuscope corpora can easily be compared to other approaches, since an unequivocal mechanism for describing a certain corpus will be provided.
Zenodo (CERN European Organization for Nuclear Research), Oct 4, 2014
In this paper we take a connectionist machine learning approach to the problem of metre perceptio... more In this paper we take a connectionist machine learning approach to the problem of metre perception and melody learning in musical signals. We present a multilayered network consisting of a nonlinear oscillator network and a recurrent neural network. The oscillator network acts as an entrained resonant filter to the musical signal. It 'perceives' metre by resonating nonlinearly to the inherent periodicities within the signal, creating a hierarchy of strong and weak periods. The neural network learns the long-term temporal structures present in this signal. We show that this network outperforms our previous approach of a single layer recurrent neural network in a melody and rhythm prediction task. We hypothesise that our system is enabled to make use of the relatively long temporal resonance in the oscillator network output, and therefore model more coherent long-term structures. A system such as this could be used in a multitude of analytic and generative scenarios, including live performance applications.
International Symposium/Conference on Music Information Retrieval, 2004
Melodic segmentation is an important topic for music information retrieval, because it divides me... more Melodic segmentation is an important topic for music information retrieval, because it divides melodies into musically relevant units. Most influential theories on melodic segmentation of the last decades have stressed the role of pitch for melodic segmentation. The general assumption was, that relatively large changes or distances in any musical parameter like pitch, time, dynamics, or melodic movement mark segment boundaries. This has generally been accepted despite the lack of empirical studies. Here an empirical study is presented, that investigates the influence of inter-onset-intervals (IOI), intensity accents, pitch intervals, and pitch interval direction changes. The results show a significant influence only for IOIs and intensity, but neither for pitch intervals nor for changes in interval direction. The validity of the results and possible explanations are discussed and directions of further investigations are outlined.
We present a computational method for pattern discovery based on the application of the wavelet t... more We present a computational method for pattern discovery based on the application of the wavelet transform to symbolic representations of melodies or monophonic voices. We model the importance of a discovered pattern in terms of the compression ratio that can be achieved by using it to describe that part of the melody covered by its occurrences. The proposed method resembles that of paradigmatic analysis developed by and . In our approach, melodies are represented either as 'raw' 1-dimensional pitch signals or as these signals filtered with the continuous wavelet transform (CWT) at a single scale using the Haar wavelet. These representations are segmented using various approaches and the segments are then concatenated based on their similarity. The concatenated segments are compared, clustered and ranked. The method was evaluated on two musicological tasks: discovering themes and sections in the JKU Patterns Development Database and determining the parent compositions of excerpts from J. S. Bach's Two-Part Inventions (BWV 772-786). The results indicate that the new approach performs well at finding noticeable and/or important patterns in melodies and that filtering makes the method robust to melodic variation.
This paper gives a survey of the infrastructure currently being developed in the MUSITECH project... more This paper gives a survey of the infrastructure currently being developed in the MUSITECH project. The aim of this project is to conceptualize and implement a computational environment for navigation and interaction in internet-based musical applications. This comprises the development of data models, exchange formats, interface modules and a software framework. Our approach is to integrate different information and media types like MIDI, audio, text based codes and metadata and their relations, especially to provide means to describe arbitrary musical structures. We attempt to connect different musical domains to support cooperations and synergies. To establish platform independence Java, Extensible Markup Language (XML), and other open standards are used. The object model, a framework and various components for visualization, playback and other common tasks and the technical infrastructure are being developed and will be evaluated within the project.
International Symposium/Conference on Music Information Retrieval, 2005
This paper describes the use of motif contour classes for efficient retrieval of melodies from mu... more This paper describes the use of motif contour classes for efficient retrieval of melodies from music collections. Instead of extracting incipits or themes, complete monophonic pieces are indexed for their motifs, using classes of motif contours. Similarity relations between these classes can be used for a very efficient search. This can serve as a first level search, which can be refined by using more computationally intensive comparisons on its results. The model introduced has been implemented and tested using the MUSITECH framework. We present empirical and analytical results on the retrieval quality, the complexity, and quality/efficiency trade-off.
NeSy4VRD is a multifaceted resource designed to support the development of neurosymbolic AI (NeSy... more NeSy4VRD is a multifaceted resource designed to support the development of neurosymbolic AI (NeSy) research. NeSy4VRD re-establishes public access to the images of the VRD dataset and couples them with an extensively revised, quality-improved version of the VRD visual relationship annotations. Crucially, NeSy4VRD provides a well-aligned, companion OWL ontology that describes the dataset domain. It comes with open source infrastructure that provides comprehensive support for extensibility of the annotations (which, in turn, facilitates extensibility of the ontology), and open source code for loading the annotations to/from a knowledge graph. We are contributing NeSy4VRD to the computer vision, NeSy and Semantic Web communities to help foster more NeSy research using OWL-based knowledge graphs.
This paper reports on a short project that aimed to integrate two mathematical measures, convexit... more This paper reports on a short project that aimed to integrate two mathematical measures, convexity and compactness, into the ISSM, a model for music analysis. The measures convexity and compactness have been successfully used before in music informatics research. It turned out that the combination of these two tools has not been as successful as initially hoped for, and therefore the project has been aborted. We report here on the methods and give a rationale for not pursuing this project any further.
Music source separation in the time-frequency domain is commonly achieved by applying a soft or b... more Music source separation in the time-frequency domain is commonly achieved by applying a soft or binary mask to the magnitude component of (complex) spectrograms. The phase component is usually not estimated, but instead copied from the mixture and applied to the magnitudes of the estimated isolated sources. While this method has several practical advantages, it imposes an upper bound on the performance of the system, where the estimated isolated sources inherently exhibit audible "phase artifacts". In this paper we address these shortcomings by directly estimating masks in the complex domain, extending recent work from the speech enhancement literature. The method is particularly well suited for multi-instrument musical source separation since residual phase artifacts are more pronounced for spectrally overlapping instrument sources, a common scenario in music. We show that complex masks result in better separation than masks that operate solely on the magnitude component.
We introduce the novel concept of anti-transfer learning for speech processing with convolutional... more We introduce the novel concept of anti-transfer learning for speech processing with convolutional neural networks. While transfer learning assumes that the learning process for a target task will benefit from re-using representations learned for another task, anti-transfer avoids the learning of representations that have been learned for an orthogonal task, i.e., one that is not relevant and potentially misleading for the target task, such as speaker identity for speech recognition or speech content for emotion recognition. In anti-transfer learning, we penalize similarity between activations of a network being trained and another one previously trained on an orthogonal task, which yields more suitable representations. This leads to better generalization and provides a degree of control over correlations that are spurious or undesirable, e.g. to avoid social bias. We have implemented anti-transfer for convolutional neural networks in different configurations with several similarity metrics and aggregation functions, which we evaluate and analyze with several speech and audio tasks and settings, using six datasets. We show that antitransfer actually leads to the intended invariance to the orthogonal task and to more appropriate features for the target task at hand. Anti-transfer learning consistently improves classification accuracy in all test cases. While anti-transfer creates computation and memory cost at training time, there is relatively little computation cost when using pre-trained models for orthogonal tasks. Anti-transfer is widely applicable and particularly useful where a specific invariance is desirable or where trained models are available and labeled data for orthogonal tasks are difficult to obtain.
The modeling of human emotion expression in speech signals is an important, yet challenging task.... more The modeling of human emotion expression in speech signals is an important, yet challenging task. The high resource demand of speech emotion recognition models, combined with the general scarcity of emotion-labelled data are obstacles to the development and application of effective solutions in this field. In this paper, we present an approach to jointly circumvent these difficulties. Our method, named RH-emo, is a novel semisupervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms, enabling the use of quaternion-valued networks for speech emotion recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. On the one hand, the classifier permits to optimization of each latent axis of the embeddings for the classification of a specific emotionrelated characteristic: valence, arousal, dominance, and overall emotion. On the other hand, quaternion reconstruction enables the latent dimension to develop intra-channel correlations that are required for an effective representation as a quaternion entity. We test our approach on speech emotion recognition tasks using four popular datasets: IEMOCAP, RAVDESS, EmoDB, and TESS, comparing the performance of three well-established real-valued CNN architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent fed with the embeddings created with RH-emo. We obtain a consistent improvement in the test accuracy for all datasets, while drastically reducing the resources' demand of models. Moreover, we performed additional experiments and ablation studies that confirm the effectiveness of our approach. The RH-emo repository is available at: https://github.com/ispamm/rhemo.
International Journal of Smart Engineering System Design, 2003
The task of recognizing patterns and assigning rhythmic structure to unquantized musical input is... more The task of recognizing patterns and assigning rhythmic structure to unquantized musical input is a fundamental one for interactive musical systems and for searching musical databases since melody is based on rhythm. We use a combination of combinatorial pattern matching and structural interpretation with a match quality rating by a neuro-fuzzy system that incorporates musical knowledge and operates on perceptually relevant features extracted from the input data. It can learn from relatively few expert examples by using iterative training by relative samples. It shows good recognition results and the used methods of pre-filtering and optimization facilitate efficient computation. The system is modular, so feature extraction, rules, and perceptual constraints can be changed to adapt it to other areas of application.
Proceedings of the International AAAI Conference on Web and Social Media, May 31, 2022
Recent advances in fake news detection have exploited the success of large-scale pre-trained lang... more Recent advances in fake news detection have exploited the success of large-scale pre-trained language models (PLMs). The predominant state-of-the-art approaches are based on fine-tuning PLMs on labelled fake news datasets. However, large-scale PLMs are generally not trained on structured factual data and hence may not possess priors that are grounded in factually accurate knowledge. The use of existing knowledge bases (KBs) with rich human-curated factual information has thus the potential to make fake news detection more effective and robust. In this paper, we investigate the impact of knowledge integration into PLMs for fake news detection. We study several state-of-the-art approaches for knowledge integration, mostly using Wikidata as KB, on two popular fake news datasets-LIAR, a politics-based dataset, and COVID-19, a dataset of messages posted on social media relating to the COVID-19 pandemic. Our experiments show that knowledge-enhanced models can significantly improve fake news detection on LIAR where the KB is relevant and up-to-date. The mixed results on COVID-19 highlight the reliance on stylistic features and the importance of domain specific and current KBs. The code is available at https://github.com/chenxwh/fake-news-detection.
Vocal source separation and fundamental frequency estimation in music are tightly related tasks. ... more Vocal source separation and fundamental frequency estimation in music are tightly related tasks. The outputs of vocal source separation systems have previously been used as inputs to vocal fundamental frequency estimation systems; conversely, vocal fundamental frequency has been used as side information to improve vocal source separation. In this paper, we propose several different approaches for jointly separating vocals and estimating fundamental frequency. We show that joint learning is advantageous for these tasks, and that a stacked architecture which first performs vocal separation outperforms the other configurations considered. Furthermore, the best joint model achieves state-of-the-art results for vocal-f0 estimation on the iKala dataset. Finally, we highlight the importance of performing polyphonic, rather than monophonic vocal-f0 estimation for many real-world cases.
Uploads
Papers by Tillman Weyde