Justin Salamon

Adobe Systems, Adobe Research, Senior Research Scientist

Followers

Following

Co-authors

Public Views

For my most up-to-date publications list please see: justinsalamon.com/publications.html

I am a senior research scientist and member of the Audio Research Group at Adobe Research in San Francisco. Previously I was a senior research scientist at the Music and Audio Research Laboratory and Center for Urban Science and Progress of New York University.

My research focuses on the application of machine learning and signal processing to audio & video, with applications in machine listening, representation learning & self-supervision, music information retrieval, bioacoustics, environmental sound analysis and open source software & data.

less

Interests

Uploads

Papers by Justin Salamon

Metric Learning vs Classification for Disentangled Music Representation Learning

21st International Society for Music Information Retrieval Conference (ISMIR), 2020

Deep representation learning offers a powerful paradigm for mapping input data onto an organized ... more Deep representation learning offers a powerful paradigm for mapping input data onto an organized embedding space and is useful for many music information retrieval tasks. Two central methods for representation learning include deep metric learning and classification, both having the same goal of learning a representation that can generalize well across tasks. Along with generalization, the emerging concept of disentangled representations is also of great interest, where multiple semantic concepts (e.g., genre, mood, instrumentation) are learned jointly but remain separable in the learned representation space. In this paper we present a single representation learning framework that elucidates the relationship between metric learning, classification, and disentanglement in a holistic manner. For this, we (1) outline past work on the relationship between metric learning and classification, (2) extend this relationship to multi-label data by exploring three different learning approaches and their disentangled versions, and (3) evaluate all models on four tasks (training time, similarity retrieval, auto-tagging, and triplet prediction). We find that classification-based models are generally advantageous for training time, similarity retrieval, and autotagging, while deep metric learning exhibits better performance for triplet-prediction. Finally, we show that our proposed approach yields state-of-the-art results for music auto-tagging.

Few-Shot Drum Transcription in Polyphonic Music

21st International Society for Music Information Retrieval Conference (ISMIR), 2020

Data-driven approaches to automatic drum transcription (ADT) are often limited to a predefined, s... more Data-driven approaches to automatic drum transcription (ADT) are often limited to a predefined, small vocabulary of percussion instrument classes. Such models cannot recognize out-of-vocabulary classes nor are they able to adapt to finer-grained vocabularies. In this work, we address open vocabulary ADT by introducing few-shot learning to the task. We train a Prototypical Network on a synthetic dataset and evaluate the model on multiple real-world ADT datasets with polyphonic accompaniment. We show that, given just a handful of selected examples at inference time, we can match and in some cases outperform a state-of-the art supervised ADT approach under a fixed vocabulary setting. At the same time, we show that our model can successfully generalize to finer-grained or extended vocabularies unseen during training, a scenario where supervised approaches cannot operate at all. We provide a detailed analysis of our experimental results, including a breakdown of performance by sound class and by polyphony.

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

Self-supervised audio-visual learning aims to capture useful representations of video by leveragi... more Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs. Existing approaches have focused primarily on matching semantic information between the sensory streams. We propose a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream. Our approach is simple yet effective. We train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams. To train and evaluate our method, we introduce a large-scale video dataset, YouTube-ASMR-300K, with spatial audio comprising over 900 hours of footage. We demonstrate that understanding spatial correspondence enables models to perform better on three audio-visual tasks, achieving quantitative gains over supervised and self-supervised baselines that do not leverage spatial audio cues. We also show how to extend our self-supervised approach to 360 degree videos with ambisonic audio.

Disentangled Multidimensional Metric Learning for Music Similarity

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020

Few-Shot Sound Event Detection

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020

Chirping up the Right Tree: Incorporating Biological Taxonomies into Deep Bioacoustic Classifiers

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020

Sound Event Detection in Synthetic Domestic Environments

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020

Robust Sound Event Detection in Bioacoustic Sensor Networks

PLoS ONE 14(10): e0214168, 2019. DOI: https://doi.org/10.1371/journal.pone.0214168, 2019

Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of w... more Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and across sensors, hinders the reliability of current automated systems for sound event detection (SED), such as convolutional neural networks (CNN) in the time-frequency domain. In this article, we develop, benchmark, and combine several machine listening techniques to improve the generalizability of SED models across heterogeneous acoustic environments. As a case study, we consider the problem of detecting avian flight calls from a ten-hour recording of nocturnal bird migration, recorded by a network of six ARUs in the presence of heterogeneous background noise. Starting from a CNN yielding state-of-the-art accuracy on this task, we introduce two noise adaptation techniques, respectively integrating short-term (60 ms) and long-term (30 min) context. First, we apply per-channel energy normalization (PCEN) in the time-frequency domain, which applies short-term automatic gain control to every subband in the mel-frequency spectrogram. Secondly, we replace the last dense layer in the network by a context-adaptive neural network (CA-NN) layer, i.e. an affine layer whose weights are dynamically adapted at prediction time by an auxiliary network taking long-term summary statistics of spectrotemporal features as input. We show that PCEN reduces temporal overfitting across dawn vs. dusk audio clips whereas context adaptation on PCEN-based summary statistics reduces spatial overfitting across sensor locations. Moreover, combining them yields state-of-the-art results that are unmatched by artificial data augmentation alone. We release a pre-trained version of our best performing system under the name of BirdVoxDetect, a ready-to-use detector of avian flight calls in field recordings.

TriCycle: Audio Representation Learning from Sensor Network Data Using Self-Supervision

In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, Oct. 2019., 2019

Self-supervised representation learning with deep neural networks is a powerful tool for machine ... more Self-supervised representation learning with deep neural networks
is a powerful tool for machine learning tasks with limited labeled
data but extensive unlabeled data. To learn representations, selfsupervised models are typically trained on a pretext task to predict
structure in the data (e.g. audio-visual correspondence, short-term
temporal sequence, word sequence) that is indicative of higher-level
concepts relevant to a target, downstream task. Sensor networks
are promising yet unexplored sources of data for self-supervised
learning—they collect large amounts of unlabeled yet timestamped
data over extended periods of time and typically exhibit long-term
temporal structure (e.g., over hours, months, years) not observable
at the short time scales previously explored in self-supervised learning (e.g., seconds). This structure can be present even in singlemodal data and therefore could be exploited for self-supervision in
many types of sensor networks. In this work, we present a model for
learning audio representations by predicting the long-term, cyclic
temporal structure in audio data collected from an urban acoustic
sensor network. We then demonstrate the utility of the learned audio
representation in an urban sound event detection task with limited
labeled data.

Download

SONYC Urban Sound Tagging (SONYC-UST): A Multilabel Dataset from an Urban Acoustic Sensor Network

In Proc. of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), pages 35–39, New York University, NY, USA, Oct. 2019., 2019

SONYC Urban Sound Tagging (SONYC-UST) is a dataset for the development and evaluation of machine ... more SONYC Urban Sound Tagging (SONYC-UST) is a dataset for the
development and evaluation of machine listening systems for realworld urban noise monitoring. It consists of 3068 audio recordings
from the “Sounds of New York City” (SONYC) acoustic sensor
network. Via the Zooniverse citizen science platform, volunteers
tagged the presence of 23 fine-grained classes that were chosen in
consultation with the New York City Department of Environmental Protection. These 23 fine-grained classes can be grouped into
eight coarse-grained classes. In this work, we describe the collection of this dataset, metrics used to evaluate tagging systems, and
the results of a simple baseline model.

Download

Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis

In Proc. of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), pages 253–257, New York University, NY, USA, Oct. 2019., 2019

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCA... more This paper presents Task 4 of the Detection and Classification of
Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems
for large-scale detection of sound events using a combination of
weakly labeled data, i.e. training labels without time boundaries,
and strongly-labeled synthesized data. We introduce the Domestic Environment Sound Event Detection (DESED) dataset, mixing
a part of last year’s dataset and an additional synthetic, strongly labeled, dataset provided this year that we describe in more detail. We
also report the performance of the submitted systems on the official
evaluation (test) and development sets as well as several additional
datasets. The best systems from this year outperform last year’s
winning system by about 10% points in terms of F-measure.

Download

Matching Human Vocal Imitations to Birdsong: An Exploratory Analysis

by Kendra Oudyk and Justin Salamon

2nd Int. Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR), London, UK, Aug. 2019., 2019

We explore computational strategies for matching human vocal imitations of birdsong to actual bir... more We explore computational strategies for matching human vocal imitations of birdsong to actual birdsong
recordings. We recorded human vocal imitations of birdsong and subsequently analysed these data
using three categories of audio features for matching imitations to original birdsong: spectral, temporal,
and spectrotemporal. These exploratory analyses suggest that spectral features can help distinguish
imitation strategies (e.g. whistling vs. singing) but are insufficient for distinguishing species. Similarly,
whereas temporal features are correlated between human imitations and natural birdsong, they are also
insufficient. Spectrotemporal features showed the greatest promise, in particular when used to extract
a representation of the pitch contour of birdsong and human imitations. This finding suggests a link
between the task of matching human imitations to birdsong to retrieval tasks in the music domain such
as query-by-humming and cover song retrieval; we borrow from such existing methodologies to outline
directions for future research.

Download

Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019

A considerable challenge in applying deep learning to audio classification is the scarcity of lab... more A considerable challenge in applying deep learning to audio classification is the scarcity of labeled data. An increasingly popular solution is to learn deep audio embeddings from large audio collections and use them to train shallow classifiers using small labeled datasets. Look, Listen, and Learn (L 3 -Net) is an embedding trained through self-supervised learning of audio-visual correspondence in videos as opposed to other embeddings requiring labeled data. This framework has the potential to produce powerful out-of-the-box embeddings for downstream audio classification tasks, but has a number of unexplained design choices that may impact the embeddings’ behavior. In this paper we investigate how L 3 -Net design choices impact the performance of downstream audio classifiers trained with these embeddings. We show that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key. Surprisingly, we find that matching the content for training the embedding to the downstream task is not beneficial. Finally, we show that our best variant of the L 3 -Net embedding outperforms both the VGGish and SoundNet embeddings, while having fewer parameters and being trained on less data. Our implementation of the L 3 -Net embedding model as well as pre-trained models are made freely available online.

Open-source Practices for Music Signal Processing Research: Recommendations for Transparent, Sustainable, and Reproducible Audio Research

IEEE Signal Processing Magazine, 2019

Per-Channel Energy Normalization: Why and how

IEEE Signal Processing Letters, 2018

In the context of automatic speech recognition and acoustic event detection, an adaptive procedur... more In the context of automatic speech recognition and acoustic event detection, an adaptive procedure named per-channel energy normalization (PCEN) has recently shown to outperform the pointwise logarithm of mel-frequency spectrogram (logmelspec) as an acoustic frontend. This article investigates the adequacy of PCEN for spectrogram-based pattern recognition in far-field noisy recordings, both from theoretical and practical standpoints. First, we apply PCEN on various datasets of natural acoustic environments and find empirically that it Gaussianizes distributions of magnitudes while decorrelating frequency bands. Secondly, we describe the asymptotic regimes of each component in PCEN: temporal integration, gain control, and dynamic range compression. Thirdly, we give practical advice for adapting PCEN parameters to the temporal properties of the noise to be mitigated, the signal to be enhanced, and the choice of time-frequency representation. As it converts a large class of real-world soundscapes into additive white Gaussian noise (AWGN), PCEN is a computationally efficient frontend for robust detection and classification of acoustic events in heterogeneous environments.

Adaptive pooling operators for weakly labeled sound event detection

IEEE Transactions on Audio, Speech and Language Processing, 2018

Sound event detection (SED) methods are tasked with labeling segments of audio recordings by the ... more Sound event detection (SED) methods are tasked
with labeling segments of audio recordings by the presence of
active sound sources. SED is typically posed as a supervised
machine learning problem, requiring strong annotations for the
presence or absence of each sound source at every time instant
within the recording. However, strong annotations of this type are
both labor- and cost-intensive for human annotators to produce,
which limits the practical scalability of SED methods.
In this work, we treat SED as a multiple instance learning
(MIL) problem, where training labels are static over a short
excerpt, indicating the presence or absence of sound sources
but not their temporal locality. The models, however, must
still produce temporally dynamic predictions, which must be
aggregated (pooled) when comparing against static labels during
training. To facilitate this aggregation, we develop a family of adaptive pooling operators—referred to as auto-pool—which smoothly interpolate between common pooling operators, such as min-, max-, or average-pooling, and automatically adapt to the characteristics of the sound sources in question. We evaluate the proposed pooling operators on three datasets, and demonstrate that in each case, the proposed methods outperform non-adaptive pooling operators for static prediction, and nearly match the performance of models trained with strong, dynamic annotations. The proposed method is evaluated in conjunction with convolutional neural networks, but can be readily applied to any differentiable model for time-series label prediction. While this article focuses on SED applications, the proposed methods are general, and could be applied widely to MIL problems in any domain.

Time Lattice: A Data Structure for the Interactive Visual Analysis of Large Time Series

Computer Graphics Forum (EuroVis'18), 2018

Advances in technology coupled with the availability of low‐cost sensors have resulted in the con... more Advances in technology coupled with the availability of low‐cost sensors have resulted in the continuous generation of large time series from several sources. In order to visually explore and compare these time series at different scales, analysts need to execute online analytical processing (OLAP) queries that include constraints and group‐by's at multiple temporal hierarchies. Effective visual analysis requires these queries to be interactive. However, while existing OLAP cube‐based structures can support interactive query rates, the exponential memory requirement to materialize the data cube is often unsuitable for large data sets. Moreover, none of the recent space‐efficient cube data structures allow for updates. Thus, the cube must be re‐computed whenever there is new data, making them impractical in a streaming scenario. We propose Time Lattice, a memory‐efficient data structure that makes use of the implicit temporal hierarchy to enable interactive OLAP queries over large time series. Time Lattice is a subset of a fully materialized cube and is designed to handle fast updates and streaming data. We perform an experimental evaluation which shows that the space efficiency of the data structure does not hamper its performance when compared to the state of the art. In collaboration with signal processing and acoustics research scientists, we use the Time Lattice data structure to design the Noise Profiler, a web‐based visualization framework that supports the analysis of noise from cities. We demonstrate the utility of Noise Profiler through a set of case studies.

SONYC: A System for the Monitoring, Analysis and Mitigation of Urban Noise Pollution

Communications of the ACM (CACM), 2018., 2018

We present the Sounds of New York City (SONYC) project, a smart cities initiative focused on deve... more We present the Sounds of New York City (SONYC) project, a smart cities initiative focused on developing a cyber-physical system for the monitoring, analysis and mitigation of urban noise pollution. Noise pollution is one of the topmost quality of life issues for urban residents in the U.S. with proven effects on health, education, the economy, and the environment. Yet, most cities lack the resources to continuously monitor noise and understand the contribution of individual sources, the tools to analyze patterns of noise pollution at city-scale, and the means to empower city agencies to take effective, data-driven action for noise mitigation. The SONYC project advances novel technological and socio-technical solutions that help address these needs.

SONYC includes a distributed network of both sensors and people for large-scale noise monitoring. The sensors use low-cost, low-power technology, and cutting-edge machine listening techniques, to produce calibrated acoustic measurements and recognize individual sound sources in real time. Citizen science methods are used to help urban residents connect to city agencies and each other, understand their noise footprint, and facilitate reporting and self-regulation. Crucially, SONYC utilizes big data solutions to analyze, retrieve and visualize information from sensors and citizens, creating a comprehensive acoustic model of the city that can be used to identify significant patterns of noise pollution. These data can be used to drive the strategic application of noise code enforcement by city agencies to optimize the reduction of noise pollution. The entire system, integrating cyber, physical and social infrastructure, forms a closed loop of continuous sensing, analysis and actuation on the environment.

SONYC provides a blueprint for the mitigation of noise pollution that can potentially be applied to other cities in the US and abroad.

CREPE: A Convolutional Representation for Pitch Estimation

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Canada, Apr. 2018., 2018

The task of estimating the fundamental frequency of a monophonic sound recording, also known as p... more The task of estimating the fundamental frequency of a monophonic
sound recording, also known as pitch tracking, is fundamental to
audio processing with multiple applications in speech processing and
music information retrieval. To date, the best performing techniques,
such as the pYIN algorithm, are based on a combination of DSP
pipelines and heuristics. While such techniques perform very well
on average, there remain many cases in which they fail to correctly
estimate the pitch. In this paper, we propose a data-driven pitch
tracking algorithm, CREPE, which is based on a deep convolutional
neural network that operates directly on the time-domain waveform.
We show that the proposed model produces state-of-the-art results,
performing equally or better than pYIN. Furthermore, we evaluate
the model’s generalizability in terms of noise robustness. A pretrained
version of CREPE is made freely available as an open-source
Python module for easy application.

Investigating the Effect of Sound-Event Loudness on Crowdsourced Audio Annotations

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Canada, Apr. 2018., 2018

Audio annotation is an important step in developing machinelistening systems. It is also a time c... more Audio annotation is an important step in developing machinelistening
systems. It is also a time consuming process, which
has motivated investigators to crowdsource audio annotations.
However, there are many factors that affect annotations, many
of which have not been adequately investigated. In previous
work, we investigated the effects of visualization aids
and sound scene complexity on the quality of crowdsourced
sound-event annotations. In this paper, we extend that work
by investigating the effect of sound-event loudness on both
sound-event source annotations and sound-event proximity
annotations. We find that the sound class, loudness, and annotator
bias affect how listeners annotate proximity. We also
find that loudness affects recall more than precision and that
the strengths of these effects are strongly influenced by the
sound class. These findings are not only important for designing
effective audio annotation processes, but also for effectively
training and evaluating machine-listening systems.

Justin Salamon

Uploads

Papers by Justin Salamon

Log In