
Justin Salamon
For my most up-to-date publications list please see: justinsalamon.com/publications.html
I am a senior research scientist and member of the Audio Research Group at Adobe Research in San Francisco. Previously I was a senior research scientist at the Music and Audio Research Laboratory and Center for Urban Science and Progress of New York University.
My research focuses on the application of machine learning and signal processing to audio & video, with applications in machine listening, representation learning & self-supervision, music information retrieval, bioacoustics, environmental sound analysis and open source software & data.
I am a senior research scientist and member of the Audio Research Group at Adobe Research in San Francisco. Previously I was a senior research scientist at the Music and Audio Research Laboratory and Center for Urban Science and Progress of New York University.
My research focuses on the application of machine learning and signal processing to audio & video, with applications in machine listening, representation learning & self-supervision, music information retrieval, bioacoustics, environmental sound analysis and open source software & data.
less
Uploads
Papers by Justin Salamon
is a powerful tool for machine learning tasks with limited labeled
data but extensive unlabeled data. To learn representations, selfsupervised models are typically trained on a pretext task to predict
structure in the data (e.g. audio-visual correspondence, short-term
temporal sequence, word sequence) that is indicative of higher-level
concepts relevant to a target, downstream task. Sensor networks
are promising yet unexplored sources of data for self-supervised
learning—they collect large amounts of unlabeled yet timestamped
data over extended periods of time and typically exhibit long-term
temporal structure (e.g., over hours, months, years) not observable
at the short time scales previously explored in self-supervised learning (e.g., seconds). This structure can be present even in singlemodal data and therefore could be exploited for self-supervision in
many types of sensor networks. In this work, we present a model for
learning audio representations by predicting the long-term, cyclic
temporal structure in audio data collected from an urban acoustic
sensor network. We then demonstrate the utility of the learned audio
representation in an urban sound event detection task with limited
labeled data.
development and evaluation of machine listening systems for realworld urban noise monitoring. It consists of 3068 audio recordings
from the “Sounds of New York City” (SONYC) acoustic sensor
network. Via the Zooniverse citizen science platform, volunteers
tagged the presence of 23 fine-grained classes that were chosen in
consultation with the New York City Department of Environmental Protection. These 23 fine-grained classes can be grouped into
eight coarse-grained classes. In this work, we describe the collection of this dataset, metrics used to evaluate tagging systems, and
the results of a simple baseline model.
Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems
for large-scale detection of sound events using a combination of
weakly labeled data, i.e. training labels without time boundaries,
and strongly-labeled synthesized data. We introduce the Domestic Environment Sound Event Detection (DESED) dataset, mixing
a part of last year’s dataset and an additional synthetic, strongly labeled, dataset provided this year that we describe in more detail. We
also report the performance of the submitted systems on the official
evaluation (test) and development sets as well as several additional
datasets. The best systems from this year outperform last year’s
winning system by about 10% points in terms of F-measure.
recordings. We recorded human vocal imitations of birdsong and subsequently analysed these data
using three categories of audio features for matching imitations to original birdsong: spectral, temporal,
and spectrotemporal. These exploratory analyses suggest that spectral features can help distinguish
imitation strategies (e.g. whistling vs. singing) but are insufficient for distinguishing species. Similarly,
whereas temporal features are correlated between human imitations and natural birdsong, they are also
insufficient. Spectrotemporal features showed the greatest promise, in particular when used to extract
a representation of the pitch contour of birdsong and human imitations. This finding suggests a link
between the task of matching human imitations to birdsong to retrieval tasks in the music domain such
as query-by-humming and cover song retrieval; we borrow from such existing methodologies to outline
directions for future research.
with labeling segments of audio recordings by the presence of
active sound sources. SED is typically posed as a supervised
machine learning problem, requiring strong annotations for the
presence or absence of each sound source at every time instant
within the recording. However, strong annotations of this type are
both labor- and cost-intensive for human annotators to produce,
which limits the practical scalability of SED methods.
In this work, we treat SED as a multiple instance learning
(MIL) problem, where training labels are static over a short
excerpt, indicating the presence or absence of sound sources
but not their temporal locality. The models, however, must
still produce temporally dynamic predictions, which must be
aggregated (pooled) when comparing against static labels during
training. To facilitate this aggregation, we develop a family of adaptive pooling operators—referred to as auto-pool—which smoothly interpolate between common pooling operators, such as min-, max-, or average-pooling, and automatically adapt to the characteristics of the sound sources in question. We evaluate the proposed pooling operators on three datasets, and demonstrate that in each case, the proposed methods outperform non-adaptive pooling operators for static prediction, and nearly match the performance of models trained with strong, dynamic annotations. The proposed method is evaluated in conjunction with convolutional neural networks, but can be readily applied to any differentiable model for time-series label prediction. While this article focuses on SED applications, the proposed methods are general, and could be applied widely to MIL problems in any domain.
SONYC includes a distributed network of both sensors and people for large-scale noise monitoring. The sensors use low-cost, low-power technology, and cutting-edge machine listening techniques, to produce calibrated acoustic measurements and recognize individual sound sources in real time. Citizen science methods are used to help urban residents connect to city agencies and each other, understand their noise footprint, and facilitate reporting and self-regulation. Crucially, SONYC utilizes big data solutions to analyze, retrieve and visualize information from sensors and citizens, creating a comprehensive acoustic model of the city that can be used to identify significant patterns of noise pollution. These data can be used to drive the strategic application of noise code enforcement by city agencies to optimize the reduction of noise pollution. The entire system, integrating cyber, physical and social infrastructure, forms a closed loop of continuous sensing, analysis and actuation on the environment.
SONYC provides a blueprint for the mitigation of noise pollution that can potentially be applied to other cities in the US and abroad.
sound recording, also known as pitch tracking, is fundamental to
audio processing with multiple applications in speech processing and
music information retrieval. To date, the best performing techniques,
such as the pYIN algorithm, are based on a combination of DSP
pipelines and heuristics. While such techniques perform very well
on average, there remain many cases in which they fail to correctly
estimate the pitch. In this paper, we propose a data-driven pitch
tracking algorithm, CREPE, which is based on a deep convolutional
neural network that operates directly on the time-domain waveform.
We show that the proposed model produces state-of-the-art results,
performing equally or better than pYIN. Furthermore, we evaluate
the model’s generalizability in terms of noise robustness. A pretrained
version of CREPE is made freely available as an open-source
Python module for easy application.
systems. It is also a time consuming process, which
has motivated investigators to crowdsource audio annotations.
However, there are many factors that affect annotations, many
of which have not been adequately investigated. In previous
work, we investigated the effects of visualization aids
and sound scene complexity on the quality of crowdsourced
sound-event annotations. In this paper, we extend that work
by investigating the effect of sound-event loudness on both
sound-event source annotations and sound-event proximity
annotations. We find that the sound class, loudness, and annotator
bias affect how listeners annotate proximity. We also
find that loudness affects recall more than precision and that
the strengths of these effects are strongly influenced by the
sound class. These findings are not only important for designing
effective audio annotation processes, but also for effectively
training and evaluating machine-listening systems.