DDX7: Differentiable FM Synthesis of Musical Instrument Sounds
2022, Zenodo (CERN European Organization for Nuclear Research)
https://doi.org/10.5281/ZENODO.7343063Abstract
FM Synthesis is a well-known algorithm used to generate complex timbre from a compact set of design primitives. Typically featuring a MIDI interface, it is usually impractical to control it from an audio source. On the other hand, Differentiable Digital Signal Processing (DDSP) has enabled nuanced audio rendering by Deep Neural Networks (DNNs) that learn to control differentiable synthesis layers from arbitrary sound inputs. The training process involves a corpus of audio for supervision, and spectral reconstruction loss functions. Such functions, while being great to match spectral amplitudes, present a lack of pitch direction which can hinder the joint optimization of the parameters of FM synthesizers. In this paper, we take steps towards enabling continuous control of a well-established FM synthesis architecture from an audio input. Firstly, we discuss a set of design constraints that ease spectral optimization of a differentiable FM synthesizer via a standard reconstruction loss. Next, we present Differentiable DX7 (DDX7), a lightweight architecture for neural FM resynthesis of musical instrument sounds in terms of a compact set of parameters. We train the model on instrument samples extracted from the URMP dataset, and quantitatively demonstrate its comparable audio quality against selected benchmarks.
References (50)
- REFERENCES
- J. M. Chowning, ªThe synthesis of complex audio spectra by means of frequency modulation,º Journal of the Audio Engineering Society, vol. 21, no. 7, pp. 526±534, 1973.
- M. Lavengood, ªWhat makes it sound '80s? The Yamaha DX7 electric piano sound,º Journal of Popular Music Studies, vol. 31, no. 3, pp. 73±94, 2019.
- E. Miranda, Computer Sound Design: Synthesis tech- niques and programming. Routledge, 2012.
- B. Stevens, Teaching Electronic Music: Cultural, Cre- ative, and Analytical Perspectives. Routledge, 2021.
- T. Pinch and F. Trocco, ªThe social construction of the early electronic music synthesizer,º ICON, pp. 9±31, 1998.
- A. R. Jensenius and M. J. Lyons, A NIME reader: Fif- teen years of New Interfaces for Musical Expression. Springer, 2017, vol. 3.
- T. West, B. Caramiaux, S. Huot, and M. M. Wander- ley, ªMaking mappings: Design criteria for live perfor- mance,º New Interfaces for Musical Expression con- ference (NIME), 5 2021.
- J. Regimbal and M. M. Wanderley, ªInterpolating au- dio and haptic control spaces,º in New Interfaces for Musical Expression conference (NIME). PubPub, 2021.
- C. Poepel and R. B. Dannenberg, ªAudio Signal Driven Sound Synthesis,º in International Computer Music Conference, 2005.
- V. Verfaille, U. Zolzer, and D. Arfib, ªAdaptive digital audio effects (a-dafx): a new class of sound transfor- mations,º IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1817±1831, 2006.
- V. Lazzarini, J. Timoney, and T. Lysaght, ªAdaptive FM Synthesis,º in DAFX-07 the 10th Int. Conference on Digital Audio Effects, September 2007.
- J. Engel, L. H. Hantrakul, C. Gu, and A. Roberts, ªDDSP: Differentiable Digital Signal Processing,º in 8th International Conference on Learning Representa- tions, Addis Ababa, Ethiopia, 2020.
- B. Hayes, C. Saitis, and G. Fazekas, ªNeural wave- shaping synthesis,º Proceedings of the 22th Interna- tional Society for Music Information Retrieval Confer- ence, 2021.
- M. Michelashvili and L. Wolf, ªHierarchical timbre- painting and articulation generation,º Proceedings of the 21th International Society for Music Information Retrieval Conference, 2020.
- O. Cifka, A. Ozerov, U. Ë SimË sekli, and G. Richard, ªSelf-Supervised VQ-VAE for One-Shot Music Style Transfer,º in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2021, pp. 96±100.
- M. Carney, C. Li, E. Toh, P. Yu, and J. Engel, ªTone transfer: In-browser interactive neural audio synthe- sis,º in Joint Proceedings of the ACM IUI 2021 Work- shops, 2021.
- Y. Wu, E. Manilow, Y. Deng, R. Swavely, K. Kast- ner, T. Cooijmans, A. Courville, C.-Z. A. Huang, and J. Engel, ªMIDI-DDSP: Detailed control of musical performance via hierarchical modeling,º International Conference on Learning Representations (ICLR) 2022, 2022.
- M. Yee-King and L. McCallum, ªStudio report: Sound synthesis with DDSP and network bending tech- niques,º Proceedings of the 2nd Conference on AI Mu- sic Creativity, 2021.
- K. Nielsen, ªPractical linear and exponential frequency modulation for digital music synthesis,º Proceedings of the 23rd International Conference on Digital Audio Effects (DAFx-20), Vienna, Austria, September 8±12, 2020-21, 2020.
- N. Masuda and D. Saito, ªQuality diversity for syn- thesizer sound matching,º in Proceedings of the 23rd International Conference on Digital Audio Effects (DAFx20in21), 2021.
- Sound On Sound Magazine. (2020) Korg Opsix. [Online]. Available: https://www.soundonsound.com/ reviews/korg-opsix
- Max Kuehn, for Fidlar Music. (2022) Best FM Synth 2022. [Online]. Available: https://fidlarmusic.com/ best-fm-synth/
- A. Horner, J. Beauchamp, and L. Haken, ªMachine tongues XVI: Genetic algorithms and their application to FM matching synthesis,º Computer Music Journal, vol. 17, no. 4, pp. 17±29, 1993.
- M. J. Yee-King, L. Fedden, and M. d'Inverno, ªAuto- matic Programming of VST Sound Synthesizers Using Deep Networks and Other Techniques,º IEEE Trans- actions on Emerging Topics in Computational Intelli- gence, vol. 2, no. 2, pp. 150±159, Apr. 2018.
- G. Le Vaillant, T. Dutoit, and S. Dekeyser, ªIm- proving synthesizer programming from variational au- toencoders latent space,º in Proceedings of the 23rd International Conference on Digital Audio Effects (DAFx20in21), 2021.
- Z. Chen, Y. Jing, S. Yuan, Y. Xu, J. Wu, and H. Zhao, ªSound2Synth: Interpreting sound via FM synthesizer parameters estimation,º arXiv preprint arXiv:2205.03043, 2022.
- A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, ªWavenet: A generative model for raw audio,º The 9th ISCA Speech Synthesis Workshop, 2016.
- S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio, ªSampleRNN: An Unconditional End-to-End Neural Audio Genera- tion Model,º in 5th International Conference on Learn- ing Representations, Toulon, France, 2017.
- J. Nistal, S. Lattner, and G. Richard, ªDrumGAN: Syn- thesis of Drum Sounds With Timbral Feature Condi- tioning Using Generative Adversarial Networks,º in Proceedings of the 21th International Society for Mu- sic Information Retrieval Conference, Montréal, Aug. 2020.
- J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Don- ahue, and A. Roberts, ªGANSynth: Adversarial Neu- ral Audio Synthesis,º in 7th International Conference on Learning Representations, New Orleans, LA, USA, 2019, p. 17.
- A. Lavault, A. Roebel, and M. Voiry, ªStyleWaveGAN: Style-based synthesis of drum sounds with extensive controls using generative adversarial networks,º arXiv preprint arXiv:2204.00907, 2022.
- P. Esling, N. Masuda, A. Bardet, R. Despres, and A. Chemla-Romeu-Santos, ªFlow Synthesizer: Uni- versal Audio Synthesizer Control with Normalizing Flows,º Applied Sciences, vol. 10, no. 1, p. 302, 2020.
- A. Caillon and P. Esling, ªRAVE: A variational autoen- coder for fast and high-quality neural audio synthesis,º arXiv preprint arXiv:2111.05011, 2021.
- S. Huang, Q. Li, C. Anil, S. Oore, and R. B. Grosse, ªTimbreTron A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer,º in 7th Interna- tional Conference on Learning Representations, New Orleans, LA, USA, 2019, p. 17.
- X. Wang, S. Takaki, and J. Yamagishi, ªNeural source-filter waveform models for statistical paramet- ric speech synthesis,º IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, vol. 28, pp. 402±415, 2019.
- Y. Stylianou, J. Laroche, and E. Moulines, ªHigh- quality speech modification based on a harmonic+ noise model,º in Fourth European Conference on Speech Communication and Technology, 1995.
- A. Jansson, ªImplicit neural differentiable FM syn- thesizer,º https://github.com/andreasjansson/fmsynth, 2022.
- J. Alonso, ªDDSP-FM: differentiable FM synthesis,º https://juanalonso.github.io/ddsp_fm/, 2021.
- J. Turian and M. Henry, ªI'm Sorry for Your Loss: Spectrally-Based Audio Distances Are Bad at Pitch,º arXiv:2012.04572 [cs, eess], Dec. 2020, I Can't Be- lieve It's Not Better! (ICBINB) NeurIPS 2020 Work- shop.
- J. W. Kim, J. Salamon, P. Li, and J. P. Bello, ªCrepe: A Convolutional Representation for Pitch Estimation,º in 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 -Pro- ceedings. Institute of Electrical and Electronics Engi- neers Inc., Sep. 2018, pp. 161±165.
- S. Bai, J. Z. Kolter, and V. Koltun, ªAn Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling,º arXiv:1803.01271 [cs], Apr. 2018, arXiv: 1803.01271.
- D. Bristow and J. Chowning, ªFM Theory and Appli- cations: By Musicians for Musicians,º Yamaha Music Foundation, 1986.
- C. J. Steinmetz and J. D. Reiss, ªEfficient neural net- works for real-time modeling of analog dynamic range compression,º in 152nd AES Convention, 2022.
- B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sharma, ªCreating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications,º IEEE Transactions on Multimedia, vol. 21, no. 2, pp. 522±535, Feb. 2019.
- B. C. Moore, B. R. Glasberg, and T. Baer, ªA model for the prediction of thresholds, loudness, and partial loudness,º Journal of the Audio Engineering Society, vol. 45, no. 4, pp. 224±240, 1997.
- P. Gauthier, ªDexed -FM Plugin Synth,º https://github. com/asb2m10/dexed, 2022.
- K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, ªFréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms,º in Inter- speech 2019. ISCA, Sep. 2019, pp. 2350±2354.
- R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, ªThe Unreasonable Effectiveness of Deep Features as a Perceptual Metric,º in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. Salt Lake City, UT: IEEE, Jun. 2018, pp. 586± 595.
- S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gem- meke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, ªCNN architectures for large-scale audio classification,º in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New Orleans, LA: IEEE, Mar. 2017, pp. 131±135.