Physics-informed differentiable method for piano modeling
2024, Frontiers in signal processing
https://doi.org/10.3389/FRSIP.2023.1276748Abstract
Numerical emulations of the piano have been a subject of study since the early days of sound synthesis. High-accuracy sound synthesis of acoustic instruments employs physical modeling techniques which aim to describe the system's internal mechanism using mathematical formulations. Such physical approaches are system-specific and present significant challenges for tuning the system's parameters. In addition, acoustic instruments such as the piano present nonlinear mechanisms that present significant computational challenges for solving associated partial differential equations required to generate synthetic sound. In a nonlinear context, the stability and efficiency of the numerical schemes when performing numerical simulations are not trivial, and models generally adopt simplifying assumptions and linearizations. Artificial neural networks can learn a complex system's behaviors from data, and their application can be beneficial for modeling acoustic instruments. Artificial neural networks typically offer less flexibility regarding the variation of internal parameters for interactive applications, such as real-time sound synthesis. However, their integration with traditional signal processing frameworks can overcome this limitation. This article presents a method for piano sound synthesis informed by the physics of the instrument, combining deep learning with traditional digital signal processing techniques. The proposed model learns to synthesize the quasi-harmonic content of individual piano notes using physicsbased formulas whose parameters are automatically estimated from real audio recordings. The model thus emulates the inharmonicity of the piano and the amplitude envelopes of the partials. It is capable of generalizing with good accuracy across different keys and velocities. Challenges persist in the highfrequency part of the spectrum, where the generation of partials is less accurate, especially at high-velocity values. The architecture of the proposed model permits low-latency implementation and has low computational complexity, paving the way for a novel approach to sound synthesis in interactive digital pianos that emulates specific acoustic instruments.
References (95)
- References Adrien, J. M., Causse, R., and Ducasse, E. (1988). "Sound synthesis by physical models, application to strings," in Audio engineering society convention (Paris, France: Audio Engineering Society), 84.
- Aouameur, C., Esling, P., and Hadjeres, G. (2019). Neural drum machine: an interactive system for real-time synthesis of drum sounds. arXiv preprint arXiv:1907.02637.
- Askenfelt, A., and Jansson, E. V. (1991). From touch to string vibrations. ii: the motion of the key and hammer. J. Acoust. Soc. Am. 90, 2383-2393. doi:10.1121/1.402043
- Bank, B. (2009). "Energy-based synthesis of tension modulation in strings," in Proceedings of the 12th International Conference on Digital Audio Effects (DAFx-09), 365-372.
- Bank, B., and Sujbert, L. (2005). Generation of longitudinal vibrations in piano strings: from physics to sound synthesis. J. Acoust. Soc. Am. 117, 2268-2278. doi:10. 1121/1.1868212
- Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015). Scheduled sampling for sequence prediction with recurrent neural networks. Adv. neural Inf. Process. Syst. 28. doi:10.5555/2969239.2969370
- Bentsen, L. Ø., Simionato, R., Benedikte, B. W., and Krzyzaniak, M. J. (2022). "Transformer and lstm models for automatic counterpoint generation using raw audio," in Proceedings of the Sound and Music Computing Conference (Society for Sound and Music Computing).
- Bilbao, S., Ducceschi, M., and Webb, C. J. (2019). "Large-scale real-time modular physical modeling sound synthesis," in Proceedings of the 22nd Conference of Digital Audio Effects (DAFx-19), 1-8.
- Bitton, A., Esling, P., Caillon, A., and Fouilleul, M. (2019). Assisted sound sample generation with musical conditioning in adversarial auto-encoders. arXiv preprint arXiv: 1904.06215.
- Brunton, S. L., Noack, B. R., and Koumoutsakos, P. (2020). Machine learning for fluid mechanics. Annu. Rev. fluid Mech. 52, 477-508. doi:10.1146/annurev-fluid-010719-060214
- Cai, S., Mao, Z., Wang, Z., Yin, M., and Karniadakis, G. E. (2021). Physics-informed neural networks (pinns) for fluid mechanics: a review. Acta Mech. Sin. 37, 1727-1738. doi:10.1007/s10409-021-01148-1
- Carrier, G. F. (1945). On the non-linear vibration problem of the elastic string. Q. Appl. Math. 3, 157-165. doi:10.1090/qam/12351
- Chabassier, J., Chaigne, A., and Joly, P. (2013). Modeling and simulation of a grand piano. J. Acoust. Soc. Am. 134, 648-665. doi:10.1121/1.4809649
- Chabassier, J., Chaigne, A., and Joly, P. (2014). Time domain simulation of a piano. part 1: model description. ESAIM Math. Model. Numer. Analysis 48, 1241-1278. doi:10. 1051/m2an/2013136
- Chaigne, A., and Askenfelt, A. (1994). Numerical simulations of piano strings. i. a physical model for a struck string using finite difference methods. J. Acoust. Soc. Am. 95, 1112-1118. doi:10.1121/1.408459
- Chen, J., Tan, X., Luan, J., Qin, T., and Liu, T. (2020). Hifisinger: towards high-fidelity neural singing voice synthesis. arXiv preprint arXiv:2009.01776.
- Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural ordinary differential equations. Adv. neural Inf. Process. Syst. 31. doi:10.5555/3327757.3327764
- Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
- Conklin, J., and Harold, A. (1996). Design and tone in the mechanoacoustic piano. part i. piano hammers and tonal effects. J. Acoust. Soc. Am. 99, 3286-3296. doi:10.1121/ 1.414947
- Cooper, E., Wang, X., and Yamagishi, J. (2021). Text-to-speech synthesis techniques for midi-to-audio synthesis. arXiv preprint arXiv:2104.12292. Curtis, H., Andriy, S., Adam, R., Cheng-Zhi, A. H., Sander, D., and Elsen, E. (2018).
- Maestro.
- Défossez, A., Zeghidour, N., Usunier, N., Bottou, L., and Bach, F. (2018). Sing: symbol-to-instrument neural generator. Adv. neural Inf. Process. Syst. 31. doi:10.5555/ 3327546.3327579
- Desai, S. A., Mattheakis, M., Sondak, D., Protopapas, P. D., and Roberts, S. J. (2021).
- Port-Hamiltonian neural networks for learning explicit time-dependent dynamical systems. Phys. Rev. E 104, 034312. doi:10.1103/physreve.104.034312
- Dieleman, S., and Schrauwen, B. (2014). "End-to-end learning for music audio," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE), 6964-6968.
- Donahue, C., McAuley, J., and Puckette, M. (2018). Adversarial audio synthesis. arXiv preprint arXiv:1802.04208.
- Dong, H., Zhou, C., Berg-Kirkpatrick, T., and McAuley, J. (2022). "Deep performer: score-to-audio music performance synthesis," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE), 951-955.
- Drioli, C., and Rocchesso, D. (1998). "Learning pseudo-physical models for sound synthesis and transformation," in SMC'98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No. 98CH36218) (IEEE), 1085-1090.
- Drysdale, J., Tomczak, M., and Hockman, J. (2020). "Adversarial synthesis of drum sounds," in Proceedings of the International Conference on Digital Audio Effects (DAFx). Dunlop, J. I. (1983). On the compression characteristics of fibre masses Ellis, A. J. (1880). The history of musical pitch (FAM Knuf)
- Engel, J., Agrawal, K. K., Chen, G. I. S., Donahue, C., and Roberts, A. (2019). Gansynth: adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710. Engel, J., Hantrakul, L., Gu, C., and Roberts, A. (2020). Ddsp: differentiable digital signal processing. arXiv preprint arXiv:2001.04643.
- Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D., et al. (2017a).
- "Neural audio synthesis of musical notes with wavenet autoencoders," in Proceedings of the International Conference on Machine Learning (Sydney, Australia: PMLR), 1068-1077.
- Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D., et al. (2017b).
- Nsynth. Etchenique, N., Collin, S. R., and Moore, T. R. (2015). Coupling of transverse and longitudinal waves in piano strings. J. Acoust. Soc. Am. 137, 1766-1771. doi:10.1121/1. 4916708
- Fletcher, N. H., and Rossing, T. D. (2012). The physics of musical instruments. Springer Science & Business Media.
- Gabrielli, L., Tomassetti, S., Squartini, S., and Zinato, C. (2017). "Introducing deep machine learning for parameter estimation in physical modelling," in Proceedings of the 20th international conference on digital audio effects.
- Gabrielli, L., Tomassetti, S., Zinato, C., and Piazza, F. (2018). End-to-end learning for physics-based acoustic modeling. IEEE Trans. Emerg. Top. Comput. Intell. 2, 160-170. doi:10.1109/tetci.2017.2787125
- Giordano, N., and Winans, J. P. (2000). Piano hammers and their force compression characteristics: does a power law make sense? J. Acoust. Soc. Am. 107, 2248-2255. doi:10.1121/1.428505
- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial nets. Proc. Adv. neural Inf. Process. Syst. 27. doi:10.1145/ 3422622
- Hall, D. E. (1987). Piano string excitation ii: general solution for a hard narrow hammer. J. Acoust. Soc. Am. 81, 535-546. doi:10.1121/1.394919
- Hantrakul, L., Engel, J. H., Roberts, A., and Gu, C. (2019). "Fast and flexible neural audio synthesis," in Ismir.
- Hawthorne, C., Stasyuk, A., Roberts, A., Simon, I., Huang, C. A., Dieleman, S., et al. (2018). Enabling factorized piano music modeling and generation with the maestro dataset. arXiv preprint arXiv:1810.12247.
- Hinrichsen, H. (2012). Entropy-based tuning of musical instruments. Rev. Bras. Ensino Física 34, 1-8. doi:10.1590/s1806-11172012000200004
- Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840-6851. doi:10.5555/3495724.3496298
- Hono, Y., Hashimoto, K., Oura, K., Nankaku, Y., and Tokuda, K. (2021). Sinsy: a deep neural network-based singing voice synthesis system. Proc. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 2803-2815. doi:10.1109/taslp.2021. 3104165
- Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., et al. (2018). "Efficient neural audio synthesis," in Proceedings of the International Conference on Machine Learning (Stockholm, Sweden: PMLR), 2410-2419.
- Kim, J. W., Bittner, R., Kumar, A., and Bello, J. P. (2019). "Neural music synthesis for flexible timbre control," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE), 176-180.
- Kim, S., Lee, S., Song, J., Kim, J., and Yoon, S. (2018). Flowavenet: a generative flow for raw audio. arXiv preprint arXiv:1811.02155.
- Kingma, D. P., and Ba, J. (2014). Adam: a method for stochastic optimization. Int. Conf. Learn. Represent.
- Kirchhoff, G. (1891). Vorlesungen über mathematische Physik, 2. Teubner: BG.
- Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. (2020). Diffwave: a versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761. Kuznetsov, B., Parker, J. D., and Esqueda, F. (2020). Differentiable iir filters for machine learning applications. Proc. Int. Conf. Digital Audio Eff., 297-303.
- Lavault, A., Roebel, A., and Voiry, M. (2022). Stylewavegan: style-based synthesis of drum sounds with extensive controls using generative adversarial networks. arXiv preprint arXiv:2204.00907.
- Legge, K., and Fletcher, N. H. (1984). Nonlinear generation of missing modes on a vibrating string. J. Acoust. Soc. Am. 76, 5-12. doi:10.1121/1.391007
- Li, N., Liu, S., Liu, Y., Zhao, S., and Liu, M. (2019). Neural speech synthesis with transformer network. Proc. AAAI Conf. Artif. Intell. 33, 6706-6713. doi:10.1609/aaai. v33i01.33016706
- Liu, J., Li, C., Ren, Y., Chen, F., and Zhao, Z. (2022). Diffsinger: singing voice synthesis via shallow diffusion mechanism. Proc. AAAI Conf. Artif. Intell. 36, 11020-11028. doi:10.1609/aaai.v36i10.21350
- Lu, P., Wu, J., Luan, J., Tan, X., and Zhou, L. (2020). Xiaoicesing: a high-quality and integrated singing voice synthesis system. arXiv preprint arXiv:2006.06261. Mauch, M., and Dixon, S. (2014). "pyin: a fundamental frequency estimator using probabilistic threshold distributions," in 2014 ieee international conference on acoustics, speech and signal processing (icassp) (IEEE), 659-663.
- Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., et al. (2016). Samplernn: an unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837
- Morise, M., Yokomori, F., and Ozawa, J. (2016). World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99, 1877-1884. doi:10.1587/transinf.2015edp7457
- Moseley, B., Markham, A., and Nissen-Meyer, T. (2020a). Solving the wave equation with physics-informed deep learning. arXiv preprint arXiv:2006.11894. Moseley, B., Nissen-Meyer, T., and Markham, A. (2020b). Deep learning for fast simulation of seismic waves in complex media. Solid earth. 11, 1527-1549. doi:10.5194/ se-11-1527-2020
- Neldner, L. M. (2020). The origins of phantom partials in the piano.
- Nistal, J., Lattner, S., and Richard, G. (2020). Drumgan: synthesis of drum sounds with timbral feature conditioning using generative adversarial networks. arXiv preprint arXiv: 2008.12073.
- Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., et al. (2018). "Parallel wavenet: fast high-fidelity speech synthesis," in Proceedings of the International conference on machine learning (Stockholm, Sweden: PMLR), 3918-3926.
- Paganini, M., de Oliveira, L., and Nachman, B. (2018). Accelerating science with generative adversarial networks: an application to 3d particle showers in multilayer calorimeters. Phys. Rev. Lett. 120, 042003. doi:10.1103/physrevlett. 120.042003
- Parker, J. D., Schlecht, S. J., Rabenstein, R., and Schäfer, M. (2022). Physical modeling using recurrent neural networks with fast convolutional layers. arXiv preprint arXiv: 2204.10125. Pascanu, R., Mikolov, T., and Bengio, Y. (2013). "On the difficulty of training recurrent neural networks," in Int. Conf. on machine learning (Atlanta, GA, United States: Pmlr), 1310-1318.
- Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., et al. (2018).
- "Deep voice 3: 2000-speaker neural text-to-speech," in Proceedings of the International Conference on Learning Representations, 214-217.
- Ping, W., Peng, K., Zhao, K., and Song, Z. (2020). "Waveflow: a compact flow-based model for raw audio," in Proceedings of the International Conference on Machine Learning (PMLR), 7706-7716.
- Podlesak, M., and Lee, A. R. (1988). Dispersion of waves in piano strings. J. Acoust. Soc. Am. 83, 305-317. doi:10.1121/1.396432
- Prenger, R., Valle, R., and Catanzaro, B. (2019). "Waveglow: a flow-based generative network for speech synthesis," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE), 3617-3621.
- Raissi, M., Perdikaris, P., and Karniadakis, G. E. (2017). Physics informed deep learning (part i): data-driven solutions of nonlinear partial differential equations. arXiv preprint arXiv:1711.10561.
- Ramires, A., Chandna, P., Favory, X., Gómez, E., and Serra, X. (2020). "Neural percussive synthesis parameterised by high-level timbral features," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE), 786-790.
- Rasp, S., Pritchard, M. S., and Gentine, P. (2018). Deep learning to represent subgrid processes in climate models. Proc. Natl. Acad. Sci. 115, 9684-9689. doi:10.1073/pnas. 1810286115
- Renault, L., Mignot, R., and Roebel, A. (2022). "Differentiable piano model for midi- to-audio performance synthesis," in Proceedings of the Conference on Digital Audio Effects (DAFx).
- Rezende, D., and Mohamed, S. (2015). "Variational inference with normalizing flows," in Proceedings of the International conference on machine learning (Lille, France: PMLR), 1530-1538.
- Rigaud, F., David, B., and Daudet, L. (2011). "A parametric model of piano tuning," in Proc. of the 14th Int. Conf. on Digital Audio Effects (DAFx-11), 393-399.
- Ronneberger, O., Fischer, P., and Brox, T. (2015). "U-net: convolutional networks for biomedical image segmentation," in Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI) (Springer), 234-241.
- Russell, D., and Rossing, T. (1998). Testing the nonlinearity of piano hammers using residual shock spectra. Acta Acustica United Acustica 84, 967-975. See also https://www. acs.psu.edu/drussell/hammers.html.
- Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., et al. (2018). "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions," in Proceedings of the International conference on acoustics, speech and signal processing (ICASSP) (IEEE), 4779-4783.
- Simon, K., and Vasilis, K. (2005). Blizzard
- Smith, J. O. (1996). Physical modeling synthesis update. Comput. Music J. 20, 44-56. doi:10.2307/3681331
- Smith, J. O., III (1991). "Viewpoints on the history of digital synthesis," in Proceedings of the International Computer Music Conference (Montreal, Canada: INTERNATIONAL COMPUTER MUSIC ACCOCIATION), 1.
- Stulov, A. (1995). Hysteretic model of the grand piano hammer felt. J. Acoust. Soc. Am. 97, 2577-2585. doi:10.1121/1.411912
- Suzuki, H. (1987). Vibration analysis of a hammer-shank system, J. Acoust. Soc. Am. 81, S83. doi:10.1121/1.2024431
- Suzuki, H., and Nakamura, I. (1990). Acoustics of pianos. Appl. Acoust. 30, 147-205. doi:10.1016/0003-682x(90)90043-t Tan, H. H., Luo, Y., and Herremans, D. (2020). Generative modelling for controllable audio synthesis of expressive piano performance. arXiv preprint arXiv:2006.09833. Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. (2018). "Wasserstein auto-encoders," in Sixth International Conference on Learning Representations ICLR 2018.
- Välimäki, V., Huopaniemi, J., Karjalainen, M., and Jánosy, Z. (1995). "Physical modeling of plucked string instruments with application to real-time sound synthesis," in Audio engineering society convention (Paris, France: Audio Engineering Society), 98.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al.
- "Attention is all you need," in Proceedings of the Advances in neural information processing systems, 5998-6008.
- Verma, P., and Chafe, C. (2021). A generative model for raw audio using transformer architectures, 230-237.
- Wang, X., Takaki, S., and Yamagishi, J. (2019). "Neural source-filter-based waveform model for statistical parametric speech synthesis," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE), 5916-5920.
- Weinreich, G. (1977). Coupled piano strings. J. Acoust. Soc. Am. 62, 1474-1484. doi:10.1121/1.381677
- Wilkinson, W. (2019). Gaussian process modelling for audio signals. Ph.D. thesis. Queen Mary University of London.
- Wu, D.-Y., Hsiao, W.-Y., Yang, F.-R., Friedman, O., Jackson, W., Bruzenak, S., et al. (2022). Ddsp-based singing vocoders: a new subtractive-based synthesizer and a comprehensive evaluation. Proc. ISMIR 2022.
- Yanagisawa, T., and Nakamura, K. (1982). "Dynamic compression characteristics of piano hammer," in Transactions of musical acoustics technical group meeting of the acoustic society of Japan, 1.
- Yanagisawa, T., Nakamura, K., and Aiko, H. (1981). Experimental study on force- time curve during the contact between hammer and piano string. J. Acoust. Soc. Jpn. 37, 627-633.
- Yuki, O., Keisuke, I., Shinnosuke, T., Ryosuke, Y., Takahiro, F., and Yoichi, Y. (2020). Onomatopoeia. Zhuang, X., Jiang, T., Chou, S., Wu, B., Hu, P., and Lui, S. (2021). "Litesing: towards fast, lightweight and expressive singing voice synthesis," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE), 7078-7082.