A Study On Lstm Networks For Polyphonic Music Sequence Modelling
2017
https://doi.org/10.5281/ZENODO.1415017Abstract
Neural networks, and especially long short-term memory networks (LSTM), have become increasingly popular for sequence modelling, be it in text, speech, or music. In this paper, we investigate the predictive power of simple LSTM networks for polyphonic MIDI sequences, using an empirical approach. Such systems can then be used as a music language model which, combined with an acoustic model, can improve automatic music transcription (AMT) performance. As a first step, we experiment with synthetic MIDI data, and we compare the results obtained in various settings, throughout the training process. In particular, we compare the use of a fixed sample rate against a musically-relevant sample rate. We test this system both on synthetic and real MIDI data. Results are compared in terms of note prediction accuracy. We show that the higher the sample rate is, the better the prediction is, because self transitions are more frequent. We suggest that for AMT, a musically-relevant sample rate is crucial in order to model note transitions, beyond a simple smoothing effect.
References (18)
- M. Bay, A. F. Ehmann, and J. S. Downie. Evaluation of Multiple-F0 Estimation and Tracking Systems. In 10th International Society for Music Information Retrieval Conference (ISMIR), pages 315-320, 2009.
- E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and A. Klapuri. Automatic music transcription: challenges and future directions. Journal of Intelligent Informa- tion Systems, 41(3):407-434, 2013.
- E. Benetos and T. Weyde. An efficient temporally- constrained probabilistic model for multiple- instrument music transcription. In 16th International Society for Music Information Retrieval Conference (ISMIR), pages 701-707, 2015.
- N. Boulanger-Lewandowski, P. Vincent, and Y. Ben- gio. Modeling Temporal Dependencies in High- Dimensional Sequences: Application to Polyphonic Music Generation and Transcription. 29th Interna- tional Conference on Machine Learning, pages 1159- 1166, 2012.
- V. Emiya, R. Badeau, and B. David. Multipitch esti- mation of piano sounds using a new probabilistic spec- tral smoothness principle. IEEE Transactions on Au- dio, Speech and Language Processing, 18(6):1643- 1654, August 2010.
- M. Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
- I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT Press, 2016.
- A. Graves, A. Mohamed, and G. Hinton. Speech recog- nition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6645-6649. IEEE, 2013.
- G. Hadjeres and F. Pachet. DeepBach: a steerable model for Bach chorales generation. arXiv preprint arXiv:1612.01010, 2016.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997.
- N. Jaques, S. Gu, R. E. Turner, and D. Eck. Tuning Re- current Neural Networks with Reinforcement Learn- ing. 5th International Conference on Learning Repre- sentations (ICLR), pages 1722-1728, 2017.
- D. P. Kingma and J. Ba. Adam: A method for stochas- tic optimization. In 3rd International Conference on Learning Representations (ICLR), 2015.
- F. Korzeniowski and G. Widmer. On the Futility of Learning Complex Frame-Level Language Models for Chord Recognition. In AES International Conference on Semantic Audio, 2017.
- F. Lerdahl and R. Jackendoff. A Generative Theory of Tonal Music. MIT Press, 1983.
- T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent neural network based lan- guage model. In Interspeech, volume 2, page 3, 2010.
- S. A. Raczyński, E. Vincent, and S. Sagayama. Dy- namic Bayesian networks for symbolic polyhonic pitch modeling. IEEE Transactions on Audio, Speech, and Language Processing, 21(9):1830 -1840, 2013.
- S. Sigtia, E. Benetos, and S. Dixon. An end-to-end neu- ral network for polyphonic piano music transcription. IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 24(5):927-939, May 2016.
- D. Temperley. A Unified Probabilistic Model for Poly- phonic Music Analysis. Journal of New Music Re- search, 38(1):3-18, 2009.