MusCaps: Generating Captions for Music Audio
2021 International Joint Conference on Neural Networks (IJCNN)
https://doi.org/10.1109/IJCNN52387.2021.9533461Abstract
Content-based music information retrieval has seen rapid progress with the adoption of deep learning. Current approaches to high-level music description typically make use of classification models, such as in auto-tagging or genre and mood classification. In this work, we propose to address music description via audio captioning, defined as the task of generating a natural language description of music audio content in a human-like manner. To this end, we present the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention. Our method combines convolutional and recurrent neural network architectures to jointly process audiotext inputs through a multimodal encoder and leverages pretraining on audio data to obtain representations that effectively capture and summarise musical features in the input. Evaluation of the generated captions through automatic metrics shows that our method outperforms a baseline designed for non-music audio captioning. Through an ablation study, we unveil that this performance boost can be mainly attributed to pre-training of the audio encoder, while other design choices-modality fusion, decoding strategy and the use of attention-contribute only marginally. Our model represents a shift away from classificationbased music description and combines tasks requiring both auditory and linguistic understanding to bridge the semantic gap in music information retrieval 1 .
References (41)
- K. Choi, G. Fazekas, and M. Sandler, "Automatic tagging using deep convolutional neural networks," in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2016.
- T. Kim, J. Lee, and J. Nam, "Sample-level CNN Architectures for Music Auto-tagging Using Raw Waveforms," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
- J. Pons, O. Nieto, M. Prockup, E. Schmidt, A. Ehmann, and X. Serra, "End-to-end learning for music audio tagging at scale," in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2018, pp. 637-644.
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and Tell: A Neural Image Caption Generator," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, "Show, attend and tell: Neural image caption generation with visual attention," in Proceedings of the 32nd International Conference on Machine Learning, 2015, pp. 2048-2057.
- A. Karpathy and L. Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, p. 664-676, 2017.
- J. Lu, D. Batra, D. Parikh, and S. Lee, "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks," in Advances in Neural Information Processing Systems, vol. 32, 2019, pp. 13-23.
- W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, "Vl-bert: Pre-training of generic visual-linguistic representations," in International Conference on Learning Representations, 2019.
- L. Harold Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, "Visu- alBERT: A Simple and Performant Baseline for Vision and Language," 2019.
- L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao, "Unified vision-language pre-training for image captioning and vqa," in Proceed- ings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 13 041-13 049.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171-4186.
- B. Korbar, F. Petroni, R. Girdhar, and L. Torresani, "Video Understand- ing as Machine Translation," arXiv preprint, 6 2020.
- C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, "VideoBERT: A joint model for video and language representation learning," in Proceedings of the IEEE International Conference on Computer Vision (ICCV), vol. 2019-Octob, 2019, pp. 7463-7472.
- K. Drossos, S. Adavanne, and T. Virtanen, "Automated Audio Cap- tioning with Recurrent Neural Networks," in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 374-37.
- M. Wu, H. Dinkel, and K. Yu, "Audio Caption: Listen and Tell," in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 830-834.
- E. C ¸akır, K. Drossos, and T. Virtanen, "Multi-task Regularization Based on Infrequent Classes for Audio Captioning," in Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2020), 2020.
- A. O. Eren and M. Sert, "Audio Captioning using Gated Recurrent Units," arXiv preprint, 2020.
- C. D. Kim, B. Kim, H. Lee, and G. Kim, "AudioCaps: Generating captions for audios in the wild," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, p. 119-132.
- A. Tran, K. Drossos, and T. Virtanen, "WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information," arXiv preprint, 2020.
- S. Ikawa and K. Kashino, "Neural Audio Captioning Based on Condi- tional Sequence-to-Sequence Model," in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), 2019, pp. 99-103.
- S. Perez-Castanos, J. Naranjo-Alcazar, P. Zuccarello, and M. Cobos, "Listen carefully and tell: an audio captioning system based on residual learning and gammatone audio representation," in Workshop on Detec- tion and Classification of Acoustic Scenes and Events (DCASE2020), 2020, pp. 150-154.
- Y. Koizumi, Y. Ohishi, D. Niizumi, D. Takeuchi, and M. Yasuda, "Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval," arXiv preprint, 2020.
- Y. Koizumi, R. Masumura, K. Nishida, M. Yasuda, and S. Saito, "A Transformer-Based Audio Captioning Model with Keyword Estimation," in Proc. Interspeech 2020, 2020, pp. 1977-1981.
- K. Choi, G. Fazekas, M. Sandler, B. Mcfee, and K. Cho, "Towards Music Captioning: Generating Music Playlist Descriptions," arXiv preprint, 2016.
- C. Tian, M. Michael, and H. Di, "Music autotagging as captioning," in Proceedings of the 1st Workshop on NLP for Music and Audio (NLP4MusA). Association for Computational Linguistics, 2020.
- X. Favory, K. Drossos, V. Tuomas, and X. Serra, "COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representa- tions," in Workshop on Self-supervised learning in Audio and Speech at ICML, 2020.
- A. Van Den Oord, S. Dieleman, and B. Schrauwen, "Transfer learning by supervised pre-training for audio-based music classification," in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2014.
- K. Choi, G. Fazekas, M. Sandler, and K. Cho, "Transfer Learning for Music Classification and Regression Tasks," in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2017.
- J. Pennington, R. Socher, and C. D. Manning, "GloVe: Global vectors for word representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
- M. Won, S. Chun, O. Nieto, and X. Serrc, "Data-Driven Harmonic Filters for Audio Representation Learning," in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 536-540.
- M. Won, A. Ferraro, D. Bogdanov, and X. Serra, "Evaluation of CNN- based Automatic Music Tagging Models," in Proceedings of the 17th Sound and Music Computing Conference, 2020, pp. 331-337.
- P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering," in Proceedings of the IEEE Com- puter Society Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6077-6086.
- R. J. Williams and D. Zipser, "A learning algorithm for continually running fully recurrent neural networks," Neural Computation, vol. 1, no. 2, pp. 270-280, 1989.
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "Bleu: a method for automatic evaluation of machine translation," in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002.
- A. Lavie and A. Agarwal, "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments," Pro- ceedings of the Second Workshop on Statistical Machine Translation, pp. 65-72, 2007.
- C. Y. Lin, "Rouge: A package for automatic evaluation of summaries," Proceedings of the workshop on text summarization branches out (WAS 2004), pp. 74-81, 2004.
- R. Vedantam, C. L. Zitnick, and D. Parikh, "CIDEr: Consensus-based image description evaluation," in Proceedings of the IEEE Computer So- ciety Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4566-4575.
- P. Anderson, B. Fernando, M. Johnson, and S. Gould, "Spice: Semantic propositional image caption evaluation," in European conference on computer vision, 2016, pp. 382-398.
- S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, "Improved Image Captioning via Policy Gradient optimization of SPIDEr," in Proceedings of the IEEE International Conference on Computer Vision, 2016, pp. 873-881.
- E. Law, K. West, M. Mandel, M. Bay, and J. Stephen Downie, "Evaluation of algorithms using games: The case of music tagging," in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2009.
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 652-663, 4 2017.