Parallel Stacked Hourglass Network for Music Source Separation
2020, IEEE Access
https://doi.org/10.1109/ACCESS.2020.3037773Abstract
Music source separation is one of the old and challenging problems in music information retrieval society. Improvements in deep learning lead to big progress in decomposing music into its constitutive components with a variety of music. This research uses three types of datasets for source separation namely; Korean traditional music Pansori dataset, MIR-1K dataset, and DSD100 dataset. DSD100 dataset includes multiple sound sources and other two datasets has relatively smaller number of sound sources. We synthetically constructed a novel dataset for Pansori music and trained a novel parallel stacked hourglass network (PSHN) with multiple band spectrograms. In comparison with past study, proposed architecture performs the best results in real-world test samples of Pansori music of any length. The network performance was also tested for the public DSD100 and MIR-1K dataset for strength comparison in multiple source data and found comparable quantitative and qualitative outcomes. System performance is evaluated using median value of signal-to-distortion ratio (SDR), source-to-interference ratio (SIR), and source-to-artifacts ratio (SAR) measured in decibels (dB) and visual comparison of prediction results with ground truth. We report better performance in the Pansori dataset and MIR-1K dataset and perform detailed ablation studies based on architecture variation. The proposed system is better applicable for separating the music source with voices and single or fewer musical instruments. INDEX TERMS Music source separation, parallel stacked hourglass network, multiband spectrogram, Pansori.
References (42)
- K. Kokkinakis and P. C. Loizou, ''Using blind source separation techniques to improve speech recognition in bilateral cochlear implant patients,'' J. Acoust. Soc. Amer., vol. 123, no. 4, pp. 2379-2390, 2008.
- E. Gómez, F. J. C. Quesada, J. Salamon, J. Bonada, P. Vera, and P. Cabanas, ''Predominant fundamental frequency estimation vs singing voice separa- tion for the automatic transcription of accompanied flamenco singing,'' in Proc. 13th Int. Soc. Music Inf. Retr. Conf. (ISMIR), 2012, pp. 601-606.
- J. R. Zapata and E. Gomez, ''Using voice suppression algorithms to improve beat tracking in the presence of highly predominant vocals,'' in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., May 2013, pp. 51-55.
- N. Takahashi and Y. Mitsufuji, ''Multi-scale multi-band densenets for audio source separation,'' in Proc. IEEE Workshop Appl. Signal Process. to Audio Acoust. (WASPAA), New Paltz, NY, USA, Oct. 2017, pp. 21-25.
- N. Takahashi, N. Goswami, and Y. Mitsufuji, ''Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation,'' in Proc. 16th Int. Workshop Acoustic Signal Enhance- ment (IWAENC), Tokyo, Japan, Sep. 2018, pp. 106-110.
- S. Park, T. Kim, K. Lee, and N. Kwak, ''Music source separation using stacked hourglass networks,'' in Proc. Int. Soc. Music Inf. Retr. Conf. (ISMIR), Paris, France, Sep. 2018, pp. 289-296.
- W.-H. Heo, H. Kim, and O.-W. Kwon, ''Source separation using dilated time-frequency DenseNet for music identification in broadcast contents,'' Appl. Sci., vol. 10, no. 5, p. 1727, Mar. 2020.
- A. Newell, K. Yang, and J. Deng, ''Stacked hourglass networks for human pose estimation,'' in Proc. Eur. Conf. Comput. Vis. Amsterdam, The Netherlands: Springer, 2016, pp. 483-499.
- A. Hyvärinen and E. Oja, ''Independent component analysis: Algo- rithms and applications,'' Neural Netw., vol. 13, nos. 4-5, pp. 411-430, Jun. 2000.
- D. D. Lee and H. S. Seung, ''Algorithms for non-negative matrix factoriza- tion,'' in Proc. Neural Inf. Process. Syst. (NIPS), Vancouver, BC, Canada, 2001, pp. 556-562.
- P. Georgiev, F. Theis, and A. Cichocki, ''Sparse component analysis and blind source separation of underdetermined mixtures,'' IEEE Trans. Neural Netw., vol. 16, no. 4, pp. 992-996, Jul. 2005.
- M. G. Lopez, H. M. Lozano, L. P. Sanchez, and L. N. O. Moreno, ''Blind source separation of audio signals using independent component analysis and wavelets,'' in Proc. Conielecomp, 21st Int. Conf. Electr. Commun. Comput., Feb. 2011, pp. 152-157.
- C. P. Dadula and E. P. Dadios, ''A genetic algorithm for blind source separation based on independent component analysis,'' in Proc. Int. Conf. Humanoid, Nanotechnol., Inf. Technol., Commun. Control, Environ. Man- age. (HNICEM), Nov. 2014, pp. 1-6.
- A. A. Nugraha, A. Liutkus, and E. Vincent, ''Multichannel music separa- tion with deep neural networks,'' in Proc. 24th Eur. Signal Process. Conf. (EUSIPCO), Aug. 2016, pp. 1748-1752.
- E. M. Grais and M. D. Plumbley, ''Single channel audio source separa- tion using convolutional denoising autoencoders,'' in Proc. IEEE Global Conf. Signal Inf. Process. (GlobalSIP), Montreal, QC, Canada, Nov. 2017, pp. 1265-1269.
- S. Uhlich, F. Giron, and Y. Mitsufuji, ''Deep neural network based instru- ment extraction from music,'' in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2015, pp. 2135-2139.
- A. A. Nugraha, A. Liutkus, and E. Vincent, ''Multichannel audio source separation with deep neural networks,'' IEEE/ACM Trans. Audio, Speech Lang. Process., Inst. Electr. Electron. Engineers, vol. 24, no. 10, pp. 1652-1664, Sep. 2016.
- P. Huang, M. Kim, M. H. Johnson, and P. Smaragdis, ''Singing-voice sep- aration from monaural recordings using deep recurrent neural networks,'' in Proc. Ismir, 2014, pp. 477-482.
- P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, ''Joint optimization of masks and deep recurrent neural networks for monaural source separation,'' IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 12, pp. 2136-2147, Dec. 2015.
- G.-X. Wang, C.-C. Hsu, and J.-T. Chien, ''Discriminative deep recurrent neural networks for monaural speech separation,'' in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2016, pp. 2544-2548.
- S. L. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller, ''A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation,'' CoRR, vol. abs/1709.00611, pp. 1-6, Sep. 2017.
- S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and Y. Mitsufuji, ''Improving music source separation based on deep neural networks through data augmentation and network blending,'' in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 261-265.
- D. Stoller, S. Ewert, and S. Dixon, ''Wave-U-Net: A multi-scale neural network for end-to-end audio source separation,'' 2018, arXiv:1806.03185. [Online]. Available: http://arxiv.org/abs/1806.03185
- Y. Luo and N. Mesgarani, ''TaSNet: Time-domain audio separa- tion network for real-time, single-channel speech separation,'' CoRR, vol. abs/1711.00541, pp. 696-700, Apr. 2017.
- E. M. Grais, D. Ward, and M. D. Plumbley, ''Raw multi-channel audio source separation using multi-resolution convolutional auto- encoders,'' 2018, arXiv:1803.00702. [Online]. Available: http://arxiv.org/ abs/1803.00702
- A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, ''WaveNet: A generative model for raw audio,'' 2016, arXiv:1609.03499. [Online]. Available: http://arxiv.org/abs/1609.03499
- S. Pascual, A. Bonafonte, and J. Serra, ''SEGAN: Speech enhancement generative adversarial network,'' 2017, arXiv:1703.09452. [Online]. Avail- able: http://arxiv.org/abs/1703.09452
- D. Rethage, J. Pons, and X. Serra, ''A wavenet for speech denoising,'' CoRR, vol. abs/1706.07162, pp. 5069-5073, Apr. 2017.
- C. Donahue, J. McAuley, and M. Puckette, ''Adversarial audio synthesis,'' 2018, arXiv:1802.04208. [Online]. Available: http://arxiv. org/abs/1802.04208
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ''Densely connected convolutional networks,'' in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 4700-4708.
- Y. R. Pandeya, D. Kim, and J. Lee, ''Domestic cat sound classification using learned features from deep neural nets,'' Appl. Sci., vol. 8, no. 10, p. 1949, Oct. 2018.
- K. He, X. Zhang, S. Ren, and J. Sun, ''Deep residual learning for image recognition,'' in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770-778.
- A. Newell, Z. Huang, and J. Deng, ''Associative embedding: End-to- end learning for joint detection and grouping,'' in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 2274-2284.
- B. McFee, C. Raffel, D. Liang, D. Ellis, M. McVicar, E. Battenberg, and O. Nieto, ''Librosa: Audio and music signal analysis in Python,'' in Proc. 14th Python Sci. Conf., Austin, TX, USA, 2015, pp. 18-25.
- D. P. Kingma and J. Ba, ''Adam: A method for stochastic opti- mization,'' 2014, arXiv:1412.6980. [Online]. Available: http://arxiv. org/abs/1412.6980
- E. Vincent, R. Gribonval, and C. Fevotte, ''Performance measurement in blind audio source separation,'' IEEE Trans. Audio, Speech Lang. Process., vol. 14, no. 4, pp. 1462-1469, Jul. 2006.
- A. Liutkus, F. R. Stöter, Z. Rafi, D. Kitamura, B. Rivet, N. Ito, N. Ono, and J. Fontecave, ''The 2016 signal separation evaluation campaign,'' in Proc. LVA/ICA, 2017, pp. 66-70.
- J. Le Roux, J. R. Hershey, and F. Weninger, ''Deep NMF for speech separation,'' in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2015, p. 6670.
- Y.-H. Yang, ''Low-rank representation of both singing voice and music accompaniment via learned dictionaries,'' in Proc. ISMIR, 2013, pp. 427-432.
- A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, ''Singing voice separation with deep U-Net convolutional net- works,'' in Proc. 18th Int. Soc. Music Inf. Retr., Suzhou, China, 2017, pp. 1-8.
- N. Ono, Z. Koldovsky, S. Miyabe, and N. Ito, ''The 2013 signal separation evaluation campaign,'' in Proc. IEEE Int. Workshop Mach. Learn. Signal Process. (MLSP), Sep. 2013, pp. 1-6.
- BHUWAN BHATTARAI received the B.S. degree in computer science and information technology from Patan Multiple Campus (an affiliation of Tribhuvan University), Nepal, in 2015, and the M.S. degree in computer science and engineering from Jeonbuk National University, South Korea, in 2019, where he is currently pursuing the Ph.D. degree with the Artificial Intelligence Laboratory. His research interests include music information retrieval (MIR), image processing, object detec- tion in images, and music source separation. YAGYA RAJ PANDEYA was born in Banlek, Dadeldhura, Nepal, in 1988. He received the B.E. and M.E. degrees in computer engineering from Pokhara University, Nepal, in 2010 and 2013, respectively. He was the Head of the Department of Com- puter Engineering, Dhangadhi Engineering Col- lege (NAST), Dhangadhi, Nepal. He joined the Ministry of Home Affairs, Nepal, from 2015 to 2017. He is currently a Ph.D. Fellow with the Fuzzy Logic and Artificial Intelligence Laboratory, Jeonbuk National Uni- versity, South Korea. His research interests include audio-video information retrieval, audio event detection and localization, emotion engineering, and animal sound behavior analysis. JOONWHOAN LEE received the B.S. degree in electronic engineering from the University of Hanyang, South Korea, in 1980, the M.S. degree in electrical and electronics engineering from KAIST, South Korea, in 1982, and the Ph.D. degree in electrical and computer engineering from the University of Missouri, USA, in 1990. He is currently a Professor with the Department of Computer Engineering, Jeonbuk National Uni- versity, South Korea. His research interests include image and audio processing, computer vision, emotion engineering, and so on.