Real-time rendering of decorative sound textures for soundscapes

Jinta Zheng; Shih-Hsuan Hung; Kyle Hiebel; Yue Zhang

doi:10.1145/3414685.3417875

Outline

Real-time rendering of decorative sound textures for soundscapes

Jinta Zheng

2020, ACM Transactions on Graphics

https://doi.org/10.1145/3414685.3417875

visibility

…

description

12 pages

link

1 file

Abstract

Audio recordings contain rich information about sound sources and their properties such as the location, loudness, and frequency of events. One prevalent component in sound recordings is the sound texture, which contains a massive number of events. In such a texture, there can be some distinct and repeated sounds that we term as a foreground sound. Birds chirping in the wind is one such decorative sound texture with the chirping as a foreground sound and the wind as a background texture. To render these decorative sound textures in real-time and with high quality, we create two-layer Markov Models to enable smooth transitions from sound grain to sound grain and propose a hierarchical scheme to generate Head-Related Transfer Function filters for localization cues of sounds represented as area/volume sources. Moreover, during the synthesis stage, we provide control over the frequency and intensity of sounds for customization. Lastly, foreground sounds are often blended into background...

Figures (15)

Fig. 1. Our method renders a decorative sound texture of a city street during a rainstorm. The images (top row) show the virtual scene from the listener’: perspective over an eight second time period. The plots (bottom row) show the corresponding color-coded waveform of the rendered decorative sound texture in the left and right ears. Raindrops hitting the road (blue) is the background texture, raindrops hitting the umbrella (dark green) is the first foreground sound and birds chirping (light green) is the second foreground sound. All the foreground sounds and background textures were extracted from recordings at Fon et al. [2013]. The intensity of the background texture increases throughout the eight seconds, as intended by the scene designer. Additionally, the even frequency of the foreground sounds increases over time, which is also controlled by our methods. This scene is built in CARLA [Dosovitskiy et al. 2017].

a a a, Fig. 3. Background textures have a homogeneous difference in loudness to the (a) left ear and (b) right ear; while the foreground sound events (green boxes) have varying differences. In this example, the background texture is rain drops hitting the ground and the foreground sound is heavy rain drops hitting a metal roof. The recording is from Font et al. [2013].

(a) Decorative Sound Texture Processing (b) Decorative Sound Texture Synthesis (c) Decorative Sound Texture Auralization Fig. 4. Rendering of a decorative sound texture includes sound synthesis and auralization. (a) We provide a foreground sound extraction algorithm to create a decorative sound texture with foreground sounds and a background texture on area/volume sources in virtual scenes. (b) At run-time, for decorative sound texture, we synthesize the foreground sounds (green) and background texture (blue) with different Markov model designs. (c) For decorative sound texture auralization, we compute the convolution of the synthesized sound and HRTF filters constructed with hierarchical grids to efficiently capture the location information for the area/volume sources of foreground sounds and background textures. We mix the auralized foreground sound and background textures to generate the final sound for the listener.

Fig. 7. A synthesized ocean wave sound starting from an a priori-class state of low entropy and moving to a high entropy state. The resulting background texture smoothly follows the entropy changes controlled by the a priori-class layer of the two-layer Markov model. With the two-layer Markov models, we generate realistic decora- tive sound textures with similar features to the original recordings. We present our evaluations of the realism and quality of our deco- rative sound texture synthesis in Section 5.2.

Re me en ere Sami ment ane nes sneer OE IEE Fig. 8. Auralization of area/volume sources of (a) a foreground sound and (b) a background texture. (a) The foreground sound is emitted randomly from a point guided by the perception model that avoids a point (dashed arrow) where the sound is masked by a background texture. The HRTF is interpolated at the point in 2D interaural-polar coordinates using the hierarchical grid. (b) The background texture is heard from the area/volume sources and the HRTF is constructed with the ray-intersection test in the hierarchical grid.

AS] VA A AMT AV GIMBAL MA IEE AMP AAV GAMGLIUAS MA AES Fig. 9. Comparison of our decorative sound texture extraction to DAP [Tian et al. 2019], NMF [Spiertz and Gnann 2009], and spectral subtraction [Boll 1979] using (a) SDR, (b) SAR, and (c) SIR. We average the values of SDR, SAR, and SIR over the 5 categories (animal, natural, urban, human, and music). Our extraction results have higher average scores compared to the other methods. Our test database and results can be accessed here: https://github.com/hiebelky, JKSound-Benchmark.

Fig. 11. Synthesized background textures using McDermott et al. [2011] (second column) and our method (third column) compared to the original sound texture of (a) wind, and (b) a jackhammer. From top to bottom, the figure shows the waveform, Mel spectrogram, and cross-band envelope correlation matrix for each audio clip. Our resulting background textures reproduce closely the features of the input recordings. The sounds are from Font et al. [2013].

Fig. 12. The RMS errors from comparing our background texture synthesis and the McDermott et al. [2011] to the input sound texture over the texture statistics. We evaluate the texture statistics with the mean of the cochlear envelopes of each frequency band (Envelopes mean), the cross-band enve- lope correlation matrix (C), the modulation power (Mod. power) and two types of modulation correlations (C1: the same modulation frequency but different acoustic frequencies; C2: the same acoustic frequency but different modulation frequencies). The bar chart shows that our resulting background textures have a smaller RMS error from the input recordings.

Fig. 13. Comparison of our HRTF construction with and without the hierar- chical grid with respect to the size of a sphere or plane. The two experiments indicate our hierarchical grid method improves performance and is slightly dependent on the size of the sound source.

RAS Eg REESE Se re Fig. 14. Comparison of our HRTF construction to the SH method based on Schissler et al. [2016]. (a) For spheres, our method is slightly slower than the SH method which has only analytical projection. (b) For planes, our method is faster than the SH method using Monte Carlo projection with both 1000 and 10, 000 rays.

References (54)

V Ralph Algazi, Richard O Duda, Dennis M Thompson, and Carlos Avendano. 2001. The CIPIC HRTF database. In Applications of Signal Processing to Audio and Acoustics. IEEE, 2001 IEEE Workshop, 99-102.
Durand R Begault and Leonard J Trejo. 2000. 3-D sound for virtual reality and multi- media. (2000).
Juan Pablo Bello, Laurent Daudet, Samer Abdallah, Chris Duxbury, Mike Davies, and Mark B Sandler. 2005. A tutorial on onset detection in music signals. IEEE Transac- tions on speech and audio processing 13, 5 (2005), 1035-1047.
Steven Boll. 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on acoustics, speech, and signal processing 27, 2 (1979), 113-120.
Joan Bruna and Stéphane Mallat. 2013. Audio texture synthesis with scattering moments. arXiv preprint arXiv:1311.0407 (2013).
Nicholas Bryan and Gautham Mysore. 2013. An efficient posterior regularized latent variable model for interactive sound source separation. In International Conference on Machine Learning. 208-216.
Chunxiao Cao, Zhong Ren, Carl Schissler, Dinesh Manocha, and Kun Zhou. 2016. Interactive sound propagation with bidirectional path tracing. ACM Transactions on Graphics (TOG) 35, 6 (2016), 1-11.
Jeffrey N Chadwick and Doug L James. 2011. Animating fire with sound. In ACM Transactions on Graphics (TOG), Vol. 30. ACM, 84.
Abe Davis and Maneesh Agrawala. 2018. Visual Rhythm and Beat. ACM Trans. Graph. 37, 4 (2018), 122-1.
Antonio Di Crescenzo and Maria Longobardi. 2009. On cumulative entropies. Journal of Statistical Planning and Inference 139, 12 (2009), 4072-4087.
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. 2017. CARLA: An open urban driving simulator. arXiv preprint arXiv:1711.03938 (2017).
Michael J Evans, James AS Angus, and Anthony I Tew. 1998. Analyzing head-related transfer function measurements using surface spherical harmonics. The Journal of the Acoustical Society of America 104, 4 (1998), 2400-2411.
Raphael A. Finkel and Jon Louis Bentley. 1974. Quad trees a data structure for retrieval on composite keys. Acta informatica 4, 1 (1974), 1-9.
Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia. 411-412.
Fabio P Freeland, Luiz WP Biscainho, and Paulo SR Diniz. 2002. Efficient HRTF interpola- tion in 3D moving sound. In Audio Engineering Society Conference: 22nd International Conference: Virtual, Synthetic, and Entertainment Audio. Audio Engineering Society.
Hannes Gamper. 2013. Head-related transfer function interpolation in azimuth, ele- vation, and distance. The Journal of the Acoustical Society of America 134, 6 (2013), EL547-EL553.
Aki Härmä, Julia Jakka, Miikka Tikander, Matti Karjalainen, Tapio Lokki, Jarmo Hi- ipakka, and Gaëtan Lorho. 2004. Augmented reality audio for mobile and wearable appliances. Journal of the Audio Engineering Society 52, 6 (2004), 618-639.
Toni Heittola, Annamaria Mesaros, Dani Korpi, Antti Eronen, and Tuomas Virtanen. 2014. Method for creating location-specific audio textures. EURASIP Journal on Audio, Speech, and Music Processing 2014, 1 (2014), 9.
Alexander JE Kell and Josh H McDermott. 2019. Invariance to background noise as a signature of non-primary auditory cortex. Nature communications 10, 1 (2019), 1-11.
Vivek Kwatra, Irfan Essa, Aaron Bobick, and Nipun Kwatra. 2005. Texture optimization for example-based synthesis. In ACM SIGGRAPH 2005 Papers. 795-802.
Vivek Kwatra, Arno Schödl, Irfan Essa, Greg Turk, and Aaron Bobick. 2003. Graphcut textures: image and video synthesis using graph cuts. ACM Transactions on Graphics (ToG) 22, 3 (2003), 277-286.
Wei-Hsiang Liao, Axel Roebel, and Alvin Su. 2013. On the modeling of sound textures based on the STFT representation. In Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13). 33.
Shiguang Liu, Haonan Cheng, and Yiying Tong. 2019. Physically-based statistical simulation of rain sound. ACM Transactions on Graphics (TOG) 38, 4 (2019), 123.
Josh H McDermott and Eero P Simoncelli. 2011. Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron 71, 5 (2011), 926-940.
Brian McFee, Justin Salamon, and Juan Pablo Bello. 2018. Adaptive pooling operators for weakly labeled sound event detection. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 26, 11 (2018), 2180-2193.
Ian McLoughlin, Haomin Zhang, Zhipeng Xie, Yan Song, and Wei Xiao. 2015. Robust sound event classification using deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 3 (2015), 540-552.
Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. 2010. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv preprint arXiv:1003.4083 (2010).
Sean O'Leary and Axel Roebel. 2014. A two level montage approach to sound texture synthesis with treatment of unique events.. In DAFx. 1-1.
Seán O'Leary and Axel Röbel. 2016. A montage approach to sound texture synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 6 (2016), 1094-1105.
Ashish Panda and Thambipillai Srikanthan. 2011. Psychoacoustic model compensation for robust speaker verification in environmental noise. IEEE transactions on audio, speech, and language processing 20, 3 (2011), 945-953.
David R Perrott and Kourosh Saberi. 1990. Minimum audible angle thresholds for sources varying in both elevation and azimuth. The Journal of the Acoustical Society of America 87, 4 (1990), 1728-1731.
Emil Praun, Adam Finkelstein, and Hugues Hoppe. 2000. Lapped textures. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques. 465- 470.
Lawrence R Rabiner and Ronald W Schafer. 2011. Theory and applications of digital speech processing. Vol. 64. Pearson Upper Saddle River, NJ.
Boaz Rafaely and Amir Avni. 2010. Interaural cross correlation in a sound field repre- sented by spherical harmonics. The Journal of the Acoustical Society of America 127, 2 (2010), 823-828.
Nikunj Raghuvanshi, Rahul Narain, and Ming C Lin. 2009. Efficient and accurate sound propagation using adaptive rectangular decomposition. IEEE Transactions on Visualization and Computer Graphics 15, 5 (2009), 789-801.
Nikunj Raghuvanshi and John Snyder. 2018. Parametric directional coding for pre- computed sound propagation. ACM Transactions on Graphics (TOG) 37, 4 (2018), 108. Curtis Roads. 1988. Introduction to granular synthesis. Computer Music Journal 12, 2 (1988), 11-13.
Griffin D Romigh, Douglas S Brungart, Richard M Stern, and Brian D Simpson. 2015. Efficient real spherical harmonic representation of head-related transfer functions. IEEE Journal of Selected Topics in Signal Processing 9, 5 (2015), 921-930.
Nicolas Saint-Arnaud and Kris Popat. 1995. Analysis and synthesis of sound textures. In in Readings in Computational Auditory Scene Analysis. Citeseer.
Carl Schissler, Ravish Mehra, and Dinesh Manocha. 2014. High-order diffraction and diffuse reflections for interactive sound propagation in large environments. ACM Transactions on Graphics (TOG) 33, 4 (2014), 1-12.
Carl Schissler, Aaron Nicholls, and Ravish Mehra. 2016. Efficient HRTF-based spa- tial audio for area and volumetric sources. IEEE transactions on visualization and computer graphics 22, 4 (2016), 1356-1366.
Diemo Schwarz. 2011. State of the art in sound texture synthesis. In Digital Audio Effects (DAFx). 221-232.
Diemo Schwarz and Baptiste Caramiaux. 2013. Interactive sound texture synthesis through semi-automatic user annotations. In International Symposium on Computer Music Multidisciplinary Research. Springer, 372-392.
Mincheol Shin, Stephen W Song, Se Jung Kim, and Frank Biocca. 2019. The effects of 3D sound in a 360-degree live concert video on social presence, parasocial interaction, enjoyment, and intent of financial supportive action. International Journal of Human- Computer Studies 126 (2019), 81-93.
Paris Smaragdis, Bhiksha Raj, and Madhusudana Shashanka. 2006. A probabilistic latent variable model for acoustic modeling. (2006).
Martin Spiertz and Volker Gnann. 2009. Source-filter based clustering for monaural blind source separation. In Proceedings of the 12th International Conference on Digital Audio Effects.
Yapeng Tian, Chenliang Xu, and Dingzeyu Li. 2019. Deep Audio Prior. ArXiv abs/1912.10292 (2019).
Andries Van Der Merwe and Walter Schulze. 2010. Music generation with markov models. IEEE MultiMedia 18, 3 (2010), 78-85.
Charles Verron, Mitsuko Aramaki, Richard Kronland-Martinet, and Grégory Pallone. 2009. Spatialized synthesis of noisy environmental sounds. In Auditory Display. Springer, 392-407.
Jui-Hsien Wang, Ante Qu, Timothy R Langlois, and Doug L James. 2018. Toward wave-based sound synthesis for computer animation. ACM Trans. Graph. 37, 4 (2018), 109-1.
Stephan Wenger and Marcus Magnor. 2011. Constrained example-based audio synthesis. In 2011 IEEE International Conference on Multimedia and Expo. IEEE, 1-6.
Zechen Zhang, Nikunj Raghuvanshi, John Snyder, and Steve Marschner. 2018. Ambient sound propagation. ACM Transactions on Graphics (TOG) 37, 6 (2018), 1-10.
Zechen Zhang, Nikunj Raghuvanshi, John Snyder, and Steve Marschner. 2019. Acoustic texture rendering for extended sources in complex scenes. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1-9.
Changxi Zheng and Doug L James. 2009. Harmonic fluids. In ACM Transactions on Graphics (TOG), Vol. 28. ACM, 37.
Xinglei Zhu and Lonce Wyse. 2004. Sound texture modeling and time-frequency LPC. In Proceedings of the 7th international conference on digital audio effects DAFX, Vol. 4.

Real-time rendering of decorative sound textures for soundscapes

Sign up for access to the world's latest research

Abstract

Related papers

References (54)

Related papers

Related topics