Efficient stream compaction on wide SIMD many-core architectures
2009, Proceedings of the 1st ACM conference on High Performance Graphics - HPG '09
https://doi.org/10.1145/1572769.1572795Abstract
Stream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage. For wide SIMD many-core architectures, we present a novel stream compaction algorithm and explore several variations thereof. Our algorithm is designed to maximize concurrent execution, with minimal use of synchronization. Bandwidth and auxiliary storage requirements are reduced significantly, which allows for substantially better performance. We have tested our algorithms using CUDA on a PC with an NVIDIA GeForce GTX280 GPU. On this hardware, our reference implementation provides a 3× speedup over previous published algorithms.
References (21)
- ABRASH, M., A First Look at the Larrabee New Instructions (LRBni), 2009. http://www.ddj.com/hpc-high-performance- computing/216402188 .
- BLELLOCH, G. E. 1990. Prefix Sums and Their Applications. Tech. rep., Synthesis of Parallel Algorithms.
- CHATTERJEE, S., BLELLOCH, G. E., AND ZAGHA, M. 1990. Scan Primitives for Vector Computers. In In Proceedings Super- computing '90, 666-675.
- CUDPP: CUDA data parallel primitives library, 2008. http://www.gpgpu.org/developer/cudpp/ .
- DOTSENKO, Y., GOVINDARAJU, N. K., SLOAN, P.-P., BOYD, C., AND MANFERDELLI, J. 2008. Fast scan algorithms on graphics processors. In ICS '08: Proceedings of the 22nd annual international conference on Supercomputing, ACM, New York, NY, USA, 205-213.
- FATAHALIAN, K., AND HOUSTON, M. 2008. A closer look at GPUs. Commun. ACM 51, 10, 50-57.
- GRESS, A., GUTHE, M., AND KLEIN, R. 2006. GPU-based Col- Detection for Deformable Parameterized Surfaces. Com- puter Graphics Forum 25, 3 (Sept.), 497-506.
- HILLIS, W. D., AND STEELE, JR., G. L. 1986. Data parallel algorithms. Commun. ACM 29, 12, 1170-1183.
- HORN, D. 2005. Stream reduction operations for GPGPU applica- tions.
- LAUTERBACH, C., GARLAND, M., SENGUPTA, S., LUEBKE, D., AND MANOCHA, D. 2009. Fast BVH Construction on GPUs. In Proceedings of the Eurographics Symposium on Rendering, the Eurographics Association, Eurographics and ACM/SIGGRAPH.
- LINDHOLM, E., NICKOLLS, J., OBERMAN, S., AND MONTRYM, J. 2008. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 28, 2, 39-55.
- NVIDIA, CUDA Zone: Toolkit & SDK, 2008. http://developer.nvidia.com/object/cuda.html .
- ROGER, D., ASSARSSON, U., AND HOLZSCHUCH, N. 2007. Ef- ficient Stream Reduction on the GPU. In Workshop on General Purpose Processing on Graphics Processing Units, D. Kaeli and M. Leeser, Eds.
- ROGER, D., ASSARSSON, U., AND HOLZSCHUCH, N. 2007. Whitted Ray-Tracing for Dynamic Scenes using a Ray-Space Hierarchy on the GPU. In Rendering Techniques 2007 (Pro- ceedings of the Eurographics Symposium on Rendering), the Eu- rographics Association, J. Kautz and S. Pattanaik, Eds., Euro- graphics and ACM/SIGGRAPH, 99-110.
- SATISH, N., HARRIS, M., AND GARLAND, M. 2008. Design- ing Efficient Sorting Algorithms for Manycore GPUs. NVIDIA Technical Report NVR-2008-001, NVIDIA Corporation, Sept.
- SEILER, L., CARMEAN, D., SPRANGLE, E., FORSYTH, T., ABRASH, M., DUBEY, P., JUNKINS, S., LAKE, A., SUGER- MAN, J., CAVIN, R., ESPASA, R., GROCHOWSKI, E., JUAN, T., AND HANRAHAN, P. 2008. Larrabee: a many-core x86 architecture for visual computing. In SIGGRAPH '08: ACM SIGGRAPH 2008 papers, ACM, New York, NY, USA, 1-15.
- SENGUPTA, S., LEFOHN, A. E., AND OWENS, J. D. 2006. A Work-Efficient Step-Efficient Prefix Sum Algorithm. In Pro- ceedings of the 2006 Workshop on Edge Computing Using New Commodity Architectures, D-26-27.
- SENGUPTA, S., HARRIS, M., ZHANG, Y., AND OWENS, J. D. 2007. Scan Primitives for GPU Computing. In Graphics Hard- ware 2007, ACM, 97-106.
- WALD, I., GRIBBLE, C. P., BOULOS, S., AND KENSLER, A. 2007. SIMD Ray Stream Tracing -SIMD Ray Traversal with Generalized Ray Packets and On-the-fly Re-Ordering. Tech. Rep. UUSCI-2007-012.
- ZHOU, K., HOU, Q., WANG, R., AND GUO, B. 2008. Real-time construction on graphics hardware. In SIGGRAPH Asia '08: ACM SIGGRAPH Asia 2008 papers, ACM, New York, NY, USA, 1-11.
- ZIEGLER, G., TEVS, A., THEOBALT, C., AND SEIDEL, H.-P. 2006. GPU Point List Generation through Histogram Pyramids. Technical Reports of the MPI for Informatics MPI-I-2006-4-002, June.