Efficient stream compaction on wide SIMD many-core architectures

Markus Billeter; Ola Olsson; Ulf Assarsson

doi:10.1145/1572769.1572795

Outline

Efficient stream compaction on wide SIMD many-core architectures

Ulf Assarsson

2009, Proceedings of the 1st ACM conference on High Performance Graphics - HPG '09

https://doi.org/10.1145/1572769.1572795

visibility

…

description

8 pages

link

1 file

Abstract

Stream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage. For wide SIMD many-core architectures, we present a novel stream compaction algorithm and explore several variations thereof. Our algorithm is designed to maximize concurrent execution, with minimal use of synchronization. Bandwidth and auxiliary storage requirements are reduced significantly, which allows for substantially better performance. We have tested our algorithms using CUDA on a PC with an NVIDIA GeForce GTX280 GPU. On this hardware, our reference implementation provides a 3× speedup over previous published algorithms.

Figures (11)

Figure 1: Main steps of performing compaction with a prefix sum. First a prefix sur of the valid element flags is computed. Then a gather or scatter step is used to move the valid input elements into the output vector.

Listing 2: Basic parallel algorithm. The number of elements processed by a processo is denoted Kp, and inputp is the associated range of input elements. The notatior [@. .P) is used to describe the range of elements from 0 (inclusive) to P (exclusive) We therefore develop an algorithm that is more suited to actual GPU architecture. In the next sections, we will first describe a model for how modern GPUs operate, which will allow us to design several flavors of efficient compaction algorithms, in the sections follow- ing.

Listing 3: The first phase, extended to make use of the SIMD capabilities to count the number of valid elements. After reducing the count from S' individual SIMD lanes, the resulting total is stored in the vector processorCounts. We use a parallel reduction to sum the values in COUNTS, .s), as shown on line 9. However, this reduction is performed only once per processor, and is thus not time critical. In the first phase, each SIMD lane is used to count the valid elements it encounters, independently. After processing the en- tire range, the processor performs a parallel sum-reduction, which yields the total number of valid elements in the range inputp. Pseudo code describing these modifications is provided in Listing 3.

Figure 2: Illustration of the compactSIMD procedure. Here, the SIMD width is S = 16, and the number of valid elements is Q = 8. The element A moves from s = 0 tos’ = 0, whereas the element B moves from s = 3 to s’ = 1. The output is compact in the sense that the @ valid elements occupy lanes 0 through Q — 1.

Listing 6: Implementation of compactSIMD using a population count. The variable m must be large enough to store S bits. Setting m, on line 5, assumes that the architec- ture allows simultaneous setting of bits. This is not always the case, and a workaround is described in Section 4.2. Assuming POPC is present in the native SIMD instruction set, the log S steps required by the SIMD prefix sum can be avoided. The architecture must also support a word size of S bits. Listing 6 shows the new compactSIMD procedure.

Listing 7: The buffered implementation of the third phase. The buffer is flushed when it is full, and the overflowing elements are then written to the buffer. The alignOutput procedure moves enough elements to align j to the next multiple of S, and initializes the buffer and counters. There are many short branches in the inner loop, however, they can be compiled into efficient predicated instructions. As the alignment phase will write at most S — 1 elements, these writes will not contribute significantly to the run time of the al- gorithm and can therefore be unbuffered, as long as the buffer is correctly initialized.

Figure 4: Time (in milliseconds) required to compact a varying number of elements. We compare our best implementation with the CUDPP implementation and geometry shaders. The geometry shader plot is cut off to provide a better view of CUDPP and our implementation. The error bars in the left figure display variations in time as the proportion of valid elements is changed. The graphs represent the average time for varying proportions of valid elements. Also shown are curves for compaction of 64 bit and 128 bit elements. The scattered version shows a near-linear performance scaling. Since we cannot use 64 bit writes, but must resort to several scat- tered 32 bit writes, the number of write transactions quickly in- creases.

Figure 3: Time (in milliseconds) required to compact 4M (27? ) 32 bit elements with a varying ratio of valid elements. We have compared our implementations, using both the prefix sum and the POPC based compactSIMD implementations, as well as staged, scattered and buffered variations. In addition, our selective implementation, which automatically switches between staged and scattered output on a warp level, is shown.

Results from some earlier works [Horn 2005; Sengupta et al. 2006] are not included in Table 3. This is because the publicly available CUDPP implementation uses the same basic strategy, and offers higher performance. Table 3: Comparison of compaction performance with competing techniques. If avail- able, we have used reference implementations for measurements on our hardware. We have created our own CUDA implementation of the algorithm presented in Ziegler et al. The reported times are averages over uniform distributions with 0% to 100% valid elements.

Our performance is comparable with that of Satish et al. [2008], and outperforms the CUDPP library [CUDPP 2008]. The imple- mentation is very simple, we did not perform any in-depth analysis or special optimization. We simply invoke the stream split opera- tion once for each bit in the radix sort key. The simplicity makes it very flexible, allowing any data type and number of bits as radix keys. Table 5: Comparison of our stream split based radix-sort and the currently fastest pub- lished implementation. Our implementation shows almost identical performance, but is more flexible. Our implementation operates on interleaved key-value pairs; Satish et al. have separate arrays for keys and values. We can handle separate keys and values by a pre- and postprocessing step that transforms the separate arrays into interleaved data and back.

References (21)

ABRASH, M., A First Look at the Larrabee New Instructions (LRBni), 2009. http://www.ddj.com/hpc-high-performance- computing/216402188 .
BLELLOCH, G. E. 1990. Prefix Sums and Their Applications. Tech. rep., Synthesis of Parallel Algorithms.
CHATTERJEE, S., BLELLOCH, G. E., AND ZAGHA, M. 1990. Scan Primitives for Vector Computers. In In Proceedings Super- computing '90, 666-675.
CUDPP: CUDA data parallel primitives library, 2008. http://www.gpgpu.org/developer/cudpp/ .
DOTSENKO, Y., GOVINDARAJU, N. K., SLOAN, P.-P., BOYD, C., AND MANFERDELLI, J. 2008. Fast scan algorithms on graphics processors. In ICS '08: Proceedings of the 22nd annual international conference on Supercomputing, ACM, New York, NY, USA, 205-213.
FATAHALIAN, K., AND HOUSTON, M. 2008. A closer look at GPUs. Commun. ACM 51, 10, 50-57.
GRESS, A., GUTHE, M., AND KLEIN, R. 2006. GPU-based Col- Detection for Deformable Parameterized Surfaces. Com- puter Graphics Forum 25, 3 (Sept.), 497-506.
HILLIS, W. D., AND STEELE, JR., G. L. 1986. Data parallel algorithms. Commun. ACM 29, 12, 1170-1183.
HORN, D. 2005. Stream reduction operations for GPGPU applica- tions.
LAUTERBACH, C., GARLAND, M., SENGUPTA, S., LUEBKE, D., AND MANOCHA, D. 2009. Fast BVH Construction on GPUs. In Proceedings of the Eurographics Symposium on Rendering, the Eurographics Association, Eurographics and ACM/SIGGRAPH.
LINDHOLM, E., NICKOLLS, J., OBERMAN, S., AND MONTRYM, J. 2008. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 28, 2, 39-55.
NVIDIA, CUDA Zone: Toolkit & SDK, 2008. http://developer.nvidia.com/object/cuda.html .
ROGER, D., ASSARSSON, U., AND HOLZSCHUCH, N. 2007. Ef- ficient Stream Reduction on the GPU. In Workshop on General Purpose Processing on Graphics Processing Units, D. Kaeli and M. Leeser, Eds.
ROGER, D., ASSARSSON, U., AND HOLZSCHUCH, N. 2007. Whitted Ray-Tracing for Dynamic Scenes using a Ray-Space Hierarchy on the GPU. In Rendering Techniques 2007 (Pro- ceedings of the Eurographics Symposium on Rendering), the Eu- rographics Association, J. Kautz and S. Pattanaik, Eds., Euro- graphics and ACM/SIGGRAPH, 99-110.
SATISH, N., HARRIS, M., AND GARLAND, M. 2008. Design- ing Efficient Sorting Algorithms for Manycore GPUs. NVIDIA Technical Report NVR-2008-001, NVIDIA Corporation, Sept.
SEILER, L., CARMEAN, D., SPRANGLE, E., FORSYTH, T., ABRASH, M., DUBEY, P., JUNKINS, S., LAKE, A., SUGER- MAN, J., CAVIN, R., ESPASA, R., GROCHOWSKI, E., JUAN, T., AND HANRAHAN, P. 2008. Larrabee: a many-core x86 architecture for visual computing. In SIGGRAPH '08: ACM SIGGRAPH 2008 papers, ACM, New York, NY, USA, 1-15.
SENGUPTA, S., LEFOHN, A. E., AND OWENS, J. D. 2006. A Work-Efficient Step-Efficient Prefix Sum Algorithm. In Pro- ceedings of the 2006 Workshop on Edge Computing Using New Commodity Architectures, D-26-27.
SENGUPTA, S., HARRIS, M., ZHANG, Y., AND OWENS, J. D. 2007. Scan Primitives for GPU Computing. In Graphics Hard- ware 2007, ACM, 97-106.
WALD, I., GRIBBLE, C. P., BOULOS, S., AND KENSLER, A. 2007. SIMD Ray Stream Tracing -SIMD Ray Traversal with Generalized Ray Packets and On-the-fly Re-Ordering. Tech. Rep. UUSCI-2007-012.
ZHOU, K., HOU, Q., WANG, R., AND GUO, B. 2008. Real-time construction on graphics hardware. In SIGGRAPH Asia '08: ACM SIGGRAPH Asia 2008 papers, ACM, New York, NY, USA, 1-11.
ZIEGLER, G., TEVS, A., THEOBALT, C., AND SEIDEL, H.-P. 2006. GPU Point List Generation through Histogram Pyramids. Technical Reports of the MPI for Informatics MPI-I-2006-4-002, June.

Efficient stream compaction on wide SIMD many-core architectures

Sign up for access to the world's latest research

Abstract

Related papers

References (21)

Related papers

Related topics

Cited by