Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Figure 10 The LOM-NTT kernel is designed to handle large input arrays (N > 2!'). The LOM-NTT kernel distributes tasks using a similar strategy as used in the LOS-NTT kernel, except that it spreads them over multiple blocks. This allows us to employ multiple SMs to execute the workload in parallel. The LOM-NTT kernel splits a single N-point NTT between multiple blocks. Because of the use of multiple blocks, this im- plementation requires kernel-wide barriers for synchronization between stages. We use the LOM-NTT kernel to decompose a single N-point NTT into multiple 2''-point NTTs. Then we incorporate our LOS-NTT (Single-block) kernel to evaluate all the 2't-point NTTs to harness the optimizations of shared memory and block-level barriers. We show the distribution for our LOM-NTT for N = 2!° in Figure 10. B. Latency optimized Multi-block NTT
Discover breakthrough research and expand your academic network
Join for free