Figure 11 – uploaded by Evelio Mora

Figure 10 The LOM-NTT kernel is designed to handle large input arrays (N > 2!'). The LOM-NTT kernel distributes tasks using a similar strategy as used in the LOS-NTT kernel, except that it spreads them over multiple blocks. This allows us to employ multiple SMs to execute the workload in parallel. The LOM-NTT kernel splits a single N-point NTT between multiple blocks. Because of the use of multiple blocks, this im- plementation requires kernel-wide barriers for synchronization between stages. We use the LOM-NTT kernel to decompose a single N-point NTT into multiple 2''-point NTTs. Then we incorporate our LOS-NTT (Single-block) kernel to evaluate all the 2't-point NTTs to harness the optimizations of shared memory and block-level barriers. We show the distribution for our LOM-NTT for N = 2!° in Figure 10. B. Latency optimized Multi-block NTT

Related Figures (18)

Fig. 1. HE provides security from eavesdroppers on the web as well as untrusted cloud services, as encrypted data can be computed on directly. Kaustubh Shivdikar*, Gilbert Jonatan’, Evelio Mora‘, Neal Livesay*, Rashmi Agrawal}, Ajay Joshi’, José L. Abelldn?, John Kim’, David Kaeli* *Northeastern University, Boston University, TKAIST University, Universidad Catélica de Murcia {shivdikar.k, n.livesay} @northeastern.edu, {eamora, jlabellan} @ucam.edu, {rashmi23, joshi} @bu.edu, kaeli@ece.neu.edu, gilbertjonatan @kaist.ac.kr, jjk12@kaist.edu schemes—such as HE for Arithmetic of Approximate Num- bers [6] (also known as HEAAN or CKKS) and TFHE [7]— a slowdown of 4-6 orders of magnitude is reported, as compared to running the same computation on unencrypted data [8], [9]. We aim to accelerate HE by targeting the main operation in these schemes (and, more generally, in lattice- based cryptography): polynomial multiplication [10], [11], 12]. The Number Theoretic Transform (NTT) and modular reduction are two key bottlenecks in polynomial multiplication (and, by extension, in HE), as evidenced by the performance profiling of several lattice-based cryptographic algorithms by Koteshwara et al. [13]. As lattice-based HE schemes have continued to establish themselves as leading candidates for privacy-preserving computing and other applications, there has been an increased focus on optimization and acceleration of these core operations [14], [15], [16].

Fig. 2. Our contributions: 4 major optimizations incorporated into 3 kernels

Fig. 3. Modular reduction profile comparison of architectural parameters (a,b) and causes of warp stalls (c,d). Algorithm 3 Dhem—Quisquater’s modified Barrett reduction

Algorithm 4 Proposed Barrett reduction optimized for a GPU Therefore, we propose Algorithm 4 for use in HE imple- mentations on a GPU. Similar to Algorithm 3, Algorithm 4 is an instantiation of Dhem—Quisquater [28] (for a = N +1 and 8 = —2) that requires at most one correctional subtraction. However, Algorithm 4 allows for moduli g of length up to B — 2, and thus results in no increase in the workload size.

Fig. 4. Execution times of modular reduction implementations for 28, 29, and 30-bit prime numbers (on the V100 GPU), averaged over 10,000 iterations. The error bars represent ranges. The “builtin reduction” uses the CUDA % construct for modular reduction.

Fig. 5. (a) Negacyclic convolution block diagram. (b) Hadamard product and its neighboring butterflies. (c) Fusion of butterflies into Hadamard product.

Fig. 6. The Cooley—Tukey (left) and Gentleman—Sande butterflies (right).

Algorithm 7 Butterflies fused into the Hadamard product We say that Algorithm 7 fuses the CT and GS butterflies into the Hadamard product. To define the fused polynomial multiplication algorithm, we first define truncated versions of. She CT and GS NTTs. __—— Define the truncated CT NTT, NTT ,,o-s50> to be the merged CT NTT with the final stage omitted (i.e., line 3 in Algorithm 5 is replaced with “while m < aC 2) do”). Likewise, define the truncated GS NTT, NTT,, no,» to be the merged GS NTT with the first stage omitted Ge e., line 3 in Algorithm 6 is replaced with “while m > 1 do”). Our proposed fused polynomial multiplication is specified in Algorithm 8.

The benefits of our proposed fused polynomial multiplica- tion algorithm include the following:

Fig. 7. V100 GPU memory hierarchy and latency comparison. We obtain performance metrics for our kernels using hardware performance counters and binary instrumentation tools. We explore performance bottlenecks using a variety of tools including the NVIDIA Binary Instrumentation Tool (NVBit) [59] for tracing memory transactions, the Nsight Compute for fetching performance counters, and the Nsight Systems [60] to obtain kernel scheduler performance, as well as measuring synchronization overheads. We compare

Fig. 8. (a) Architectural performance profile of shared memory NTT and iNTT workloads compared against respective global memory workloads. (b) Stall profile of NTT workload comparing global and shared memory kernels. (c) Stall profile of inverse-NTT workload comparing global vs. shared memory kernels.

Figure 11 – uploaded by Evelio Mora

Related Figures (18)

Connect with 287M+ leading minds in your field