Low-Complexity VLSI Design of Large Integer Multipliers for Fully Homomorphic Encryption
Large integer multiplication has been widely used in fully homomorphic encryption (FHE). Implementing feasible large integer multiplication hardware is thus critical for accelerating the FHE evaluation process. In this paper, a novel and efficient operand reduction scheme is proposed to reduce the area requirement of radix-r butterfly units. We also extend the single port, merged-bank memory structure to the design of number theoretic transform (NTT) and inverse NTT (INTT) for further area minimization. In addition, an efficient memory addressing scheme is developed to support both NTT/INTT and resolving carries computations. Experimental results reveal that significant area reductions can be achieved for the targeted 786 432- and 1 179 648-bit NTT-based multipliers designed using the proposed schemes in comparison with the related works. Moreover, the two multiplications can be accomplished in 0.196 and 2.21 ms, respectively, based on 90-nm CMOS technology. The low-complexity feature of the proposed large integer multiplier designs is thus obtained without sacrificing the time performance.
- Xilinx 14.2
FULLY homomorphic encryption (FHE) allows computations to be carried out directly on cipher texts for ensuring data privacy on un trusted servers, thus attracting much attention for cloud computing applications. Generally, FHE can be classified into three categories: lattice-based, integer based, and (ring) learning with errors.
One of the main challenges in the development of practical FHE applications is to mitigate the extremely high-computational complexity and resource requirements. For example, software implementations of FHE in high-performance computers, still consume significant computation time, particularly for accomplishing large integer multiplication which usually involves more than hundreds of thousands of bits. For lattice-based FHE, 785 006- bit multiplication is required for the small setting with a lattice dimension of 2048. To accelerate FHE operations, various efficient schemes have been presented to tackle large integer multiplication based on the Schönhage–Strassen algorithm (SSA). In the algorithm, the number theoretic transform (NTT) is applied because the asymptotic complexity of NTT-based multiplication for million-bit applications is lower than that of existing approaches such as the Karatsuba algorithm and the Toom–Cook algorithm. Specifically described schemes for speeding up the modular multiplication and addition/subtraction of large numbers together with a parameter pre computation strategy, and then implemented all FHE operations on graphics processor units for parallel computations. Cao et al presented a low latency hardware architecture for NTT-based multiplication and a low Hamming weight architecture for the integer based FHE with unbalanced bit length of input operands, both implemented in field-programmable gate arrays (FPGAs). Considering the implementations in application-specific integrated circuits (ASICs), Doroz et al employed a cache architecture together with customized processing units to carry out 98 304-point NTT for implementing a 1 179 648-bit multiplier. Wang et al employed a radix-16 butterfly unit (BU) with a memory-based in-place fast Fourier transform (FFT) architecture to perform 65 536-point operations over a finite field for fulfilling 786 432-bit multiplication. Recently, Feng and proposed to implement NTT with two moduli along with the Chinese remainder theorem to enlarge the sample size for million-bit multiplier design, thus improving transform efficiency.
Compared to pipelined FFT architectures, memory-based FFT, is recognized as a more feasible solution for low area complexity, especially for large-size FFT. The same conclusion can also be applied to NTT, which possesses the same data flow as the traditional FFT but with a different set of twiddle factors. As a result, memory-based FFT/NTT solutions are well suited to FHE applications that attempt to accelerate large integer multiplication by ASIC/FPGA design with restricted hardware cost. For memory-based FFT/NTT architecture designs, efficient memory management schemes are usually demanded to increase the equivalent memory bandwidth by partitioning the required memory into several banks. High-radix BUs are commonly applied to reduce the number of operation stages, thereby increasing the resulting performance. There are always tradeoffs between hardware cost and time complexity for a given application specification. In this paper, a new operand reduction scheme is proposed to optimize the high-radix BU design for reducing its area requirement. The basic idea is to combine the partitioned input data of the BU to decrease the number of operands required in the summation operation based on the inherent properties of
NTT. In addition, we extend to the design of large integer multiplication using single-port, merged-bank (SPMB) memory for further reducing the memory area of the NTT implementation. A unified address generator unit (AGU) is also presented to support NTT/INTT (inverse NTT) and resolve carries computations. Experimental results show that applying the proposed schemes to the designs of 786 432- and 1 179 648- bit multipliers can achieve up to 52.34% and 18.73% area reductions compared to the related implementations, respectively, and the area saving is obtained without compromising the time performance.
- Memory addressing is not efficient
- Significant area reductions cannot be achieved
- Complexity feature is high
Proposed architecture and hardware design
Fig. 1 depicts the proposed large integer multiplication architecture, which consists of two NTT units, a resolve carries unit, an AGU, a controller unit, and several memory units. An NTT unit comprises one radix-r BU, r 64-bit modular multipliers (MulMod), one radix-r switcher, and one buffer. Each of the two NTT units accesses data from two single-port SRAM banks. The ROMs are used to store the associated twiddle factors, i.e., the powers of the primitive Nth root of unity in Zp, for NTT/INTT computations. The following summarizes the operations of the proposed NTT-based multiplication architecture. The NTT1 and NTT2 units are employed to carry out NTT computations of the two input operands at the same time. In addition, we reuse the NTT1 unit to perform INTT computation of the result of point-wise multiplication. Each NTT input data has 64 bits and the radix-r BU is employed to process r input data. The data are packed using the proposed operand reduction schemes and then processed based on 192-bit operations in the BU. The MulMod unit is basically a 64-bit modular multiplier, in which a few modular additions and subtractions are employed to fulfill the multiplication of the BU output data and a chosen 64-bit value.
The chosen value can be a twiddle factor from ROMs, a constant value of 1, a BU output data of the NTT2 unit for performing point-wise multiplication, or the inverse value N -1 for INTT computation, depending on the current operational status. The radix-r switcher is employed to perform the temporal data relocation for NTT/INTT computations. Moreover, to reduce the critical path delay in NTT units, the BU unit is implemented in a five-stage pipelined structure, and the MulMod unit has four pipelined stages. The buffer is used to store and reschedule NTT/INTT output data for achieving conflict-free memory access. The resolve carries unit, implemented with carry look ahead addition, is adopted to handle the carries with the INTT output data and obtain r digits of the multiplication result at a time. Note that each digit of the multiplication result contains b bits for base B = 2 b. Assuming r = 16, the detailed radix-16 BU design is shown in Fig. 8. As mentioned in Section III-A, the outputs of a BU adopt different merging strategies for operand reduction according to it. Specifically, the output X 0 requires 96-bit Wallace tree addition of 16 operands [see Fig. 2(a)], the eight odd-indexed outputs and four even-indexed outputs, i.e., X2, X6, X10, and X14, take 192-bit Wallace tree addition of six operands [see Fig. 2(b)], and the remaining three even-indexed outputs, i.e., X4, X8, and X12, demand 192-bit Wallace tree addition of eight operands [see Fig. 2(c)]. The CSAs and pipeline registers, i.e., the black rectangles, are used to reduce the critical path delay of carrying out multioperand addition. The carry propagation adder is employed to convert the carry-save form to its corresponding binary representation. Finally, the modular reduction circuit is used to fulfill a modulo-p operation. Note that X0 takes a less number of modular additions in the modulo-p operation than the other NTT outputs
The AGU is modified to support both NTT/INTT and resolving carries computations, as shown in Fig. 3. The controlling signal CtrlAG1 is used to determine the value of the circular right shift of BC; this is done by selecting either the stage counter value or a constant value. Fig. 4 shows Design of BC for N = 65536 and r = 16. The other controlling signal CtrlAG2 is employed to indicate whether or not a bit-reversed operation is needed. There are three operation modes in the unified AGU design, each corresponding to a distinct addressing sequence: 1) CtrlAG1 = 0 and CtrlAG2 = 0 for NTT computation, 2) CtrlAG1 = 0 and CtrlAG2 = 1 for INTT computation, and 3) CtrlAG1 = 1 and CtrlAG2 = 0 for resolving carries computation. Finally, assuming N = 65 536 and r = 16, Fig. 10 depicts the hardware implementation of the 12-bit BC for the AGU, i.e.
for NTT/INTT computation with CtrlAG1 = 0, and or resolving carries computation with CtrlAG1 = 1. The constant value required for resolving carries computation in the AGU is calculated at steps 7 and 8 of the SPMB_RC algorithm
- Efficient memory addressing
- Significant area reductions can be achieved
- Low-complexity feature
Jheng-Hao Ye and Ming-Der Shieh ,Member, IEEE, “Low-Complexity VLSI Design of Large IntegerMultipliers for Fully Homomorphic Encryption”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018.