High-Performance RNS Modular Exponentiation by Sum-Residue Reduction

With rapid development and application of artificial intelligence and block chain, the requirement of information and data security is also increased, in which the public-key cryptography, such as Rivest-Shamir-Adleman (RSA) cryptography, plays a significant role. Modular exponentiation is fundamental in computer arithmetic and is widely applied in cryptography, such as ElGamal cryptography, Diffie–Hellman key exchange protocol, and RSA cryptography. The implementation of modular exponentiation in a residue number system leads to high parallelism in computation and has been applied in many hardware architectures. While most residue number system (RNS)-based architectures utilize RNS Montgomery algorithm with two residue number systems, the recent modular multiplication algorithm with sum residues performs modular reduction in only one residue number system with about the same parallelism. In this work, it is shown that the high-performance modular exponentiation and RSA cryptography can be implemented in RNS. Both the algorithm and architecture are improved to achieve high performance with extra area overheads, where a 1024-bit modular exponentiation can be completed in 0.567 ms in Xilinx XC6VLX195t-3 platform, costing 26489 slices, 87357 LUTs, 363 dedicated multipilers of $18 \times 18$ bits, and 65 block RAMs.


I. INTRODUCTION
M ODULAR exponentiation is fundamental in Rivest-Shamir-Adleman (RSA) cryptography [1], and it is so complex that it usually needs hardware acceleration for usages as public key cryptography. In fact, the implementation of cryptography has become special VLSI architectures with computer arithmetic [2], [3], [4], [5], [6]. Also, some schemes have implemented cryptographic protocols on graphic processing units (GPUs) and central processing units (CPUs) [7], while other schemes even outsource the modular exponentiation to the cloud sever [8]. In [6], lots of precomputation are used to accelerate the modular reductions. Since a residue number system (RNS) performs additions and multiplications in parallel by a tuple of residues [9], [10], it becomes an important approach to perform long-precision modular multiplications. RNS modular multiplications can be carried out by Montgomery algorithm [3], [11], [12], [13], [14], [15], [16], which uses an extra RNS to extend the range for multiplications before modular reductions. A few improvements with precomputation and architectures are also developed in [17], [18], [19], and [20].
Besides, there are also other Montgomery modular multipliers with parallelism. A carry-save-addition (CSA)-based hardware architecture is used in [4], [21], and [5] to carry out continuous modular exponentiation, while quotient-pipelined high-radix scalable Montgomery modular multipliers can also be used for it [22], [23]. The radix-4 scalable Montgomery Algorithm 1 Modular Reduction in RNS by Sum-Residue Reduction [26] Input: modular multipliers can also be utilized for modular exponentiation with small area overheads [24], [25]. Also, as is described in [26], it is able to perform modular multiplications without Montgomery algorithm in RNS, and this work just applies the algorithm to implement modular exponentiation and RSA cryptography. The contributions of this article are as follows.
1) Improve the RNS modular multiplication by sum residues for modular exponentiation. 2) Develop computer arithmetic by special moduli for RNS.
3) Fast implementation of modular exponentiation and RSA cryptography. Especially, the RNS keys for RSA cryptography can be computed in the hardware unit.
The remaining parts of this article is organized as follows. Section II introduces the modular multiplication algorithm on RNS and our improvement for modular exponentiation. Section III shows the modular multiplier architecture for RNS modular multiplications. The RNS modular exponentiation and CRT-RSA are discussed in Section IV. Section V shows the hardware implementation results and the comparison with other results in the literature. Finally, Section VI concludes this article.

II. SUM-RESIDUE REDUCTION FOR MODULAR MULTIPLICATION
A. Related Work While modular reduction can be performed with Montgomery algorithm, it can also be implemented with classic modular multiplication [11]. A direct method for modular reduction in RNS is shown in Algorithm 1 [26].
In the above algorithm, the improved Chinese remainder theorem (CRT) [14] and Kawamura et al.'s approximation method [13], [15] are used. The critical point lies at representing the coefficients of CRT by residues, therefore simplifying the computation.
By Algorithm 1, the modular exponentiation can be implemented in RNS with generalized Mersenne numbers m = 2 n − 2 k ± 1 as moduli. Besides, it is necessary to compute C = A · B mod N as follows: Application of the improved CRT to the above equation Representing the CRT by sum of residues as follows [26] In order to carry out modular exponentiation, it needs to adjust the original algorithm, so that: C < (d + 1)2 n · N . For the sake of modular exponentiation, it requires the dynamic range of RNS with M > (d + 1) 2 · 2 2n N 2 .

B. Improvement
During the modular exponentiation A k , there are many modular multiplications with the integer A in RNS. By precomputing Z i = A i · M −1 i mod m i and A i = A mod m i for i = 1, 2, . . . , n, the computation of A k can be accelerated in RNS, as is shown in the following equation: For modular exponentiation by a fixed window method, precomputation of A γ i · M −1 i mod m i can be used instead. As far as hardware implementation is concerned, the computation of can be calculated by Algorithm 1, with the modulus N being replaced by (P − 1) and (Q − 1).

III. MODULAR MULTIPLIER ARCHITECTURE
The Karatsuba-Ofman method can be used to build efficient multipliers, yielding [27], [28], [29] As a result, the Karatsuba-Ofman method decreases the number of O(n)-bit multiplications or multipliers from 4 to 3. Figs. 1 and 2 show the construction of a 64-bit multiplier by nine 18-bit embedded multipliers.

A. Moduli Selection for Modular Exponentiation
Mersenne numbers can be used as RNS moduli in modular multiplication [30]. For 1024-bit modular exponentiation, the RNS moduli set in generalized Mersenne numbers can be chosen by testing "CoprimeQ" in Wolfram Mathematica, as are shown in Table I.
The modular multipliers are built to perform modular multiplications over special Moduli in this work. Let T = A · B = 2 n · T H + T L ; then, C = T mod P = δT H + T L . The problem can be divided into two cases. The symbol k is different in this section.

B. Modular Multiplications Over 2 n
Obviously, C ′ = (2 k + 2 n−k + 1)T h1 + (2 k + 1)T h2 + T L includes six parts, each of which is less than 2 n . For example, 2 n−k T h1 < 2 n−k · 2 k = 2 n . Also, 2 k < 2 n/2 < 2 n−k , T L < 2 n , and Notice that and k < n/2, 2k ⩽ n − 1. There is P > 2 2k + 2 k+1 . In general, n ≫ 4, so that C ′ < 3P + P = 4P. Thus C ′ ⩾ 0. Finally In the operations, all the numbers are n + 2-bit unsigned integers. 1) Modular Reduction Over 2 n − 2 k − 1: It can be found that the six parts in (6) can be combined into four parts and two additions, as is shown in Fig. 3. The sum C ′ can then be reduced modulo P by testing the signs of C ′ − P, C ′ −2P, and C ′ − 3P. In general, the modular reduction can be completed in three pipeline stages. C. Modular Multiplications Over 2 n − 2 k + 1 In this case, P = 2 n −δ = 2 n −2 k +1 and T = 2 n · T H + T L . Similarly, let T h1 = (t n−1 , . . . , t n−k+1 t n−k ) 2 < 2 k and T h2 = (t n−k−1 , . . . , t 1 t 0 ) 2 < 2 n−k ; then, T H = 2 n−k · T h1 + T h2 . Thus Substituting 2 n ≡ 2 k − 1(mod P) into the above equation and it yields It can be found that C ′′ = (2 k −2 n−k −1)T h1 +(2 k −1)T h2 + T L also includes six parts. On the one hand On the other hand, since 0 ⩽ T h1 ⩽ 2 k − 1, there is Thus Let the intermediate numbers be n + 2-bit signed numbers. The (n + 2)th bit is the sign bit. In the modular reductions, CSAs are expanded to (n + 2) bits. Compared with the case of P = 2 n −2 k −1, it is also able to perform parallel subtractions of C ′′ + P and C ′′ − P. By judging the sign bits of C ′′ and C ′′ − P, it is able to get C = C ′′ mod P. 1) Modular Reduction Over 2 n − 2 k + 1: As is shown in Fig. 4, the six parts in (11) can be combined into two full additions, i.e., the following hold.
The second addition can be processed as a carry-selection addition, while the significant half part can be written as full adder, such as the following: w = u + v + (2 n−k − 1) + c −1 , with u = T h1 , v = ⌊T L /2 k ⌋, and c −1 being the carry out of (T L mod 2 k ) + 1. By CSAs, there are s i = 1 ∧ u i ∧ v i = u i ∧ ∼ v i , c i = u i |v i , and w = s + 2c + c −1 . The sum C ′′ can then be reduced modulo P by testing the signs of C ′′ and C ′′ − P. Totally, the modular reduction can be completed in three clock cycles, with only four parts for accumulation.
Modular accumulation consists of a series of modular additions and subtractions. Due to the buffer stage, there are two accumulated results for every other input x i , and they can be finally added together.

IV. MODULAR EXPONENTIATION AND CRT-RSA
For modular exponentiation, the exponent is first stored in a shift register, and set a counter as N /2 − 1, and then, shifted left 2 bits per time and tested whether they are two zero bits. If they are both zeros, then the shift continues, and the counter is decreased by 1. Else, the exponent and the counter keep their current values until the following modular multiplications are completed.
The test and shift occur at first, and the compute and shift occur after them. The compute and shift are finished after N /2 − 1 times in the modular exponentiation, which consists of four modular squares and up to one modular multiplication.
The modular exponentiation algorithm can be performed by a fixed-window method [23], [31]. In order to perform the RNS modular multiplication by Algorithm 1, there is frequent need of selecting a word from multiple residues. In the situation, a selection logic is used to reduce the driving load.
The state transfer in modular exponentiation includes seven states.

A. Conversion From RNS to Binary System
The RNS-to-binary conversion is carried out according to CRT Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. The binary form of R is calculated digit by digit modulo 2 n . Suppose the jth digit is obtained as follows: r i ← (r i −B j−1 ) · 2 −n mod m i , then σ i = r i · M −1 i mod m i and z i = σ i · (M i mod 2 n ) mod 2 n , from i = 1 to d. Also, α can be obtained by an approximation method from [13], i.e., The modular exponentiation result is initially obtained within the domain [0, (d + 1) · 2 n · 2 1024 ), with d = 36 and n = 64. It can be then transformed into binary system for modular reduction. As is shown in Section III, X ≈ 2 K +t , the quotient q = ⌊X/N ⌋ is approximated by q 1 = ⌊X 1 /(N 1 + 1)⌋, where K = 1024, t = 73, X 1 = ⌊X/2 K −t ⌋, and N = 2 K −t N 1 + N 0 . Second, the quotient q 1 can be transformed into RNS, so that the multiplication q 1 · N can be performed in RNS directly. Finally, when the remainder is transformed back to binary system, N can be subtracted from it word by word to test whether the result is between [N , 2N ). Such a process is just: X rns → X → q 1 = ⌊X/(N 1 + 1)⌋ → q 1,rns → R rns = X rns − q 1,rns N rns → R.
The RNS processing units for modular exponentiation have most multiplicands in RNS stored in a large read only memory (ROM) table, so the control lines can be selected as the ROM address. Notice that the other operand of the RNS multiplication can be chosen from another RAM and the previous product. The ROM contents with Algorithm 1 have a size of 2304 × 41 bits. Since the result falls between [0, (d +1)2 1024+n ), they can be represented by (n+2) residues.

B. Reduction in Binary System and RNS
At the end of modular exponentiation, the 2K -bit number A can be reduced to A mod N with K bits by transfers between RNS and binary system. First, convert the 2K -bit number A from binary system to RNS. Then, perform modular reduction over N in RNS by Algorithm 1, with output A ′ as large as (d + 1)2 n · N . Second, convert the results of A ′ from RNS to binary form, and calculate the quotient q = ⌊A ′ /N ⌋ in binary form. Then, compute the product q · N and the subtraction A ′′ = A ′ − q · N . Third, convert the binary number A ′′ = A mod N to RNS for further computation.
As is shown in Fig. 5, the reduction of the long-precision numbers in both RNS and binary system is critical for obtaining accurate intermediate results and reducing the dynamic range of the RNS. For CRT-RSA in two-moduli RNS, the modular exponentiation can be divided into two parts and then combined together [34].

V. HARDWARE IMPLEMENTATION
The modular exponentiation, RSA encryption, and decryption can be implemented in RNS, whose performance is measured on FPGA platforms, as are shown in Tables II and III. The designs are described by Verilog HDL, simulated by Modelsim 6.2, and synthesized by Xilinx Synthesis Technology (XST). They are then placed and routed in the Xilinx 14.7 platform.
In [25], a fast, compact, and symmetric commonmultiplicand architecture is proposed for full modular exponentiation. In comparison, the result in this work is about three times as fast as that in [25]. The design in [23] is based on quotient-pipelined high-radix Montgomery modular multipliers and arrives at a minimum time of 1.19 ms for a 10 240-bit modular exponentiation. By contrast, this work still gets higher speed and much fewer clock cycles than those in [23].
The modular exponentiation architecture in [31] is based on a Montgomery modular multiplier with fast carry in between words. It utilizes the DSP architecture within Xilinx FPGA to reach high speed for long-precision additions after word-based multiplications. However, it needs much more clock cycles than this work and is, therefore, slower than it. The work in [32] utilizes CSAs and a bypass logic to implement energyefficient Montgomery modular multiplications. It computes one modular exponentiation in about 2.31 ms with TSMC 0.13-µm CMOS technology, which is about four times of that with this work on XC6VLX-3 FPGA. However, the power efficiency of this work on FPGA is much lower than that in [32] with the CMOS process.
In addition, the RNS implementation of modular exponentiation is supposed to be resilient to side channel attacks (SCAs) due to parallel computation. As the RSA decryption is concerned, the decryption for 2048-bit moduli is done by two 1024-bit full modular exponentiation. By contrast, the RSA encryption supposes a much shorter key of 17 bits, i.e., e = 2 1 6 + 1. In this way, the 2048-bit modular exponentiation is The work in [15] implements RSA cryptography in RNS, in which the RNS Montgomery modular multiplication algorithm is applied. By contrast, this work applies the modular multiplication algorithm with sum residues in RNS. It can be found from Table III that the clock cycles of this work are about 1/3 of that in [15], resulted from high-radix digits of RNS moduli, simple conversions between RNS and binary system, and few base conversions between two RNS bases. Compared with [4], the main improvement happens with the reduction of clock cycles by more than 80%.
The design in [23] is based on high-radix scalable Montgomery modular multipliers, and two 1024-bit modular exponentiations are carried out in parallel to compute 2048-bit RSA decryptions. While it is slower than this work for modular exponentiation, it may get faster than this work in case of RSA decryption due to high-level parallelism. The architecture with [33] uses a high-radix multiplier to implement Montgomery modular multiplication and modular exponentiation, which is relatively slow in 90-nm CMOS technology compared with this work on the XCV6 FPGA device in Tables II and III. The XCV6 FPGA refers to 40-nm copper CMOS process technology; however, it is often slower than the usual ASIC logics on the same technology point.

VI. CONCLUSION
The RNS Montgomery algorithm is widely used for multiprecision modular multiplications, but its performance is curbed by the frequent conversion between two RNSs. By precomputing the constant parameters in RNS, it is able to calculate modular multiplication by Chinese remainder theorem directly in the sole RNS, which reduces the precomputation and control logics. Together with special moduli, it shows that high-performance modular exponentiation and RSA cryptography can be obtained, which consumes even fewer clock cycles than that with scalable architectures.