Secure Inference on Homomorphically Encrypted Genotype Data with Encrypted Linear Models

Background: Accurate models are crucial to estimate the phenotypes from high throughput genomic data. While the genetic and phenotypic data are sensitive, secure models are essential to protect the private information. Therefore, construct an accurate and secure model is significant in secure inference of phenotypes. Methods: We propose a secure inference protocol on homomorphically encrypted genotype data with encrypted linear models. Firstly, scale the genotype data by feature importance with Xgboost or Adaboost then train linear models to predict the phenotypes in plaintext. Secondly, encrypt the model parameters and test data with CKKS scheme for secure inference. Thirdly, predict the phenotypes under CKKS homomorphically encryption computation. Finally, decrypt the encrypted predictions by client to compute the 1-NRMSE/AUC for model evaluation. Results: 5 phenotypes of 3000 samples with 20390 variants are used to validate the performance of the secure inference protocol. The protocol achieves 0.9548, 0.9639, 0.9673 (1-NRMSE) for 3 continuous phenotypes and 0.9943, 0.99290 (AUC) for 2 category phenotypes in test data. Moreover, the protocol shows robust in 100 times of random sampling. Furthermore, the protocol achieves 0.9725 (the average accuracy) in an encrypted test set with 198 samples, and it only takes 4.32s for the overall inference. These help the protocol rank top one in the iDASH-2022 track2 challenge. Conclusion: We propose an accurate and secure protocol to predict the phenotype from genotype and it takes seconds to obtain hundreds of predictions for all phenotypes.


Introduction
The research on genotype-to-phenotype are crucial to uncover the gene functions and the mechanisms in distinct phenotypic outcomes [1]. The high throughput genomic data makes it may be possible to predict the phenotype from genotype. While the inference of genotype to phenotype is a complex problem due to the intricate factors such as genotypes, epigenetic variants and their interactions [2]. Moreover, an individual of same genotype may develop to thousands of different diseases, which makes it still a huge challenge in achieving accurate predictions for these phenotypes e ciently [3]. Because of the sensitive nature of genotype and phenotype data, a secure and accurate model is essential for the secure inference of the predictions.
Linear or logistic regression models are generally applied in Genome-wide association studies (GWASs) such as SNPTEST and PLINK [4][5][6]. However, the linear models may be over tting because of the number of genotypes far exceed phenotypic outcomes. Regularized linear regression models such as ridge regression, lasso, elastic net and their extensions could overcome the over tting problems and select a functional genotype set for phenotype estimations [7][8][9]. While linear models could only capture additive effects, ensemble-based machining learning methods such as Xgboost or Adaboost could select the epistasis genotypes as well and may achieve better performance [10]. Either linear or non-linear models assist to construct an accurate model to infer the phenotype from genotype.
The sensitive nature of genotype and phenotype data urges to develop the secure inference models for phenotype prediction. Besides, the track 2 of iDASH-2022 appeals to develop a secure model evaluation on homomorphic encrypted genotype data via protecting both model parameters and genotypes.
Homomorphic encryption (HE) is a cryptosystem that enables homomorphic operations on encrypted data and is considered as one of the most important primitives for privacy-preserving applications. Most of the current HE schemes can be categorized into word-wise HE (such as BFV [11], BGV [12] and CKKS [13]) and bit-wise HE (such as FHEW [14] and TFHE [15]). Among these schemes, Cheon-Kim-Kim-Song (CKKS) is regarded as the unique scheme to support homomorphic operations on oat/complex number naturally. Therefore, CKKS could be utilized to construct a secure inference of phenotype from genotype.
To make the secure inference e ciently, we propose an accurate and secure inference protocol on homomorphic encrypted genotype data with encrypted linear models. Firstly, scale the genotype data by feature importance with Xgboost or Adaboost then train a linear model to predict the phenotypes in plaintext. Secondly, encrypt the model parameters and genotype data with CKKS for secure inference.
Thirdly, predict the phenotypes under CKKS homomorphic encryption. Finally, decrypt the encrypted predictions by client to compute the 1-NRMSE/AUC for model evaluation.

Overview of the secure inference protocol
The secure inference consists of three parties: Client, Modeler, and Evaluator. 198 samples with 20390 features/variants are taken as an example to illustrate the details (Fig. 1). Firstly, Client generates private key and public key (for encryption), relinearization keys (for ciphertext multiplication) and Galois keys (for ciphertext rotation), and broadcasts these public keys to Modeler and Evaluator; Secondly, Client encrypts test data with the public key by diagonal coding, BSGS algorithm and CKKS homomorphic encryption and sends the encrypted results to Evaluator; Thirdly, Modeler encrypts model parameters with received public key and sends the encrypted results to Evaluator; Fourthly, after receiving the encrypted test data matrix and encrypted model parameters, Evaluator executes homomorphically secure model inferences with received relinerization keys and galois keys, and sends the encrypted predictions to Client; Finally, Client decrypts the ciphertext of predictions then the decrypted predictions are used for computing 1 − NRMSE and AUC.
Linear models with feature importance for predictions of phenotype from genotype in plaintext X ∈ R m × n is the genotype matrix and X ij denotes the j-th variant for i-th sample. Y ∈ R m × K is the phenotype matrix and Y ik denotes the k-th phenotype for i-th sample. Xgboost or Adaboost is used to obtain the feature importance for each phenotype then scale the raw genotype matrix by Where F k is the feature importance for the k-th phenotype.
If Y k is the continuous phenotype, then the linear regression model should be Where M k is the linear regression parameter and w 0k is the intercept term. Let If Y k is the category phenotype (i.e. 0 and 1), then the logistic regression model should be Where p k is the probability of predicting the phenotype to be 1.
Similarity, the nal model could be In summary, both nal models adopt the Linear Model with Feature Importance (LMFI) for phenotype inference.

CKKS Scheme
For a 2-power number N, we write R N = Z[X] / X N + 1 and R N , q = R N /qR N ≡ Z q [X] / X N + 1 .
The lower-case letters with a "hat"symbol such as â represents some element in R n , and a j is denoted as the j-th coe cient of â. The dot symbol ⋅ such as â ⋅b is denoted as the multiplication of ring elements. We use bold lower-case letters symbol such as a to represent vectors, a[j] to represent the j-th component of a, and a||b to represent the concatenation of vectors. Denote by a ≪ k the left-hand-side rotation of the vector components. Denote by a T b the inner product of vectors and a ∘ b the Hadamard product of vectors, i.e., the element-wise multiplication. We use bold upper-case letters such as Mto denote matrices, and As Z[X] / X N + 1 is isomorphic to C N / 2 , the ring structure allows us to encode a real vector v ∈ R l as a ring element of R N , q with l ≤ N /2. The addition/multiplication in R n corresponds to elementaddition/multiplication of the real(complex) vector v ∈ R l . Denote by Encode(v, Δ) ∈ R N , q and Decodev, Δ, l ∈ R l the encoding of v with a scaling factorΔ > 0, and the decoding of v with a scaling factor Δ > 0 and a length l > 0 respectively.
The Ring Learning With Errors (RLWE) distribution RLWE s (N, q, χ) with secret s ∈ R N and error distribution χ over R N , produces pairs (a, b) ∈ R N , q where a ← R N , q is chosen uniformly at random, and b = s ⋅ a + e for e ← χ. The decisional Ring LWE assumption over R N with error distribution χ, secret distribution χ ′ and m samples, states that when s ← χ ′ , the product distribution RLWE s (N, q, χ) m is psedudorandom, i.e., it is computationally indistinguishable from the uniform distribution over R N , q × R N , q m . As usual, χ ′ is the uniform distribution over R N , 3 = R N /3R N and χ is the discrete Gaussian distribution.
The security of CKKS scheme is based on RLWE Assumption. The following is the details of CKKS. 1.
The key generation algorithm picks s ← χ ′ , e ← χ, a ← R N , q , and outputs secret key sk = ( − s, 1) ∈ R 2 N , q and public key pk = (a, b) ∈ R 2 N , q where b = s ⋅ a + e follows the RLWE distribution.
By linearity of Enc, CKKS directly supports (bounded) addition of ciphertexts: if ct 0 = a 0 , b 0 and ct 1 = a 1 , b 1 are encryptions of v 0 and v 1 respectively, then the vector sum For homomorphic multiplication/rotation, extra public keys are needed. Denote by EK/RotK the evaluation key for homomorphic multiplication/rotation, respectively.
Rotation. Given the ciphertext ct which encrypts Encode(v, Δ), an integer k ∈ N, and a rotation key RotK, the operation RotL k (ct; RotK) results in an CKKS ciphertext that encrypts the left-hand-side rotated vector Encode(v ≪ k, Δ).
Self-repeating.Decode(Encode(v ∥ ⋯ ∥ v, Δ), Δ, l) = v. In other words, the encoding of some selfrepeating vectors can be viewed as the encoding of a single copy.
The rectangular matrix M is converted to a square matrix by repeating the rectangular matrix itself (called tiling) instead of expanding the rows (resp. cols) of M with zero-padding to be squared. A subset of the diagonals of the tiling matrix are constructed in Step 1 by looping through the rows and columns of M. This tiling is always possible without zero-padding because the number of rows and columns of M is always a power-of-2 value. The baby-step-giant-step (BSGS technique [19] in Step 2 of Algorithm 1 and Algorithm 2 aims to sum up some products of plaintext-ciphertext with a speci c offset of homomorphic rotations. Speci cally,  and Relin ∑ i a 0 , i a 1 , i ; EK are summed up, which is equivalent to ∑ i Mul ct 0 , i , ct 1 , i . Note that Relin is executed only once, and the cost of expensive Relin is independent of the number of ciphertext pairs ( ct i , 0 , ct i , 1 ).
The performance of the secure inference protocol is evaluated by the model accuracy and the running time. Speci cally, the model accuracy is achieved by 1-NRMSE for continuous phenotype and AUC for category phenotype. Here NRMSE of the k-th phenotype is calculated by Where Ŷ ik is the prediction of the k-th phenotype for sample i. Let

LMFI outperform other methods in plaintext
To demonstrate the performance of LMFI in the inference of phenotype from genotype, we applied it to the dataset containing 5 phenotypes of 3000 samples with 20390 variants, and ten percent of the dataset is randomly selected as test data. LMFI achieves 0.9548, 0.9639, 0.9673, 0.9943, 0.9929 for overall phenotypes, respectively (Fig. 2). which performs much better than linear models with lasso in both continuous and category phenotypes. Furthermore, LMFI also shows better than non-linear models, such

The secure inference protocol is robust
To demonstrate the robust of the secure inference protocol, we applied it to a random sampling with 300 test samples for 100 times, and evaluated the performance by mean and standard deviation. The secure protocol could obtain 0.9577 ± 0.0033 (mean and standard deviation), 0.9686 ± 0.0025, 0.9711 ± 0.0030, 0.9921 ± 0.0029, 0.9920 ± 0.0034 for 5 phenotypes, which indicating that the secure protocol is robust in different experiments (Fig. 3a). Furthermore, the average of the secure protocol achieves 0.9725 in a blind test with 198 samples.
The secure inference protocol is e cient To demonstrate the e cient performance of the secure protocol, we test the performance with different sample size. Here we choose N = 8192, the coeff modulus is the multiplication of three primes with bit size 60, 60, 60 respectively by obeying the homomorphic encryption white paper [20]. The protocol only takes 4.32s for 198 samples (Fig. 3b). With the sample size arise, it does not change much (4.63s for 250 samples). Even with 500 samples, it takes 8.67s to obtain the nal predications. All the computation are compiled by AMD EPYC 7K64@2.6GHz, running with 4 processes and the memory is 8GB.

Discussion And Conclusion
Construct an accurate, secure and e cient phenotype prediction model is essential for privacy and security computation. We have developed a secure inference protocol with encrypted linear models and it achieves good performance in the inference of phenotype and shows robust in 100 times of random sampling. Besides, it is very e cient and only takes seconds to predict hundreds of samples. However, the protocol also needs to be further developed. Firstly, the protocol has not been test on more datasets.
Secondly, the protocol should be further improved in homomorphic encrypted computation. Thirdly, the linear models could be extended to non-linear models to achieve better accuracy and the homomorphic encrypted computation methods should be corresponded to be transformed. In conclusion, we have developed an accurate, secure and e cient phenotype prediction protocol and it takes only seconds to predict hundreds of samples. The comparisons of different methods. The performance of LMFI, Xgboost, Adboost, LR/Logit-lasso on continuous phenotypes (a) and category phenotypes (b). 1-NRMSE(Normalized Root Mean Square Error) is used to evaluate the continous phenotypes and AUC is used to evaluate the category phenotypes. LMFI shows the best perfomance among these methods.