Secure Multi-Label Tumor Classi�cation Using Homomorphic Encryption

Background: In a secure genome analysis competition called iDASH 2020, the homomorphic encryption task was to develop a multi-label tumor classiﬁcation method for predicting the classes of samples based on genetic information. The scenario is that a data holder encrypts a genetic variant dataset from tumor samples and provides the encrypted data to an untrusted server. Then, the server evaluates homomorphically encrypted data in its model which is trained in plaintext using the published data or own genetic data and outputs the result in an encrypted state so that there is no leakage of genetic information. Methods: We develop a secure multi-label tumor classiﬁcation method using the CKKS scheme, the approximate homomorphic encryption scheme. We ﬁrst propose a new data preprocessing method to reduce the size of large-scale genetic data of tumor samples. Our method aims to analyze the dataset from iDASH 2020 competition track I, which originated from The Cancer Genome Atlas (TCGA) dataset, which consists of 2,713 samples from 11 types of cancers, genetic features from more than 25,000 genes. Secondly, we propose the new data packing method for CKKS ciphertext to provide a trade-oﬀ between the number of ciphertexts and the number of rotations in matrix multiplication. Lastly, we suggest the approximation method for softmax activation of a neural network model. Results: Our preprocessing method reduces the number of genes from more than 25,000 to 2048 or less and achieves a microAUC value of 0.9865 with a 1-layer shallow neural network. Using our model, we successfully compute the tumor classiﬁcation inference steps on the encrypted test data in 4.5 minutes. Despite using the approximate softmax function, the diﬀerence in microAUC value from our implementation results in the encrypted state is less than 10 − 3 compared to the plain result. Conclusions: We present preprocessing and evaluation methods for secure multi-label tumor classiﬁcation based on approximate homomorphic encryption using a shallow neural network model with the softmax


Background
Cancer is a disease caused by the unlimited proliferation of certain cells in the human body, and the exact cause of cancer is not yet known. In 2020 alone, 19.3 million new cases were reported worldwide, and 10 million people died because of cancer [1]. For this reason, the prediction of cancer through genetic data analysis has been regarded as one of the most important tasks since early treatment can reduce the lethal effects of cancer on the human body.
Machine learning (ML) is one of the fields of artificial intelligence that learns the process of finding solutions on its own without human assistance to a given problem. Due to the difficulty in the process of diagnosing tumors through genes, tumor classification using MLbased on large amounts of genetic data contributes to decision making to diagnose and treat cancer. Some cancer genome studies using ML techniques on largescale data have shown the relationship between genetic modification and specific cancer types [2,3,4,5].

Motivation
Since genetic data contains a lot of personal information and cannot be discarded or changed even if it is leaked, it is essential to protect the privacy of information about genetic data in the data analysis using ML. Various techniques, including differential pri-vacy [6,7] or multi-party computation [8], have been used to ensure privacy in the data analysis process. However, each of these approaches has the disadvantage of losing accuracy in the process of anonymizing the data or requiring multiple phases in the process of data sharing.
Homomorphic Encryption (HE) is a cryptographic scheme that enables us to perform arithmetic operations between encrypted data without decryption. HE has been considered one of the useful applications for privacy-preserving ML, as it allows the computation of desired operations without disclosing information about the data [9,10]. However, most HE libraries [11,12,13] mainly support only addition and multiplication of arithmetic operations, so many popular activation functions for neural networks, such as sigmoid, ReLU, or softmax functions, cannot be directly operated for encrypted data. Most of the applications using HE overcomes such problems by approximating non-arithmetic operations with a polynomial that minimizes the error in a certain interval. The main problem with applying HE is to minimize computational increases occurring during the polynomial approximation process while limiting errors.
Integrating Data for Analysis, Anonymization and SHaring (iDASH) has held an annual secure genome analysis competition since 2014. Each year, important topics in the field of genetic analysis are selected to compete for the most effective solution. The problem of multi-label tumor classification using HE was one of the three tasks for the 2020 iDASH competition. Given the dataset with a total of 2,713 patients and their 25,128 genes, Participants had to pre-process the given data, train the ML model with plain data, and obtain the highest microAUC score [14] within 5 minutes when the inference step was performed with encrypted test data.

Summary of Results
In this work, we propose the privacy-preserving multilabel classification method using a shallow neural network with softmax activation based on HE, which is a solution of the first track of the iDASH 2020 competition. Our work aims to build the model and run it over encrypted data to get the highest microAUC score while the total round trip time (including encryption, computation, and decryption) for the inference step costs less than 5 minutes.
To compute the inference step within a limited amount of time with encrypted data, the following challenges must be solved: first, since given data is very large, a large number of ciphertexts are required to encrypt the data, and a lot of time cost is also needed for operations between them. Second, the softmax function cannot be directly computed using HE because it consists of exponential and inverse functions.
To overcome such challenges, we propose a data preprocessing method suitable for a given dataset, so that the number of genes for both training and inference steps is reduced to less than 2 B for some B (we choose B = 9, 10, or 11). The data consist of two types: Copy Number (CN) data and Variants data. For CN data, we divide CN data into groups through the Hamming distance function and select a representative gene for each group, called CN data filtering. Since adjacent genes mutate similarly, our technique can be thought to have effectively selected representatives of genes in similar positions. For Variants data, only specific genes data exist for each person, so we select the representative genes for each tumor based on mutation effect and frequency, called Variants data filtering.
Also, we find that the elementary exponential approximation method retains the properties for calculating the microAUC value. For calculating microAUC, the difference in comparison between different samples is important for each feature, so an approximation method that reduces errors, such as minimax approximation, does not preserve the order and results in a large loss of microAUC through the approximation. On the other hand, even if the error is not minimized in a given interval, our approximation similarly satisfies the property of exponential function: 1) reducing the space between small inputs and 2) increasing the space between large inputs, so the difference between the microAUC value through the approximation softmax is small. Using these properties, we confirmed through experiments that the microAUC score through our softmax approximation can obtain a value that is almost similar to that computed with the exact softmax function.
Additionally, we modify the data packing method for matrix-vector multiplication [15] by duplicating the data in encryption to make a trade-off between the number of ciphertexts for packing and the number of rotations. Since the number of rotations occupies the most time cost in the multiplication between the encrypted matrix and the plain vector, it can be reduced by our method and the total time can be minimized by choosing an appropriate number of copies in the encryption step.
Finally, we exploit an approximate homomorphic encryption scheme HEaaN [16] to implement our inference method over encrypted data. As one of our results, by reducing the number of genes by less than 2 11 , we got an inference result that outputs a microAUC value of 0.9865 in about 4.5 minutes.

Related Works
Various ML techniques are using to classify the cancer type of a sample from its genetic data. Chen et al. [17] classified across 17 cancer types using Support Vector Machine (SVM), a kind of supervised learning. They used Catalogue Of Somatic Mutations In Cancer (COSMIC) data, an online database of somatically acquired mutations found in human cancer. The dataset consists of gene symbols, mutations, chromosomes, and pathways. They have pre-processed to remove all samples without information about some features and achieved 62% accuracy.
Yuan et al. [4] introduced a multi-label classifier named DeepGene, using a Deep Neural Network (DNN) with 4 hidden layers, achieving maximum accuracy of 65.5% across 12 cancer types. The data was from The Cancer Genome Atlas (TCGA), a project to list genetic mutations subject to cancer, using genome sequencing and bioinformatics. The data was the presence of mutations for each of the 22k genes in 3k samples. They introduced two filtering techniques named Clustered Gene Filtering (CGF) and Indexed Sparsity Reduction (ISR) to filter out irrelevant genes and reduce the data sparsity. Especially, CGF clusters the genes with Jaccard distance and extracts the representative genes. The filtering techniques made it easier and more accurate for the DNN model to learn the data.
Yuan et al. [18] also introduced DeepCNA, a cancer classification scheme based on Convolution Neural Network (CNN) from copy number aberrations (CNAs) and chromatin 3D structure data. The data was from the COSMIC CNA dataset and scaled into an interval and padded with zero to fit into a CNN input image size. A 2D CNN model with 7 hidden layers was used in their work, which led to 57.4% accuracy across 25 cancer types.
Sun et al. [19] reached 70.1% classification accuracy across 12 cancer types using 5-layer DNN model. They used Whole Exon Sequencing (WES) mutations data from the TCGA and of healthy samples from the 1000 Genomes Project (1KGP), a project to list detailed human genetic variation. The genes with a low mutation occurrence rate were considered irrelevant and filtered out.
Lee et al. [20] classified across 31 cancer types with their CPEM scheme, an ensemble of two ML techniques: random forest and DNN. The data was from the COSMIC database consists of gene-level mutation profiles, mutation rates, mutation spectra, gene-level SCNAs, and mutation signatures. They compared the three most commonly used supervised feature selection approaches in ML: the extra tree-based, Least Absolute Shrinkage and Selection Operator (LASSO), and Linear Support Vector Classifier (Linear SVC). With the LSVC feature selection method, they reached 84% classification accuracy.
As one solution to privacy-preserving ML on genetic data, many researches are proposed that use ML on encrypted data with HE. Kim et al. introduced a novel method to train a logistic regression model without information leakage with HE [21]. The model was trained to classify samples with 18 genetic features into two: disease or healthy. The authors adapted Nesterov's accelerated gradient descent method to reduce the number of iterations, since the depth of the homomorphic circuit highly affects to the computational cost. The classifier was selected as the best solution of Track 3 at iDASH competition 2017, where similar approaches were also be done [9,22,23]. In the second track of iDASH 2018 competition, which aimed to develop a method for outsourcing computation of Genome Wide Association Studies (GWAS) on homomorphically encrypted data, various solutions based on logistic regression were introduced [10,24,25,26,27]. Parallel logistic regression models on encrypted input data were used to identify genetic variants correlated with specific phenotypes. In 2019, the HE task of iDASH competition was a secure genotype imputation using HE. The solution showed "ultra-fast" HE models to verify genotype imputation that took less than 10 seconds for evaluation with only a 2-3% decrease in accuracy [28]. The authors have confirmed that similar results can be obtained for various HE libraries, including BFV [29], CKKS, and TFHE [30].

Notations
We denote Hadamard multipcation between two vectors by a ⊙ b. Also, for simplicity, we denote the [x] n be the number in {1, · · · , n} satisfying [x] n = x mod n. Using this notation, we can denote For a matrix A, A(i, j) means the i th row and j th column element of A. Also, we denote the submatrix of A with i ∈ I th rows and j ∈ J th columns from A by A[I, J]. In this case, ':' indicates the whole index set.

Approximate Homomorphic Encryption
Homomorphic Encryption (HE) is a cryptographic scheme that enables computing operations between encrypted data without decryption. For applications that require computations between real data, such as machine learning [31,32,33,34] or cyber physics system [35], the approximate HE scheme, namely as CKKS scheme, proposed by Cheon et al. [16] that supports arithmetic operations between encrypted real data is mainly used.
In this section, we formally describe CKKS scheme, but we omit details of scheme construction. Let L be a level parameter that a fresh ciphertext is equipped with a modulus q L = 2 L , and q ℓ := 2 ℓ for 1 ≤ ℓ ≤ L. Let R := Z[X]/(X N + 1) be a cyclotomic ring for a power-of-two N and R q be a modulo-q quotient ring of R, i.e., R q = R/qR. The distribution χ enc and χ err denote the discrete Gaussian distribution with some fixed standard deviation. The distribution χ key outputs a polynomial of {−1, 0, 1}-coefficient. We denote the rounding function ⌊·⌉ and modulo-q operation [·] q . CKKS scheme uses a plaintext vector m ∈ C N/2 and provides enrty-wise operation, called Single-Instrument-Multiple-Data (SIMD) operation, such as addition, substitution, and Hadamard multiplication between vectors. Before encrypting a plaintext vector, we have to convert a complex vector into a integer polynomial in R, called encoding message. To encrypt complex value, CKKS uses a field isomorphism τ : . For a polynomial m ∈ R, output a plaintext vector m ′ = ∆ −1 · τ (m) ∈ C N/2 . To sum up, to encrypt a plaintext vector of complex numbers m, we first encode m into a polynomial m ← Ecd( m, ∆) and then generate a ciphertext ct ← Enc pk (m). For convenience, let n be N/2. Now, the scheme description of CKKS is as following: • KeyGen(params).
-Sample s ← χ key and Set the secret key as sk = (1, s). -Sample a ← U (R q L ) and e ← χ err . Set the public key as pk • Enc pk (m). Given a message m ∈ R, sample v ← χ enc and e 0 , e 1 ← χ err . Output the ciphertext ct = [v · pk + (m + e 0 , e 1 )] q L .
• CMult(ct, c). Given a ciphertext ct ∈ R 2 q ℓ and a constant vector c ∈ C n , output the ciphertext ct cmult = Ecd( c) · ct encrypting a plaintext vector c ⊙ m.
• ReScale ℓ→ℓ ′ (ct). For a cipertext ct ∈ R 2 q ℓ , return the ciphertext ct ′ = [⌊(q ℓ ′ /q ℓ ) · ct⌉] q ℓ ′ . This step needs to control a scaling factor after some operations. For the remind of the paper, we may denote the operations between ciphertexts or ciphertext and plain vector by common symbols, such as Add(ct 1 , ct 2 ) = ct 1 + ct 2 or CMult(ct, c) = ct · c for simplicity.

Data Preprocessing
Given more than 25,000 genes, we need to select meaningful genes for efficient machine learning using HE. Therefore, we introduce new data preprocessing technique to obtain significant genes that are useful for predicting tumors. Our data preprocessing method consists of two filtering algorithms that extract specific features from two different types of data. The resulting filtered data matrices are concatenated and input to our neural network as in Figure 1(c).

Data Structure
Our scenario focuses on the dataset originated from The Cancer Genome Atlas (TCGA) database, which is widely used in genomic research. The dataset consists of somatic mutations and copy number data for 11 sites: Bladder, Breast, Bronchus and Lung, Cervix uteri, Colon, Corpus uteri, Kidney, Liver and Intrahepatic bile ducts, Ovary, Skin, and Stomach. The dataset consists of total 2,713 samples and 25,128 genes, and each sample has one cancer type out of 11 types of cancer, consists of two types of data: Copy Number (CN) data and Variants data. The dataset is available from iDASH 2020 competition track I and the details of dataset including the number of samples and the number of mutation's effects for each feature are disclosed in Table 1.
CN data consists of the copy number of the genes. The copy number data show the copy number variations (CNVs) of the genes as numbers, representing the state of the gene on the corresponding sample, whether the gene has duplication or deletion of a considerable number of base pairs. The data consist of 5 different levels 0, ±1, ±2 where the negative and positive values represent deletion and duplication of the corresponding gene from the sample, respectively. Copy number with 0 means that the gene has no considerable variation on the corresponding gene from the sample.
Variants data consists of mutation data of selected pairs of samples and genes for each tumor type with various features: gene's location on chromosomes, mutation type, whether the mutation is Single Nucleotide Polymorphism (SNP), and mutation's effects separately predicted by two different methods.

CN Data Filtering
Our main approach for CN data filtering is to filter out the irrelevant genes based on the similarity of the neighboring genes. Let C ∈ {0, ±1, ±2} s×gcn be the matrix of copy number data, where the s rows correspond to the s samples, and the g columns correspond to g genes; C(i, j) is a copy number that corresponds to gene j of sample i. We initially sort the raw CN data according to the order of the gene positions on the chromosome. Then, we group the genes by their copy number similarity (line 3 in Algorithm 1). For two gene columns p, q ∈ {0, ±1, ±2} 1×gcn , we use the Hamming distance as the similarity measure: where p i and q i are the i th entry of vectors p and q, which represents the copy numbers of each gene for ith sample. Starting from the first gene in C, say a representative of the first group, we calculate the hamming distance with the following genes in order. Until the distance is less than the predetermined threshold d cn , we merge each corresponding gene to the first group. If the distance first reached the threshold d cn , the first group ends and the corresponding gene becomes a representative of the second group. Then start from the new representative, calculate the hamming distance with the following genes.
Repeating this algorithm until the genes are all grouped, then each gene is in a unique group with neighboring genes. Finally, we filter the CN data with the representative gene, the first gene in each group, resulting in a matrixC ∈ {0, ±1, ±2} s×gcn whereg cn is the number of groups as the result of our filtering. Note thatg cn is much smaller than the number of original genes. We formally state our CN data filtering algorithm in Algorithm 1 and shown in Figure 1(a).

Variants Data Filtering
Variants data filtering mechanism uses only a mutation's effect data classified in 4 different levels: LOW, MODERATE, MODIFIER, and HIGH. The Variants data filtering works based on the effectiveness of the mutations. It filters out the genes with less effective mutations.
geneList.append(i) 10 end 11C ← C[:, geneList] // filter C with zero padding 12 returnC In the Variants data, for each sample, only a few mutation effect data of individually selected genes are given. So the raw Variants data for whole samples and genes is almost empty, which cannot be used immediately as a neural network input. Hence our Variants data filtering is separately applied to each cancer type to reduce the empty spaces and fill in yet remaining spaces with zero.
Let V t be the matrix of raw data with mutation's effect corresponding to the tumor type t (1 ≤ t ≤ T ). V t is a s t × g matrix with mutation's effect as strings, where s i rows correspond to s t samples and g columns correspond to whole g genes for the tumor type t. V t (i, j) is the effect of the mutation in j th gene of i th sample with the tumor type t. We first encode the string data to predetermined real values between 0 and 1: LOW = 0.2, MODERATE = 0.5, MODIFIER = 0.9, HIGH = 1.0, and 0 if no information exists. The resulting encoded Variants matrix is sparse matrix in E st×g , where E = {0, 0.2, 0.5, 0.9, 1.0} be a set of encoding values.
Secondly, we sum V t by each gene column and if the sum is more than the predetermined threshold k var , then put the gene in the list of the selected genes (line 2 in Algorithm 2). The column-wise summation ) is a mutation's effect of jth gene to the s t samples, so the selected genes in list can be regarded as genes with considerable mutation effects. The union of the selected genes from each tumor type t is the set of filtered genes.
Finally, we filter the whole Variants data with the filtered genes for each t, resulting in a matrixṼ ∈ E s×gvar with much smaller genes to whole samples (line 11 in Algorithm 2). The workflow of the Variants data filtering technique is shown in Figure 1(b).

Modeling Shallow Neural Network
After preprocessing, we get the data matrix X with size s × g where g =g cn +g var . Then, with the tumor Vt ← Vt[:, geneList] 11 end 12Ṽ ← Concatenate(V 1 , · · · , V T , axis = 0) 13 returnṼ data Y , we train a neural network model from (X, Y ) in plain, where T is the number of tumor types and Y is the one-hot label matrix with size s × T . Note that T is 11 in our dataset, which is relatively small than s and g. Our neural network model consists of one hidden layer with 64 nodes and linear activation function and output layer with 11 nodes. In the output layer, we use softmax activation function to output their predicted value. During the training phase, we used Keras library [36] in Python with batch size of 32, number of epochs of 50 and dropout rate of 0.9. Note that we used sufficiently big dropout rate in order to avoid overfitting since the layers in our model are small.

Roadmap for Inference
Step over Encrypted Data From now on, we state the method to compute inference step from our model with encrypted data. Precisely, s × g matrix X and g × T matrix W are given (recall that s, g and T refers the number of samples, genes, and tumors, respectively). Since t is relatively small than s and f in practice, we consider this matrix multiplication Y = X · W as T times of matrixvector multiplications y i = X · w i for 0 ≤ i < T , where w i is i + 1th column of W . Then, for each row Y j of Y (0 ≤ j < s), we compute softmax function to get final score of our model. Since the matrix Y is still encrypted, we need to compute approximate function of softmax. In short, our Method can be divided into three steps: 1 Data Packing : encrypt the matrix X to a number of ciphertexts {ct i }. 2 Matrix-Vector Multiplication : using {ct i } and W , compute matrix-vector multiplication to get the ciphertext ct Y that contains Y = X · W .
3 Softmax Evaluation : compute approximate softmax function for ct Y to obtain softmax output for each row of Y . We break the method down into the first two steps and the other, and explain the main ideas in each section.

Data Packing and Matrix Multiplications
Although CKKS scheme supports SIMD operation between vectors, the SIMD operation cannot be directly applied to the matrix-vector multiplication operation. Therefore, in order to efficiently perform a matrixvector multiplication operation using HE, a process of mapping a matrix to a number of vectors is required. In this section, focusing on the matrix multiplications in our scenario, we first state the naive approach and then suggests our optimization method. From now, we denote the rotation of a vector by superscript with parentheses, so that v (r) = (v r+1 , v r+2 , · · · , v n , v 1 , · · · , v r ).
The main idea of mapping a matrix to vectors for this computation is suggested in [15], which is that X · w i can be understood as n times of slot-wise vector multiplication between diagonal part of X and w i . Concretely, we define x j by the j-th diagonal part of X for 0 ≤ j < g so that x j = (X(k, [j + k − 1] g ) 1≤k≤s . Then, the matrix-vector multiplication can be computed as i . Hence, the multiplication can be done if we encrypt x j 's in several ciphertexts and compute constant multiplication between ciphertext and plain vector w

Warm-up with Simple Parameters
Before we state our explicit method for encodings and matrix-multiplications, we start to explain with a toy example, simplifying the parameter as s = 2, g = 4, and t = 2, and assumsing that the number of slots in one ciphertext is 4 times larger than the size of vector s. Then our goal is to compute y i = 3 j=0 x j ⊙ w (j) i for 0 ≤ i < 4. In the naive approach, we encode i ) for each i. Then we can compute Hadamard multiplication by ct i,mult = ct ·w i , which contains the message vector ( i ). Hence, the final sum is obtained by computing rotation twice as ct ′ i = ct i,mult + rot(ct i,mult ; s/2) and ct i,sum = ct ′ i + rot(ct ′ i ; s/4), where the real parts of first s/4 slots in ct i,sum contains y i . Here, the operations we need are total 4 CMults and 4 rots.
From the naive approach, we need t number of the matrix-vector multiplication, so that the CMult and rot should be computed t times. Here, the implementation cost of rot is larger than CMult, so we suggest new packing method to reduce the number of rotation in the whole algorithm. Our main idea is to duplicate and encode X multiple times in each ciphertexts, which reduces the number of rotations since the the number of summation between the slots in one ciphertexts is reduced. Now, we propose a new approach for matrix-vector multiplication to reduce the number of constant multiplications and rotations. Our main idea is that we copy each x j 's several times in encoding step. For the same condition as above example, we can encode the vectors x i twice so that we obtain two ciphertexts ct 0 = Enc( x 0 x 0 x 1 x 1 ) and ct 1 = Enc( x 2 x 2 x 3 x 3 ). In this case, we definẽ which are y i 's that we desired. In this method, the operations we need are 4 CMults and 2 rots, while 2 ciphertexts are required for encryption instead of 1.

Using Imaginary Part of Message Space
The CKKS scheme supports operations between complex numbers, but in applications using real numbers only such as neural networks, the imaginary part is never used. Here, we can make the computation in the encrypted state more efficient by using the imaginary part, which is not used in the plain operation.
The main idea which is suggested in [28] is that for four real numbers a, b, c, and d, the sum of the two products ab + cd is equal to the real part of one complex product (a + ib)(c − id), where i denotes √ −1. Since the matrix-vector multiplication we need is also composed of the sum of Hadamard multiplications between vectors, we can reduce the number of ciphertext and the number of constant multiplications by half by combining each vector by two and encoding it into one complex number.

Rotate and Sum Algorithm
For the matrix-vector multiplication, the addition between data packed in one ciphertext is needed. We use RotSum algorithm, which repeatedly computes rotation and addition to get summation of desired slots in a ciphertext.
Precisely, we define the algorithm RotSum(ct, s, t, d) outputs the ciphertext ct out that is obtained by computing ct out ← ct and ct out ← ct out + Rot rk (ct out ; st j ) for j = 0, 1, · · · , d − 1 in order. As a result, if ct was an encryption of the message vector m = (m i ) 1≤i≤n , then ct out contains the message vector

Putting It All Together
Combining all of the above methods, we explain the explicit method used for our implementation. For the rest of this section, we denote the concatenation of vectors by t i=s a i = ( a s a s+1 · · · a t ) and the repeated concatenation of the same vector by ( a) r = ( a a · · · a) (r times). Also, we can embed matrices X and W into large matrices with row and column lengths of power-of-2, respectively, so we assume that s, g are power-of-2's. Recall that our goal is to com- j for 0 ≤ j < t while x i 's are encrypted, so that the output y j 's are also encrypted.
As explained before, we copy each x i 's m times when encoding, where m is a power-of-2. Since each vector has the size s and the number of slots in a ciphertext is n, each ciphertext will contain ℓ = 2 · n ms number of different vectors (in both real and imaginary parts). Thus we need total g ℓ = m · sg 2n number of ciphertexts for encoding.
Precisely, we encode x i 's to each ciphertext by Note that the massage vector in ct i sequentially contains m copies of vector x i·ℓ+2q + i · x i·ℓ+2q+1 for each q. Also, we definew ji bỹ Next, we compute the multiplication step. For each j, we multiply ct i andw ji , add its conjugation and divide it by 2 to get ct ′ ji = (ct i ·w ji ) + (ct i ·w ji ) /2. Then, by adding them to get ct sum j = g ℓ −1 i=0 ct ′ ji , we get the ciphertext ct sum j which is the encryption of the vector for each j. Finally, since we compute the desired sum using RotSum, so that the first ms slots contains the desired sum and it is copied to the remaining slots. Precisely, the jth group of m outputs y jm , · · · , y (j+1)m−1 are obtained as Note that the number of rotations used in the algorithm is log ℓ 2 = log n ms for each j, so total log n ms ·⌈ T m ⌉. In summary, if we duplicate x j 's m times for encryption, then the required number of rotations is reduced by m times, where the number of ciphertexts for encryption in increased by m times. Exactly, the number of rotations and multiplications are log n ms · ⌈ T m ⌉ and sg n · m⌈ T m ⌉, respectively, while the number of ciphertexts for encryption is increased by m · sg n (note that the value sg n implies the minimum number of ciphertexts to encrypt X). Since the number of multiplication does not change asymptotically for m, our method provides a time cost trade-off between rotations and encryptions. In our implementation environment, the rotation costs the significant parts in total matrix-vector multiplication algorithm, so that we can reduce the total time cost by using more memory to encrypt data.

Approximation of Softmax
While softmax layer substantially enhances the performance of the classification, it cannot be directly computed by CKKS scheme since it comprises several non-polynomial operations. Recall that the softmax of a real vector v = (v 1 , · · · , v t ) is defined as followings.
Both exponential function and division are not polynomial, so we should replace them by their polynomial approximation. We introduce a proper polynomial approximation technique for each of exponential and division function. In general, in order to approximate a non-polynomial operation by a polynomial, minimax method that minimizes the maximum error value within an interval or Chebyshev approximation that express the function by the series of Chebyshev polynomials are mainly used. However, the minimax approximation loses the increasing property of exponential and the division function, because the sign of the error is not constant. Hence, this method is not suitable in terms of calculating microAUC score. Instead, we suggest an approximation method of the softmax layer that is more suitable for microAUC score.
In particular, our approximation method consists of a less number of squaring operations, so it has an advantage to be evaluated over homomorphically encrypted data.

Approximation of Exponential Function
Our approximation method for exponential function comes from the elementary definition of exp(x). Precisely, for some r, we approximately compute exp(x) as Note that this formula can be computed by squaring n times, so it mitigates the computational overhead accompanied by the HE computation. Also, our approximation of exp(x) is monotone on [−2 r , ∞), so it is more appropriate to approximate softmax functions in terms of getting a high microAUC score compared to other polynomial approximation techniques such as minimax and Chebyshev approximation. Furthermore, to make the evaluation more stable, we rather consider the approximation of scaled exponential function, AExp r,L (x) as followings: If we carefully choose L large enough that satisfies | 2 r +x L | < 1 for the all possible choices of x, then the values will not rapidly grow during the evaluation. This makes the computation more stable to the HE implementation which usually supports fixed precision bits.

Goldschdmidt's Algorithm
Goldschmidt's divison algorithm [37] is one of the most popular approximation algorithm to compute the inverse of a real number. For x ∈ (0, 2), Goldschmidt's algorithm uses the following property: Note that the approximation converges rapidly as d grows, and it uses 2d − 2 multiplications to evaluate the approximate polynomial of degree 2 d − 1. The small number of multiplication gives a great advantage on implementation with HE, because HE multiplication between ciphertext is hugely time consuming. The detailed algorithm is described in algorithm 3.
Here, note that the Algorithm 3 quickly converges to 1/x since the ratio between the error and true value is Algorithm 3: Goldschdmidt Method [37] Input : x ∈ (0, 2), number of iteration d ∈ N Output: an approximate value of 1/x Moreover, we can easily exploit Goldschmidt's algorithm Gol(·) to approximately evaluate 1/x on (0, 2M ), instead of (0, 2), by using the following equation: We note that the scaling factor M should be carefully chosen since the error accompanied by Goldschmidt's algorithm becomes non-negligiblly large as the input is near 0. Therefore, if we select an overly large M , the input values become smaller, resulting in an error that cannot be ignored.
To put approximate exponential function and Goldschmidt's algorithm together, we now can approximately compute the softmax layer by using HE operations. Denoting L r = L 2 r 2 r , we utilize the following equation: The detailed algorithm is described in Algorithm 4.

Results
Parameter Selection for HE We used CKKS parameters N = 2 17 , q = 2 2670 with (uniformly random) signed binary secret, which satisfy more than 128-bit security according to Albrecht's LWE estimator [38]. For the scaling factor of CKKS, we choose ∆ = 2 60 for ciphertexts and ∆ const = 2 40 to encode plain vectors for constant multiplication or

Plain Model Construction
In our experiment, we use the TCGA data that is imported in the same way as used in the 2020 iDASH competition with the same format. In the analysis, we use Challange dataset for training and Evaluation dataset for inference step, respectively. In order to use each filtering effectively, we experiment for various combinations in plain model. To be suitable for implementation with HE, we fix the gene bound by 2 B for B = 9, 10, or 11 first, then change parameters d cn and k var until the filtered number of genes is close to the bound. Here, in order to compare the importance of the CN data and the Variants data, experiments are carried out for the case of fixing one of the parameters d cn or k var to 0 and the case of controlling both to obtain the chosen gene bound. Then, using each chosen pair of parameters, we build plain model by train dataset and then check microAUC value with test dataset. Our experimental results are summarized in Table 2.
According to Table 2, we obtain a much better mi-croAUC by mixing the two filtering than when using only one of the CN data filtering and Variants data filtering. And when more genes are extracted from Variants data filtering than CN data filtering, they tend to have better accuracy. Since we are using a 1-layer neural network, increasing the genes does not necessarily increase the accuracy.

Encrypted Inference Results
Using filtered gene in preprocessing step and the model built in training step, we compute inference step over encrypted data as stated in previous sections. To compute approximate softmax, we choose parameters (r, L) = (4, 64) for B = 9, 10 and (4, 80) for B = 11. Then, the parameters d, M for Goldschmidt algorithm are chosen by 30 and 64, respectively. As a result, we estimate the time cost for each round-trip step including encryption, constant multiplication, rotate and summation, Approximate softmax evaluation, and decryption step. The results are stated in Table 3. In the table, the column Encoding Dup. means the parameter m that we used as the number of duplication in encryption. As m increases, the number of required ciphertexts increases linearly so the time cost for encryption also similarly increases and the number of rotations decreases as O( 1 m log 1 m ) scale. With such trade-offs, denoting the bound of the number of genes by 2 B , we notice that the time cost is minimized when m = 4 for B = 9, and m = 2 for B = 10, 11, respectively. We illustrate the scalability of time cost from m in Figure 2 for B = 10. Moreover, the microAUC-enc, real, and linear columns in Table 3 calculate the microAUC values with an approximate softmax algorithm (in encrypted state), an exact softmax algorithm (in plain), and without a softmax (in plain), respectively, with the same model for each B built in the previous step. Since our model was trained with the softmax activation function, if we omit the calculation of the softmax function in the inference step, the microAUC value is greatly reduced. However, using our softmax approximation method, the microAUC value hardly decreases compared to the correct softmax.

Discussion
The main purpose of our data preprocessing is to remove irrelevant genes in the data. The genes removed have a similar effect to the remaining genes, or are have a weak effect on the cancer type classification. CN data filtering extracts the genes with the hamming distance of the samples' copy number, based on the fact that the adjacent genes have copy number similarity. This filtering technique can be applied to other types of the genetic dataset with an appropriate threshold d cn for each dataset. Variants data filtering can be also applied to other types of datasets that have a lot of empty information. Since various genetic datasets are 'empty-sparse', not sparse with lots of 0s but no information, our filtering technique can be used to efficiently extract the relevant features.
Our softmax approximation algorithm works well for any type of data. However, the size of the input data must be adjusted through normalization, and to compute the approximation of exponential function and the Goldschmidt algorithm, the process of predicting the range of the maximum and minimum values of the matrix product XW as a result of the train data should be required. In addition, since our algorithm consumes a lot of depth of HE in approximating softmax, it may not be suitable to compute a deep neural network model having multiple softmax activation functions for a limited time.

Conclusions
In this paper, we adapted the previous data preprocessing methods to reduce the size of the dataset from 25, 128 to 2 B where B = 9, 10, or 11. Also, our approximation method for the softmax function enables us to apply the softmax activation function for learning shallow neural network model. As a result, we obtain inference results with microAUC values of about 0.985 to classify multi-label tumor data in 5 minutes. If the time given for the inference step is not limited, our method would be sufficiently scalable to a general deep neural network with softmax activations. Ethics approval and consent to participate Not applicable. Figure 1 Workflow of the data preprocessing. (a) CN data filtering. CN data matrix C consists of s = 2,713 samples and g = 25,128 genes, where each entry represents the copy number of the corresponding sample and gene in 5 levels: 0, ±1, ±2. The CN data filtering is based on the copy number similarity of adjacent genes. (b) Variants data filtering. Variants data matrix Vt for each tumor type t = 1, 2, · · · , 11 consists of mutations' effect data. The data is encoded to real values 0, 0.2, 0.5, 0.9, and 1.0 where 0 is for the entry with no information. The Variants data filtering removes irrelevant genes with ineffective mutations. (c) Input data to Shallow Neural Network. The filtered CN and Variants data is concatenated and input to our Neural Network. Figure 2 Time Graph for B = 10. As encoding duplication number increaes, we can check that the time cost for Encryption increases linearly, the cost for RotSum decrease approximately log scale, and the other parts are stationary. As a result, the total time cost is minimal when duplication number is 2. Table 1 Sample and mutation statistics of the dataset on 11 cancer types. Note that the total number of genes are 25,128 and the number in Mutation's Effect column means that the number of non-zero values in Variants dataset (which is less than # of samples × # of genes)   Time Graph for B = 10. As encoding duplication number increaes, we can check that the time cost for Encryption increases linearly, the cost for RotSum decrease approximately log scale, and the other parts are stationary. As a result, the total time cost is minimal when duplication number is 2.