LSH-GAN: in-silico generation of cells for small sample high dimensional scRNA-seq data

A fundamental problem of downstream analysis of scRNA-seq data is the unavailability of enough cell samples compare to the feature size. This is mostly due to the budgetary constraint of single cell experiments or simply because of the small number of available patient samples. Here, we present an improved version of generative adversarial network (GAN) called LSH-GAN to address this issue by producing new realistic cell samples. We update the training procedure of the generator of GAN using locality sensitive hashing which speeds up the sample generation, thus maintains the feasibility of applying the standard procedures of downstream analysis. LSH-GAN outperforms the benchmarks for realistic generation of quality cell samples. Experimental results show that generated samples of LSH-GAN improves the performance of the downstream analysis such as feature (gene) selection and cell clustering. The corresponding software is available in: https://github.com/Snehalikalall/LSH-GAN


Introduction
Recently, the emergence of high dimensional biological data such as single cell RNA sequence (scRNA-seq) data has posed a significant challenge to the machine learning researchers 1,2 . The high dimension, and small sample size (HDSS) data handling is difficult for downstream analysis particularly for feature selection (FS). It affects later stages of downstream analysis such as cell clustering, marker selection, and annotation of cell clusters. A few outliers can drastically affect the FS techniques, and the selected feature sets may not be adequate to discriminate the classes 3 . Moreover, high dimensionality increases the computational time beyond acceptability.
High dimensional small sample (HDSS) data is prevalent in the single cell domain due to the budgetary constraint, ethical consideration of single cell experiments, or simply because of the small number of available patient samples. Whatever the reason is, too few observations (cell sample) in the single cell data may create problems in the downstream analysis. This is because a small sample size may not reflect the whole population accurately, which surely degrades the performance of any model. The general pipeline of scRNA-seq downstream analysis starts with pre-processing (normalization, quality control) of the raw count matrix and then going through several steps which include identification of relevant genes, clustering of cells, and annotating cell clusters with marker genes [4][5][6][7][8] . Each step has a profound effect on the next stage of analysis. The gene selection step identifies the most relevant features/genes from the normalized/preprocessed data and has an immense impact on cell clustering. The general procedure for selecting relevant genes which are primarily based on high variation (highly variable genes) 9, 10 or significantly high expression (highly expressed genes) 4 suffers from a small sample effect. The general FS techniques also failed to provide a stable and predictive feature set in this case due to an ultra large size of feature (gene). One way to solve this issue is to go for a robust and stable technique that does not overfit the data. A few attempts [11][12][13] were observed recently which embed statistical and information-theoretic approach. Although these methods result in stable features, however, these are not performed well in small sample scRNA-seq data.
Recently computational researchers gaining interest in this field. Some methods like cscGAN 14 , Splatter 15 , SUGAR 16 are already developed which uses different techniques (like generative model, statistical framework) to successfully simulate the samples of specific cell types or subpopulations. The challenge in this task is to handle the sparsity and heterogeneity of the cell populations which define the specific characteristics of scRNA-seq data. In this paper, we propose a generative model to sort out this problem in HDSS scRNA-seq data. We use generative adversarial model to generate realistic cell samples from a small number of available samples of HDSS scRNA-seq data. Generative Adversarial Network (GAN) 17 has already been shown to be a powerful technique for learning and generating complex distributions 18,19 . However, the training procedure of GAN is difficult and unstable. The training suffers from instability because both the generator and the discriminator model are trained simultaneously in a game that requires a Nash equilibrium to complete the procedure. Gradient descent does this, but sometimes it doesn't, which results in a costly time consuming training procedure. The main contribution here is in modifying the generator input that results in a fast training procedure. We create a subsample of original data based on locality sensitive hashing (LSH) technique and augment this with noise distribution, which is given as input to the generator. Thus, the generator does not take pure noise as input, instead, we introduce a bias in it by augmenting a subsample of data with the noise distribution.
Researchers are still trying to find improved versions of the generative adversarial network (GAN) to use in different domains. Most of the variations such as progressive GAN (PGAN) 20 , Wasserstein GAN (WGAN) 18 try to train the model quicker than the conventional GAN. Unlike PGAN and WGAN, conditional GAN (CGAN) 21 operates by conditioning the conventional model on additional data sources (maybe class label or data from different modalities) to dictate the data generation. In our model, we direct our attention to the additional sample generation from HDSS data. However, the generated sample size becomes increasingly large with more features, the generation of which may not be feasible for conventional generative models. Augmenting subsample of real data distribution (p data (x)) with the prior noise (p z (z)) makes the training procedure of our model faster than the conventional GAN. We theoretically proved that the global minimum value of the virtual training criterion of the generator is less than the traditional GAN (< −log4).

Summary of contributions:
Here, we provide the following novelties: -The proposed model is the first one to address the problem of downstream analysis (particularly gene selection and clustering) on HDSS scRNA-seq data.
-LSH-GAN can able to generate realistic samples in a faster way than the traditional GAN. This makes LSH-GAN more feasible to use in the feature (gene) selection problem of scRNA-seq data.
-LSH-GAN can produce more realistic cell samples than the other existing benchmarks.
-Here we derive a new way of training a generator that combines subsamples of original data with pure noise and takes this as input.
-For a fixed number of iteration LSH-GAN performed better than the traditional GAN in generating realistic samples.
-Gene selection and clustering on the generated samples of LSH-GAN provides excellent results for small-sample and large-feature sized single cell data.

Results
In the following, we first describe the workflow of our analysis pipeline. Next, experimental validation of the proposed model is carried out by comparing it with several state-of-the-arts in real life scRNA-seq data. Finally, we used LSH-GAN to select genes from HDSS scRNA-seq data. We use benchmark gene selection techniques on the generated samples and used one single cell clustering technique to validate the selected genes.

Proposed model: LSH-GAN
The figure 1 describes the workflow of our analysis pipeline. Figure 1, panel-A, describes the application of the proposed LSH-GAN model in the feature selection task of the HDSS scRNA-seq data, while Panel-B depicts basic building blocks of the model. The following subsections describe in brief.
LSH step: sampling of input data Locality Sensitive Hashing (LSH) 12,22,23 is widely used in nearest neighbor searching to reduce the dimensionality of data. LSH utilizes locality-sensitive hash functions which hash similar objects into the same bucket with a high probability. The number of buckets is much lesser than the universe of possible items, thus reduces the search space of the query objects (see supplement for a detailed description of LSH technique).
In this work first, the unique hash codes which depict the local regions or neighborhoods of each data point are produced. For this, we utilized python sklearn implementation of LSHForest module with default parameters.
An approximate neighborhood graph (k-nn graph) is constructed by using k = 5 for each data point. This step computes the euclidean distances between the query point and its candidate neighbors. Sampling is carried out in a 'greedy' fashion where each data point is traversed sequentially and its corresponding five nearest neighbors are flagged out which never visited again. Thus after one traversing a sub-set of samples is obtained which is further down-sampled by performing the same step iteratively. x s =LSH-SAMPLING(x,k,t) 3: augment p x s (x s ) with prior noise p z (z) and give this (p z ( z)) to the generator, G. 4: real data p data (x) and generated data p g (x) is given to discriminator D.
Update the Discriminator, D 5: {The adaptive momentum gradient decent rule is used in our experiment.} 8: procedure LSH-sampling(x, k,t) 9: Execute Locality Sensitive Hashing (LSH) on x and prepare a k-Nearest Neighbour list for each data point. 10: for number of iteration of sub-sampling t do 11: visit each data point sequentially in the order as it appears in data. 12: if the data point is not visited earlier, select the data point and discard all its k neighbors from its nearest-neighbour list. 13 We take the adaptive learning rate optimization algorithm implemented in ADAM optimizer in python Tensorflow version 1.9.2. Generator (G) and Discriminator (D) uses 2-layer multilayer perceptrons with hidden layer dimension as (16,16). For traditional GAN, we retain the same settings as LSH-GAN for G and D networks.
For benchmarking our method we have utilized three state-of-the-art techniques widely used for sample generation: cscGAN 14 , SUGAR 16 , and Splatter 15 . For these three methods, We adopted the code (with default parameters) provided on the Github page of the original publications.
Five well known gene selection methods (with default parameters) of scRNA-seq data are adopted for validation: GLM-PCA 24 , CV 2 Index, M3Drop 25 , Fano Factor 26 and Highly Variable Gene (HVG) selection of Seurat V4 27 . We select the top 500 features (genes) using all three feature selection methods on scRNA-seq datasets. For validation purposes, Wasserstein metric 18 is utilized to estimate the quality of the generated data. Clustering of scRNA-seq data is performed using SC3 28 technique with default parameters. Clustering performance is evaluated using the Adjusted Rand Index (ARI).

LSH-GAN improves performance of traditional GAN on simulated data
First, we train the LSH-GAN on HDSS synthetic data and generate realistic samples to compare against the traditional GAN model. For this, we create a 2-class non-overlapping Gaussian mixture data consisting 100 samples and 1000 features by taking the mean (µ) of the data in the range of 5 to 15 for class-1 and −15 to −5 for class-2. The covariance matrix (Σ) is taken for all the samples using the formula Σ = (ρ |i− j| ), where i, j are row and column index, and ρ is equal to 0.5. We calculate Wasserstein metric to estimate the quality of the generated data. The Wasserstein distance between the real data distribution (p data ) and the generated data distribution (p g ) to estimate the quality of the generated data. We use different settings of k th (k=5, 10,15,20) nearest neighbor to generate sub-sample of data from LSH sampling procedure. In each case, the sampled data (p x s ) is augmented with prior noise (p z ) and given to the generator of LSH-GAN for model training.
For comparison with the traditional GAN model, we use the data with train: test split of 80:20 and calculate the Wasserstein metric between the test sample and the generated sample. Table 1 shows the values of the metric for LSH-GAN and traditional GAN model in different range of epochs and nearest neighbors k. A closer look into the table 1 reveals that the performance of LSH-GAN (at 10000 epoch and k = 5) is far better than the traditional GAN model with 25000 epochs. Notably, for less amount of sampling (larger k), LSH-GAN needs more iterations for training. As for particular example, the performance of LSH-GAN achieved on 20000 epoch and k = 20 is rivaled only at 10000 epoch for k = 10. Thus it is evident from the results that reducing the amount of sampling needs more epochs and thus needs more training time for the LSH-GAN model to converge. Figure 2 also supports this statement. Here, the two models (LSH-GAN, and traditional GAN) are trained to simulate a two dimensional synthetic data of known distribution, for which the LSH-GAN can able to generate samples that are more  real than the traditional GAN, in a lesser number of iteration.

Comparison of LSH-GAN with benchmarks in HDSS scRNA-seq data
We compare LSH-GAN with four existing benchmarks: cscGAN 14 , splatter 15 , SUGAR 16 and traditional GAN 17 . Since the evaluation of the generative model is notoriously difficult 29, 30 , we first use Wasserstein distance to compare the real data distribution and generated data distribution coming from different competing models. We also used UMAP visualization, and marker genes expression to visualize the generated cell samples. Figure 3, panel-A-C shows the two-dimensional UMAP representation of generated and real cell sample from the test data for four competing models. Melanoma data is utilized for this experiment. As can be seen from this figure, LSH-GAN can able to retain the distribution of the original cell samples. This can also be supported by the Wasserstein distance (see figure 3, panel-F) measured between real data and generated data distribution. To know how the expression of the marker genes are retained in the generated data, we plot the expression of marker gene CD8A (marker for CD8T cell) and MS4A1 (marker for B cell) in the two dimensional UMAP space for both the real and generated samples of LSH-GAN (see figure 3, panel-D-E). It reveals from the figure that marker genes CD8A and MS4A1 show similar expression patterns (high expression) both in real and generated cell samples. We also validate the generated samples by training a classifier (random forest) to see whether it can able to distinguish the samples coming from two different distributions (real and generated). The aim is to see whether the model can discriminate between the real and generated cell samples accurately. Table 2 shows the cross validation AUC score of the random forest classifier for five scRNA-seq datasets. It reveals from the table that for LSH-GAN, the AUC scores hardly reach 0.6 (only for melanoma data) suggesting a chance-level performance of RF model. This suggests the generated data obtained from LSH-GAN is highly similar to the real data.

Gene selection in HDSS scRNA-seq data
Here, we aim to address the problem of gene selection in HDSS scRNA-seq data using the generated samples. We augment the generated sample with original data to make the sample to feature ratio as 1.5. The augmented data is utilized for gene selection. Here, we have employed five feature selection methods (Highly Variable Gene (HVG) selection of Seurat V3/V4, M3Drop, GLM-PCA and Fano Factor, CV 2 -index), widely used for the gene selection task in scRNA-seq data and one single cell clustering method (SC3) technique to validate the selected genes from the augmented data.
LSH-GAN is compared with five other state-of-the-arts in four HDSS scRNA-seq datasets (   Klein datasets). We exclude Melanoma data for this analysis as it already has larger sample size compared to the feature size (sample:feature is 3.46). The aim is to know whether the selected features/genes from the generated combined data can lead to a pure clustering of cells. Table 3 shows the comparisons of the Adjusted Rand Index (ARI) values resulting from the cell clustering. It is evident from the table that features/genes selected from the generated combined data of the LSH-GAN model produce better clustering results than the other competing models. The last column of the table 3 shows the ARI scores of clustering results with the original feature (gene) set.

Selected genes using LSH-GAN can effectively predict cell clusters
Here we provide the detailed results of clustering on four used datasets using the genes selected from the generated samples. For this, we adopted a widely used single cell clustering method SC3 28 .

Discussions
In this paper, we present a novel and faster way of generating cell samples of HDSS single cell RNA-seq data using a generative model called LSH-GAN. We update the training procedure of generative adversarial network (GAN) using locality sensitive hashing which can produce realistic samples in a lesser number of iteration than the traditional GAN model. We utilized the generated data in the standard procedure of downstream analysis for analyzing real life scRNA-seq data. Particularly, we demonstrated that the recent and benchmark approaches of gene selection and cell clustering produce excellent results on the generated cell samples of LSH-GAN. Our preliminary simulation experiment also suggests that for fixed number of training iteration the proposed model can generate more realistic samples than the traditional GAN model. This observation is also established theoretically by proving that the cost of value function is less than −log4 which is the cost for traditional GAN at the global minimum of virtual training criterion (p g = p data ). We demonstrated that the generated samples of LSH-GAN are useful for gene selection and cell clustering in HDSS scRNA-seq data. particularly the excellent results of LSH-GAN over the recent benchmark methods support its usability for generating realistic cell samples. For validation of the generated cell samples, we use the conventional steps of downstream analysis for scRNA-seq data. We employ five widely used gene selection techniques and one single cell clustering technique for gene selection and grouping of cells. The precise clustering of cells demonstrates the quality of generated cell samples using the LSH-GAN model.
One limitation of our method is that for feature selection we hardly found any linear relationship between the clustering results with the sample size of generated scRNA-seq data. The correct sample size should be selected by using a different range of values between 0.25p to 1.5p, where p is the feature size. There may be some effects of different parameters related to single cell clustering (SC3 method) and feature selection (e.g. different FS methods, number of selected features, etc.) which may play a critical role in the clustering performance. However, we found clustering results are always better for the generated data with more than 1p (p is the feature size) sample size. This observation suggests that for feature selection in HDSS data, whenever we produce samples larger than the feature size we will end up with a better clustering. The feasibility of generating such samples is justified by the faster training procedure of LSH-GAN model.
It may interesting to speculate how well LSH-GAN can be useful for generating data of other biological domains, particularly data generated from spatial transcriptomics. The obtained data from this technology has spatial arrangement of cell types within a tissue and thus extremely useful to understand normal development and disease pathology. In-silico generation of this data may find great interest to the machine learning researcher as the model should capture the location-wise heterogeneity of the real samples.
Taken together, the proposed model can generate good quality cell samples from HDSS scRNA-seq data in a lesser number of iteration than the traditional GAN model. Results show that LSH-GAN not only leads over the benchmarks in the cell sample generation of scRNA-seq data but also accelerates the way of gene selection and cell clustering in the downstream analysis. We believe that LSH-GAN may be an important tool for computational biologists to explore the realistic cell samples of HDSS scRNA-seq data and its application in the downstream analysis.

Overview of Datasets
We have used five public benchmark scRNA-seq datasets: Melanoma 31 Table 4 shows a detailed summary of the used datasets (see supplement for description). The sample:feature ratio for all the datasets except Melanoma are less than 0.012. For Melanoma the ratio is quite large (3.41). We retain this data to know the efficacy of our model in both small and large sample data.

Data Preprocessing
The raw count matrix M ∈ R c×g , where c and g represents the number of cells and genes, respectively, is normalized using Linnorm 36 Bioconductor package of R. We select cells having more than a thousand expressed genes (non zero values) and choose genes having a minimum read count more than 5 in at least 10% of the cells. log 2 normalization is performed on the transformed matrix by adding one as a pseudo count.

Formal details of LSH-GAN
In this section, we first provide a short description of Generative Adversarial Network (GAN) and then explain the theoretical foundation of LSH-GAN model.
Generative adversarial network Generative adversarial network (GAN) is introduced in 17 which was proposed to train a generative model. GAN consists of two blocks: a generative model (G) that learn the data distribution (p(x)), and a discriminative model (D) that estimates the probability that a sample came from the training data (X) rather than from the generator (G). These two models can be non-linear mapping functions such as two neural networks. To learn the generator distribution p g over data x, a differentiable mapping function is built by generator G to map a prior noise distribution p z (z) to the data space as G(z; θ g ). The discriminator function D(x; θ d ) returns a single scalar that represents the probability of x coming from the real data rather than from generator distribution p g . The goal of the generator is to fool the discriminator, which tries to distinguish between true and generated data. Training of D ensures that the discriminator can properly distinguish samples coming from both training samples and the generator.
Locality Sensitive Hashing Generative adversarial network For LSH-GAN, a sub-sampling of real data p x s (x s ) is augmented with the prior noise distribution, p z (z). Due to this additional information in generator, we assume that the probability D(G( z)) will increase by a factor, ζ . the value function of LSH-GAN can be written as: L(D, G) = min G max D (E x∼p data (x) log(D(x)) + E z∼p z ( z) log(1 − D(G( z)))) (3)