Edge sparse PCA based on gene network for correcting cell type heterogeneity in epigenome-wide association studies


 Background: In epigenome-wide association studies (EWAS), the mixed methylation expression caused by the combination of different cell types may lead the researchers to find the false methylation site related to the phenotype of interest. In order to fix this problem, researchers have proposed some non-reference methods based on sparse principle component analysis (PCA) to correct the EWAS false discovery. However, the existing model assumes that all methylation site have the same a priori probability in each PC load, but it is known that there already has network structure in the genetic variable corresponding to the methylation site. In this paper, we show that the results of the existing EWAS correction model are still not good enough. If we can integrate the existing methylation network as prior knowledge into the sparse PCA model, we can effectively improve the correction ability of the existing model. Result: Based on the above ideas, we propose GN-ReFAEWAS, a model which uses the prior methylation gene network structure into the PCA framework for feature extraction. This model can be used to correct the false discovery in EWAS. GN-ReFAEWAS model does not need cell counting data and can estimate cell type composition through methylation principal component data. The key of this model is to solve a sparse regularize problem of methylation network. This paper uses regularize and random sampling algorithm to solve this problem. We used one simulated data set and three real data sets for experiments and compared four existing EWAS calibration models. The experimental results show that the GN-ReFAEWAS model is superior to existing models. Conclusion: The result proved that GN-ReFAEWAS model can provide a better estimation of cell-type composition and reduce the false positives in EWAS.


Background Introduction
The whole genome methylation association analysis can help us detect new regulatory mechanisms that are more susceptible to environmental factors [1][2][3]. The EWAS model selects CpG sites that are easily affected by environmental factors through screening a large number of CpG sites. Standard EWAS generally uses whole blood data, which means the value of methylation data contains information on heterogeneous cell composition [4,5]. Existing research has shown that epigenome information is greatly affected by cell changes. It means the similarity of the interest phenotype and cell composition will lead researchers to find many false methylation sites [6][7][8][9] Fortunately, in EWAS model, the researchers used univariate testing to detect the correlation between phenotype and methylation site. Therefore, the false discovery can be corrected by adding the cell ratio as a covariate to the EWAS model. [10][11][12]. Unfortunately, in many situations, researchers usually do not measure the composition ratio of cells in whole blood data. In such a case, researchers have proposed some calculation models that use reference data sets to estimate the composition of cell types [13][14][15][16][17][18][19]. However, the model based on the reference data set can only use a small part of different blood cells as the reference data set, and cannot use other tissues [10,20]. More Importantly, the factors affecting methylation are very complex [21]. This may lead to wrong analysis. Therefore, due to the above limitations, the researchers have proposed many non-reference models [14,22,23].
In 2014, Zou et.al proposed a non-reference model base on linear mixed model (LMM) and PCA which called factored spectrally transformed linear mixed model 'EWASher' (FaST-LMM-EWASher) [7]. PCA is a natural candidate to correct the false discovery of EWAS, because the previous few main PCs are related to cell type composition. This model can screen methylation data as a covariate to correct false discovery caused by cell heterogeneity without a reference data set. However, existing research shows that the effect of this model is still not good enough. Therefore, in 2016, Rahmaniet al. proposed ReFACTor model (reference-free adjustment for cell type composition model) which based on sparse PCA model [24]. This model can better screening methylation data as a covariate to correct cell heterogeneity.
The principle of sparse PCA can effectively screen out few of differentiated CPG sites to reduce the time complexity of the model and improve the correction result. Because the researchers believe that only a small part of the methylation data is significantly different (known as differentially methylated regions, DMRs) [24,25]. If all methylation sites are used for principal component analysis, then too much useless information may lead to wrong correction results. Therefore, in the Sparse PCA, how to find the DMRs in the methylation data accurately is the key of non-reference model to correct false discovery. The existing sparse PCA model assumes that all genes have the same a priori probability in each PC load (for example ReFACTor [24]) and use the greedy algorithm based on 0 norm to perform sparseness [26,27]. However, it is known that there already has network structure in the genetic variable corresponding to the methylation site [28]. Known genetic network information can be considered as specific prior knowledge. The existing models cannot integrate it as prior information into sparse PCA for covariate selection. We believe if we can integrate biological network as a priori knowledge into sparse PCA, then we can improve the selection ability of existing models, better correction for false discovery caused by cell heterogeneity. Based on this idea, we designed a referencefree model for EWAS based on gene network (GN-ReFAEWAS) which used the existing gene network as a priori information to integrate into the sparse PCA. We use the edge information of gene interaction to constrain the number of non-zero elements loaded in each PC and use an alternating iteration algorithm based on random sampling for sparseness. We conducted experiments on one simulated data set and three real data sets. Experimental results show that, compared with existing models, the GN-ReFAEWAS proposed in this paper can better control the false positive rate of EWAS.

Materials
In this chapter, we introduced a simulated data set and three real data sets used in this article.

Simulation dataset
The key of non-reference model can reduce the false of EWAS is to find alternatives to cell composition data. The first few principal components of the methylation data are considered to be related to cell type. These PCs contain methylation sites that can reflect the proportion of cell composition. We simulated a methylation dataset with 5 principal components and gene networks, only a few points in each principal component are related to cell composition (DMRs). The purpose is to evaluate whether the proposed GN-ReFAEWAS model and other existing models can correctly summarize each methylation site into the correct PC loading.
We use the formula (1) to simulate the establishment the DMRs areas of each PC loadings: Here, ∈ (0,5) represents the -th PC loadings, assuming PC loadings has element. ∈ (1, ) represents the starting DMRs sites position of the -th PC loadings in the entire simulation data set,for example, k = 1，then, = 1. represents the position of the last DMRs sites. ( − ) represents the total number of DMRs sites contained in the -th PC loadings(in the same simulation experiment, the DMRs areas of each PC is same). The remaining ( + ) elements represents non-DMRs sites.
Because the area of DMRs is not fixed. Therefore, the position of [ (0, )] changes dynamically in each simulation experiment. For each DMRs sites, we assume that each cell type has a different normal distribution in each PC. For non-DMRs sites, we assume that each cell type has the same normal distribution in each PC. If all PC loadings has elements in total, then = + ( − ) + .
For each PCs, we use the formula (2) to generate data: represents the number of samples, ( ) represent that the values are generated by random sampling (normal distribution).
Next, we can generate the methylation data simulation expression matrix ∈ ℝ × as Here ∈ ℕ(0,20), = 5 ∈ ℕ(0,1) The gene edge data in the simulation data is generated in the following way: = 1 ∪ … ∪ , where are as: = {( 1 , 1 ), ( 2 , 2 ), … , ( , ) , 1 , 1 ∈ ℕ( , ) and 1 ≠ 1 . We assume that the location of the DMRs region plays a major regulatory role in the entire network. Therefore, we will generate as much edge information as possible in the DMRs during the process of establishing the simulation data set.

Real dataset
We used three real data sets to verify the performance of the proposed model. It contains a rheumatism (RA) data set, a solid tumor data for breast and colon cancer, and a Mexican and Puerto Rican descent data set. For all real data sets, according to the recommendations of existing researchers. We discarded methylation sites with mean values above 0.8 and below 0.2 [24]. The following is a detailed introduction to the data set: First, we used a disease data set for rheumatoid arthritis. Because the cell composition of rheumatic patients is usually very different from normal people. Therefore, there is a risk of false discovery due to unexplained cell type heterogeneity. The RA data set used in this paper contains 689 experimental and control samples. Illumina HumanMethylation450 BeadChip arrays were used to obtain methylation data. The data set is already available in Gene Expression, numbered GSE42861.
Next, we use a cancer data set based on breast and colon cancer. Because the blood of cancer patients also contains many heterosexual cell mixtures and many of its components are unknown. This makes it difficult for researchers to find reliable methylation profiles. Although thousands of publications have reported the methylation changes of hundreds of genes. However, the naive analysis performed in this situation may lead to a lack of clinically available DNA methylation biomarkers. Therefore, the method in this paper can also help EWAS analysis of cancer data sets. We have uploaded the data set to GitHub at the address: https://github.com/mr1528126360/GN-ReFAEWAS.
Finally, we conducted experiments using the differential DNA methylation data set in the Mexican and Puerto Rican descent population. This data set contains 573 individuals of Mexican and Puerto Rican descent, which contains more than 450,000 variable methylated CpG sites. Prior to this. Here, we study the environmental exposure and environmental exposure of self-identity in race, genetic ancestry, DNA methylation, and analyze the differences in the environment and genetics of different ethnic groups. We found that even when adjusting the genetic lineage, the methylation level of many methylation sites is closely related to race. The influencing factors of this data are complex. More importantly, the data set contains 94 samples that have undergone a complete blood count using automated flow cytometry (Supplementary Table S1). Under such circumstances, we can study whether the results of the proposed model and other existing models still have incorrect methylation site discovery. The data set is already available in Gene Expression, numbered GSE77716.

Gene pathway dataset
The gene edge dataset is obtained from the following dataset: Molecular Signatures Database (MSigDB) (http://software.broadinstitute.org/gsea/msigdb). The corresponding data of methylation and genes are queried in the HumanMethylation450 file of each data set.
In the three real data sets, we collected the following network parameters: The methylation network of the rheumatism data set is the largest, which contains 14733988 edges. The cancer dataset network is relatively small and contains 27,955 edges. The Mexican and Puerto Rican descent data set methylation network contains 4008100 edges.

Result
This chapter uses a simulation data set and three real data sets for experimentation. For the simulation data set, we counted the number of erroneous methylation sites found in each PC loading. For the real data set, ReFACTor model retain up to 600 methylation sites and GN-ReFAEWAS model retain up to 600 edges. We first used Holm and FDR values to draw quantile-quantile plots. The graph can visually show the reliability of the correlation analysis of methylation data. Next, in order to further verify the method in this article, we count the number of FDR significant sites for different models. The model with a lower number of FDR significant site obviously has a better correction effect. Because the ReFACTor model and GN-ReFAEWAS model are both sparse PCA models, both contain the parameter k (the number of methylation sites retained) and both models performed well in the experiment. To further compare the two models. This article counts the number of FDR significant methylation sites generated by the two models in the case of k = 0-600 (retain 0-600 methylation sites). Finally, this paper also uses known cell count information to count the discovery of false methylation sites in the Mexican dataset by different models.

Simulation dataset
In the actual calibration process, the no-reference model will only select a few important methylation sites (DMRs) in each PC loading as covariates. Therefore, whether the model can correctly select the methylation sites of DMRs becomes crucial. This paper establishes a simulation data set with 5 principal components. According to the results in Table 1, The GN-ReFAEWAS model correctly extracts all 5 DMRs sites of PC1-PC5 loading. In comparison, ReFACTor model incorrectly identified 3,0,2,2 and 3 methylation sites with non-DMRs in PC1-PC5 loading. The PCA model performed the worst, in all 5 PC loadings, 0,4,2,2 and 5 methylation sites were judged wrong. This means that, in addition to the GN-ReFAEWAS model proposed in this paper, other comparison models may use the non-DMRs PC loading methylation information when correcting the dataset. This may lead the errors in the final correction results.

RA dataset
We use 5 different models to correct false discovery. As a baseline (Fig 2. F), we performed logistic regression without adjusting the cell composition data, it led to a severe expansion of the test statistics, which is consistent with the results reported in previous studies. There are 23168 FDR significant methylation sites in the RA dataset without correction (Table 2). Then, we present the GN-ReFAEWAS model to estimate the proportion of cell types obtained to adjust the data. This correction eliminates swelling by eliminating cell composition confusion. Then, we used the first PC of standard PCA, ReFACTor, FaST-LMM-EWASher and RefFreeEWAS for unsupervised cell type correction models. As shown in Figure 2, the correction effect of GN-ReFAEWAS is the best, and the PCA model is the worst. We also counted the number of FDR significant sites of different methods after correction. As shown in Table 2, the results show that the correction results of the GN-ReFAEWAS model proposed in this paper are significantly better. It uses only one PC to completely eliminate data inflation. In the comparison method, ReFACTor and LMM-EWASher models perform better. However, these two models still have 7 and 12 FDR significant methylation sites, respectively. We also show the comparison curves of the GN-ReFACTor and GN-ReFAEWAS models with retaining different methylation sites number (0-600). The results show that under the same number of methylation sites, the GN-ReFAEWAS model has the best correction results (Fig 3,Supplementary Table S2). This indicates that the model proposed in this paper has a stronger ability to select methylation sites.

Tumor dataset
Tumor data is also one of the most common data sets for EWAS analysis. The uncorrected results show significant data swell (Fig 4. F). There are 2807 FDR significant methylation sites in the RA dataset without correction ( Table 3). The GN-ReFAEWAS model produced very good correction results. As shown in Figure 4, in tumor data set, the GN-ReFAEWAS model has reached the optimal result of the experiment in this paper, which FDR significant methylation sites reached 0. For this comparison, The ReFACTor, LMM-EWASher, and RefFreeEWAS models contain 12, 35, and 22 FDR significant methylation sites, respectively. The PCA model had the worst results, with 735 FDR significant methylation sites. Figure 5 is the Number of FDR significant sites of retain different methylation sites number for tumor dataset. We found that GN-EefEWAS is still the best performing model. The model proposed in this article contains the fewest FDR significant sites in all cases(Supplementary Table S3).

Mexican and Puerto Rican descent population dataset
Finally, we selected the Mexican and Puerto Rican descent population data set for analysis. We have obtained similar results on this dataset with the first two real datasets (Fig, 6). According to Table 4, without correction, the data set contains 440 FDR significant sites. After correction, the GN-ReFAEWAS model completely eliminates FDR significant sites. The PCA model is the worst and still contains 128 FDR significant sites. In the case of retaining different methylation sites number. GN-ReFAEWAS model still shows a very obvious advantage (Fig. 7, Supplementary Table S4).
Although the GN-ReFAEWAS, ReFACTor, FaST-LMM-EWASher and RefFreeEWAS models also produced good results. However, these models still have the possibility of false positives. Therefore, we use the existing cell count data from this data set to estimate the number of false positives for each model. The final result showed that only the GN-ReFAEWAS model completely controlled the false positive rate ( Table 5). The data set showed 4115 false positive methylation sites without correction. After correction, the best result is the GN-ReFAEWAS model. The model did not show any false positives. The ReFACTor and FaST-LMM-EWASher model immediately followed, with 6 and 5 false positives respectively, followed by the RefFreeEWAS model, with 273 errors, and the worst is the PCA model, with 361 errors (Table 5).
In summary, in all three real data sets, all indicators show that GN-ReFAEWAS performs best. This model controls the false positive rate of the data very well. This shows that using the known methylation network as a priori information can effectively help the model to select the DMRs methylation site, and further help researchers control the false positive rate of EWAS.

Discussion
Sparse PCA is a very effective model to correct EWAS, However, none of the existing reference-free models use any prior knowledge with network results. However, existing research has proven that the existing prior network information can greatly improve the feature selection capability of the model. Inspired by two facts, we propose a sparse PCA model based on methylated network structure for correcting erroneous discovery caused by cell type heterogeneity in EWAS. We proposed a sparse regularizer based on network edge information, and designed a random sampling algorithm based on greedy principle to solve the sparse projection operation in the model. We used a simulated data set and three real data sets to verify the model in this paper. The results of the simulation data set show that only GN-ReFAEWAS correctly selects all methylation sites corresponding to each PC. This shows that, compared with the existing models, the prior network structure information can improve the model's feature filtering ability. The results of the real data set also show that the GN-ReFAEWAS model is superior to the other four existing models compared. The methylation information screened by this model can be effectively used as a covariate to effectively reduce the false positive rate of EWAS. In the case of using the same number of methylation sites, in general, the GN-ReFAEWAS model performs better than other models. In the third data set, GN-ReFAEWAS is the only model with no false positives.
In fact, in addition to having the effect of correcting the false positive rate of EWAS. The methylation sites in each PC screened by the GN-ReFAEWAS model may also be potential ethnic groups with the same biological function. Because the GN-ReFAEWAS model uses known gene network information as a priori knowledge. This determines that there must be a consistent relationship between the methylation variables after their screening. Therefore, the methylation sites screened by this model are highly likely to have biological connectivity. We hope this result can help other researchers.
In the future, we can further expand this model in the following aspects. First, further integrate gene connection information containing specific functions (for example, methylation sites known to be related to cell composition) as the focus of a priori network structure. Second, the model proposed in this paper only considers the edge information of the prior network. However, in some special cases, it may cause some information about methylation sites not in the network to be masked. Therefore, it may also be necessary to reasonably consider the information of methylation sites that are not in the network.

Conclusions
In summary, this paper proposes GN-ReFAEWAS, a model which uses the prior methylation gene network structure into the sparse PCA framework. Experimental results show that, compared with the existing non-reference EWAS correction model, the GN-ReFAEWAS model has a stronger ability to select DMRs methylation sites, and can provide better cell composition estimation and correction results.

Methods
In this chapter, we first introduce the network sparse regularize of the GN-ReFAEWAS model, and then propose the learning model of the GN-ReFAEWAS model and the generation model of PC and its loading. Finally, we introduced other comparison models used in this article.

Sparse network regularizer
There is a corresponding relationship between methylation sites and gene sites. Therefore, the corresponding methylation network data set can be established based on the known human gene pathway data set. In this network dataset, if two methylation sites have links, then the two sites with side information can be considered as a group. In this case, we set = { 1 , … , } as the set of all edges in the network. Our goal is to select important methylation points (DMRs) from a methylation network containing m edges and use the scores of the principal components as covariates to correct false discovery caused by cell heterogeneity. Here, we presented a Methylation network Sparse penalty as formula (4): According to existing research, the SVD framework is usually used to solve the PCA model, then the formula (4) can be written as: Where and represents methylation sites weight and sample weight ( ⊆ ℝ ×1 , ⊆ ℝ ×1 ). ∈ ℝ × represents a matrix with methylation sites and samples.

Random sampling algorithm based on greedy principle for GN-ReFAEWAS
The key to solving this problem is how to determine the value of the sample weight vector , set = , then formula (6) can be expressed as: Traditional sparse PCA mostly adopts the principle of greedy algorithm. According to the threshold , the first largest methylation sites of each PC are retained. However, this algorithm has no randomness and cannot obtain the global optimal solution. This paper proposes a random sampling algorithm based on the principle of greed as a solution to the learning model of GN-ReFAEWAS. Let = { 1 , … , } is the weight vector of methylation sites, = { 1 , … , } is the sample weight vector of the dataset, = ， is the weight information of the methylated sites retained after sparseness, = { 1 , … , } is the set of methylated network edges, is the parameter that controls the randomness of the algorithm. The larger the value of this parameter, the higher the randomness of algorithm initialization. is a parameter of the control algorithm to reduce the speed of randomness. is the number of edges remaining in the methylation network after sparseness. The steps of the algorithm are as follows: 1. First, use the randomization model to initialize the sample weight vector of the data set, and manually enter the parameters , N and T 2. Use to calculate the weight of the methylation data , which can get the weight value of each current methylation site and for the -th methylation site, � ( , )� : Where ( ) is an edge of the methylation network that contains methylation site . If ( )is retained, then there is � ( , )� = ，otherwise � ( , )� = 0

Generating model of PC and PC loading
We use the following method to generate each principal component of the model (formula (8)): Where we have known that ≥ 2 and pairs with { , }( = 1, … , − 1),and must be orthogonal with 1 , … , −1 . This paper adopts an alternating iteration to solve the problem of formula 8. First of all, fixed and update and vice versa. When is fixed, formula (8) can be expressed as formula (9): We use the Gram-Schmidt orthogonalization method in our model [29].
‖ ‖ 2 ≤ 1 We can get the optimal solution of the formula (11): Where = −1 ⊥ , then we have formula (12): Where we use the following theorem to solve −1 The optimal solution of problem (13) is After is updated, we can use the random sampling algorithm based on greedy principle proposed in this paper to update . Therefore, the final alternate iteration strategy is as formula (14)(15)

ReFACTor
The ReFACTor model was proposed by Rahmani et al [24]. The ReFACTor model is based on the sparse PCA method and corrects EWAS false discovery by screening DMRs sites. This model is a typical sparse PCA model. The model is based on the idea of differential methylation (that is, only a small part of the methylation sites is different from other sites), combined with the sparseness idea, first select all the methylation sites according to left-singular vectors The specified k methylation sites, and finally perform PCA again to distinguish the principal components and select methylation data.

FaST-LMM-EWASher
FaST-LMM-EWASher model was proposed by Zou et al [7]. This model computes the methylome similarity between every pair of samples and then use these similarities as the covariance in the mixed model as an implicit proxy for cell-type composition.       Table   Table 1 The number of DMRs and non-DMRs methlation sites selected in each PC for each model of the simulation dataset