Accurate Prediction Model - Polygenic Risk Score for High-Risk Individuals Predictive of Gastric Cancer

The genetic variation of gastric cancer has not been fully identied. We aimed to screen and identify common variant single nucleotide polymorphisms (SNPs) and long noncoding RNA (lncRNA) related SNPs associated with the risk of gastric cancer, and construct and evaluate prediction models based on polygenic risk score (PRS). Non-genetic factors such as H.pylori infection, environment, and genetic factors associated with gastric cancer were screened following meta-analysis and bioinformatics,veried by frequency matched case-control study. PRS and weighted genetic risk scores (wGRS) were derived from estimation of effect size. Net reclassication improvement (NRI), integrated discrimination improvement (IDI), akaike information criterion (AIC) and bayesian information criterion (BIC) were used to evaluate model. A risk gradient was observed across quantile of the PRS, the results showed that the risk of gastric cancer in the highest 10 quantile of PRS was 3.24 folds higher than that of the general population (OR=3.24,95%CI: 2.07, 5.06). The PRS with one or more risk factors (smoking, drinking and H. pylori infection) was superior to the single genetic risk model. For NRI and IDI, the PRS combinations were signicantly improved compared to wGRS model combinations (P<0.001). The model of PRS combined with lncRNA SNPs, smoking, drinking and H. pylori infection was the best tting model (AIC=117.23, BIC=122.31). and H.


Background
Gastric cancer is a highly lethal cancer worldwide, being the fourth most common malignancy and the third leading cause of cancer-related mortality in developing countries [1,2]. According to the cancer statistics released by cancer registration center in China, gastric cancer is the second most commonly diagnosed cancer among men, second only to lung cancer [3].
Studies have shown that the occurrence and development of gastric cancer is a multifactor and multistage process, which is highly related to biological, environmental exposure and genetic factors. Helicobacter pylori (H. pylori) is a well-known biological pathogenic factor [4]. The average infection rate of H. pylori in Chinese population is as high as 60% [5], which is also one of the causes of high incidence rate of gastric cancer in China.
The polygenic risk score (PRS) is a risk prediction models based on demographic and clinical characteristics, genetic markers and other risk hierarchical factors. At present, studies have shown that PRS has promising predictive power for cancer high-risk group [6]. Provide clinicians with a tool that enables them to assess the risk of patients and improve the utilization of medical resources [7,8].
The PRS model incorporating genetic and non-genetic factors has robust risk prediction capability, showing that there is an interaction (multiplier effect) between breast cancer-related single nucleotide polymorphisms (SNPs) and environmental risk factors [9]. Other similar studies have also con rmed the remarkable predictive power of PRS [10][11][12] This method has also been utilized to predict the risk of psoriasis [13], stroke [14] schizophrenia, and bipolar disorder [15,16]. The construction of the model has achieved good prediction results.
The genome-wide association studies (GWASs) have shown that there are about 11 loci in the human genome associated with gastric cancer [17]. The previously developed weighted genetic risk scores (wGRS) has some limitations in predicting cancer risk, and models based on wGRS were basically depend on genetic sites screened by GWAS or evidence-based medicine (EBM) [18][19][20]. So far, the construction of risk prediction model based on PRS for gastric cancer research has not been reported in China. In other complex disease studies, it has been shown that the prediction ability of risk prediction model based on PRS was better than that based on wGRS [7,8,14]. Meanwhile, studies have con rmed that lncRNA SNPs were associated with gastric cancer. However, lncRNA SNPs were not found in the existing risk prediction model [21][22][23].
In the present study, quantitative systematic evaluation and meta-analysis were used to determine the non-genetic and genetic factors such as H. pylori infection and environment. According to the results of association and bioinformatics, gastric cancer-related SNPs were screened and veri ed by case-control study. Based on the validation results, lncRNA SNPs, as an independent risk factor data set, combined the common SNPs with H. pylori infection and environmental factors by PRS to construct an individualized risk prediction model for gastric cancer.

Materials And Methods
The study was approved by the ethics committee of Zhengzhou University. All participants were informed and signed written informed consent. The design and implementation ow chart of this study was shown in Figure 1.

Meta-analysis of risk factors for gastric cancer
To obtain the credibility and strength of non-genetic factors and genetic variation on gastric cancer risk, we performed a eld synopsis and meta-analysis to identify the risk of gastric cancer in Chinese population. A total of 22 SNPs involving 16 genes were identi ed and associated with the risk of gastric cancer.
Details have been published in the journal of Aging-US [24] Genetic variant selection for PRS The bioinformatics method was used to screen lncRNAs and corresponding functional SNPs that were differentially expressed in gastric cancer and possess potential binding sites with microRNAs (miRNAs).
The gastric cancer related microarray data (gse50710, gse53137, gse58828) of Chinese population in the Gene Expression Omnibus (GEO) database were retrieved and downloaded. The GEO chip data related to gastric cancer was analyzed by using the Bioconductor software based on R-software (version 3.6.2 for Windows), which was associated to the mapping database of chip probes according to the probe code. The intersection part was obtained according to the analysis results of three chips by using SAS 9.2 (SAS Institute Inc., Cary, North Carolina, USA). The difference multiple was > 2.0 and P < 0.05, the differentially expressed lncRNAs were screened.
We used the lncRNASNP2 database (http://bioinfo.life.hust.edu.cn/lncRNASNP#!/) and the online database RNAfold (http://rna.tbi.univie.ac.at/cgibin/RNAWebSuite/RNAfold.cgi), the preliminary potential function prediction of the biological functions of the SNPs on the differentially expressed lncRNAs were screened out, and the SNPs that affect the secondary structure of lncRNAs or affect the binding of miRNAs will be identi ed and screened out. r 2 can re ect the degree of linkage disequilibrium (LD) between SNPs sites, combined with the LD (r 2 <0.8 and LD<1.0) between SNP sites on the same gene, 21 lncRNA SNPs were nally selected (supplementary Table 1).
We followed the principle of evidence-based medicine and applied a three-step approach. We initially performed meta-analysis to screen the genetic associations between genetic variant and gastric cancer. After this screening analysis, SNPs in strong linkage disequilibrium (LD) with each other polymorphisms ware excluded. Finally, the extracted SNPs were combined with the published eld synopsis or systematic review on SNPs (OR≥1.20 or OR 0.8) signi cantly associated with gastric cancer in Chinese population (Chinese Han in Beijing, Minor Allele Frequency≥0.1). Finally, a total of 18 genes involved in 20 SNPs were selected, the results were presented in supplementary Table 2.

Study population
All patients with gastric cancer were new cases from the First A liated Hospital of Zhengzhou University and the A liated Cancer Hospital of Zhengzhou University from January 2012 to December 2015. The patients did not receive anti-tumor treatment before recruitment, and had no history of other malignant tumors.
The controls were collected from a cardiovascular disease epidemiological survey conducted at the same time in Henan Province. Individuals with malignant tumors, digestive system diseases, and blood related to the case were excluded.
Based on frequency matched case-control study design to match subjects according to gender and age (± 2 years), the blood samples of 660 patients with gastric cancer con rmed by pathology and 660 normal controls from community were collected. Each participant met the requirements of the institutional review committee and gave informed consent.

Genotyping and quality control
Polymerase chain reaction restriction fragment length polymorphism (PCR-RFLP), created restriction site-PCR-RFLP (CRS-PCR-RFLP) and Improved Multiplex Ligation Detection Reaction (iMLDR TM ) were used to genotype SNPs corresponding to lncRNAs or selected by EBM. For iMLDR TM , 3130XL sequencer (AppliedBiosystems, USA) was used for sequencing, and the GeneMapper 4.0 was applied to identify genotypes.
For PCR-RFLP typing, 10% of the samples were randomly selected and the sequencing results were compared with the experimental results. When the agarose gel electrophoresis pattern could not accurately determine the genotype, repeated experiments or direct sequencing were used to determine the genotype.
In the iMLDR typing test, agarose gel electrophoresis was used to detect each sample before typing, and 3% double blind sample quality control and negative control quality control. For quality control samples, the success rate (call rate) and accuracy rate were ensured to be more than 98%.

Weighted genetic risk scores
The population average risk (Genetic score) of single SNP was calculated based on the genotype frequency of the genetic variation and the OR of the metaanalysis in the Chinese population.
Assuming that the genotypes of a SNP are AA, AB and BB, B is the risk allele, A is the non-risk allele, and the corresponding risk values are 1, OR and OR 2 , then the weighted genetic risk scores (wGRS) is estimated as follows: AA=1/W AB=OR/W BB=OR 2 /W wGRS=SNP1×SNP2×SNP3…… SNPn (Missing value set to 1)

Polygenic risk score
We derived a PRS speci c to Chinese populations from all SNPs that have been veri ed to be associated with gastric cancer risk at genome-wide signi cance level. The PRS was constructed for cases and controls by summing the risk allele counts (i.e., subjects have 0, 1, or 2 risk alleles) for the associated variants weighted by their natural log transformed (ie, the ln of the odds ratios (OR)) effect sizes (OR) extracted from results of multivariate logistic regression model. For each participant, we summed the weighted risk allele counts and then divided the total number of loci to derive a mean weighted score, and the mean weighted score as the reference.
j is the number of SNPs included in the model; nij is the number of the i-th risk allele (0, 1 or 2); ORi is the associated risk value (OR) between the risk allele of the i-th SNP and gastric cancer.

Statistical analysis
The Hardy-Weinberg equilibrium (HWE) test was performed on the genotype distribution of the control using Chi square test of goodness of t. Unconditional logistic regression was used to implement the correlation analysis between the targeted SNPs and gastric cancer risk. Plink 1.9 (NIH-NIDDK's Laboratory of Biological Modeling, Harvard University) was used for quality control of related SNPs, association analysis of allele and generation of PRSice-2 (Gavin Band, New York, USA) basic dataset and target dataset. Gastric cancer risk prediction models were constructed using SNP screened by EBM and veri ed by association based on wGRS and PRS. lncRNAs SNPs were put into the prediction models as independent datasets of risk factors and empirical P-value was used to perform 10,000 ttings within the model to optimize model parameters and build the optimal model.
Receiver operating characteristic (ROC) and area under curve (AUC) were used to evaluate the gastric cancer recognition degree of different models. Net reclassi cation improvement (NRI) and integrated discrimination improvement (IDI) were used to evaluate the predictive ability of wGRS and PRS models, akaike information criterion (AIC) and bayesian information criterion (BIC) were used to evaluate the tting degree of the model. show that 14 SNPs were found to be related to risk of gastric cancer in 21 associated lncRNA SNPs (Table 2). Stepwise logistic regression analysis was used to institute PRS related to SNPs and 15 SNPs were found among 20 genotyping SNPs (P<0.05)( Table 3).

Distribution of genetic risk score
For each lncRNA SNPs and common SNPs, PRS was approximately normal distribution (Supplementary gure 1). In lncRNA SNPs and common SNPs, wGRS and PRS in the case group were signi cantly higher than those in the control group (Supplementary gure 2 and gure 4).

Construction of PRS risk prediction model
For wGRS, the risk of gastric cancer was signi cantly elevated with the increase of score groups. According to wGRS distribution, the risk of gastric cancer increased signi cantly with the increase of score, taking 0-1 group as reference (Supplementary Table 3).
PRS measured by bar plot showed the variance ratio of the correlation results obtained under different p-value thresholds (P t ), that was, the distribution of the explanatory value (R 2 ) of the estimated phenotypic variation. Figure 2A showed the R 2 value (vertical axis) of the phenotypic variation of the PRS model under different P t values (horizontal axis), and the highest point in the column graph indicated the optimal model. When P t = 0.0818, the model was optimal, and about 8.8% of the cases were from genetic variation (P = 6.4 × 10 -19 ).
The output results of PRSice-2 showed the tted p-value distribution corresponding to the correlation results obtained under different P t with the results of high-resolution plot. In this model, the best P t value was the point with the highest line (the point with the minimum tting p-value), and the P t was 0.0818.
According to the distribution of the data set, the visual output data was divided into 10 groups, with 40-60 quantile as a reference. We calculated the PRS for each subject and then divided the subjects into nine groups. The increase of PRS was associated with a signi cantly increased risk of gastric cancer ( Figure   3). A risk gradient was observed across quantile of the PRS. The results showed that the risk of gastric cancer in the lowest 10 quantile of the risk score was 47.1% lower than that of the general population (OR=0.53, 95%CI, 0.34, 0.87), and the risk of gastric cancer in the highest 10 quantile of PRS was 3.24 folds higher than that of the general population (95% CI: 2.07, 5.06) (Figure 3).
Goodness of t and model evaluation of risk prediction model The NRI and IDI were used to estimate the improvement of the prediction effect of the model and compared the NRI and IDI values of four different models after adding one or more new risk factors. According to NRI, the results showed that the increase of gastric cancer risk was not statistically signi cant (positive improvement was 4%) except wGRS vs. PRS + lncRNA SNPs in the comparison of four different models. Based on IDI, the prediction effect of PRS combined model was signi cantly higher than wGRS combined model (P<0.001), and the increase of NRI was statistically signi cant in the other three models ( Table 4).
By introducing different factors to compare AIC and BIC based on wGRS and PRS, the best tting degree was selected. wGRS and PRS of simple genetic model as reference, the results showed that the model coupled with lncRNA SNPs was better than the single genetic models of wGRS and PRS. The PRS with one or more risk factors (smoking, drinking and H. pylori infection) was superior to the single genetic risk model. The model of PRS combined with lncRNA SNPs, smoking, drinking and H. pylori infection was the best tting model (AIC = 117.23, BIC = 122.31) ( Table 5).
According to the ROC curve, AUC results showed that the introduction of lncRNA SNPs on the basis of wGRS and PRS could signi cantly improve the prediction ability of the model (Figure 4). On the basis of simple genetic model, the introduction of gastric cancer related lncRNA SNPs, drinking alcohol and H.pylori infection could signi cantly improve the prediction ability of the model (Table 5). Based on the above evaluation index results, the model including lncRNA SNPs, smoking, drinking, H. pylori infection on the basis of PRS has the best predictive ability. This model combination has the same results as the best model goodness of t combination.

Discussion
Due to the high heterogeneity of cancer phenotypes and the complexity of their etiology, individuals exposed to the same environment may have different risks [25].
The heterogeneity was the result of the interaction of different mechanisms and multiple conditions. Genetic risk models could quantitatively explain the heritability of phenotypes to a certain extent. In practical application, the index effect of a single genetic phenotype prediction needs to be combined into a comprehensive index to facilitate the screening of high-risk population or screening application.
On the basis of comprehensive quantitative evaluation of biological, environmental, behavioral and genetic factors and gastric cancer risk, combined with systematic review and meta-analysis published in authoritative journals related to gastric cancer in Chinese population. This study veri ed the selected 20 SNPs related to gastric cancer through a population-based case-control study, and constructed a risk prediction model according to the results of association veri cation. Meanwhile, we used bioinformatics method to screen lncRNAs related to gastric cancer, and determined the functional SNPs and veri ed them in the population. As an independent risk factor, it was included in the risk prediction model combined with environmental, behavioral and biological factors, and the model was evaluated in multiple dimensions according to the different risk factors included.
So far, wGRS has been widely used for weighting the genetic association effect value of SNPs [26]. As the occurrence of cancer is regulated by multiple genes and multiple loci, the genetic e ciency of few genes or even single gene is weak, so it is impossible to accurately predict the disease. Therefore, it is necessary to integrate the genetic information of multiple loci and polygenes. PRS is a new strategy to study the genetic susceptibility of cancer or other complex diseases. PRS can be used to quantify the cumulative effect of multiple loci or genes, and rede ne the risk scale to accurately measure the disease susceptibility score of an individual[27], it considers the size of individual SNPs effect. At the same time, some studies have con rmed that when considering the combination of multiple loci, the pattern recognition is the highest[28-30]. Therefore, wGRS and PRS methods were used to construct the model, and comparative evaluation and analysis were carried out in this study.
The recent large-scale GWASs of gastric cancer have provided an opportunity to develop risk models that include genetic variation. Studies  In the present study, the risk identi cation provided by genetic pro le was summarized in PRS and further improved by combining biological, behavioral and environmental factors. However, some limitations of this study should be pointed out. First, in the SNPs screening stage, the pooled results showed that there might be heterogeneity among the included studies. However, we did not explore the source of heterogeneity, which was close to the effect value and population validation, suggesting that the results should be broadly applicable. Second, we only evaluated the 20 risk loci identi ed in a Chinese population, which limits our ndings from being generalizable to other ethnic populations with different effect sizes of variants, LD patterns, and allele frequencies. Third, the interactions among the independent risk factors included in PRS, such as the interaction between environmental and genetic factors, were not dealt with, and the interaction between SNPs was ignored in the analysis of multiple genes. Finally, only genome-wide signi cant mutations were selected to produce PRS, other loci have not been identi ed, rare and copy number variations with greater correlation should be included in future PRS.

Conclusion
In summary, gastric cancer related common SNPs and lncRNA SNPs had a signi cant combined effect. Under the same factors and conditions, the PRS model was better than the wGRS model and the introduction of gastric cancer related lncRNA SNPs could signi cantly improve the recognition. The model based on PRS combined with lncRNA SNPs, smoking, drinking, and H. pylori infection had the best predictive ability on risk of gastric cancer, contributing to distinguish high-risk groups from general population. This study has important practical signi cance for China to formulate accurate and e cient screening strategies, and improve the detection rate of early gastric cancer.

Declarations Author Contributions
All authors contributed signi cantly to this work. FJD and KJW designed and drafted the manuscript. JYZ and LPD collected studies and summarized data.
HY, CHS and PW copyedited manuscript, did statistical work and prepared gures. All authors reviewed this manuscript and approved the nal draft.