Clonal architectures predict clinical outcome in gastric adenocarcinoma based genomic variation and tumor evolution and heterogeneity in gastric adenocarcinoma

Background: Gastric cancer is a highly heterogeneous disease. Due to the lack of effective molecular markers and personalized treatment, the prognosis of patients with gastric cancer is still rather poor. This paper attempts to explore the genomic instability and intratumoral heterogeneity of gastric cancer through bioinformatics analysis. Methods: According to the RNA-Seq data, copy number variation data and clinical follow-up information data of TCGA, brunet algorithm in NMF was used to identify SNV signature. ABSOLUTE algorithm was used to screen and identify CNV signature. The relationship between clonal state and survival of genes was analyzed with Kaplan-Meier method. The Spearman method was used to evaluate the correlation between clonal/sub-clonal events and TMB and neoantigen. Results: Mutation analysis of SNV mutation and CNA data in the TCGA gastric cancer dataset divided mutation of TCGA gastric cancer samples into three signatures with significant differences. Furthermore, the clonal/sub-clonal events of SCNA and SNV were analyzed to identify the clonal/sub-clonal status of mutations in STAD samples. It was further found that the number of mutations in TP53, TTN and MUC16 genes was the highest (> 20%) among the samples. The number of CNV (Gain) in MIEN1, GRB7 and PNMT genes was the highest (> 10%) among the samples. A series of early gastric cancer genes, such as TP53, USH2A and GLI3 as well as advanced genes such as CTNNB1, LRP1B and ERBB4 were further identified. Ultimately, the clonal/sub-clonal status of 5 early genes, 12 intermediate genes and 8 advanced genes was significantly correlated with the overall survival rate of patients. In addition, there was a significant correlation between clonal/sub-clonal events and TMB/Neoantigens. Conclusion: In this paper, the relationship between intratumoral heterogeneity and genomic instability was evaluated based on developmental data of gastric cancer cloning system. A series of molecular features determined by screening can be used as a marker of potential gastric cancer heterogeneity, which

cause of cancer-related deaths in the world [1]. The incidence of gastric cancer lacks typical clinical features, which poses great challenges to the early diagnosis of gastric cancer [2]. At the same time, the recurrence rate and metastasis rate of gastric cancer patients are higher, which is an important reason for the short survival time and poor prognosis of gastric cancer patients [3]. Currently, the treatment mode of gastric cancer is combination of surgery, chemotherapy and targeted treatment [4]. Despite some progress in the treatment of gastric cancer in recent years, the prognosis of patients is still poor [5]. Therefore, the research on the etiology and pathogenesis of gastric cancer is an important means to prevent and treat gastric cancer.
The occurrence and development of gastric cancer is the result of a combination of environmental factors and genetic factors [6]. Epidemiological data have confirmed that: Helicobacter pylori (HP), high-salt diet, inadequate dietary fiber intake, smoking and alcohol consumption are common pathogenic factors of gastric cancer [7]. Extensive studies have shown that cancer is a genetic disease caused by mutations in cell genome. Cancer cells are produced by the gradual accumulation of genomic mutation, including somatic cell mutations and copy number variations [8,9]. SNV variation is caused by the replacement of one base in the DNA sequence by another base [10]. Copy number variation refers to the variation of large genomic regions in cancer [11]. SNV mutation and copy number variation play an important role in the occurrence and development of tumors. For example, the SNV variation of single base G/T substitution in HRAS gene can cause changes in the corresponding protein products. The imbalance of the protein's involvment in the process can cause cancer [10]. The copy number variation in the region of tumor suppressor genes, such as BRCA1, CDKN2A and PTEN, can lead to low expression of these genes, which brings selective growth advantage for cancer cells [12,13]. In gastric cancer, the amplification of copy number of cell growth regulatory gene HER2 can significantly promote the proliferation of gastric cancer cells [14]. Clinical studies have also shown that Tratozumab combined with chemotherapy can significantly improve the survival rate of gastric cancer patients with HER2 gene amplification [15]. This not only indicates the importance of genomic variation in the occurrence and development of gastric cancer, but also suggests the potential value of targeted genomic mutation in the treatment of gastric cancer. Genomic mutations are different in different cancer cell samples, which is known as tumor heterogeneity [16]. In the process of tumor occurrence and development, the cloning types of tumor cells will continue to change due to genomic variation and other factors. These different tumor cell clones show significant differences in growth rate, invasive ability and immune features. They constitute the heterogeneity between cells in the tumor. These highly heterogeneous tumor cells lead to differential responses to various treatments [17,18]. Therefore, an in-depth understanding of the heterogeneity of tumors help reveal the mechanism of tumor formation and evolution, which provides clues and basis for the development of accurate diagnosis, prevention and treatment of tumors. Due to the high degree of spatio-temporal heterogeneity [19], there are still few targeted drugs for gastric cancer. HER2 remains the only therapeutic target with a clear benefit group in the field of gastric cancer. Even if initial anti-HER 2 therapy was effective, drug resistance mostly occurred within one year, preventing sustained benefit from patients [20]. Therefore, in-depth exploration of the heterogeneity and evolutionary mechanism of gastric cancer will help to deepen the understanding of gastric cancer, in order to provide more treatment methods.
In this study, the clone (subclone) composition of each tumor sample was inferred from the group library of somatic mutation (SNV) and copy number variations (SCNV) from the genome data of gastric adenocarcinoma (STAD) from TCGA. Using the tumor clonal phylogenetic data, the relationship between intratumoral heterogeneity and genomic instability was further evaluated. These features will deepen our understanding of gastric cancer and serve as a potential marker of tumor complexity.

Methods
Data download and preprocessing TCGA GDC API was used to download the latest clinical follow-up information on March 14, 2019 (S1_Table.txt). The following steps of preprocessing were conducted on the clinical sample data of TCGA: 1) samples with no clinical information or OS < 30 days were removed; 2) normal tissue sample data were removed. The mutated data were preprocessed in the following steps: 1) silent and intron mutated sites are removed; 2) hyper-mutated samples were removed. Hyper-mutated samples were defined as samples with more than 11.4 mutations per Mb [21]. The following steps were performed on the CNV data: 1) data with the interval > 500 kb were removed; 2) gencode.v 22 of the GRh38 version was applied to map the CNV interval to the corresponding genes. The statistics information of the preprocessed STAD data set is shown in Table 1. Otherwise, they were classified as sub-clonal events. CCF referred to cancer cell fraction. Specifically, for each mutated site (including CNV and SNV), the numbers of mutated reads, unmutated reads, tumor purity and local CNV could be used to evaluate the probability distribution density of CCF. In the first step, the tumor DNA proportion was calculated first, and then the allele fraction (AF) probability was calculated according to the binomial probability density distribution. In this step, the influence of normal cell components was cleared and the p (AF) was obtained. The second step was to integrate all possible mutation multiplicities (m: 1 to local absolute copy number) and to evaluate CCF's probability by p(AF) [22].

Analysis of mutation characteristics
Non-negative matrix factorization (NMF) is an unsupervised clustering method, which is widely used in the discovery of tumor molecular subtypes based on genomics [23][24],The mutation characteristics in CRC were further observed, and the NMF method was used to cluster the samples based on SNV data to identify the mutation characteristics of the samples, in which the NMF method selected the standard "brunet" for 50 iterations. The clustering number k is set to 2 to 10, and the average contour width of the common member matrix is calculated using the R software package NMF [25]. and the minimum member of each subclass is set to 10. The cophentic, dispersion and rss indexes of K = 2-10 are evaluated respectively. According to these three indicators, the optimal number of clusters is selected. In order to evaluate the heterogeneity of mutant signature, we calculated the contribution of each mutant signature to each sample, using the 30 known mutant signatures, provided by COSMIC (https://cancer.sanger.ac.uk/cosmic/signatures_v2). We calculated the similarity between mutant signatures and COSMIC mutational signatures [26].

Analysis of genomic variation in cloning and subcloning
The copy number variation of STAD was evaluated by the ABSOLUTE algorithm. Firstly, the CNV obtained by ABSOLUTE was screened, and the CNV interval meeting the following conditions was retained: 1) modal CN < 2 (Loss) or modal CN > 2 (Gain); 2) CNV interval < 0.5 Mb. Then the SCNA was corresponding to the gene by using the coordinate position of gencode V22, and the relationship between the cloned and subcloned gene and CNV was analyzed.

Analysis of temporal sequence relationship between mutation and tumor evolution
With a sample's clonal events and sub-clonal events, the possible timing of mutations in tumor evolution was constructed. When clonal events and sub-clonal events appear in the same sample, an edge was established between the two to analyze all the samples in the same way, and finally a gene network with a specific direction was obtained. The node of the network represented the gene, and the edge indicated that there was a clonal and sub-clonal relationship between the two genes.
Enrichment analysis was performed according to the number of in-edges and out-edges of each node (gene), fisher exact test was used for significance test, and the BH method was used to calculate FDR.
For SSNV and SCNV, the nodes (genes) with FDR < 0.05, 0.2 and out-edges > in-edges were defined as early genes; similarly, the nodes with FDR < 0.05, 0.2 and in-edges > out-edges were defined as late genes; and the genes of other cases were defined as intermediate genes.
Because we calculated the genes that produced SCNA based on the CNV interval and gff interval of the chip data, some genes of SCNA might present "false positive", affecting the results of SSNV, we inferred the temporal order of SCNA and SSNV respectively. In order to facilitate the display, we removed some conflicting edges and finally got SCNV pairs and SNV pairs.

Statistic Analysis
The Kaplan-Meier method was used to perform survival curves for the subgroups in each data set, and logrank test was used to determine the statistical significance of the differences, significantly defined as P < 0.05, Chisq test was used to test the significance of sample overlap between clinical features and clonal or subclonal events. The Benjamini-Hochberg method converts the P value to FDR. All of the above analyses were completed under R 3.5.1. If there are no special instructions, * means p < 1e-5 in this article, p < 0.01, and * means p < 0.05.

Results: Evaluation of samples' purity and ploidy
The purity and ploidy of the STAD samples calculated by the above method is shown in S2_Table. The CCF of SSNV is demonstrated in S3 _ Genomic mutant signature analysis Mutated signatures can reflect the potential effects of previous exposure to different carcinogens, as well as some characteristic changes associated with DNA damages and repair in STAD tumors. Here the brunet algorithm in NMF was used to identify the SNV signature. In order to ensure that the optimal number of SNV signatures could be identified, we evaluated cophenetic and rss when k = 2-10 (that is, there were 2-10 SNV signatures), respectively. According to these two indexes, k = 3 (that is, three SNV signatures) was chosen as the optimal quantity (S1_Fig). According to the trinucleotide mutation pattern, three SNV signatures were obtained, which were defined as Signature A-C.
According to the base mutation pattern, signature A was mainly composed of "C > T", while signature B was mainly composed of "C > A", "C > G" and "C > T", and the mutation pattern of "C > G" only appeared on signature B. Signature C mainly consisted of "T > G" (Fig. 1A). SNV was divided into clonal events and sub-clonal events based on CCF. No significant difference was observed in the contribution of the two types of SSNV to the three mutated signatures (S2 Fig A), showing that clonal events and sub clonal events were similar in mutation patterns.

Identification and distribution of mutant signatures
In order to evaluate the heterogeneity of mutant signatures, contributions of signature A-C were calculated in each sample (the larger the contribution, the higher the proportion of the signature in the sample). It was found that signature A accounted for a large proportion in most samples, while signature B and C accounted for a high proportion only in specific samples (Fig. 1B). By using the known 30 mutant signatures provided by COSMIC, we calculated the cosine similarity between three signatures and COSMIC mutational signatures (expressed by correlation coefficient), finding that signature B had high similarity with signature 3, signature13, signature C and signature 17 (Fig. 1C).
The similarity between signature A and signature 1 was the strongest (S2 Fig B). Signature

Variation analysis of cloning and subcloning genomes
The clonal/sub-clonal events data of SCNV and SSNV were integrated, and the clonal and sub-clonal structures of STAD samples were analyzed. The SCNV and SSNV genes with more than 5% occurrence times in all samples were selected, and 46 SCNV genes with the highest occurrence frequency and 101 SSNV genes (Fig. 2, S5_Table) were obtained, respectively. The results showed that the number of mutations of TP53, TTN and MUC16 genes in the samples was the highest (> 20%), and the major mutation was clonal events (S3_Fig, enrichment p < 0.05, S6_Table), indicating that these genes were more likely to occur as early mutation events. The number of clonal and subclonal mutations in common proto-oncogenes such as PIK3CA and CDH1 was relatively small (< 10%). MIEN1, GRB7 and PNMT genes had the largest number of CNV (Gain) appearing in the samples (> 10%), while for ERBB2, MYC, KRAS and HRAS, the number of CNV was small (6%-10%, S4_Fig).

Analysis of temporal sequence relationship between mutation and tumor evolution
In order to analyze the mutations involved in the occurrence and development of STAD, 46 SCNA and 101 SSNV with the highest mutation frequency were sorted according to CCF value (Fig. 3A, S5_FigA-B). On the whole, the CCF of SCNV was significantly higher than that of SNV (rank test p < 1e-5, mean ccf: 0.9287/0.9003). Gain was the major result for SCNV, and Loss accounted for a very small proportion. (Gain/Loss:1214/3).
In order to facilitate the demonstration, only the gene pairs with intergenic edges > = 2 were retained, and finally 369 SCNA pairs (S7_Table) and 119 SSNV pairs (S8_Table, S10_Table) were obtained. Five early SCNA genes and eight early SSNV genes (S9_Table, S10_Table) were obtained by edges enrichment analysis. In the temporal order results of SSNV, it was found that TP53, USH2A and GLI3 appeared the earliest in STAD, which could be the driver events of STAD. On the other hand, CTNNB1, LRP1B and ERBB4 appeared the latest in STAD, which might be related to the progress of STAD (Fig. 3B). The results of all SSNV's edges are shown in S5_FigC (edges were not filtered). In the temporal order results of SCNA, MYC, KRT14 and KRT16 were defined as early genes and metaphase genes, and KRAS, ERBB2 and CCNE1 were late genes (Fig. 3C). For the results of all SCNV edges, see S5_FigD (edges were not filtered).

Relationship between cloning or sub-cloning events and prognosis
In order to study the effect of clonal or sub-clonal events on the survival of patients, the kaplan-meier method was adopted to analyze the prognosis relationship between the clonal status and overall survival of 46 high frequency SCNA genes and 101 high frequency SSNV genes (the number of mutations > 5%). When log rank test p < 0.1, five early genes ( Fig. 4A-B, S6_Fig) which were obviously related to overall survival, 12 metaphase genes (Fig. 4C-D, S7_Fig) and eight late genes (Fig. 4E-F

Relationship between cloning or sub-cloning events and clinical characteristics
Based on the previous method, clonal events of SCNA and SSNV were obtained. The relationship between clonal events and sub-clonal events and clinical characteristics was analyzed with the clinical information provided by TCGA. The differences of clonal/sub-clonal events in TNM, stage, age, gender and organization types were analyzed. The results showed that there were significant differences in the number of sub-clonal events in T stage (Fig. 5A). N stage, gender and tissue types have significant differences in clonal events (Fig. 5B-D). The risk of gastric adenocarcinoma in males was higher than that in females, which was consistent with our observation that clonal events in males was significantly higher than that in females. Papillary and tubular mutation in tissue types was significantly higher than that in other types. There was no significant difference observed in clonal/sub-clonal events among M, stage, age and grade (S9_Fig).

Relationship between cloning or sub-cloning events and TMB/Neoantigens
Tumor mutation burden (TMB) and neoantigen are important biomarkers in immunity checkpoint therapy, and the appearance of clonal/sub-clonal events also has an important effect on the occurrence and progression of tumor. Therefore, the relationship between clonal/sub-clonal events and TMB and neoantigen was analyzed. Because the distributions of TMB, neoantigen and clonal/sub-clonal events did not conform to normal distribution (shapiro test p < 1e-5), the spearman method was used to evaluate the correlation among them. According to the significance test, there is a highly significant relationship among clonal events and TMB and neoantigen (Fig. 6A-C), but the correlation among sub-clonal events and TMB and neoantigen was weak (Fig. 6D), which seemed to indicate that the emergence of clonal events contributed greatly to tumor mutation burden and new antigen production. The mutation of mismatch repair genes (MMR) has an important effect on the mutation burden of genome. The clonal/sub-clonal difference between MMR's mutated samples (Mut) and nonmutated samples (WT) was further evaluated. It was found that the clonal/sub-clonal events in the Mut group was higher than that in the WT group (Fig. 6E), but there was no significance observed, which might be related to the smaller sample size of the Mut group. TMB and neoantigens in the Mut group were significantly higher than those in the WT group (Fig. 6F), but there was no significant difference between them in OS (Fig. 6G), indicating that although the abnormality of the mismatch repair system had an important effect on genomic stability, its relationship with prognosis was complicated.

Discussion
Gastric cancer is one of the most common cancers in the world and the second leading cause of cancer death [27]. The occurrence of gastric cancer is affected by both environmental and genetic factors. Environmental risk factors for gastric cancer include a high-salt diet, smoking and infection, Helicobacter pylori, etc. Because of its complex etiology and anatomical structure, GC is highly heterogeneous in clinic and pathology [19], and its 5-year survival rate varies greatly in different regions [28,29]. The standard treatment regimen supplemented by operation combined with chemotherapy largely ignores the heterogeneity and features of gastric cancer. Therefore, there is an urgent need to explore biomarkers on the basis of heterogeneity of gastric cancer to develop more effective treatment methods.
Existing studies have shown that cancer is mainly caused by genomic mutations. The identification of mutations from the cancer genome is helpful to the development of cancer pathogenesis and clinical treatment. In recent years, researchers have been studying the molecular mechanism of gastric cancer. In particular, second-generation sequencing, as a high-throughput method, can systematically identify genomic variations in the cancer genome [30]. A number of studies have been carried out on the genomic variation of gastric cancer. For example, studies have reported genomic mutations in gastric cancer such as TP53, PIK3CA, CTNNB1, and CDH1 [31]. In addition, studies have shown that 59% of gastric cancer has mutations in chromatin remodeling genes such as ARID1A, PBRM1 and SETD2 [32]. These studies suggest that the complexity of genomic variation in gastric cancer needs to be fully discussed. As two important sources of genomic variation, SNV variation and copy number variation have played an important role in gastric cancer. This study attempted to analyze the mutation of SNV variation and CAN data in the TCGA gastric cancer dataset, and divided the mutations in TCGA gastric cancer samples into three signatures with significant differences and combined with the COSMIS database, which indicated that they had good similarity with the database.
These results deepen our understanding of gastric cancer mutations.
Tumorigenesis is a dynamic evolution process. Tumor heterogeneity comes from the evolution of different subclones in tumor evolution [17]. In the process of tumor occurrence and development, the clone types of tumor cells continue to change due to the accumulation of mutations [33]. According to cancer cell fraction (CCF), the tumor cell population can be divided into clones and subclones. Studies on the clonal evolution model of tumor cells and the influencing factors of cloning help increase the understanding of tumor heterogeneity, which is helpful to clinical individualized therapy [34]. This study attempts to explore the evolutionary models of cloning and subcloning in gastric cancer. Firstly, the clonal/sub-clonal state of STAD mutation was identified by clonal/sub-clonal events analysis of SCNA and SSNV. In addition, there was no significant difference in the status of clonal/sub-clonal among the three mutant signatures identified in this paper. Furthermore, by selecting the genes with mutations in more than 5% of the samples, it was found that the number of mutations was the highest in TP53, TTN and MUC16 genes (> 20%), and the main mutation was clonal events.
Ultimately, a series of early stage genes such as TP53, USH2A, and GLI3 and advanced stage genes such as CTNNB1, LRP1B and ERBB4 were identified by analyzing the relationship between cloning state and tumor evolution. Based on the above results, this paper revealed the cloning status of gastric cancer and identified a series of genes related to the evolution of gastric cancer.
Another important part of this article is to explore the relationship between prognosis and clinical features of patients with gastric cancer. Cloning and subclonal status play an important role in tumor progression [35], but the relationship between cloning and gastric cancer patients is not clear. In this study, 5 early stage genes, 12 metaphase genes and 8 advanced stage genes were identified. The clonal/sub-clonal state of these genes was more significantly related to the overall survival rate of patients. Furthermore, the relationship between clonal/sub-clonal events and clinical pathology parameters is more complex. T stage had significant difference in the number of sub-clonal events, while N stage, gender and tissue type had significant difference in the number of clonal events. No significant differences were observed in the M, stage, age, and grade factors. Finally, this paper focuses on two important biomarkers in the immune checkpoints therapy of tumor mutation load (TMB) and neoantigen [36]. The results showed a significant correlation between clonal/sub-clonal events and TMB/Neoantigens, indicating that the emergence of clonal/sub-clonal events had important contributions to TMB and neoantigens. The above results have shown the relationship between clonal status and clinical characteristics coupled with prognosis of patients with gastric cancer. In addition, it also suggests its great potential in immunotherapy.

Conclusion
In summary, this study has collected data from the (STAD) genome of gastric adenocarcinoma from TCGA to infer the clone (subclone) composition of each tumor sample. The relationship between intratumoral heterogeneity and genomic instability was evaluated using the tumor clonal phylogenetic data. A series of molecular features were screened and identified as markers of potential heterogeneity of gastric cancer, which is of great significance for personalized treatment of gastric cancer.

Consent for publication
Not applicable.

Availability of data and material
The data sets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Competing interests
All authors declare no conflict of interest.

Funding
This study was supported by the Fund for Shanxi "1331 Project" Key Innovative Research Team and College Students' Innovation Project (No. D2018385).

Authors' contributions
CXR, CLW, NNW, CHL, and CQY contributed to the conception and design of the study. CXR collected and analysed the data. DY, CLW and NNW wrote the manuscript. All authors have read and approved the manuscript.