Study of the co-expression gene modules of non-small cell lung cancer metastases

Background: Metastasis regularly is a marker of the disease development of cancers. Some metastatic sites significantly showed a more serious clinical outcome in non-small cell lung cancer (NSCLC). Whether they are caused by tissue-specific (TS) or non-tissue-specific (NTS) mechanisms is still unclear. Methods: Weighted Correlation Network Analysis (WGCNA) was used to identify the gene modules among the metastases of NSCLC. The clinical significance of those gene modules was evaluated with the Cox hazard proportional model with another independent dataset. Functions of each gene module were analyzed with gene ontology. Typical genes were further studied. Results: There were two TS gene modules and two NTS gene modules identified. One TS gene module (green module) and one NTS gene module (purple module) significantly correlated with survival. This NTS gene module (purple module) was significantly enriched in the epithelial-to-mesenchymal transition (EMT) process. Higher expression of the typical genes ( CA14 , SOX10 , TWIST1, and ALX1 ) from EMT process was significantly associated with a worse survival. Conclusion: The lethality of NSCLC metastases was caused by TS gene modules and NTS gene modules, among which the EMT-related gene module was critical for a worse clinical outcome.

3 outcome (Bauml, Mick et al. 2013, Riihimaki, Hemminki et al. 2014. Riihimaki et al. (Riihimaki, Hemminki et al. 2014) have found that liver and bone metastases caused a worse survival time, and the respiratory and nervous system metastases showed a better survival time. Gibson et al. (Gibson, Li et al. 2018) have found that liver metastasis had worse survival and adrenal metastasis had the best survival. Tamura et al. (Tamura, Kurishima et al. 2015) have found that liver and adrenal gland metastasis had bad survival and brain and bone metastases have better survival. Those studies conflicted with each other for some metastatic sites. Commonly, liver metastasis had bad survival.
As for now, the understanding of the mechanism of the lethality difference within the metastases of lung cancer is still unclear. Only a limited number of studies in genomic variation analysis, especially in some driver genes, were reported. For examples, EGFR mutation could be an important marker for clinical outcome (Hsu, De Caluwe et al. 2017, Li, Zhao et al. 2019. The median survival time for the EGFR + group and EGFR was 22.4 months and 7.9 months, respectively (Hsu, De Caluwe et al. 2017).
However, in the EGFR + cohort, the brain showed the highest risk ratio, which was contrary to a previous study (Riihimaki, Hemminki et al. 2014). Besides, there were only 22.3% of the patients tested as EGFR + (Hsu, De Caluwe et al. 2017), indicating that there must be other mechanisms involved.
In spite of the success in studying such driver genes, there are many unresolved questions. For examples, which mechanism is most important for the survival? Is there shared mechanism among these metastases? Is there a tissue-specific mechanism among these metastases?". In order to solve these problems, a global transcriptome study would be necessary.
In this study, the Weighted Correlation Network Analysis (WGCNA) was used to identify the gene modules in NSCLC. The tissue-specificity and function of those gene modules were studied. Further using an independant dataset, the clinical outcomes of these modules were evaluated. Specifically, some typical genes were carefully examined.

Result
Prepare data for WGCNA analysis By searching in the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/) 4 with keywords of "lung cancer AND metastasis" and further checking manually, only one dataset, GSE12630, which contains 12 metastatic samples of lung cancer, was screened (Suppl. Table 1) and then the CEL files for those samples were downloaded. The expression was called with the affy package (Gautier, Cope et al. 2004) from Bioconductor on R platform and normalized with the Robust Multi-array Average (RMA) method (Bolstad, Irizarry et al. 2003). Low variable probes were removed according to the interquartile range (<0.5). In total, 3533 probes were kept for further analysis.

WGCNA analysis for the gene modules
In order to get the co-expression gene modules, the WGCNA analysis was conducted with the 12 samples. As denoted by the analysis of the scale-free topology model fit, a soft power threshold at 37 showed a better performance. Unsigned Pearson's correlation coefficient between genes was used to generate the adjacency matrix. The topological overlap was calculated to construct the co-expression network. Each gene was then hierarchically clustered according to the topological distance in the coexpression network ( Figure 2). The hierarchical tree was dynamically cut. Nine gene modules were inferred. The gene modules with high eigengene similarity (>0.75) were merged. Finally, they were merged into four gene modules. For convenience, they were named as black, green, yellow and purple modules, which contained 157, 749, 131, and 47 genes respectively.

Tissue specificity and functions of the gene modules
To characterize the gene modules, the eigengene was used as a surrogate measure for each gene module. A Pearson's correlation coefficient between those eigengenes for each gene modules and the tissue type vector was calculated. As shown in (Figure 3), the green module was highly expressed in the lymph node (r = 0.67, p-value = 0.02) and expressed lower in the adrenal gland (r = -0.52, pvalue = 0.08); the yellow module was highly expressed in the intestine tissue (r = 0.87, p-value = 0.0001). The purple and black gene modules showed no significant correlation with any tissue types.
Those results showed that there were tissue-specific and non-tissue-specific gene modules in NSCLC metastases.
Next, the function of the four gene modules in the biological process was examined with gene ontology. The yellow module was enriched with "extracellular matrix organization" and "cell 5 adhesion" (Suppl. Figure 1A); the black module was enriched with "immune response", "defense response", "innate immune response" and "regulation of T cell activation" (Suppl. Figure 1B); the green module was enriched with "hormone-mediated signaling pathway", "cell communication" and "cellular response to endogenous stimulus" (Suppl. Figure 1C); and the purple module was enriched with "neural crest cell migration", "mesenchyme development" and "stem cell differentiation" ( Figure   4).

EMT process was critical for lethality caused by metastasis
Except for the biological function of gene modules identified by WGCNA analysis, we also studied their clinical association. Without loss of generality, another dataset, GSE14814, was used to validate the gene modules. That dataset had recorded survival/censor time, which was used as a dependent variable in a Cox hazard proportional model ( Table 1). The eigengene expressions of those gene modules was dichotomized by the median. They were used as independent variables in the model. As denoted by the log-likelihood ratio test, the model was significant with p-value = 0.04. Specifically, the green and purple modules were significantly correlated with survival. The p-values were 0.0069 and 0.036, respectively. The purple module had a hazard ratio equal to 2.20 and a 95% confient

Clinical significance of the typical genes from the purple module
Many studies (Nouri, Ratther et al. 2014, Puisieux, Brabletz et al. 2014, Shen, Xu et al. 2019 shown that EMT was important for the metastasis process, but there were no studies about the clinical association with metastasis. In order to study the clinical significance of the EMT process, the hub genes and EMT related genes from the purple module were further validated in a wider dataset provided by the web service (Monzon, Lyons-Weiler et al. 2009) (Figure 5). The hub gene was defined as the gene which had the highest correlation with the eigengene of the gene modules. CA14 was the hub gene of the purple module. SOX10, ALX1, and TWIST1 were the genes involved in the EMT in the purple module. Using the median as the cutoff, their Kaplan-Meier curves were plotted as in Figure 5.
Higher expression of these genes showed a significantly shorter survival time with the p-values of 6 0.0023, 0.037, 3.7e-5 and 3.7e-3, respectively. The results further proved the lethality of the purple module and the EMT process for the lung cancer metastasis.

Discussion And Conclusion
Lung metastasis could severely affect the clinical outcome (Riihimaki, Hemminki et al. 2014).
Different metastatic sites could cause significantly different survival (Riihimaki, Hemminki et al. 2014, Tamura, Kurishima et al. 2015, Hsu, De Caluwe et al. 2017, Gibson, Li et al. 2018. Those studies have found both common and different results. To resolve those discrepencies, it is urgent to determine the background molecular mechanisms. Compared to the genomic variations, the expression profile can reflect the biological process in those metastasis tissues more directly when considering the difficulties in the evaluation of the effects of genomic variations. Due to the difficult availability of metastatic tissues, there have been only a few studies on the expression profiles of lung cancer metastasis up to now. After searching for the GEO database, only one dataset containing multiple metastatic tissues has been collected. That dataset only contains metastatic sites of the adrenal gland, lymphoid, skin and omentum tissues. There are no expression profiles from the liver, which is the most lethal metastasis (Riihimaki, Hemminki et al. 2014, Tamura, Kurishima et al. 2015, Gibson, Li et al. 2018. Limited by the data availability, it is hard to know whether the liver has a tissue-specific module or non-tissue-specific module. Although there is a study which contains only a single metastatic site of brain tissue (Luke, Blazquez et al. 2018), to avoid possible error from multiple experiments, it was not included in this study. We look forward to more efforts in this field of research.
In order to infer the gene modules that are tissue-specific and non-tissue-specific, unsupervised clustering is needed. In fact, in considering the high heterogeneity of lung cancer metastasis, Student's t-test in combination with supervised clustering could always be futile. WGCNA has the advantage to cluster all genes into separate gene modules without the knowledge of the tissue types.
In this study, we have successfully identified four gene modules. The function of those gene modules was studied with the tissue-specificity and the gene ontology of biological process. As we have hypothesized, two gene modules (green and yellow) are correlated with the tissue types and two other gene modules (black and purple) were not correlated with any tissue types.
The green module was correlated with lymphoid metastasis tissue. It was enriched with "cell communication". The typical gene was Wnt10B, which was reported to be associated with lymph node metastasis of gastric cancer (Wu, Bie et al. 2017). However, there is also a report in endometrial cancer that high expression of Wnt10B is prone to not have lymphoid metastasis (Chen, Wang et al. 2013). Those results suggest that the expression of Wnt10B is critical for lymphoid metastasis. The green gene module could be tissue-specific for lymphoid metastasis of pan-cancer.
Similarly, the yellow module was correlated with intestine metastasis. It was enriched with "extracellular matrix organization" and "cell adhesion". The typical genes were MMP2, COL1A1, and ECM2. Currently, there are rarely studies in the intestine metastasis of lung cancer, but they have appeared in the metastasis of other cancers, such as, breast cancer (Mendes, Kim et al. 2007), ovarian cancer (Kenny, Kaur et al. 2008) and lung cancer (Rojiani, Alidina et al. 2010).
The black gene module was non-tissue-specific and enriched with"immune response", "defense response", "innate immune response" and "regulation of T cell activation". The black module reflects the function of immune microenvironment. It is a general mechanism in all types of cancers. For example, the inflammatory chemokine C-C motif ligand 4 (CCL4) is an "innate immune response"related gene. CCL4 can promote the lymphangiogenesis in oral squamous cell carcinoma (OSCC), which is a critical step in tumor metastasis (Lien, Tsai et al. 2018). In breast cancer, CCL4 can contribute to the metastasis to the bone by binding to CC chemokine receptor type 5 (Sasaki, Baba et al. 2016). The genes in the "regulation of T cell activation" category, HLA-DR or CD4, also play a critical role in the metastasis in the metastasis of melanoma (Costantini and Barbieri 2017), breast cancer (DeNardo, Barreto et al. 2009), and prostate cancer (Liu, Guo et al. 2018).
The most interesting module was the purple module, which was also a non-tissue-specific gene module and had a significant effect on the survival. The purple module was enriched with"neural crest cell migration", "mesenchyme development" and "stem cell differentiation". Generally, for cancer metastasis, the nervous system plays a big part of metastatic initiation and clonal expansion, angiogenesis to extravasation and colonization (Kuol, Stojanovska et al. 2018). The typical genes from it are SOX10, TWIST1, and ALX1. Those genes also take part in the EMT process. EMT is a process whereby epithelial cells are transformed into mesenchymal cells (Kalluri and Weinberg 2009). After the EMT process, the transited cells gain migratory/invasive properties (Bonnomet, Syne et al. 2012).
This study proves that EMT could be the most important signature for the severity of lung cancer metastasis.
In summary, in this article, we have studied the gene co-expression modules in the metastatic tissues of lung cancer. Four gene modules were identified, from which two gene modules were tissue-specific and the other two gene modules were non-tissue-specific. Specifically, one of the non-tissue-specific gene modules was significantly correlated with survival. The function of that gene module demonstrates that the EMT process plays a big role in the metastasis process of lung cancer and survival.

Data collections
Two datasets, GSE12630 (Monzon, Lyons-Weiler et al. 2009) and GSE14814 (Zhu, Ding et al. 2010), were downloaded from the Gene Expression Omnibus (GEO) website (http://www.ncbi.nlm.nih.gov/geo/). For better comparison, only the sample data generated on the Affymetrix U133A microarray platform were chosen as candidates. The samples for primary or metastatic lung cancer were manually curated from dataset GSE12630. The raw CEL files from both datasets were processed by affy package on R platform. The expression was further normalized with the Robust Multi-array Average (RMA) method (Bolstad, Irizarry et al. 2003). The suboptimal probes with the name suffix "s_at" and "x_at" were removed from the analysis. To better fit the normal distribution, the expression was log2-transformed. The annotation package for U133A was downloaded from Bioconductor (http://bioconductor.org). For dataset GSE12630, the low variant probes were filtered according to the interquartile range (<0.5).

WGCNA analysis
WGCNA analysis was conducted on the R platform (Langfelder and Horvath 2008). Briefly, the scalefree topology model fit was used to choose the soft power. An unsigned correlation distance defined as "(1-r)/2" was calculated, where r is the Pearson's correlation coefficient. The adjacency matrix was generated according to the unsigned correlation distance matrix with the soft power. The adjacency correlation was then used to construct the topological network. The genes were hierarchically clustered with the topological distance. Then, the tree was dynamically cut into multiple gene modules with the parameter "minClusterSize = 30, deepSplit = 2, pamRespectsDendro = FALSE".

Gene function analysis
The biological processes of gene ontology were used to enrich the gene modules with the DAVID web service (Huang da, Sherman et al. 2009). The bar plot is generated on the R platform.

Survival analysis
The WGCNA modules were selected by survival analysis of their eigengenes. The eigengenes were dichotomized by the median. The expression over the median will be defined as high expression; the expression below the median will be defined as low expression. The gene modules were regressed against the survival time with Cox proportional hazards model with R package "survival" (Therneau 2015). The log-likelihood test was used to test the significance. The modules with higher survival probability were further enriched with gene ontology. The typical genes were validated in a large dataset with the web service (Monzon, Lyons-Weiler et al. 2009). The median value was used as a cutoff to categorize the gene expression into high and low. The survival time was plotted against survival probability with the Kaplan-Meier curve. The log-likelihood ratio test was used to compare their different significance.

Abbreviations
NSCLC non-small cell lung cancer TS tissue-specific NTS non-tissue-specific WGCNA Weighted Correlation Network Analysis EMT epithelial-to-mesenchymal transition GEO Gene Expression Omnibus RMA Robust Multi-array Average CCL4 C-C motif ligand 4 OSCC:oral squamous cell carcinoma.

Ethics approval and consent to participate
Yes.

Consent for publication
Yes.

Availability of data and material
Yes.  Tables   Table 1. Study of module eigengenes by survival analysis The study was conducted in a dataset consisting of 133 samples. The log-likelihood ratio test showed a significant correlation between the gene modules and the survival time. The green and purple gene modules were the two significant gene modules with hazard ratios, 0.45 and 2.76, respectively.

Competing interests
Suppl. Table 1 The data used in this study

Figure 1
The workflow of this study 20 Figure 2 WGCNA analysis of the metastasis tissues 21 Figure 3 The relationship between modules and tissues