Classification prediction of early pulmonary nodes based on weighted gene correlation network analysis and machine learning

To use weighted gene correlation network analysis (WGCNA) and machine learning algorithm to predict classification of early pulmonary nodes with public databases. The expression data and clinical data of lung cancer patients were firstly extracted from public database (GTEx and TCGA) to study the differentially expressed genes (DEGs) of lung adenocarcinoma (LUAD). The intersection of three R packages (Dseq2, Limma, EdgeR) methods were selected as candidate DEGs for further study. WGCNA was used to obtain relevant modules and key genes of lung cancer classification, GO and KEGG enrichment analysis was performed. The model was built using two machine learning methods, Least Absolute Shrinkage and Selection Operator (LASSO) regression and tumor classification was also predicted with extreme Gradient Boosting (XGBoost) algorithm. DEGs analysis revealed that there were 1306 LUAD genes. WGCNA module analysis showed that a total of 116 genes were significantly related to classification, and module genes were mainly related to 14 KEGG pathways. The machine learning algorithm identified 10 target genes by LASSO regression analysis of differential genes, and 18 genes were identified by XGBoost model. A total of 6 genes were found from the intersection of the above methods as classification signatures of early pulmonary nodules, including “HMGB3” “ARHGAP6” “TCF21” “FCN3” “COL6A6” “GOLM1”. Using DEGs analysis, WGCNA method and machine learning algorithm, six gene signatures related to early stage of LUAD, which can assist clinicians in disease classification prediction.


Introduction
Lung cancer was one of the most common malignancies, in recent years, advanced treatment have been springing up including minimally invasive surgery, chemotherapy and targeted therapy (Weinberg 2005). However, the initial diagnosis of pulmonary nodules during physical examinations did not attract enough attention, and many patients finally developed into malignant lung cancer. Early detection and treatment had become an urgent problem, distinguishing whether pulmonary nodules benign or not was a prejudgment for subsequent treatment options (Tinteren et al. 2002). Lung adenocarcinoma (LUAD), a subtype of lung cancer, which was found in most diagnosed lung cancer patients in China. Researchers carried out targeted intervention for cancers, by analysing differences in molecular characteristics and mechanism to predict lung cancer occurrence and development (Weir et al. 2007). There were still lack of clear clues to explain classification Guang Li and Meng Yang contributed to the work equally and should be regarded as co-first authors.
1 3 of early pulmonary nodules, which may contribute to better understanding and identification of new molecular targeting strategies for lung cancer treatment.
Gene expression profiling was widely used in various research fields, including malignancies, nervous system diseases and other infectious diseases (Alizadeh et al. 2000). Most microarray analysis focused on the comparison between normal and disease states, differentially expressed genes (DEGs) analysis using genotype method to screen the potential markers in disease status (Reiner et al. 2003). Although this method had made great contributions to the clinic field, frequent errors occurred in the analysis of single data set, and the verification of cross data sets will greatly improve accuracy and sensitivity (Goldberg et al. 2008). Meanwhile, weighted gene coexpression network analysis (WGCNA) was development of network biology, which provided an overall strategy for microarray analysis (Langfelder and Horvath 2008).The gene modules related to prognosis or classification were determined by calculating the Pearson correlation between gene pairs and the correlation between module characteristic genes and clinical status (Alaei et al. 2019). Coexpression network was brought in to study the relationship between genes and determine the internal mode of coordinated expression genes, hub genes related to module genes may play a role in coordinating module behavior and disease classification.
In addition, there appeared more scientific approach in tumor-normal classification, such as machine learning, which can achieve gene markers quickly and accurately (Raghav et al. 2019). Least Absolute Shrinkage and Selection Operator (LASSO) regression had better prediction ability in scoring model, and it had two significant advantages, deal with missing values and scale data (Xue et al. 2021). Previous studies had combined experimental and bioinformatics methods to study the prognostic markers of lung cancer and other diseases (Lu et al. 2020). In Guo's study, a 6-gene prediction model for AML survival was developed and validated by multiple datasets using WGCNA, which established the co-expression network signature correlated with the ELN2017 recommended prognostic genetic abnormalities in AML . Another method Extreme gradient boosting (XGBoost) algorithm had succeeded in distinguishing patients with cancers by RNA expression profile. Different from the traditional integrated decision tree algorithm, XGBoost method added regular terms to the loss function, which can control the complexity of the model and prevent the model from over fitting (Chen et al. 2016). Furthermore, XGBoost algorithm was used to narrow the scope and find effective mutation, which had been successfully applied in various fields and achieved outstanding results (Ywh et al. 2021). However, XGBoost and LASSO had not been together used to objectively classify samples according to RNA characteristics.
In this study, the high-throughput data was obtained from The Cancer Genome Atlas (TCGA) database (Mclendon et al. 2008) and Genotype Tissue Expression (GTEX) database (Keen and Moore 2015), these genomic information enabled the cancer research community to improve the prevention, diagnosis and treatment of cancer. Firstly, we screened LUAD database containing 432 samples, including the RNA-seq expression data of each sample. Next, DEGs in LUAD were identified with R packages. The WGCNA was then used to selected the qualified color modules, the machine learning tumor-normal classifier was further discussed based on LASSO and XGBoost mode. Combined with all these methods, the classification prediction model of early pulmonary nodules was developed, and the important features in the prediction were analyzed. The flowchart for this study was shown in Fig. 1.

Data sources
TCGA datasets (https:// portal. gdc. cancer. gov/) consisted of more than 2 Pb of genomic data, which was publicly available. The RNA-seq sequence data of LUAD samples were retrieved from TCGA database. Since there was relatively limited normal data in TCGA, the lung related normal data was also downloaded from GTEX database (https:// commo nfund. nih. gov/ gtex). These data was sourced from llumina HiSeq-RNASeq platform, which can be downloaded free of charge. The data were from public databases, the ethics committee did not need further approval. According to the clinical information data LUAD, samples with T1 stage and different M0 and N0 were extracted as samples for early pulmonary nodule model. A total of 85 pulmonary nodule samples were screened by LUAD. As to the normal tissues, there were 347 samples of LUAD extracted from GTEX database. The tumor and normal datas were combined respectively, and a total of 432 LUAD samples were obtained.

Differential expression analysis
In the research, the program code written in R language was mainly used to analyze and process RNA data. "Deseq2" (Love et al. 2014), "Limma" (Smyth et al. 2010) and "EdgeR" (Vijay et al. 2012) package were installed from Bioconductor (http:// www. bioco nduct or. org), these three methods were used to read the original files respectively, conduct standardized filtering, and remove the mRNA only found in single sample to obtain the final mRNA data set, and mRNA not included in Ensembl database (http:// www. ensem bl. org/ index. html) was excluded. mRNAs with average below 1 reads were also deleted. The mRNA data of the two disease subtypes were analyzed by DESeq2, Limma, EdgeR, and the differentially expressed mRNAs were obtained. P-values used false discovery rate (FDR) to correct the difference of multiple tests, the expression data were log2 transformed and normalized, excluding the clinical data with no follow-up message or less than 1 day. The fold change |log2fc|≥ 2 and FDR (P < 0.05) was the threshold to screen DEGs. The volcano map and heat map were generated by "ggplot2" (Wickham 2009) and "pheatmap" (Kolde 2015) package, and mRNA with no statistically significant difference was deleted. The up-regulated and downregulated DEGs were obtained by Venn plot analysis, and three methods overlapping genes with the same trend were obtained as the basic genes of this model.

WGCNA method
Firstly, the counts matrix was transformed into fpkm matrix by R language, and the top coefficient 5000 genes were selected. After that, the "WGCNA" (Langfelder and Horvath 2008) package was used to analyze the expression and classification files of candidate gene. The expressed genes were used to calculate the Pearson correlation coefficient among genes and select the appropriate soft threshold β, making the constructed network more in line with the standard of scale-free network. The gene network was constructed by one-step method, the adjacency matrix was transformed into topological overlap matrix Tom, and a hierarchical clustering tree of genes was generated by hierarchical clustering. Gene significance (GS) and module significance (MS) were calculated to measure the significance of genes and clinical information, and the significant correlation between modules and models was analyzed. Calculate the module conservatism and score each module. The function module Preservation was used to calculate the conservatism Z score (Dibley et al. 1987), Z > 10 modules were regarded as significant. Meanwhile, use the enrichment analysis to take the enrichment score of each module, enrichment fold, EF > 1, 95% confidence interval and P value < 0.05 modules retained. Finally, take the intersection of all color module genes with candidate genes obtained from the previous DEGs.

Gene enrichment analysis
The Gene set enrichment analysis was processed using the "clusterProfiler" package (Yu et al. 2012) in R language. The former module genes were enriched and analyzed by GO and KEGG, and the enrichment analysis results were visualized with bubble charts, bar charts and heat map, P value < 0.05 was statistically significant. In this study, human org.Hs.eg.db database were chose to transform the gene symbol. The function enrichKEGG (KEGG pathway)/enrichMKEGG(KEGG module) support enrichment analysis online, and function createKEGGdb can be also used to generate local KEGG.db package, use_ internal_ data parameter was set to TRUE. Meanwhile, the functional protein association network of these genes were also analyzed on line with STRING (https:// cn. string-db. org/) (Andrea et al. 2012), the candidate genes can be uploaded directly with this web-tool and protein-protein interaction networks functional enrichment analysis processed.

LASSO machine learning algorithm
The DEGs obtained from 1.2 were used as input data, and the model was established based on LASSO machine learning algorithm, "glmSparseNet" (Schelldorfer et al. 2011) package of R language was used to integrate multiple data sets. The integrated data were randomly divided into training set and verification set with the proportion of 2:1, and the clinical and survival data of the two groups were compared. Firstly, a classification model was constructed to distinguish the expression samples of tumor samples and normal tissues, with the function cv.glmhub, the parameter was set family = 'binomial'. Use timedependent receiver operating characteristic curve (ROC) to check the positive rate of the classifier, and the response and ROC results were obtained. Secondly, the differences of survival factors between the two groups were compared. Lasso regression algorithm was used to screen the mRNA related to the prognosis and survival of pulmonary nodules, and a risk score model was constructed, too. The influence of candidate differential genes on the overall survival was analyzed, P < 0.05 and survival related module genes were integrated into LASSO machine learning algorithm to identify prognostic risk characteristics. Cox survival curve was used to compare the survival differences of patients with pulmonary nodules between the two groups, so as to judge whether the risk model can be used as an independent prognostic index.

XGBoost machine learning algorithm
The 1.2 candidate genes obtained from differential expression were used as input data, and the model was established based on XGBoost machine learning algorithm, which was differently from traditional integrated decision tree algorithm. Regular term was added in XGBoost, which can control the complexity of the model and prevent the model from over fitting, and the data was optimized with second-order Taylor expansion to make it closer to the actual value. Then, the importance of each feature was calculated according to the number of tree node splitting features. The "XGBoost" (Chen et al. 2016) package of R language was used for statistical analysis. The integrated data were randomly divided into training set and verification set with the proportion of 7:3. In order to evaluate the model more accurately and comprehensively, the accuracy and area under curve (AUC) were used as the evaluation criteria of prediction results. According to the above algorithm, the classification model was established. The processed data were trained for many times, and different parameters were adjusted accordingly to get the best result.

Differential expression analysis genes
Gene expression data including LUAD were isolated from RNA-seq expression profile with R language, three methods including Deseq2, limma and edgeR were used respectively, Fig. 2 demonstrated the identification result of DEGs. From LUAD data, 1425 genes were up-regulated, 620 genes were down regulated by Deseq2, 2276 genes were up-regulated, 659 genes were down regulated by edgeR, 1631 genes were up regulated and 1560 genes were down regulated by limma. Genes with the same trend of the three methods were selected, 854 up-regulated genes and 452 down regulated genes were screened with intersection. After taking the intersection of the three methods, a total of 953 up-regulated genes and 330 down regulated genes were obtained. The final result showed that there were 1306 DEGs got in LUAD, these DEGs were selected as the feature subset of the model, and the disease classification prediction model was established.

WGCNA method
The input fpkm data of LUAD were clustered according to the classifications of normal and tumor, and different colors represent all those clustered modules. The soft threshold = 6 was selected to comply with the scale-free network standard, and the network heat map was established based on the hierarchical clustering of module genes. Tom results showed that LUAD data generated a total of 13 color modules, which were screened according to Z-score, EF, P value and other factors. There were five qualified modules in total, including 2804 genes in blue, brown, green yellow, grey and purple. The WGCNA result was shown in Fig. 3 and Table 1.

Gene enrichment analysis
Taking the intersection of module genes and differentially expressed genes, 99 differentially expressed genes with the strongest correlation were obtained from LUAD. Cluster-Profile for enrichment analysis showed that these module genes were related to multiple pathways, take LUAD genes for example, Gene Ontology (GO) including 81 functions related to Biological Process (BP), five functions related to Cellular Component (CC), 15 functions related to Molecular Function (MF) and 14 pathways related to

XGBoost machine learning analysis
The DEGs obtained from 2.1 were also analyzed by XGBoost machine learning. The samples were randomly divided into two groups (7:3) as training set and verification set. Through the training set, it was found that 14 genes of LUAD were considered to significantly affect the tumor classification of patients, including "ETV4" "MAP3K9-DT"  "HMGB3" "PLEK2" "SPTBN2" "SBK1" "TCF21" "GPT2" "FERMT1" "LINC00942" "FCN3" "GLB1L3" "COL6A6" and "ARHGAP6". Using the verification set, the final accuracy was about 0.978 according to the formula. The ROC curve of xgboost was shown in the Fig. 5. The final genes were demonstrated in Table 2.

Candidate genes
Taking the intersection of WGCNA and DEGs respectively, 99 common genes with the strongest correlation were obtained from LUAD, all these genes may be used for the classification of early pulmonary nodules. LASSO machine learning algorithm results showed that LUAD got 5 candidate genes. From the two subtypes of XGBoost, LUAD obtained 14 candidate genes. Based on the above analysis, we further combined with WGCNA and LASSO machine learning, there was an intersection gene"GOLM1" in LUAD. Meanwhile, taking the intersection between WGCNA and XGBoost, LUAD had five intersection genes"HMGB3" "ARHGAP6" "TCF21" "FCN3" and "COL6A6". These six candidate genes are significant in both groups of subtypes of lung cancer, which can be used as key classification objects for further experimental analysis. At the same time, we tested the survival of candidate genes on GEPIA website (http:// gepia. cancer-pku. cn/) (Tang et al. 2017) and found that "GOLM1" "COL6A6" were significantly correlated with the overall survival of LUAD. The expression of these signature genes were further validated with HPA database(https:// www. prote inatl as. org/) (Ran et al. 2017), immunostaining results showed that the upregulation of

Discussion
LUAD was one of the subtypes of pulmonary nodules, at present, many studies had shown that gene signature played an important role in predicting the classification of tumors (Juarez-Flores et al. 2021). However, there was no effective method for early diagnosis of pulmonary nodules, the algorithms used on pulmonary nodule signature were not deeply enough, and little attention was paid to the classification role of signature genes. Herein, samples of early pulmonary nodules were screened through TCGA lung cancer clinical data set, and RNA-seq transcriptome data were analyzed by three differentially expressed methods of R language. 1283 genes were found with statistically significant expression difference in LUAD. Since these genes were too much and their downstream mechanisms may affect each other, finding gene sets with similar expression patterns and related functions were effective steps to process dimension reduction and explore more precise classification of pulmonary nodules.
On one hand, WGCNA was a method to construct gene coexpression network based on unsupervised clustering, such as hierarchical clustering, which can be used to identify gene modules. In addition, dynamic tree cutting method was used to determine the clustering of similar data points in multi-dimensional space, which can successfully detect the branches that cannot be identified by the static cutting method, and generate prognostic gene clusters. Recent studies showed that WGCNA had been widely used to screen the core genes of tumor specific phenotypes. In this study, WGCNA was introduced to calculate the expression relationship between genes, and identify the module gene sets with similar expression patterns, and finally find the gene sets related to the classification of early pulmonary nodules. On this foundation, together with Z-score and enrichment fold, which scored each module more scientifically into groups, 116 highly related DEGs were determined. In addition, enrichment analysis revealed that these genes were related to humoal immune response phagocytosis, lymphocytes mediated immunity pathways, and so on.
On the other hand, machine learning algorithms was brought in to further narrow the gene number. Machine learning was a technique with hypothesis free and data adaptive, which can be effectively used to model complex data. Lasso, as a typical machine learning algorithm, had been widely used in binary classification analysis and survival analysis. The most commonly used for survival statistical model was cox proportional hazard regression model, "Lasso" package just had such function, which can identify potential risk factors, evaluate the prediction, and fit the clinical relevance of the results. In this study, Lasso was used to predict the tumor and normal group classification of early pulmonary nodules, and five signatures were obtained. Meanwhile, the survival model of five genes was also constructed and its prognostic value was studied. It was found that the prognosis of high-risk and low-risk levels in the training group and the validation group were significantly different. Of course, Lasso algorithm also had some limitations, which required appropriate processing of input data, and the parameters was adjusted in the calculation process to avoid the decline of algorithm performance. Therefore, we adopted another machine learning algorithm, XGBoost as the supplement of gene signature, such method was a rapid implementation of GradientBoosting algorithm. The algorithm was improved and the accuracy was improved at the same time. Since XGBoost method used second-order Taylor expansion of loss function, which was closer and faster in convergence. In addition, the regular term was added to the loss function that can effectively control the complexity of the model and prevent over fitting. In this study, 14 related genes were screened by XGBoost to distinguish the classification of pulmonary nodules, tROC result confirmed the certain reliability and strong prediction ability.
The signatures by machine learning and WGCNA methods were intersected to six genes, including "HMGB3" "ARHGAP6" "TCF21" "FCN3" "COL6A6" "GOLM1". High mobility group box 3 (HMGB3) was widely existed in eukaryotes, which binded to chromatin and single stranded DNA. It can bend DNA and improve DNA flexibility to promote the activity of various gene promoters. Researchers reported that HMGB3 was low expressed in normal cells and high expressed in cancer cells, which was an attractive therapeutic target for different cancers (Liu et al. 2018). Studies have reported that Rho GTPase Activating Protein 6 (ARHGAP6) had inhibitory effect on cervical cancer, and its expression level in cancer tissues was significantly lower than that in normal tissues, indicating that this gene can inhibit the proliferation, invasion and metastasis of cancer cells, induce cancer cell apoptosis. It can also keep cell cycle in G0/G1 stage, which played an important role in inhibiting the growth of cancer cells in the treatment of cervical cancer (Yin et al. 2016). Transcription factor 21 (TCF21) was located in the 6q23-24 region of human chromosome and of great significance to cell growth and differentiation. TCF21 can regulate the expression of downstream genes, which was closely related to had relationship with the occurrence and progression of a variety of malignant tumors. Previous studies have shown that the expression of TCF21 in non-small cell lung cancer was closely related to the degree of tumor differentiation, lymph node metastasis, clinical stage and postoperative 5-year survival (Guo et al. 2018). Ficolin (collagen/fibrinogen domain containing) 3 (FCN3) protein was mainly synthesized in lung, which was secreted by bronchial epithelial cilia and type II alveolar epithelial cells. FCN3 protein was the main catalyst of complement lectin pathway and the most important innate immune factor in human body. Researchers found that FCN3 protein decreased significantly in LUSC, which suggested that FCN3 may be a serum marker of LUSC (Kim et al. 2021). Collagen type VI A6 (COL6A6) was considered to be a tumor inhibitor and therapeutic target in multiple tumors. A study found that overexpressed COL6A6 were enriched in immune pathways such as T cell activation and T cell receptor signaling, suggesting that COL6A6 may be a potential immunotherapeutic target of LUAD. Cox analysis results also identified the gene risk prediction characteristics of COL6A6 related immune regulators (Ma et al. 2021). Golgi membrane protein 1 (GOLM1) was a kind of Golgi membrane glycoprotein which was located in person 9q21 33, containing 11 exons, with a total length of 3042 bp. Study found that GOLM1 expression was closely related to the proliferation and invasion of non-small cell lung cancer, and believed that it can be used as an independent predictor of prognosis of nonsmall cell lung cancer (Aruna 2018). Validation on GEPIA and HPA was also consisted with prediction, suggesting that these six genes were involved in the important functions of carcinogenesis and development, which played an important guiding role for our research.
So far, there was no research proposed the prediction of pulmonary nodule classification by combining WGCNA and machine learning methods. Based on RNA-seq data, this study deeply studied the molecular characteristics of LUAD and found the gene signatures for screening early pulmonary nodules. Using differential expression analysis, WGCNA, LASSO regression algorithm and XGBoost algorithm, six gene signatures were developed and verified. The establishment of prediction model in this study was successful with high accuracy and good stability, which provided a theoretical basis for early diagnosis and treatment of pulmonary nodules. The influence mechanism of these characteristic variables on classification will be further explored in subsequent studies. In addition, because the sample size included in this study was small and the model needed to be continuously optimized, the mechanism of the contribution of these genes to the model needed to be further studied.
Author contributions FJ and LR conceived of the presented idea. GL developed the theory and performed the computations. MY verified the analytical methods. All authors discussed the results and contributed to the final manuscript.