Development and validation of a prognostic model based on a single-cell RNA-seq in Wilms tumor in children

To analyze the heterogeneity between different cell types in pediatric Wilms tumor (WT) tissue, and identify the differentially expressed genes (DEGs) of malignant tumor cells, thereby establishing a prognostic model. The single-cell sequencing data of pediatric WT tissues were downloaded from the public database. Data filtration and normalization, principal component analysis, and T-distributed stochastic neighbor embedding cluster analysis were performed using the Seurat package of R language. Cells were divided into different clusters, malignant tumor cells were extracted, and DEGs were obtained. Then, the pseudo-time trajectory analysis was performed. Prognostic biomarkers were determined by univariate and multivariate COX regression analyses and LASSO regression analysis. Kaplan–Meier survival analysis and receiver operator characteristic curve analysis were performed. Combined with the prognostic biomarkers and clinical characteristics, a nomogram was generated to predict WT prognosis. The prognostic power was validated in the external datasets. Cells in the WT tissue were divided into 10 clusters. Three prognostic biomarkers that affected the survival time of patients were screened from 215 DEGs in malignant tumor cells, and a nomogram was constructed using the three genes and clinical characteristics. The area under the curve (AUC) values of 3- and 5-year disease-free survival were 0.756 and 0.734, respectively. In the external validation dataset, the AUC value of this nomogram model was 0.826. Based on the single-cell RNA-seq, we recognized cell clusters in the WT tissue of children, identified prognostic biomarkers in malignant tumor cells, and established a comprehensive prognostic model. Our findings might provide new ideas and methods for the diagnosis and treatment of WT.


Introduction
Wilms tumor (WT) is the most common kidney cancer in children, and the fourth most common childhood cancer. 1 It is genetically heterogeneous. 2 WT affects over 90% of children with renal malignancies. 3 Its histology resembles that of a developing kidney, and is usually triphasic with blastemal, stromal, and epithelial components. 4 WT is a malignant solid tumor, mainly composed of stroma, germ, and epithelium, including epithelial tissue, muscle tissue, connective tissue, skeletal tissue, and nerve tissue. Abdominal mass is its main clinical symptom. When the mass is small, there are no obvious symptoms so that WT is easy to be ignored. When the mass is found, WT may have developed to an advanced stage. However, in the early stage of WT, distant metastasis can also occur in lung, liver, and brain, which seriously affects the life and health of children. Therefore, it is necessary to conduct in-depth research on the factors associated with the occurrence and development of WT. At present, genomic and transcriptomic studies on WT tissues mainly use bulk profiling techniques to investigate the effects of genes, 5,6 miRNAs, 7 and lncRNAs 8,9 on tumor tissues and patient survival time. However, these studies typically rely on data from the bulk profiling, limiting their ability to accurately capture tumor heterogeneity. Therefore, an in-depth understanding of cellular heterogeneity and the interaction between WT cells and their microenvironment may lead to the development of new therapeutic approaches for treating WT.
Single-cell RNA sequencing (scRNA-seq) analysis has emerged as a powerful tool for revealing cellular diversity and cell-to-cell communication at single-cell resolution. Recently, scRNA-seq has been applied to dissect the complex tumor and immune landscapes of some cancers, including glioblastoma, breast cancer, lung cancer, head and neck cancer, pancreatic ductal adenocarcinoma, and liver cancer. This technique improves our understanding of cellular heterogeneity and facilitates the screening of promising molecular targets to guide antitumor therapy. However, tumor heterogeneity and the interaction between malignant cells and normal cells at single-cell resolution in human WT remains poorly understood.
Therefore, our study aimed to analyze the heterogeneity between different cell types in WT tissue of children, and identify the differentially expressed genes (DEGs) of malignant tumor cells, thereby establishing a prognostic model.

Data collection and preprocessing
The training dataset was derived from the published single-cell sequencing data, 10 in which three pediatric WT samples, five adult clear cell renal cell carcinoma (RCC) or papillary RCC samples, two healthy fetal samples, and one healthy adolescent were analyzed. The data of WT patients were extracted as the research object, and a total of 2967 single-cell sequencing data of WT patients were obtained. Then the Seurat package in the R language (https://www.r-project.org/) was used to preprocess the data, and the ''CreateSeuratObject'' function was used to build a Seurat object. 11 The survival data and gene expression data were obtained from the WT project in the TARGET database (https://ocg.cancer.gov/). The TARGET database is a childhood tumor database designed to use a comprehensive genomic approach to identify molecular changes in the occurrence and development of difficult-to-treat childhood cancers, including acute lymphoblastic leukemia, acute myeloid leukemia, kidney tumors, neuroblastoma, and osteosarcoma. The data of 124 WT patients were selected as the experimental data set in this study. The GSE73209 and GSE11024 data sets were downloaded from the GEO data platform (https://www.ncbi.nlm.nih. gov/geo) and were used as the validation data sets.

Quality control, data filtering, and normalization
The ''PercentageFeatureset'' function was used to calculate the percentage of mitochondrial genes, and the ''FilterCells'' function was used to filter the Seurat object, with the retention criteria of 200 \ features \ 2500, and the percentage of mitochondrial genes \5%. 12 The ''NormalizeData'' function was used to normalize the data with the normalization.method = ''LogNormalize.' ' 12 Identification of genes with highly variable expression between cells Some genes, whose expression varies widely between cells, are called hypervariable genes. The ''FindVariableGenes'' function was used to calculate hypervariable genes, and the ''vst'' method was used to select the top 2000 hypervariable genes. The data were further scaled using the ''ScaleData'' function with the features = ''all.genes.'' 12

Principal component analysis
The ''RunPCA'' function was used to perform principal component analysis (PCA) on highly variable genes, the ''ScoreJackStraw'' function was used to score each principal component (PC), the ''JackStrawPlot'' function was used to visualize the score of each PC, and the ''EbowPlot'' function was further used for visual analysis of each PC. The dimension of PCs for cluster analysis was selected accordingly. 12 Then, the ''FindClusters'' function was used to perform cell clustering analysis, and the parameter resolution was set as 0.4. 12

T-distributed stochastic neighbor embedding cluster classification analysis
The T-distributed stochastic neighbor embedding (TSNE) algorithm was used for dimensionality reduction and cluster classification analysis. Then, the ''TSNEPlot'' function was used to visualize cell clusters. The SingleR package in the R language was used to annotate the cell clusters by checking the expression of common cell grouping marker genes in tumor tissue, the results of cell clusters annotation were checked and confirmed, and the cell clusters were finally determined comprehensively. 12

Pseudo-time trajectory analysis
Investigation of the differentiation trajectories and corresponding genes in various cell populations may shed light on the molecular mechanisms underlying cancer development. Pseudo-time trajectory analysis of single-cell data was performed using the monocle package 13 to determine differentiation trends and differentiation paths of cell clusters.

Identification of DEGs in malignant tumor cells
The ''FindMarkers'' function was used to determine the DEGs between malignant tumor cells and other cells, with the screening criteria of |logFC|.1 and adjPval\0.05. 12 The obtained DEGs were subjected to enrichment analysis to determine the possible changes in gene functions and signaling pathways caused by these DEGs. Gene ontology (GO) functional enrichment analysis (including the molecular function, biological processes, and cellular component) and Kyoto Encyclopedia of Genes and Genomes (KEGG) 14 signaling pathway enrichment analysis were performed using the Metascape online tool (https://metascape.org/).

Univariate COX regression analysis
To preliminarily screen out the genes associated with WT prognosis, univariate COX regression analysis was performed on the DEGs to calculate the hazard ratio and 95% confidence interval of the candidate genes using the ''survival'' package of the R language. 15 p \ 0.05 was considered statistically significant.

LASSO regression and multivariate COX regression analysis
To further select variables and avoid the over-fitting of genes obtained by univariate COX regression analysis, the ''glmet'' package of R language was used to perform LASSO regression analysis on the DEGs screened by univariate COX regression. 15 In this study, 10-fold cross-validation was used to determine the value of l during the model construction, taking the l with the smallest partial likelihood deviation as the optimal l. The selected genes were then subjected to multivariate COX regression analysis, and finally genes with the statistical significance were selected to establish a prognosis prediction model, and a risk score calculation formula was obtained.
The risk score of each case was calculated using this following formula, Risk score = ExpGENE1 3 b1 + ExpGENE2 3 b2 + . + ExpGENEn 3 bn, where ''Exp'' and ''b'' represent the corresponding gene expression level, and the regression coefficient from the multivariate Cox analysis, respectively. 15

Receiver operator characteristic curve analysis
To evaluate the predictive power of the prognostic model, this study first performed Kaplan-Meier survival analysis on the high-risk and low-risk groups; second, the receiver operator characteristic (ROC) curves of the 3-and 5-year disease-free survival (DFS) of WT patients were drawn, and the area under the curve (AUC) values of the 3-and 5-year DFS were calculated by the ''survival'' and ''timeROC'' packages. 16 When the AUC value is less than 0.5, the accuracy of the model is not significant; when the AUC value is greater than 0.7, the accuracy of the model is moderate; when the AUC value is greater than 0.9, the accuracy of the model is quite high.

Nomogram construction and calibration
Combining prognostic genes and clinical features, a nomogram was generated to predict prognosis of WT patients. The nomogram was constructed according to the regression coefficients obtained from the multivariate COX regression analysis using the ''rms'' package of R. 16 The nomogram prediction probabilities against the observed rates was visualized by drawing a calibration curve, and the predictive power of the nomogram was evaluated by the ROC curve. The proportional hazard assumption was tested by Kaplan-Meier curves.

Validation of external dataset
GSE73209 and GSE11024 from GEO were selected to verify the predictive ability of this nomogram model. The GSE73209 contains 32 WT tissue samples and 6 normal tissue samples; the GSE11024 contains 27 WT tissue samples and 12 normal tissue samples. 17

Statistical analysis
Kaplan-Meier methodology was used to compare the differences of DFS among different groups, and Kaplan-Meier curves were generated. Univariate and multivariate COX analyses, together with LASSO regression, were performed to screen independent predictors of DFS. All statistical analyses were performed using R software (v4.0.3). p Value less than 0.05 was considered statistically significant.

Results
The flow chart and principal findings of the comprehensive analysis are shown in Figure 1.

Quality control of the single-cell data
The quality control chart is shown in Figure 2(a), illustrating the range of single-cell RNA numbers, the sequencing count, and the proportion of mitochondrial sequencing counts of each cell. We accordingly retained cells with a percentage of mitochondrial sequencing count \5% and 200 \ features \ 2500. Finally, 2695 high-quality cells were retained.
In addition, statistics on the data found that there was a significant correlation between features and counts (Pearson's r = 0.85), while there was no correlation between counts and the percentage of mitochondrial genes (Pearson's r = 20.16), as shown in Figure 2(b) and (c).

Identification of highly variable genes
The top 2000 highly variable genes were screened out after analysis, as shown in Figure 2 Figure 2(g). In these heat maps, the brighter the color, the higher the expression. In this figure, PC_1 to PC_15 suggested the expression of the genes in these 15 PCs. The higher the ranking of the PCs, the more distinctly the cells can be distinguished. These allow for easy exploration of the primary sources of heterogeneity in a dataset.

Principal component analysis
The PCA, a linear dimensionality reduction method, was utilized to identify significantly available dimensions of data sets with estimated p value. The JackStraw Plot and Elbow plot showed that the differences of each sample were more obvious when the number of PCs was 10 (Figure 2(h) and (i)). Therefore, the top 10 PCs were selected for TSNE dimensionality reduction.

TSNE cluster classification analysis
The TSNE cluster classification analysis was performed to divide cells into 10 clusters, as shown in Figure 3(a). The SingleR package was used to annotate the cells of each cluster by combining the gene expression of several common cell types, and finally the cell type of each cluster was determined, as shown in Figure 3(b). The most significant marker genes in each cluster are shown in Figure 3(c), and the top 10 DEGs in each cluster are shown in Figure 3(d). The top 10 marker genes of each cell cluster were displayed in the heat map. In this heat map, the brighter the color, the higher the expression.

Pseudo-time trajectory analysis
Pseudo-time trajectory analysis was performed by the ''monocle'' package, as shown in Figure 3(e), which suggested that the differentiation of cells was mainly divided into three directions. Cluster3 and Cluster7 were on the same branch, and Cluster7 was located at the end of this branch, which implied that Cluster7 was more developed and mature tumor cells.

Analysis of DEGs in malignant tumor cells
Malignant tumor cells were the focus of this present research. Cluster3 and cluster7 were identified as malignant tumor cells. The ''FindMarkers'' function in the Seurat package was used to find the DEGs between malignant cells and other cells, and finally 215 DEGs were obtained.
The results of GO enrichment analysis showed that DEGs were mainly enriched in positive regulation of cell migration, blood vessel development, and other functions that may be related to carcinogenesis (Figure 3(f)). The results of KEGG signaling pathway enrichment analysis showed that DEGs were mainly enriched in signaling pathways that may be related to carcinogenesis, such as apoptosis, transcriptional misregulation in cancer, and antigen processing and presentation (Figure 3(g)).

Identification of prognosis-associated genes
Based on the above DEGs, and the corresponding clinical and gene expression information from TARGET, 13 genes associated with the prognosis of WT were preliminarily identified by univariate COX regression analysis, and their p values were all less than 0.05, including BTG1, CXCR4, DNAJA1, EIF3M, FBXO21, NES, NPNT, PLTP, PRDX1, PTGDS, PTPRO, TAGLN, and TUBB. LASSO regression analysis was further performed on these 13 genes, and finally 12 genes were determined to be included in this study, namely: BTG1, CXCR4, DNAJA1, EIF3M, FBXO21, NES, NPNT, PLTP, PRDX1, PTGDS, PTPRO, and TAGLN (Figure 4(a) and (b)). Subsequently, multivariate COX regression analysis was performed on these 12 genes, the results of which showed that DNAJA1, NES, and TAGLN were significantly associated with patient survival time (p \ 0.05) (see Table 1). Kaplan-Meier survival analysis was performed on DNAJA1, NES, and TAGLN, and it was found that the p values of the three genes were less than 0.05, indicating that these three genes were significantly associated with the survival of patients, and might be used as prognostic marker genes (Figure 4(c)-(e)). The prognostic risk score of each WT  patient was calculated based on these three genes. The calculation formula was Risk score = Exp DNAJA1 3 0.9166 + Exp TAGLN 3 0.5986 + Exp NES 3 (21.0138). All patients were divided into high-risk and low-risk groups according to the median risk score, and a prognostic model was constructed. To further evaluate the predictive power of the prognostic model, Kaplan-Meier survival analysis was performed, the results of which showed that the high-risk group had a 3-year survival probability of 27%, a 5-year survival probability of 27%, and the low-risk group had a 3-year survival probability of 70% and a 5-year survival probability of 63%. DFS time in the low-risk group was longer than that in the high-risk group (p \ 0.05, Figure 4(f)). The AUC value of the 3-year DFS was 0.733, and the AUC value of the 5-year DFS was 0.709 (see Figure 4(g)).

Identification of prognosis-associated clinical characteristics
Univariate COX regression analysis was performed using DFS as the dependent variable, and age, gender, and race as covariates. The results showed that gender could be used as an independent prognostic factor (p = 0.0013), and the prognosis of male patients was worse than that of female patients, as shown in Table 2. In addition, Kaplan-Meier survival analysis was performed on the clinical characteristics, and it was also found that gender was significantly associated with patient survival, as shown in Figure 4(h).

Nomogram construction
Gender, DNAJA1, NES, and TAGLN were used to construct the nomogram (see Figure 5(a)); the calibration   curve suggested that the model had good predictive power for 3-and 5-year survival (see Figure 5(b)). The AUC value for the 3-year DFS was 0.756, and the AUC value for the 5-year DFS was 0.734 (see Figure 5(c)). In addition, the Kaplan-Meier survival curves of DNAJA1, NES, TAGLN, and Gender suggested that the proportional hazard assumption was valid (see Figure 4(c)-(e), (h)).

Validation of external dataset
The ROC curve also showed that the AUC value of this nomogram model was 0.826 ( Figure 5(d)), suggesting a moderate prediction value.

Discussion
WT or nephroblastoma is the most common renal tumor in children, and is associated with different congenital anomalies and syndromes. 18 WT is the most common primary malignant tumor of the urinary system in children, with an incidence of about 1 in 10,000, 19 accounting for more than 90% of renal malignant tumors in children. 20 There is significant heterogeneity between malignant tumor cells and other cells in the WT tissue, and single-cell sequencing technology can study the differences between different types of cells at the level of single-cell resolution. The previous researches 21-23 on the construction of prognostic nomogram for WT was based on clinical information database or based on bulk seq sequencing technology, while no prognostic prediction model for malignant tumor cells has been reported. To our knowledge, this study is the first to screen out malignant tumor cells based on a single-cell sequencing technology to construct the prognostic nomogram.
In the previous studies on the nomogram prognostic model of WT, the predictive accuracy ranged from 0.656 to 0.879. [21][22][23] Tang et al. 23 constructed nomograms to forecast overall survival and cancer-specific survival (CSS) of children with WT based on Surveillance, Epidemiology, and End Results (SEER) database, where five predictors were included, such as age, tumor laterality, size, stage, and surgery, the AUC values for 3-and 5-year overall survival were 0.659 and 0.656, and the AUC values for 3-and 5-year CSS were both 0.677. Pan et al. 22 constructed a nomogram to predict the CSS of WT patients based on SEER database, where age, the number of examined LNs, SEER stage, and tumor size were included, the AUC values for 3-and 5-year CSS were 0.755 and 0.749, respectively. He et al. 21 screened four autophagy-related genes based on the TCGA database, and a nomogram was constructed together with first event, stage, and histology. Due to the introduction of more variables, the AUC values for 3-and 5-year survival based on this nomogram were 0.879 and 0.856, respectively. These previous studies did not consider the effect of heterogeneity between malignant tumor cells and other cells on the results, and thus they have certain limitations. In our study, based on the malignant tumor cells, the AUC values for 3-and 5-year DFS were 0.756 and 0.734, respectively.
In this article, single-cell transcriptome analysis was used to group cells based on a single-cell sequencing data. Malignant cells were screened out, and compared with other cell data, the DEGs were obtained. GO enrichment analysis and KEGG pathway enrichment analysis were performed on these DEGs. Prognostic biomarkers were determined by univariate COX regression analysis, multivariate COX regression analysis, and LASSO regression analysis. Kaplan-Meier survival analysis and ROC analysis were performed on these prognostic biomarkers, and finally a prognosis model and nomogram were constructed, and the results were calibrated. The results of functional enrichment analysis and signaling pathway enrichment analysis of 215 DEGs showed that malignant tumor cells were significantly enriched in functions or pathways such as blood vessel development, positive regulation of cell migration, and transcriptional misregulation in cancer. As we know, transcriptional misregulation can result in cell canceration, and cancerous cells can promote tumor development by promoting angiogenesis, cell migration, and other processes.
In this study, we identified three prognostic biomarkers of malignant tumor cells in the WT tissue, including DNAJA1, TAGLN, and NES. DNAJA1, a member of J-domain containing proteins or heat shock protein 40, is evidenced to prevent unfolded mutp53 from proteasomal degradation, which is associated with several cancers. 24,25 It was found that DNAJA1 promoted tumor metastasis by accumulating unfolded mutp53. 26 TAGLN is a regulator of the actin cytoskeleton, and affects the survival, migration, and apoptosis of various cancer cells to varying degrees. 27 Overexpression of TAGLN is associated with cell infiltration, which, in turn, promotes tumor metastasis. 28,29 NES encodes a member of the intermediate filament protein family. At present, there are few studies on it, and it can be used as a new potential gene for in-depth research.
However, there are still some limitations in our study. This research is developed only based on the public databases, and lack of the validation form in vivo or in vitro experiments. Therefore, it is necessary to further confirm our findings by in vivo or in vitro experiments in the future.
In conclusion, this study detected the biomarkers of malignant tumor cells in the WT tissue, developed a novel nomogram to predict the WT prognosis based on the candidate genes and clinical information, and validated it in other datasets. It might provide useful for the diagnosis and treatment of WT in the future.

Author contributions
Tian-Qu He, Jun He, conceiving and designing the study; Yu Liu, Feng Ning, collecting the data; Tian-Qu He, Feng Ning, Lei Tu, analyzing and interpreting the data; Tian-Qu He, writing the manuscript; Yao-Wang Zhao, Jun He, and Feng Ning, providing critical revisions that are important for the intellectual content; Tian-Qu He, Yao-Wang Zhao, Yu Liu, Lei Tu, Feng Ning, and Jun He, approving the final version of the manuscript.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: