Identification of significant genes using feature selection and machine learning
For identifying the significant genes related to subpopulations of Nipponbare (Nip) root tip cells, we used three feature selection methods (MIC, CV2 and F-score) to evaluate the importance of the 39,217 genes and ranked them according to their contribution value. The MIC, CV2, and F-score all extracted 21,981 significant genes respectively. Next, the machine learning models combined with incremental feature selection (IFS) were used to determine the optimal gene subsets. (Fig. 2A, B and C). Based on five-fold cross-validation, single-cell gene expression matrices were used as input features to train machine learning models (SVM, RFC, XGBoost, Lightgbm).
The train dataset results showed that MIC combined with SVM (MIC_SVM) achieved optimal prediction performance using top 110 genes, with the accuracy of 97.23% (Table 1). It is worth noting that the four machine learning models combined with the MIC and F-score also obtained superior prediction performance. To avoid the MIC and F-score having the same gene preference, we selected the top 110 genes in the score ranking of the two feature selection methods for comparison. As observed from Fig. 2D, MIC and F-score have few intersections and sufficiently differences. Using the 110 optimal genes on test data, MIC_SVM also predicted the best performance, with accuracy, precision, recall, and F1-measure of 96.72%, 95.15%, 94.84 and 94.92% (Table 2).
Table 1
Performance evaluation of different feature selection combined with machine learning schemes (Train dataset).
Method | Feature selection | No. of features | Accuracy |
Lightgbm | F-score | 150 | 96.53% |
XGBoost | F-score | 430 | 97.88% |
SVM | F-score | 180 | 96.34% |
RFC | F-score | 210 | 94.41% |
Lightgbm | CV2 | 20000 | 97.11% |
XGBoost | CV2 | 20000 | 97.59% |
SVM | CV2 | 7000 | 94.22% |
RFC | CV2 | 14000 | 89.71% |
Lightgbm | MIC | 100 | 95.49% |
XGBoost | MIC | 100 | 96.59% |
SVM | MIC | 110 | 96.72% |
RFC | MIC | 120 | 93.55% |
Table 2
Performance comparison between NRTPredictor and the other algorithms (Test dataset).
Method | Feature selection | No. of features | Accuracy | Precision | Recall | F1-measure |
Lightgbm | F-score | 150 | 96.53% | 94.27% | 93.37% | 93.78% |
XGBoost | F-score | 430 | 97.88% | 96.48% | 96.33% | 96.39% |
SVM | F-score | 180 | 96.34% | 94.02% | 93.35% | 93.66% |
RFC | F-score | 210 | 94.41% | 93.09% | 85.51% | 88.14% |
Lightgbm | CV2 | 20000 | 97.11% | 95.48% | 94.49% | 94.97% |
XGBoost | CV2 | 20000 | 97.59% | 96.27% | 95.56% | 95.89% |
SVM | CV2 | 7000 | 94.22% | 93.45% | 89.92% | 91.27% |
RFC | CV2 | 14000 | 89.71% | 88.02% | 85.43% | 85.37% |
Lightgbm | MIC | 100 | 95.49% | 94.49% | 93.70% | 94.02% |
XGBoost | MIC | 100 | 96.59% | 94.80% | 94.56% | 94.61% |
SVM | MIC | 110 | 96.72% | 95.15% | 94.84% | 94.92% |
RFC | MIC | 120 | 93.55% | 91.18% | 81.17% | 85.18% |
NRTPredictor | MIC | 110 | 98.01% | 95.63% | 95.45% | 95.95% |
NRTPredictor construction and performance in validated datasets
To further improve the performance of the model, we integrated the above four basic classifiers (SVM, RFC, Lightgbm and XGBoost) based on different weight assignments, called NRTPredictor. Ensemble models outperform individual models, with accuracy, precision, recall, and F1-measure of 98.01%, 95.63%, 95.45 and 95.95%. The receiver operating characteristic (ROC) curve and confusion matrix further verified the prediction performance of the NRTPredictor in six rice root cell subpopulations, and the low misclassification rate proved the demonstrated power of the NRTPredictor (Fig. 3A and B).
In addition, we have carried out a performance comparison between the pseudobulk differential expression analysis and our proposed ensemble method. Notably, the pseudopod analysis identified 1,216 genes overlapping with 98 of the 110 genes mined by our method (Figure S2). However, ensemble model exhibited superior predictive performance when using the 110 genes (Tables 1 and 2). Thus, while the pseudobulk analysis may identify more differentially expressed genes, the 110 genes identified by our method more accurately represented cell subpopulations with less computational complexity.
Investigating NRTPredictor model interpretability
To explain the performance of the proposed model, NRTPredictor gene sets were extracted and visualized. The Uniform Manifold Approximation and Projection (UMAP) and correlation analysis of 3,463 single cells indicated that the overall performance of the 110 marker genes was significantly better than all genes (Fig. 3C, D and E). These results demonstrated the ability of the proposed NRTPredictor in extracting potential genes, which helps understand the cell type annotated. Moreover, by analyzing the expression of 110 genes in six cell subpopulations, high expression was found in stele and Epidermic, and low expression was found in Root_hair, endodermis and cortex (Fig. 3F).
Besides, the accurate capture of genes involved in lineage identification is helpful for understanding the cell subpopulations of rice root tips. We observed that LOC_Os02g44310 and LOC_Os10g40520 were specifically expressed in Cortex. LOC_Os07g33997, LOC_Os01g64520 and LOC_Os06g46799 respectively exhibited high specificity and expression in Endodermis, Epidermis and Stele, while LOC_Os07g35860 and LOC_Os03g25320 were highly expressed in Epidermis (NRH) (Fig. 4A, Supplementary Figure S3 and Table S2). These highly ranked genes can be used as biomarkers to identify subpopulations of rice root cells and provide some support for further biological findings (Supplementary Table S2). In addition, we also successfully captured some reported population-specific marker genes (Fig. 4B), such as LOC_Os03g252800, LOC_Os01g18970 and LOC_Os07g44280 [10, 16].
Expression analysis of NRTPredictor gene set
To further prove that the NRTPredictor was better, we compared the expression level of the top 20 genes in each cell subpopulation with its expression level in the rest five clusters (Fig. 4C and Supplementary Figure S4). The results demonstrated that NRTPredictor had irreplaceable advantages in processing scRNA-seq data and does not rely on a priori biological background. Using multiple genes to characterize cell subpopulations of rice root tip could have greater ability to mark. Based on the 110 genes screened by NRTPredictor, the top six specific genes with the highest expression in each cell subpopulation were shown in Fig. 4D, which could provide biologists with some new insights.
Single-cell expression profiles containing all genes and 110 genes, respectively, were used as input to construct partition-based graph abstraction (PAGA) to describe the biological landscape (Fig. 4E). On the graph, the same topological structure was shown, such as the strong connections between Epidermis (NRH) and Root hair, suggesting that NRTPredictor screened for key molecular markers and removed redundant information (Fig. 4E). To provide insights into the molecular functions of 110 genes, GO analysis was performed (Fig. 4F). The results exhibited that a large number of genes were enriched in phenylpropanoid biosynthesis pathway, suggesting that these genes may be involved in the regulation of lignin and flavonoid synthesis, which played critical roles in plant growth and development, abiotic stresses and biotic stresses [17–19].
Related studies reveal that proteins encoded by OsCAD2 (LOC_Os02g09490, Os02g0187800) play a role in monolignol biosynthesis[20]. The expression of OsCAD2 was most tightly associated with the transcription of genes related to lignin biosynthesis, indicating that OsCAD2 is primarily responsible for monocotyledonous lignin biosynthesis in rice [21]. In addition, caffeic acid O-methyltransferase (COMT, LOC_Os08g06100, Os08g0157500), encoded in sorghum, has been shown to be one of the key enzymes in monolignol biosynthesis [22]. Result showed that LOC_Os02g09490 (Os02g0187800) and LOC_Os08g06100 (Os08g0157500) are specifically expressed in epidermis cells (Supplementary Table S2), which was closely related to the protective function of root tip epidermis cells in soil.
Multi-omics data integration of scRNA-seq and Bulk RNA-seq
To simultaneously define expression changes at the global and cellular levels, we also performed bulk RNA-seq analysis on rice root cells under stress and control samples in parallel. Based on the public database (PPRD) of Zhai et al [23]. we revealed that all 12 genes enriched in the phenylpropanoid biosynthesis pathway were expressed at high levels in the root, demonstrating that the key core genes were screened (Fig. 4 and Figure S5). To reveal aberrant cell subpopulations, our results were integrated with bulk RNA-seq data from salt stress, pi stress and flooding stress, and all 12 genes were involved in expression, confirming that Epidermis cells play a major regulatory role under stress conditions, with most genes representing in Epidermis cell subpopulations. (Fig. 4H). We then focused on the bulk RNA-seq specific expression patterns related to salt stress, and the results showed that Epidermis cell subpopulations has a positive role in studying the molecular mechanism of salt stress in rice (Fig. 4I).
By this research, the NRTPredictor webserver has been established and are freely available at http://www.germplasmai.com. The home user interface of NRTPredictor was shown in Fig. 5. Click on the "Predictor" button to enter the service module. Researchers can submit simple CSV file with gene expression matrix as the input. Click on the “Submit”, the NRTPredictor webserver will process the submitted tasks, predict and return result file. We also provided the example file and step-by-step guide for users, which can be seen in the ‘Tutorial’ module of web service.