In October 2021, we downloaded the data of 424 HCC cases (tumor tissue: 374 cases, normal tissues: 50 cases) from The Cancer Genome Atlas (TCGA) database (https://tcga-data.nci.nih.gov/tcga/) as a training set and main research cohort. Then, we downloaded the HCC patient data (normal tissues:192, tumor tissues: 240) of the GSE36376 and platform GPL10558433 from the the Gene Expression Ommnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) as an external validation data set.
We downloaded 237 NSRGs from Molecular Signatures Database (MSigDB) (http://www.gsea-msigdb.org/gsea/msigdb) for research.
We used the R language to extract the expression matrix of NSRGs. All expression data were standardized by the Z-score processing (the mean value of the sample becomes 0, and the variance becomes 1). NSRGs were set as independent variables (feature), and normal samples/tumor samples as dependent variables (label) for the occurrence of HCC.
Differential expression analysis of NSRGs
We used the "limma" package of R language to select the differentially expressed NSRGs (DENSRGs) of HCC, with the criteria of |logFC|>1 and FDR<0.05 (FC: Fold Change, FDR: False Discovery Rate).
Identification of hub NSRGs
In the TCGA-HCC cohort, two machine learning algorithms of Least Absolute Shrinkage and Selection Operator (Lasso) and Support Vector Machine (SVM) were used to screen the important NSRGs of HCC.
Lasso is a kind of regression analysis algorithm, which selects variables while regularizing. Lasso was implemented by the "glmnet" package of R language with parameter settings of family:binomial, alpha:1, type, measure:deviance, nfolds:10.
SVM can express complex classification boundaries by combining with the kernel function. SVM was implemented by the "e1071, kernalb, caret" packages of R language with parameter settings of functions: careFuncs, method: cv, methods: svmRadial. In order to avoid the over-fitting of the model, we also performed univariate regression analysis of the gene in the selection of the feature genes using the "survival, survminer" packages of R language with the filter condition of P value < 0.05.
The intersecting genes of the three methods were identified as the final feature genes for further research as the variables of the classification and diagnosis model.
Establishment and verification of the prognostic model
The model of HCC classification and diagnosis was constructed based on the expression of core genes. In this study, five classification algorithms of machine learning classification algorithms, including XGBClassifier, LGBMClassifier, AdaBoostClassifier, MLPClassifier and SVM, were used to construct the the initial classification and diagnosis model.
30% of TCGA-HCC patients were randomly selected as the test set, the remaining 70% as the training set for the 10-fold cross-validation, and GEO-HCC patients as the external validation set were used to validate the model. Model evaluation indicators included area under curve (AUC), accuracy, sensitivity, positive predictive value (PPV), negative predictive value (NPV) and F1 score. TP, TN, FP and FN represented the number of true positive, true negative, false positive and false negative samples, respectively. Comprehensive evaluation of various indicators and selection of the best algorithm were used to build the model
PPV= TP / (TP + FP)
NPV= TN / (TN + FN)
XGBClassifier was implemented based on the "xgboost 1.2.1" package of python language, and this model parameters were objective (optimized objective function): binary, logistic, learning_rate (learning rate): 0.3, max_depth (maximum tree depth): 4, min_child_weight (minimum bifurcation weight sum): 6, and reg_lambda (L2 regularization coefficient): 1. The LGBMClassifier machine learning model was based on the "lightgbm 3.2.1" package of python language, and this model parameters were boosting_type (algorithm type): gbdt, learning_rate (learning rate): 0.001, max_depth (maximum tree depth): 1, n_estimators (maximum number of trees) : 5, and num_leaves (the maximum number of leaves): 5. AdaBoostClassifier was implemented based on the “sklearn 0.22.1” package of python language, and this model parameters were learning_rate (learning rate): 0.1, and n_estimators (number of single models): 50. MLPClassifier was implemented based on the “sklearn 0.22.1 ” package of python language, and this model parameters were activation (non-linear function): logistic, hidden_layer_sizes (hidden layer width): (30, 30), and max_iter (number of iterations): 10. SVM was implemented based on the “sklearn 0.22.1” package of python language, and this model parameters were C (regularization factor): 0.1, kernel (core type): rbf, tol (convergence measure): 0.001.
Hub gene expression analysis
In order to further analyze the relationship between participating variables (hub genes) and HCC. The Mann–Whitney test was used to compare the expression levels of hub genes between tumor group and normal group. Pearson correlation was used to calculated the correlation of risk genes.
Immune cell infiltration analysis
The CIBERSORT algorithm was used to evaluate the infiltration of 22 immune cells in the TCGA-HCC cohort. Then, we compared the distribution of the 22 immune cells between normal group and tumor group. Spearman correlation analysis was performed between 22 immune cells and hub genes. At the end, the risk of patients was scored according to the gene expression of the selected variables in the model. The patients were divided into the high-risk group and the low-risk group by the median value of the risk score, and then analyzed for immunotherapy responsiveness. The risk score was calculated as the sum of the predicted values weighted by the Lasso coefficient, including all risk genes. The Tumor Immune Dysfunction and Exclusion (TIDE) tool (http://tide.dfci.harvard.edu/) was used to predict immunotherapy responsiveness.