Backgroud: Accurately forecasting the prognosis could improve therapeutic management of cancer patients, however, the currently used clinical features are difficult to provide enought information. The purpose of this study is to develop a survival prediction model for cervical cancer patients with big data and machine learning algorithms.
Results: The cancer genome atlas cervical cancer data, including the expression of 1046 microRNAs and the clinical information of 309 cervical and endocervical cancer and 3 control samples, were downloaded. Missing values and outliers imputation, samples normalization, log transformation and features scaling were performed for preprocessing and 3 control, 2 metastatic samples and 707 microRNAs with missing values ≥ 20% were excluded. By Cox Proportional-Hazards analysis, 55 prognosis-related microRNAs (20 positively and 35 negatively correlated with survival) were identified. K-means clustering analysis showed that the cervical cancer samples can be separated into two and three subgroups with top 20 identified survival-related microRNAs for best stratification. By Support Vector Machine algorithm, two prediction models were developed which can segment the patients into two and three groups with different survival rate, respectively. The models exhibite high performance : for two classes, Area under the curve = 0.976 (training set), 0.972 (test set), 0.974 (whole data set); for three classes, AUC = 0.983, 0.996 and 0.991 (group1, 2 and 3 in training set), 0.955, 0.989 and 0.991 (group 1, 2 and 3 in test set), 0.974, 0.993 and 0.991 (group 1, 2 and 3 in whole data set) .
Conclusion: The survival prediction models for cervical cancer were developed. The patients with very low survival rate (≤ 40%) can be separated by the three classes prediction model first. The rest patients can be identified by the two classes prediction model as high survival rate (≈ 75%) and low survival rate (≈ 50%).