A risk score-matched model analysis of 4,452 patients with small cell lung cancer: who will bene t from operative treatment


 Purpose: Small-cell lung cancer (SCLC) is difficult to cure. In this study, the SEER database was used to construct a model and explore the potential prognostic factors of SCLC patients. Methods: The data were sorted out and randomly divided into training cohort and verification cohort. Univariate and multivariate Cox regression were used in the training cohort to analyze the independent prognostic factors, then they be incorporated into the Nomogram model. Using the C-index, calibration algorithm and ROC in conjunction with the risk scores, the model was verified with the verification cohort. Finally, the overall survivals of those factors were evaluated in the total cases.Results: In the training cohort, we found that age, race, sex, total stage and extension were independent factors which were included in the Nomogram model. C-index(s) that obtained from the training and verification cohorts showed that the model has predictive power. Moreover, the calibration curves and AUC results proved that the model is of great consistency not only in the training cohort but also in the verification cohort. Finally, significant differences in survival were observed among the above-mentioned factors and the overall survivals decreased over time.Conclusions: Age, race, sex, total stage and extension degree are independent risk factors for overall survival of patients. The Nomogram model can better predict the 1-year, 3-year and 5-year survival probabilities, providing accurate reference for clinical individualized treatment.


Introduction
In 2019, lung cancer had the highest incidence and death rate of all cancers in the United States [1]. Smallcell lung cancer (SCLC) is a malignant pulmonary neuroendocrine tumor originating in bronchial mucosa or glandular epithelium. It is the most malignant and least differentiated lung cancer, accounting for about 13%-20% of lung cancer [2]. Small-cell lung cancer can be divided into oat type, intermediate type, and compound type. Its characteristics mainly include fast growth [3], strong invasive, easy drug resistance, easy recurrence and poor prognosis. Its transfer time is earlier, the scope is wide. Many studies have shown that smoking is a major risk factor for small-cell lung cancer [4].
At present, there are some challenges in the early diagnosis of small-cell lung cancer. It is mainly performed by puncture or biopsy, using Tumor, Node, Metastasis Classi cation (TNM) system or Veterans Administration (VA) system staging [5]. The clinical treatment of small-cell lung cancer is also di cult. Despite this, patients still have metastasis, drug resistance, and even relapse, resulting in high mortality rates. In order to improve the early diagnosis level of small-cell lung cancer and develop personalized treatment methods, it is quite indispensable to pay attention to the relevant risk factors and predict its survival rate systematically and accurately.
This study systematically analyzed the clinical and pathological data of small-cell lung cancer patients from the Surveillance, Epidemiology, and End Results (SEER) Database (from 1973 to 2015). By establishing a statistical model that predicts survival in patients, we explored and discussed the clinical application of potential prognostic factors in patients with small-cell lung cancer.

Data sources
Patients with small-cell lung cancer diagnosed by pathology in the seer database  were collected by SEER*Stat software. We can extract the patient information including age, race, sex, age of diagnosis, histologic type, tumor site, pathological stage (T, N, and M), surgical site and situation, tumor size, the sequence number of primary tumor, tumor type, lateral, extension, the involvement of lymph nodes, metastasis, death, survival time and state, benign and malignant tumor, and marital status. In order to understand the in uencing factors and bene ts of surgical treatment, we ignored the possible interference of chemoradiotherapy data. A total of 1207856 patients met the initial requirements. The primary tumor is not small-cell lung cancer. After screening, a total of 4432 small-cell lung cancer patients were included in the study. We randomly divided these patient data into a training cohort of 1776 (40.07%) and a validation group of 2656 (59.93%) [6][7][8] with the createDataPartition function in R caret package.

Statistical analysis
After data cleaning, 4432 cases were included in the study. They were randomly divided into a training cohort (1766 cases) and a veri cation cohort (2656 cases). Univariate Cox regression model was established to analyze the data of the training cohort. The signi cant variables in univariate analysis were combined with the Cox regression model to construct a multivariate analysis, so as to clarify the independent risk factors affecting the survival rate of small-cell lung cancer in the training cohort. Besides, Nomogram was plotted as a survival prediction model [9,10]. Then Calibration plots were drawn to evaluate the prediction models for 1-year, 3-year, and 5-year survival rates [10]. By adding meaningful variables: age, gender, race, total staging, and extension into the calculation of each patient's risk score, the risk score of each patient with small-cell lung cancer was calculated and a receiver operating characteristic (ROC) curve assessment model was produced. Meanwhile, the area under the area under the curve (AUC) of ROC of the 1-year, 3-year and 5-year survival rate was calculated to evaluate the accuracy of the model prediction.
In addition, the Cox regression model with multiple factors was constructed again in the validation group. The C index, calibration chart and AUC evaluation were veri ed.
At last, the risk score with total data (including training cohort and veri cation cohort) was calculated. The relationship between survival time and survival probability were drawn respectively under the two conditions of high risk and low risk. The overall survivals of meaningful multivariables (age, sex, race, pathological stage, and extension) which came from the above model were also analyzed.
All the above were performed by R language (v3.5.3), and the difference was statistically signi cant when P<0.05. The whole process can be seen in Fig. S1.

Patient characteristics
As shown in Fig. 1, after data cleaning, 4432 small-cell lung cancer patients were screened, including 1776 in the training cohort and 2656 in the validation group. Clinical and pathological characteristics of patients in the training cohort, validation group and pre-cleaning group are shown in Table 1.

Univariate Cox analysis results of survival in uencing factors
Univariate Cox regression analysis was performed on 1776 small-cell lung cancer patients in the training cohort. Clinical pathologic factors of single factor survival analysis results showed that age, race, sex, the degree of differentiation, N, M and total stage, surgery of primary site, tumor size, extension, involvement the lymph nodes, metastasis, death from cancer and non-cancer causes, and sequence of primary tumor, tumor number, and age of diagnosis is associated with survival time and survival (P < 0.05). There was no correlation between T stage, operation or no operation, marital status, lateral, survival time and survival status of patients (P > 0.05) (Table S1).

Multivariate Cox analysis results of survival in uencing factors
The statistically signi cant variables in the univariate Cox regression analysis were included in the multivariate analysis. The results showed that the patients were aged 65-69, aged 75-79, aged 80-84, aged >=85, white, male, total stage II, tumor invasion range correlated with survival time and survival status ( Table 2).

The Nomogram model building
Based on the results of the multivariate analysis, the Cox regression model was constructed again for the above meaningful variables. They are included in the Nomogram plot and assigned as the point in the Nomogram according to the results of the Cox regression model. The total score can be obtained by adding the single score value of 5 variables. The overall score corresponds to the survival axis values, which can predict the 1-year, 3-year, and 5-year survival rates of small-cell lung cancer patients. The higher the overall score, the higher the survival rate, and vice versa. Detailed results are shown in Fig. 2.
The Nomogram model was tested by training cohort C-index was used to evaluate the discrimination between the model and the real value of the training cohort. The results showed that the c-index was 0.6817, indicating that the model was acceptable and capable of prediction.
The results of the model evaluation using the calibration diagram are shown in Fig. 3A~C respectively. Three Numbers were taken on average in the sample number, and the tting coe cient was set to 100 to calculate the actual 1-year, 3-year and 5-year survival rates corresponding to those predicted by Nomogram. The evaluation results show that the calibration curves of the 1-year, 3-year and 5-year survival prediction of the training cohort are close to the ideal 45° dashed line, indicating that the predicted values are in good consistency with the actual observed values.
ROC curve prediction was used to evaluate the model. The signi cant variables in the multivariate analysis, such as age, sex, race, total stage, and extension, were included in the calculation of the risk score of each patient. The ROC curve was prepared in combination with the risk score and the AUC was calculated (Fig. 4A~C). The results showed that the 1-year survival rate prediction (AUC=0.733), 3-year survival rate prediction (AUC=0.754) and 5-year survival rate prediction (AUC=0.743) in the training cohort were of moderate accuracy, indicating that the accuracy of survival rate prediction was relatively high. There was good consistency between the predicted value and the actual value.

Verifying the prediction model with the veri cation cohort
In the veri cation cohort, the Cox regression model of multiple factors was built rst. Then the c-index of the training cohort was veri ed. The result of c-index is 0.6778, which conforms to the accuracy of model prediction of the training cohort and has the ability of prediction.
It is veri ed that the calibration diagram of the modeling group is accurate in evaluating the model. The evaluation results of the validation group on the model are shown in Fig. 3D~F. Evaluation results show that the one-year, three-year and ve-year survival prediction correction curves of the veri cation cohort are similar to those of the training cohort, with an ideal 45° dotted line. They indicate that the calibration of the training cohort is relatively accurate.
The ROC curve of the training cohort was veri ed to evaluate the model. The risk score of each patient in the validation group was calculated. Then the ROC curve and the area under the curve were calculated by combining the risk scores (Fig. 4D~F). The results showed that the 1-year survival rate prediction, 3-year survival rate prediction, and 5-year survival rate prediction of the validation group all belonged to the range of moderate accuracy, indicating that the ROC curve of the training cohort was relatively accurate in evaluating the model.
The overall survival evaluation in the total samples All patient data from the modeling and validation groups were integrated to obtain the total risk score. The survival curves of high risk and low risk were plotted (Fig. 5A). Based on the Cox regression model established above, the effects of 5 meaningful variables (age, gender, race, total staging and leaching degree) in the multivariate analysis on the survival of patients were analyzed, and the survival curve of each variable was plotted (Fig. 5B~F).
In the survival analysis, the median survival time of small-cell lung cancer was 7 (0-71) months, and the average survival time was (11.26±13.09) months. In the one-year survival rate statistics, the one-year survival rate gradually decreased with the increase of age, from 49.2% to 19.67%. In addition to 50-54 years survival rate (49.7%) was slightly higher than on slightly higher than a group. The 3-year survival rate also declined gradually, from 16.3% to 5.5%, except for the 65-69 years-old survival rate and the 50-54 years-old survival rate, which were slightly higher than the previous age group. Although there is a lack of data on 5-year survival, it is generally declining. The survival rate of black people (41.14%) was higher than that of white people (36.79%) and other people (32.61%). The survival rate of small-cell lung cancer in women (40.53%) was higher than that in men (33.32%). In tumor stage, the survival rate showed a general trend of decreasing with the higher stage, from 76.3% to 22.59%, among which stage IIA (73.0%), stage III (66.7%) and stage IIIB (52.6%) were slightly higher than the previous stage. In terms of the degree of an extension, the survival rate decreased gradually with the increase of the range, from 44.4% to 32.0% (Table  S2).

Discuss
This seer study is mainly to analyze the in uential factors affecting the survival time and state of patients with small-cell lung cancer in the data of 4432 patients. It nds that age, race, sex, total stage, and extension are risk factors related to survival time and survival status in multivariate analysis. Some studies suggest that the extent of disease, performance status and LDH serum levels are independent prognostic factors for survival, which is inconsistent with the results of this study [11]. Such consequences may be related to the different range and mode of data selection.
(a). In the statistics of 1-year survival rate, 3-year survival rate and 5-year survival rate, generally, the survival rate decreases gradually with the increase of age group. The reasons may be related to physical quality, physical state at the time of treatment, and treatment strategies. In clinical practice, with the increase of age, the more complications patients have, the more conservative treatment strategies doctors may adopt, leading to unsatisfactory treatment effect [12]. Older patients are also likely to be less willing to cooperate with treatment. Most patients aged 50-54 are in the middle of life and in good physical condition, have fewer complications [13] so that they can be able to withstand the side effects brought by chemotherapy and radiotherapy [14]. In conclusion, the patients aged 50-54 have a good early treatment effect and a positive attitude towards treatment.
(b). Although the United States is a multiracial country, it is still predominantly white. White people also have higher average income and education levels than blacks and others, according to the U.S. census bureau [15]. Therefore, whites have a higher awareness of health care and higher affordability of treatment. There are still de ciencies in the racial survival statistics, with seer data lacking in blacks and others. The results would be more convincing if the number of black patients and others were increased.
(c). Survival rates are signi cantly higher in women than in men, which is consistent with Eskandar et al.'s study of the effects of gender on small-cell lung cancer [16,17]. The reason may have something to do with men's smoking. Moreover, women are less likely to smoke in general, to a lesser extent than men. As a result, small-cell lung cancer is less severe in women than in men.
(d). As the total stage increased, the 1-year, 3-year, and 5-year survival analyses showed a decrease in volatility. It may be related to the treatment strategies adopted by patients. In general, the smaller the stage, the lighter the disease, the better the physical state, the better the effect of chemotherapy, radiotherapy and other treatment means. However, patients with high stage are weak and the cancer is likely to have metastasized, so it is di cult to take the treatment with large side effects and the curative effect is not satisfactory.
(e). As the degree of tumor invasion increases, the scope and extent of tumor damage to surrounding tissues will increase. Likewise, the possibility of distant tumor metastasis will also increase. Even when patients adopt aggressive treatment strategies, the results are less and less effective, so the survival rate is inevitably lower.
(f). The degree of differentiation, N, M and total stage, surgery of primary site [18], tumor size, involvement the lymph nodes [5,19], metastasis [20], death from cancer and non-cancer causes, and sequence of the primary tumor, tumor number, and age of diagnosis is meaningful in univariate analysis, but meaningless in the multifactor analysis. The rst possibility is that these factors are not strongly correlated with the survival time and status of patients [21]. The second possibility is that there are cross-in uences among various factors, leading to the weakened correlation between factors and results. The third possibility is that these factors are affected by some confounding factors, leading to errors in the statistical data.
In this study, on the whole, the SEER database adopted in this study not only has a large amount of data and rich contents, but also has a large time span of collected patient data. And the patients were from all over the United States, reducing the statistical bias of patient data from a single institution. These advantages make the results of this study have a large sample size, so the results are more convincing.
However, the seer database still has some drawbacks. Firstly, the data records of many aspects of patients in the database are unknown, which greatly reduces the sample size that can be nally included in the analysis after data cleaning. Secondly, the patients in the database are all American patients, white patients are the majority, and the sample size of black and other races is insu cient. Hence, the survival prediction model constructed is not suitable for Asia or the world. Third, there were no relevant records of the patients' physical health conditions before treatment, such as basic diseases, smoking history and tumor marker examination results [22][23][24]. No details were given about the treatment methods [25,26], the drugs used to treat it and radiotherapy timing [27]. Adjuvant therapy and quality of life after treatment were also not mentioned. In the following in-depth studies, more speci c details of the factors related to the survival of patients should be explored, such as speci c treatment strategies based on different age or physical conditions. At the same time, we will refer to more different databases to study the speci c impact of different regions and different races on survival.

Conclusion
On the whole, by identifying and analyzing the independent in uencing factors of small-cell lung cancer, this study obtained the effect of different in uencing factors on the survival time and survival probability of patients. Besides, a Nomogram was drawn to make a more accurate prediction of 1-year, 3-year and 5year survival rates for different patients Patients can also make the most appropriate treatment decisions based on prognosis, reducing the incidence of incomplete treatment and inappropriate treatment timing.
The nal results are of great signi cance for objective analysis of patients' conditions, formulation of personalized clinical treatment strategies, and even prediction of patients' prognosis. However, there are still some limitations in this study. The in uencing factors of small cell lung cancer explored and obtained in the research process are all known, and it has not been analyzed from a new perspective or drawn new conclusions in biology, which is lack of innovation.
In this study, Cox regression model, Nomogram graph and ROC curve model were constructed based on seer database, which inevitably had undeniable defects. The validation of each model also comes from internal data. Veri cation of data from other sources will certainly increase its credibility. We should continue to collect data on related diseases in other countries and regions, so that we could perfect research results and data analysis, as well as reduce objective errors and imperfections.

Declarations Competing Interests
The authors declare that they have no competing interests.

Consent for publication
We agree to publish this paper in BMC.   Figure 1 Small-cell lung cancer data cleaning process. Because the amount of small-cell lung cancer data is too large, exceeding the processing power of a single excel. Therefore the whole data were split into two groups for cleanup. We can clear the unknown, blank, and unclassi ed data in each cohort. The data that were not relevant to the study or of no research signi cance were then purged. Finally, the two groups of processed data are merged. The different types of data are numbered in line with the needs of the research content, which is helpful for the following data analysis.

Figure 2
The nomogram model of small-cell lung cancer patients. The cleaned SCLC data were randomly divided into the modeling group and the validation group. First, univariate and multivariate regression analyses were used to analyze the data of the modeling group. Then the meaningful variables ltered out were to construct a Cox regression model. The degree of prediction can be evaluated with c-index, Calibration and ROC curve. The model was applied to the validation group to verify the degree to which the model predicted its data. We can record the risk scores of the modeling group and the validation group respectively. Finally, the model was applied to all small-cell lung cancer data to calculate the total risk score and study the impact of meaningful variables on survival.  Survival curves of risk score or independent factors in total SCLC patients. (p=0 means P value < 0.001) A. risk score. The survival probability and survival time of high-risk group were signi cantly lower than that of low-risk group; B. age. As the growth of the age, the survival time and the chances of survival go down.
Nevertheless, the survival curve of patients aged 50-54 was better than that of other age groups; C. race.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.