Clinicopathological characteristics and cancer-specific survival prediction of Large cell lung cencer: a population-based study

Background:The features and survival outcome large cell lung cancer(LCLC) are scarce reported due to its low incidence,as a result, the prognoses of LCLC remain unclear.The aim of this study was to describe the demographic and clinical characteristics of large cell lung cancer with a population-base database and find the prognosis factors for cancer-specific survival(CSS) of the LCLC patients.Besides,a nomogram would be developed and independently validated to predict the CSS for LCLC based on the found prognosis factors. Methods: We extracted LCLC patients information from the Surveillance, Epidemiology, and End Results(SEER) database(2005-2014) and summarized the characteristic of the extracted factors.We used the Cox proportional hazards regression to find the prognosis factors for LCLC patients and develop the nomogram based on these in a splitted train cohort from the extracted data.The validation of the developed nomogram would be performed in an independent validation cohort from the extracted data, in which the C-index and the average of the time-dependent area under the receiver operating characteristic curve(time-dependent AUC) for CSS in 1-year, 3-year and 5-year would be calculated.The calibration curves would be drawn to visualize the performance of the established nomogram. Results: In result,4936 patients with LCLC were identified from the SEER database. Nearly half of LCLC patients were diagnosis with stage IV,only approximately 20% of patients was performed surgery.The prognosis factors influence the LCLC patients included age, sex,American Joint Committee on Cancer (AJCC) stage,race,surgery, tumour size and marital status.The calculated C-index was 0.701±0.01,mean time-dependent AUC for CSS in 1-year, 3-year and 5-year was 0.88.The calibrate curve showed that the gap between the predicted and observed CSS for 1-year, 3-year and 5-year was small. Conclusions:Sex,age,race,marital status,AJCC stage, surgery and tumour size are all the independent prognostic factors for CSS of the LCLC.The established nomogram can provide more precise evaluation for the survival of LCLC patients,and


Background
Lung cancer is the leading cause of cancer death worldwide [1].Large cell lung cancer (LCLC) is a small proportion of lung cancer,it is a sub-type of non-small cell lung carcinoma (NSCLC) without glandular or squamous differentiation accounting for just 9% of all cases of lung cancer [2].The diagnosis of LCLC is often an excluded of adenocarcinoma,squamous and small cell lung cancer [3].LCLC can be found in any pat of the lung.Men are more susceptible to LCLC [4].The growing and spreading of LCLC are quick,which make it harder to treat.The features and survival outcome LCLC are scarce reported due to its low incidence,as a result, the prognoses of LCLC remain unclear.In general,patients with solid tumour can be classified by the American Joint Committee on Cancer (AJCC) staging system [5].But the AJCC classification fail to predict the survival for patients with special cancer type [6].
As a kind of decision-making tools for patients with cancer, nomogram can predict patient survival,and the nomogram has also been widely used to stratify the treatment and evaluate outcomes [7].To date,however,the use of nomogram for patients with LCLC is still unavailable.
In this study, we present the characteristic and prognosis of LCLC by using a United States Population Based database.In addition, we develop and independently validate a nomogram model based on the selected data so as to predict the prognosis of LCLC patient visually.

Ethics statement
We obtained permission to access the Surveillance, Epidemiology, and End Results(SEER) database with the reference number 10782-Nov2016.The informed consent for patients was not not required in this study ,as the research data are de-identified and publicly available.

Study population
We obtained data from the November 2013 submission of the SEER Research Data.
In this study,we extracted patients' data from 2005 to 2014 in the SEER database by using SEER*Stat software version 8.3.5.As an authoritative source of cancer information,the SEER database contains cancer incidence and survival in the USA,there are 18 population-based registries included in the database,which represents about 28% of the American population [8,9].According to the International

Covariates
We collected patient demographic characteristics including age at diagnosis, race, sex and marital status,and tumor clinicopathological features such as primary site,laterality, grade, American Joint Committee on Cancer (AJCC) stage(6th).Surgery information was also included.We used the cancerspecific survival (CSS) as the primary outcome,which was defined as a time interval from the diagnosis to death due to this cancer.Individuals who died attribute to other cause or alive on the cutoff date(2017.12) were considered to be censored.

Statistical analysis
Categorical variables for the demographic and clinical characteristics were reported as numbers and percentage.Continuous variables were expressed as the mean±standard deviation.Random sampling strategy was used to split the primary cohort into train cohort and validation cohort.In the train cohort,the optimal cutoff points for continuous variables were determined by using the maximally selected rank statistics in advance of the Cox proportional hazards regression [10,11].The CSS curve was plotted by the Kaplan-meier method,log-rank test was used for the comparisons between CSS distributions,Cox proportional hazards regression analysis was used to evaluate the factors influencing CSS and compute the hazard ratios and its 95% confidence intervals (95%CIs),P value < 0.05 (two-sided) was considered statistically significant.A nomogram was built for predict the probability of 1-year, 3-year and 5-year CSS based on the final model of Cox proportional hazards regression.In the validation cohort, the total scores of the established nomogram for each patient were calculated.The performance of the scores was assessed by calculating the C-index and the average of the time-dependent area under the receiver operating characteristic curve ( timedependent AUC) in 1-year, 3-year and 5-year,the calibration curves were also drawn to visualize it based on the total scores.All the statistical analyses were performed using R version 3.2.5 software [12].

Summarize of the characteristics
A total 4963 of LCLC patients from the SEER database were identified,the median CSS time was 6 month(6-7month) ,of which all the patients were included in the study.
The demographic and tumor clinicopathological features for the eligible patients were summarized in Table 1,as shown in the Table 1, the mean age of the LCLC patients was 66.9±11.2,the male/female ratio was 1.37,most of the recorded patients was white race(79.27%),the main primary site labeled was up lobe(52.63%), laterality was more recorded in right -origin of primary(56.48%) ,the mean tumour size was 49.96±33.61,most of the diagnosis patients were in stage IV(49.77%),20.69% of patients was performed surgery.Based on the random sampling strategy all the patients were divided into a training cohort(n=3427,69.05%) and a validation cohort(n=1536,30.95%).The training and validation cohort were also listed in it.All the characteristics were similar in the training cohort compared with validation cohort.No significantly CSS difference was seen in the two cohort(P=0.8).

Development of the nomogram
Based on the maximally selected rank statistics,patients were classified into two group in terms of age(<=77yeas,>77years) and tumour size(<=41mm,>41mm) in the training cohort.In the univatiate analysis,age,gender,race,primary site,laterality,AJCC stage,surgery,and marital status were found to be significant correlation with the CSS in the training cohort( Table 2).The potential redundancy was removed according to the AIC-base backward selection procedure in the multivariate Cox proportional hazards regression analysis .The finally recruited independent prognostic factors including age,sex,race,AJCC stage,surgery and marital status were used to construct the nomogram model.The hazard ratio(95% confidence interval,95%CI)of nomogram parameters was shown in Table   3,the detail scores of each independent prognostic factors were also listed in Table 3.The nomogram was plotted in Figure 1.Patients were classified into 7 groups according the nomogram scores,The CSS curve for these groups were shown in Figure 2.

Validation of the nomogram
The C-index of the nomograms for predicting CSS was 0.71±0.01 in the validation cohort.The average of the time-dependent AUC in 1-year, 3-year and 5-year was 0.88.The calibration curve for 1-year( Figure 3A), 3-year( Figure 3B) and 5-year( Figure 3C) CSS showed little gap between the predictions and actual outcomes in the validation cohort.

Discussion
In the present study,4,936 LCLC patients were identified from the SEER database.
Nearly half of LCLC patients were diagnosis with stage IV,only approximately 20% of patients was performed surgery, these means that most of the LCLC cancer patients were diagnosis with advance stage and the optimal treatment opportunity was often missed,efficiency diagnosis method and treatment method are in urgent need in present.
In this study,the routinely available characteristics of the patients were extracted from the database.Based on these,a nomogram was developed and validated.The model performance of the developed nomogram was confirmed by the calculated C-index,time-dependent AUC and calibrate curves.Seven significant factors(i.e.,age, sex,tumour size,AJCC stage,surgery,race, and marital status) were included in the nomogram.All the factors are routinely available in daily practice, which make the nomogram can be easily used in predicting individual's CSS and making treatment decision for patients and clinicians.
In agreement with other type of NSCLC,the age and sex were all important predictors for CSS of LCLC [13].Old age and male patients were associated with worse prognosis.In order to get better discrimination for the nomogram model,According to Harrell's guidelines,in this study,patients age were determined into two groups [14].Seventy-seven years old was the best cutoff point.At present,an consistent conclusion was not reached for the CSS disparities of patients with lung cancer for different race [15].In genaral,better CSS outcomes would be seen in race with higher health awareness,that the treatment in these races would be more active.In this study,as for LCLC,the significantly different influence of race for patient's CSS was seen both in univariable and multi-variable analysis.The CSS outcome for other race is better than the White race.Treatment activity might be a possible explanation for this phenomenon,another reason might be attributed to the sample of the SEER database,that small proportion of other race was collected in the database,which might affect the statistical results for this study.As a most common used tumor associated indices,the AJCC stage still contributed most for the established nomogram model,which is in line with other type of NSCLC [16].Tumour size is an important indicator for T stage,in this study,its was also found to be an independent risk factor for LCLC.It could be confirmed that patients with a tumour size >41mm show less CSS time than that <=41mm with the maximally selected rank statistics.Surgery is the domain treatment for many type of lung cancer.In this study,surgery was also found to be an important treatment for LCLC that patients with surgery had a greatly decrease in cancer-specific death.Marital status has been confirmed to be associated with the CSS in a series of cancer [17][18][19][20][21].This phenomenon, in our present study was consistent with the previous study that married LCLC patients represented for lower nomogram scores had CSS benefit compared with other types of marital status.
The nomogram validation is of great importance in the prevention of overfitting for the established model,and also important for model generalization [22].In the present study,an independent validation cohort was used for the validation of the nomogram.The calculated C-index(0.701±0.01)confirmed the discriminatory capacity of the established nomogram.Optimal consistency between the predicted CSS and actual observed CSS was seen in the calibration curves for 1-year,3-year and 5-year.The time-dependent ROC for the validation cohort is also acceptable that the AUC kept a comparative higher level(>0.5) in the prediction of CSS.
Still,some limitation should be considered in the current study.Firstly,since there are lack many important factors identified in previous studies that influence the CSS of lung cancer in SEER database,many important factors such as the chemotherapy,radiotherapy,smoking status and performance status were not obtained in this study,which made it unachieveable to get more detail understanding of the prognosis factors for LCLC.Additionally, the lack of these factors may affect the accuracy of nomogram model.Thirdly,as a retrospective study the selection bias may not be excluded, the validation with perspective clinical study is still needed.

Conclusion
This study summarized the clinical feature and prognosis of LCLC by using a United States population based cohort from the SEER database, we found that sex,age,race,marital status,AJCC stage, surgery and tumour size are all the independent prognostic factors for CSS of the LCLC.Besides,a visual nomogram was developed to predict the CSS of LCLC.The discrimination of the nomogram was confirmed to be acceptable in the validation cohort. This nomogram can provide more precise evaluation for the CSS of LCLC patients,and help the clinicians to make individual management.
Abbreviations LCLC:large cell lung cancer;CSS:cancer-specific survival;SEER:Surveillance, Epidemiology, and End Results;AJCC:American Joint Committee on Cancer;AUC:aura under the curve.

Ethics approval and consent to participate
The data in the SEER database does not require informed patient consent.

Consent for publication
Not applicable.

Availability of data and material
We abstracted data from the Surveillance, Epidemiology, and End Results (SEER) database(https://seer.cancer.gov).

Competing interests
The authors report no conflicts of interest.     The CSS curve for re-grouped patients according the nomogram scores.   Figure 3C. Calibration curves for 5-year CSS using validation cohort.