Predicting the COVID-19 patients’ status using chest CT scan �ndings: A risk assessment model based on Decision tree

Background: The role of chest computed tomographic (CT) to diagnosis coronavirus disease-2019 (COVID-19) is still an open �eld to be explored. The aim of this study was to apply the decision tree (DT) model to predict critical or non-critical status of patients infected with COVID-19 based on available information on non-contrast CT scan. Method: This study was performed onpatients with COVID-19 who underwent chest CTscan atBaqiyatallahHospital, Tehran, Iran. In this retrospectivestudy, the medical records of 1078 patients with COVID-19 were evaluated. The classi�cation and regression tree (CART) of decision treemodeland k-fold cross validation were used to predict the status of patients and to measure their sensitivity, speci�city and area under curve (AUC). Results: Data included 169 critical cases and 909 non-critical cases. The bilateral distribution and multifocal lung involvement were165 (97.6%) and 766 (84.3%) in critical patients, respectively.According to DT model, total opacity score, age, lesion types and gender were statistically signi�cant predictors incritical patients.Moreover, the results showed that the accuracy, sensitivity and speci�city of the DT model were 93.3%, 72.8% and 97.1%, respectively. Conclusions: The presented algorithm demonstrates the factors affecting the patient's condition. In addition, this model has the potential characteristics for clinical applicationsand canalso identify high-risk subpopulations that need speci�c prevention.


Background
The novel coronavirus disease 2019 (COVID-19) known as viral pneumonia, which has been emergedat the Huanan Seafood Market, Wuhan, China (Kanne, 2020, Rodriguez-Morales et al., 2020).The WHO has introduced this virus as a pandemic disease.Nowadays, COVID-19 is affecting more than 210 countries around the world.As of May 11, 2020, a total of 4,196,193 con rmed COVID-19 cases, 1,500,181 recovered caseswith 284,033 deaths have been reported in the world.While the statistics demonstrated that the trend of mortality have declined in China and is rising in the world as well as in Iran (Moftakhar and Seif, 2020).The rst experience of the disease in Iran was identi ed in the Qom city on February 19, 2020 (Muniz-Rodriguez et al., 2020).Afterwards, the disease has been quickly affecting a growing number of people throughout the country until 3 May when the total number of con rmed patients has increase to97, 424(2020).
The COVID-19 can lead to respiratory infection, liver disease, gastrointestinal and neurological disorders (Musa, 2020, Boettler et al., 2020).In addition, the virus can cause severe acute respiratory syndrome such as pneumonia, pulmonary edema, acute respiratory distress syndrome (ARDS) (Matthay et al., 2019).Therefore, the non-contrast chest computed tomography (CT) scan may be applied as a helpful tool in diagnosisquanti cation and follow-up of patients with COVID-19.The lungs of patients with COVID-19 symptoms had certain visual hallmarks, such as ground-glass opacities (GGO) and areas of increased lung density called consolidation (Kim, 2020).Furthermore, greater severity of disease with increasing time from onset of symptoms showed other ndings as speci c signs that includes, linear opacities, crazy-paving pattern, reverse halo sign, pleural effusion, intralesional traction bronchiectasis and lymphadenopathy (Bhat et al., 2020, Li, 2020).
The Classi cation and Regression Tree (CART) decision tree (DT) analysis is a data mining technique that used for establishing classi cation systems based on multiple covariates or for developing prediction algorithms for a target variable (Song and Lu, 2015).The analysis has been widely applied in medicine andpublic health.Moreover, the DT model is a strong statistical method for classifying,predicting, interpreting, and processing data.The algorithm can be considered as a nonparametric and also can e ciently manage large,complex datasets without imposing a complex parametric structure.Furthermore, both heavily skewed data and missing values areeasily managed without needing data transformation.Numerous factors have been shown to in uence the conditions of COVID-19 patients such as speci c signs of HRCT, lesion type, presence diffuse opacity, age and gender that have been linked to the disease.The computer-based model can be graphically represented as a tree structure that make the interpretation easy and use in clinical approaches.The algorithm has numerous merits such as it can split sequential data into the best predictive group (Zimmerman et al., 2016).
The aim of the retrospective study, with such a large sample size population, was to apply the CART decision tree model to predict the status (critical/ non-critical status) of patients with COVID-19 based on chest CT ndings.Also, the model is able to identify independent risk factors in the patients.Additionally, receiver operating characteristic (ROC) analysis was applied to assess the ability of DT model for prediction the critical and non-critical condition.

Study design and patients
In the retrospective study, we collectedboth demographic characteristics and radiologic information of 1078 patients with COVID-19 who referred to BaqiyatallahHospital, Tehran, Iran from March to April 2020.
The inclusion criteria were positive results on a reverse-transcriptase-polymerase-chain-reaction (RT-PCR) assay of a specimen obtained on a nasopharyngeal swab, having related symptoms (like fever, dry cough, shortness of breath, and aches), and patients willing to participate in the study.The inclusion criteria were logistical impediment to data collection, incomplete data, and revoke consent.According to patients' clinical outcomes, the individuals were divided into two groups; critical and non-critical groups.Patients who admitted to the routine ward of hospital and then discharged (n=909) were considered as non-critical patients.While, the critical group included those who died (n=104) or were admitted to the intensive care unit (n=65).This retrospective study was approved by the Ethics Committee of BaqiyatallahUniversity of Medical Sciences, Tehran, Iran, with code: IR.BMSU.REC.1399.024and the patients were enrolled after giving written informed consents.

CT Protocol and evaluation of chest CT
The images of non-contrast chest CT scan of supine posture and full inspiration in patients.A CT scan was performed when the patient was referred to a medical center and had COVID-19 symptoms.All CT scan examinations were performed with a 16-row detector CT scanner (general electric GE, optima, USA).
The ndings of CT scan were evaluated by two radiologists blinded and agreed with the results of images.The inter-rater coe cient agreement between the two radiologists was r=0.98;P<0.0001.If the radiologists disagree about the COVID-19 diagnosis, the third party was managed the decision and discussion was continued until the agreement was achieved.According to Fleischner Society Nomenclature recommendations (Schoen et al., 2019, Hansell et al., 2008) intralesional traction bronchiectasis and lymph node enlargement (Schoen et al., 2019).Afterwards, the score of thin-section CT involvement was allotted based on abnormal areas involved to count the extension of lesions (Chang et al., 2005).A score, ranging from 0 to 5, was given to each lobe according to involving, 0 (no involvement); 1 (<5% involvement); 2 (25% involvement); 3 (26%-49% involvement); 4 (50%-75% involvement) and 5 (>75% involvement).A score, from 0 to 5, was assigned to each lobe and a total possible score, from 0 to 25.

Statistical analysis:
The results were described as Mean±SD in continuous variables.Also, frequency and percentage of categorical variables were reported.The Chi-square test was used to evaluate the association between categorical variables.Moreover, the Mann Whitney U test and independent T-test was performed to compare means between number of involved lobe and age in two groups.Also, Classi cation and Regression Tree(CART) method was used to build a risk assessment model predict the critical and noncritical condition of patients using both demographical and clinical factors, including age, gender, lesion types, speci c signs, presence of diffuse opacity, underlying disease, number of involved lobe and total opacity score.Afterwards, the k-fold cross-validation method was used to validate the model.The value of K was considered equal to 10.According to k-fold CV, the set of N (1078) patients is split into k mutually exclusive subsets of size N/k.Afterwards, k-1 subsets are used as training set to t a model, which is used to predict the left-out validation subset.Next, this process is repeated k times, each time excluding a different validation subset and then, an estimate of the model performance is calculated from the predicted values.Therefore, each patient is included in a validation set once and k-1 times in the training sets.Lower k values typically lead to estimates of prediction error biased upward and higher k values minimize bias but increase variance (Ricciardi et al., 2020, Pérez-Guaita et al., 2020).In decision tree, each fork is a split in a predictor variable and each end node contains a prediction for the outcome variable.Additionally, the Receiver operating characteristic(ROC) analysis was performed to assess the ability of DT model for prediction the critical and non-critical condition.That means, the model is able to predict whether the conditions' patient is critical or non-critical.Level of signi cance for statistical tests was 0.05.The R-4.0.0 software (dtreepackage)was used for statistical analysis.

Results
The study population consisted of 1078 con rmed patients with COVID-19 who underwent CT scans including 169 critical subjects and 909 non-critical subjects.The baseline characteristics and chest CT features in the patients with COVID-19 according to critical and non-critical status are given in Table 1.The age of participants in critical group was signi cantly higher than those in the non-critical group (61.24±13.48 vs. 51.47±14.02,P<0.001).The frequency of the involved lobe number in non-critical groupis more than critical group, except for the number of lymph nodes less than 1, which was signi cant between two groups (P<0.001).The resultshowed thatthere was a signi cant relationship between gender, lesions distribution, lesions type, speci c signs of high-resolution computed tomography (HRCT), presence of diffuse opacity and underlying disease (P<0.001).
The decision tree derived from CART analysis is demonstrated in Fig 5 .This decision tree has a depth of 3 levels from the root node and 3 intermediate nodes, including 6 terminal nodes.Each node represents the probability of being critical/non-critical for the corresponding branches.According to Figure 5, in order topredict the patients'statusact in such way; rst, compare total opacity score with 7.5,if the value was more than 7.5, the patient's lesion type will be checked in the next step.Otherwise, the person's age is compared to 62.5.Then, comparisons with the presented variables will continue in at each nodesplit to reach at a branch, and it will be anticipated the critical or non-critical condition of the patients.The number and percentage of cases are presented at the end of each branch.The mentioned model has a striking prediction of the samplescritical condition.The results revealed that 98% of people with a noncritical status (speci city), 72.8% of people with a critical condition are correctly predicted (sensitivity).Also theaccuracy index,the percentage of true prediction of the patient's condition correctly, is 93.3 (accuracy).The risk estimate of the presented tree model revealed that the proportion of cases that is incorrectly classi ed, was 0.068 (with se 0.008), which is acceptable.
Based on Figure 6, the ROC analysis of DT showed an acceptable power in predicting of status in patients with COVID-19, the area under the ROC curve (AUC) of opacity score in CT was 0.93 (95% CI: 0.909, 0.96, P<0.001).

Discussion
Coronavirus, the cause of severe acute respiratory syndrome, has rapidly affected a large number of people in all the world.In regard to a number of deaths and serious consequences of disease, it is so remarkable to early diagnosis of patients and timely treatment (Li et al., 2020b).One of the most important signs in these patients is to assess the chest CT scan that indicated imaging signs related to disease advancement, including increase in GGO, interstitial septal thickening and consolidative opacities (Salehi et al., 2020).
In this retrospective study, the chest CT features of 1078 patients with COVID-19 in critical and noncritical cases were reviewed.The liner opacities, pure GGO, mixed GGO with consolidation, and mixed GGO with crazy-paving pattern have been the most frequent types of lesions with involving bilateral and multifocal distribution.The total opacity score, number of lung lobes involvement and presence diffuse opacity have been regarded noticeable variables by data mining.In the study, the total opacity score has been considered as an important variable.If the variable is lower than 7.5, the next essential variable will be age.As the total opacity score is more than 7.5, lesion type improvement is 0.011 and also lesion type is GGO as well as consolidation, the occurrence of the critical condition will be equal to 82.6.It is worth mentioning, when the total opacity score is less than 7.5 and age of patient is less than 62.5, it is predicted that the percentage of non-critical status of patient will be 98.4.
In our study, the difference of mean age between the two groups was statistically signi cant and the mean age in non-critical patients was lower than the critical group (P<0.001).The results of Huang study are inconsistent with our study.In their study, patients were divided into two groups,according to the time from symptom onset to diagnosis and treatment, whichthe time from was less than 3 days and more than 3 days.The mean age between two groups was non-signi cant (P=0.76).Also, gender was considered as a non-signi cant variable in both studies (P>0.05), the con ict of the results can be that the sample size of that study was too small(n=25) (Huang et al., 2020).In a study by Zhou et al.patients were divided into two groups, patients with COVID-19 in the early stage (n=34) and in progressive stage (n=28) and the results showed that there was no signi cant difference in age and gender (Zhou et al., 2020).Moreover, a study by Shen et al. revealed thatthe age and gender was not signi cant difference between two groups of con rmed COVID-19 as severe and non-severe patients (Shen et al., 2020).In a study by Liu et al. accordingto the diagnosis and treatment protocol, patients were divided into two groups: recovery or stabilization (n=67) and progress (n=11), and the results of the study were consistent with our study.It means that age was considered signi cant, but gender was not signi cant (Liu et al., 2020b).There is no statistically signi cant difference between halo sign in two groups though liner opacities and reversed halo sign were more frequent in non-critical patients was observed between two groups (P>0.05).Similar to other chest CT studies, we observed a bilateral lung involvement (Fig 2) in a numerous patients; however, a reversed halo sign (Fig 5) was seen in a small number of patients in both groups (Albtoush et al., 2020, Yoon et al., 2020).
In both groups of our study, the common types of lesions were mixed GGO with consolidation, mixed GGO with crazy-paving pattern, liner opacities and pure GG.The frequency of pure consolidation and mixed GGO with consolidation lesions showed a signi cant difference between the groups, these types were more common in critical patients than in non-critical patients, which it means that the virus diffuses into the respiratory epithelium can cause necrotizing bronchitis and diffuse alveolar damage.Also, in critical patients reveled more intralesional traction bronchiectasis and pleural effusion lesions than in the non-critical patients.These extra pulmonary lesions indicate the occurrence of severe in ammation in critical group.The results of our study were consistent with other chest CT studies, similarly we observed the frequent speci c signs in critical patients than in the non-critical patients (53.8% vs. 32.1%,P<0.001) (Franquet, 2011, Koo et al., 2018).Although, the reversed halo signs and liner opacities were more frequent in non-critical patients, no signi cant difference was observed between two groups (P>0.05).
According to the DT model, the total opacity score from the critical group was the fundamental variable for distinguishing the critical group from the non-critical group and its accuracy, sensitivity and speci city was 93.3%, 72.8% and 97.1%, respectively.Our ndings were consistent with previous studies that reported the sensitivity and speci city of CT images for the diagnosis of lung lesions from 80% to 90% as well as 82.8% to 96% (Li et al., 2020a, Li.L et al., 2020, Liu et al., 2020a).
It is clear that this study has room for further progress innext work.At rst, if launching from other datasets, our model can change because of being purely data-driven of suggested machine learning.If more data became accessible, the total of procedure could easily be repeated to acquire more exact models.Secondly, since the information data is related to a CT scan nding of a center neither laboratory test nor early symptom of patients, the analysis has been obtained CT scan detect.That is to say, we hope for more variables and multicenter study.

Limitations:
The strength of this retrospective study is to include large sample size.The rstlimitation of the study was thatthe time of chest CT examination and the onsetsymptom were not simultaneous and therefore it was di cult to summarize the features of CT scan that could be shown during the course of the disease.The second limitation is that some factors such as laboratory test (patient's last measurements) and symptoms on onset havenot been measured in our data.

Conclusions
In conclusion the results showed that the chest CT examination was so helpful in identifying pulmonary parenchymal abnormalities in the suspected patients with COVID-19.Total opacity score was the main feature of CT in predicting the percentage of each individual with their own characteristics will suffer from a critical or non-critical situation.Themain resultsof the study showed that 98% of patients with non-critical condition and 72.8% of patients with critical situationwere correctly diagnosed.It is concluded that not only DT model was established with relatively high sensitivity and speci city but also identi ed essential risk factors inpatients with COVID-19.

Figures Figure 1
Figures

Figure 4 two
Figure 4

TableTable 1 :
Baseline characteristics and chest CT in patients with COVID-19 based on critical and non-critical status *GGO:ground-glass opacities,# HRCT:high-resolution computed tomography, a: independent-T test, b: Chisquare test, c: Mann Withney U test