The study of non-focus area in the rst chest CT of different clinical types of COVID-19 pneumonia: A study on automatic machine learning of Radiomics

Objective: To explore the possibility of predicting the clinical types of Corona-Virus-Disease-2019(COVID-19) pneumonia by analyzing the non-focus area of the lung in the rst chest CT image of patients with COVID-19 by using automatic machine learning (Auto-ML). Methods: 136 moderate and 83 severe patients were selected from the patients with COVID-19 pneumonia. The clinical and laboratory data were collected for statistical analysis. The texture features of the Non-focus area of the rst chest CT of patients with COVID-19 pneumonia were extracted, and then the classication model of the rst chest CT of COVID-19 pneumonia was constructed by using these texture features based on the Auto-ML method of radiomics, The area under curve(AUC), true positive rate(TPR), true negative rate(TNR), positive predictive value(PPV) and negative predictive value(NPV) of the operating characteristic curve (ROC) were used to evaluate the accuracy of the rst chest CT image classication model in patients with COVID-19 pneumonia. reconstruction parameters: matrix 512 512, 1.25mm


Introduction
Since January 2020, pneumonia caused by novel coronavirus broke out in Wuhan, China, it named COVID-19 by world health organization (WHO). COVID-19 is a kind of ribonucleic acid virus mainly transmitted through respiratory tract. The main harm of COVID-19 pneumonia is to cause adult acute respiratory distress syndrome (ARDS). COVID-19 virus can be detected in respiratory tract like severe acute respiratory syndrome (SARS) virus [1,2] . By the end of February, it has been extended to over 100 countries worldwide. It is estimated that more than 50000 patients have been diagnosed with over 2500 deaths. Studies showed that early effective treatment can signi cantly block the course of disease and reduce the conversion rate of critical illness. Therefore, it is very necessary to predict the direction of the course of disease in the early stage for patients with COVID-19 pneumonia [3,4] .
The common clinical symptoms of COVID-19 pneumonia include fever, cough, sore throat, occasional chest distress, expectoration and muscle soreness, but not all these symptoms and these symptoms are not speci c. When the epidemiological history is not clear or the patient intentionally conceals the medical history, clinicians often treat the patients according to the suspected diagnosis, rather than the targeted treatment with clear diagnosis. Chest CT is an important method for the diagnosis of COVID-19 pneumonia, which is widely used in the diagnosis of COVID-19 pneumonia, to guide the adjustment of clinical treatment plan and verify the treatment effect.
In the chest CT images, the typical manifestations of the focus of COVID-19 pneumonia are parapleural ground glass (GGO), interlobular septal thickening, central consolidation of the focus and banded atelectasis [5] . However, in the rst CT examination of patients with COVID-19 pneumonia, the characteristics of the focus are often not typical, which cannot clearly diagnose and classify the pneumonia of COVID-19, and cannot provide support for clinical design of treatment plan. It is urgent to explore more information in CT images to improve the e ciency of CT examination.
The in ammatory reaction of interstitial and alveolar edema in Non-focus lung tissue during the early lung injury of COVID-19 pneumonia, which is di cult to be distinguished by eyes on CT images [3,5,6] . As an extension of computer-aided diagnosis, Lambin proposed the Radiomics method in 2012 [7] . It will extract and analyze image texture features and combine them with other available patient data to enhance the ability of decision model. The method of Radiomics analysis can make the in ammatory reaction of alveolar interstitium and alveolar edema in the Non-focus area which is di cult to be distinguished by eyes in the early chest CT image of COVID-19 pneumonia become the image information that can be excavate and utilized. Therefore, our aim is to establish and validate a prediction model of Non-focus area in the early stage of COVID-19 pneumonia by excavate the texture features of the rst chest CT image with the method of Auto-ML, and to evaluate the value of the model in the degree of Non-focus area damage and clinical classi cation in the early stage of COVID-19 pneumonia.

Methods
The study is based on the principles of the Helsinki declaration. The Ethics Committee of the PLA Central Theater General Hospital approved this study because it is a retrospective study, giving up the need for written informed consent (Decision/Protocol number: [2020]030-1).

Patients Selection
Collected 2680 patients with COVID-19 pneumonia diagnosed according to the COVID-19 diagnostic and therapeutic regimen (trial 7th edition) in China (www.nhc. Gov.cn/yzygj/s7652m/202003/a31191442e29474b98bfed5579d5af95.shtml), From January 2020 to February 2020. They were included in the study according to the following conditions: 1. Hospital patients. 2. The clinical information and laboratory examination were complete, and at least two lung CT examinations (including the rst CT examination) were performed within one week after hospitalization. 3. Positive results of severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) in nasopharynx swab by RT-PCR. 4. Cases with a history of lung surgery, lung tumors, or any other cause of pneumonia were excluded. Finally, 219 patients were included in the study (Fig.1). In order to prevent asymptomatic cases infected with COVID-19 virus from being added to the control group, we randomly selected 100 cases from the physical examination population who had chest CT examination and no lung lesions between January and February 2019 as the control group (Fig. 2).
Clinical characteristics, including age, gender, temperature, cough, sputum, nausea and vomiting and other clinical symptoms; white blood cells (WBC), lymphocytes, alanine aminotransferase (ALT), aspartate aminotransferase [8] , C-reactive protein (CRP), brinogen, urea (URE), creatinine (CRE) were obtained from the medical records. The clinical symptoms were the symptoms at the time of admission, and blood samples were taken for examination within 3 days after admission.
According to the scheme of " COVID-19 diagnostic and therapeutic regimen (trial 7th edition) in China ", the moderate degree cases are de ned as the patients with fever, respiratory symptoms and other clinical symptoms, and the chest image can show pneumonia. The severe cases were de ned as adults who met any of the following criteria: respiratory rate ≥ 30 times / min; oxygen saturation ≤ 93% at rest; arterial oxygen partial pressure (PaO2) / oxygen concentration (FiO2) < 300 mmHg. In the lung CT examination, the patients whose focus increased more than 50% within 24-48 hours should be considered as severe. All the 219 patients were of moderate degree at the time of admission, 83 of them developed to serious degree in 7-13 days after admission, and the other 136 cases were stable in the moderate degree ( Fig. 2  and 3).

Image Segmentation
Study on the images of the rst CT examination of the patients. All CT images were segmented by a free and open source 3D-Slicer (4.10.2 version) software (www.slicer.org) for semi-automatic image segmentation [9] . Firstly, take regional growth to draw the volume of interest (VOI) of the non-focus part of the lung, then two radiologists with more than 10 years of experience manually modi ed and shrunk the VOI edge to 3mm from the focus edge. Data Supplement presents the VOI drawing methods and modi cation criteria (Fig 4).
Eight lters are used to extract 1688 features of 7 modules rst order statistics (FOS), gray level cooccurrence matrix (GLCM), gray level run matrix (GLRLM), gray level size region matrix (GLSZM), adjacent gray level difference matrix (NGTDM) and gray level correlation matrix (GLDM) and shape) from each original image's VOI. For more information on the methods and parameters of feature extraction in radiomics [12] , see Table1.

Auto-ML
In the texture feature data, since the shape related parameters of the control group and the study group are signi cantly different, they are removed from the data matrix during the analysis. Tree-based pipeline optimization tool TPOT (epistasislab. github. io/tpot) is a python Auto-ML tool based on genetic algorithm to optimize Auto-ML pipeline [13][14][15] . In the process of Auto-ML, each group's original data is imported into TPOT, and TPOT randomly divides the original data into training set and test set according to the proportion of 8:2. In the Auto-ML process of training set, TPOT repeatedly carries out data cleaning, feature selection, feature preprocessing, feature construction, model selection and parameter optimization through intelligent exploration of thousands of possible pipeline, automatically realizes feature analysis of shadow parts, and carries out in training set veri cation. After the exploration and veri cation, the available Python code containing classi er information and corresponding parameter settings is generated (Fig 5).

Classi cation model testing
According to the results of TOPT analysis, select classi er and set classi er parameters (generations = 5, population size = 20, verbosity = 2). Three models of Moderate and Severe group, Moderate and control group, Severe and control group, were established respectively. The test set data of each group is used to test with the corresponding classi er and optimization parameters (Fig 5).

Statistical analysis
The clinical data were analyzed by IBM SPSS26 (IBM Corp.    Table 3. ROC curves are shown in (Fig. 6).

Discussion
At present, the CT studies of COVID-19 pneumonia are all focused on the focus of pneumonia, there is no study on the non-focus area. As we all know, viral pneumonia is a widespread interstitial in ammation in the lung [16] . In the early stage of pulmonary interstitial in ammation, CT images can hardly to re ect the pathological changes of the lung. Therefore, this study uses the Auto-ML method of radiomics based on CT to study the Non-focus area of COVID-19 pneumonia, in order to nd the changes of Non-focus area that CT images cannot nd. The study of non-focus tissue in the lung will help clinicians to recognize COVID-19 pneumonia from a broader perspective, to optimize the treatment plan, block the course of disease, reduce symptoms and increase the increase cure rate of severe patients. According to the existing data, this is the rst time to use CT image-based radiomics to study the non-focus area of COVID-19 pneumonia [2][3][4][5] .
Studies have shown, that the early pathological manifestations of lung injury caused by COVID-19 virus included edema of alveolar epithelial cells and alveolar septum in different degrees, uneven surface of alveoli, and more cytoplasmic vesicles in type I alveolar epithelial cells [17] . These vesicles gradually burst and release uid, causing morphological changes of alveolar cells, such as cell swelling, deformation, DNA breakage, etc. With the necrosis of the alveolar cells, the pulmonary capillaries further ruptured, resulting in alveolar hemorrhage, pulmonary infection and pulmonary brosis. This may be the root cause of severe pneumonia in COVID-19 [18] . Radiomics medicine can extract a lot of texture feature information from the image to re ect the heterogeneity of damage. For example, GLCM mainly re ects the characteristics of the internal structure of the image through the change of density [10,16,19] . Therefore, even if no lesions are found on the CT images, we can also analyze different types of texture features extracted to determine whether the lung tissue is damaged. In this study, through the analysis of AUTO-ML classi cation model, there are signi cant differences in the texture characteristics of non-focus area in the rst CT image between the moderate and severe groups, and there are also signi cant differences between the moderate and severe groups and the control group, which is similar to the results of Yanling's study of different types of pneumonia with radiomics [11] .
Different from other radiomics studies, the classi cation technology of Auto-ML used in this study avoids the limitations of manual selection of machine learning classi ers. Feature selection, feature preprocessing, feature construction, model selection and super parameter optimization [13,14] are the advantages of TOPT module. Its main code modules are Sklearn and XGBboost, which are commonly used by Auto-ML researchers. From the results of Auto-ML classi cation of radiomics, moderate group and severe group are different classi ers from control group and moderate and group severe, and the optimization of parameters is customized, which shows that the top module has customized the best model for each data matrix.
In this study, we collected demographic factors, clinical symptoms on admission, and laboratory tests that may be relevant to identi cation. However, there was no difference between the moderate and severe focus in the early stage of the disease. When the experimental data showed differences, the patient's condition had been aggravated. Therefore, it is an effective way to reduce the rate of severe conversion by effectively predicting the Non-focus area before the patient's condition turns to severe.
In this study, a simple, stable and e cient semi-automatic region growing method, human-computer interaction segmentation method, is selected. Combined with manual modi cation, the accuracy and repeatability of VOI description are improved. This is of great signi cance to the accurate segmentation of Non-focus area for feature extraction and model construction. In addition, we chose Non-focus area as VOI. avoiding the damage of COVID-19 pneumonia, including GGO, consolidation, thickening of bronchovascular bundle, cystic change and pulmonary vessels and trachea in Non-focus area, which not only avoids the in uence of subjective factors, but also can fully measure the severity and degree of lung injury.
However, limitations still existed. Firstly, 219 cases included in the study, thus samples number was relatively insu cient, while there was a risk of over tting in machine learning and deep learning.
Secondly, the data of this study came from the same institution. Although it is a good radiology model for this institution, it is necessary for more research institutions to carry out data sharing, veri cation, cooperation thus to establish a more general COVID-19 pulmonary in ammation model. Thirdly, there is no completed biological explanation of radiomics features in this study which showing further exploration is needed in the future.

Conclusion
In conclusion, the authors believe that the Radiomics Auto-ML classi cation model based on the analysis of Non-focus area in the rst chest CT image of COVID-19 pneumonia can effectively classify the clinical types of COVID-19 pneumonia.  a-d. Axial (a), coronal(b) and sagittal(c) of CT images, showing the VOI range of segmented images: The region growing method is used to segment VOI on 3D-slicer software. By adjusting the threshold range of CT value to exclude the pulmonary vessels and bronchi above the second grade in the lung, Gauss smoothing was used to reduce the edge of VOI by 3 mm, avoid overlapping with the focus and disturb the analysis results. are typical image layers of axial, coronal and sagittal plane of the same patient's CT image(a-c); "white arrow" to mark the space between the arti cial contraction boundary and the focus and the second pulmonary artery(d).

Figure 5
Work ow of TPOT pipeline. In the pipeline, each group of original data is randomly divided into training set and test set according to the proportion of 8:2. The training set repeatedly goes through data cleaning, feature selection, feature construction, feature processing, model selection and parameter optimization in the pipeline, showing the pipeline (classi er) with optimized parameters for a speci c group. Then put the test set into the optimal pipeline and test with the optimal parameters to verify whether the pipeline and parameters are optimal. The speci c operators selected in the best pipeline include the built-in TPOT operator (OneHotenCoder, FeatureSetSelector) and the functions in the scikitlearn library (ExtraTreesClassi er, RandomForstClassi er and Nsystem) Figure 6 a-c. ROC diagrams. The AUC of "moderate" & "severe" training set and test set were 0.98/0.95, respectively (a); The AUC of the training set and the test set in the "moderate" & "control" group were 0.97/0.98, respectively (b); The AUC of the training set and the test set in the "severe" & "control" group were 0.99/0.95, respectively (c).