Pulmonary Lesion Subtypes Recognition of COVID-19 From Radiomics Data With Three Dimensional Texture Characterization in CT Images

Background: The COVID-19 disease is putting unprecedented pressure on the global healthcare system. The CT examination as a auxiliary conﬁrmed diagnostic method can help clinicians quickly detect lesions locations of COVID-19 once screening by PCR test. Furthermore, the lesion subtypes classiﬁcation plays a critical role in the consequent treatment decision. Identifying the subtypes of lesions accurately can help doctors discover changes in lesions in time and better assess the severity of COVID-19. Method: The most four typical lesion subtypes of COVID-19 are discussed in this paper, which are ground-glass opacity (GGO), cord, solid and subsolid. A computer aided diagnosis approach of lesion subtype is proposed in this paper. The radiomics data of lesions are segmented from COVID-19 patients CT images with diagnosis and lesions annotations by radiologists. Then the three dimensional texture descriptors are applied on the volume data of lesions as well as shape and First order features. The massive feature data are selected by hybrid adaptive selection algorithm and a classiﬁcation model is trained at the same time. The classiﬁer is used to predict lesion subtypes as side decision information for radiologists. Results: There are 3734 lesions extracted from the dataset with 319 patients collection and then 189 radiomics features are obtained ﬁnally. The random forest classiﬁer is trained with data augmentation that the number of diﬀerent subtypes of lesions is imbalanced in initial dataset. The experimental results show that the accuracy of the four subtypes of lesions is (0.9306, 0.9684, 0.9958, and 0.9430), the recall is (0.9552, 0.9158, 0.9580 and 0.8075) and the f-score is (0.93.84, 0.92.37, 0.95.47, and 84.42). Conclusion: The method is evaluated in multiple suﬃcient experiments. The results show that the 3D radiomics features chosen by hybrid adaptive selection algorithm can better express the advanced information of the lesion data. The classiﬁcation model obtains a good performance and is compared the models of COVID-19 in the stat of art, which can help clinicians more accurately identify the subtypes of COVID-19 lesions and provide help for further research.


Background
With the rapid growth of patients of the 2019 Coronavirus Disease (COVID- 19), the shortage of clinicians is increasingly severe. Currently, clinicians mainly use RT-PCR technology to detect RNA in sputum or nasopharyngeal swabs to detect COVID-19 pneumonia. But this method has a certain false-negative rate [1]. Therefore, clinicians will also use chest CT images as a additional diagnostic method to improve the accuracy of COVID-19 detection for confirmed diagnosis. Moreover, the imaging pattern can change rapidly in a short period of time within the treatment process [2]. The research work of lesion segmentation have achieved good results in the diagnosis of COVID-19 through machine learning or deep learning methods.
In terms of machine learning, Shi et al. use medical imaging features and clinical features as input, and logistic regression as a classifier to distinguish COVID-19 [3]. Barstugan et al. have made improvements in 2D feature extraction. GLCM, LDP, GLRLM, GLSZM, and DWT were used to obtain the second-order statistical features for classification of COVID-19 [4]. Ozkaya et al. also proposed a new method that fuses and ranks deep features for early detection in SVM [5]. Elaziz et al. used the new Fractional Multichannel Exponent Moments (FrMEMs) to extract features from chest X-ray images. Then an improved Manta-ray search optimization based on differential evolution is used to select the most important features and a K-Nearest Neighbor classifier is used to distinguish COVID-19 [6]. Tuncer et al. proposed a feature generation method called Residual Exemplar Local Binary Pattern (ResExLBP) and used a novel iterative ReliefF (IRF) for feature selection. In their work, the SVM classifier achieved 100.0% classification accuracy by using 10-fold cross-validation [7].
In addition, some scholars have proposed some deep learning methods for the diagnosis of COVID-19. For example, Zhou et al. segment COVID-19 lesions from CT by using the U-Net segmentation network with a spatial and multi-channel attention mechanism to assist in diagnosis COVID-19 [8]. Khan et al. proposed a deep convolutional neural network called Coro-Net based on Xception architecture, which can detect COVID-19 infection from chest X-ray images [9]. Afshar et al. pointed out that CNN is easy to lose the spatial information between image instances, so an alternative framework based on the capsule network is proposed, which can handle small data sets [10]. Khalifa et al. fine-tuned deep transfer learning for limited data sets to detect pneumonia chest X-ray based on generative confrontation network [11]. Minaee et al. trained four popular convolutional neural networks, including ResNet18, ResNet50, SqueezeNet, and DenseNet-121, to identify COVID-19 disease in the analyzed chest X-ray images [12]. He et al. propose a synergistic learning framework for automated severity assessment of COVID-19 in 3D CT images, by jointly performing lung lobe segmentation and multi-instance classification [13]. Xu et al. use a 3D deep learning model to segment candidate infection areas from lung CT images, then score these areas, and finally uses noise or Bayes function to calculate the final confidence score to classify patients as COVID-19, Influenza-A viral pneumonia (IAVP), and not infected [14].
In summary, the above existing work mainly focuses on lesion detection of COVID-19 or its severity assessment. Few studies are paid attention to the classification of lesion subtypes, which ignores the important role of lesion subtypes in the diagnosis of COVID-19 disease. The subtypes identification of lesions in a timely manner can enable clinicians to better assess the patient's condition and prescribe precise medicines in personality. Zhao et al. pointed out the severity and the symptoms of COVID-19 pneumonia are different from common pneumonia [15], and lesions and characteristics of it are also different from common pneumonia [16]. So the lesions caused by COVID-19 pneumonia are more worthy of further study. At the same time, if the patient lesion type can be determined more accurately, the doctor can more accurately determine the COVID-19 patient's condition by referring to the lesion type.
Therefore, it is necessary to study the computer aided diagnosis of COVID-19 lesion subtypes recognition, at the same time, the study can reduce the image reading burdens of radiologists in vast data. Moreover, the machine learning methods of classification for COVID-19 are mainly based on features extracting from the 2D CT images in the exist papers. 3D radiomics features can make full use of the advanced information of lesions which are not discussed in present work to the best of our knowledge in the state of art.
The contributions in this paper can be discussed by three aspects.
1 The pilot research work on lesions subtypes of COVID-19 is discussed in this paper, which has never been seen in previous studies so far and may greatly assist doctor diagnosis and evaluating severity of COVID-19 patients more effectively. 2 The 3D texture radiomics analysis method is applied on COVID-19 lesions diagnosis which is better to explore more hidden inner characters within the lesions to help experts better understand the pathological features of COVID-19. 3 Extensive experiments on clinical real-world datasets demonstrate the effectiveness of the proposed model of hybrid adaptive feature selection method. Moreover, we show the capability of the proposed model for the high dimension feature data with serious imbalance problem. The rest of the paper is organized as follows. This paper first introduce the method with the composition of the data set, the characteristics of 3D features, and the feature selection strategy in Section Methodology. The experimental results on the prepared database are discussed in Section Results. We finally conclude this paper and look forward to the future work in Section Conclusion.

Results
In this section, the classifier evaluation criteria is illustrated firstly. Then, we present experimental results achieved by different methods on the evaluation dataset. Finally, the comparative experiments are conducted to prove the influence of data augmentation, 3D features and HAFS.

Experimental settings
Since COVID-19 is a new disease, there is few public data set of CT images with annotations suitable for this study. Therefore, we extracted lesions from 319 cases of COVID-19 pneumonia patients provided by Neusoft Medical to construct a dataset. All patients received a thin-slice CT scan of the chest by Neusoft 256 slice CT.
The final subsets of features are evaluated by RF classifications associated with 10-fold cross validations. Precision, Recall , Accuracy and F-measure are used to compare the estimated and known labels according to the following expressions: , the accuracy of Label 3 is always the highest, some of them even close to 100. The possible reason for this phenomenon is that although we have enhanced the data in the experiment, the number of the four types of lesions tends to be balanced. However, the number of lesions on Label 3 is still the least. On the contrary, the precision value of GaussianNB is (88.57, 42.62, 14.64, and 37.82), and the fitting ability is seriously insufficient. The reason may be that GaussianNB is prone to under-fitting for a small number of samples. Fig.4 shows the ROC curves of different models. It is also obvious that our method has the highest ROC curve area. These results all show that HAFS-RF can improve the performance and efficiency of COVID-19 classification.

Influence of data augmentation
To evaluate the effectiveness of the data augmentation, we compare it to without data augmentation, with the results reported in Table 3. As can be seen from Table 3  The possible reason is that the excessive number of samples of Label 1 leads to the over-fitting of the model. On the contrary, the insufficient number of other types leads to insufficient fitting ability. After using data augmentation, the four sample sizes are relatively balanced, thus avoiding overfitting to Label 1, so the score of Label 1 decreases, but the overall score increases.

Influence of 3D features
As shown in Fig.5, each feature will have its own score in HAFS. The green features are discarded in the first stage of HAFS, the orange features are discarded in the second stage of HAFS, and the blue features are selected after HAFS. Tabel 4 shows the details of blue features of Fig.5. It is obvious that after feature selection, a total of 48 features out of 189 features were retained. Among them, there are 18(6×3) 2D features and 30 3D features. More 3D features are retained than 2D features. So 3D features may be more effective than 2D features.
To study the influence of 3D features, a comparison was done with and without the 3D features for our model. Results of the evaluated criteria are given for the 189 features in Table 5. As shown in Table 5 Table 6.
One can observe from Table 6 that compared to the other four methods, HAFS gets the highest accuracy (93.06, 96.84, 99.58, and 94.3). This proves from the side that the features selected by HAFS are more representative.
Secondly, we further develop four methods based on SVM, KNN, GaussianNB, and QDA by using HAFS (i.e., HAFS-SVM, HAFS-KNN, HAFS-GaussianNB, and HAFS-QDA). We evaluate these eight methods, with the results reported in Table  7. improve the performance of the method for different methods. The reason may be that HAFS can select a small number of irrelevant features from a large number of features, thereby avoiding the phenomenon of over-fitting. And we can see that for different models, the best α is different, for example, for SVM, the best α is 0.1, KNN is 0.5, GaussianNB is 0.1, and QDA is 0.3. The possible reason for the difference is that the principle of the classifier is not the same.

Conclusion
The most four typical lesion subtypes of COVID-19 are discussed and a computer aided diagnosis approach of lesion subtype is proposed in this paper. Then the three dimensional texture descriptors are applied on the volume data of lesions as well as shape and first order features. The massive feature data are selected by hybrid adaptive selection algorithm and a classification model is trained at the same time. Extensive experiments on clinical real-world datasets demonstrate the effectiveness of the proposed model of hybrid adaptive feature selection method. Moreover, we show the capability of the proposed model for the high dimension feature data with serious imbalance problem. The results show that the 3D radiomics features chosen by hybrid adaptive selection algorithm can better express the advanced information of the lesion data. The classification model obtains a good performance and is compared the models of COVID-19 in the stat of art, which can help clinicians more accurately identify the subtypes of COVID-19 lesions and provide help for further research.

Methodology
In this study, four lesion subtypes are studied, namely ground-glass opacity (GGO, referred as label 1), cord (referred as label 2), solid (referred as label 3), and subsolid (referred as label 4) [17]. The CT images of the prepared dataset in this paper are shown in Fig.1, which are presented in transverse, sagittal and coronal plane respectively, and the lesions are annotated by two radiologists with ITK-SNAP software.
The numbers of the four types of lesions are 2637, 519, 103 and 475 respectively and the total is 3734 in original dataset. Table 1 shows the summary of the prepared dataset that are maximum, minimum and standard variance values of the sizes of three directions and volumes for each lesions subtypes. The subtype of ground-glass opacity hosts the majority of COVID-19 which shows imbalance data problem. The subsolid subtype is the largest in size of lesions while the cord subtype is the smallest one.
Based on this data set, the method process is shown in Fig.2 which includes four steps. We firstly introduce the algorithm of lesions extraction and data augmentation used in this study. Then, the feature extraction process for the 2D and 3D features are discussed. The implementation details is presented subsequently. Finally, we describe the random forest model in the forth step.
Lesions data extraction and augmentation VB-Net is to predict and segment the image of the unknown lesion location. It combines V-Net and bottleneck layers to reduce and combine redundant information The labels and regions are given by medical experts. The red area represents ground-glass opacity, the green area represents cord, the blue area represents solid and the yellow area represents subsolid.  [18]. However, all the 3D volume data used in this experiment has the mask position of the corresponding lesion from annotations by the radiologists. Therefore, the VB-Net or V-Net will cause an error in extracting lesions. So we firstly use a breadthfirst traversal (BFS) based lesion extraction algorithm to extract lesions which is mostly used in graph structure data. It should be noted that the lesions locations and subtypes are labeled by two radiologists that ensure the accuracy of evaluation data.
The BFS-based lesion extraction algorithm is shown as Algorithm 1. We traverse the mask array, which is created by radiologists, in the entire 3D volume data. When traversing the pixels with a lesion mark, this paper uses the BFS to extract the lesion range from the case with that mask [19]. The input is generally a point in the graph, then use the point to initialize a queue. The main idea of the algorithm is to take out a point from it each time, and then for this point, all nearby points Because we set the search range L in the algorithm, the algorithm can well avoid the discontinuity of the lesion data in a certain dimension caused by inaccurate data labeling. And we set a global mark in the algorithm, all traversed points will be recorded. There are two advantages to this: 1 In the current BFS process, the marked points will not be enqueued, which can avoid double calculation; 2 In the global traversal of the mask data, the marked points will not call the BFS algorithm repeatedly. This can ensure that each time the BFS algorithm only generates one lesion and returns its 3D ranges.
As we all know, the data imbalance problem will lead to a decrease in the accuracy of multi-classification. The four different subtypes of lesions number are 2637, 519, 103 and 475. Obviously they are unbalanced, and the number of Label 1 lesions is far greater than Label 3, which can leads to insufficient fitting ability of the classifier to Label 3 samples. The unbalance characteristic data can shift the decision boundary of the classifier, and affect the final classification effect. Therefore, we have adopted a data enhancement method based on ADASYN proposed by He [20] to reduce its impact on classification accuracy. This paper adopts a data enhancement method based ADASYN to increase the number of Label 2, 3 and 4. The method is briefly introduced in Algorithm 2.
Input: Dataset train Output: The synthetic data x syn The number of majority class is defined as: n l ; The number of minority class is defined as: n s ; Calculate the degree of class imbalance: Degree imbalance = n s /n l ; if Degree imbalance < Degree threshold then Calculate the number ∆ syn of synthetic data to be generated: Calculate the ratio r i defined as r i = N KN eighbor /K where N KN eighbor is the number of majority class examples in the K nearest neighbors of x i ; Normalize r iri = r i / ns i=1 r i ; Calculate the number g i of synthetic data defined as g i =r i × ∆ syn ; for i = 1 to g i do for Random choose x mi ∈ K nearest neighbors of x i do The algorithm increases the number of four subtype lesions from (2637, 519, 103 and 475) to (2637, 1098, 386 and 976) which is evaluated that the data augmentation can effectively improve the classification performance evaluated by the experiments in Section Results.

Three dimentional feature extraction
The most of the current medical imaging research on COVID-19 are mainly based on X-rays images or ignoring the characteristics of CT planar images. Furthermore, 3D features are also seldom considered in the research. Therefore, this paper extracted the 2D features of some certain layers and more 3D features of the CT data to better characterize the lesion information.
The existing methods are mainly based on extracting 2D features from CT images. This ignores that the COVID-19 lesion is a kind of volume data. Therefore, in the process of feature extraction, the connection between layers is ignored, and some hidden features are lost.
In order to improve the accuracy of classification, we extracted multiple types of features of the lesion, that include 2D and 3D features and are shown in details as following.
1 Infected lesions number: The stage of the covid-19 affects the number of lesions and also affects the distribution of different types.Therefore, we add the total number of lesions in the same patient to the feature. 2 Shape features: Some cord-type lesions are significantly different from other lesions in shape, so we extracted three-dimensional shape features from the lesions in order to improve the accuracy of multi-classification. 3 First order features: The first order feature provide information related to the gray-level distribution of the image. We first obtain the middle layer and the layer with the largest lesion area of CT images in three directions, and extract 14 two-dimensional manual features from them, including mean, var, max, skewness, kurtosis, area, compact, rough, contrast, dissimilarity, homogeneity, energy, correlation, ASM. Then we also extract the three-dimensional features and hybrid them into the total features. 4 Second-order features: Second-order features give more information about the relative positions of the various gray levels within the lesion image.
(a) The gray-level co-occurrence matrix (GLCM) [21]: The GLCM is a statistical method of analyzing texture that considers the spatial relationship of pixels. The (i, j) th element of GLCM P (i, j|δ, θ) describes the number of times the combination of levels i and j occur in two pixels in the image, that are separated by a distance of δ pixels along angle θ. The 3D-GLCM considers 13 directions and need to be calculated separately and finally averaged.  (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14). In addtion, we also extracted 4 features that are width, height, length and volume of the 3D lesion from the bounding box of COVID-19 lesion. In summary, a total of 189 features are used in our study.
Hybrid adaptive feature selection method As described in Section Three dimentional feature extraction, we extract specific features from 3D lesion data. However, too much irrelevant information in features can easily cause over-fitting of the model, which will reduce the accuracy of the test set. Therefore, before training the model, it is necessary to perform feature selection processing to reduce the influence of the redundant feature.
There are three kinds of algorithms in the existing feature selection process, namely filters, wrappers, and embedded. Although these methods have their advantages, they all have one obvious disadvantage that the number of features in the subset after feature selection cannot be determined. So at this stage, we proposed a Hybrid Adaptive Feature Selection (HAFS) algorithm. It can not only solve the problem of the uncertain number of features in the subset, but also integrate the advantages of various traditional methods.
HAFS method is divided into two stages. In the first stage, the feature set F = {f 1 , f 2 ...f 189 } is used as input, and the genetic algorithm (GA) is used to accurately select features. First, a population of q chromosomes will be initialized, and each chromosome is a binary set of length n U = {U 1 , U 2 ...U 189 }, each value of U i is 1 or 0, if U i = 1, it means f i is selected , otherwise, f i is not selected. Next, p iteration will be sperformed. Before each iteration, individual fitness is evaluated for each chromosome. Then, according to the fitness, the chromosomes in the population are calculated with different probabilities of three genetic operators that are selection, crossover, and mutation. At the end of the iteration, the feature subset F GA = {f GA1 , f GA2 ...f GAm } is determined according to the value of each bit according to the chromosome U max with the highest fitness. A schematic illustration of the first stage is shown in Fig.3.
In the second stage, in order to integrate the advantages of multiple feature selection methods, we use F GA as input, and take two filters method that are F-test [26] and Maximal Information Coefficient (MIC) [27], one wrappers method that is Recursive Feature Elimination (RFE) [28], and one embedded method that is L1 regularized linear regression model(Lasso) [29] into data preprocessing methods to score features and sort them in ascending order, the results are presented as F 1 , F 2 , F 3 and F 4 :  Figure 5: A schematic illustration of first stage of HAFS.
So we score each f GA in F GA according to F 1 , F 2 , F 3 and F 4 . According to the S(f GA ) score, F GA is sorted in ascending order. In this way we can decide which features to keep. The scoring standard, S(f GA ), is: On the other hand, in order to further prevent feature over fitting after GA selection, We decided to select some features of F GA and we set a parameter α as the ratio of the number of selected features. The final result of feature selection F HAF S is : Random forest based classification model In the classification stage, the classifier used in this paper is random forest (RF) model. The main idea is: Select a subset S b from all sample set S through randomly selecting sample features and sampling. Then a classification and regression tree is established for S b . The classification and regression tree uses the Gini coefficient as the criterion [30]. In this paper, the above process was repeated 1000 times to construct 1000 CART trees to construct a random forest. The random forest based classification model is shown in Algorithm 3.    Typical four lesion subtypes in CT images of COVID-19. The labels and regions are given by medical experts. The red area represents ground-glass opacity, the green area represents cord, the blue area represents solid and the yellow area represents subsolid.

Figure 4
Overview of the COVID-19 Classi cation Using Random Forest Based on Hybrid Adaptive Feature Selection.

Figure 5
A schematic illustration of rst stage of HAFS.