CT Image Analysis Using Grayscale Statistics to Categorise Severity of Lung Abnormalities of COVID-19 Patients


 Grayscale image attributes from 456 images extracted from CT scan slices of 53 patients (49 with COVID-19 and 4 without) are used to establish a visual scale of severity of lung abnormalities (five classes: 0 to 4). The complex trends of these easy-to-derive image attributes can be used graphically to discern the visual scale of lung abnormalities in broad terms. With the aid of machine learning algorithms, the visual classes can be distinguished with close to 95% accuracy using combinations of selected grayscale attributes. Confusion matrices reveal that the best-performing machine learning models are able to distinguish more accurately between certain classes than visual inspection of CT images by experts. The adaboost, decision tree and random forest models confused on average less than 25 of the 456 CT-scan image extracts evaluated between the visual classes of lung abnormalities.


Introduction
The global COVID-19 virus pandemic has had devastating consequences. Accurately distinguishing and grading the severity of its pulmonary impacts for individual patients rapidly from image data remains an important objective. Computed tomography (CT) images offer a readily accessible, safe (low-radiation) and exible modality for assessing pulmonary conditions and are now routinely used for diagnosing and assessing the severity of various lung diseases including lung cancer, bacterial and viral pneumonia and, most recently, COVID-19 1,2 . Indeed, there are growing indications that radiological methods deliver greater sensitivity to COVID-19 diagnosis than a range of lab-based tests, including the standard diagnostic test of virus' nucleic acid by real-time reverse transcription polymerase chain reaction (rRT-PCR) 3,4 . This makes thoracic CT scan slices the favoured image modality for COVID-19 detection and veri cation 5 . Much of the recent research on CT images relating to COVID-19 has been related to its diagnosis and observation during therapeutics rather than in attempting to classify different degrees of its severity 6 . The wider application of CT image information in terms of distinguishing degrees of severity of COVID-19 has been recognized and is being exploited 7 . López-Cabrera et al. 8 considered some of the obstacles and limitations of using machine and deep learning methods for the automated classi cation of COVID-19 using chest CT scans. Farhat et al. 9 also reported on some of the challenges of applying deep-learning techniques to pulmonary image data for COVID-19 diagnosis and prognosis.
Features indicative of COVID-19, typically present in thoracic CT images, are ground − glass opacity (GGO), crazy − paving and consolidation textures, air bronchograms, reverse halos, and peri-lobular patterns 10 . Duzgun et al. 11 evaluated radiological characteristics of a broad spectrum of lung pathologies, considering their resemblances with the joint signs of COVID-19 pneumonia. They identi ed that existing underlying lung abnormalities could interfere with the CT diagnosis of COVID-19. Deep feature classi cation process is proving useful in classifying the severity of lung abnormalities in patients with COVID-19 infections 12 . Computer-aided detection (CAD) exploiting machine learning (ML) and deep learning (DL) algorithms has been successfully con gured and applied to accurately detect and distinguish such features [13][14][15] . In particular, DL methods adapting convolutional neural networks (CNN), coupled with automatic feature extraction algorithms, have received much attention 16,17 . Automated feature extraction is not a straightforward process as it requires effective and reliable image segmentation software 14,18,19 .
There is now a strong research trend towards CT scan image analysis to apply deep learning coupled with automated feature extraction software 9 . Whereas such an approach is clearly effective for diagnosis and identifying severity of lung abnormalities, it is somewhat opaque to the underlying attributes that determine the features and patterns being distinguished in suites of thoracic CT scan slices now routinely collected for many COVId-19 patients. This study takes a somewhat different and original tack in an attempt to increase the transparency of the underlying image attributes that are contributing to the distinguishing features. Armed with this information much insight can be provided to the characteristics associated with different levels of severity of lung abnormalities.
This study focuses on the grayscale statistical analysis of human-selected image extracts of the pulmonary parenchyma portion of CT-scan slices. The image extracts are obtained by visual inspection and do not involve any automated feature extraction intervention. Much information can be rapidly gained from such grayscale image extracts using statistical and graphical analysis on a standalone basis. The statistical data is also suitable for further analysis with the aid of machine learning algorithms, on a supervised basis, to accurately predict degrees of severity of lung abnormalities associated with speci c images.

CT Scan Images Compiled
In this study, images were gathered at the Namazi hospital (Shiraz, Iran; refer to Ethical Guidelines and Consent Statement) using a Philips 16-slice CT scanning machine. Image sets with slice thicknesses of 0.625mm were generated without using contrast. Unenhanced chest CT scans are widely employed as a good modality for fast diagnosis and evaluation of viral diseases such as coronavirus, complementing the laboratory rRT-PCR. Suites of CT image scans were compiled from the 49 COVID-positive patients and 4 COVID-negative patients.
CT scanning devices can have either single or multi-slice capabilities. Single-slice CT scanners generate a single image for each spin of the patient gantry. Multi-slice machines are now widely used in many hospitals because they provide more rapid and comprehensive imaging. Multi-slice CT scanners range from four to sixty-four-slice con gurations 20 . 16-slice CT scanners facilitate rapid full-organ coverage generating high-resolution images and are well-suited for detailed lung investigation.

Visual CT scan image assessment
The suite of CT image slices for each individual were visually assessed by a clinician and each image was assigned to one of ve classes using a score on the following 0 to 4 scale: 0 = Patient has tested negative for Covid-19 and has no obvious signs of other lung diseases 1 = Patient has tested positive for Covid-19 but has no obvious signs lung abnormalities 2 = Patient has tested positive for Covid-19 and has only minor signs of lung abnormalities 3 = Patient has tested positive for Covid-19 and has substantial signs of lung abnormalities 4 = Patient has tested positive for Covid-19 and has severe signs of lung abnormalities These visual scores (VS) then become the prediction objective of the grayscale image analysis. Appendix 1 (see Supplementary File) shows example extract images for each of these VS classes.

Statistical analysis of extract grayscale images
From the suite of CT scan images for each COVID-positive patient, four to ve images were identi ed as representative, and selected for grayscale analysis. From each of those selected CT images, two rectangular grayscale image samples were extracted, one from the left lung and one from the right lung. From the 49 COVID-positive patients this resulted in 392 individual extract images being compiled. The same approach was taken with the COVID-negative patients but more of the CT images were sampled from each of those patients. A total of 64 extract images were compiled from the four COVID-negative patients. The dataset for grayscale analysis therefore includes 456 individual rectangular extract images.
The extract images varied in size from about 2000 to 80000 pixels, averaging about 25000 pixels. That size of the extracted image was determined by the nature of the original CT scan. The objective for each extract image is that it should include only the parenchyma of the lung, avoiding the pleura, the diaphragm and the mediastinum.
A pixel in a grayscale image is associated with a single pixel value indicative of its brightness. That pixel value is typically expressed as an 8-bit integer that allows a value range from 0 to 255. By convention 0 Two of these statistics (#3 and #8) are substantially in uenced by the number of pixels in each image, so are not suitable for comparative use between images. However, the standard error is useful for providing an indication of the uncertainty in the average grayscale values. It is encouraging that standard error is less than 0.7 of a grayscale unit for all images evaluated. Statistic #4 is independent of image size so it is a more appropriate measure to use than statistic #3 for comparative purposes.

Analysis with machine learning algorithms
Values for statistics #2 to #13 constitute the independent variables. The visual assessment scores (0 to 4) of lung abnormalities observed in the CT images (termed "visual score"; VS) represent the dependent variable to be predicted by machine learning (ML) using a suite of classi cation algorithms. Therefore, each data record in the dataset involves twelve independent variables (the grayscale statistics for a speci c extract image) and one dependent variable, the visual score determined by the clinician. The models are run in python code with the objective function set to minimize the root mean squared error (RMSE) of VS actual (VS act ) versus VS predicted (VS pred ).
The 456 data records are evaluated with a suite of ten established ML classi cation methods, which are:

Statistical analysis of grayscale image extracts
A summary of the grayscale image statistics is provided in Table 2 considering the dataset as a whole and distinguishing the images for patients testing positive for COVID-19 and those testing negative. It is clear that there is a substantial variation in all of the grayscale statistics, which is encouraging for the possibility of exploiting them collectively to distinguish the severity of lung abnormalities in the patients. Even more encouraging are the substantial differences between the value distributions of several of these statistics for COVID positive and negative patients. Of particular note, are the higher average values of Quadratic Discriminant Analysis (QDA) Support Vector Machine (SVM) The VS prediction accuracy achieved by each of these methods is recorded and compared using a set of established prediction accuracy statistics. Application work ows and methodologies associated with the ten ML methods evaluated are not detailed here as they are well established and discussed in detail elsewhere: ADA 22,23 ; DT [24][25][26] ; ELM [27][28][29] ; GPC 30,31 ; KNN 13,32 ; MLP 33 ; NBC 13,34 ; RF 35 ; QDA 36,37 ; and, SVM 38,39 . In this study, these algorithms are run in Python code and the control parameter values applied to them are listed in Table 1. Table 1. ML algorithm execution method and control variable values applied to predict VS using CT scan extract grayscale image dataset. All models were executed in Python.

Statistical measures of prediction accuracy assessed
The VS prediction accuracy is assessed in terms of nine widely-used measures of statistical accuracy (Fig. 1).
These measures of prediction accuracy complement each other in the information they provide about the VS predictions versus the actual values. RMSE is of particular interest as this is the metric used as the objective function for the ML algorithms. statistics #2 and #5 to #12 and the lower average values of statistics #3 and #4 for the COVID-19 positive patients compared with the COVID-19 negative patients ( Table 2).
As well as the differences observed between the patient groups, there are also some notable correlations between the grayscale statistics and the VS indicator (Table 3). Two correlation coe cients are calculated: the Pearson correlation coe cient (R) that assumes the variables are normally distributed (i.e., parametric) and the Spearman correlation coe cient (p) which is non-parametric and is calculated using ranked data.   19. VS refers to the visual score (scale 0 to 4) assigned to each image by a clinician in terms of the severity of lung abnormalities based on inspection of the suite of CT scans available. Table 3. Pearson correlation coe cients for the grayscale statistics of the 456 extract images. The last column displays the visual score Spearman rank correlation coe cients for comparison.

Graphical analysis of grayscale statistics
Graphical displays better illustrate the continuity of the grayscale distributions than the tabular statistical analysis and highlight some key relationships between these variables. Figure 2 shows the relative spread of values relating to grayscale average, grayscale variance and percentage of pixels at the grayscale average. The scaling factors are applied to optimize the spread of the data points in this triangular display.
The three statistics displayed in Fig. 2 can successfully distinguish the severity of COVID-19-related lung abnormalities with a continuous gradation from the lower left of the triangle. That position indicates a high percentage of pixels at the grayscale average which is of dark grey shades with low variance. As the data points move towards the top right the increasing variance and average grayscale indicate more lighter grey shades are present as the lung abnormalities become more severe (higher VS values indicated by darker symbols). Finally, for the grayscale extract images with the most severe lung abnormalities the data points in Fig. 2 move towards the lower right of the triangle as variance decreases and light grey shades dominate the images.
A three-dimensional graphic of grayscale P90 versus percentage of pixels at the average grayscale value versus grayscale variance provides another graphic that meaningfully distinguishes the severity of lung abnormalities related to COVID-19 from CT scan images (Fig. 3).
The graphical and statistical relationships established indicate that the grayscale statistics can be used effectively to identify the severity of lung abnormalities in relation to COVID-19 infection. Indeed, the contrast between these statistical variables suggest that relatively simply formulaic relationships between some of them could be used to provide a useful automated scale of severity of lung abnormalities with which to compare the VS score visually assigned by a clinician. This goes beyond the objective of the current study. However, the approach of exploiting grayscale images extracted from CT scans clearly has promise for categorizing lung abnormalities associated with diseases such as pneumonia and lung cancer as well as COVID-19 from lung CT scans. Future research is planned to further explore these possibilities.

Feature selection for machine learning analysis
The grayscale statistical data set (456 data records with 13 input variables; Table 2) can be used with the commonly used ML algorithms to predict the VS value for each image extract to a high degree of accuracy. To demonstrate this, two cases of ML models are executed with VS as the objective function: A) using eleven of the statistical variables (all excluding number of pixels and Variance/Average) available; and B) using just nine variables (excluding number of pixels, standard error, minimum grayscale and number of pixels at the average grayscale value). The feature selection for the variables to use is based upon the information provided in Table 3 and Figs. 2 and 3. The most in uential variables, in terms of their correlation coe cients with VS, and each other, are included as input variables. On the other hand, the variables with the lowest R and p values with VS, and those that are in uenced by the number of pixels in the extract images, are excluded from the 9-variable case.

VS predictions from machine learning methods
Tables 4 and 5 present the prediction accuracy statistics for the ten ML algorithms in terms of their ability to correctly assign the VS value to each of the 456 data records, for the 11-variable and 9-variable models, respectively. Each algorithm's prediction performance is ranked in terms of RMSE and the number of prediction errors (right-hand columns of Tables 4 and 5). Most algorithms yield better VS prediction accuracy in terms of RMSE for the 11-variable models. However, the adaboost model results in lower RMSE (and number of errors) for the 9-variable model making it the best performing algorithm for that data variable combination. Random forest generates the best performance for the 11-variable models in terms of RMSE, although the decision tree model results in the lowest number of errors (23 out of 456; 95% of its predictions being correct. The KNN model achieves similar performance for the 11variable and 9-variable models with lower RMSE in the former but lower errors in the latter.  (34 and 35 prediction errors, respectively). On the other hand, the MLP models performance deteriorates signi cantly from Case A (54 prediction errors) to case B (100 prediction errors). These results suggest that of the ML models evaluated ADA, DT, ELM, KNN and RF are best suited to the VS prediction task for the data set evaluated. Future studies are planned to evaluate the potential of deep learning methods to improve upon the results achieve by these deep learning methods.

Discussion
The results presented are considered to be very encouraging in terms of the use of statistical analysis of grayscale image extracts from suites of CT scans to identify the severity of lung abnormalities in COVID-19 suffers and potentially those with other lung diseases. The broad continuous nature of the distributions for several grayscale statistics (Tables 2 and 3) can be usefully exploited to grade lung abnormalities on a continuous basis. This can be achieved graphically (Figs. 2 and 3) without recourse to machine learning. Alternatively, by exploiting ML methods it can accurately (up to about 95% accuracy) predict visual scales of lung damage assessed by a clinician. These results justify further research and application of the method.
How error is assessed in the ML methods is worthy of more detailed consideration. The ML methods are driven to minimize RMSE as their objective function. This is considered an appropriate approach as RMSE provides a continuous and highly sensitive error scale for this purpose. However, the overall objective of the prediction models is to minimise the number of prediction errors. It is therefore important to compare the performance of these two error measures, and the other measures shown in Tables 4 and  5 to more fully understand the accuracy of the measures. It is possible for the RMSE value to go up slightly but the number of prediction errors to go down slightly (e.g., compare the performance of the KNN algorithm for Cases A and B). It is worth exploring why this is possible and its implications.
Confusion matrices are useful ways to explore in more detail how the prediction errors are distributed across the VS classes (0 to 4). Figure 4 plots a confusion matrix for the two best performing models (RF for Case A and ADA for Case B).
The comparison of these two high-performing ML models is interesting because they both achieve the same number of prediction errors overall, that is 24, but the ADA model for Case B results in a lower RMSE value (0.2324) compared with that of the RF Case A model (RMSE = 0.2569). Close inspection of the confusion matrices shows that both models achieve similar performance for VS class 0 (COVIDnegative patients) with two errors. However, the RF model confuses its two errors for class 0 with VS class 2, whereas the ADA model confuses its two errors with class 1. In RMSE terms, the RF model's errors for this class will contribute more to the RMSE than the ADA errors. The ADA model performs better in its predictions for VS classes 1 and 2 than the RF model. On the other hand, the RF model performs better in its predictions for VS classes 3 and 4 than the ADA model. What is relevant though, is that only one of the 24 errors generated by the ADA model is more than one class away from the correct VS class (i.e., for VS class 4 one erroneous prediction falls into class 2). On the other hand, three of the errors generated by the RF model are more than one class away from the correct VS class (i.e., two for VS class 0 and one for VS class 2). The RMSE magnitude is therefore affected by not only the number of prediction errors but also their magnitude. From an RMSE perspective, it is better for errors to fall into neighbouring classes rather than into more distant cases.
As well as, providing insight to the location of the prediction errors, the confusion matrices highlight the classes of the dependent variable for which the model works well and less well. Consequently, Fig. 4 suggests that the RF could potentially do better with its predictions at the middle / lower end of the VS scale (VS 0 to 2), whereas ADA improved could potentially do better with its predictions at the middle/ upper end of the VS scale (VS 2 to 4). Work is underway to consider how such improvements could be achieved.
What should be taken into account is that the VS classes are made by human visual observation (a clinician's considered view). The boundaries between the classes are therefore, to an extent, associated with a degree of subjectivity. As is clear from Figs. 2 and 3 the severity of lung abnormalities likely falls into a continuous spectrum from none to extreme and any class boundaries placed visually in that spectrum by human interpretation will to an extent be arbitrary. In that context, only 24 VS misclassi cations out of 456 images (almost 95% correct) by the better ML models is surprisingly good. It is unlikely, considering the broad nature of the VS class de nitions applied in this study that any expert system would be able to get close to zero VS prediction errors.
Another observation worthy of discussion is that the better performing ML models confuse very few of the images between classes VS = 0 (COVID-19-negative) and VS = 1(COVID-19-positive, but with no visible signs on the CT images). By de nition the clinician is unable to distinguish classes 0 and 1 by visual inspection. Yet with > 95% accuracy (Fig. 4B) the ADA 9-variable model is able to distinguish these two classes with only three misclassi cations between the groups from 114 images analysed. Clearly, the grayscale statistics are able to distinguish distinct attributes of these classes that are not visible to even to the trained observer. This is considered to be further justi cation for applying and further evaluating this the application of grayscale statistical analysis of CT scan images.

Conclusions
Consideration of the grayscale image attributes of the pulmonary parenchyma portion of CT-scan slices reveals valuable information with which to categorize the severity of lung abnormalities associated with COVID-19. Such information usefully complements their visual analysis by medical experts and supports a classi cation system (0 = COVID negative; 1 = COVID positive with no lung abnormalities; 2 to 4 represent varying grades of severity of lung abnormalities) that can accurately be discerned with the aid of machine learning. The analysis compiles 456 extract images from CT scans from 49 patients with COVID-19 (con rmed by rRT-PCR testing) and 4 individuals testing negative for the disease. Thirteen grayscale statistical features were measured for each extract image. Several of these features follow systematic but complex continuous trends that can be graphically related to the 0 to 4 visual assessment trend of lung abnormalities. Ten machine learning methods were evaluated to establish their ability to accurately identify the visual classi cation of lung abnormalities applying supervised learning to selected grayscale features. The best of these machine learning models was able to achieve almost 95% prediction accuracy using nine to eleven grayscale variables. Adaboost, Decision tree and random forest models averaged 23 to 25 prediction errors from the 456 image dataset with respect to the lung abnormality severity classes. Confusion matrices reveal that the best performing models make very few errors in distinguishing between the visual classes. Of particular interest is their ability to distinguish between visual class zero (COVID-19 free class) and class one (COVID-19 positive, with no visually discernible abnormalities), a distinction that is not possible by visual inspection alone.

Declarations
The authors declare that they have no con icts of interest and have received no external funding associated with this research.

Con ict of Interest Statement
The authors declare that they have no con ict of interests regarding the information presented in this study.

Ethical Guidelines and Consent Statement
All analysis of CT scans of human patients used in this study were performed in accordance with relevant guidelines and regulations of the Namazi Hospital (Shiraz, Iran). In particular, the anonymity of individuals and con dentiality of biographical information has been maintained at all times. Furthermore, the informed consent for CT scan analysis was obtained from all CT scan subjects from which images were derived for this study. All analytical protocols were approved by the Namazi Hospital.  Triangular display of key grayscale statistics for distinguishing severity of lung abnormalities. Grayscale average and variance are low in normal lungs or those with minor abnormalities present but increase as lung abnormalities increase. For the grayscale image extracts with the most severe lung abnormalities average increases and variance decreases, as the lighter grey shades become more dominant. 3D graphic display of key grayscale statistics for distinguishing severity of lung abnormalities. Grayscale P90, variance and percentage of pixels at the grayscale average, considered together are very effective at identifying the degree of severity of lung abnormalities from grayscale extract images of lung CT scans of healthy and COVID-19-positive patients.