End-to-end COVID-19 screening with 3D deep learning on chest computed tomography

The outbreak of an acute respiratory syndrome (called novel coronavirus pneumonia, NCP) caused by SARS-CoV-2 virus has now progressed to a pandemic, and became the most common threat to public death worldwide[i] ,[ii] . COVID-19 screening using computed tomography (CT) can perform a quick diagnosis and identify high-risk NCP patients[iii]. Automated screening using CT volumes is a challenging task owing to inter-grader variability and high false-positive and false-negative rates. We propose a three dimensional (3D) deep learning convolutional neural networks (CNN) that use a patient’s CT volume to predict the risk of COVID-19, trained end-to-end from CT volumes directly, using only images and disease labels as inputs. Our model achieves a state-of-the-art performance (95.78% overall accuracy, 99.4% area under the curve) on a dataset of 1,684 COVID-19 patients, nearly twice larger than previous datasets 3 , and performs similarly on an independent clinical validation set of 121 cases. We tested its performance against six radiologists on clinical conrmed patient’ CT volumes, our model outperformed all six radiologists with absolute reductions of 7% in false positives and 35.9% in false negatives, demonstrating articial intelligence (AI) capable to optimize the COVID-19 screening process via computer assistance and automation with a level of competence comparable to radiologists. While the vast majority of patients remain unscreened, we show the potential for AI to increase the accuracy and consistency of COVID-19 screening with CT.


Introduction
The World Health Organization (WHO) o cially declared the outbreak of a novel coronavirus, SARS-CoV-2, as a global pandemic. The virus is termed COVID-19, can cause fever, cough, and other u-like symptoms, and the patients died over 60% once they progress rapidly into severe acute respiratory failure stage , . COVID-19 diagnosis is con rmed by a real-time molecular polymerase chain reaction (PCR) testing, but conservative estimates of the detection rate are low and several negative tests might be required in a single case to be con dent about excluding this disease ,,, . Chest CT radiography is an important tool for diagnosis of lung diseases, capable for COVID-19 screening. CT scanning procedure has a faster turnaround time than a viral PCR test in the screening of suspected cases. The majority of COVID-19 cases have similar morphology features and a peripheral lung distribution on CT images including ground-glass opacities (GGO) in the early stage and pulmonary consolidation in the late stage 5, .
The CT images of various viral pneumonia are similar and they overlap with other infectious and in ammatory lung diseases. Therefore, it is di cult for radiologists to distinguish NCP from other common pneumonia (OCP) such as other viral pneumonia, bacterial pneumonia, and mycoplasma pneumonia.
Deep learning algorithms offered an exciting potential to automate analyze complex CT images, have recently been shown the potential to assist radiologists to improve diagnostic e ciency and accuracy in the COVID-19 screening ,,, . Previous works in computer-aided COVID-19 screening had lacked the generalization capability of medical practitioners owing to insu cient data and a focus on handengineered features such as GGO, consolidation, and brosis 3,, .
Methods AI system framework. In this paper, we aimed to build an end-to-end approach to extract features of full ne information contained within 3D structure of CT images, because the deep CNN has been shown superior to hand-engineered features in many competitions , . We used a 3D-CNN to perform COVID-19 risk categorization tasks with the CT images input alone, and comparison to radiologist based strictly on image classi cation. During inference, the model outputs a probability distribution over the three classes (NCP, OCP, and normal controls (NCs)). Fig. 1 shows the AI system. We utilize an in ated Inception v1 CNN architecture , that was pretrained on approximately 1.28 million images (1,000 object categories) from the ImageNet dataset, and train it on our dataset using transfer learning.
Training algorithm.We used 3D in ated Inception V1 architecture pretrained on ImageNet dataset. We removed the nal classi cation layer from the network and retrained it with our dataset, ne-tuning the parameters across all layers to predict the probability of NCP, OCP, and NCs. We then used the last layer before the nal probability, which contains 1,024 units. We took these 1,024 numbers as the outputs for this model, and used them as features later on. Our CNN was trained using backpropagation. All layers of the network were netuned using the same global learning rate of 0.0001 and a decay factor of 0.  Table  1). Scans were randomly assigned to a training set and an internal validation set with 10-fold cross validation method. NCP patient was given when a patient had pneumonia with a positive viral PCR test. The other common pneumonia patient was given when a patient had viral pneumonia (including in uenza, parain uenza pneumonia, adenoviral, and epstein-barr virus), bacterial pneumonia, and mycoplasma pneumonia. The normal controls were given by a public dataset LUNA, which patients were diagnosed without pneumonia. k-fold cross validation. We employed the 10-fold strati ed cross validation method to estimate the our AI model uncertainties. The total dataset (4,221 scans) was randomly shu ed and partitioned into ten equal sized sub-datasets, with each sub-dataset containing equal number of data. Of the ten subdatasets, a single sub-dataset (422 scans) was retained as the validation data for testing the model, and the remaining nine sub-dataset (3,799 scans) were used as training data. The cross-validation process was then repeated ten times, with each of the ten sub-datasets used exactly once as the validation data.
The models were trained and tested ten times to obtain the average prediction accuracy and standard deviation.

Results
We validated the performances of the AI model in three ways. First, we validated the effectiveness using 10-fold cross-validation with a three-class disease partition, which represent NCP, OCP and NCs. In this task, the model achieved 95.78 ± 0.87% (mean ± s.d.) overall accuracy (the macro-average of individual inference class accuracies), and the macro-average area under the ROC curve (AUC) of 0.994 on an internal validation dataset (Fig. 3).
Second, We validated the performances of our model on an independent dataset (Fig. 3). The independent test dataset which consisted of a total of 39,369 CT slice images from 121 patients including 52 NCP patients, 49 other common pneumonia patients, and 20 normal controls. Our AI system achieved overall accuracy of 89.6% in the independent test dataset for three-way classi cation, and overall accuracy of 93.3% in the independent test dataset for NCP versus other two groups. The AI model exhibits generalization capability of COVID-19 screening when tested on independent dataset.
Third, We compared the performances of our model with six practicing radiologists on the same independent test dataset (Fig. 4). We employed six radiologists, all with more than 5 years of clinical experience. For each case, previously unseen, RT-PCT proven images were displayed, and radiologists were asked which case they thought was NCP, other common pneumonia, or normal control. A radiologist outputs a single prediction per cases. The green points in Fig. 4(a) is the average of the radiologists (average sensitivity and speci city of all solid points in Fig. 4(b)), with error bars denoting one standard deviation. Fig. 4(c) and Fig. 4(d) are the confusion matrixes. The 3D-CNN outperforms any radiologist whose sensitivity and speci city point falls below the blue curve of the model. Our AI system achieved overall accuracy = 93.30%, sensitivity = 98.08%, speci city = 91.30%, AUC = 0.994 for diagnosis of NCP versus other classes, outperformed all six radiologists with deductions of 7% in false positives and 35.9% in false negatives (example model false positives in Fig. 5), demonstrating end-to-end 3D-CNN capable to optimize the COVID-19 screening process.
We also examined the features in last hidden layer learned by our AI model with the t-SNE (t-distributed Stochastic Neighbour Embedding) method (see Fig. 6). We projected the 1,024-dimensional features of a chest CT volume into two dimensions, and represented as a single solid point. The red points represented NCP class, and were clustering in the lower left-hand side. In contrast, the blue points represented OCP class, and were clustering on the lower right-hand side. Similarly, the green points represented normal controls, and were clustering at the top.

Discussion
We demonstrated the effectiveness in diagnosing pneumonia with deep learning, a technique that we apply to chest whole-3D volume. Using an trained 3D-CNN model, we compared the performances with six dermatologists tested across three critical diagnostic tasks: NCP, OCP, and NCs classi cation. Our studies demonstrated the potential of 3D-CNN into diagnostic NCP which was found have improved clinical diagnostic e ciency and accuracy signi cantly in COVID-19 screening. Scalable application holds the potential for substantial clinical impact, including assist radiologists and physicians in performing a quick diagnosis and clinical decision-making for radiologists especially in this pandemic.
Further research is necessary to validate the performances of the AI model in real clinical situation, and develop a free access diagnose system. While we acknowledge that radiologists make their diagnoses based on factors such as clinical context rather than visual inspection of lesions in isolation, the ability of our model to classify NCP classes accurately has the potential to assist radiologists and clinicians coping with the COVID-19 epidemic.

Figure 1
Overall modeling framework. Data ow is from left to right. For each patient, the AI model uses a full CT images and class label as inputs, then partition the whole CT images into a probability distribution over three classes (NCP, OCP, and NCs) using i3d architecture pretrained on the ImageNet dataset and netuned on our own datasets, and outputs an overall probabilities for the case. The predicted label of an inference class is calculated by the most probable class.

Figure 2
Diagram describing exclusions made in our analysis. The value of AUC with the full validation dataset for NCP, OCP, and NCs classed was 0.991, 0.991, and 0.999 respectively. We observed negligible AUC changes (<0.004) in 10-fold cross validation, validating the reliability of our results on a larger dataset. b, Normalized confusion matrix of the AI model for identifying NCP cases from other common pneumonia (OCP) and normal controls (NCs). Our AI system achieved overall accuracy of 95.78 ± 0.87% in 10-fold cross validation for three-way classi cation, and overall accuracy of 96.32% ± 1.28% in 10-fold cross validation for NCP versus all other groups. The AI model exhibits reliable COVID-19 screening when tested on our dataset.

Figure 4
Comparing the performance of our AI model with six radiologists on an independent test dataset. a, Performance of model (blue line) versus average radiologist for three classes using a single CT volume. The length of the crosses represents the one standard deviation. b, The previous highlighted mistyrose area is magni ed to show the performance of each of the six radiologists at various classes. Each solid point on the plots represents the sensitivity and speci city of a single radiologist. c, Confusion matrix of the mean diagnostic performance of six radiologists. d, Confusion matrix of the AI system performance comparable to that of senior practicing radiologists. Figure 6 t-SNE visualization of the last hidden layer representations in the CNN for four disease classes.