E�cient Deep Learning Model for COVID-19 Detection in large CT images datasets: A cross-dataset analysis

Early detection and diagnosis are critical factors to control the COVID-19 spreading. A number of deep learning-based methodologies have been recently proposed for COVID-19 screening in CT scans as a tool to automate and help with the diagnosis. To achieve these goals, in this work, we propose a slice voting-based approach extending the E�cientNet Family of deep arti�cial neural networks. We also design a speci�c data augmentation process and transfer learning for such task. Moreover, a cross-dataset study is performed into the two largest datasets to date. The proposed method presents comparable results to the state-of-the-art methods and the highest accuracy to date on both datasets (accuracy of 87.60\% for the COVID-CT dataset and accuracy of 98.99% for the SARS-CoV-2 CT-scan dataset). The cross-dataset analysis showed that the generalization power of deep learning models is far from acceptable for the task since accuracy drops from 87.68% to 56.16% on the best evaluation scenario. These results highlighted that the methods that aim at COVID-19 detection in CT-images have to improve signi�cantly to be considered as a clinical option and larger and more diverse datasets are needed to evaluate the methods in a realistic scenario.


Introduction
In March 2020, the World Health Organization (WHO) o cially declared the outbreak of COVID-19, the disease caused by SARS-CoV-2, a pandemic. COVID-19 is highly infectious and can potentially evolve to fatal acute respira-tory distress syndrome (ARDS). Early detection and diagnosis is a critical factor to control the COVID-19 spreading. The most common screening method to de-tect it is the reversetranscription polymerase chain reaction (RT-PCR) testing. However, it is a laborious method and some studies reported its low sensitivity in early stages [1].
Chest scans such as X-rays and Computer tomography (CT) scans have been used to identify morphological patterns of lung lesions linked to the COVID-19. However, the accuracy of the diagnosis of COVID-19 by Chest scans strongly depends on experts [2] and Deep learning techniques have been studied as a tool to automate and help with the diagnosis [3,4,5,6,7,8].
A computed tomography scan, or CT scan, produces detailed images of or-gans, bones, soft tissues and blood vessels. CT images allow physicians to iden-tify internal structures and see their shape, size, density and texture. Different from conventional X-Rays, CT scans produce a set of slices of a given region of the body without overlaying the different body structures. Thus, CT scans give a much more detailed picture of the patient's condition than the conventional X-Rays. This detailed information can be used to determine whether there is a medical problem as well as the extent and exact location of the problem. For these reasons, a number of deep learning based methodologies have been recently proposed for COVID-19 screening in CT scans [9,10,11,12,13,14].
The main bottleneck for the realization of a study such as the ones cited above is the lack of good quality comprehensive data sets. Possibly the rst attempt to create such a data set was the so-called COVID-CT dataset [15] which consists of images mined from research papers. Different versions of this dataset were used in [9,10,11,12]. For its most updated version, the highest reported accuracy, F1-score, and AUC were 86%, 85%, and 94% [9], respectively. More recently, Soares et al. [14] made another set of CT scans publicly available. It consists of 2482 CT scans taken from hospitals in the city of São Paulo, Brazil. They have reported an accuracy, sensitivity, and positive predictive value of 97.38%, 95.53%, and 99.16%, respectively.
These two datasets are, to date, the biggest publicly available datasets. It can be seen that the difference in the best results obtained in each of them is signi cant which raises two questions: (i) Are the discrepancies in the results due to the differences in the datasets? (ii) Does a model trained in one dataset have good performance when tested with the other?
Another drawback of the best performing techniques is their immense num-ber of parameters which directly in uence their footprint and latency. Improving these two metrics allows the model to be more easily embedded in mobile ap-plications and to be less of a burden on the server if provided as a webservice receiving an enormous number of requests per second. In addition, having a more compact baseline model allows the exploitation of higher resolution inputs without making the computational cost prohibitively high. Broadly speaking, the computational cost is an important factor in the accessibility and availability of the the technology to the public.
Thus, the main goals of this work are: (i) to propose a high-quality yet compact deep-learnign model for the screening of COVID-19 in CT scans and (ii) to address, for the rst time, the aforementioned questions regarding the two biggest datasets.
To produce an e cient model we exploit and extend the E cientNet Family of deep arti cial neural networks along with a data augmentation process and transfer learning. Following previous evaluation protocols [9,14], state-of-the-art results are presented for the COVID-CT dataset (accuracy of 87.60%) and the SARS-CoV-2 CT-scan dataset (accuracy of 98.99%).
A vote-based evaluation approach is also studied as well as a cross-dataset analysis in order to address questions related to the datasets.
The remainder of this work is organized as follows. Section 2 present the de-tails of COVID-CT [15] and SARS-CoV-2 CT-scan [14] datasets. The method-ology is described in Section 3 and the experiments along with the results in Section 4. Finally, Section 5 presents the conclusion of this work.

Datasets
This section describes the two datasets considered in this work. To the best of our knowledge, these are the two largest public datasets to date.
In this dataset the images consist of digital scans of the printed CT exams and they have no standard regarding image size (the dimensions of the smallest image in the dataset are 104 × 153 while the largest images are 484 × 416), Figure 1 shows some examples.
This dataset also lacks standardization regarding the contrast of the images, as can be seen in Figure 2. For method evaluation, the protocol presented in [14] 2

.2 COVID-CT Dataset
To assemble the COVID-CT dataset [15], CT images of patients infected with COVID-19 were collected from scienti c articles (pre-prints) deposited in the medRxiv and biRxiv repositories, from January 19 to March 25 and also some images were donated by hospitals (http://medicalsegmentation. com/covid19/). The PyMuPDF software was used to extract images from the manuscripts, in order to maintain high quality. Meta data were manually ex-tracted and associated with each image: patient age, gender, location, medical history, scan time, severity of COVID-19, and medical report. A total of 349 images were collected, from 216 patients.
Regarding healthy and non-covid patients, the authors collected images from two other datasets (MedPix dataset, LUNA dataset), from the Radiopaedia website and from other articles and texts available at PubMed Central (PMC). A total of 463 images were collected from 55 patients.
Analogous to the previous dataset, the COVID-CT dataset has de ned stan-dard for image size and contrast. Figure 3 shows some examples. It is also im-portant to highlight that some images contain textual information which may interfere with model prediction. See Figure 4.
A protocol is proposed for the creation of training, validation, and test sets. The COVID-19 images that were donated by hospitals and extracted directly from medical equipment (LUNA and Radiopaedia) were selected to compose the validation and test sets. The remaining -extracted from scienti c articles and manuscripts -were reserved to compose the training set. The dataset is available at https://github.com/UCSD-AI4H/COVID-CT. Table 1 summarizes the datasets presented in this section. In the table is possible to observe the issues identi ed in the datasets, and the relation between the number of patients and the amount of images of each class (COVID and Non-COVID). varying the input resolution in the quality of the model. In this way, this pre-processing step becomes another parameter of the network.

E cientCovidNet
The E cientNets are a family of arti cial neural networks in which the basic building block is the Mobile Inverted Bottleneck Conv Block, MBconv [16], as depicted in Figure 5. Table 2 presents a typical E cientNet architecture, particularly the B0 model. Figure 5: MBConv Block [16]. DWConv stands for depthwise conv, k3x3/k5x5 de nes the kernel size, BN is batch normalization, HxWxF represents the tensor shape (height, width, depth), and x1/2/3/4 is the multiplier for number of repeated layers.
The main idea to achieve the E cientNet architecture was to start from one high quality yet compact baseline model presented in Table 2 and progressively scale each of its dimensions, in a systematical manner, with a xed set of scaling coe cients. According to [17], Eq. 1 provides a nice trade-off between computational cost and performance.
In [18], four new blocks are added to the baseline model to improve COVID-19 recognition on x-ray images. Here, we proposed modi cations aimed at CT images, and six new blocks are added to an E cientNet B0 architecture. These blocks were achieved by a grid search and can be seen in Table 3. Two searches are carried out. One aiming at a shallower architecture and the other deeper. On top of the model a new fully connected layer (FC) is added to adapt the classi cation task to a new domain. Along with those blocks, Batch normaliza-tion (BN), dropout, and swish activation functions are also employed.
The batch normalization operation constrains the output of the layer in a speci c range, forcing zero mean and standard deviation one. That works as a regularization, increasing the stability of the neural network, and accelerating the training [19].
The Dropout [20] operation also act as a regularization, by inhibiting a few neurons and thus emulating a bagged ensemble of multiple neural networks , for each mini-batch on training. The dropout parameter de nes the number of inhibited neurons (0 to 100 percent of the neurons of one layer).
Despite Recti ed Linear Unit (ReLU) is considered the most popular acti-vation function, here we explore the swish activation function [21]. ReLu can be formally de ned as f (x) = max(0, x), while the swish function is de ned by the equation: The swish activation produces a smooth curve during the minimization loss process and contrary to that, the ReLu produces an abrupt change. Also, the swish function does not zero out small negative values. We believe those factors may be relevant for capturing patterns underlying the data [21].

Training
Due to its complexity, Deep learning models require a large number of in-stances to avoid over tting. However, for the majority of real-life problems, data is not abundant. In fact, few are the situations where there is an abundance of data, such as the ImageNet [22] dataset. To overcome this issue, one could rely on two techniques: data augmentation and transfer learning. In this work, we made use of both techniques and we describe below.

Data augmentation
Data augmentation consists of increasing the training samples by transform-ing the images without losing semantic information. In this work, we applied three transformations to the training samples: rotation, horizontal ip, and scaling. Figure 7 presents an example of the applied data augmentation. Such transformations preserve the images and would not prevent a physician from interpreting the images.

Transfer learning
Starting from a pre-trained neural network and re-training it to t other datasets or other domains is called transfer learning [23]. Performing a ne-tune from a pre-trained network can enable the use of deep architectures when there is little training data, as the network has already learned lters in other domains/problems that can be reused [24]. In the present work, we have few im-ages to carry out the training, especially of the COVID-19 class. Thus, transfer learning becomes imperative.
Our models inherit several layers from E cientNet (See Table2) and the new layers are randomly initialized with zero mean. E cientNets were originally trained for the Imagenet dataset [22]. Thus, we follow the steps to transfer leanring from one domain to another: 4. De ne which layers will pass through the learning process and which one will be frozen; and 5. Perform the learning process, by updating the weights according to the loss function and optimization Here, the weights are updated with Adam Optimizer with a maximum learn-ing rate of 10 − 4. We schedule the learning rate to decrease by a factor of 10 in the event of stagnation. The number of epochs is xed at 10.

Experiments And Discussion
Experiments were carried on an Intel(R) Core(TM) i7-5820K CPU 3.30GHz, 64GB Ram, one Titan X Pascal with 12GB, and the TensorFlow/Keras frame-work for Python. The source code and pre-trained models are available in https: //github.com/ufopcsilab/E cientCovidNet. In the following subsections, we present the three experimental setups explored in this work.
In a rst setup, in Section 4.1, we investigate the discrepancy regarding the results reported by the methods considered state-of-the-art for the two studied datasets. The best approach for the COVID-CT dataset reports 86.0% of ac-curacy [9]. For the SARS-CoV-2 CT-scan dataset, the state-of-the-art method achieves 97.38% of accuracy [14]. However, the SARS-CoV-2 CT-scan dataset has signi cantly more images than the COVID-CT dataset and the same num-ber of patients (individuals). To assess whether this difference is due to the evaluation protocol, we perform two experiments. We investigate the impact of selecting samples/images for training and test sets at random and in a second step, we evaluate the impact of performing the selection guided by individuals, that is, ensuring that there are no samples from the same individual simultane-ously in the training and test sets.
In a second setup, in Section 4.2, we investigate a very important aspect, which is the generalization power of a model. A model is only useful if it can also generalize to data from other distributions or other datasets. In this regard, we evaluate how the model, trained with the SARS-CoV-2 CT-scan dataset, behaves when it is faced with images from another dataset, the COVID-CT Dataset. We follow the datasplit protocol proposed in [15].
Finally, for the third setup, we explore our E cientCovidNet model only with the COVID-CT Dataset, considering the protocol proposed in [15]. This setup aims to expand the comparison of the proposed approach with the liter-ature since this dataset is the most popular to date. Here we also explore the impact of varying the size of the input images.

Setup 1 : 5-fold evaluation on a Large Dataset
To evaluate the performance of the proposed approach, we tested the pro-tocol proposed by Soares et al. [14] and three different scenarios using a 5-fold cross-validation: (i) "Random", (ii) "Slices", and (iii) "Voting". The "Random" evaluation divide the data into training and test sets randomly. The "Slice" evaluation consider all the CT images independent of each other but consider the patient division, that is, we prevent samples from one individual simultaneously in the training and test sets. In this manner, the model will always be evaluated with samples from unknown individuals. Finally, the "Voting" evaluation con-sider all images of an individual and a voting scheme to achieve a diagnosis per individual instead of by instance or image. Considering that several CT images are acquired in a single exam for a single individual, we believe that the disease patterns will not be present on all instances. Thus, an evaluation using a voting scheme, considering all possible instances of one individual, could increase the chances of success.

Results
Following the protocol proposed in [14], the proposed approach in this work enhanced all metrics as shown in Table 5. Despite the outstanding results presented in Table 5, we believe that such results are overestimated. Upon this fact, we introduce a 5-fold classi cation and some changes in the original protocol as described and with the results presented in Table 6. The "Random" evaluation presents better results when compared to the two other approaches ("Slice" and "Voting"). One of the reasons is due to data from the same patient/individual in both training and test sets, which leads to an overestimated result. Upon this fact, our hypothesis is that an approach tends to learn the patterns related to the individuals instead the COVID patterns.
In the "Slice" evaluation, the samples are classi ed as an isolated instance, such as the "Random" one but ensuring that all samples of an individual are exclusively present only on one data partition: training or test set. A down-grade is observed which clearly shows an overestimation from the "Random" evaluation.
On the opposite to the "Slice" evaluation, the "Voting" one considers all images of an individual to decide whether the individual is infected or not. It is worth to emphasize that the same model is used in both approaches, that is, the model trained by image (only one " slice " of the lung).
Due to the nature of CT scans, we believe the disease patterns will not manifest in all slices (instance/images) of an individual CT exam, and results of "Slice" and "Voting" evaluation re ect that. We believe this can generate false positives and therefore impact the gures of approaches (See Table 6). Besides, this problem can be seen as a multiple instance learning (MIL) problem [25] and that a MILbased approach can be a promising path for future work.
Comparing the results of both Tables 5 and 6, we believe the presence of samples from the same individual in training and test tends to lead an overes-timation of an approach. To circumvent this issue, it is necessary to ensure the division of the dataset considering the individual, and the use of a crossdataset approach.

Setup 2 : Cross-dataset evaluation
For this experiment, we investigate the impact of learning a model in one data distribution and evaluate on another one. This scenario is closer to reality since it is almost impossible to train a model with images acquired from all available sensors, environments and individuals. On this setup, the SARS-CoV-2 CT-scan dataset [14] is used only for train-ing, and none image of this dataset is present on the test set.
For the test set, we use the dataset presented in [15], the COVID-CT, since it is a dataset used by several authors in the literature. We follow the protocol proposed in [15] to split the COVID-CT in train and test sets, however, we highlight that for training the model only images from the SARS-CoV-2 CT-scan dataset is used. We also evaluated other test con gurations, such as using the COVID-CT train-ing partition as a test and also combining both partitions from the COVID-CT dataset as a larger test set (See Table 7). We also test the opposite scenario, in which we use all images from the COVID-CT dataset [15] for training and all images of SARS-CoV-2 CT-scan dataset [14] to test. CT-scan dataset [14] (Train)

Results
CT-scan dataset [14] (Test) CT-scan dataset [14] (Train + Test) (Train + Test) CT-scan dataset [14] As one can see, the model performance is drastically reduced when we com-pare cross-dataset evaluation against an intra dataset one. We believe that the reason for this behavior is due to data acquisition diversity. Images from dif-ferent datasets can be acquired by different equipment, different image sensors, and thus, change relevant features on the images impairing recognition. The model could learn how to identify portions and patterns of one image that may indicate the presence (or absence) of COVID-19, although, those patterns may no appear in a different dataset.
Training on COVID-CT [15] and testing in SARS-CoV-2 CT-scan dataset [14] presents even worse results since COVID-CT training set is smaller.
We believe such test should be mandatory for all methods aiming at COVID-19 recognition with CT images, since it is the one that most resembles a real test.

Setup 3 : Impact of input resolution
In this setup, we evaluate the protocol presented in [15] only on COVID-CT dataset. Zhao et al. [15] proposes to divide the COVID-CT dataset into three sets: training, validation, and testing. We also applied data augmentation by rotating (max 0.15 degrees for each side), randomly zooming (80% of the are) with 20% of chance and horizontal ipping with a probability of 50%. We stress that the data augmentation is applied only for training data. The nal number of training images totalized 2968 images (1442 of COVID and 1408 of NonCOVID). Using the protocol in [15], the test set consists of 203 images (98 of COVID and 105 of NonCOVID).

Results
In Table 8, we report the results of the proposed approach using the pro-tocol described in [15]. One may observe that the experiments with the same approach used in Setups 1 and 2 (E cientNet-B3) has a worse performance when compared with the ones available in the literature.
Aiming to reduce the incidence of over tting during training of "Architecture 1", we propose a deeper network. In most of the cases, when the deeper network is used (see "Architecture 2" in Table 8), rather than a Architecture 1 one (see "Architecture 1" in Table 8), a gain is observed on all reported gures.
We emphasize that the architectures with the largest image size (550x550) present the worst performance among the experiment (varying the input size), in the opposite direction of what is expected. Our hypothesis is that there are some small images (281x202) that are expanded and severely distorted, which hinders the COVID-19 patterns on images. The best model is the one with the Architecture 2 with input size of 500x500 (source available at https://github.com/ufopcsilab/E cientCovidNet). The ROC curve of the model is presented in Figure 8.
We present in Table 9 a comparison of the best proposed approach against the ones available in the literature. Despite the results presented by Amyar et al. [12] and Mobiny et al. [10], both evaluated their approach with only 105 images (47 COVID and 58 NonCovid) and, therefore, they cannot be directly compared to the present work. Thus, the best results previously obtained in this setup were presented in [9]. Although the work proposed here overcomes it in terms of accuracy and F1-score on COVID-CT dataset using a signi cantly smaller model ( 3 × smaller). The base model proposed in [9] needs 14,149,480 parameters while the one proposed here only 4,779,038 parameters.

Conclusion
In this work, a model for the detection of COVID-19 patterns in CT images, namely E ecintCovidNet, is proposed. The proposed model presents compa-rable results to the state-of-the-art methods and the highest accuracy to date on both datasets. Also, it is three times smaller (with 4.78 million parameters against 14.15 million of He et al. [9]) and has a latency of 0.010 seconds. This model could enable the use on devices with low computational power, such as smartphones and tablets or even facilitate integration with the Radiology PACS. Our model was evaluated on three setups and with the two largest public datasets. We also performed a cross-dataset analysis To the best of our knowledge, this is the rst work to carry out such analysis for the present task. We believe that the cross-dataset approach is of paramount importance for the methods aiming to detect COVID-19 in CT images since the approach resem-bles a real scenario and unveils the limitations of the methods (for instance, the accuracy drops from 87.68% to 56.16% in this scenario for the COVID-CT test set). Our analysis shows that the methods that aim COVID-19 detection in CT images have to improve signi cantly to be considered as a clinical option.
In this study, we show the potential of Deep Learning models for the task of COVID-19 detection on CT images. We also emphasize that larger and more diverse datasets are needed in order to evaluate the methods in a more realistic manner. As a future research path, we intend to build a very large CT image datasets from several Brazilian centers, in order to try to cover a larger spectrum of equipment (sensors), ethnic groups and acquisition processes and thus, properly validate our method.  Comparison among different contrast in images.

Figure 4
Example of images with textual information Figure 5 MBConv Block [16]. DWConv stands for depthwise conv, k3x3/k5x5 de nes the kernel size, BN is batch normalization, HxWxF represents the tensor shape (height, width, depth), and x1/2/3/4 is the multiplier for number of repeated layers. Supplementary Files This is a list of supplementary les associated with this preprint. Click to download.