STCovidNet: Automatic Detection Model of Novel Coronavirus Pneumonia Based on Swin Transformer

The novel coronavirus disease 2019 (COVID-19) has emerged as an enormous challenge facing China today. Preventive Medicine physicians and Arti�cial Intelligence (AI) researchers try to improve the ability to early automatic warning of coronavirus infections, promote epidemic prevention, and reduce medical costs using deep learning methods. In this work, we build an extensive database of chest computed tomography (CT) scans with image data from domestic and international open-source medical datasets. Swin Transformer is chosen as the backbone network to establish a model (STCovidNet) for the prediction of COVID-19. We then compare the performance of our technique against that of Vision Transformer (ViT) and Convolutional Neural Network (CNN). Next, to visualize our model's high-dimensional outputs in 2-dimensional space, we apply t-distributed stochastic neighbor embedding (t-SNE) as the dimension-reduction strategy. Finally, we employ gradient-weighted class activation mapping (Grad-CAM) to present a class activation map. The results indicate that STCovidNet’s performance surpasses ViT and CNN with a 0.9811 AUC and 0.9858 accuracy score. Our network outperforms previous techniques to reduce intra-class variability and generate well-separated feature embedding. The CAM �gure illustrates that the decision region corresponds to radiologists' detecting spots. The suggested method can be an effective way of catching COVID-19 instances.


Introduction
COVID-19 is now the most prevalent respiratory infectious disease in the 21st century [1].By February 24, 2022, it had infected more than 430,220,905 individuals and caused the deaths of 5,936,914 patients worldwide [2].Because of its rapid mutation and the many powerful immune escape variants that have been produced, such as Delta and Omicron, testing them is a daunting task for today's public health workers.Nevertheless, the real-time reverse transcription-polymerase chain reaction (RT-PCR) has a roughly 1/3 false-negative rate, necessitating repetitive testing to decrease incorrect diagnoses [3], [4].
Another important detection technique is the Chest CT scan, which improves sensitivity in diagnosing COVID-19 instances [6], [7].The main manifestations of chest CT are ground-glass shadows, pulmonary consolidation, and leaving stone signs after SARS-CoV-2 virus infection, which are the most frequent radiological manifestation and critical diagnostic criteria for COVID-19 cases [5].Combined with RT-PCR results, clinical symptoms, and epidemiological history, these ndings are the sole basis for diagnosing or excluding COVID-19 pneumonia.Therefore, to achieve automatic early warning of COVID-19, some studies have attempted to develop models to automatically identify patients with COVID-19 by learning lesion characteristics through arti cial intelligence technology.Most of these studies mainly used convolutional neural networks (CNN) to automatically nd COVID-19 patients on chest CT images [6].Though CNN has demonstrated its ability to solve a variety of classi cation issues, it is not the best option for problems that require highlevel categorization, where global features such as patterns, multiplicity, and distribution must be taken into account [7].Recently, some studies based on the Vision Transformer (ViT) [8] architecture have been published to solve the problem of the receptive eld through the attention mechanism, obtaining better classi cation results than CNN in the eld of CT image classi cation [9].Motivated by this, we conduct this study with the following signi cant contributions: 1. We establish a dataset named MUST-COVID-19, consisting of 7930 chest CT images, which were preprocessed with image cropping and scaling.We applied various data enhancement methods to reduce the chance of model over tting.
2. The Swin Transformer [10] was then chosen as the backbone network to establish a novel model (STCovidNet) for COVID-19 case detection.
3. An evaluation of the effectiveness of STCovidNet against the state-of-the-art (SOTA) CNN and ViT models is performed on the MUST-COVID-19 dataset.Our experimental ndings demonstrate the e cacy of our technique for achieving the best performance on the CT image datasets under consideration.
To our knowledge, this is the rst publication that evaluates and compares the Swin Transformer's classi cation performance on the COVID-19 pneumonia dataset with other models, providing a framework for medical experts to choose an excellent COVID-19 detection model and lling a research need.
The remainder of this article is structured in the following manner.Section 2 presents the details of related studies.The sources and construction methods for training, verifying, and testing data sets are provided in Section 3. Section 4 describes our proposed approach, STCovidNet architecture, cornerstones of t-SNE and Grad-CAM to visualize the model's high-dimensional outputs and class activation map, and performance evaluation metrics.Model parameter settings for the experimental study are discussed in Section 5.The experimental results, comparison with related models, and bene ts of the proposed model are detailed in Section 6.Finally, Section 7 provides a conclusion with comments on future work.

Related Works
This section extensively reviews the primary research methods used for the current COVID-19 cases detection.CNN is the most often used approach to solve the challenge of automated COVID-19 cases diagnosis [11], [12].The deep learning frameworks in the previous studies are primarily based on the pretrained networks, including variants of Very Deep Convolutional Networks (VGGNet) [13], Deep Residual Neural Networks (ResNet) [14], Dense Convolutional Network (DenseNet) [15], Inception [16], Xception [17], MobileNet [18] and E cientNet [19].These models adapt to the new task of COVID-19 patients' detection and classi cation by modifying or adding custom layers and making use of the knowledge gained from previous experience based on Transfer Learning.For example, Brunese et al. [20] proposed two models using the VGG-16 network as a backbone model based on Transfer Learning.The rst network is used to identify whether the target is healthy or getting pneumonia.Once the rst network has a positive prediction result, the second network is used to nd COVID-19.The VGG-16 network attained 98% accuracy for the three-class classi cation.ResNet is another common architecture in CNN that prevents gradient disappearance problems compared to earlier architectures such as VGG.Using the residual network, Narin et al. [21] classi ed COVID-19 cases and healthy with ResNet-50, achieving the highest accuracy (98%) for binary classi cation.Other studies have used more e cient architectures such as DenseNet and E cientNet.Wang et al. [22] developed a COVID-19 pneumonia classi cation pipeline using DenseNet-121.The proposed approach achieved an AUC with an overall performance of 0.88-0.99 in different datasets.Shamila et al. [23] applied the EffectiveNet architecture to establish a classi cation model with 95% accuracy and 93% F1-score on the test set.
While CNN is suited for image classi cation in deep learning, they have some conceptual limitations.In CNN, information about the location of entities is lost during the maximum pooling operation.In addition, CNN does not consider some spatial relationships between simple objects.They need a vast receptive eld to capture long-range dependencies, which means developing large kernels or highly massive networks, resulting in an extremely complex model that is challenging to train.[9].To overcome these drawbacks of CNN, some researchers have used other architectures, such as Capsule Neural Networks (Capsnets) [24] and ViT [8], for COVID-19 classi cation, which differs from the traditional CNN networks.Sabour et al. proposed Capsnets [24], which are the new architecture in neural networks to resolve the disadvantage of CNN in not using location and orientation information to perform recognition of objects [25].Toraman et al. [26] suggested a ve-convolutional-layer Capsnets model.The 4 layers contain 16, 32, 64, and 128 kernels, respectively, and the 5th layer includes 32 capsules.After 10-fold cross-validation and 50 epochs of training, the model's results are evaluated to reach 84.22% accuracy for the multiclassi cation.
The latest research is based on Transformer [27].Dosovitskiy et al. [8] applied the standard Transformer architecture to image recognition.They proposed ViT based on the Self-attention to approach or exceed the SOTA model in several image recognition benchmarks.Some new COVID-19 detection algorithm based on the ViT architecture has been proposed in a few research projects.Shome et al. [28] built a dataset of 30,000 images and trained the ViT model on it.The trained model performed better than CNN, such as E cientNet-B0, Inception-V3, and ResNet-50 in a multi-classi cation challenge, with 92% accuracy and 98% AUC.Mondal et al. [9] suggested a network based on the ViT-B/16 architecture and achieved the highest 98.1% accuracy, exceeding most existing methods.

Must-covid-19 Dataset
We establish a chest CT scan dataset named MUST-COVID-19 for this research, consisting of 7930 chest CT images.To make the results representative, our data are randomly sampled from a dataset consisting of eight open-source chest CT image sets, namely, (1) CNCB 2019 Novel Coronavirus Resource AI Diagnosis Dataset, which comes from a total of 2778 patients in the dataset of CC-CCII [29], including 917 COVID-19 pneumonia cases, 878 normal, and 983 none-COVID-19 pneumonia cases in the training set; (2) iCTCF [30], which comes from a total of 1521 patients in two hospitals of Huazhong University of Science and Technology, China, including 894 COVID-19 pneumonia cases (including mild, severe, and critical cases), 328 novel coronavirus-negative patients (control group), and 299 patients with suspected COVID-19; (3) COVID-CTSet [31], which comes from the dataset of Negin Medical Center in Sari, Iran, including 377 patients with con rmed COVID-19, 95 novel coronavirus-negative patients, and 282 other pneumonia patients; and the remaining were collected from (4) TCIA [32], (5) COVID-19 Infection Segmentation Dataset [33], (6) LIDC-IDRI [34], (7) Radiopaedia [35], (8) and MosMedData [36].In Table 1, MUST-COVID-19 contains images of about three classes, with 80 percent of images employed for training and veri cation and 20 percent for model testing.Image Pre-processing The pixels outside the red bounding box have no value for diagnosing COVID-19 pneumonia [37], as illustrated in Fig. 1 (a).Therefore, to remove the irrelevant parts, we crop the images of MUST-COVID-19 to the body region using these bounding boxes.Figure 2 (b) shows the example images after image cropping and scaling.

Stcovidnet Architecture
Figure 2 (a) illustrates the architecture of STCovidNet.The backbone of STCovidNet is the Swin Transformer, and it is based on the transfer learning principle [38], [39].Swin Transformer [10] is a subcategory of the ViT, which is created to be suitable for detection by introducing the idea of a hierarchical feature map and shifted windows to ViT.A detailed architecture for connecting two Swin S(W)-MSA has a LayerNorm (LN) in front and behind, and the nal MPL has two GELU non-linearities.As a result, the Swin Transformer block has been deployed in groups of two [40].The connections of the Swin Transformer blocks can be represented using Equations ( 1) to (4): where âk is (S)W-MSA's result, a k is MLP's result, and k denotes the Swin Transformer block position.
The Swin Transformer provides four versions of the model, which, from small to large, are the Swin-Tiny (Swin-T), Swin-Small, Swin-Base (Swin-B), Swin-Large [10].This research presents Swin-T as a backbone network, considering the performance and computational complexity, and it has been pre-trained on ImageNet-21k [41].
In Fig. 2 (a), the initial of the STCovidNet framework is the data augmentation.Augmentation approaches including random rotation, random horizontal ip, random crop, random blur, random salt pepper noise, and random Gaussian noise are used to improve data representativeness, reduce over-tting, and develop a more generic model.To further boost the randomization of the operations, the random order command is used to disrupt the order of all the previous transform operations.
After passing through the augmentation, the input CT image with a size of 244 × 244 passes through the patch partition layer.It is segmented into patches with a 4 × 4 size to generate patch tokens having a shape of (

( ( ))
The last layer of STCovidNet is an average pooling followed by a Norm layer.The CT image has been successfully converted into one representation with 768 embeddings.A new classi cation head for the target domain MUST-COVID-19 is attached to convert these 768 embeddings into the 3 dimensional to nally obtain the predicted results.

SOTA ViT and CNN Models
As a comparison, we used the following SOTA model: ViT [8] was implemented in computer vision applications by employing multi-head self-attention [8], [10], [42], [43] as an image extraction approach, following the recent breakthrough of Transformers [27] in handling natural language processing tasks [44].ViT mainly consists of the following parts: Linear Projection of Flattened Patches (Embedding layer), Transformer Encoder, and MLP Head. Figure 3 illustrates the ViT model.
The principle of ViT is rst to divide the input into patches and then reshape each patch into a vector to get a attened patch.The input image is the H×W×C size, and then ViT gets N patches by dividing the picture with a P×P patch.The shape of each patch is P×P×C, which is converted into vectors to get P2×Cdimensional vectors.These vectors are eventually integrated to obtain a two-dimensional matrix of N × (P2 × C), similar to the Word Vectors in Natural Language Processing.The formula of input sequence z 0 of ViT is shown in Eq. ( 6). 6 where x represents an image block, the formula for ViT is shown in (7) to (9).
where z is the output value of ViT.ViT consists primarily of multiple self-attention (MSA) and MLP, with LayerNorm (LN) and residual connections added before MSA and MLP in Fig. 4.
CNN is a feed-forward neural network capable of representing learning from images automatically.[45].It extracts features with diagnostic values from medical images to realize the disease's automatic classi cation and detection [46], [47].The layer transition occurs when CNN connects the i th layer's result as input to the (i + 1) th layer [48], [49], as shown in (10): where i is the network layer's index, T i ( • ) is the nonlinear function that includes convolutional computation, pooling and batch normalization, etc., z i is the i th layer's result.The general framework of CNN is shown in Fig. 5.
ResNet [14] is a series of extremely deep convolutional networks which includes a skip link that uses an identity function to eliminate exploding or vanishing gradients problems, as shown in (11): The activation of a layer is directly applied to the activation of other layers further in the network by employing skip connections, allowing gradients to ow straight from the later layers to the earlier ones.This aids in the generation of deeper CNNs while retaining accuracy.In this study, we employ ResNet-50 [14], a typical ResNet variation, and its design in Fig. 6.
DenseNet, developed by Huang et al. [49], is designed to achieve more excellent anti-tting properties.DenseNet extends ResNet's shortcut connections by connecting all levels; each layer z i receives all preceding layers, z 0 , …, z i − 1 , as its new input to guarantee that the most inter-layer information is conveyed, as shown in (12): ( ) The use of dense connections helps reduce the problem of over tting in networks with limited datasets [50].Figure 7 depicts a three-layer dense block in which each layer executes batch normalization, ReLU activation, and convolution processes.
E cientNet [19] is a series of models from B0 to B7 obtained by Google following a multi-objective neural network search (NAS) approach.Based on the recombination coe cient, E cientNet scales the three dimensions with the formula shown in Eq. (13).
where ϕ is a composite coe cient, ρ is the scaling factor for depth, σ is the scaling factor for width, and τ is the scaling factor for resolution.Three scaling coe cients are determined by the grid search method, upon which the B0 model is scaled to generate a series of required models.The framework of E cientNet mainly consists of mobile inverted bottleneck convolution (MBConv) [18], [19].The structure of E cientNet-B0 is shown in Fig. 8

t-SNE
t-SNE [53] is an effective method of scaling down high-dimensional data to explore the distribution of features generated by models [54]: Suppose X is a vector containing all samples, and Y is a target vector of a low-dimensional representation of X. P j | i is a conditional probability in the original high-dimensional space to describe the similarity of data point x j to data point x i [53], as shown in Eq. ( 14): As a result, in the original space, the probabilities may be stated as follows: The size of the data collection is denoted by the number n.The probability at low dimension Q ij is calculated using this distribution [55], as illustrated in the equation below.
Using the Kullback-Leibler divergence [56] as a loss function and a gradient-based algorithm, t-SNE then determines the projections of x i in lower dimension as y i : Grad-CAM Grad-CAM [57] generates a class-discriminative localization map based on the gradient value, emphasizing critical regions in images and offering an interpretable perspective of models.The estimation equation for the class-discriminative localization map L c Grad − CAM is shown in Eq. ( 18): where A is the feature map activations operator, c is the target class of the model, α c k is the network's partial linearization downstream of A [57].The calculation equation for α c k is shown in Eq. (19).
where A k is the feature map of the k th layer and n, m are the locations in the map.Y c is the gradient of the score for class c, and ∂A k is the gradient of Y c .

Performance evaluation metrics
In this research, we use multiple evaluation metrics to assess the model: The average performance was calculated by the macro average and weighted average [58].

Experimental Setup
The suggested STCovidNet model, as well as the following SOTA models, are used in this research: (1) ViT-B/32 (base size model), (2) ViT-L/32 (large size model), (3) ResNet-50, (4) DenseNet-201, and (5) E cientnet-B4, which are pre-trained on ImageNet-21k.Each model has been trained in a maximum of 30 epochs using the Adam optimizer with train batch size: 16, test batch size: 8, and initial learning rate: 3e-5.We used Python as our programming language, and all experimentation was conducted with NVIDIA CUDA GPU 11.0 using a Tesla P100-16GB.We also use the PyTorch 1.9.We can further understand the prowess of the proposed model by examining the confusion matrix (Fig. 9).Notably, among the 556-novel coronavirus-positive patients predicted by STCovidNet, 543 are con rmed to be novel coronavirus-infected patients.The actual labels of the other 8 and 5 are normal and non-COVID-19 pneumonia cases, respectively.Additionally, 523 of the 533 normal patients predicted by STCovidNet are consistent with their actual labels, and 491 of the 498 patients with other types of pneumonia predicted by STCovidNet are consistent with their actual labels.Overall, STCovidNet performed the best in identifying COVID-19, healthy, and other types of pneumonia patients.

The t-SNE Visualization
The results of t-SNE are depicted in Fig. 10. Figure 10

Conclusion
This research offers a COVID-19 detection model (STCovidNet) using the Swin Transformer blocks, trained and evaluated on MUST-COVID-19.The suggested approach yields the best results and adheres to medical judgment guidelines.Our ndings suggest that STCovidNet can be considered a promising architecture to detect COVID-19 cases.
In future experiments, we intend to create further the proposed method for different types of pneumonia, to investigate whether the model can distinguish between multiple different cases of pneumonia.Furthermore, we will study the use of STCovidNet to COVID-19 detection on chest X-ray (CXR) to assess its e cacy.
Page 24/28 The general framework of CNN.
Page 25/28 The detailed structure of a three-layer Dense block.
Transformer blocks is shown in Fig.2(b).A LayerNorm (LN) layer, a typical windows-based with multihead and self-attention (W-MSA) module, a shifting window-based with multi-head and self-attention (SW-MSA) module, and Multi-Layer Perceptron (MLP) layers are used for each Swin Transformer block.

32 ,
channel).The generated patch token is linearly embedded in the rst stage, and the patch token of ( ) is projected to the dimension of C (C = 96) to generate tokens of ( ).They are then input into several Swin Transformer blocks.The rst two Swin Transformer blocks keep the shape of input and output tokens constant at ( ) and are designated as Stage 1 together with the linear embedding layer.Stages 2, 3, and 4 consist of patch merging and Swin Transformer blocks, respectively.As the network deepens, the shape of tokens is gradually reduced by patch merging.In patch merging, adjacent 2 × 2 patches are merged into one patch, and the tokens are down-sampled to 1/2, whereas C is doubled.In stages 2, 3, and 4, the output shape of tokens is ( 8C), respectively.After stage 4, we end up with 224 32 × 224 32 , i.e., 7×7 tokens each of an embedding size of 768 dimensions.
p 2 C × D , E pos ∈ R ( N + 1 ) × D (a), with a 224 × 224 resolution of the input images.The 1st part is a convolutional operator, and the 2nd part completes feature extraction by 16 MBConv operators.The third part consists of the convolution, global average pooling, and classi cation layer.
Fig. 11 (b).Figure 11 (a) identi es the apparent feature region of infection in the patient's lung, which is the main activation area for the used model to classify the image as a COVID-19 instance.In Fig. 11(b), the healthy regions of the normal CT images are uniformly localized.Figure 11(a) and 11(c) show that it effectively distinguishes between COVID-19 and other types of pneumonia patients by locating different focal areas, which is close to the ability of advanced radiologists.All results visualized by the Grad-CAM method show that the used model learns valid representations before making classi cation decisions and is interpretable.

Figure 1 Examples
Figure 1

Figure 4 Detailed
Figure 4

Figure 8 The 9
Figure 8

Table 2 and
Fig. 9 illustrate the experimental ndings obtained by STCovidNet and other models.As can be seen, STCovidNet outperformed the ViT and CNN techniques on MUST-COVID-19, with 0.9811 AUC and 0.9858 accuracy score.We then analyze the model's accuracy, sensitivity (recall), and F1-scores and explain their importance in assessing the model's classi cation quality.It is worth mentioning that under the "dynamic zero-case" policy of COVID-19 adhered in China, sensitivity is considered the critical indicator of the COVID-19 auto-detection model, as any missed positive case can pose a severe risk to the communities.

Table 2 ,
STCovidNet has the highest sensitivity value of 0.9837, revealing that a tiny number of pneumonia patients caused by COVID-19 pneumonia are wrongly categorized, which is a highly desired feature in a COVID-19 early warning model.Furthermore, our proposed approach received the most outstanding F1 scores in all categories, demonstrating that it is the best-balanced model among the baseline models regarding accuracy and sensitivity.