Intelligent detection and applied research on diabetic retinopathy based on the residual attention network

This study proposes a high‐accuracy (ACC) algorithm to automatically detect diabetic retinopathy (DR) and diabetic macular edema (DME) in retinal fundus images. Three DR datasets were obtained for use in this study: EyePACS, Messidor, and IDRid. In the EyePACS dataset, two DR classifications and five classifications experiments were conducted. The Messidor and IDRid dataset were graded DR and DME. After preprocessing, enhancement, and normalizing, common convolutional neural networks (CNN) were used to obtain the classification results. Afterward, an optimization method residual attention network (RAN) was introduced that was based on the residual attention module, and incorporated dilated convolution, so as to optimize the experimental results. The focal loss was then added to solve the imbalance problem. Next, a five‐fold cross‐validation strategy was introduced so as to assess and optimize the proposed model, after which the prediction ACC, sensitivity, specificity, area under receiver operating curve, and Kappa score were assessed. The proposed method RAN was shown to achieve 89.2% ACC (95% confidence interval [CI], 0.8782–0.9123) for two DR classifications (normal and abnormal) on the EyePACS dataset, 89.8% ACC (95% CI, 0.8751–0.9275) for two DR classifications on the Messidor dataset. The IDRid dataset achieved an ACC of 71.5% (95% CI, 0.6941–0.7423) for the two DR classifications. RAN mainly improves the results of commonly used CNN methods on the same dataset. Therefore, the classification and diagnosis of DR may be improved by adopting the proposed method.


| INTRODUCTION
Diabetic retinopathy (DR) is a late manifestation of diabetes mellitus and is one of the most severe complications of diabetic microangiopathy. If it is not detected and treated early, DR can cause irreversible visual impairment or even blindness in severe cases. 1 Fundus imaging is a vital method of inspection for the early detection of DR lesions. The corresponding fundus changes and grading standards used in DR are shown in Figures 1 and 2. As ophthalmologists in less developed areas are lacking, patients suffering from diabetes are unable to receive an early diagnosis and treatment for DR. 2 Therefore, computerized screening technology based on fundus images is of great significance in delaying the progression of DR.
High-quality color retina images can assist doctors in investigating and diagnosing retinopathy. However, the diagnosis of DR requires a clinically experienced ophthalmologist, and DR screening is not performed in most grass-roots areas, which has significantly increased the risk of blindness due to diabetes. 3 Therefore, adopting computer-assisted remote diagnostic technology in fundus imaging can effectively result in reductions in visual impairment among diabetic patients due to insufficient medical resources.
At present, most ophthalmology image analysis work focuses on DR classification, vessel segmentation, and detection of retina structures. [4][5][6] Pratt et al. 7 developed a network with convolutional neural networks (CNN) architecture and data augmentation that can identify intricate features involved in the classification of DR. They then trained the model on the EyePACS dataset, achieving a sensitivity (SE) of 95% and accuracy (ACC) of 75% for 5000 validation images. Rahim et al. 8 presented an automatic detection method for DR and maculopathy in fundus images by employing fuzzy image processing techniques. A combination of fuzzy image processing techniques, circular hough transform, and several feature extraction methods were implemented. Eftekhari et al. 6 adopted a two-step process with two online datasets to train CNN, which solved the problem of imbalance and reduced the training time while performing accurate detections. Seth et al. 9 used CNN and linear support vector machines to train the network on the benchmark dataset EyePACS dataset, which demonstrated that the model had high SE and specificity (SP) in detecting DR. Dutta et al. 10 proposed an automatic knowledge model to identify critical prerequisites for disaster recovery. After testing the model using a central processing unit CPU-trained neural network, three types of back-propagation neural networks were used. Accordingly, the model was able to quantify the characteristics of different types of blood vessels, exudates, bleeding, and microaneurysms. Adem et al. 11 13 proposed a hierarchically Coarse-tofine network (CF-DRNet) to classify the five stages of DR severity using CNN, which showed that CF-DRNet outperformed various state-of-the-art methods in the publicly available IDRiD and EyePACS datasets. Arenas-Cavalli et al. 14 evaluated the automated DR screening tool DART, for which the receiver operating curve (ROC) analysis indicated a SE of 94.6%, SP of 74.3%, and AUC of 0.915. Furthermore, Dai et al. 15 developed a system called DeepDR, which was able to detect early to late stages of DR. The grading of DR as mild, moderate, severe, and proliferative achieved an AUC of 0.943, 0.955, 0.960, and 0.972, respectively. This article will factor in the needs of both ophthalmologists and diabetic patients by proposing a DL algorithm RAN to improve the performance of DR diagnosis in the model (Tables 1-3).

| MATERIAL AND METHODS
In this study, experiments were conducted on the EyePACS, Messidor, and IDRid datasets, as shown in Figure 3. Based on ResNet, 27 the algorithm RAN proposed in this paper integrated the attention mechanism and added the attention guided module (AGM) and dilated convolution.

| Database
This study utilized three public datasets. EyePACS's 28 training set contained 35 125 fundus images released by the California Medical Foundation from eye-PACS users, including level 0 25 809 (74%), level 1 2443 (7%), level 2 5292 (15%), level 3 873 (2%), and level 4 708 (2%). Due to the excessive number of normal fundus images in this dataset, 40% of the normal images were selected for training and testing during the two classification experiments, while only 20% of the normal images were selected for training during the five classification experiments and tests. The Messidor dataset 29 consisted of 1200 fundus images from three ophthalmology hospitals, 800 of which were images obtained following pupil dilation. Each picture was marked with a DR lesion grade of 0-3, and each picture had a DME lesion grade of 0-2. Table 4 lists the number distribution. The image sizes in the dataset were 1440 Â 960, 2240 Â 1488, and 2304 Â 1536, respectively, in tif format. The IDRid dataset 30 included lesion segmentation, disease classification, and optic disk and fovea detection. In this experiment, only disease classification data were used, including 413 pictures in the training set and 103 pictures in the test set. All pictures were 4288 Â 2848 and in jpg format. Table 5 shows the number and proportion distribution. In these three datasets, 60%, 15%, and 25% of the images in each dataset were randomly selected as the training set, validation set, and test set, respectively.
As the number of abnormal pictures in the Messidor and IDRid dataset was too small, it was more meaningful to perform two classifications in terms of clinical application. Evidently, according to the above dataset, the most prominent feature of medical images is the imbalance in the distribution of data; the number of samples in normal images is much higher than that of abnormal images, while the amount of grading data with the severity of the disease is becoming increasingly less. In order to solve this problem, data preprocessing and optimizing the loss function are most commonly utilized. Commonly used data augmentation methods include translation, rotation, cropping, scaling, noise addition, affine transformation, and so forth. These methods usually do not change the type of object and are the earliest and most widely used types of image-enhancing methods. The color of the image can also be changed according to four areas: brightness, contrast, saturation, and tone ( Figure 4).
In order to reduce the difference between the different images of the dataset, before sending the image to the The workflow of the proposed work for classification of diabetic retinopathy network for training, normalization was implemented in each image: where Þ is the normalized image, i and j is the coordinates of the pixel points, k represents the three channels of the image (blue, green, and red), m k represents the average value of the kth channel pixel value, and σ k represents the standard deviation of the kth channel pixel value.
The loss function in the neural network was used to measure the gap between the predicted value obtained by the model and the actual value of the data, which also served as the standard used to measure the generalization ability of the model. The smaller the loss function, the better the performance of the model, and the loss function used by different models are generally different. The most commonly used loss function is cross-entropy. 31 Since the imbalance problem generally exists in DR datasets, focal loss 32 was introduced in this experiment, which was modified on cross-entropy. It multiplies the original crossentropy by an index that weakens the contribution of the easily detectable object to the model training.

| Residual attention network
The core of RAN ( Figure 5) served as the attention mechanism, which can increase the area information of the lesion and suppress other background information, thereby improving the ACC of the model in the DR classification. 33 As module stacking becomes more in-depth, F I G U R E 4 Messidor dataset enhanced results different levels of attention information can be extracted from top to bottom, and the attention perception function from different modules will change adaptively. 34 The added attention residual learning structure can train very deep residual attention networks, which may also be easily extended to hundreds of layers. By stacking this residual attention structure, the advantages of residual learning and attention mechanism can be thoroughly combined to achieve better results. Each attention mechanism module is divided into two branches 33 : the soft mask branch and trunk branch ( Figure 6). The formula of the attention mechanism is where T represents the main branch and M represents the mask branch. The mask branch used several maximum pooling to increase the receptive field. After reaching the minimum resolution, a symmetric network structure was used to amplify the features back. 33 As shown in Figure 7, AGM was also added into RAN, which was composed of two 1x1 convolution layers with different activation functions in the adaptive average pooling layer. The specific operation was carried out according to the following. First, the input feature map passed through an adaptive average pooling layer, with an output feature map dimension of F I G U R E 5 Residual attention network structure 33 F I G U R E 6 Soft mask branch and trunk branch 33 R 1Â1ÂM . Next, after a 1 Â 1 convolution layer with the linear rectification function (ReLU) activation function, the output feature map dimension was R 1Â1ÂM=r , and the number of channels was reduced from M to M=r. Then, after a 1 Â 1 convolution layer with sigmoid activation function, the number of channels was expanded from M=r to M, and a channel descriptor with dimension R 1Â1ÂM was obtained so as to recalibrate the original feature map. Among them, the hyper-parameter r can control the calculation amount of AGM, which was set to 16. Finally, by multiplying the obtained channel descriptor and input feature map, recalibration of the feature map was completed, and the importance of each channel was recalibrated by integrating the global information. The importance of different channels varied, highlighting the important information while suppressing background information.

| Dilated convolution module
In order to expand the receptive field and capture multiscale contextual information, this article also adopted a dilated convolution module. 35 As shown in Figure 8, the dilated convolution was equivalent to the filling d À 1 dilations between adjacent convolution kernel parameters. When the dilation rate d = 1, the dilated convolution degenerated into a standard convolution; the larger the d, the larger the receptive field of the convolution kernel. In this article, 1 Â 1 standard convolution, 3 Â 3 dilated convolution with dilation rate d = 2, 3 Â 3 dilated convolution with dilation rate d = 3, 3 Â 3 dilated convolution with dilation rate d = 5, and global average pooling was used to extract the features. Five levels of image information were subsequently extracted. The specific process in regard to using global average pooling to extract features was to initially utilize an adaptive average pooling layer in order to generate a 1 Â 1 Â 512 dimension feature map. Then, a 1 Â 1 convolution was used to change the number of channels to 256, after which the bilinear interpolation algorithm was adopted to expand its size to 14 Â 14. The extracted feature maps of five levels were then spliced with the original feature maps to obtain a 14 Â 14 Â 1792 dimension feature map. Finally, 1 Â 1 convolution was done to change the number of channels to 512. After each convolution operation, a batch normalization layer and ReLU activation function were integrated. Before each dilated convolution extracted the features, the feature map carried out a padding operation in order to ensure that the resolution of the feature map before and after did not change.

| Transfer learning
Transfer learning 36 is a method of machine learning that encompasses transplanting the model obtained from training one task to training other tasks. In this experiment, the pretrained EfficientNet weights were loaded on ImageNet to the three DR datasets so as to train the proposed model to obtain better results. In addition, in order to improve the performance and alleviate issues pertaining to the small amount of data in the Messidor and IDRid datasets, the weights RAN model trained on the EyePACS dataset was transferred to the Messidor and IDRid datasets.

| Implementation details
The PyTorch framework and OpenCV image processing library were applied in this experiment, which was implemented on the Ubuntu16.04 operating system with a GeForce RTX 2080Ti graphics card. The Adam optimizer initial learning rate was 0.001, the batch size training phase was 16, the testing phase was 4, and a total of 60 epochs were trained. In addition, each image was initially scaled to 512 Â 512 pixels, which was then sent to the network for training and testing. The test set was tested after every epoch of training, only outputting models and results with the highest SE and ACC.
In this experiment, the relationship between the model prediction result and the true label of the data was evaluated according to the following criteria: True Positive, False Negative, False Positive, and True Negative. ACC, SE, SP, receiver operating curve, and AUC were also applied in order to evaluate the experimental results.
In clinical settings, a missed diagnosis has a greater adverse effect on patients; hence, the SE in DR classification is more significant. In the DR five-category experiment, the Kappa coefficient was also added as an evaluation criterion (Table 6).

| RESULTS
In this paper, according to the three DR datasets of EyePACS, Messidor, and IDRid, the commonly used DL methods, and the proposed RAN were experimented, respectively. Here, we utilized cross-entropy and focal loss to conduct classification and diagnosis experiments for DR and DME, which were then compared and analyzed, as shown in Tables 7-12. In the EyePACS dataset, the SP, SE, and AUC of the RAN for DR two classifications reached 0.894 (95% CI, 0.8646-0.9108), 0.930 (95% CI, 0.9047-0.9486), and 0.917 (95% CI, 0.8976-0.9287), respectively, while the ACC reached 0.892 (95% CI, 0.8782-0.9123), which was 4.6%, 3.7%, 9.4%, and 5.6% higher than VGG-16, respectively. Moreover, RAN attained an excellent level of ACC in the DR classification. The ACC of RAN in the five DR classifications reached 0.815 (95% CI, 0.8024-0.8456), which was 5.3% higher than that of VGG-16. Meanwhile, the Kappa score reached 0.865, which was higher than the 0.829 obtained in the DR classification competition held As seen in Figure 9, due to the imbalance problem in the DR datasets, focal loss was more suitable than cross entropy as the loss function in each classification task. Accordingly, the ACC was noted to be greatly improved.

| DISCUSSION
This paper proposed a classification algorithm RAN for DR detection, in which the classification experiments were verified on the EyePACS, Messidor, and IDRid datasets. Since the imbalance between data categories will lead to overfitting during model training, data augmentation, and focal loss were introduced. The image augmentation method used in this experiment can make the amount of data in each classification of DR reach a relatively balanced state. Moreover, the focal loss also achieved satisfactory results in alleviating data imbalance issues. In order to address the minor differences between DR categories, normalizing was also carried out on the original retinal image to highlight the bleeding and exudation in the fundus image. In addition, the attention mechanism, which focuses on features in the fine-grained image during classification, was added to the network in order to extract the features of the fine-grained images so that the network can better distinguish the differences between the types of lesions. Furthermore, dilated convolution was added to the network to increase the receptive field. The above results demonstrated that the intense competitiveness of CNN in clinical diagnostic applications and RAN achieved better performance in DR detection. In short, using the proposed RAN can enhance the ACC of DR classification and diagnosis of most fundus images. Through this combination of ResNet, attention mechanism, and dilated convolution, the ACC of classification of DR can be improved.
However, the rise in ACC of the proposed method is not significant enough. In our future studies, we will integrate additional information related to DR, such as age, blood glucose, blood pressure, intraocular pressure, and past history, into the DR classification model to effectively improve the diagnosis results. Moreover, multi-task experiments will be conducted to mutually promote the improvement of the experimental results. How to integrate the results of exudates, bleeding, microaneurysms detection, and blood vessel segmentation into the DR classification model will also be the focus of our subsequent works. Algorithm engineers and clinicians both aspire to build a robust and accurate DL model for DR detection, and this desire cannot be achieved without the joint efforts and cooperation of both parties.