Intelligent detection method and applied research of diabetic retinopathy based on residual attention network

Diabetic Retinopathy (DR) is a late - stage ocular complication of diabetes. Proposing a high - accuracy automatic screening technology of fundus images based on deep learning is of great significance to delay the deterioration of DR. In this paper, we propose an end - to - end framework RAN for DR classification and diagnosis based on the ResNet, attention mechanism and dilated convolution was added to the framework. We implemented experiments on three DR datasets, Kaggle, Messidor and IDRid, analyzed and compared the experimental results. The focal loss function is added to solve the imbalance problem between DR datasets. The results show that the method RAN used mainly improves the results of the basic neural network when using the same dataset. Therefore, by optimizing the basic neural network, the classification and diagnosis effect of DR can be improved.

prevalence of obesity worldwide, the future prevalence of diabetes will continue to rise, and the burden of diabetes will also increase [1] . Diabetes is associated with life, the longer it is discovered and the prolonger diabetes is, the higher the risk of complications is. Eventually, complications of diabetes can be disabling and even life-threatening.
Diabetic retinopathy (DR) is a late manifestation of diabetes, and one of the most severe complications of diabetic microangiopathy. If it is not detected and treated early, it will cause irreversible visual impairment, and in severe cases it may cause blindness. Fundus image is an vital inspection method for early detection of DR lesions. Due to many ophthalmologists in the less developed areas are lacking, patients with diabetes lack early diagnosis and treatment of DR. Therefore, computerized screening technology based on fundus images is of great significance to delay the deterioration of DR.

No obvious retinopathy No abnormality Nonproliferative DR, Mild
Microaneurysms only Moderate Besides microaneurysms, there are still a few hard exudative spots or small bleeding spots. Severe There are no signs of proliferative DR, but besides moderate lesions, there is still one of the following three (4, 2, 1 regulation)： More than 20 retinal vein beads in four quadrants, Two quadrants have clear retinal vein beads, One quadrant has obvious IRMA. Proliferative DR One or more of the following changes: 1. Neovascularization 2. Preretinal bleeding 3. Vitreous blood. [2] Table 1.2 Clinical classification of diabetic macular edema (DME) [2] DME Level Fundus Examination

Figure 1.2 Schematic diagram of 5 grades of authentic clinical diabetic retina images
No obvious DME No noticeable thickness of retina or hard exudate at the posterior pole. There is obvious DME A significant thickness of retina or hard exudate in the posterior pole. Mild DME The thickness of retina or hard exudates away from the fovea. Moderate DME The thickness of retina or hard exudates does not affect the fovea. Severe DME The thickness of retina or hard exudates affects the fovea.
High-quality color retina images can assist doctors in the diagnosis and judgment of retinopathy. However, the diagnosis of DR requires a clinically experienced ophthalmologist, and DR screening has not carried out in most grass-roots areas, which has significantly increased the risk of blindness due to diabetes [3] . Therefore, computer-assisted remote diagnostic technology in fundus images can effectively reduce the visual impairment of diabetic patients caused by insufficient medical resources. This study intends to use deep learning (DL) methods to process the fundus images, laying the foundation for the remote automatic fundus image screening system.

Related Work
At present, most of the work in the field of ophthalmology image analysis focuses on the DR classification, segmentation, and detection of retina structures, such as optic disc, macular, blood vessel, abnormal parts (hard osmosis, soft osmosis, bleeding spots, microaneurysms), respectively. Rahim et al. [6] presents an automatic detection method of diabetic retinopathy and maculopathy in fundus images by employing fuzzy image processing techniques. A combination of fuzzy image processing techniques, the circular hough transform, and several feature extraction methods are implemented.
Eftekhari et al. [7] used a two-step process and two online datasets to train CNN, which can solve the problem of imbalance and reduce training time while accurately detecting. Seth et al. [8] used convolutional neural networks and linear support vector machines to train the network on the benchmark dataset EyePACS dataset.
Experimental results show that the model has high sensitivity and specificity in detecting diabetic retinopathy. Dutta et al. [9] proposed an automatic knowledge model to identify critical prerequisites for disaster recovery. After testing using a CPU-trained neural network model, three types of back-propagation neural networks were used. The model was able to quantify the characteristics of different types of blood vessels, exudates, bleeding, and microaneurysms. Benzamin et al. [10] proposed a deep learning algorithm based on CNN, which can detect hard exudates in fundus images and assist ophthalmologists in diagnosis. Adem et al. [11]  improve the accuracy of model diagnosis, and assist clinicians in their work.  [14] VggNet、 GoogleNet DR1 Messidor Sensitivity 97.11%，Specificity 86.03% Accuracy 92.01%，AUC0.9834 Gargeya et al. [15] Data driven DNN ResNet

Research Status of Deep Learning Methods
Based on the previous deep learning methods, the Residual Attention Network proposed in this paper is mainly comprises of an encoder, a residual attention module, and dilated convolution.

Encoder
The primary function of the encoder is to extract image features with high-level semantic information. Generally, the deeper the network, the stronger the ability to extract features. But when the network increases to a certain depth, the problem of gradient disappearance will occur, which leads to the degradation of network performance. ResNet [23] solves this problem through residual connection, which can make the network deeper, and its ability to extract features is more stronger. It is a structure designed based on VGG [24] . The biggest part is adding a layer jump connection structure to achieve residual learning and increase identity mapping, making the depth of the network play a role.
From an intuitive perspective, the residual learning needs less content, and the learning difficulty is low. The residual unit can be expressed as: The learning characteristics from shallow l to deep L are expressed as: According to the chain rule, the gradient of the reverse process can be expressed as: (3)(4) lo 

Residual Attention Module:
The attention mechanism in computer vision is an imitation of human visual attention actually. The principle is that human brain can find the target area, and give more attention to the area, while assigning to the surrounding unimportant areas less attention, so as to obtain more useful information and suppress other useless information. In traditional image processing methods: salience detection, image feature extraction, and sliding window methods can also be regarded as attention mechanisms. The attention mechanism in deep learning also mainly includes two parts: learning weight distribution (different parts of the input image or feature map have different weights), task focus (divide the task, design different sub-networks, and focus to different subtasks, redistribute the learning ability of the network).
As shown in Figure 3.2, the attention guided module (AGM) is composed of two 1x1 convolution layers with different activation functions in the adaptive average pooling layer. The specific operation is as follows: First, the input feature map passes through an adaptive average pooling layer, and the output feature map dimension is 1×1× ; then, after a 1x1 convolution layer with ReLU activation function, the output feature map dimension is 1×1× / , and the number of channels It is reduced from C to C/r; then, after a 1x1 convolution layer with sigmoid activation function, the number of channels is expanded from C/r to C, and a channel descriptor with dimension 1×1× is obtained to recalibrate the original feature map. Among them, the hyper-parameter r can control the calculation amount of the AGM, which is set to 16 in the experiment. Finally, by multiplying the obtained channel descriptor and the input feature map, the recalibration of the feature map can be completed, and the importance of each channel can be recalibrated by integrating global information. The importance of different channels is different, which highlights important information and suppresses background information.

Figure 3.2 Attention guided module
In this DR classification experiment, the areas such as hard exudates, cotton velvet spots, bleeding films, and microaneurysms in the fundus image are the areas of focus [25] . The methods in the neural network can increase the abnormal area information of the lesion and suppress other background information, which can improve the accuracy of the model in the DR classification task.
Residual attention module structure, as shown in Figure 3.3, based on ResNet, the method of stacking attention structure to change the attention of features, as the network deepens, the attention mechanism module will make adaptive changes [26] . In each attention mechanism module, upsampling and downsampling structures are added. Each attention mechanism module is divided into two branches, as Figure 3.4 shows, the soft mask branch (attention mechanism branch) and the trunk branch (original branch). The formula of the attention mechanism is: T represents the main branch and M represents the mask branch. The mask branch uses several maximum pooling to increase the receptive field. After reaching the minimum resolution, a symmetric network structure is used to amplify the features back. (such as residual modules and inception modules), and can be easily connected to other networks to achieve a plug-and-play effect. By stacking this residual attention structure, the advantages of residual learning and attention mechanism can be thoroughly combined to achieve better results. [27] :

: Dilated Convolution Module
In order to expand the receptive field, this article also introduces a cavity convolution module. Deep features have high-level semantic information but lost resolution; shallow features have high-resolution, but the semantic level is low. The dilated convolution can expand the receptive field of the network without reducing the resolution in the case of 2 dimensions. The dilated convolution can be expressed as: ( [3][4][5][6] In the formula, [ ] represents the output feature map, [ ] represents the input feature map, d represents the dilation rate, [ ] represents the k − th parameter of the convolution kernel, and K represents the size of the convolution kernel. As shown in Figure 3.5, the dilated convolution is equivalent to filling d-1 dilations between adjacent convolution kernel parameters. When the dilation rate d=1, the dilated convolution degenerates into a standard convolution, the larger d, the larger the receptive field of the convolution kernel. In this article, 1x1 standard convolution, 3x3 dilated convolution with dilation rate d=2, 3x3 dilated convolution with dilation rate d=3, 3x3 dilated convolution with dilation rate d=5, and global average pooling is used to extract features. Five levels of image information are extracted. The specific process of using global average pooling to extract features, is to use an adaptive average pooling layer to generate a 1x1x512 dimension feature map first. Second, use 1x1 convolution to change the number of channels to 256, and then use the bilinear interpolation algorithm to expand its size to 14x14. Third, the extracted feature maps of 5 levels are spliced with the original feature maps to obtain a 14x14x1792 dimension feature map, and finally use 1x1 convolution to change the number of channels to 512. After each convolution operation, there is a batch normalization layer (BN) and a ReLU activation function. Before each dilated convolution extracts features, the feature map perform a padding operation to ensure that the resolution of the feature map before and after does not change.

Loss Function
The loss function in the neural network is used to measure the gap between the predicted value obtained by the model and the actual value of the data, and it is also a standard used to measure the generalization ability of the model. The smaller the loss function, the better the performance of the model, and the loss function used by different models are generally different.
3.5.1 Cross Entropy [28] In this experiment, the classification module performs the main task of DR classification. The goal of the classification task is to predict the label category of each input image. The most commonly used loss function is cross entropy. Cross entropy is also known as log-likelihood loss, logarithmic loss, and is also called logistic loss in the two-class classification. To describe the difference in probability distribution, the formula is: represents the original image label，̂ is the classifier predicting similar values. Simultaneously, represents the weight value in the classification module.

Focal Loss
Since the imbalance problem generally exists in the DR datasets, focal loss, designed to solve the imbalance problem is introduced in this experiment. It is modified on cross entropy, and multiplies the original cross entropy by an index that weakens the contribution of the easily detectable object to the model training. So that focal loss successfully optimizes the imbalance problem between positive and negative samples, and relieves the problem that object detection loss are easily affected by a large number of negative samples. Focal Loss is defined as: γ is the focus parameter, γ ≥ 0. (1 − ) is called modulating factor, the purpose of adding modulation coefficient is to reduce the weight of samples that are easy to classified, so that the model focused more on the samples that are difficult to classified during training.
Focal loss has two important properties:①When a sample is wrong, is very small, then the modulation factor (1 − ) is close to 1, the loss is not affected; when  → 1, the factor (1 − ) is close to 0, then the weight of the better sample is reduced. Therefore, the modulation coefficient tends to 1, which means that there is no significant change from the original loss. ②When γ = 0, focal loss can be written as cross entropy, and as γ increases, the modulation coefficient also increases.
3.6 Transfer Learning [30] Transfer learning is a method of machine learning, which is to transplant the model obtained from one task training to the training of other tasks. Affected by transfer learning, in the case of insufficient training data, by loading the pre-trained EfficientNet weights on the ImageNet dataset, the model has a better weight initialization before starting to optimize the gradient, so as to train your own model.
Considering the huge difference between the fundus image dataset and the ImageNet dataset, the training of the network layer during the experiment is restarted from each layer.     Because the amount of abnormal pictures in the Messidor dataset and IDRid dataset is too small, it is more meaningful for clinical application to do two-class classifications. It can also be seen from the above dataset that the most prominent feature of medical images is the imbalance distribution of data, that is, the number of samples in normal images is much higher than that of abnormal images, and the amount of grading data with the severity of the disease is getting less and less. To solve this problem, the most commonly used method is data enhancement, to expand the lesion sample. In addition, improving the loss function or improving the network structure is also a widely used optimization method.

Materials and Approach
In the above three datasets, we randomly selected 60%, 15%, and 25% of the images in each dataset as the training set, validation set and test set.

Image Preprocessing
Since all the widely used DR public datasets have the problem of severe imbalance in data distribution, and image preprocessing is used in this experiment to increase the amount of data. The purpose of image enhancement is to process the acquired images so that the features of interest have better contrast and visibility. By

Data Augmentation
Commonly used image-enhancing methods, translation, rotation, cropping, scaling, noise addition, affine transformation, etc., usually do not change the type of object, are the earliest and most widely used type of image-enhancing method.
Another way is to change the color. We can change the color of the image from four areas: brightness, contrast, saturation, and tone. In practical applications, multiple image-enhancing methods are usually superimposed, as shown in Figure 4.3.   In addition, due to limited computer performance, the image is first scaled to 224x224 pixels and then sent to the network for training and testing.

Implementation Details
The experiment uses the PyTorch deep learning framework and OpenCV image processing library, implemented on Ubuntu16.04 operating system, GeForce GTX 2080Ti graphics card, Adam optimizer initial learning rate is 0.001, the batchsize training phase is set to 8, the testing phase is set to 1, and a total of 60 epochs are trained. The test set is tested after every epoch of training, and we only output models and results with the highest sensitivity and accuracy values.

Evaluation Index
In this experiment, the relationship between the model prediction result and the true label of the data is evaluated by the following criteria (as shown in  to be negative. The calculation formula for the SE and SP is: The higher the SE, the greater the probability of a DR image being diagnosed, the higher the SP, the greater the probability that a normal image is predicted to be normal. In clinical applications, the missed diagnosis has a greater adverse effect on patients, so the SE in DR classification is more significant. ACC represents the probability of correct classification of all samples, the calculation formula is as follows:

Results and Discussion
In this paper, on the three DR datasets of Kaggle, Messidor, and IDRid, the commonly used deep learning methods and our proposed RAN are used respectively, and the loss function uses cross entropy and focal loss to perform classification and diagnosis experiments for DR and DME, then compared and analyzed them. The results are as follows.

Messidor Results
It can be seen from     As can be seen from Figure 5.1-5.3, because of the imbalance problem in the DR datasets, in each classification task, using focal loss as the loss function is more suitable than cross entropy as the loss function, accuracy has been greatly improved.

Figure 5.4 Visualization of Grad-CAM [39] DR classification
In addition, we also used Grad-CAM to visualize the attention heat map during the fundus image DR classification process. As shown in Figure 5.4, we can clearly see that the optimization method is more focused on abnormal parts than the basic neural network structure.
The above experimental results show the intense competitiveness of CNN in clinical diagnostic applications, and RAN has achieved good results in completing the DR classification task. The image-enhancing method used in this experiment can make the amount of data in each classification of DR reach a relatively balanced state, and the loss function optimization method has also achieved satisfactory results in alleviating the problem of data imbalance. The model added an attention mechanism, which can pay more attention to the features in the fine-grained image during classification, and play an active auxiliary role in the feature extraction of the network.
Using optimization methods such as dilated convolution can also improve the results of the neural network. In short, using our RAN can enhance the accuracy of DR classification and diagnosis on most fundus images.

Conclusion
This paper proposes a classification algorithm, Residual Attention Network (RAN), combining attention mechanism and dilated convolution for diaetic retinopathy (DR) detection. The classification effect of the model is verified on Kaggle, Messidor and IDRid competition data. Since the imbalance between data categories will lead to overfitting during model training, data augmentation and focal loss are used. Aiming at the problem of minor differences between DR categories, we performed a series of preprocessing on the original retinal image to make the bleeding and exudation in the fundus image more obvious. Then, an attention mechanism is added to the network to extract features of fine-grained images, so that the network can better distinguish the differences between the types of lesions, and we also used dilated convolution in the network to increase the receptive field. Through this combination of residual network designed based on ResNet, attention mechanism and dilated convolution, the accuracy of the classification task of diabetic retinopathy can be improved. However, the increase in accuracy of this method is not significant enough. Therefore, in the future work, we will integrate the prior knowledge of age, blood glucose, blood pressure, intraocular pressure, and past history into the DR classification model to integrate more information related to disease and effectively improve the diagnosis effect. In addition, multi-task experiments will mutually promote the improvement of experimental results. How to integrate the results of optic disc, macular detection, and blood vessel segmentation into the DR classification model will also be the focus of future work. It is the common aspiration of algorithm engineers and clinicians to build a robust and accurate deep learning model for DR screening. This desire cannot be achieved without the joint efforts and cooperation of both parties.
Disclosures. The authors declare that there are no conflicts of interest related to this article.