In this section, the framework of our multi-dimensional cascaded net and the structures of each part are described in detail. This study is committed to building a more efficient neural network structure to complete the automatic classification of brain tumor in MRI images. U-net is found to be a more effective approach as compared to other network architectures. U-net, evolved from the traditional CNN, was first designed and applied in 2015 to process biomedical images. As a general convolutional neural network, it focuses on image classification, where input is an image and output is a one label, but in biomedical imaging, it requires us not only to distinguish whether there is a disease or not, but also to localize the area of abnormality. U-net provides fast and precise image segmentation and is dedicated in solving this problem as it is able to localize and distinguish borders by doing classification on every pixel, so that the input and output share the same size. This work emphasizes on the classification module of the CAD system, which presents a deep learning-based UNet for automated detection of cancer in MRI images. We propose an efficient attention network with nested connections, as shown in Fig. 1. This architecture is based on the popular UNet, which is designed to work well with a small number of training samples. The network is an encoder-decoder structure.

## 3.1. NESTED CONNECTION

The success of UNet depends largely on skip connections. The connection combines the coarse-grained features of the decoder with the fine-grained features of the encoder by adding elements. Zhou et al(2020) redesigned the skip connection of UNet and proposed a new model Light weight UNet (LUNet). It uses nested connection instead of the original simple connection, which can capture more abundant features of multi-level, and then integrate the multi-level features by adding elements. In the case of nested connection, the encoder is not directly connected with the decoder after getting the final aggregation feature mapping. Instead, the nested connection is accessed first, so that the rich features captured earlier can be better preserved.

Let’s take a closer look at nested connection. Let x i,j denote the output value of node Xi,j, index i denote the maximum pooling operation from the encoder, and index j denote the transpose convolution operation from the decoder. The formula of the concatenated feature maps is expressed as

Where function F (. ) consists of four operations, two convolution layers M (. ) and T (. ) two DropBlock [14] layers arranged alternately, and [ ] represent the Max-pooling layer and transpose convolution layer respectively, and denotes the concatenation layer. When node i = 0 and j = 0, the function F (. ) is executed on the input original image. When nodes i > 0 and j = 0 indicate that there is only one input in the front layer. When nodes i ≥ 0 and j = 0, they will receive j + 1 inputs. Take decoder node X0,2 as an example, it has three inputs, which are the previous decoder node X1,1, the adjacent decoder nested node X0,1 and the skip connection node X0,0. The collection of multiple feature maps will provide richer semantic information for the final segmentation.

## 3.2. EFFICIENT ATTENTION

Since Vaswani et al (2017) proposed a new structure that connects encoders and decoders through Self-Attention (SA), which has achieved great success in machine translation tasks. Subsequently, many works have applied SA to computer vision tasks and achieved good results. The Squeeze Excitation (SE) (Hu et al., 2020) module starts from the perspective of channels and constructs information features by fusing the channel information of the local receptive fields of each layer. The SE module significantly improves the performance of the current most advanced Convolutional Neural Network (CNN) while slightly increasing the computational cost. Fu et al(2019) proposed a dual attention (DA) network that combines spatial attention and channel attention for adaptive fusion of local and global features. The best performance is achieved in scene segmentation tasks.

Wang et al.(2020) found through in-depth research on channel attention that appropriate cross-channel interaction is very important for learning more effective feature mapping. Local cross-channel interaction can be achieved through one- dimensional convolution, and nonlinear mapping is used to adaptively determine the size of the convolution kernel. The Efficient Attention (EA) module can be flexibly put into the existing CNN, and its structural details are shown in Fig. 3.

The function of efficient attention module is to obtain more abundant channel information. Group convolution is a method to manually adjust the size of convolution kernel according to the number of channels. However, manual adjustment consumes too much resources, so an adaptive adjustment strategy is designed. There is a mapping θ(•) between the size k of convolution kernel and the channel dimension C:

**C = θ(k)**

The channel dimension C is usually set to a power of 2. Therefore, a nonlinear function θ(•) is designed:

**θ(k) = 2(α∗k − β)**

When the channel dimension C is given, the convolution kernel size k can be calculated adaptively by

Where function O(. ) takes the nearest odd number. The later experiment, we set α and β to 2 and 1 respectively. In this way, higher dimensional channel will get larger convolution kernel size, while low dimensional channel can get smaller convolution kernel size through nonlinear mapping.

## 3.3. DropBlock regularization

In 2012, the Hinton team proposed an effective regularization method called Dropout (Srivastava et al., 2014) which can be used to prevent overfitting. Dropout is widely used in fully connected layers, but it is usually not effective for convolutional layers. The reason is probably because the adjacent elements in the feature map of the convolutional layer share semantic information in space, so although a unit is discarded, the adjacent elements can still retain the semantic information of the position, and the information can still be Circulate in convolutional networks. The mainstream network model that solves the segmentation problem just does not include a fully connected layer, so Google Brain proposed a DropBlock (Ghiasi et al., 2018) regularization method for this situation, which is a simple method similar to Dropout.

The main difference with Dropout is that it removes adjacent regions from the feature map of a layer instead of discarding independent random units. The first parameter µ of DropBlock is the size of the block to be dropped, and the second parameter γ controls the number of active units to be dropped. Similar to Dropout, we do not apply DropBlock in the inference process. In the experiment, we set a constant for all feature maps, regardless of the resolution of feature maps. When µ = 1, DropBlock and Dropout are equivalent, when covering the entire feature map. For the setting of the value of γ, we assume that we want each activation unit to maintain a probability of pkeep. γ can be calculated as:

In order to ensure that the µ2 region is contained in the λ2 region, we adjust the initial binary mask when sampling, so that the size of the effective seed region is (λ-µ + 1)2, and set p keep as 0.9 and µ as 7.

## 3.4. FUSION LOSS FUNCTION

The most common loss function used for image semantic segmentation tasks is pixel-level cross-entropy loss. This loss function discriminates each pixel, and then compares the result with the one-hot encoding vector. Binary Cross Entropy (BCE) loss is for the case where there are only two categories, and its loss function formula is:

Since the cross-entropy loss evaluates the category prediction of each pixel separately, and then averages the loss of all pixels, we essentially learn equally for each pixel in the image. If the distribution of multiple classes in the image is unbalanced, then this may lead to the training process being dominated by classes with a large number of pixels. The model will mainly learn the features of the large number of class samples, and the learned model will be more biased to predict the pixels for this category.

In order to overcome the problem that the foreground region is difficult to detect completely. Milletari et al(2016) proposed a new loss function based on dice coefficient (Dice). The expression of the dice coefficient D is

Where pi = P represents the predicted binary vector and gi ∈ G represents the ground truth binary vector. The total number of pixels is N.

Using this formula, we can establish an appropriate balance without assigning different weights to the parts of interest and background. In the fundus blood vessel segmentation task, we fuse the two loss functions of BCE loss and Dice loss by simple addition, which we call the fusion loss function.