Application of Multi-Scale Fusion Attention U-Net to Segment the Thyroid Gland on CT Localization Images for Radiotherapy

Xiaobo Wen the Third A liated Hospital of Kunming Medical University Biao Zhao the Third A liated Hospital of Kunming Medical University Meifang Yuan the Third A liated Hospital of Kunming Medical University Jinzhi Li the Third A liated Hospital of Kunming Medical University Mengzhen Sun the Third A liated Hospital of Kunming Medical University Lishuang Ma the Third A liated Hospital of Kunming Medical University Chaoxi Sun the Third A liated Hospital of Kunming Medical University Yi Yang (  yiyangrt@126.com ) the Third A liated Hospital of Kunming Medical University


26
Head and neck tumors and breast cancer are currently the tumors with the highest morbidity and 27 mortality worldwide(1). In 2020, there were 19.29 million new cancer cases worldwide, of which 4.57 28 million cases (23.7%) were in China. Radiotherapy is an effective and common method for the 29 treatment of head and neck cancer and breast cancer (2-4). Accurately delineating organs at risk 30 (OARs) when designing radiotherapy plans can effectively avoid radiation side effects. At present, the 31 outline of OARs is mainly done by physicians; thus, the process is subjective, time-consuming, and 32 labor-intensive.

33
With the rapid development of artificial intelligence, Ronneberger (5) and others proposed the U-

38
However, in segmentation studies on OARs for head and neck tumors and breast cancer, the thyroid 39 gland has often been ignored and has not been considered an OAR for automatic segmentation studies 40 (8); therefore, the automatic delineation of the thyroid gland in CT localization images for radiotherapy 41 has rarely been studied. It has been shown that side effects, such as thyroid function decline, occur 42 when the radiation dose of the thyroid gland exceeds 26 Gy (11). Franco and others (12) conducted a 43 retrospective study on 3-dimensional conformal radiation therapy (3D-CRT) for breast cancer. They 44 3 found that in about 45% of patients with lymph node-positive breast cancer, the thyroid gland was 45 exposed to a radiation dose higher than 26 Gy. Other studies have shown (13, 14) that, at 5-10 years 46 after receiving radiotherapy, the incidence of hypothyroidism in patients with nasopharyngeal cancer or breast cancer is 20% to 52%; moreover, the incidence of hypothyroidism increases with the increasing 48 time of follow-up. Therefore, during radiotherapy planning, it is necessary to limit the radiation of the 49 thyroid gland. Considering that CT localization for radiotherapy involves a simulated-positioning large-50 aperture CT (Somatom Sensation Open, 24 rows, Φ85 cm), which is limited by small size and poor 51 image resolution, automatic segmentation of the thyroid gland based on deep learning model is 52 difficult. It is necessary to further explore the performance of the deep learning model on CT 53 localization images for radiotherapy. This study proposed a model that combines cSE attention 54 mechanism and HR-net on the basis of U-net, and applied it to segment the thyroid gland on CT 55 localization images so as to assist with the delineation of the thyroid gland as an OAR in radiotherapy.

84
The model in this study was improved on the basis of the U-net and HR-net model architectures.

85
The main improvement in this study included the replacement of the two feature extraction height × width to Channel × 1 × 1 and then used Dense to reduce the feature channel by half, which 95 was achieved by activating the function Relu. After that, the feature channel was restored to normal 96 size by using Dense, and the function Sigmoid was used to activate the channel. Finally, the calibrated 6 feature map was obtained through the channel-wise multiplication. The schematic diagram of the 98 residual connection and cSE module structure is shown in Figure 3. The residual connection effectively 99 prevents the model from disappearing and exploding with the deepening of the network(17). Moreover, 100 the cSE module is able to effectively reflect the relationship between different channels and assign 101 different weights so that the model can focus on important features that accurately segment the thyroid 102 gland during the training process. The whole module is called an Attention Resblock (Figures 2 and 3).

103
The traditional U-net model uses the maximum pooling layer to perform downsampling and reduce the 104 amount of parameters. This method may lead to the loss of information during the feature extraction 105 process. Therefore, in this study, the stepped convolution was used to perform the downsampling.

106
Stride convolution can remove redundant information, thereby reducing the size of the feature map.

107
The model uses multiple branches of different resolutions to extract features in parallel during the 108 training process, and it performs feature fusion between different scales after each attention residual 109 block so as to achieve strong semantic information and precise location during the training process.  Due to the small size, the thyroid gland occupies less space on a CT image. Therefore, the use of

165
where TP represents the foreground target value that is predicted correctly, and FP represents the 166 foreground target value that is predicted incorrectly, and FN represents the background target value that 167 is predicted incorrectly.

231
To further evaluate the differences between the four models, we made box plots of the evaluation