Change detection (CD) aims to detect change area between dual-time remote sensing (RS) images. The change defined in RS image CD generally refers to the change of land cover or land use status, and detection is to achieve the purpose of identifying changes in specific areas through visual interpretation or related algorithms. CD plays a very important role in various practical applications, for instance, disaster evaluation, ecological environment detection, urban development planning and civil map revision.
The traditional RS image CD has the problems of complex procedures, low accuracy. Besides, the traditional methods require high-quality dual-time images.In this paper, we use ai algorithms to solve this problem.Because CD can be viewed as a unique task for semantic segmentation, we use the idea of semantic segmentation to do CD.The early semantic segmentation model was implemented by removing the full connection layer from the full convolutional neural network [1]and adding deconvolution to restore the original resolution, but a lot of semantic information would be lost in this process. In deep learning, the deep convolution neural network based on U-Net[2] is the most classic semantic segmentation network structure. U-Net[2] is composed of skip connection structure and symmetric encoder and decoder. Through a succession of convolutions and down-samples operations, the encoder extract the features of the input image. The decoder recovers the resolution of the image through up-sampling and convolution, and the skip connection structure integrates the features of each layer in the process of down-sampling which alleviates the loss of spatial information. This technical path led several algorithms, including 3D U-Net[3],Res-UNet[4], U-Net++[5] and UNET3+[6].Those algorithms developed for various semantic segmentation and CD tasks,and they are effective.So we apply U-net[2] as part of our module.
The locality of convolution operations makes it challenging for models based on convolution operations to learn global semantic information, even though CNN-based models have produced good results, which makes such methods unable to completely meet the accuracy requirements of semantic segmentation and CD. Swin Transformer [7] applies the Transformer[8] structure that performs well in natural language processing(NLP) to the field of computer vision. Swin-Unet [9]based on Swin Transformer block is a network look like U-shaped based solely on Swin Transformer[7], with incoder,decoder,bottleneck, and skip connection. Like U-Net[2] The Swin Unet structure is an ideal backbone for segmentation of RS images with only a few spatial information, and the self-attention structure's global feature extraction capabilities can also extract significant features from RS images.
Siamesenet[10] is designed to measure how similar two inputs are.To conduct end-to-end detection, researchers develop a variety of fully convolutional networks in [11][12][13]Recurrent neural networks and CNN are used in [14][15] to extract characteristics from multi-temporal pictures. For CD with multisource VHR pictures, convolutional multiple-layer recurrent neural networks are also suggested[16]. These networks either use a two-stream structure to learn picture characteristics or combine two images into a single multi-channel input to do so. But in our model, the two pictures are passed through the network in turn with the same weights in the lowest layers, using a siamese architecture. The technique of learning common characteristics through shared and wholly same weights is fair because the two photographs were captured at separate times while still being in the same location. For CD, a siamese convolutional network is suggested in[17]. The model in [17] utilizes a straightforward Euclidean distance-based thresholding segmentation independent from the model while combining the information collected by the siamese CNN. For better information fusion in our approach, deeper modules are designed.
Combining Swin Transformer[7],U-Net[2] and Siamesnet[10], we designed a special network for RS image CD: Siam-Swin-Unet. Siam-Swin-Unet is made up of Siamesenet[10], encoder, decoder, skip connection structure and feature fusion module. The encoder and decoder are both built based on the Swin Transformer block. The dual-time RS images are respectively processed by the Swin Transformer encoder with the same weights to extract features of dual-time images. The two feature maps are fused through the feature fusion module. The fused features are up-sampled by the Swin Transformer decoder with shared weights, and multi-scale features from the encoder are fused through the skip connection structure to perform CD task. Finally, the RS dual-time image features with resolution restored by the siamesenet are multiplied to ensure that the network uses the information extracted by the two siamesenet equally. Specifically, We can sum up our contributions as follows: