Farm Parcel Extraction in High Resolution Remote Sensing Image Based on Hierarchical Spectrum and Shape Features

Background: Land-use classification schemes typically address both land use and land cover. Vectorized data extracted from farm parcel segmentation provides important cadastral data for the formulation and management of climate change policies. It also provides important basic data for research on pest control in large areas, crop yield forecasts, and crop varieties classification. It can also be used for the assessment of compensation for damages related to extreme weather events by the agricultural insurance department. Firstly, we investigate the effectiveness of an automated image segmentation method based on TransUNet architecture to enable that automate the task of farm parcel delineation that originally relied on high labor costs. Then, post-processing by vectoring binary segmentation image, which the area and regularity parameter to adjust the accuracy of segmentation, can get a more optimized image segmentation result. Results: The results on the existing data show that the automatic segmentation system we proposed is a method that can effectively divide various types of agricultural land. The system was trained and evaluated using 94780 images. The performance parameters obtained showed that the accuracy rate reached 83.31%, the recall rate reached 82.13%, the F1-S rate was 80.37%, the total accuracy rate was 82.23%, and Iou was 80.39%. At the same times, without losing too much accuracy, we train and test the model with 3m resolution image, which has the advantage of processing speed than 0.8m resolution. Therefore, our proposed method can be effectively applied to the task of extraction of agricultural land, which is better and more efficient than most manual annotations. Conclusions: We have demonstrated the effectiveness of strategy using a TransUNet architecture and postprocessing by vectoring binary segmentation for farm parcel extraction in high remote sensing images. The success of our approach is also a demonstration of feasibility of the deep learning to participate in and improve agricultural production activities, which is important for achieving scientific management of agricultural production.


Introduction
In recent years, with the continuous growth of population and the decrease of arable land resources, modern agricultural production is developing towards intensive and refined direction, and people's demand for dynamic, large-scale, timely and rapid arable land spatial information has become increasingly urgent. Since remote sensing technology can repeatedly obtain large-scale farmland information data, it is increasingly applied to land cover information extraction [1].
For the classification of remote sensing images, the traditional method is to use statistical methods to extract low-level features, including distance [2], K-nearest neighbor [3], maximum likelihood [4] and logistic regression [5] classifier. However, with the rapid development of aerospace, sensor and computer technologies, high-resolution remote sensing images are increasingly applied to land use classification [6]. But, spectral confusion exists in high-resolution remote sensing images, which reduces the effectiveness of traditional classification methods based on low-level features [7]. Low-level features encoded by typical pixel-based or objectbased methods, such as spectrum, texture and geometric attributes, become basically invalid [8][9][10].
So more complex features and descriptors are needed to capture the semantics of the scene. An approach [11] based on middle-level feature modeling is developed on the basis of low-level feature method. Three types of mid-level feature extraction methods describe image semantics, namely the bag-of-visual-words (BoVW),latent Dirichlet allocation (LDA), and machine learning models. In practical application, the SIFT-BOVW method has been successfully applied to remote sensing land use image classification [12]. However, in practical applications, the performance of BoVW-based methods relies on the extraction of handcrafted local features [13]. LDA modeling methods rely on K-means clustering to produce a visual dictionary. Therefore, these are not the most suitable segmentation methods for high-resolution remote sensing images [14]. Machine learning models independently perform data expression and feature extraction [15] and discard the extracted feature mode [16,17] according to pre-determined rules. Therefore, machine learning can often achieve better classifi-cation results when processing complex hyperspectral remote sensing images.
Commonly used machine learning methods include sparse coding [18], neural network [19], support vector machine [20] [21]and Deep learning [22]. Among them, convolutional network has shown good performance in semantic segmentation [23][24][25]. In [26], the author trained a one-dimensional CNN containing input layer, convolution layer, maximum pooling layer, full connection layer and output layer to classify hyperspectral images directly in the spectral domain. Makantasis et al. [27] used a twodimensional CNN to encode spectral and spatial information. Zhao and Du [28]proposed a classification framework based on spectral spatial features, which combined dimensionality reduction algorithm based on local discriminant embedding and two-dimensional CNN. Paoletti et al. [29] also developed a 3D CNN network capable of simultaneously processing spectral and spatial features of hyperspectral images. The authors of [30] propose an unsupervised convolutional network for learning spectral-spatial features using sparse learning to estimate the network weights in a greedy layerwise fashion instead of endto-end learning. Mou et al. proposed a network structure for unsupervised spectral spatial feature learning of hyperspectral images, called total residual convolution -deconvolution network [31,32]. In [33], the author proposed a RN-N model for hyperspectral image classification based on a new activation function and an improved gate cycle element, which can effectively analyze hyperspectral pixels as sequence data and then determine information categories through network reasoning. Zhao and Du [34] proposed a multi-scale convolutional neural network for learning depth features of spatial relations, Pyramid structure is constructed according to the image to present spatial characteristics of different scales. In 2016, Lin et al. proposed the Refine NET [35] model for recognizing natural images. The Refine NET model is a universal multi-path optimization network, which explicitly takes the advantage of all available information during the entire downsampling process and uses remote residual connection to achieve high-resolution prediction.
In addition, some different methods have achieved some results in the classification of high resolution remote sensing images. GEO-BIA is a method that provides a high spatial resolution image analysis using spectral, spatial, textural and topological features [36]. The basic unit of analysis is an image object rather than a single pixel, so this method intends to bypass the problem of artificial square cells as used in per-pixel method [37]. Liu Weifeng proposed Multiview hessian regularization for image annotation [38] and Multiview hessian discriminative sparse coding for image annotation [39], and conducted a large number of experiments on PASCAL VOC '07 data set to verify the effectiveness of the methods. Lidar point clouds have been used in many semantic segmentation applications to provide additional dimensional information [40]. F.L. Luo et al. [41] proposed hyperspectral image feature learning using spatial hypergraph discriminant analysis. SAE [42] and DBN [43] were also used to extract the hierarchical features of the spectral domain. Tao et al. [44] used sparse SAEs to learn effective feature representations from input data. Then, the learned features are input into linear support vector machine for hyperspectral data classification. In [45], the author proposed a remote sensing image segmentation method using both spectral and texture information. Yao et al. [46]proposed a remote sensing image segmentation method based on adaptive clustering ensemble learning. Shen et al. [47] proposed a two-group particle swarm optimization algorithm to improve the performance of remote sensing image segmentation. A multi-layer fusion model for adaptive segmentation and change detection of optical remote sensing image sequences is proposed [48]. In [49] and [50], a multiple Feature Pyramid Network (MFPN) framework is proposed, which utilizes multi-level semantic features of high-resolution remote sensing images to implement effective feature pyramids and cus-tomized pyramid pooling modules. Deep convolutional neural networks (CNNs) have been successfully applied to various tasks including image recognition [51][52][53], object detection [54][55][56][57], and object tracking [58][59][60][61], and achieved good results.
We investigated the application of using highresolution remote sensing satellite images to delineate farm parcel areas. Firstly, the ArcGIS software to manually annotate the target image is used to obtain the training samples. Secondly, pre-process by augmented images is achieved in data set, and the modified TransUNet network is trained to build the models in different resolution 0.8m and 3.2m respectively. Finally, post-processing techniques by tuning parameter in the shape features are used to further improve the recognition results, and testing image of different resolutions are used on different models to perform effectiveness so that a better recognition result image can be obtained.

Data Description
We propose TransUNet to process the recognition of farm parcel. It has the both advantages of U-Net and Transformer. On one hand, with the combination of U-Net to enhance finer details by recovering localized spatial information. On the other hand, the Transformer to enable precise localization [62].
The Gaofen-2 satellite has two sensors in different bands with different resolutions, and panchromatic wavebands have one band ranging from 0.4 to 0.9 and multi-wavebands have four bands ranging from 0.45 to 0.89. A 0.8 resolution remote sensing image can be obtained by combining two sensor images. In the case of accuracy, we did not fuse into imagery in different resolution. The process of land parcel extraction was accelerated with different resolutions. The main parameters of Gaofen-2 satellite are shown in Table1. Firstly, all farm parcel types are extracted as a class. Typical parcels include Polder, Duo farmland, Dam farmland and etc as shown in Fig.1. The types of parcels are described as:  • Polder: Farmland reclaimed by embankments along the river, near the sea, or lakeside areas. Ancient Chinese farmers invented polder to rebuilding low-lying land and transform the lake into the ground. • Duo farmland: Stacked high fields formed by the excavation of deep net-shaped trenches or small rivers in low-humid areas along lakes or river networks in southern China. Its terrain is high, the drainage is good, and the soil is fertile and loose. It is suitable for planting various dry crops, especially for the production of melons and vegetables. • Dam farmland: It is farmland created by people damming in a ravine and blocking the soil washed down from the mountain. • Strip farmland: Strip farmland is a field surrounded by agricultural ditches and windbreak forest area. It is the basic unit of land use for farming, irrigation, and soil improvement. • Dry land: Dry land refers to fields that do not store water on the ground. Its irrigation mainly relies on natural precipitation. Dryland is generally planted with xerophytic crops such as wheat, corn, and cotton. • Paddy Fields: Paddy fields are arable land that has been flooded and used to grow semi-aquatic crops, especially rice and taro. Paddy farming is still the main form of modern rice cultivation. • Platform farmland: A field shaped like a platform above the ground and surrounded by ditches. The platform farmland crops are mainly grain, cotton, vegetables and other crops. And it is a land improvement project to eliminate waterlogging and alkalinity. • Terrace: In agriculture, a terrace is a sloping flat surface that is cut into a series of continuous receding flat surfaces or platforms, similar to steps, for the purpose of farming more efficiently. Terraced fields are widely used in the cultivation of rice, wheat and barley in East Asia, South Asia and Southeast Asia.

Data Generation
There are many labeling methods for deep learning data set samples, which are mainly operated by Labelme [63], RSLabel and ArcGIS software [64]. The purpose of labeling is to classify the background and various plots in the sample and label them with their corresponding labels to facilitate model learning. The marking method is as follows: • Labelme: Labelme is not convenient for managing remote sensing image annotation in multi-channel iamges and is not easy to modify in annotation. The labeling of farm parcels needs to be modified frequently due to undefined attributes and scales. • ArcGIS: ArcGIS uses vector polygons by shpfiles to label remote sensing images which easily modify polygons in the area of farm parcels. The biggest advantage of using ArcGIS labeling is that we only need to manually label high-resolution samples to generate binarized labels for samples with 3.2m resolutions which can be mapped to the 0.8m resolution image. Annotation by the polygons can be used for images with different resolutions at one time as shown in Fig.2. The green vector polygons are the marked farm parcel, and the rest are the unlabeled areas. After the labeling is completed, assign the attribute values of the farm parcel labels uniformly and use the rasterization function of ArcGIS to rasterize the vector labels to generate a binarized label file, and finally use ArcGIS itself or Python to binarize the entire image. The image label is cut into the original image and the annotated image corresponding to a fixed size at the same time, as shown in Fig.3.

Methods
The process of Farm Parcel Segmentation system includes three main stages: 1) Preprocessing, 2) semantic segmentation 3) post-processing, as shown in Fig.4.

Data Augmentation
As the neural network deepens, the parameters that need to be learned will also increase. When the data set is small, too many parameters will fit all the characteristics of the data set, rather than the commonality between the data, which will not meet the needs of the generalization ability required by the network model. In order to obtain more data and reflect the commonality of more data sets, we need to make minor changes to the existing data sets. This small change can not only reflect the common ground of the data set, but also make the neural network think that these are completely different images. When our neural network is trained with additional data sets, it can interpret remote sensing images of different locations, different light conditions, and different proportions.
The main method of data augmentation is to process samples such as flip, rotation, scale, crop, translation, and Gaussian noise. Its purpose is to increase the amount of training data, prevent over-fitting, and improve the generalization ability of the model, increase noise data to  [1] improve the robustness of the model. The following data is rotated and flipped, and white noise, color enhancement and Gaussian noise are added, as shown in Fig.5 (only the label part is shown). The labels of the original image, Gaussian noise, tone enhancement and white noise are the same.

Segmentation For Tansfromer and Unet
Based on the characteristics of remote sensing images and semantic segmentation technology, the TransUNet network architecture is proposed as shown in Fig.6 [62]. The overall network structure is divided into three parts: Downsample layer based on Resnet [51]; Transformer Layer based on attention mechanism with residual structure; Upsample Layer based on full convolution (FCN) [65].

Downsample Layer
This model uses the classic Resnet [51] structure as the down-sampling network layer. The down-sampling network mainly extracts image features, and the extracted feature information participates in the information fusion of the Upsample Layer (the embedded information output by Attention is used as another fusion information). Among them, only the first output feature layer uses maximum pooling, and the other output feature layers use the FCN structure. Downsample Layer is composed of basic residual basic Block, each Basic Block is composed of crosslayer residual block and non-cross-layer residual block. The cross-layer residual block mainly implements the extraction of information between feature maps of inconsistent sizes between layers, while the non-cross-layer residual block implements the extraction of information between feature maps of the same size.

Transformer Layer
Mainly composed of the following parts: local feature information segmentation; feature information Embedding Layer; Multi-Head Attention Layer;Feed-forward Network Layer. The local feature information segmentation is mainly [1] A,I,J and K are same in mask image to segment the output features of the last layer of Downsample Layer. Embedding Layer includes Patch Embedding and Position Embedding, the two are combined for flattening processing, as the input of Attention Layer. Attention self-attention mechanism mainly focuses on the connection between each encoded embedding and other embeddings, and calculates related self-attention. The embedding of each input is multiplied by the weight of the Multi-Head Attention Layer to calculate the three vectors of the corresponding embedding: query vector (q), key vector (k) and value vector (v). Calculate the self-attention vector of the current embedding and other embeddings through each key vector and query vector, and express them in the form of probability (weight), multiply the probability by the value vector and add them to get the output of the attention layer embedding from that location. Multi-Head Attention deals with the same embedding in the horizontal direction. The Feed-forward Network Layer is mainly to normalize the Attention output on the weight layer. Innovations of Transformer Layer: 1. Transformer Layer Embedding uses convolution instead of linear processing; 2. In Attention attention, query vector (q), key vector (k), value vector (v), change q, k or v to q ′ , k ′ , v ′ , minimum local attention to the same feature embedding (Pay attention to specific details, such as the connection between the edge of the plots and the center); 3. Add an attention mechanism to the feature information output by each Downsample Layer, replace the direct output as the input source of fusion information.

Upsample Layer
Upsample Layer mainly includes: Transformer Layer output layer inverse flattening; Upsample Layer convolution; feature fusion; Segmentation-Head output layer. Inverse flattening mainly restores the embedded features of the Transformer Layer output layer before Embedding, and arranges the new feature information learned by Attention according to the position before Attention learning. Feature fusion is to fuse the Upsample Layer innovation point: Only one convolution is performed to ensure that the fused feature information will not cause gradient problems due to the increase of the number of layers.

Post-processing
After the semantic segmentation network is processed, the resultant image often has pixel-level misrecognition, including voids and noise regions, vector polygons are not smooth and trivial small areas, etc. Therefore, it is necessary to design a flexible post-processing scheme to improve the recognition accuracy for farm parcel. The post-processing scheme generally includes three main steps: morphological analysis, vectorization operation, and data fusion, as shown in Fig.7. Post-processing is an important mean- Figure 7 The post-processing flowchart of farmland parcel identification s for the secondary correction of the recognition results and the practical application of engineering. When the processing results of the recognition algorithm are not ideal, the post-processing technology is used to make subsequent adjustments to the recognition results to obtain a more accurate land recognition effect. Use the following three methods to operate on the identified binarized image.
Morphological operations: expansion, corrosion, opening and closing operations, hit and miss transformation, TOP-HAT transformation, black hat transformation, etc. Combining smoothing and opening operations to eliminate small objects in the image. Corrosion and then expansion combined with closing operations are used to fill small holes inside the target.
Vectorization operation: The area is judged on the recognition result, and small areas are removed. Then adjust the vector polygons of each farm parcel to merge and delete the interference area.
Fusion: The vector polygons data generated by the binarization map after morphological operation is fused with the vector data generated directly from the original data, and the neighboring vectors are judged by geographic information, and the recognition results are optimized through vector attributes(area, regularity, etc.) The original image and the result of recognition  Fig.9. It can be seen that morphological processing is difficult to control the expansion coefficient. In an image with a resolution of 3.2m, an expansion of one pixel will affect the actual situation of 3.2 meters, which will cause the edge of the farm parcel to be misidentified. (The natural boundary of the farmland is defined at 2 meters, and the remote sensing image has a resolution of 3.2m and 0.8m) Therefore, in this scheme, the area parameter is first used to filter the misrecognition of trivial areas as defined: Among them, A is the area, σ is the spatial resolution and P is the number of pixels. Filter small area vector polygons by setting a threshold, as shown in Fig.10(B). Then the area vector is smoothed. The threshold of the smoothing scale can be set smaller, because the definition of the edge of the plots is greater than 2 meters, and the buffer distance is set to 0.5 meters, the definition of regularity is defined: β ∈ (0, 1), if it is close to 1, it proves that the extracted target is relatively regular, and if the value is small, it proves that the extraction is incomplete or the recognition ability is insufficient. An iterative calculation based on the extracted area is proposed here. If the set regularity threshold is met, the area will be filled and smoothed until the set threshold is met. The operation of the degree of regularity can use the dynamic sorting method to arrange them from large to small. In order to improve efficiency, first use the structure of GDAL to establish a regularity field linked list, calculate the area, and then calculate the regularity and add it to the field linked list, and put the largest regularity on the upper level through quick sorting. The regularization operation is performed on the part of the linked list whose regularity is greater than the threshold, without the need to perform a global traversal operation on the entire vector polygons data, which can increase the postprocessing speed. Fig.10 has shown the results of vector polygons data processing by processing some specific areas. A, B, C, and D in Fig.10 show various results of post-processing by vectorization technology when the recognition effect is not ideal.

Experiments And Results
This part will introduce the data source, data production process, training environment and various algorithm performance comparison. First, we obtained the remote sensing image data of the GaoFen-2 satellite, and generated 66346 training sets, 18956 verification sets and 9478 test sets through manual annotation and enhancement processing, a total of 94780 remote sensing image data sets of 256x256 pixels. Then, use these data to train various algorithms and compare the training results of different algorithms.

Experimental Data
The research scope of the farm parcel identification in this paper has selected the areas containing more typical farm parcels and crops area in Zhejiang and Anhui, as shown in Fig.11. Anhui Province is located in the eastern part of mainland China, with 4.22 million hectares of arable land in the province, and Zhejiang Province is located on the southeast coast of China, with 1.98 million hectares of arable land in the province. These two regions have large crop planting areas, high yields, and relatively intensive crops. Therefore, their characteristics are complex and have strong representation. Studying them is of great significance for ensuring national food security, and it is also of guiding significance for the identification of farm parcel in other regions of China. The experimental data contains two resolutions of 0.8m and 3.2m, as well as four channels. Since different band combinations can highlight the different features of the farm parcel, we choose the standard false-color combination (red, green and blue are assigned to bands 4, 3 and 2 respectively)to label farm parcel. The vegetation in the standard false-color image is displayed in red, which can highlight the characteristics of vegetation, as shown in Fig.12, which is the most suitable band combination for make experimental data. After the label is completed, we use the python segmentation tool to refine and preprocess the original image and its label, and we get the experimental data. It contains a total of that Anhui remote sensing images with a size of 7300x6908 with two different resolutions, as well as two Zhejiang remote sensing images with a size of 27500x26760 and 6882x7329 with two different resolutions. Because the original image is too large, we use the python program to crop the original image Enough training samples will make the network training effect better. When the training samples are insufficient, we can augment the data set. After flip, rotation, scale, crop, translation, Gaussian noise, etc., the total data set pictures are 94780 images. The training data set, validation data set, and test data set are divided according to the ratio of 7:2:1, which means that there are 66346 training sets, 18956 verification sets and 9478 test sets to participate in the training. In addition, in order to further verify the performance of our algorithm, we additionally prepared 2341 remote sensing images from the QuickBird satellite as a test set to verify the generalization ability of the segmentation model.

Experiments Results
All the experiment are based on PyTorch to achieve Deeplabv3 and Deeplabv3+ method-

Evaluations Of Different Resolution
The process of farm parcel Segmentation includes input image, block processing, semantic segmentation, Stitching, vectorization and output result of vectorization. Therefore, the segmentation performance indicators of remote sensing images should not only include the conventional performance parameters of the segmentation model, but the time-consuming and recognition speed of each functional module is also very important. The experiment is divided into training data set, verification data set and test data set according to 7: 2: 1, batch-size is set to 32, epoch is set to 50, and the initial learning rate is set to 0.001 for performance testing. We ran our algorithm on two data set with different resolutions, and compared the performance of each stage in detail, as shown in Table2 below. It can be seen from the data that it takes significantly more time to process high-resolution images than to process low-resolution images, and the time-consuming for cropping, recognition, and stitching is more than ten times that of low-resolution images. From the perspective of model performance, the accuracy, recall, and intersection ratio of the algorithm in this paper on high-resolution images are also weaker than those on low-resolution images. For the problem of farm parcel identification, the performance of image segmentation is mainly evaluated according to the indicators in Fig.13.

Evaluations Of Different Methods
In order to better evaluate the method in this article, we also used other methods to conduct  [2] . Use the above data set to compare with Deeplabv3, Unet and the more advanced Deeplabv3+, we conducted multiple experiments to verify the semantic segmentation effect, as shown in the following Table3 . During the experiment, the input image size, learning rate, epoch, batch-size, and experimental hardware environment of the abovementioned network were controlled to be consistent, and data was not used for online amplification. As shown in Fig.14 , it is a comparison of the results of the algorithm test in this paper and the content of the label. The green framed content is the artificially labeled farm parcel, and the red framed content is the results of the farm parcel range identified by the algorithm in this paper. It can be found that the recognition results are accurate. It can be seen from the above data that compared with the existing classic segmentation methods, the improved recognition network in this paper can effectively improve the accuracy of the original model for extracting remote sensing farm parcel features and the learning ability of edge details, which can provide an effective solution to the problem of farm parcel identification in high-resolution remote sensing images. [2] 8 Segmentation Models, we achieve Unet, deeplabv3+ and pretrained Unet and pretrained Unet

Conclusion And Perspectives
In this study, we proposed TransUNet architecture to segment farm parcel areas and boundaries in high resolution remote sensing image. We built our own farm parcel data set and applied it to training various neural network models. The results show that our TransUNet model achieves an accuracy of 83.31% and performs better than other algorithms on other parameters. This shows that the algorithm we proposed can better identify farm parcels and may achieve better results in practical applications. Due to the particularity of remotely sensed plots objects, general recognition models cannot perform well when dealing with such objects, and cannot reach the performance and efficiency of the method in this paper. In addition, the data set used in this article is a small self-made data set with a small amount of data, which may be one of the reasons why the existing models cannot perform well. If the amount of data is further increased, the performance of the existing model and the performance of the method in this paper may be improved to a certain extent. Finally, in the recognition results, there is an inaccurate recognition of the edge of the plots. In addition to expanding the number of training sets, improving the accuracy of manual labeling may also directly improve the efficiency of training and the recognition effect.
We divided the GaoFen-2 satellite data into a training set and a verification set for testing, and satisfactory results can be obtained. However, when we use the remote sensing image of the QuickBird satellite as a verification set to test the segmentation effect, we found that the effect is not good. As a future work, in order to improve the versatility in different scenarios, we will increase the data of the QuickBird satellite for training to further improve the generalization ability of the model.