A cross-view geo-localization method guided by relation-aware global attention

Cross-view geo-localization mainly exploits query images to match images from the same geographical location from different platforms. Most existing methods fail to adequately consider the effect of image structural information on cross-view geo-localization, resulting in the extracted features can not fully characterize the image, which affects the localization accuracy. Based on this, this paper proposes a cross-view geo-localization method guided by relation-aware global attention, which can capture the rich global structural information by perfectly integrating attention mechanism and feature extraction network, thus improving the representation ability of features. Meanwhile, considering the important role of semantic and context information in geo-localization, a joint training structure with parallel global branch and local branch is designed to fully mine multi-scale context features for image matching, which can further improve the accuracy of cross-view geo-localization. The quantitative and qualitative experimental results on University-1652, CVUSA, and CVACT datasets show that the algorithm in this paper outperforms other advanced methods in recall accuracy (Recall) and image retrieval average precision (AP).


Introduction
Cross-view geo-localization can be regarded as a contentbased image retrieval task [1,2], which refers to matching the query image from one platform with the images from other platforms to find the images with the same geographic location. Previous research mainly focused on matching ground views with satellite and aerial images. Recently, drone-view images have been introduced in cross-view geolocalization with the gradual maturity of UAV technology [3], and geo-localization based on drone-view and satellite images has become the current research hotspot.
As convolutional neural networks (CNN) are widely used in visual tasks such as image classification [4,5], object detection [6,7], semantic segmentation [8,9], and action recognition [10,11], some researchers have applied CNN to cross-view geo-localization [12] and made significant progress. However, most cross-view geo-localization methods mainly consider the high-level semantic information of the target image, ignoring the impact of spatial structure information on improving the accuracy of geo-localization. Zheng et al. [13] regarded geo-localization as a classification task and measured the similarity of the image semantic features. However, this method ignores the context information of the area around the target, resulting in the extracted features are not comprehensive enough. Wang et al. [14] used the square-ring partition strategy to make the network focus on the surrounding area of the target, thus improving the accuracy of geo-localization by exploiting context information. However, this method directly divides the feature map into four scales and ignores the global structure information of the image, which leads to the false detection of similar images as the correct retrieval results in the retrieval process. Obviously, it is helpful to elevate the performance of geo-localization tasks by sufficiently exploring the structural information of the geographic target images.

3
To alleviate the impacts of existing algorithms that fail to fully consider the image structure information on the matching accuracy of cross-view geo-localization, this paper proposes a cross-view geo-localization method guided by relation-aware global attention. Specifically, this method adopts the deep residual network [15] as the backbone network, and exploits the relation-aware global attention module (RGA) [16] to capture more robust global structure information of the image for image feature matching. Meanwhile, a dual-branch network is designed to capture deep features with rich semantic information and local features with multi-scale context information, respectively. Among them, the local branch employs the dilated convolution [17] to increase the receptive field of the feature map while adopting the square-ring partition strategy [14] to divide the feature map at four scales. Moreover, our method converts the feature map of each branch into a column vector and obtains its prediction category through the classifier. Finally, the cross-entropy loss function [18,19] is exploited to learn the image prediction category for improving the training accuracy of the network.
The contributions of this paper mainly include the following aspects: (1) A cross-view geo-localization method guided by relation-aware global attention is proposed. This method exploits the relation-aware global attention module to learn the relationship between image feature nodes, which can sufficiently mine the global structural information of the image, thus extracting more robust features for image feature matching. (2) A dual-branch structure is designed, where the deep residual network is exploited in the global branch to extract deep features for obtaining the feature maps containing abundant semantic information, while the dilated convolution is employed in the local branch to capture local features with richer multi-scale context information, which further enhances the precision of geo-localization. (3) The method achieves superior positioning accuracy than other advanced models on the three datasets of University-1652, CVUSA, and CVACT, which proves the effectiveness of the proposed method in geo-localization.
The remainder of this paper is organized as follows: Sect. 2 introduces the related work of cross-view geo-localization. Section 3 details the method and network structure. Section 4 describes the experimental results and analyzes the results, and conducts the ablation experiment and followed by the summary of the full text and prospects for future research directions in section 5.

Related work
The research content of early geo-localization is mainly based on ground view and aerial view images. Workman et al. [20] adopted two publicly available pre-trained models to extract image features, and proved that deep features can discriminate images from different geographic locations. However, this method only focuses on image feature extraction at a single scale, and fails to effectively utilize the multi-scale information, resulting in insufficient matching features extracted by the network. On this basis, Workman et al. [21] constructed a CVUSA (Cross-View USA) dataset to perform a multi-scale fusion of aerial image features and improved the cross-view localization results. Lin et al. [22] employed publicly available data to build 78,000 street view and 45 • aerial image pairs and then adopted the deep siamese network to extract features for conducting cross-view localization. Vo [23] et al. trained the network by exploiting the distance based logistic layer (DBL) and rotation invariance to evaluate different deep learning methods and improve the localization accuracy. Considering that the image semantic information is more robust to viewpoint changes, Tian et al. [24] used object detection technology to extract buildings in the image for building matching, and obtained the final geo-localization results. Altwaijry et al. [25] focused on the matching task of aerial image pairs, and they exploited data-driven methods to learn discriminant representations from image pairs, thus solving the problem of ultra-wide baseline image matching. Furthermore, Zhai et al. [26] first extracted aerial image features, then mapped them to the ground view by employing adaptive transformation, and finally minimized the difference between the semantic feature predicted from the ground view and those directly extracted from the ground images through an end-to-end learning method. Hu et al. [27] combined the siamese networks with NetVLAD [28] to encode local features for obtaining global descriptors and accelerated network convergence by introducing weighted soft margin ranking loss, thus improving network performance. Shi et al. [29] believed that existing methods ignore the differences in appearance and geometry between ground view and aerial view images, so they utilized the polar coordinate transform to approximately align aerial images with ground view images. In order to further solve the problem of orientation alignment in the cross-view, Shi et al. [30] designed a dynamic similarity matching network (DSM), which makes the image matching results more accurate. Liu et al. [31] believed that geometric cues (such as orientation) can be used for localization, so they designed a siamese network to integrate the orientation information of each pixel from the image into the network model, which can enable the network to learn both appearance and geometric information, and improve the recall accuracy and precision of the network. To solve the problem of scene changes over time, Rodrigues et al. [32] proposed a semantic-driven data augmentation technique aimed at simulating the phenomenon of the scene change in crossview image matching, and then employed the multi-scale attention module to match the image, and improved the network performance. Regmi et al. [33] first applied generative adversarial networks (GANs) [34] to cross-view localization, and synthesized aerial images from ground views by using GANs for image matching, but this method is not an end-to-end method. Toker et al. [35] employed polar coordinate transformation on satellite views to synthesize the ground views followed by image retrieval, and achieved advanced geo-localization performance by integrating the two steps in an end-to-end architecture. The above methods mainly focus on the matching task between ground view and aerial view images, they only consider two views for geo-localization and do not pay attention to the drone-view image, so the feature learning of multiview matching task is ignored. Recent researches on cross-view geo-localization believes that adding the viewpoint can improve the accuracy of geolocalization, so the drone images are introduced to solve geo-localization problem. Zheng et al. [13] constructed the University-1652 dataset, including satellite view images, ground view images, and drone-view images, and they considered all view images at the same location as a category to complete the geo-localization task in a classified manner, while optimizing the model by applying the instance loss [36]. Nevertheless, this method only concentrates on the semantic information and does not consider the impact of the detailed information on cross-view geo-localization. To solve this problem, Wang et al. [14] proposed a local pattern network (LPN), which takes the contextual information of the image as an auxiliary clue and divides the feature image to make the network notice the environment around the target building, thus effectively solving the problems of ignoring the image details in the method [13] and achieving better matching results. Ding et al. [37] adopted the location classification (LCM) to achieve image matching, which solves the problem of sample imbalance between satellite images and drone images and improves the image matching accuracy. The attention mechanism has been widely applied in the field of computer vision [16,[38][39][40], which aims to enable the network to pay more attention to discriminative features while filtering out some irrelevant information, thereby improving the training effect of the model. Zhang et al. [16] integrated relation-aware global attention into the person re-identification network, which enhances the feature representation ability by capturing the global structural information of the image, and improves the performance of person re-identification. In order to avoid the impact of target offset and view scaling on image matching, Zhuang et al. [38] proposed a multi-scale block attention (MSBA) structure to enhance the salient features of different regions. Lin et al. [39] introduced the unit subtraction attention module (USAM), which makes the model focus on the salient areas in the image by detecting key points in the feature map, and improves the performance of the model with fewer parameters. Dai et al. [40] believed that some operations based on CNN would lead to the loss of fine-grained image information, so the Transformer structure [41] is introduced in the cross-view localization and designed the feature segmentation and region alignment method (FSRA), which segments the feature map into different regions on the basis of the heat distribution for classifying and supervising each region, thus effectively realizing the cross-view localization.
The above methods provide a new research idea for solving the problem of inaccurate geo-localization. Inspired by this, this method fully combines the attention mechanism with the feature extraction network to mine structural information from a global perspective. Meanwhile, the dual-branch structure is designed for joint training, and the dilated convolution is fused in the local branch to increase the receptive field of the feature map, which can capture richer multi-scale context information and further improve the accuracy of cross-view localization.

Overview architecture
The overview framework is shown in Fig. 1. The entire network structure is divided into the global branch and the local branch, which share the network weights. First, this model employs ResNet50 as the backbone network while removing the average pooling layer and classification layer, and then extracts features of the input images. At the same time, the relation-aware global attention module is introduced after extracting the shallow features, which can sufficiently capture the global structure information of the image. Then, a dual-branch structure is exploited to process the output features of the previous stage respectively, which can effectively focus on global and local information. Among them, the global branch is adopted to extract the high-level semantic information of the whole image, while the local branch is employed to focus on the context features of the network, thereby retaining more image detail information. Meanwhile, so as to combine the valuable environmental information of the region around the building, the feature map is divided into four distinct regions by using a squarering partition strategy in the local branch. Finally, the image high-level features are converted into column vector descriptors through global average pooling. The classifier module is utilized in the training process to get the predicted category probability of each column vector descriptor, and the cross-entropy loss function is employed to minimize the difference between the predicted class and the true one. The Euclidean distance is adopted to calculate the similarity between the query image and the database image during the test, and finally, the retrieved images are sorted according to the similarity.

Relation-aware global attention
In the cross-view geo-localization task, the RGA module can make the network notice the differences in image features to help identify buildings with a similar appearance. This paper combines the RGA module with the deep residual network to construct a feature extraction network guided by relation-aware global attention, which calculates the attention weights by learning the relationship between feature nodes, thus making the network sufficiently mine the feature of discriminant region. The relation-aware global attention is shown in Fig. 2. The feature vector in the feature map is represented as the feature nodes x i , where i = 1, 2, ⋯ N , and N is the number of feature nodes. For a feature node x i , calculate the correlation relationships r i,j and r j,i between the current node and all other nodes, where j = 1, 2, ⋯ N , thus the relationship vector of the feature node x i is r i = r i,1 , r i,2 , ⋯ , r i,N , r 1,i , r 2,i , ⋯ , r N,i . Then, the feature node x i and the relationship vector r i are concatenated to acquire the relation-aware feature E i , and the attention weight a i of the current feature node is inferred.

Spatial relation-aware global attention
The spatial relation-aware global attention (RGA-S) learns the correlations among all feature nodes in the spatial dimension of the feature map to enable the network to capture the features of the salient target. The RGA-S is shown in Fig. 3.
Specifically, for the feature map S ∈ ℝ C×H×W obtained from the neural network, the C-dimensional feature vector of each spatial position is taken as a feature node to form a graph G s with a total of N = W × H nodes and the feature The correlation r i,j between the feature nodes x i and x j can be obtained through the dot product operation, which can be defined as Equation (1): where f s (⋅) represents the dot product operation, ReLU(⋅) is the modified linear unit activation function, BN(⋅)denotes the batch normalization layer, Conv(⋅) represents the 1 × 1 convolution operation, and the dimensionality reduction ratio is controlled by a predefined positive integer. Similarly, the correlation r j,i between feature nodes x j and x i can be obtained, and (r i,j , r j,i ) is used to represent the pairwise relationship between feature nodes x i and x j . Finally, the correlation between all nodes can be represented by a relation matrix R S ∈ ℝ N×N , where r i,j = R S (i, j). Stack the relationships between the i th feature node and all nodes in a fixed order to obtain the spatial relationship vector r i = R S (i, ∶), R S (∶, i) ∈ ℝ 2N , where R S (i, ∶) represents the correlation between the i th feature node and all nodes, and R S (∶, i) represents the correlation between all nodes and the i th node. In order to enable the network to sufficiently exploit the global structural information, the spatial relationship vector r i is concatenated with the feature node itself x i to get the spatial relation-aware feature E s , which can be defined as Equation (2): where C(⋅) represents concatenation operation, pool c (⋅) represents the global average pooling on the channel dimension, reducing the channel dimension to 1. The spatial attention weight a i can be calculated through E s , which is defined as Equation (3): a i = sigmoid(BN(Conv 2 (ReLU(BN(Conv 1 (E s )))))), where sigmoid(⋅) represents sigmoid activation function, Conv 2 (⋅) converts the number of channels to 1, and Conv 1 (⋅) reduces dimensions at a fixed ratio.

Channel relation-aware global attention
The channel relation-aware global attention (RGA-C) learns the correlations between all feature nodes in the channel dimension of the feature map to assign different weights for each channel. The RGA-C is shown in Fig. 4.
Specifically, for the acquired feature map S ∈ ℝ C×H×W , the feature map on each channel is considered as a feature node to form a graph G C with a total of C nodes, and each feature node is denoted as For the input feature graph S, it is first compressed into S � ∈ ℝ (HW)×C×1 in space, and then the correlation r i,j between the feature node x i and x j can be obtained similar to the RGA-S, which is defined as Equation (4): where f c (⋅) represents the dot product operation. Similarly, the correlation r j,i between the feature nodes x j and x i can be obtained, and the pairwise relationship between all nodes is expressed by matrix R C ∈ ℝ C×C . The relationship between the i th feature node and all nodes is stacked to obtain the channel relationship vector which is similar to Equation (2)(3), and the final channel attention weight c i can be obtained.

Local branch
Since the rich multi-scale context information and detailed spatial structure information can assist the network to match the image of the same geographical place to optimize the precision of cross-view geo-localization, the dilated convolution [17] with multiple dilation factor is adopted in the local branch to increase the receptive field of the feature map without losing (4) r i,j = f c x i , x j = (ReLU(BN(Conv(x i )))) T (ReLU(BN(Conv(x j )))),  Fig. 4 The structure of channel relation-aware global attention image details, thereby the model can capture more robust multi-scale information. Meanwhile, the feature map is divided into four scales by using the square-ring partition strategy to obtain rich spatial context information. The dilated convolution expands the receptive field of the convolution kernel by inserting r − 1 values with weight 0, where r is the dilation factor. The structure of the standard and dilated convolutions is shown in Fig. 5, where (a) represents the standard convolution and (b) represents the dilated convolution with the dilation factor 2. Using the convolution kernel of 3 × 3 under the same conditions, the receptive fields of the standard and dilated convolutions are 3 × 3 and 5 × 5, respectively. Compared with standard convolution, the dilated convolution can capture richer multi-scale context information for image matching.

RC(i,:) EC ci
Specifically, this module employs the dilated convolution with dilation factors 2 and 4 to increase the receptive field of the feature map, and the stride of both the convolutional layer and the downsampling layer in the last residual block of ResNet50 is adjusted to 1. When the resolution of the input image is 256 × 256, the resolution of the feature image output by the backbone network is 8 × 8, while that of the output feature image using the dilated residual network is 32 × 32.
To help the network better discriminate the images in different geographical locations, the environment around the target building is used as auxiliary information. Meanwhile, the feature image is divided into four parts by adopting the square-ring partition strategy in the local branch, to obtain the feature representation of distinct regions. Then, the obtained image features are converted into 2048-dimensional feature vectors through the average pooling operation, represented by Equation (5): where Avgpool(⋅) represents the average pooling operation, s i j (i ∈ [1,4];j ∈ [1, 2]) denotes the feature maps of the local branch divided in the different view platforms, and l i j represents the 2048-dimensional feature vector of the four local branches after pooling.

Global branch
Since the semantic information focused by the deep network is also an important part of the cross-view geolocalization task, a global branch structure is designed which is parallel to the local branch. The deep residual network is exploited in the global branch to extract and refine the large-scale features for obtaining the feature map f j containing rich semantic information. Then, the average pooling method is applied to obtain the 2048-dimensional feature vector and enables the network to recognize the categories of image features, which is expressed by Equation (6): where g j denotes the feature vector of the global branch after pooling.

Classification of learning and loss function
The classifier module is introduced after the feature extraction stage to predict the category of each feature vector, where the classifier consists of a fully connected layer (FC), a batch normalization layer (BN), a dropout layer, and a classification layer (Cls). This module takes the local feature vector l i j and the global feature vector g j as input to predict the category to which each feature vector belongs, and finally obtains the local and global prediction probability distribution vector z i j and q j , respectively. The method adopts cross-entropy loss as a loss function to measure the distribution difference between the image predicted probability and the real one, which can learn more robust image features and enhance the network training accuracy. The cross-entropy loss can be expressed by Equation (7)): [1,4];j ∈ [1,2]) denotes the corresponding original image after segmentation, x j (j ∈ [1,2]) represents the input image, j = 1 represents the UAVs platform, and j = 2 represents the satellite platform; y denotes the true category of the input image, p(y|x i j ) and q(y|x j ) respectively represent the normalized probability score of x i j and x j (6) g j = Avgpool f j , (a) (b) Fig. 5 The standard convolution and the dilated convolution. a represents the standard convolution; b represents the dilated convolution with dilation factor 2 belonging to the true category, which is defined by Equation (8) and Equation (9), where C represents the number of all geo-tagged categories in the database.

Datasets
In this paper, three datasets of University-1652 [13], CVUSA [21], and CVACT [31] are exploited to train and test the proposed method.
(1) University-1652 is a multi-view and multi-source dataset, including drone-view, satellite-view, and groundview images of 1652 buildings in 72 universities, and the images in the training dataset and the test dataset are not duplicates. Our method uses this dataset to study the two tasks of drone-view target localization and drone navigation. There are 701 image categories of drone-view query images in the drone-view target localization task, and each category corresponds to a real matching satellite image. In the drone navigation task, there are a total of 701 image categories in the satellite view query dataset, and each category corresponds to 54 real matching drone images.

Experimental detail
Our method is implemented on the Linux server with the Ubuntu20.04 operating system, and all performance comparisons are based on the results under this configuration. The server configuration is GTX 3090 GPU with 24 G memory capacity. The proposed model is implemented based on the Pytorch framework. Before training, the size of all input images is adjusted to 256 × 256, and horizontal flipping and , random rotation are used for data augmentation. The SGD optimizer with 0.9 momentum and 0.0005 weight decay is adopted to update the model and we set the initial learning rate to 0.001. To accelerate the network convergence, the model training epoch is 140 for University-1652 dataset, and 100 for CVUSA and CVACT datasets. During testing, the feature vectors of each branch are spliced to obtain the final feature representation, so as to complete the image matching.

Quantitative comparison
In this paper, recall accuracy at top K (Recall@K) and image retrieval average precision (AP) are adopted as the performance metrics of image retrieval. Recall@K refers to the ratio of the true-matched images in the top k retrieved results to all the real matching images in the database. In this paper, the case of k=1 is mainly considered. AP refers to the ratio of the real matching images retrieved to the total number of retrieved results. The larger the value of Recall@K and AP, the higher the precision of image retrieval.
As can be seen from Table 1, our method achieves the best results in both tasks on the University-1652 dataset. In the task of drone-view target localization, i.e., Drone→ Satellite, the performance of the proposed method on R@1 and AP reached 81.06% and 83.74% , respectively, our Table 1 Quantitative test results on University-1652 dataset The results of the methods [22,27,42] are obtained by replacing the loss function on the basis of method [13]. The optimal and suboptimal results of the evaluation indicators are indicated in bold and underline, respectively method achieves the improvement of 3.99% and 3.65% on each of the two metrics compared to the suboptimal method LPN+USAM [39]. In the drone navigation task, i.e., Sat-ellite→Drone, the performances of 89.58% and 79.63% are achieved on the R@1 and AP, respectively. Compared with the suboptimal method LPN [14], it has improved by 3.13% and 4.84% in the two metrics, which proves that our method has significant advantages in image retrieval performance.
The comparison with other approaches on the CVUSA and CVACT_val datasets are detailed in Table 2. Since the ground images in these two datasets are panoramic images, a sequential partition strategy [14] is adopted to divide the images. The comparison methods include the CVM-Net [27], Orientation † [31], Instance Loss [13], Regmi et al. [33], Siam-FCANet [43], CVFT [12], LPN [14], Instance Loss+USAM [39], LPN+USAM [39], where the results of LPN [14], LPN + USAM [39]algorithms are generated by adopting the publicly released codes for training, while the other methods directly use the results provided by the authors.
It can be observed that on the CVUSA dataset, the proposed method achieves 88.00% and 99.47% on the evaluation indicators R@1 and R@Top1% , respectively. Compared with the other nine advanced models, this method has achieved evident promotion in retrieval performance, especially in the R@1 indicator, the performance is improved by 2.03% . On the CVACT_val dataset, the proposed method reached 80.98% and 96.53% on R@1 and R@Top1% , both of which achieved optimal results, thus demonstrating the effectiveness of our method.
Except for integrating the relation-aware global attention in the feature extraction network to capture the rich global structural information, our method also designs a joint training structure with parallel global branch and local branch to fully mine multi-scale context features, which is the key that the proposed method outperforms other cross-view geolocalization models.

Qualitative results
Figures 6 and 7 are the retrieval results of our method on the University-1652 dataset, which, respectively, visualizes the results from the tasks of drone-view target localization and drone navigation; and Fig. 8 is the retrieval results on the CVUSA dataset. In these qualitative results, each row represents the retrieval result of a position, the first image is the query image, and the top images in the matching results are shown on the right side of the dotted line, where the yellow box represents the true retrieval and the blue box denotes the false retrieval.

True-Matched Images False-Matched Images
For the drone-view target localization task, there is only one truly matched image in the first five images that showed the matching results in Fig. 6, this is because each droneview image has only one matching satellite image, which proves that our method can correctly retrieve the matched image under the interference of similar images. For the drone navigation task, the top five images of the matching results in Fig. 7 are all correctly matched images, because each satellite image has 54 drone-view images matched with it. Since each ground image in the CVUSA dataset corresponds to one real satellite image, the first image in the retrieval result of each query image in Fig. 8 is the correctly matched image. Through the analysis of qualitative results, it can be found that our method can retrieve the correct results on both datasets, which further demonstrates the effectiveness of our method.

Ablation experiment
To verify the effectiveness of each component, we conduct several ablation experiments on the University-1652 dataset.

Effectiveness of the relation-aware global attention
To verify the effectiveness of the relation-aware global attention module, two ablation experiments are conducted in this subsection. The first experiment is to remove the relationaware global attention module and only use the network with a dual-branch structure for image feature extraction. The second experiment is to add the SE attention module in SENet [44] to the network based on the first experiment for obtaining the attention of the image in the channel dimension.

True-Matched Images False-Matched Images
According to the results of Table 3, it can be observed that compared with not adding any attention mechanism and adding an SE attention module, using the relation-aware global attention module can make the network pay attention to the discriminative features of the image while capturing more robust global structure information, which improves the retrieval ability of the network and achieves the better performance.

Effectiveness of the dilated convolution
In this part, we conducted three ablation experiments to verify the effectiveness of the dilation convolution, that is, we adjusted the dilated convolution in the local branch residual blocks and adopted different dilation factors to extract image features. It can be seen from the results in Table 4 that increasing the receptive field of the feature map by using the dilated convolution can effectively capture more detailed information of the image and mine the potential features. When the dilation factors are 2 and 4, respectively, the performance of the model is optimal.

Effectiveness of the dual-branch structure
The dual-branch structure is an important component of this method. Therefore, two ablation experiments are conducted to verify its effectiveness, that is, using different branches to extract features for subsequent matching. According to the results in Table 5, it can be found that using the dual-branch for joint training can adequately exploit the semantic and multi-scale context information of the image, thus obtaining the optimal retrieval performance.

Effect of the input image size on the results
In real-world applications, training models with high-resolution images can achieve better accuracy but require more computational resources and time. Due to limited resources, it is necessary to exploit low-resolution input images in the actual operation, which will reduce the accuracy of image matching. Therefore, this paper designs a set of ablation experiments to observe the impact of input images with different resolutions on the model performance. It can be found from the results in Table 6 that while increasing the size of  input image from 224 to 320, the R@1 and AP values of the network are both improved; and when the image size increases to 384, the performance of the network decreases slightly.

Conclusion
In this paper, we proposed the cross-view geo-localization method guided by relation-aware global attention, which exploits the relation-aware global attention to capture the global structural information and extract more robust image features for geo-localization. Meanwhile, the dual-branch strategy is designed for joint training, and the dilated convolution is adopted in the local branch to increase the receptive field of the feature map while dividing the feature map into four scales, which obtains the feature representation containing semantic and context information to calculate the image category probability, and higher accuracy is obtained in geo-localization. The experimental results show that on the three datasets of University-1652, CVUSA, and CVACT, our method has achieved significant improvements in both Recall@K and AP. In addition, the algorithm can also avoid the interference of similar buildings and retrieve the correct image. However, due to the introduction of dual-branch structure, the complexity and parameter quantity of the model are increased, which leads to the increase of time for model training and testing. In future research, we will consider the complexity of real scene images, further study the cross-view geo-localization method that can adapt to complex scenes, and explore the use of lightweight network to improve the accuracy of geo-localization.