Image Mosaic Research and Realization Based on LoFTR Algorithm


 Aiming at the problem of inaccurate extraction of feature points by the traditional image matching method, low robustness, and problems such as diffculty in inentifying feature points in area with poor texture. This paper proposes a new local image feature matching method, which replaces the traditional sequential image feature detection, description and matching steps. First, extract the coarse features with a resolution of 1/8 from the original image, then tile to a one-dimensional vector plus the positional encoding, feed them to the self-attention layer and cross-attention layer in the Transformer module, and finally get through the Differentiable Matching Layer and confidence matrix, after setting the threshold and the mutual closest standard, a Coarse-Level matching prediction is obtained. Secondly the fine matching is refined at the Fine-level match, after the Fine-level match is established, the image overlapped area is aligned by transforming the matrix to a unified coordinate, and finally the image is fused by the weighted fusion algorithm to realize the unification of seamless mosaic of images. This paper uses the self-attention layer and cross-attention layer in Transformers to obtain the feature descriptor of the image. Finally, experiments show that in terms of feature point extraction, LoFTR algorithm is more accurate than the traditional SIFT algorithm in both low-texture regions and regions with rich textures. At the same time, the image mosaic effect obtained by this method is more accurate than that of the traditional classic algorithms, the experimental effect is more ideal.


Introduction
Image mosaic refers to the synthesis of two or more images with obvious overlapping areas into an image with wide viewing angle and high resolution. The fused image is similar to the multiple images before fusion, and has most of the image information. At present, image mosaic technology is widely used in many fields such as computer vision, image processing, vehicle-assisted driving, human-computer interaction and computer graphics.
Image mosaic technology includes two key links: image registration and image fusion. Image registration directly determines the quality and efficiency of image mosaic. Local feature matching between images is the cornerstone of many threedimensional computer vision tasks. The existing matching methods include feature detection, feature description and feature matching. Due to various factors such as texture difference, viewpoint change, illumination change and motion blur, traditional algorithms may be unable to find repeatable interest points or find corresponding relationships according to descriptors. To solve the above problems, researchers have proposed a variety of detector-free local feature matching methods, such as SIFT (Scale-Invariant Feature Transformation) algorithm. Traditional feature matching algorithms include Harris algorithm, ORB (Oriented FAST and Rotated BRIEF) algorithm, and SIFT are proposed [1] on the basis of Moravec algorithm [2]. Moravec algorithm is a corner detection operator based on gray-scale variance. The operator calculates the gray-scale variance of a pixel in the image along the horizontal, vertical, diagonal and anti-diagonal directions. The minimum value is selected as the interest value of the pixel, and then the local non-maximum suppression is used to detect whether it is a corner point. Harris algorithm is an improvement and optimization of Moravec operator. Harris has the characteristics of rotation invariance, insensitive to gray translation and scale changes, but Harris corner detection operator does not have scale invariance, resulting in poor feature matching effect. The SIFT algorithm was first proposed by David G. Lowe in 1999, and was improved and officially published in 2004 [3]. SIFT feature is the local feature of the image, which is invariant to its rotation, scale and brightness change, and also maintains a certain stability to the change of viewing angle and noise. The algorithm constructs 128-dimensional vectors for feature points, which leads to the slow operation speed. Therefore, H. Bay et al. proposed an algorithm [4] to improve the SIFT algorithm, the improved algorithm is much faster. E. Rublee et al. proposed a very fast binary descriptor based on BRIEF [5], which is a fast feature point extraction and description algorithm. In 2016, K. M. Yi et al. proposed the LIFT (Learned Invariant Feature Transform) algorithm [6], which introduces a novel network architecture that uniformly addresses three provious problems: feature detection, direction estimation and description. Many multiple visual geometry problems cannot be solved by traditional algorithms, D. DeTone et al. proposed a self-supervised framework to train key point detectors, and it is suitable for feature descriptors for multi-view geometric problems in 2018 [7].
As one of the key technologies of image mosaic, image fusion has been deeply studied by many researchers. The AKAZE-based image mosaic algorithm proposed by S. K. Sharma et al. in the reference [8] minimizes the mosaic seam and generates a perfect mosaic image. A fast sonar image mosaic method is proposed in the reference [9]. This method was composed of denoising, feature point extraction, mosaic and optimization, and the quality of mosaic image was effectively improved. The deep convolutional neural network was used in reference [10] to adaptively obtain the image features, and the results in this paper showed the robustness and effectiveness of the proposed method. Reference [11] proposed a principal component invariant feature transformation, the proposed algorithm improved the speed of image mosaic and was conducive to image fusion on the premise of ensuring the mosaic quality. Reference [12] introduced a efficient mosaic method of 6-DoF imaging model, which reduced the number of unknown variables for parameter optimization. Compared with the existing methods, this method effectively obtained more accurate mosaic results. J. Zhang et al. proposed a RANSAC algorithm based on block matching in the reference [13] to eliminate mismatch points in the process of key point matching and this algorithm had strong robustness. L. Li et al. proposed a suture detection algorithm [14], which can effectively hide the artifacts caused by dynamic objects and geometric misalignment. Y. Wang et al. proposed an image mosaic algorithm based on empirical mode decomposition transformation [15], the computational time was reduced significantly by this method. Z. Yang et al. proposed an image serialization method on line segment acceleration robust features [16], which realized the robustness to panoramic images. H. Nejad et al. proposed a new hybrid algorithm of Gaussian weighting function [17], and this method has superiority in image mosaic and image registration. J. Kaur proposed a normalized improved SIFT algorithm in reference [18]. Compared with the traditional SIFT, the computational time was reduced, the efficiency was improved.
However, there are still some problems in real scenes, the traditional feature point matching algorithm is inaccurate in regions with poor texture, and poor robustness, resulting in poor image mosaic effects. With the continuous development of deep neural networks, a new detector free local feature matching method is proposed to generates dense descriptors or dense feature matching. By adopting this method, the matching accuracy and robustness are greatly improved. Then, the weighted average fusion algorithm is used to fuse the image. The quality of the mosaic image is significantly improved by using our method.
The block diagram of image mosaic process is shown in Fig. 1.

Reference image Input image
Image registration Transformation matrix Image fusion Image mosaic Given the image pairs I A and I B , the local feature matching methods uses a feature detector to extract interest points. We propose a detector free design to solve the repeatability problem of feature detector. An overview of proposed method LoFTR is presented in Fig.2.

Local Feature Extraction function
First, a Feature Pyramid Network (FPN) [19] of local feature convolutional neural network is used to extract the multi-level features from both images. As shown in Fig.2

Transformer module
After local feature extraction,F A andF B feature map is tiled into a onedimensional vector, and added to the corresponding position coding, the added feature enters the LoFTR module for processing, the processed feature is expressed asF A tr andF B tr . Transformer consists of encoder and decoder, LoFTR Module consists of selfattention layer and cross-attention layer. The encoder of transformer consists of sequentially encoders, as shown in Fig.3 (a).The key element of the encoder layer is the attention layer. Attention layer input is generally a query vector (Query, Q), a key vector (Key, K) and a value vector (Value, V ). The query vector Q retrieves information from the value vector V according to the amount of attention calculated from the dot product of Q and the key vector K corresponding to each value v. the calculation diagram of the attention layer is shown in Fig. 3 (b), and the formula of the attention layer can be denoted as: The standard positional encoding proposed by DETR [20] is used in transformer. The positional encoding provides unique position information for each element in a sinusoidal format, by adding position code toF A andF B , the transformed features will depend on the location information.In the linear transformer, the dot product between Q and K is represented as N , and their feature dimension as D. In the case of local matching, it is unreasonable to directly use the original version of the transformer. To solve this problem, we use an alternative kernel function with the exponential kernel used in the original attention layer, as shown in Fig.3 (c).

Establish Coarse-level Matches
The Optimal Transport (OT) layer in the Differentiable Matching Layer can be used in LoFTR. The score matrix S between the transformed features is first Where τ is the dimension of featurẽ F tr .When matched with OT, -S can be used as the cost matrix for the partial allocation problem in [21]. We can also apply softmax to the two dimensions of S (hereinafter referred to as double softmax) to obtain the probability of soft mutual nearest neighbor matching. Formally, when using dual-softmax, the matching probability P c is obtained by: P c (i, j) = soft max (S(i, ·)) j · soft max (S(·, j)) i Matching selection is based on confidence matrix P c . We select matches with confidence higher than the θ c threshold and further implement the Mutual Nearest Neighbor (MNN) standard to filter the possible Coarse-level matches. We express the Coarse-level matches predictions as:

Establish Fine-level Matches
After estabishing coarse matches, these matches are re-fined to the original image resolution with Coarse-to-Fine Module. For every coarse match (î,ĵ), we first locate its position (î,ĵ) at the Fine-level feature mapsF A andF B ,and then crop two sets of local windows of size w × w. A smaller LoFTR Module transforms the cropped features within each window by N f times,yielding two transformed local feature mapsF A tr (î) andF B tr (î) centered onî andĵ, respectively.Then, we associate the center vector ofF A tr (î) with all the vectors inF B tr (ĵ) to generate a heat map, represents the matching probability of each pixel nearĵ withî. By calculating the expectation over the probability distribution,we get the final positionĵ ′ with sub-pixel accuracy on I B .Gathering all the matches î ,ĵ ′ products the final Fine-level matches M f .

Image fusion
Image fusion is the last step of image mosaic, two pictures after feature matching are fused into a complete picture. In this paper, the gradual in and gradual out weighted image fusion method is used image fusion, which is also called Weighted Averaging (WA), which is a relatively simple image fusion algorithm. Weighted image fusion method has the advantages of fast calculation speed, simple operation and simple implementation. In addition, the method can suppress the noise in the fused image, improve the signal to noise of the image and other advantages [22].
The main idea of the weighted image fusion method is to carry out the weighted re-fusion of the pixel values of the two images respectively, and realize the seamless mosaic of the images by setting the weight value. The formula is: Where I 1 (x, y) and I 2 (x, y) are two fused images, w 1 and w 2 are the weights of image fusion,and 0 < w 1 < 1, 0 < w 2 < 1, w 1 + w 2 = 1,General take: The d 1 and d 2 respectively represent the distance from the overlapping point (x, y) to the left and right boundaries, and width represents the width of the overlap area. And d 1 + d 2 = width.

Results and discussion
In order to verify the effectiveness of the image mosaic method proposed in this paper, a comparative experiment was carried out. The experimental environment is CPU Intel Core i5-8400, the main frequency is 2.81GHz, the memory is 8GB, the operating system is Windows10, and the development environment is PyCharm Community Edition 2020.2.2. This article selects an image with a resolution of 1200×1440 for experimentation. Use the mobile phone camera to shoot in indoor scenes and outdoor scenes. Take a set of photos in each scene and use SIFT algorithm and LoFTR algorithm to match the feature points of the picture. And use the weighted average image fusion method to perform image fusion according to the result of feature point matching. The results of algorithm feature point matching and image mosaic are shown in Fig.4      It can be seen from Table 1 that the PSNR value obtained by the LoFTR algorithm is always higher than that of the SIFT whether it is indoor or outdoor. The higher the PSNR value, the better the image quality. Therefore, from the perspective of the peak signal-to-noise ratio, the images stitched by the LoFTR algorithm are better than the traditional algorithm SIFT. In addition, from the analysis of feature matching theory, the LoFTR algorithm extracts more accurately than the traditional SIFT algorithm and recognizes dense matching points. Therefore, the weighted average fusion algorithm is used for splicing at the same time, and the LoFTR algorithm splicing is also better. It can be seen from Table 2 that the SSIM value of the LoFTR algorithm is higher than that of the traditional algorithm SIFT, whether it is indoors with low texture or outdoors with rich textures. The higher the SSIM value, the higher the image quality. Therefore, from the perspective of structural similarity, the LoFTR algorithm has better splicing quality than the traditional SIFT algorithm.

Conclusions
Aiming at the problem of inaccurate and low robustness of traditional algorithm image matching methods, a new local image feature matching method combined with deep neural networks is proposed in this paper. First, the pixel-intensive matching is established at the coarse level, then the good matching is refined at the fine level, and finally the image is fused by the weighted tie fusion algorithm. Finally, experiments show that the accuracy of this method for feature matching between two images is higher than that of the traditional method, especially in areas with poor texture, which can better extract feature points, and through the comparison of the peak signal-to-noise ratio and the structural similarity value, It can be clearly seen that the LoFTR algorithm is higher than the traditional SIFT algorithm, and the effect of the stitched image obtained by image fusion is also better than that of the traditional algorithm.