Video Stitching method for a Surveillance System Deployed at a Fully Mechanized Mining Face

: Using video stitching technology, video images with overlapping parts can be stitched into a complete image, with characteristics such as intuitiveness, visualization, and measurable analysis. This technology could be applied in the operation of coal mines for a remote monitoring and control of coal production. However, when the technology is used in coal mines, there are several challenges such as non-uniform illumination, missing scenes, and oblique panorama. In this paper, methods were purposed to solve the above problems: (1) To overcome the non-uniform illumination on a mining face, we applied the wide dynamic range technology to the images from a single camera and histogram matching algorithm on multiple images to reduce the color difference between the images; (2) To overcome the missing scene problem due to the narrow field of view (FOV) of a single camera, the SURF matching and template recognition methods are combined to achieve a stable stitching; (3) To overcome the oblique panorama issue, we applied the vertical correction technology exploiting the posture information of the camera, and then the adjacent images are concatenated. The results of practical experiments show that the proposed methods are suitable for solving the above problems in a fully mechanized mining face. The research provides a new approach for displaying extended scenes of stope faces in the intelligent collieries.


Introduction
A video surveillance system in coal mines can help avoid emergency disasters and is a necessary component of modern coal mines for a safe production (Vujic et al., 2008;Şalap et al., 2009;Zhou, 2011). A real-time monitoring of the underground situation and mining progress can be realized by deploying a video surveillance system on the fully mechanized mining face. Generally, because of the long range of the mining face and the undulating terrain, establishing a video surveillance system requires dozens of cameras to cover the entire face, where each camera covers a limited field of view (FOV) and works independently of the other cameras. It is difficult for surveillants to view dozens of videos simultaneously, making it a challenge to grasp the whole mining situations in time. Clearly, it is valuable to improving the usability of the video surveillance in a coal mine.
Video stitching has emerged as a new research field in computer vison. This method can be stitch two or more digital images with overlapping parts into a panoramic image whose FOV is larger. It has been widely used in many applications such as mine inspection (König & O'Leary, 2020), medical treatment (Zheng et al., 2020), remote sensing (González-García et al., 2020), and 3D reconstruction (Li et al., 2019). However, when it is applied to videos from the surveillance system deployed in coal mines, the stitching technology is associated with the following problems: (1) Underexposure and overexposure due to non-uniform illumination; (2) Missing scenes due to insufficient FOV of the cameras; (3) Obliquity of the panoramic images due to pose change of the cameras.
In underground coal mines, the artificial light is non-uniform so that underexposure and overexposure are common. It is necessary to enhance images taken from coal mine surveillance systems. To this end, algorithms based on illumination adjustment and bi-γ function have been proposed for enhancing non-uniform illumination scenes (Zhi et al., 2017(Zhi et al., , 2018. Improved algorithms based on the wavelet transformation have contributed to denoising and defogging (Yanqin, 2013;Zhao et al., 2015;. Some researchers used deep networks to solve the low-contrast problem (De-yong & Ze-xun, 2019;Yuehua & Weiqiang 2019). Other methods, such as histogram equalization , visual characteristics (Hua & Jiang, 2014), dark prime color, and CLANE algorithm (Dongmei & Siqi, 2019), have been applied to enhance coal mine images.
Although post-enhanced images contain more evident features, the stitching of scenes in a fully mechanized mining face is still associated with some problems. Most existing video cameras in fully mechanized mining faces are deployed at the top of the hydraulic supports. The adjacent cameras are too far away to make their coverage overlapped. Therefore, the FOV of a single camera must be expanded to make image stitching possible. Recent studies related to image stitching have focused on reducing the deformation and shortening the running time. Shum et al. (2000) proposed a combination of global and local registration methods to enable unstable handheld shooting methods to quickly obtain panoramic images. Qu et al. (2020) significantly reduced the complexity of computation based on a binary tree and rectified deformed image by estimating overlapped area. The RUF algorithm proposed by Liu et al. (2020) helped improve the accuracy and real-time performance of stitching images. Vishwakarma and Bhuyan (2020) first applied automatic image sorting and reduced the computational complexity of feature detection to improve the efficiency of image stitching. In terms of the real-time performance, the algorithm proposed by Yoon and Lee (2018) can achieve 13 fps for continuous video obtained by a motion camera. Bai et al. (2020) proposed a video stitching method for monitoring systems in coal mines, reaching a frame rate of 26 fps.
Although image stitching can expand the FOV of a single camera, the obtained panoramic images will have a perspective effect in that closer objects appear large and farther objects appear small. Therefore, the extended images captured by adjacent cameras cannot be directly concatenated. The perspective effect is due to oblique shooting, so it can be eliminated by vertical correction. However, the correction of oblique images typically depends on the control points information of the scene (Wang et al., 2011;Wang et al., 2016) or the 3D information of the scene (Habib et al., 2007;Dong et al., 2019). Information of this sorts is difficult to obtain in the case of fully mechanized mining faces, making it a challenge to correct the video obliquity.
To realize video stitching of videos taken by existing video surveillance systems deployed at a fully mechanized mining face, methods are purposed in this paper to solve the above problems. In terms of enhancing coal mine images, we performed a wide dynamic adjustment and multi-image histogram matching, to increase the information entropy in a single image and to eliminate hue difference between multiple images under non-uniform illumination. In aspect of image stitching, we used a combination of SURF matching and template recognition to achieve stable stitching. For correcting oblique images, we applied the photogrammetry principle to complete the vertical correction with the posture information of the camera, enabling the concatenation of adjacent images.

Enhancement of images taken by a single camera
The cameras deployed on a fully mechanized mining face are typically located between hydraulic supports. Therefore, their FOV is limited, and the image quality depends heavily on the light source. Due to space and electricity constraints, the distribution of the artificial light sources is nonuniform, resulting in an uneven image quality. The poor image quality can be credited to two reasons. When strong light appears in the scene, images nearby are overexposed. On the other hand, when the illuminance of the scene is extremely low, images are serious underexposed. These issues will lead to loss of image information.
By tuning metering methods, shutter speeds, the size of aperture, the sensitivity, and other parameters of the camera, we can resolve the overexposure or underexposure problems separately. However, simply tuning the above parameters cannot significantly improve the quality of an image which is partially overexposed and underexposed. To this end, we applied a wide dynamic range (WDR) technology to simultaneously perform strong light suppression and low illumination recovery, thus successfully improving the original image quality.
The WDR technology uses multiple exposures of the same scene with different shutter speeds, and then a high-speed digital signal processor combines the multiple images to increase the dynamic range of the output image. For the highlighted part of the scene, the processor will select an image with a high shutter speed; for the shadow part of the scene, the processor will select an image with a low shutter speed. Finally, the selected parts are combined to obtain an output image with a more uniform brightness.

Enhancement of images taken by multiple cameras
The images shot by a single camera cannot provide a large scene of the mining face. Therefore, it is necessary to stitch images from the same camera at continuous poses and in continuous moments. Because of the differences in the light conditions and cameras parameter configuration, the obtained images always differ in aspects of the brightness, contrast, and saturation. Stitching them directly will have unexpected results or even fail. Therefore, these differences should be minimized. We used the histogram matching technology to reduce the differences in the characteristics of the different images and to improve the effect of panoramic images.
The principle of histogram matching is to approximate the probability density function of the source image pixel to that of the target image pixel. Suppose the gray value is continuous in the interval [0, − 1] in the source image, r is the gray value, and ( ) is the probability density function; in the target image, z is the gray value, and ( ) is the probability density function. Here, ( ) and ( ) are known functions, and the conversion relationship between the random variables r and z is completed with the help of an intermediate random variable s. s is a random variable in the interval [0, 255] following a uniform distribution, so the relationships between s and r and z are as follows: We need to express s using r (expressed in Equation (1)) and then find the inverse function of ( ): For digital images, the random variables r, z, and s are discrete, so we discretize the above formulae: The histogram of the processed image will be similar to that of the target image. In practical experiments, we selected images with better lighting conditions as target images, and the others were histogram matched to the target image, for obtaining the consistency of the brightness, saturation, and contrast of all images.

Stable stitching by combining SURF matching and template recognition
Because the number of cameras is limited, adjacent cameras are separated by 2-3 hydraulic supports. When shot from the same pose, the images from adjacent cameras cannot be directly stitched (shown in Fig. 1(a)), because no overlapped information exists between the two images. To solve this problem, we successively obtained images shot by the same camera at three different angles (shown in Fig. 1(b)) and then applied the SURF matching and template recognition method to stitch the three images. The three images are stitched to expand the FOV of a single camera, so that the images acquired by adjacent cameras are overlapped to each other, preparing for the next concatenation. The stability is vital important for coal mine video image stitching. A failure of stitching may be credited to the following issues: (1) The image contains a large amount of clastic coal, which leads to fake point features and chaos; (2) The mining face is in operation, and the movement of personnel and equipment can cause point-to-point matching errors. Although stitching methods based on point features matching are relatively mature, affected by the above two issues, stitching coal mine video images requires a more stable method. Methods of this sort are allowed to be inferior to point features matching methods in aspect of the visual effect, but must be more stable than the counterpart. Therefore, we purposed a combination of point features and template recognition method (Fig. 2

Stitching based on SURF matching
The speeded up robust feature (SURF) (Bay et al., 2008) was used as point features in this article. In SURF, the detection process is based on the determinant estimation of the Hessian matrix at different scales, where the scale space is generated by Gaussian filtering. For a window template size of (2 + 1) × (2 + 1), the pixel value after Gaussian filtering can be In Equations (6) and (7), represents the scale parameter of Gaussian filtering. A higher will result in a severer blurred image. The calculation formula of the Hessian matrix is: In Equation (8), x and y are the pixel coordinates, and I represents the corresponded pixel value. The second derivative in Equation (8) is approximated by the second-order difference: The Gaussian filtering and Hessian matrix are jointly represented as: In Equation (12), L xx , , and respectively represent the second derivatives of the image after Gaussian filtering at coordinates (x, y), and their approximations are the same as that of Equations (9), (10), and (11). When computing Hessian matrices at different , a simplified filter is used, which reduces the computing complexity significantly. Different scales are represented by different sizes of the filter window. The minimum window size of the filter should be 9, which is equivalent to the case where Gaussian filter parameter σ equals 1.20. The selection of the possible point features is completed by comparing the Hessian determinant, whose estimation is greater (or lower) than its neighbors are selected. More concretely, a selected feature P has the extreme value among its 27 neighbors (9 of them are at the same scale as P, and 18 of them are at the adjacent scale as P). The estimation of the determinant is: After the possible point features are initially located, filters for removing false positives are conducted to obtain the final point features.
The direction invariance is one of the merits of SURF. To enable this trait, the main direction of the feature should be calculated. The Harr wavelet response in the adjacent area of the point feature is used to determine the main direction. In a 60° fan area radius of 6 with the point feature as the center, the sum of the horizontal and vertical Harr wavelet features of all points is obtained. As the fan rotating 0.2 radians by step, the value of the corresponded direction is obtained. Finally, the main direction of the point feature is selected, whose value is highest among all directions.
Descriptors are essential for the point features. To calculate the SURF descriptor, firstly, a square area around the point feature height of 20 is chosen, which can be equally divided into 4 × 4 sub-areas. Noted that the direction of the square area is consistent with the main direction. Secondly, the Haar wavelets response in each sub-region are counted. The feature vectors of a single subregion consist of the sum of the horizontal response, the sum of the vertical response, the sum of the absolute value of the horizontal response, and the sum of the absolute of the vertical response. Since there are 16 sub-regions in total, the descriptor of a SURF feature point is a 64-dimensional vector.
After obtaining the coordinates and the descriptors of every point features in the two images, we can perform the optimal matching feature pairs by calculating the space distance between descriptors. A correct feature pairs should conform to the homography model: In Equation (14) To estimate the parameters in the homography model, at least three pairs of point features are needed. In practice, hundreds of matched pairs are obtained. The random sample consensus (RANSAC) algorithm is always used to exclude erroneous pairs and find an optimal homography model. After the homography model is obtained, the homography transformation is performed between image A and B, then A is stitched into B to obtain a primary processed result. Generally, there exists little misalignment at the junction of the two images. To improve the stitched result, we performed the post-processing to the primary result for seamlessness (Fig. 2). The pixels around the seams are reassigned by weighed average. As the coordinates move from image A to image B, the weight of image A changes linearly from 1 to 0, while the weight of image B changes from 1 to 0. The pixel value in the result image will transition smoothly from image A to image B, without the unnatural seams at the junction.

Stitching based on template recognition
Compared with the method based on SURF matching, the template recognition method requires some conspicuous template images in the scene. In practical terms, we attached some reflective strips on cable railings on the mining face. Although it is less flexible, this method can improve the stability by increasing the number of templates in the scene. As long as the relative positions and orientations of the templates in the image do not change significantly, the templates can be recognized in a changeable environment. In this case, the template recognition method is superior to the SURF matching method.
The template recognition method is realized by finding the template image T in the source image I. Suppose the sizes of I and T are × and × ℎ, respectively, for each feasible pixel coordinate ( , ) in I, we calculate the indicator function: In Equation (17), = 0,1, … , , = 0,1, … , ℎ, = 0,1, … , − + 1, = 0,1, … , − ℎ + 1. The R function is the normalized correlation coefficient between the images T and I. We will find the position where the maximum value of R output, which is deemed the position of T in I. After the positions of all templates in the two original images A and B are obtained, the matched pairs are established, and then the homography model can be estimated by Equations (15) and (16). The purposes of template recognition and SURF matching are the same, which aim to estimate the homography model between two images. Afterwards, stitching and post-processing are performed to obtain the final output panorama (as shown in Fig. 2).

Vertical correction of oblique panorama
When the cameras are turned to three different directions to shoot for eliminating any missing scenes, the angles of view of the cameras are tilted, not vertically downward. So that a so-called perspective effect will deform the images, in which closer objects appear large and farther objects appear small. In this case, although the images shot by adjacent cameras are contiguous in the 3D space ( Fig. 1(b)), the same objects will appear to be of different sizes in the two images. Therefore, it is necessary to eliminate the perspective effect in the image as much as possible, which can be done by correcting the oblique images to vertical images.
We consider the camera as a pinhole model, and the oblique image is colored by the orange solid line in Fig. 3. The blue solid line represents the corrected vertical image. After the focal length of the camera is estimated, the corresponded point of each pixel in the oblique image on the vertical image can be obtained.  Fig. 3, 1 is the optical center of the camera, 1 is the origin of the image plane coordinates of the vertical image, 2 is the origin of the image plane coordinates of the tilted image, C 2 ′ is the corresponded point of 2 on the vertical image plane, and P is a certain pixel on the tilted image. Suppose the image space coordinates of P in the oblique image are [ , , ] , then the image space coordinates of P' are projected onto the vertical image as follows: In Equations (18) and (19), f represents the focal length of the camera, and α is the angle between the oblique image and the horizontal line (Fig. 3). Through the above transformation, we can reduce the perspective effect, so that the adjacent stitched images both keep the objects the same sizes at contiguous areas.

Experiments and results
Practical experiments were carried out on the No. 3301 fully mechanized face of Guotun Coal Mine in Shandong, China. Twenty-three high-resolution cameras were deployed at the fully mechanized mining face to obtain all the images and complete the program testing. In the absence of effective marking points in the underground mining face, we added red and white reflective strips on the cable railings behind the conveyor as artificial signs to enable the method of combining SURF matching and template recognition.

Framework of the system
To enhance the practicality of our research, we developed a prototype system to operate on the No. 3301 mining face. Considering that multiple cameras work simultaneously, a configurable distributed client program was designed to control multiple cameras for changing their shooting angle, stitching the images, and reducing the perspective effect. In this system, multiple client programs send their corresponded corrected panoramas and necessary information to the message queue managed by RabbitMQ. Subsequently, the stitching program takes multiple images from the same message queue, arranges them in the order of ID as the keyword, and uses the relevant information to process the adjacent panoramic images. The stitching program sends the final results to another message queue linked to a web server for display, which presents large panoramic images in the browser (shown in Fig. 4).  Fig. 4 Three parts of the system framework RabbitMQ enables a non-blocking communication between multiple camera programs and stitching programs, while ensuring a complete and real-time image sequence. The users don't need to pass through fussy installation or configuration procedures of the system. They just connect to the server, and a complete panorama will be presented in the web browser.  5 shows that as the wide dynamic intensity increases, the histogram of the image gradually becomes uniform, and the overall visual effect of the image is gradually improved. The information entropy is used to measure the information contained in the image:

Results of image enhancement
In Equation (20), L represents the gray level of the image, which equals 256 in general. represents the frequency of the i-th gray level in the image. Using this equation, we calculated the image entropy respectively of Figs. 6(a), (b), and (c) in the RGB channel. Table 1 lists the results. As the WDR intensity increases, the entropy of the image gradually increases, consistent with the results shown in Fig. 5. The images taken by different cameras may differ a lot in aspect of the visual effects and the distribution of their histograms due to different camera parameter configurations. We adopted the method of histogram matching to obtain images that appear as similar as possible. Comparing Figs. 6(a) and (c), we find that the visual effect of the latter is similar to that shown in Fig. 6(b), with a better illumination. Comparing the histograms of the three images, we find that the histogram matching method does play a role: the histogram distribution shown in Fig. 6(c) is more similar to Fig. 6(b) than to Fig. 6(a). Through the histogram matching, the overall image would be more coordinated in the subsequent stitching procedure.

Results of stitching images taken from a single camera
The resolution of all the cameras was set to 1920×1080, and the camera control interfaces were called remotely, so that we could obtain three images continuously with overlapped areas. After stitching three images, a single panoramic image was obtained via the stitching method of combination of the SURF matching and the template recognition.  Fig. 7(c) shows the well-matched feature pairs. In Figure  7(c) the connection lines between the matched pairs are oriented toward a same direction, which means that the error of the estimated homography model is relatively small so that the subsequence stitching can be more accurate.  Fig. 8(a), there exists conspicuous seams. After the seamless procedure, the unnatural seams are hardly visible so that an optimization of the result image is realized (shown in Fig. 8(b)).
The SURF matching method could fail because of the chaos and some unpredictable changes in the scene, such as during the movement of personnel or coal equipment. In this case, the template recognition method can be an alternative to accomplish the stitching procedure. In Figs. 9(c) & (d), the locations of the template images are marked with cyan rectangles. In most cases where the SURF matching fails, the alternative method can accurately locate the templates, so as to perform the subsequent stitching procedure. Fig. 9(e) shows the result of stitching the Figs. 9(c) & (d) via the template recognition method.

Results of vertical correction for oblique panoramic images
The panoramic image from the three images is deformed by the perspective effect. The distance between the cable trough and the hydraulic support bases decreases gradually along the extension direction, forming a trapezoidal shape (Figs. 10(g)&(h)). With the increasing of the distance, the widths of the hydraulic support bases gradually decrease in the uncorrected images (shown in Fig. 11(a) & (c)), so the images cannot be directly concatenated. After the correction, the result images are displayed as if they are taken from vertical angle (Fig. 11(b) & (d)). In order to concatenate the two contiguous images, we take one of the images as the reference, stretching the other image to keep the width of hydraulic support bases at the same. Finally, the two images are concatenated (Fig. 12). Fig. 12 Result of stitching two panoramic images Figure 12 shows the final result image derived from two adjacent cameras after a series of operations. Two distributed client programs were used to create such a panoramic image, which took around 3 to 4 seconds. The output image contains at least six sets of hydraulic supports, which is nearly six times the size of the image captured by a single camera, in which we only see one set of supports. The prototype system used the distributed computing and the message queue technologies. In the practical mining environment, the low-speed movement of the shearer could be reflected in the result images. The output panoramic images would be updated continuously, thus providing surveillants with a real-time, large-scale and the most intuitive understanding to the mining environment.

Conclusions
Aiming to improve the usability of the video surveillance systems deployed at fully mechanized mining faces, this paper purposed some methods for solving salient problems related to the actual mining environment: (1) To eliminate the conspicuous differences between images due to non-uniform illumination, the WDR was applied to solve both the overexposure and underexposure problems. The increasing of the information entropy proved our method effective. The differences between the images shot by different cameras was decreased by histogram matching method.
(2) To complete the missing scene of the fully mechanized mining face due to the limited FOV of the camera, a stable stitching method realized by combining SURF matching and template recognition. The range of the result panoramic image was larger than that of any image shot from a single angle of a single camera.
(3) To solve the image obliquity problem after stitching, we used the focal length and the pose of cameras, to eliminate the perspective effect of the single panoramic image. Consequently, the single-panoramic images from the adjacent cameras remain sizes of objects the same. Subsequently the corrected images are concatenated to a non-perspective image with a wider range. The final panoramic images provide an intuitive and dynamic situation of mining for surveillants.
(4) To prove the effectiveness of the methods we purposed, a prototype system was developed which adopted a distributed configuration to achieve simultaneous correction and stitching of the images. Practical experiments showed that the prototype system was highly robust and usable in actual mining situation.  Pipeline of a more stable stitching method purposed by this article Figure 3 Correction for oblique image Figure 4 Three parts of the system framework         Result of stitching two panoramic images