A marker-less monocular vision point positioning method for industrial manual operation environments

Vision-assisted technologies in industrial manual operation such as augmented reality (AR) are increasingly popular. They require high positioning accuracy and robustness to operate normally. However, narrow spaces, moving hands, or tools may occlude or obscure local visual features of operation environments and thus affect the positioning accuracy and robustness of operating position. The resultant misguidance may even cause misoperation of operators. This paper proposes a marker-less monocular vision point positioning method for vision-assisted manual operation in industrial environments. The proposed method can accurately and robustly locate the target point of operation using constraint minimization method even if the target area has no corresponding visual features in the case of occlusion and improper illumination. The proposed method has three phases: intersection generation, intersection optimization, and target point solving. In the intersection generation phase, a certain number of intersections of epipolar lines are generated as candidate target points using fundamental matrices. Here, the solving constraint is converted from point-to-line to point-to-points. In the intersection optimization phase, the intersections are optimized to two different sets through the iterative linear fitting and geometric mean absolute error methods. Here, the solving constraint is further converted from point-to-points to point-to-point sets. In the target point solving phase, the target point is solved as a constrained minimization problem based on the distribution constraint of the two intersection sets. Here, the solving constraint is finally converted from point-to-point sets to point-to-point, and the unique optimal solution is obtained as the target point. The experimental results show that this method has a better accuracy and robustness than the traditional homography matrix method for the practical industrial operation scenes.


Instruction
With the rapid development of visual assistance technologies such as augmented reality (AR), visual-assisted industrial manual operation has been recognized as a promising technology in industrial environments such as assembly and maintenance phases of automotive, aerospace, and military equipment domains [1][2][3][4]. At present, marker-less monocular visual-assisted systems based on the natural features of operation environments has become a mainstream technology considering its practicability and low cost. Meanwhile, it requires minimum or zero setup efforts of end users and can be adapted to an unprepared environment [5][6][7]. However, there are still some challenges to overcome for more effective application [4]. In the case of occlusion and improper illumination, how to accurately and robustly locate the operation position with marker-less monocular images is a critical problem that has not been solved effectively.
One major task of industrial manual operations is to identify and locate the operation locations which usually are some specific spatial points, and then carry out the handling, aligning, joining, disjoining, adjusting, or other actions based on these locations [7][8][9]. The operator needs accurate and visual guiding information based on the locations to complete the operation correctly, especially for the small objects such as the holes of engine crankcast [10][11][12][13]. These manual operations with high precision and robustness requirements mostly occur in industrial environments. The narrow operation space and moving hands or handheld tools may cause full occlusion or improper illuminations around the target point. Therefore, the accuracy of monocular positioning methods based on visual features may be affected, and even wrong locations will be determined.
The operator needs to complete the work with bare hands or handheld tools because most of industrial operations are typically carried out on-site and heavily depend on manual maneuver [14]. Sometimes, the operator also carries out the tasks in some unprepared environments with uncertain illuminations. But the hands of operator or handheld tools may fully or partially occlude the target area, and the strong light may also obscure the target [5,6,[15][16][17]. This state will result in that the target area loses the recognizable features used to retrieve or match the target point [18]. At the same time, the state without target features will last for a period of time along with the action of the operator. Furthermore, the camera pose may also change in the operation process. Accordingly, the target point will not be automatically identified and located because of no recognizable features in an unsteady image. Therefore, the corresponding visual guiding information could not be registered in the vision scenes based on the target location [19].
Lots of research fruits have been presented in the object identification and tracking in the case of occlusion or changeable illumination [20,21] while few in accurate and robust methods of point positioning in the case of full and durative occlusion or obscuring, marker-less monocular cameras, and practical industrial environments. It may be partly due to that the common method, feature matching, will not be good effect in this case. In this situation, the homography matrix (HM) method could be used to approximately estimate the location of the target point while it may bring a non-ignorable registration error in complex environments, because HM method is more suitable for planar environments, but most of industrial environments are spatial [22]. Nevertheless, it is still possible to solve this problem with epipolar geometry method because it actually captures some information of scene structure [23]. However, the epipolar constraint of multiple view geometry in computer vision cannot handle this situation because it is a constraint of point-to-line rather than point-to-point [24,25]. If the constraint of point-to-line can be converted to point-to-point, the problem of accurate and robust positioning may possibly be solved. Therefore, we propose a novel method to position the target point accurately and robustly through solving a constraint minimization problem with a unique solution.
In this paper, we address the challenge of accurate and robust marker-less monocular vision positioning in visionassisted industrial operation scenes without target features. The main contributions of our work are as follows: (1) a better marker-less monocular vision point positioning method which can accurately and robustly locate the target point without around visual features and can be applied in practical industrial environments and (2) a novel constrained minimization solving algorithm for point correspondence in industrial computer vision which converts the constraint from point-to-line to point-to-point and get a unique solution. The experimental results show that this method has a better accuracy and robustness than the approximate estimate method based on HM and meets the requirements of visual guidance for practical industrial environments through being integrated in the visual guidance process such as AR guidance.
This paper is organized as follows. Section 2 introduces previous research works in this field. Section 3 gives the details of proposed method. Section 4 shows the experimental settings and results, assesses the performance, and discusses the results and limitations of the proposed method. Finally, conclusions and future work are presented in Sect. 5.

Point positioning in industrial operation environments
Many tasks of industrial operations need to be carried out based on a specific and precise location such as connector wiring and micro drill clamping [2, 7-9, 26, 27]. However, hands and handheld tools often fully or partially occlude the area of target location because most stages of operation activities are carried out manually by operators [2,15,28]. The illumination changes because of the narrow space or the reflection on metal surface may also obscure the target area. Accordingly, the visual guiding information such as symbols and cues could not be registered and displayed on the correct position because the target point may be positioned with non-ignorable errors [11,28,29]. Incorrect position guidance may lead serious operation quality problems, and the benefits of vision-assisted systems cannot be maximized [27]. Therefore, how to accurately and robustly locate the target point when the target area has no valid features is a critical issue for vision-assisted systems for industrial operations such as AR applications.
Losing target features is one of the most common problems of computer vision positioning. Losing target features is the state when the target is hidden or obscured partially or fully by other objects or light in the scene and severely affects the detection of target [19,20]. It is a problem as well as a challenge to locate or track the target when its appearance or some key attributes are not available for the camera while the target is still present in the scene [19,30]. Some researchers have tried various methods to solve this problem in computer vision scenes of industrial fields.
Zubizarreta et al. [15] introduced a matching scheme and template-based optimization using corresponding conics between model surface circles and image ellipses for 3D non-Lambertian object recognition in arbitrary environments. This method could detect the visible geometry features for the current pose of target object applying occlusion queries based on the z-buffer. But this method needs the offline training with CAD models and would fail if the heavy occlusion occurs. Gao et al. [31] used a locally supported Gaussian weight function and a bilateral filtering and outlier removal to handle cluttered scenes with partial occlusions for robust object recognition and matching. This method can identify the object in cluttered scenes with partial occlusions and build virtual-real object registrations via point cloud fusion. Wang et al. [16] used a tracking algorithm combining visual feature matching and point cloud alignment to achieve a good tracking performance under the partial occlusion condition. A reference point cloud model generated from its 3D model in a computer-aided design system is required for the establishment of benchmark coordinate system. Wang et al. [32] adapted a LINE-MOD method into a scale-invariant descriptor using depth information to handle occluding boundaries of the target object. This method needs the offline training with CAD models and is fast and robust for partial occlusion rather than full occlusion because the adopted template method would fail in the case of full occlusion. Huang et al. [33] used a monocular real-time robust feature tracking algorithm (MRRFT) to track a deep space target even it is partially occluded by a strong light spot. This method needs a point set to track a target object rather than a point, and the occluded points would not be tracked. Wang et al. [34] used a new image feature named as chainof-lines feature (COLF) constructed by several directed line segments to register 3D objects in the camera view. This method can establish multiple correspondences between the 2D scene image and 3D model simultaneously even in a complex assembly environment with partial occlusion. This method also needs the offline training with CAD models and is only suitable for partial occlusion rather than full occlusion.
It can be seen that above methods are valuable for tracking a target object rather than positioning a target point. How to locate a target point in the demanding operation environments still needs to be studied. For the point positioning, Lima et al. [6] adopted a model-based tracking method to correctly identify points in tasks that involved tracking a rotating vehicle. However, the user needs to select each corresponding point from one valid key frame in the same sequence of the reference 2D points in order to correctly match these 3D points. In general, these methods are effective when the targets are partially occluded and are more suitable for a prepared environment because most of them need off-line training based on CAD models and also need a point cloud obtained by depth camera scanning. Actually, the marker-less monocular tracking is adopted extensively in industrial practice because it just requires minimum or zero setup efforts of end users and can be adapted to the unprepared environment [5,7]. Therefore, how to precisely locate a target point rather than an object when the target area is fully occluded or obscured in an unprepared environment of industrial operation still needs to be further investigated. In other words, a method which can accurately find a single point correspondence between template image and target image without valid features of target area needs a further research.

Point correspondence under epipolar constraint
Positioning a target point in the target image based on a reference point of the template image is actually a point correspondence problem. The point correspondence can be considered as a part of the point set registration or image registration process in computer vision [35,36]. The main purpose of the point set registration aims to find correspondences of points and to estimate the transformation between two or more point sets [37]. Image registration, also known as image fusion, matching, or warping, is the process of overlaying two (or more) images of the same area through geometrically aligning common features (or control points) identified in the images [38]. In practice, both of point set registration and image registration suffer from many challenges including occlusion or obscuring [35,39]. The target point cannot be matched well in the case of occlusion or obscuring with common registration methods, for example, the feature matching-based methods and the iterative closest point algorithm, because the corresponding features of target point are insufficient or missing. Therefore, the location of target point can only be estimated and searched according to the reference point in the template image as well as the constraint of multiple view geometry.
The epipolar constraint is the only geometric constraint between two uncalibrated images of the same scene observed from two different camera viewpoints [40]. The epipolar constraint means a mapping between the point on the template image and the epipolar line on the target image. For each reference point in the template image, we could limit the search for a corresponding target point in the target image to just an epipolar line (instead of naively searching the whole target image) [25]. However, the specific target point cannot be located without effective auxiliary information such as the depth because the epipolar constraint is not a one-to-one constraint [25]. In practice, the auxiliary information is unavailable for the uncalibrated images from unprepared environments, especially demanding industrial environments. On this condition, the homography matrix can be used to approximately estimate the target point through the perspective transform based on the homography relation [24,41].
The mapping of pixel points, however, is accurate only in the case of pure camera rotation or planar scene [42]. The demanding industrial environments mostly are spatial rather than planar and the cameras also have translation as well as rotation which may severely affect the positioning accuracy based on the homography matrix. A large positioning error could lead to incorrect guiding information, especially in the case of multiple small targets, which may result in a wrong action. In order to locate the target point, Wang et al. proposed a dual-correlation method to determine the relationship between matching points in different images by a pair of different correlation transformations. This method uses two fundamental matrices (FMs) to generate two pairs of epipolar line and get the intersection of two lines as the target point [43]. Although this method can determine a single point correspondence between two uncalibrated images with fundamental matrix, it does not take into account the uncertainty of intersection. The intersection may be far away from the target point because of the uncertainty of FM.
In computer vision, the FM is the algebraic representation of the epipolar geometry that relates two images of a scene observed from two different viewpoints and describes the geometric relation of the two images [44,45]. If the internal (intrinsic) parameters of camera are unknown, the pixel image coordinates should be used, and the matrix is known as the fundamental matrix [25]. In practice, the accuracy of FM is affected by the matching accuracy of control points and their depth variation [23,[44][45][46]. Although errors always exist, the target point is still distributed in the narrowest region of the epipolar envelope with some fixed probability [23,47,48]. Stojanovic et al. [47] used Monte Carlo simulation to obtain a set of different FMs and computed an envelope of epipolar lines. The potential target point has a maximal probability in this area. This method needs to select specific eight correspondences for computing the different FMs. However, the specific correspondences cannot reduce the tendency of location error because of the uncertainty of the FM. At the same time, the target point still cannot be uniquely obtained because this is a constraint of point-toarea rather than point-to-point.
Generally, existed methods have proved that the FM is dependent of the scene structure and pointed out the most probable distribution area of the target point. Our investigation also shows that there is still no better method than the approximate estimation based on the homography matrix to uniquely position the target point in the case of uncalibrated images, spatial scene, complete and durative target occlusion/obscuring, and monocular marker-less vision. However, it can be seen that the search space of the target point has been reduced from a line to a narrow region based on the uncertainty of the FM [23,43,47]. If the geometric constraint can be further reduced into a constraint of pointto-point, it would be possible to accurately position the most likely target point uniquely.

Summary
Most of related researches in the field of practical industrial environments focus on the positioning of an object under partial occlusion rather than a point under complete occlusion or obscuring and have achieved good results. Generally, these methods need offline training based on 3D models in advance and the depth information of points in runtime. Therefore, these methods are not suitable for the target point positioning in the case of unprepared environments, marker-less monocular vision, uncalibrated cameras, and full occlusion or obscuring. If the target point positioning problem is abstracted as a geometric constraint problem of computer vision, the epipolar constraint based on the FM is the main theoretical basis related to this problem. At present, the related literatures which have promoted the understanding of epipolar geometry mainly focus on the registration of point set or image rather than a single point correspondence. How to uniquely and accurately position the target point in the demanding industrial environments is still a problem that needs further study. However, these researches simultaneously inspire us that the uncertainty of the fundamental matrix is conducive to determine the distribution and probability of the target point. Hence, we proposed a novel method that can uniquely and accurately position the target point without its feature information. This method has a better accuracy than the approximate estimate method based on the homography matrix.

Overview
In this paper, we present a marker-less monocular vision point positioning method with higher accuracy and robustness through directly positioning the target point based on the location of designated reference point. This method is suitable for the vision-assisted scenes of industrial manual operations in the case of spatial environments, full and durative target occlusion or obscuring, and uncalibrated monocular cameras. The main phases of the general positioning process can be summarized as follows: image preprocessing, point feature extraction, point feature matching, and visual content registration [35,38]. If the target area is completely occluded or obscured for some time whilst the position of camera changes with the operator's action, the steps of point feature extraction and matching based on the natural features will not work properly. Our method can find the position of the target point through replacing the phases of point feature extraction and matching in the existing methods. The process is presented in Fig. 1.
The core idea of this method is to convert the constraint of point-to-line with uncertainty to the constraint of pointto-point with unique solution. In other words, this method can reduce the solution number of positioning problem from many uncertain solutions to a constrained minimization solution. Hence, the target point could be located uniquely and accurately. The workflow is separated in five phases: image preprocessing, intersection generation, intersection optimization, target point solving, and visual content registration.
Image preprocessing is the first phase which accepts the input images, detects and matches their feature points, and optimizes the matched feature points. If the two images are matched, the optimized matched feature points will be output to the next phase. But the feature points in the target area will not be matched because of the occlusion or obscuring of the target area even they are detected.
The second phase, intersection generation, uses the matched feature points to generate a certain number of epipolar line pairs with adding Gaussian noise to the matched feature points, and then calculates the intersections of epipolar line pairs as the candidate solutions of target point.
The third phase, intersection optimization, detects two optimized intersection sets based on the methods of iterative linear fitting (ILF) and geometric mean absolute error (GMAE), respectively. These two sets have specific constraint relation and are used to construct a constrained minimization problem and solve the optimal solution.
The fourth phase, target point solving, transforms the constraint relation between the function representations of two sets to a constrained minimization problem, then constructs the objective function and constrained function to solve the optimal solution based on the Lagrange multiplier method. The optimal solution just is the desired target point.
The last phase, visual content registration, transforms the coordinates of the target point to the screen coordinates and superimposes the visual contents based on the screen coordinates of the target point. Finally, the visual scene with right guiding contents can be shown to the operator and the point positioning process ends.
The first and last phases are the essential procedures and can be implemented through the existing general technologies. They provide the inputs and display the results. So, our method will focus on the other three intermediate phases.
The problem solved by our method is stated as follows: given two sets of matched featured points PS R and PS T from the template image and the target image respectively, find the location of the target point P T (x T , y T ) in the target image based on the location of the reference point P R (x R , y R ) in the template image. The taking angle and position of two images are different. The poses and intrinsics of monocular camera are uncalibrated. No corresponding feature points can be matched around the target point, and no markers are on the two images. Our method needs to use above three phases to position the pixel coordinates of the target point according to these limited conditions.

Intersection generation
The phase of intersection generation accepts the matched feature points generated from the image preprocessing phase as input parameters. At the image preprocessing phase, SURF, SIFT, or other common methods can be adopted to detect the feature points. And RANSAC or other general methods can be used to match feature points [35,36,49]. After image preprocessing, it should be confirmed that the template image and target image are matched, and the matched points are obtained as shown in Fig. 2. This is the premise of positioning the target point. The target point cannot be located based on two unmatched images.
On account of the matched featured points, a fundamental matrix F can be obtained through RANSAC or other methods [24]. The fundamental matrix is the algebraic representation of epipolar geometry and is used to get a corresponding epipolar line in the target image for a reference point in the template image. For each reference point P in the template image, there exists a corresponding epipolar line L in the target image. Any target point P ′ in the target image matching the point P must lie on the epipolar line L . But the exact location of the target point P ′ on the epipolar line is unknown because the epipolar line is the projection of the ray of light R from the reference point through the camera center C R rather than a point as shown in Fig. 3. If this constraint of point-to-line can be transformed to the constraint of point-topoint, the location of the target point would be determined. A straightforward idea is to find another epipolar line in the target image for the reference point. Hence, the intersection of two epipolar lines will be the target point because it should be on both lines at the same time [43]. However, a single intersection is unreliable because of the uncertainty of fundamental matrix itself [23,46]. The most likely target points are located in a hyperbolic region as shown in Fig. 4. So, a certain number of likely target points rather than one point should be obtained in order to retrieve the correct target point. Our method to obtain intersections is divided into three steps: generation of fundamental matrix pairs, generation of epipolar line pairs, and generation of intersections.
In the step of generation of fundamental matrix pairs, a certain number pairs of fundamental matrices are generated The International Journal of Advanced Manufacturing Technology (2022) 120:6011-6027 through adding zero-mean Gaussian noise to the matched feature points. If there are n feature points in the set of matched feature points PS R and PS T respectively, and the feature point P i in PS R and the matched feature point P ′ i in PS T are the correspondence points P i → P ′ i , a fundamental matrix F can be obtained through the 8-point method [24,25]. Here, we use the RANSAC with normalized eight-point method to calculate all correspondence points instead of 8 specific correspondences for obtaining F as shown in Eq. (1) because all correspondence points can maximize the overall robustness of F: If a zero-mean Gaussian noise with a standard deviation x = y is added to each feature point P(x, y) and P � (x � , y � ) to get the noised feature points NP(x + N(0, x), y + N(0, y) and NP � (x + N(0, x), y + N(0, y) , a noised fundamental matrix NF can be obtained: However, another NF should be obtained because two NF s are needed to generate two epipolar lines for getting their intersections. In order to get another NF and make sure that two epipolar lines have an intersection, all feature point in PS R and PS T are reversed, and a new corresponding fundamental matrix NF ′ can be obtained: So, a pair set of fundamental matrices PFS is obtained through m times iteration of adding Gaussian noise: In the step of generation of epipolar line pairs, m epipolar line pairs can be obtained based on the fundamental matrix pairs. A fundamental matrix is a 3 × 3 matrix with rank 2. An epipolar line Ax + By + C = 0 can be expressed as L = [A, B, C] T . The reference point P R (x R , y R ) can be further expressed as P R (x R , y R , 1) because it is a 2D point. The epipolar line L and its correspondence L ′ can be obtained through epipolar geometry: When all pairs of fundamental matrices are calculated, the pair set of epipolar lines PLS is obtained: In the step of generation of epipolar line intersections, an intersection I(x, y, 1) of one pair of epipolar lines can be easily obtained through linear equations: Finally, the set of intersections IS can be obtained: Figure 5 shows an instance of noised feature points, corresponding epipolar line pairs and their intersections.

Intersection optimization
Although, theoretically, the intersections should be distributed in an ellipse or parabola area, they are always located in an hyperbola region named epipolar band in practice [23]. The target point would be located in the densest area of intersections where the two sections of hyperbolas are closest to each other [50]. Moreover, another theoretical distribution area should be a line because the epipolar constrain is the constraint of point-to-line. In other words, the target point should be located in the densest area and on the line at the same time. Therefore, the intersections should be optimized from two perspectives of linear fitting and point density.
An iterative linear fitting (ILF) method considering the distance between an intersection and the fitted line is proposed to optimize intersections from the perspective of linear fitting. The main idea of ILF is to fit a line with intersections and then remove outliers with a longer distance than a given distance. Next, the fitting and removing process is iterated, and the given distance is also gradually reduced until there are no outliers when the given distance is reduced to D min . The purpose of the distance iteration is to fit the most likely epipolar line as much as possible. Q denotes the total number of iterations, ΔD denotes the distance increment per iteration. After ith iteration, the optimized intersection set is as follows: After iteration, LIS , the final intersection set based on linear fitting optimization, is obtained. k intersections are assumed in LIS.
In order to get the optimized intersection set from the perspective of point density, the geometric mean absolute error (GMAE) method is used to retrieve the densest points based on the given confidence interval . E , the GMAE of all I(x, y) in IS = {I 1 , … , I m } , can be calculated as follows: If the error of ∀I and I is less than E , I will be remained as an inlier. Finally, DIS , the optimized intersection set based on point density, is obtained. w intersections are assumed in DIS: Figure 6 shows two optimized intersection sets from two perspectives of linear fitting and point density.

Target point solving
Now, two sets of optimized intersections have been obtained. Although the constraint of point-to-line has been transformed to the constraint of point-to-points, this is still a oneto-many problem. This problem should be reduced further to a one-to-one problem in order to get a unique target point. This problem can be transformed to a constrained minimization problem because the target point is distributed in a hyperbolic region which is constrained by the condition that the target point should also be on the fitted epipolar line.
According to the above analysis, we construct Lagrange functions to solve this constrained minimization problem with Lagrange multiplier. The optimal solution of this problem will be the target point. The objective function, f (x, y) , is a minimal Mahalanobia Distance function which describes the distance between the target point and DIS: ∑ + I denotes the SVD of inverse covariance matrix of DIS and is still a symmetric matrix. I is the mean of DIS: So, f (X, Y) can be expanded as follows: The constraint function, g(X, Y) , is a line function which is fitted with LIS . The fitted line is denoted as The constrained minimization problem can be described as follows: The optimal solution can be found when the gradients of the two functions are parallel at the target point. The unknown constant multiplier is necessary because the magnitudes of the two gradients may be different: After expanding Eq. 18, the following equations are determined: After computing its derivatives, the constrained minimization problem can be reduced to solving a set of three linear equations as Eq. 20: This set of equations can be easily solved, and the solution vector (x, y, ) T is obtained. The target point P T x T , y T = (x, y) + I = (x + x, y + y) can be identified uniquely. Figure 7 shows the illustrative curves of the (15) f (x, y) = x −x y − y T 11 12 21 22 x −x y − y = 11 x 2 + ( 12 + 21 )xy + 22 y 2 objective function and constraint function, and the optimal solution which is the target point. At last, visual contents can be superimposed on the visual scene based on the coordinates of target point. After the coordinates of target point are converted to screen coordinates, visual contents can be superimposed on the screen coordinates or other locations based on these coordinates as shown in Fig. 8. After that, the point positioning process ends and the superimposed visual contents with accurate and correct position will be shown for the operator.

Experiment setup
There are no other direct methods to position a single point in the case of marker-less monocular image pairs and full or durative target occlusion or obscuring. The perspective transform method based on homography matrix can be used to approximately estimate the location of target point, but it is more suitable for planar environments. In fact, most industrial environments are not planar. Therefore, we designed an experiment which comes from practical industrial application and includes two kinds of typical industrial operation environments to evaluate the accuracy and stability of our proposed method over the homography matrix method.
Considering the relevance of the experiment and a better comparative analysis, we selected two industrial operation environments including aeroengine maintenance and CNC machine tool control. They are typical industrial manual operations. They are in typical industrial environments with occlusion or obscuration. The target area of aeroengine is occluded by operator's hands, and the target area of CNC machine tool's control panel is obscured by strong light. At the same time, these two environments have different spatiality. The aeroengine is stereo, and the control panel is planar-like. Figure 9 shows the template images of two environments and one of their corresponding target images. Both methods use uncalibrated and marker-less monocular images to locate the target points. In order to analyze the positioning accuracy and robustness in different positions and angles, we manually marked the target point on the fiducial image. The fiducial image was taken at the same taking position as the target image and without occlusion and obscuring. In each environment, 10 groups of images were taken at the different positions which have different translations and rotations relative to the taking position of the template image. These positions are evenly distributed in the range of the operator's arms and the field of view angle. We set the left side of the reference camera as negative and the right side as positive. Figure 10 shows the layout setup of taking image for different groups. Table 1 shows these settings in detail. Table 2 shows the thumbnails of experimental images.
Our experiment was run on a ThinkPad computer with Windows 10 system, 2.8 GHz CPU, 16 GB memory, and integrated Intel HD 520 graphics card. The image resolution is 800 by 600 pixels. OpenCV 4.5.0 is used as development platform, and C + + is the used programming language.

Experiment results and discussion
We conducted the experiment based on the above settings. In order to evaluate the accuracy and robustness of the proposed method, and observe the influence of the number of intersections on them, we generated different numbers of intersections from 100 to 1000 to repeat the experiment. We used a common set of parameters. A zero-mean Gaussian noise with a standard deviation σx = σy = 1, that is N(0, 1) , was added to each feature point. D min = 3 and = 2 are set. All results' errors of two environments are shown in Figs. 11, 12. We calculated the experimental results, got the maximum max , minimum min , mean value , standard deviation , and coefficient of variation cv = , and used them to evaluate the positioning accuracy and robustness of our method. Table 3 shows the computed results.
It can be seen that the accuracy of our method is better than the HM method for both environments according to . From the perspective of the actual impact of errors, the difference between the minimum errors of the two environments is no more than two pixels, and so, there is no obvious impact on the visual guidance. However, the maximum error of the HM method for the aeroengine maintenance is too large (> 11 pixels) to be ignored, and this may cause incorrect operation. The stability of our method is also better than   the HM method for both environments according to cv . This indicates that the error range of our method is smaller than the HM method even in the planar-like scene. For the aeroengine maintenance, the error's mean of proposed method, = 2.33 pixels, is almost half that of the HM method. 2.33 pixels in the target image (800 × 600 pixels) can be roughly converted to 1.6 mm in this corresponding physical scene (50 × 37.5 cm). This average error can well meet the requirements of industrial operation guidance [31,32]. For the CNC control, the mean error, = 0.89, is more accurate. Actually, both methods have almost the same registration effect under this level of error.
The results also show that the single error of our method varies in different groups and with different intersections although the overall error is stable and accurate. The cause is that the optimal solution is obtained based on the statistical data. The single error of the HM method is the same with different intersections in the same group but varies in different groups. The cause is that the HM method runs based on the determined matched points of the image pair but not the statistical data.
We also observed that the errors of both methods have a similar changing trend. That means the quality of the matched points have the similar impact on them. If the

Number of intersections / Group
Proposed Method HM Method Fig. 12 Positioning errors of CNC control environment quality of matched points is not good, especially in the case of uneven distribution around the target point, the robustness of both methods will be affected. The results also show that the position of camera has no evident or direct impact on both methods. However, it seems that the spatiality of scene has impact on the accuracy of both methods. The closer the scene is to the plane, the smaller the error is. The fundamental matrix actually captures some information of scene structure; in other words, the fundamental matrix is dependent of the scene structure [23,[44][45][46]. Obviously, the proposed method is much less affected by the scene structure. This is mainly due to the application of statistical data and the constraint solving. Further research should be undertaken to investigate the inner influences of scene structure on the proposed method in order to get more accurate results.
In terms of efficiency, the proposed method is more timeconsuming than HM method, because there is more computing to do. In general, the running time of HM method is less than 1 ms. Figure 13 shows the running time of the proposed method for the CNC control environment. The data show that running time increases with the number of intersections. However, we also could observe the number of intersections has almost no impact on the accuracy of the proposed method from Fig. 12. Selecting 100 intersections for calculation will take less than 10 ms. Although this runtime value is much higher than the runtime of HM method, this is acceptable for the real-time positioning and visual guidance in the case of a refresh rate of less than 60 frames. What is interesting about the data in Figs. 12, 13 is that the accuracy error and the calculation time have a certain positive correlation. We believe it is because the re-projection threshold of FM by RANSAC method needs more time to calculate when the error is large.
But some limitations of this method still exist. First, it needs a certain running time and mainly because of the generation of hundreds of fundamental matrices. Therefore, it is not suitable for the visual-aided applications with highly real-time requirements or a refresh rate of more than 60 frames. However, considering that the refresh rate of most visual-aided application is 30 or no more than 60, this method still has wide adaptability. Second, the robustness of this method may be influenced by the scene structure, i.e., the spatiality of operation environments. Excessively uneven spatiality of environment may lead to a greater positioning error. Actually, the HM method has the same problem, and its accuracy and robustness are worse than our method in this situation. In other words, this method has certain adaptability to the spatial environments. At last, this method is not suitable for the visual guidance which needs the same error for each positioning in the same environment. The registration error may change slightly because of adding Gaussian noise in calculating process. However, the accuracy still meets the requirements.
Overall, these results indicate that the proposed method has a better accuracy and robustness than the perspective transform method based on the homography matrix. Its accuracy and robustness meets the requirements of visual guidance for industrial operations. This method is more  Fig. 13 Running time of the proposed method suitable for the industrial operation scenes with higher positioning requirements, full and durative target occlusion or obscuring, marker-less and even spatial environments, and uncalibrated monocular cameras. It is also applicable to the general environments where the HM method is applicable. Only in terms of computational efficiency, if higher accuracy or higher real-time operation is not necessary, HM method can be selected a preferred way.

Conclusions and future work
Vision-assisted industrial operation has been recognized as a promising technology used in most of life-cycle phases and different fields. Accurate and robust point positioning in the case of no target features and uncalibrated monocular cameras still is a critical problem that has not been solved effectively. This study set out to design a marker-less monocular vision point positioning method for vision-assisted industrial operation environments in order to provide more accurate and robust visual guidance. This study has shown that an accurate and robust point positioning problem can be solved by converting the constraint of point-to-line with uncertainty to the constraint of point-to-point with unique solution. This work contributes to existing knowledge of computer vision point positioning by providing a novel method with higher accuracy and robustness for practical industrial environments. It also contributes a novel constrained minimization solving algorithm for point correspondence in computer vision. The major limitations of this study are slightly larger amount of computation and influence of scene structure. Notwithstanding these limitations, the study can meet the practical industrial requirements of point positioning. Although this method can also be used in general environments, it is more suitable for the more demanding industrial scenes with higher positioning requirements, full and durative target occlusion or obscuring, marker-less spatial environments, and uncalibrated monocular cameras.
However, several questions still remain to be answered. One of them is to optimize the generation algorithm of fundamental matrices which is the main part of time consuming of this method. The other is to evaluate the method in more complex scene structures for more kinds of harsh industrial operations and further identify the mechanism how the scene structure affects the fundamental matrix as well as the position error.