A posture detection method for augmented reality–aided assembly based on YOLO-6D

The assembly of small batch electrical and mechanical products still relies on manual operation, with the problems of high rechecking time and low assembly efficiency and quality. The augmented reality technology can be used to assist assembly to improve efficiency. In view of the high professionalism, labor, and time cost for the traditional posture detection method, a posture detection method is proposed to match the assembly posture by pre-calibrating the assembly posture template. The YOLO-6D model is brought in to increase the robustness of tracking registration. And the detection from both translation and rotation perspectives is designed to enhance the adaptability to different assembly tasks. For the generation of training datasets, a weighted sampling method based on part features is proposed to optimize the accuracy of position estimation with the limited training samples. Taking the assembly process of a typical electronic product as an example, the developed augmented reality aided assembly system is used to boost the assembly efficiency of the assembly operator effectively as compared to the traditional posture detection method.


Introduction
As the concept of Industry 4.0 continues to advance, the integration, packaging, and complexity of electromechanical products are increasing [1,2]. At present, the overall assembly automation level of small batch electromechanical products is low, and its assembly process basically belongs to the fixed assembly of connectors and the connection of related cables. At the same time, the staff are also faced with the problems of numerous assembly steps and the need for technical experience support. Therefore, assembly is still a high labor-intensity industry with long training cycles and high costs [3,4]. In this context, augmented reality (AR) technology has emerged as a solution to this engineering problem. Among them, vision-based augmented reality, which is carried by removable or wearable display devices [5], has been regarded as a powerful technological tool for improving efficiency in the field of manual assembly, because of its ability to help workers establish a connection between the physical world and the digital information environment [6][7][8][9].
In augmented reality, 3D tracking and registration technology is the basis for virtual information guidance and assembly posture detection, which is the most direct part of users' perception. This technology acquires the pose of the target object in the real field of view in real time by means of a camera or other type of sensor and reestablishes the spatial coordinate system according to the user's viewpoint, superimposing the virtual information accurately on the real object. The performance of its algorithm determines the effectiveness of the application of augmented reality technology [10]. Among them, the vision-based tracking and registration technology has been the focus of researchers' efforts due to its characteristics of cost-effectiveness, easy portability and broad application prospects. Salonen, Reiners, Boulanger et al. [11][12][13] applied marker-based tracking registration methods to AR assembly systems, while Yuan, Andersen, Alvarez et al. [14][15][16] performed tracking registration based on different features of the scene. As in Table 1, the advantages and disadvantages of each method of vision-based 3D tracking registration technology are provided. At present, the marker-based method is still the most widely used tracking registration method in AR assembly by virtue of its robustness and good real-time performance, but its problems such as marker obscuration and secondary contamination of parts also restrict its further development.
In recent years, deep learning-based pose estimation techniques have been developed rapidly [21][22][23][24]. The YOLO-6D model [25,26] is a regression-based deep learning method. It can indirectly regress the 2D image coordinates of the 3D control points from the estimated object in the RGB image. In this way, the 6D pose estimation problem can be formulated. It has higher accuracy and better scalability than other traditional vision approaches. However, the huge amount of training samples and labeled data for matching usage scenarios are still an important constraint for deep learning in industrial applications [27,28]. Many scholars have proposed the use of random domains, domain adaptation, and physical engine synthesis [29][30][31] to expand the training samples, which can effectively save labor cost, improve efficiency, as well as have some practical significance. The technology of posture estimation based on deep learning is still in the rapid development stage as a whole, while the mature application cases on the aided assembly system are relatively few.
To address the limitations of the traditional assembly posture detection methods, this paper proposes an assembly posture detection method based on YOLO-6D, which is a markerless detection method. This method requires only a simple 2D calibration of an assembled image to determine the state of the assembly through the posture relationship. In summary, by using a monocular camera and exploiting the efficiency of the YOLO-6D model for recognition on weakly textured parts, we regress the 2D image coordinates of the labeled 3D control points by learning the features in the sample images. Based on the positional information, a designed positional matching strategy for detection from both translation and rotation perspectives is used to achieve 6D positional estimation. For the limited assembly datasets, a weighted sampling method based on the part features is proposed to improve the correct rate of pose estimation by filtering the sampled frames with low similarity based on the features of the parts. Finally, the effectiveness of the proposed method is verified by using the developed augmented reality aided assembly system with a typical electronic product assembly example.

Related works
The traditional marker-based posture detection method [32,33] measures the posture accuracy by calculating the deviation between the estimated posture and the real posture, as Fig. 1 shows the principle diagram of the posture matching strategy based on the marker-based posture detection method.
Among them, the translation error and rotation error are quantified and analyzed by using Eq. (1) and Eq. (2) in the article of Sattler T et al. [34]. And combining the threshold range defined in their article and the actual scenario, we choose the positional error thresholds as |α| = 2 • and d=0.25m . If the errors are less than the threshold values, the assembly posture is correct by default.
where d is the Euclidean distance in meters. c est and c gt are the estimated and true translation matrices.
where | | is the absolute directional error in degrees. R est and R gt are the estimated and true rotation matrices.
While the 3D tracking registration method based on deep learning predicts the positional information of the target in the video frame by learning the relationship between the features in the image and the 6D poses of the target object [35], the deep learning networks built with such methods generally consist of two parts, first for target detection and then for regression to Feature-based [19] Higher accuracy and faster speed Affected by the workpiece's own texture characteristics and light greatly Template-based [20] Suitable for weakly textured scenes Reliance on hardware devices and poor real-time performance the 6D poses of the target object. In the process of regression 6D poses, it can be divided into direct regression 6D poses and indirect regression 6D poses. As shown in Fig. 2, the flow chart of these two types of regression methods is presented. The technical framework of the proposed augmented reality aided assembly posture detection consists of two sections, as shown in Fig. 3. For the offline part, the real data is sampled first. Then, according to the 3D model in weakly textured parts, the proposed weighted sampling method based on the part features is used for sampling. Unreal engine is used to synthesize the dataset for raising the performance of the YOLO-6D model training and forming the basis for the posture estimation later. For the online part, the registration of the virtual model of the assembled part is achieved based on the aruco [36,37] module first, and the 3D poses of the virtual model are manually preadjusted and calibrated to achieve real-time guidance in the assembly process. In addition, according to a proposed assembly posture detection method that matches by pre-calibrating the assembly posture template, it detects the assembly position from two perspectives: translations and rotations. Finally, the feedback information is used to reduce the re-inspection time of employees and improve the assembly efficiency.

Offline processing
In order to associate the sampling viewpoints with the characteristics of the sampled objects, and to obtain more samples with large positional variability for model training with a limited number of samples, this paper chooses to improve on the typical sampling method [38] to obtain an adaptive sampling viewpoint distribution of the objects.

SSC equation
In order to strengthen the connection between the sampling viewpoint and the estimated object's own features, as well as to obtain higher quality learning samples, the features of the parts should be selected for analysis first. The research objects in this paper are connectors of electromechanical products, which generally have a regular, weakly textured shape. From the perspective of image processing, the surface texture and contour features of the estimated objects are extracted mainly by SIFT algorithm. Among them, the matching degree sift for surface texture features is defined as the ratio of the number of matching points based on the Euclidean distance to the total number of matching pairs. The match IoU cp for contour features is defined as the intersection over union of the areas of the part objects after center alignment in the sampled frame.
According to the similarity evaluation metrics sift and IoU cp determined above, the CRITIC approach in the objective weighting evaluation method is used to measure the objective weights of the metrics. The flow chart of the CRITIC algorithm [39] is shown in Fig. 4. In order to quantify the indicators and to combine the intrinsic connection between the indicators at the macro level, the number of where r ij is the correlation coefficient between evaluation objective i and j , R j is the correlation matrix, S j is the standard deviation of the jth objectives, and W j is the weight of the jth objectives.
In combination with the weights of each indicator, the similarity of surface features and contour features (SSC) can be expressed as: Where and represent the weights of surface features similarity sift and contour features similarity IoU cp respectively, and the sum of the both is one.

Filtering path
For the typical sampling method based on latitude and longitude grid has the advantages that the distribution of viewpoints is more regular and dense, besides the position of the target object at the same latitude with adjacent viewpoints has only one rotational variation at one latitude. This paper is based on the latitude and longitude grid and defines the paths of comparison and filtering of adjacent sampling frames. That is, the comparison and compression sequence of a certain sampled viewpoint on the same latitude and longitude sampling sphere with the neighboring viewpoints. The path diagram of the sampled frame compression is shown in Fig. 5.
In this study, the spatial coordinate point sets R 1 ,R 2 , ⋯ ,R n are constructed for sampled viewpoints at the same latitude. In the point set R n , a spatial point P 1 is randomly selected as the starting point of the path and stored in the chain of spatial points corresponding to the keyframe created. And the point on the left side of P 1 is recorded as PL n and the point on the right side is recorded as PR n . Next, the sampling frame corresponding to the P 1 point is used as the template image, compared to the left and right sides of the diffusion, and the SSC is calculated. If the SSC is less than the set threshold , this point set is added to the spatial point set link list, as shown in Fig. 5(a). At the same time, the template of the side that meets the condition of SSC < is switched to the sampling frame corresponding to this spatial point, as in Fig. 5(b). If the SSC is not less than the set threshold , the corresponding spatial point is skipped, and it continues to compare the sampling frames of the spatial point to the left or right of this point, as in Fig. 5(c). Finally, the same latitude point set is looped along this keyframe filtering path until the compared point sets overlap.

Sampling process
The crucial of the proposed weighted sampling method based on part features is to merge highly similar sampling frames in a limited number of learning samples. As the Fig. 6, the flow chart of the weighted sampling based on part features is shown. And the detailed steps are described as follows: Step 1: With a known OBJ model, import the model of the sampled object and the coordinates of the spatial  Step 2: Set the threshold for the SSC and the compression ratio for the sampled frames, and compress them along the keyframe compression path. Generally, the threshold is initially set to 0.7 and the compression ratio is initially set to 3/4. Step 3: Compare the size of the number of compressed keyframes n with the number of samples N. If the condition of |N − n| < N • • 5% is satisfied, go to step 4; otherwise, adjust the value of SSC threshold and continue with step 2.
Step 4: Uniformly distribute the Fibonacci sampling viewpoints of the same radius sampling sphere among different latitudes, with the number of |N-n|.
Step 5: Reimport the spatial points in the final generated sampling link list into Unreal Engine to generate the sampling image.

Online detection
The assembly posture detection method proposed in this paper contains two stages, which mainly decompose the correct assembly position of parts into rotational and translational movements for matching degree calculation, and the technical route of this method is shown in Fig. 7.
Where the matching errors for the rotational motion are defined as the absolute orientation errors based on the rotation matrices, and the matching for the translational motion is defined as the intersection rate of the minimum bounding box of the virtual model used for guidance and the minimum bounding box of the 3D bounding box estimated by the YOLO-6D network.

Calibration of posture templates
In this paper, the YOLO-6D-based pose estimation is performed on the part to be assembled, and the result of the pose estimation is the RT transformation matrix from the world coordinate system to the camera coordinate system where the part is located. Then, the local coordinate system of the part can be regarded as coincident with the world coordinate system. Thus, the rotational motion posture of each component in the correct assembly position needs to be manually calibrated to extract the rotational transformation matrix in the correct assembly position. Extracting the displacement matrix of the part to be assembled from the image of the correct assembly is a typical 3D-to-2D motion solution problem, which is solved in this paper with the Perspective-n-Point (PnP) algorithm [40]. The calibration process of the rotation matrix is shown in Fig. 8. Firstly, the world coordinates of the 3D feature points of the model are selected by the online tool 3d-on-2d. Secondly, the 2D points corresponding to the 3D point set on the video frame are calibrated with the self-developed 2D calibration tool based on Qt development platform. Eventually, the RT transformation matrix from the world coordinate system to the camera coordinate system is calculated by the PnP algorithm, where the rotation matrix is saved as a template.

A strategy for posture matching
The arbitrary posture of the part to be assembled can be decomposed into two motion steps: translation and rotation. The result of the pose matching can be obtained from the comparison of the translational and rotational movements of the part in the template for each frame. It is shown that the flow chart of the pose matching strategy for assembly posture detection in Fig. 9.
For translational motion, in order to better quantify the error of the parts to be assembled in different directions, the correctness of the translational motion during assembly is determined by calculating the intersection rate of the minimum bounding box of the 3D bounding box obtained from the guidance model and the YOLO-6D estimation.
If the intersection rate is greater than 0.8, then the translation poses of the assembly are correct as default. As shown in Fig. 10, the pink area is the estimated positional bounding box and the yellow area is the overlap area. Where the virtual model used for guidance is calibrated based on the markers for the position and rendered by the aruco module. For rotational motion, the same as the traditional markerbased posture detection approach is used to quantify the analysis with Eq. (2). The error can be compared between the rotation matrix obtained by estimating the part pose in YOLO-6D and the rotation matrix in the pose template. The average error of the test sets calculated during the training of the YOLO-6D model was set as the threshold value. If the absolute orientation error is less than the threshold value, the rotation position of the assembly is correct by default.

Implementation and experiments
In order to verify the effectiveness of the method proposed above, this section designs and validates relevant experiments based on the YOLO-6D network model with the plug-in electronics as the experimental object.

Assessment datasets and training schemes
The measurement dataset is composed of part of the public dataset in LINEMOD [41] and a self-made plug-in dataset. The model for testing and validation is shown in Fig. 11.
The comparison experiments of typical sampling methods and weighted sampling methods based on part features were conducted by using GTX 1660S graphics card with 6 GB of video memory in Windows environment, respectively. Among the typical sampling methods include the latitude and longitude grid-based and the Fibonacci gridbased sampling methods [42,43]. According to the size of the model used in the datasets, sampling was performed on a sampling sphere with a radius of 60 cm to 100 cm. Each of these classes of objects is sampled with 800 samples based on each of the three sampling methods. From these samples, 70% are randomly selected for YOLO-6D model training, and the remaining 30% samples of each method are mixed and then 30% are randomly selected as the overall test sample.

Experimental results
In Table 2, the comparison results of 800 samples training are shown.
It is observed from Table 2 that for the 6D Pose [34] and 5cm5 • [44] metrics, the mean values improved by 8.76%, 8.27%, and 9.17%, 9.86%, respectively, compared to the two classical sampling methods in the case of a small number of samples. For the judging metric of 5px 2D Protection [45], by using the weighted sampling method based on part features proposed in   this paper to estimate the percentage of correct poses consistently approaches or even exceeds the higher percentage of the other two methods. Thus, it demonstrates the better robustness of the method against different sampled objects. It also provides reliable data information for the later posture detection section.

Experiment design
Taking the assembly position detection of optical disk driver in a computer case as an example, firstly, the synthetic samples are obtained by the sampling method above and mixed with a certain number of real calibration samples to generate the training hybrid datasets. And the specific production process is shown in Fig. 12. After training the YOLO-6D network, the average absolute orientational error of the test sets was obtained as | |=7.92 , and the threshold value of the rotation matrix comparison was =7.92.
In order to have better comparison results in the posture detection experiment, a correct assembly posture and two wrong assembly postures are designed respectively. Taking the assembly of optical disk driver as an example, a sample of the assembly posture detection is shown in Fig. 13. Where Fig. 13(a) shows the template image of correct assembly; Fig. 13(b) shows the correct assembly pose; the wrong assembly pose includes the following two kinds: correct rotational motion and wrong translational motion, as in Fig. 13(c); and wrong rotational motion, as in Fig. 13(d).

Experimental results
The above samples are used to calculate the matching results based on the positional matching strategy in Sect. 3.2.2. As shown in Table 3, for the assembly state detection results of the samples.
The above assembly posture detection results for the samples show that for both correct and incorrect assembly posture images of the assembly process, the rotation and

Application cases
With the aim of verifying the superiority of the proposed method over the marker-based detection method in practical product applications, a Qt-based design and development of an augmented reality aided assembly system for mechanical and electrical products with small batches, large bodies, and multiple connectors is presented in this paper. Figure 14 shows the sketch of the system integration operation and application. This section takes the assembly of a computer case as an example to specifically demonstrate the actual whole assembly effect of the case. The goal of the assembly is to use the corresponding tools to assemble the six components provided to the case, namely the optical drive, hard disk 1, hard disk 2, power supply, fan and CPU.

Application
Eight people with no prior knowledge of the assembly were selected and divided into two groups. Four assemblers from each group stood at the workbench and performed experiments on the whole assembly of the case utilizing the developed system. The first group was required to assemble by using the conventional markerbased posture detection method, and the second group was required to assemble by using the YOLO-6D-based posture detection method. After the assemblers have specified the assembly target of the case, they enter the information management module of the system to register the assembly process information and store it in the databases. Then, the poses of the assembled parts are adjusted and calibrated through the assembly guidance module. Subsequently,  the posture template is calibrated in the assembly posture detection module and all this information is recorded in the databases. Finally, the system decided whether to automatically call the guidance information in the databases of the next step by the feedback from the posture detection module. Throughout the whole assembly process, the assembly guidance module assisted in the whole assembly of the case by interacting with the 3D tracking registration

Results and analysis
From the perspective of productivity, the time to completion of assembly and the number of errors are the most critical performance metrics in assembly operations. Table 4 lists the three types of assembly errors that are considered to be the most common errors in manual mechanical assembly processes [46]. Therefore, during the chassis assembly experiment, it is required to record the completion time of assembly by the assemblers as well as the number and type of assembly errors when they occurred. Also, the experimental results of the two groups were statistics and compared, which are shown in Tables 5 and 6. The time of assembly completion and the number of errors in assembly were reduced by 15.7% and 27.3%, respectively, in the second group compared to the first group. Among them, the error rate of part selection is reduced by 25%, which is due to the fact that during the first group of assembly, there are cases when the markings are out of the field of view, obscured or reflective. It causes the marks to be poorly identified and increases the detection time. For the sequence of the assembly, the error rate is almost zero for both groups. This is because our system can recognize the parts that need to be assembled at each step, which is a good guide for the assembly sequence. While the two groups have certain assembly fixation errors, the reason is that the two detection methods do not yet have recognition and detection functions for tiny parts and flexible cables. And there are not enough samples of mixed datasets for each part, which makes the occurrence of incorrect estimation of the position posture and forms a disturbance to the assemblers.
In summary, the YOLO-6D-based posture detection method proposed in this paper has the merits of good realtime performance and high detection robustness. It also avoids the "pollution" of markers on the field of view, and has a certain potential for development.

Conclusion
In this paper, a posture detection method based on posture template and non-contact assembly posture detection is proposed to address the issues of current assembly posture detection methods. The method combines the YOLO-6D model with the positional matching strategy, which effectively reduces the rechecking time of the assembly and enhances the assembly efficiency. And a weighted sampling method based on part features is designed to improve the effectiveness of YOLO-6D model training. It is verified Sequence of assembly error e.g., first to put all the parts in the respective position and then to fix the whole, etc 3 Fixation/installation error e.g., incorrect fixing of the data cable to the slot, missing fixing screws, etc  through comparative experiments that the method can enhance the sampling quality of parts with limited training samples, reduce distortion caused by too much synthetic data. To verify the practicality of the proposed approach, an augmented reality aided assembly system for small batch mechanical and electrical products is designed and developed. The system is applied to an assembly example of the typical electromechanical product. Through user testing, it can be seen that the time to complete the assembly and the number of assembly errors are reduced by 15.7% and 27.3%, respectively, when the posture detection method proposed in this paper is compared with the marked posture detection method. It illustrates the effectiveness of the approach proposed in this paper for the registration of virtual assembly guidance information in real scenarios, and can be applied to guidance in augmented reality assembly of products.