A blind area information perception and AR assembly guidance method based on RGBD data for dynamic environments and user study

In this research, a blind area information perception and guidance approach for dynamic context is proposed as a solution to the issue of diﬃcult and time-consuming assembly in blind areas. The proposed approach involves the utilization of real-time RGBD data to perceive both blind area context and operator hand information. The resulting data is then used to visualize the blind area scene and provide assembly guidance through the application of augmented reality technology. Unlike conventional methods, the proposed solutions are based on dynamic RGBD data rather than static predeﬁned CAD models, making it simpler to conﬁgure and adapt to more scenarios. A user study was designed and conducted to conﬁrm the feasibility of the suggested approach. The results indicate that the suggested approach can decrease assembly time by 49.5%, greatly lower the percentage of assembly errors, reduce the mental load on the workers, and signiﬁcantly enhance their operational experience.


Introduction
Assembly is an essential component of product manufacturing [1].In the aerospace field, as aerospace products are mostly characterized by small batches, multiple varieties, and high customization, which makes its assembly process still involves plenty of manual operations [2].Meanwhile, the complex structure and high space utilization of aerospace products lead to inevitable occlusion problems in certain assembly scenarios (such as the docking of the wing and the central wing box), requiring workers to rely solely on tactile groping for blind assembly operations.Blind assembly operations are sometimes accompanied by ergonomically unfavorable assembly postures, such as squatting or lying down, which significantly augment the physical burden on the worker and may pose potential safety risks.
Previous research has shown that people primarily rely on visual information to comprehend their surroundings [3], with the sense of touch playing only a minor part.
Assembly via tactile groping is bound to be a difficult and unpleasant activity for unskilled new employees.As a result, figuring out how to employ existing technologies to assist workers in perceiving blind area information and assisting them in performing assembly operations is critical for increasing production efficiency and ensuring worker health.
Augmented reality (AR) technology offers a practical solution for the spatial viewing of obscured information.AR helps workers view and comprehend information from a 3D perspective by integrating virtual information with real-world scenes [4], which contributes to reducing the user's cognitive burden [5,6].
There are some existing studies that use AR to view blind area information [7,8].These studies, however, are typically based on predefined models of the blind area environment; i.e., before visualizing blind area information (e.g., mounting hole locations) with AR, CAD models of the parts to be assembled in the blind area must be transferred to the AR space in advance, and a virtual reality alignment must be of the experimental results.Finally, Section VI discusses the limitations, conclusions, and future work of this paper.

Related work 2.1 Hand identification and tracking
Workers in blind assembly circumstances typically perform assembly tasks by hand or with basic equipment.Tracking tools is possible by placing an identification mark on the tool while it is in the worker's hand.The worker's hand posture must be identified and tracked for bare-hand operations.According to their principles, hand tracking approaches can currently be generally categorized into vision sensor-based approaches and inertial sensor-based approaches [9].
In terms of inertial sensor-based methods, Chen [10] proposed a microelectromechanical (MEMS) inertial sensor-based hand tracking method for virtual training scenarios and achieved gesture recognition and semantic interaction using kinematic equations and support vector machine (SVM) algorithms.Stancic et al. [11] conceived and developed an inertial sensor-based gesture interaction method and wearable device for the human-robot interaction problem, as well as an online gesture categorization algorithm for processing raw sensor data based on strings of motion primitives.Djemal et al. [12], on the other hand, utilize an artificial neural network to classify dynamic movements with an overall accuracy of roughly 96%.In general, hand tracking methods based on inertial sensors are mostly used for gesture recognition.Inertial sensors are unsuitable for providing spatial position information since they only provide raw acceleration data; thus, position data must be derived by double integration, which will inevitably result in an increasing cumulative inaccuracy of the results.In addition, this approach necessitates the use of hand-worn sensors, which will impede the natural movement of the hand.
In terms of vision sensor-based methods, Xiao [13] proposed a hand tracking method that combines color features and depth information to address the problem that hand tracking is difficult with only RGB data, and a particle filtering algorithm was used to predict the hand pose to improve the tracking results.Zhang et al. [14] developed a real-time hand tracking approach based on MediaPipe [15] and a single RGB camera to address the issue that hand tracking in AR/VR frequently requires complicated external sensors.They separated the hand model into the palm and hand bones, and during the recognition procedure, they identified the distinctive palm, obtaining the finger pixel range and generating 2.5D coordinates based on 2D images.
Valentini [16], on the other hand, tracks hand motion and supports hand-part interaction in AR using a commercial tracking device, LeapMotion.In the author's usage experience, however, LeapMotion's tracking results might be jittery and lose track when the fingers are curled up or obstructing each other more, reducing the user's experience.In general, most existing vision sensor-based systems track hand postures at the 2D level, and the generated data are generally 2D or 2.5D coordinates.However, in blind assembly scenarios, we expect that it would be useful to gather spatial 3D data of the hand joints to view the hand model in AR space in order to assist the assembly.

AR blind area assist assembly
The virtual-real fusion feature of AR makes it ideal for the spatial viewing of obscured information.According to the visibility of the assembly parts, Wang et al. [17] classified blind assemblies into two types: entirely invisible and partially invisible.For the partially invisible case, the approach of posting visual markers in the visible area to track is proposed; for the entirely invisible case, the LeapMotion sensor is installed in the blind area to capture the worker's hand movements.And the results of the user study indicate that the introduction of AR to assist blind assembly helps improve efficiency.Henderson et al. [18] developed an AR-based assisted assembly system to address the issues of confined space and mutual obscuration of parts in the military vehicle maintenance process, which presents assembly information in the form of AR text, arrows, or animations, effectively reducing the operator's head rotation.Khenak et al. [19] developed experiments to compare two approaches for data visualization for the blind insertion task.The first is a wireframe overlay mode, which shows the model wireframe, and the second is an axis overlay mode, which only shows essential geometric constraint elements (e.g., axes).The results of the experiments show that there is no significant difference in performance between the two modes, although the insertion trajectory for the axis overlay mode is smoother.In contrast to the visual method of tracking the hand, Zhang et al. [8] used a data glove to gather hand posture in blind areas.This method, while not needing additional equipment in the blind area, requires employees to wear gloves and results in inaccurate measurements because the hand position is extrapolated from the shoulder position.Wang et al. [20] proposed a multi-view interface that integrates first-person and third-person views to address the assessment of assemblyability in confined spaces, allowing the user to simulate assembly operations in a natural way in the first-person view while assessing assemblyability holistically in a "world in the palm of your hand" in the third-person view.Feng et al. [21] developed a blind assembly assist system based on LeapMotion that enhanced assembly efficiency and lowered cognitive load by making the part area near the eye gaze point and hand transparent, allowing the user to see the constructed parts on the opposite side.
In general, previous studies on AR-assisted blind assembly usually utilized visionbased methods to capture hand position and display the data via AR.This approach, however, has a number of drawbacks.First, it is difficult to gather 3D coordinate data of the hand with only RGB images, and commercial devices such as LeapMotion do not operate well when the self-obscured area of the hand is rather large.Second, existing methods' visualization of parts is dependent on predetermined CAD models, so each time the assembly scenario changes, the project must be reconfigured, reducing the method's flexibility and applicability and limiting its engineering application.Therefore, it is a pending problem to propose a relatively generalized blind area information sensing and assembly guidance method for the blind assembly problem.

Methodology
This section describes the system components and specific technical details.Module A is in charge of tracking hands and parts using RGBD data; module B is responsible for recognizing current assembly steps via deep learning methods; module C is responsible for perceiving assembly context information and determining the size and location of mounting holes; and finally, module D is responsible for guiding workers through the blind assembly process in AR with the output information from the first three modules.Specific details of each module are described below.Image Processing

Coordinate calculation
Fig. 3 Calculation process of 3D coordinates of each joint of the hand cannot be used directly for hand visualization and must be combined with depth images to calculate 3D coordinates.Finally, because the depth data for the obscured hand joint points cannot be retrieved directly from the depth image, the 2.5D data must be combined to derive the 3D coordinates of the obscured feature points, and the specific process is as follows: (A) Determine the type of occlusion of each joint of the hand.
As shown in Figure 4, the occlusion situation can be split into two cases: (a) fingers occluding each other and (b) single finger self-occluding.
For case (a), as in Figure 5 For the finger self-obscuring case (b), due to the limitation of the finger's own biological structure, it is reflected in the 2.5D feature point coordinates as too short knuckle length or too small pinch angle, as in Figure 4(b).Therefore, the finger selfobscuring situation can be determined by Eqs. ( 2) and (3).
where l is the joint length and threshold l is the given length threshold.
where ⃗ P M and ⃗ DP are the vectors from the feature point PIP to MCP and DIP, respectively, and θ is the vector clip angle.but that are not obscured need to be removed (for example, line segment l 1 blocks line segment l 2 , and there is an occlusion relationship between l 1 and l 2 , but l 1 is not obscured).If an occlusion relationship exists, the relative depth z in the 2.5D coordinates of the starting and ending feature points of the obscured joint line segment is always greater than that of the visible line segment.As a result, we delete the viewable line segments by comparing the respective depths of the starting and ending feature points of pairs of line segments with occlusion relationships.It is worth noting that the preceding procedure does not apply to conditions where the figure is twisted, as shown in Figure 6.However, because it does not fit the natural human gesture habit, it is virtually impossible to occur in a real assembly; hence, it is not taken into account.
After determining the occlusion of the finger joints, the 3D coordinates of the obscured joint feature points need to be derived by combining 2.5D coordinates with depth data.
As shown in Figure 3, the conversion factor k from 2.5D to 3D coordinates is first determined based on the unobscured feature points' 2.5D and 3D coordinates, as shown in the following Eq. 4.
where n is the number of all unobscured feature points; z 3D,i is the depth value of the i-th unobscured feature point in the camera coordinate system obtained by the depth image; z 3D,0 is the depth value of the wrist joint feature point in the camera coordinate system; z 2.5D,i is the relative depth value of the feature point in 2.5D coordinates; z 2.5D,0 is the relative depth value of the wrist joint feature point in 2.5D coordinates, which is 0; and k is the conversion factor between the 2.5D relative depth value and the true depth value.
Then, using the conversion factor k, the true depth of the obscured feature point in the camera coordinate system is determined by the following Eq.5: Determine the feature points A,B (only once) Calculate the pixel coordinates of the midpoint C Obtain ABC three-point depth value   then the part position (point C) and the axis direction vector (plane normal vector) are derived.In case the part is relatively small, it is impossible to obtain a sufficient point cloud.In this case, the part and the finger joints that hold it together are treated as a rigid body bound together, and the part's position and axis vector can be determined via the spatial coordinates of the four joint points.

Data Reliability Verification
Unreliable raw data are deleted here in two ways to make the tracking more consistent and reliable.On the one hand, the current frame depth image is compared to the prior frame, and pixels with a big absolute difference are removed, and the previous frame depth of these pixels serves as the current frame depth.On the other hand, if the number of currently recognized hand feature points is too few, the measurement is less effective, and the frame data is discarded.

Assembly context perception
The assembly context perception module, as shown in Figure 8, is divided into two parts: assembly plane fitting and mounting hole detection.
For assembly plane fitting, the pixel area of the assembly plane is manually framed first, and then the plane is fitted with a random sampling consistency algorithm based on the depth data after bilateral filtering, and thus the plane equations can be determined.
To detect mounting holes, first get the hole center pixel coordinates O i (u i , v i ) and pixel radius r i through a sequence of filtering, binarization, edge detection, and Hoff circle detection methods.Then, starting from the hole's center, the pixels in the u-direction (any other direction is fine) are gradually iterated until a sudden depth change of the pixel P e (u e , v e ) is discovered, indicating that this pixel represents the hole's edge.Then, using the depth images, the 3D coordinates O 3D,i (x i , y i , z i ) and P 3D,e (x e , y e , z e ) of the points O i and P e are acquired, and the 3D distance between

Raw data
Bilateral filtering RANSAC Planar Fitting

Original image Mean shift filtering Binarization Canny edge detection Hoff circle detection Detection results
Fig. 8 Assembly context perception process these two points can be considered the real radius of the hole.Finally, the radius of the detected holes is separated into preset categories according to the preset hole type (e.g., only 3 or 5 mm radius holes are possible in this scenario), thus completing the mounting hole detection process.In addition, the axis vector of the hole can be determined by the normal vector of the plane in which it is located.

Assembly stage identification
Due to the lack of visual information when executing blind assembly operations, it is generally difficult for workers to discern the current stage of assembly.For this problem, we employ a convolutional neural network (CNN) to determine the assembly steps and a voting mechanism to improve the classification accuracy rate.The detailed procedure is as follows.
Yolov5 was chosen as the recognition model for the assembly step based on the small sample detection results (Yolov5: 143 fps, 94.82% correct rate; SSD: 46 fps, 82.6% correct rate).After model training, the number of frames c j identified as workstep j among n frames {I 1 , I 2 , . . ., I n } is counted, and the assembly step with the highest recognition count is chosen as the final recognition result, which is then used to determine whether the current assembly position is correct and the AR assembly guidance process.

Coordinate system calibration
In order to visualize the data in the AR environment, it is necessary to convert the data from the RGBD camera coordinate system to the AR glasses coordinate system, i.e., to calculate the coordinate transformation matrix.
In general, virtual-real registration is implemented with a single identification marker, i.e., the AR glass determines its relative position to the marker by scanning it with its own camera, whereas virtual graphics, text, and other elements have already determined their relative position to the marker when the project is built, allowing the calibration process to be accomplished with a single marker.
In the blind assembly scenario, however, unlike the normal case, the RGBD camera can't see items beyond the blind region, and the AR glass cannot see objects inside the blind area, resulting in a lack of common positional reference for the RGBD camera and AR glass.To address this issue, this research presents a union localization method based on indirect identifying markers.A union positioning tool is designed with two identification markers attached, as shown in Figure 9.The two markers' relative positions are predefined, and their conversion matrix can be measured with the RGBD camera to the AR glass can be generated according to Eq. 6 with two identifying markers as the bridge between the inside and outside of the blind area, thus completing the calibration procedure.
where W C T is the coordinate transformation matrix from the RGBD camera to the AR glass; V1 C T is the transformation matrix from the RGBD camera to the blind internal marker; V2 V1 T is the transformation matrix from the blind internal marker to the blind external marker, which is measured by the CAD software, and W V2 T is the transformation matrix from the blind external marker to the AR glass.

AR assembly guidance
Two visualization and guidance approaches, AR video streaming and AR virtual model guidance, are proposed based on the aforesaid detection messages for displaying the blind area information and guiding the blind assembly process.
As shown in Figure 10, in the AR video stream guidance approach, 2D video streams are overlaid to overlap with real assembly objects, and workers wearing AR glasses perform assembly based on the video stream, with green, blue, and red markers representing the installed, being installed, and pending installation states, respectively.
In the AR virtual model guidance approach, RGBD images in the blind area are abstracted as simple AR models, arrows, or text, with green, red, and white blobs representing holes with different assembly states and guide lines indicating parts to be assembled with mounting locations, with hand models processed transparently to prevent occlusion.

AR video streaming guidance AR virtual model guide Installed Installing
To be installed

User study
We designed and conducted a user study on blind assembly.This section describes the experiment purpose, participants, experiment conditions, and procedure, as well as relevant hypotheses and the data analysis method.

Experiment objectives
The objectives of this experiment are twofold: first, to investigate the effectiveness of the proposed method for blind assembly guidance, specifically its differences with other methods in terms of efficiency, error rate, system usability, and user experience; and second, to investigate the information representation characteristics of 2D video streaming and 3D AR graphics.

Participants
18 participants (4 females and 14 male), all from the School of Mechanical Engineering, Northwestern Polytechnical University, were invited to participate in this experiment.
The participants' ages ranged from 23 to 30 years (M = 25.4,SD = 2.21).8 individuals had prior expertise with blind assembly operations, whereas 10 had no prior experience.15 participants had prior experience with AR/VR-assisted assembly, whereas 3 had no prior experience.All of the subjects had normal vision, with 7 having normal naked eye vision and 11 wearing vision-correcting glasses.This study was authorized by the Medical and Laboratory Animal Ethics Committee of Northwestern Polytechnical University.Each participant was asked to read and sign an informed consent form prior to the start of the experiment, which was aimed to inform them of the purpose, process, and related precautions of the experiment.Following the experiment, each participant was given a snack bag worth roughly 50 CNY for their participation.

Experiment conditions and procedure
As shown in Figure 11, the experiments comprise four conditions: no guidance group (NG), partial AR guidance group (PARG), video stream guidance group (VSG), and AR guidance group (ARG).The participants in the NG group have to figure out the assembly sequence, assembly position, and assembly object based on the paper process documents and rely on their sense of touch to groping for the assembly, which represents the traditional pure blind assembly scenario.PARG groups have virtual hands and assembly contexts but no guidance information, whereas VSG groups contain 2D video streaming information, and ARG groups offer dynamic guidance information to PARG groups.
The steps of the experiment are as follows: After reading and signing the informed consent form, participants were invited to complete a pre-questionnaire with basic information such as age, gender, and relevant experience.Participants were then given 15 minutes to become familiar with the visualization and functionality of the system in order to reduce experimental error due to inconsistent system familiarity.To reduce the learning impact caused by experiment order, we determined the order of four conditions for each participant using the Latin square experiment concept.Participants in each experiment were instructed to install bolts with sizes of 14 mm, 10 mm, and 8 mm into the designated bolt holes in the blind area in the given order, with 5 bolts of each kind for a total of 15 bolts.Following the experiment, participants were asked to complete three questionnaires documenting their subjective experiences.

Evaluation
The following methods were employed to assess the interaction experience under different experiment conditions.
Objective data: (1) Assembly time: the time taken by the participants to complete the assembly task; (2) Error rate δ of bolt assembly, as calculated by the following equation.
where n c is the number of wrong or missing installations, and N is the total number of bolts.
We used one-way ANOVA and a post hoc analysis with the Bonferroni correction method to analyze task time and error rate data.The Friedman and Wilcoxon signed rank tests were employed for significance analysis on discrete scale and questionnaire data.

Hypotheses
The objective of this experiment is to determine the efficacy of the proposed ARguided blind assembly approach and how it differs from 2D video streaming assembly guidance.Based on this goal and our intuitive hunch, we propose the following hypothesis: Hypothesis 1: The ARG approach has the best results, i.e., the lowest error rate and the fastest time.
Hypothesis 2: There is no statistically significant difference in the guidance effect of VSG and ARG.
5 Results Analysis

Assembly error rate
Figure 13 depicts the average assembly error rates for four conditions derived by recording the number of incorrect and missing assemblies.For the four conditions, the average assembly error rates are NG: 6.32%, PARG: 1.6%, VSG: 0.7%, and ARG: 0. group has the lowest score.The Friedman test revealed a significant difference in SUS scores among the four conditions (p < 0.01).The Wilcoxon Signed Rank Test was used to determine whether there was a significant difference between each of the two groups, as shown in Figure 14.

NG
The SUS scale results can be divided into three subscales that assess the system in terms of usability, learnability, and satisfaction.Figure 15 depicts the subscale scores.
The ARG group has the lowest learning difficulty, the best usability, and the highest user satisfaction, followed by the VSG group with little disparity, the PARG group, and the NG group with the lowest score.

SMEQ
The SMEQ score represents the user's mental effort during the assembly procedure.
Figure 16 shows that users of the ARG group have essentially no mental burden, whereas users of the VSG (M = 33.89,SD = 14.41) and PARG (M = 14.39,SD = 9.50) groups have significantly higher mental strain, and users of the NG group have the highest mental load.The Friedman test revealed a significant difference (p < 0.001) among the four conditions, and the Wilcoxon signed rank test results are given in Figure 16.

UEQ
The aforementioned experiment results show that the PARG group is inferior to the VSG and ARG groups due to a lack of guidance information; hence, the PARG group was not considered in the UEQ data analysis.Figure 17 depicts the results of the UEQ, which measures users' subjective experience of the system in six aspects.As can be observed, the ARG group outperforms the VSG group in all six dimensions, while the paper or electronic process manuals in actual assembly, causing the worker's attention to switch back and forth between the assembly object and the process handbook.The constant switching of workers' attention and the multiple repositioning of workers' hands lead to an increase in their mental load.In contrast, by transferring the duty of identifying the assembly stages to the system, the worker is only required to focus on finishing the assembly operation in accordance with the system's directions.
(2) Both the video stream method and the AR virtual model method have benefits and drawbacks.There is no significant difference between the ARG and VSG groups in terms of assembly time, error rate, and SUS, as demonstrated by the experiment analysis results, implying that the assembly guidance effect of both is actually similar.
However, this might also be due to the simplicity of the assembly task we set, as we only considered assembly on a single plane, which could result in ARG's failure to demonstrate its advantages in providing 3D spatial information.In terms of UEQ, the ARG group was only marginally superior in the stimulation and novelty aspects.
Only in terms of mental burden does the ARG group outperform the VSG group.We was considerably less stable than the direct video stream due to the sophisticated calculating procedure.
(3) The two proposed blind assembly assistance approaches (ARG and VSG) improve blind assembly significantly.It can be shown that with the help of the proposed approach, the participants' performance has improved dramatically in all areas.

Limitations and conclusions
Although our proposed blind assembly assist method improves assembly efficiency and correctness, there are still limitations to our work.
(1) The proposed blind context perception approach has limited application scenarios.At present, we are merely considering the most common scenarios in which the assembly surface is flat and the mounting holes are of the most basic circular shape.
However, there are some assembly scenarios (for example, airplane wall plate assembly) where the assembly surface is curved.In the future, we will further improve the applicability of the algorithm based on deep learning and RGBD data for accurate perception of the shape and position information of more complex assembly areas.
(2) The assembly task in our user study was too basic.As stated in Section VI, the reason for the minimal performance difference between the ARG and VSG groups may be due to the fact that we only consider assembly on a single plane, which may cause the ARG group to be unable to demonstrate its ability to represent 3D spatial information, causing the experiment results to deviate.
(3) The proposed algorithms' stability needs to be improved.Despite meeting the basic usage requirements, the proposed algorithms are less stable than the video stream guidance approach.We will address this issue in the future by adding filtering algorithms, adjusting algorithm parameters, and switching to high-performance hardware.
In this research, we propose a blind area information perception and dynamic guidance approach based on augmented reality to address the difficulties of blind area assembly and the absence of guidance.Unlike conventional approaches, the proposed methods are based on dynamic perception of the hand and context information of the blind area, which simplifies the time-consuming pre-use setting process and considerably expands the applicable scenarios.The results of the experiments show that the proposed AR virtual model guidance approach and video stream guidance approach can significantly improve assembly efficiency (44.4% and 49.5%, respectively) and correctness rate, as well as reduce workers' mental load and improve the user experience.
As a result, the proposed solution helps to overcome the problem of difficult and time-consuming blind assembly in engineering and helps to increase product assembly efficiency.

Figure 2 Fig. 2 Figure 3
Figure 2 depicts the flow of hand and part tracking.The procedure is divided into three stages: hand pose tracking, part pose tracking, and data reliability verification.The joint pixel coordinates output by the hand tracking module, in particular, are used to calculate the part pixel area, and the joint and part data output by the first two modules are then fed into the reliability verification module.Details of each module are described below.

Fig. 6
Fig. 6 Twisting of the figure A A A B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B

Figure 7
Figure 7 depicts the flow of part tracking.First, the depth image is used to determine the depth of the feature points A and B of the finger gripping the part and its midpoint C.Then, whether the finger holds the part is determined based on whether the difference between the average depth of points A and B and the depth of midpoint C is less than a given threshold.If the part is held, the pixel area of the part is determined by gradually expanding outward with point C as the circle center and detecting whether the outer pixels mutated.Depending on the relative size of the part, two different cases are considered during the part tracking procedure.If the part is relatively large, the amount of point cloud data of the part end face area is sufficient to fit the part end face plane equation, and

Fig. 10
Fig. 10 Two AR assembly guidance methods

Figure 12 FirstFig. 11
Figure 12  depicts the assembly time results.The ARG group has the shortest assembly time, while the NG group has the longest.For specific analysis of the data, first, the Kolmogorov-Smirnov test revealed that the assembly times of all four groups followed a normal distribution (NG (p = 0.156 > 0.05), PARG (p = 0.227 > 0.05), VSG

Fig. 15
Fig. 15 Learnability, usability and satisfaction score of SUS