An Augmented Reality Assistance System to See-Through Vehicle Components

doi:10.21203/rs.3.rs-1754735/v1

Limited visibility during the operation of forklifts is one of the most significant sources of danger in in-plant material handling. Existing systems record concealed areas via cameras and display them directly on monitors in the driver's cab. Augmented reality (AR) allows to display information directly in the driver's field of view (FoV). Concealed areas could thus be overlaid with the camera images, while nothing is displayed in areas with a clear view, allowing the real environment to be perceived. This research aims to demonstrate the suitability of AR as an assistance system for forklifts. The assistance system combines various sensors for pose determination, RGB-D cameras for environment recording and a head-mounted display (HMD) for visualization. We define requirements from a safety perspective and classify our system with respect to these requirements. The results show the suitability of AR technology as a driver assistance system. At the same time, they show challenges in accuracy and latency.

Computer vision

Vehicle Safety

Advanced driver assistance systems

Imaging

Augmented Reality

RGB-D Cameras

The explosive growth of online commerce in the covid pandemic and the continued economic growth for years have led to an increased need for in-plant material flow solutions. Since forklifts have great importance for manual goods handling, the number of forklifts operating within a company is increasing. The increased traffic volume requires the driver to be more aware of the surrounding scenery. However, complete environmental awareness is impossible on a forklift due to its mechanical structure and technical functioning. Vehicle components and cargo limit visibility. The accident statistic shows the need for visibility improvement: In 2020, 31,787 accidents involving industrial trucks occurred in Germany, 13,689 of which involved forklifts (DGUV e.V. 2021).

Referring to this problem, we present an approach based on the principle of AR and the use of a multi-camera RGB-D system to solve the view restriction problem when operating a forklift.

The temporally and spatially accurate display of information in AR systems is essential for immersion and operational reliability when using the technology. The key performance indicators (KPI) are the overlay accuracy and latency. Established assistance systems in the automotive sector, such as the rear-view camera or head-up displays, use fixed screens and superimpose static or contact-analog information. Static information is always displayed in the same place, while contact-analog information is registered depending on the environment. The information is displayed independently of the driver's pose. Neither between the sensors themselves nor between sensors and the display medium relative movements occur. The use of HMDs allows user-oriented presentation of information directly in the FoV. AR-based assistance systems are used mainly in assembly and order picking but especially in medicine. The systems provide static and contact-analog information. These assistance systems usually only use the internal sensor technology of the HMD. Thus, the used HMD predominantly dictates the system's latency and accuracy.

Our approach uses an HMD in an assistance system for forklifts to hide view restrictions. In contrast to established systems we display highly dynamic camera data in real-time. Due to the design of a forklift, the system needs several cameras to record the environment and these cameras move relative to each other during operation. In addition, the cameras for scene reconstruction and the driver move relatively to each other.

The information of the occluded scenes behind the view restrictions is reconstructed with the help of a multi RGB-D camera system and superimposed in the operator's FoV in a perspective-corrected manner. Various sources of error, such as the measurement inaccuracy of the depth camera, incorrect calibration of the multi-camera system or insufficient accurate registration between the HMD and the vehicle lead to discrepancies between the superimposition and reality. Likewise, latencies in the system lead to inaccurate superimpositions. When used in in-plant material handling, latencies of even a few hundred milliseconds can cause accidents, especially if the forklift moves at high speed in warehouse traffic. Latencies are caused, for example, by the data acquisition of the camera systems, the processing software or the data transmission. Another possibility for overcoming the problem of restricted visibility on forklift trucks is the use of gestures and voice-based assistance systems (Overmeyer et al. 2016). The operation of autonomous vehicles for example can thus be performed from a better point of perspective.

The paper is divided into five sections: The following section provides an overview of the related work. Here, we focus on existing methods and systems that allow to see through any kind of obstacles in the human FoV as this is the main function of our system. Since our system is based on AR and in the context latency and accuracy are considered KPIs, we additionally present measurement methods for these in AR systems. In section 3 we describe the concept of our developed see-through system in detail. In section 4 the general functionality of our system is shown. Furthermore, our performance results for the latency and the overlay accuracy are available in this section. Finally, we end this paper with a conclusion and outlook.

The problem of limited view caused by objects in the human FoV has been topic of research since the 1970s. One approach to solve this problem is AR technology, first used by Sutherland (1968) which allows visualizing virtual, computer-generated information in the operator's FoV using various display technologies. In addition, virtual elements can be used to obscure objects in the actual scene to augment information in the real world. By superimposing occluded information on view restrictions congruently, objects in the FoV can appear transparent to humans (Azuma 1997). Since AR is limited to adding artificial computer-generated content, researchers establish the term diminished reality (DR). Mori et al. (2017) described the essential functions as diminishing, see-through, replacing and inpainting. See-through allows objects to appear transparent using images of the hidden background. Thus, AR devices combined with DR technology can visually manipulate or hide objects directly in the operator's FoV.

Our goal is to develop an AR-based assistance system to see through visual impairments. Accordingly, we present research work that has focused on see-through systems in the following subsection. To be able to evaluate the performance of the system, we subsequently present research work that deals with the measurement of the KPIs latency and accuracy.

2.1 See-through systems

The development of the object-oriented display enabled the first AR-DR application of retro-reflective projection technology (RPT) to selectively hide surfaces in the human FoV in an object-oriented manner (Kawakami et al. 1999). To create the optical camouflage Inami et al. (2003) covered objects with retro-reflective material, a semi-transparent mirror mounted in front of the eye and a projector. Scene information captured by a camera was overlayed and giving the user the impression of a transparent object.

Yoshida et al. (2008) used RPT to solve the transparent cockpit problem. They improved the driver's view which was restricted by vehicle components by using multiple 2D cameras. They used stereo alignment to render a perspective-corrected image onto a retroreflector. Chang et al. (2008) used trinocular depth estimation to solve the problem of depth image-based rendering for a perspective correct overlay of the a-pillar.

Sasai et al. (2015) presented a see-through cockpit system for an autonomous vehicle that visualizes the predicted trajectory of the vehicle's front and rear tires on the dashboard to reduce the stress level of the subjects. In addition to the vehicle's components objects outside the vehicle, such as other cars, can restrict the view. The challenge arises from the constant variation of the view restriction and the congruent overlay needed for immersion. In most cases visualization of vehicles ahead was done on abstracted projection surfaces controlled by the dimensions of the object to be hidden (Kim et al. 2015; Rameau et al. 2016a; Rameau et al. 2022). Some approaches used markers on vehicles in front to detect them which, however, is impractical when used in actual road traffic (Gomes et al. 2012; Rameau et al. 2016b). The occluded background was mainly captured in three dimensions, but some approaches used a 2D image from a dashcam to visualize the occluded background (Chen et al. 2015).

One possibility that arised from using RGB-D cameras is the so-called point-based-rendering method, whereby background data can be projected congruently onto the view restriction (Ikeda et al. 2018). Our work also uses the approach of rendering reconstructed scene data on the view constraints using point-based rendering.

In addition to the development of see-through systems with a focus on system design researchers were investigating for example how these assistance systems affect performance and what information complexity is required in simulation environments. Van Amersfoorth et al. (2019) showed that with increasing translucency the average reaction time of the driver decreases. Lindemann and Rigoll (2017) investigated the influence of a transparent cockpit on the performance of the driving task in a simulator. Lindemann et al. (2019) examined the influence of different projection methods in a simulated overtaking scenario. Yasuda and Ohama (2012) showed that even an abstracted representation is sufficient for collision estimation. Plopski et al. (2019) suggested testing different visualization modes for each application, such as highlighting the outline of the object to be hidden. Our method uses the highest visualization level, namely the dynamic overlay of a scene reconstructed in real-time.

Kittaka et al. (2016) and Sugimoto et al. (2014) used RGB-D sensors to see through obstacles directly in front of a remote-controlled joint arm robot. Mori et al. (2017) presented a see-through system which allows a worker to look through the work tool and his hand to have an unobstructed view of the workpiece when using hand tools. An RGB-D camera captured the occluded scene. The system was real-time capable and achieved a high frame rate of up to 40.1 fps.

Kittaka et al. (2018) addressed the main problem of see-through, the registration error. Mismatches cause objects to be mapped in the wrong position and to be displayed twice. The solution presented is a real-time registration using the Iteratively Reweighted Least Squares (IRLS) method by Holland and Welsch (1977).

The marker-based approach represents another method to solve the registration problem, whereby the pose between two cameras can be estimated, resulting in a transformation of the camera capturing the occluded scene into the viewing coordinate system (Kim et al. 2020; Meerits and Saito 2015).

In the absence of an RGB-D camera for scene reconstruction it is possible to use image data from previous frames as long as the scene moves and the occluded background was seen before (Mori et al. 2020). However, this is problematic for safety-critical applications such as driver assistance systems, where up-to-date data is always required.

2.2 Performance measurement in AR systems

Quandt et al. (2018) divided requirements for AR systems for their industrial use into requirements during development, during setup and during operation. In this paper we focus on the requirements during operation. In particular we address the accuracy of presentation and the real-time capability. Accuracy of presentation is about the alignment of virtual and real objects. The real-time capability requires tracking and visualization of objects in real-time to ensure intuitive interaction and prevent motion sickness.

These general requirements for AR systems need extensions in the case of vision impairment compensation because camera data should be displayed on the virtual object to be faded in. Thus, the positioning of the virtual object to the real object and the alignment of the camera data on the virtual object are crucial. Sadovitch (2020) studied that all errors such as latency and registration accuracy significantly reduce the system acceptance.

The primary metric for evaluating the accuracy of AR applications is the registration error. It describes how precise the localization of virtual elements in the actual scene is in the operator's FoV, for example with the help of a HMD (Azuma and Bishop 1994).

In AR systems the registration of these objects depends firstly on the correct head pose of the user but also on the correct registration between the virtual object and the environment. One of the primary sources of error in head pose tracking is latency. Even the most minor latencies in head tracking lead to inaccurate augmentation (Holloway 1995). An end-to-end system latency of 100 ms at a moderate head rotation rate (100°/s) results in a deviation of 10° (Azuma 1997). At an augmentation distance of 1 m a rotational error of 10° results in a deviation of 17.4 cm. It is possible to use predictive tracking to minimize the deviation resulting from a tracking latency (Azuma 1995). Other error sources for the registration error are the acquisition/alignment, display, distortion in the HMD and viewing error (Azuma and Bishop 1994; Holloway 1997).

Our system uses a multi-camera system of RGB-D cameras (Intel Realsense D435) for scene reconstruction. Giancola et al. (2018) studied the accuracy of the camera we use. The depth error of the camera results in the reconstructed scene being only an approximation of reality and is an acquisition/ alignment error.

Holloway (1997) described the registration error as the degree of separation or misregistration and defines the lateral, linear, angular and depth error as a metric. Other metrics are the orientation error and the position error (Livingston and Ai 2008; Mela et al. 2021). Azuma and Bishop (1994) differentiated between a static and a dynamic registration error from a temporal point of view.

The registration error is also a function of distance which causes position errors to diminish with increasing distance and orientational errors to increase (MacIntyre et al. 2002). There are different accuracy requirements for small and large-scale AR applications with large-scale applications presenting challenges in terms of registration accuracy (Shin et al. 2008).

A practical application to determine the registration error is the matching with a repeating pattern in the scene (Kittaka et al. 2016) such as checkerboards (Lincoln et al. 2016) or markers (Suthau 2005, 2006). Pandya et al. (2005) demonstrated the accuracy of the AR subsystem using the deviation of the corners of an augmented cubic. The result is the Euclidean distance as an error metric. While humans can become comfortable with systematic overlay errors (Welch et al. 2014) the goal of any system development should be to minimize registration errors.

In terms of real-time capability it is essential to track and visualize the virtual object, but it is also essential that the camera data displayed on the virtual object is available with low latency. As a guideline for the acceptable latency the standard for rear-view cameras prescribes a maximum latency of 200 ms while the standard for rear-view mirrors prescribes 50 ms (ISO 16505, 2019). The latency of these systems is determined by lighting up an LED in a darkened room which is then recorded by the system camera and displayed on the screen.

Various methods for measuring latency in AR systems exist. Frame counting is the base for most of these methods. It is an easy-to-use method in which a real object and its virtual image are recorded and the time difference between detectable events is determined (Stauffert et al. 2021). For VR systems He et al. (2000) implemented frame counting by moving a tracked rod in front of a marked wall. They compared the frames in the video where the real and virtual staff cross the same marker. Wu et al. (2013) implemented frame counting by a column. They determined the deviation of the position of the real object and the virtual object automatically using thresholding. Roberts et al. (2009) used the start of a motion, Miller and Bishop (2002) used the end of a motion for detection. Friston and Steed (2014) detected the objects via thresholding and compare the acceleration peaks.

In AR systems Gruen et al. (2020) used two synchronized cameras. One filmed a high-resolution timer and the other filmed through the HMD showing an overlay of the timer. The resulting time difference could thus be determined. In contrast, Billeter et al. (2016) used only one camera that simultaneously records the real timer and the overlay on the HMD. Swindells et al. (2000) used the angular difference between two objects to determine the latency.

Our developed assistance system for vision impairment compensation on forklifts uses an HMD. However, the overall system consists of seven submodules shown with a yellow border in Fig. 1. The external influencing variables which consist of the forklift, the driver and the logistics environment are outlined in black. The blue outline represents the system boundaries. We use a Microsoft HoloLens 2 as the HMD for visualization (module 7). We use PC resources for rendering instead of the HoloLens 2 (module 6).

So-called anchors are usually used for the positionally accurate insertion of virtual elements. Typically, markers are used to anchor an object at a specific position. The Microsoft HoloLens 2 can also anchor objects to walls or other geometries using spatial mapping. However, it is intended to be worn by humans and mainly moved by them. Head movements are tracked to create an immersive overlay. In our system, the HMD is located in a moving vehicle (the forklift). This means the spatial anchors to which the virtual objects are attached (lift mast, columns, etc.) can move in space. At the same time the operator can also move relatively to these objects. Thus, it is necessary to determine which parts of the movement registered by the HoloLens 2 result from the forklift movement and which part from the head movement.

Another unique feature of our use case is that the cameras recording the environment can move to each other. In addition to an initial calibration, tracking the mast movement is also necessary.

3.1 Initial calibration and environmental recording

We calibrate the multi-camera system using the Matlab camera calibration Toolbox (Bouguet 2003). For our approach we use the coordinate systems and transformations shown in Fig. 2.

The key coordinate systems can be described as follows:


FL, FR	The frame for the left and right fork camera (integrated in the tips)
F	The frame of the forklift, which is centered in the axis of the front tires
B	The frame of the base camera, which is fixed to the vehicle
AR	The frame of the AR glasses, which is the position during initialization
MC, ML, MR	The frame of the center, (left and right) mast camera
W	The world frame, which is fixed and used for the calculation of the forklift movement

Table 1 summarizes the transformations in our application. Some transformations are static and must be determined only once at the beginning of module 1. Other transformations have to be determined dynamically during the runtime of the application. This is due to the structural design of the vehicle and the arrangement of the camera system. The forklift coordinate system is the same as the CAD model integrated in the game engine Unity which we use in our approach for rendering.

Table 1

Key extrinsic transformations between coordinate systems
Transformation	Description	Type	Determination
^FR T _FL	Fork left to fork right	Static	Initial calibration
^B T _FR	Fork Right to base	Dynamic	Initial calibration
^B T _MC	Mast center to base	Dynamic	External Sensors
^F T _B	Base to forklift	Static	Measurement 3D model
^FL T _AR	AR glasses to forklift	Dynamic	External sensors and HMD sensors
^MC T _ML	Mast left to mast center	Static	Initial calibration
^MC T _MR	Mast right to mast center	Static	Initial calibration

We calibrate the multi-camera system using the RGB data of each camera. The use of the Matlab calibration toolbox involves using a checkerboard pattern. The pattern must be completely captured by the cameras during the calibration process. In addition, the toolbox is limited to two cameras.

To eliminate the problem of a small overlapping FoV in our system, as shown in Fig. 3, we perform a pairwise calibration between adjacent cameras. For example, to represent the left fork camera FL in the forklift coordinate system F by using following transformation:

^F T _FL = ^F T _B ^B T _FR ^FR T _FL

The pairwise calibration starts with determining the rigid transformation between the left and right fork camera ^FR T _FL. Then, we determine the transformation between the right fork camera and the base camera ^B T _FR. We determine the transformation matrix between the base camera and the vehicle ^F T _B with the help of a CAD model of the forklift.

The calibration process using the Matlab Calibration Toolbox is necessary to get the initial poses for further processing in Unity.

It is also possible to build a complete 3D model of the demonstrator, but it is nearly impossible to transfer the correct pose of the virtual objects into the real world. For example, the minor angular errors generate a significant lateral error in the overlay.

For the environmental recording we use the proposed RGB-D multi-camera system to collect 3D scene data with color information for each pixel. The cameras on the lift mast need to be able to see behind the load. This perspective is supplemented with fork cameras in order to also generate data during storage and retrieval operations in the rack.

3.2 Forklift tracking

We use external sensors to track the forklift's movement as the forklift's diagnostic CAN bus does not provide movement data. Rotary encoders with a fixed-less bearing are attached to the front wheels of the forklift (Fig. 4 left). A measurement box transmits the signal from the encoders to the PC. An asynchronous TCP socket integrated into the API of the measurement box sends the data to Unity for further processing. We use the two-wheel model according to Dudek and Jenkin (2000) to calculate the forklift movement from the rotational speeds of the wheels and move the digital forklift model in Unity (Fig. 4 right) accordingly.

3.3 Lift mast tracking

The lift mast tracking consists of two parts. On the one hand we determine the tilt of the lift mast and on the other hand the lift height. A tracking camera (Intel T265) is attached to the side of the fork carriage (Fig. 5 left). This camera can be integrated directly into Unity via the RealSense wrapper (Dorodnicov 2018) and transmits its pose data. We use the rotational values of the pose data to determine the tilt of the lift mast. Although the pose data contains the vertical movement determined by the tracking camera which would correspond to the lifting height of the forks these values are far too inaccurate. Instead, we use an analog cable sensor so that the sensor is attached to the fixed part of the lift mast and the cable to the moving part (Fig. 5 right). We send the data to Unity via the Ardity wrapper (Wilches 2018) for Arduino. Since a forklift with a duplex frame is used the corresponding kinematic transfer function x_i = 2 x_o is implemented. Thus, the position of the inner part i can be determined on the basis of the position of the outer part o.

3.4 Scene reconstruction, rendering and visualization

The advancement of stereo camera technology allows us to offload the computationally intensive depth estimation to so-called RGB-D cameras allowing us to use more cameras and map a larger FoV. For scene acquisition and scene distance detection we use multiple Intel RealSense D435 to generate a depth map of the environment in real-time transforming the image pixel by pixel into the operator's FoV.

Cameras are integrated into Unity via 3D Objects. To apply the initial calibration, each initial transformation should be combined with the transformation between the RGB camera and the camera object's center frame as shown in Fig. 6.

The transformation should be transferred to the left-hand Unity coordinate system. Then each camera object can be described by its extrinsic parameters T:

$$T=\left[ \begin{array}{cccc}{r}_{11}& {r}_{11}& {r}_{11}& x\\ {r}_{11}& {r}_{11}& {r}_{11}& y\\ {r}_{11}& {r}_{11}& {r}_{11}& z\\ 0& 0& 0& 1\end{array} \right]$$

The pose of the camera objects is manipulated by the data from external sensors or with preprocessed data from a transfer function describing the kinematics of the lift mast. The use of parent-child relationships in Unity makes it possible to combine cameras that move the same way such as the three cameras on the lift mast or the two fork cameras.

Figure 7 shows the schematic procedure of rendering in our system. The camera information is merged with the data from the initial calibration and the position sensors. This merging is done in a Unity application on an industrial PC on the forklift. The image is rendered here. The connection to the HoloLens 2 is established via holographic remoting (Microsoft Corporation 2022). The HoloLens 2 does not calculate the scene to be displayed itself but receives it from the Unity application on the PC. Therefore, it sends its position information to the PC.

3.5 Head tracking

The HoloLens 2 head tracking uses its environmental cameras as well as an inertial measurement unit (IMU). It is not designed to work in moving vehicles. It results in drift, jumps or even total failure. Walko and Maibach (2021) proposed covering the environmental cameras of the HoloLens 2. It then uses only IMU for head tracking which increases tracking robustness as jumps and total failure no longer occur. Nevertheless, a drift still occurs. So, it must be corrected via external head tracking. The sole use of external head tracking is not practical as it leads to a post rendering image warp that generates a lot of jitter.

We initially used an external head tracking system. Due to the ambient light in our test environment this proved to be impractical. Instead, we placed an ArUco marker on the HoloLens 2 and determined head position and orientation using OpenCV marker recognition (Fig. 8). We placed a Logitech c920 directly in front of the driver to detect the markers. Only if the deviation of the position determined by the HoloLens 2 and the position determined by the external head tracking exceeds 5 cm in x, y or z-direction an adjustment is made.

Due to our system design the HoloLens 2 is always seen as the main Camera in Unity which means that its position in Unity cannot be adjusted via a C# script. Accordingly, we shift the forklift model and do not change the HoloLens 2 position. For this purpose we determine the relative pose of the forklift to the pose of the HoloLens 2 via an inverse transformation from the pose of the ArUco marker.

Our developed assistance system for vision impairment compensation on forklifts uses an HMD. However, the overall system consists of seven submodules shown with a yellow border in Fig. 1. The external influencing variables which consist of the forklift, the driver and the logistics environment are outlined in black. The blue outline represents the system boundaries. We use a Microsoft HoloLens 2 as the HMD for visualization (module 7). We use PC resources for rendering instead of the HoloLens 2 (module 6).

So-called anchors are usually used for the positionally accurate insertion of virtual elements. Typically, markers are used to anchor an object at a specific position. The Microsoft HoloLens 2 can also anchor objects to walls or other geometries using spatial mapping. However, it is intended to be worn by humans and mainly moved by them. Head movements are tracked to create an immersive overlay. In our system, the HMD is located in a moving vehicle (the forklift). This means the spatial anchors to which the virtual objects are attached (lift mast, columns, etc.) can move in space. At the same time the operator can also move relatively to these objects. Thus, it is necessary to determine which parts of the movement registered by the HoloLens 2 result from the forklift movement and which part from the head movement.

Another unique feature of our use case is that the cameras recording the environment can move to each other. In addition to an initial calibration, tracking the mast movement is also necessary.

3.1 Initial calibration and environmental recording

We calibrate the multi-camera system using the Matlab camera calibration Toolbox (Bouguet 2003). For our approach we use the coordinate systems and transformations shown in Fig. 2.

The key coordinate systems can be described as follows:


FL, FR	The frame for the left and right fork camera (integrated in the tips)
F	The frame of the forklift, which is centered in the axis of the front tires
B	The frame of the base camera, which is fixed to the vehicle
AR	The frame of the AR glasses, which is the position during initialization
MC, ML, MR	The frame of the center, (left and right) mast camera
W	The world frame, which is fixed and used for the calculation of the forklift movement

Table 1 summarizes the transformations in our application. Some transformations are static and must be determined only once at the beginning of module 1. Other transformations have to be determined dynamically during the runtime of the application. This is due to the structural design of the vehicle and the arrangement of the camera system. The forklift coordinate system is the same as the CAD model integrated in the game engine Unity which we use in our approach for rendering.

Table 1

Key extrinsic transformations between coordinate systems
Transformation	Description	Type	Determination
^FR T _FL	Fork left to fork right	Static	Initial calibration
^B T _FR	Fork Right to base	Dynamic	Initial calibration
^B T _MC	Mast center to base	Dynamic	External Sensors
^F T _B	Base to forklift	Static	Measurement 3D model
^FL T _AR	AR glasses to forklift	Dynamic	External sensors and HMD sensors
^MC T _ML	Mast left to mast center	Static	Initial calibration
^MC T _MR	Mast right to mast center	Static	Initial calibration

We calibrate the multi-camera system using the RGB data of each camera. The use of the Matlab calibration toolbox involves using a checkerboard pattern. The pattern must be completely captured by the cameras during the calibration process. In addition, the toolbox is limited to two cameras.

To eliminate the problem of a small overlapping FoV in our system, as shown in Fig. 3, we perform a pairwise calibration between adjacent cameras. For example, to represent the left fork camera FL in the forklift coordinate system F by using following transformation:

^F T _FL = ^F T _B ^B T _FR ^FR T _FL

The pairwise calibration starts with determining the rigid transformation between the left and right fork camera ^FR T _FL. Then, we determine the transformation between the right fork camera and the base camera ^B T _FR. We determine the transformation matrix between the base camera and the vehicle ^F T _B with the help of a CAD model of the forklift.

The calibration process using the Matlab Calibration Toolbox is necessary to get the initial poses for further processing in Unity.

It is also possible to build a complete 3D model of the demonstrator, but it is nearly impossible to transfer the correct pose of the virtual objects into the real world. For example, the minor angular errors generate a significant lateral error in the overlay.

For the environmental recording we use the proposed RGB-D multi-camera system to collect 3D scene data with color information for each pixel. The cameras on the lift mast need to be able to see behind the load. This perspective is supplemented with fork cameras in order to also generate data during storage and retrieval operations in the rack.

3.2 Forklift tracking

We use external sensors to track the forklift's movement as the forklift's diagnostic CAN bus does not provide movement data. Rotary encoders with a fixed-less bearing are attached to the front wheels of the forklift (Fig. 4 left). A measurement box transmits the signal from the encoders to the PC. An asynchronous TCP socket integrated into the API of the measurement box sends the data to Unity for further processing. We use the two-wheel model according to Dudek and Jenkin (2000) to calculate the forklift movement from the rotational speeds of the wheels and move the digital forklift model in Unity (Fig. 4 right) accordingly.

3.3 Lift mast tracking

The lift mast tracking consists of two parts. On the one hand we determine the tilt of the lift mast and on the other hand the lift height. A tracking camera (Intel T265) is attached to the side of the fork carriage (Fig. 5 left). This camera can be integrated directly into Unity via the RealSense wrapper (Dorodnicov 2018) and transmits its pose data. We use the rotational values of the pose data to determine the tilt of the lift mast. Although the pose data contains the vertical movement determined by the tracking camera which would correspond to the lifting height of the forks these values are far too inaccurate. Instead, we use an analog cable sensor so that the sensor is attached to the fixed part of the lift mast and the cable to the moving part (Fig. 5 right). We send the data to Unity via the Ardity wrapper (Wilches 2018) for Arduino. Since a forklift with a duplex frame is used the corresponding kinematic transfer function x_i = 2 x_o is implemented. Thus, the position of the inner part i can be determined on the basis of the position of the outer part o.

3.4 Scene reconstruction, rendering and visualization

The advancement of stereo camera technology allows us to offload the computationally intensive depth estimation to so-called RGB-D cameras allowing us to use more cameras and map a larger FoV. For scene acquisition and scene distance detection we use multiple Intel RealSense D435 to generate a depth map of the environment in real-time transforming the image pixel by pixel into the operator's FoV.

Cameras are integrated into Unity via 3D Objects. To apply the initial calibration, each initial transformation should be combined with the transformation between the RGB camera and the camera object's center frame as shown in Fig. 6.

The transformation should be transferred to the left-hand Unity coordinate system. Then each camera object can be described by its extrinsic parameters T:

$$T=\left[ \begin{array}{cccc}{r}_{11}& {r}_{11}& {r}_{11}& x\\ {r}_{11}& {r}_{11}& {r}_{11}& y\\ {r}_{11}& {r}_{11}& {r}_{11}& z\\ 0& 0& 0& 1\end{array} \right]$$

The pose of the camera objects is manipulated by the data from external sensors or with preprocessed data from a transfer function describing the kinematics of the lift mast. The use of parent-child relationships in Unity makes it possible to combine cameras that move the same way such as the three cameras on the lift mast or the two fork cameras.

Figure 7 shows the schematic procedure of rendering in our system. The camera information is merged with the data from the initial calibration and the position sensors. This merging is done in a Unity application on an industrial PC on the forklift. The image is rendered here. The connection to the HoloLens 2 is established via holographic remoting (Microsoft Corporation 2022). The HoloLens 2 does not calculate the scene to be displayed itself but receives it from the Unity application on the PC. Therefore, it sends its position information to the PC.

3.5 Head tracking

The HoloLens 2 head tracking uses its environmental cameras as well as an inertial measurement unit (IMU). It is not designed to work in moving vehicles. It results in drift, jumps or even total failure. Walko and Maibach (2021) proposed covering the environmental cameras of the HoloLens 2. It then uses only IMU for head tracking which increases tracking robustness as jumps and total failure no longer occur. Nevertheless, a drift still occurs. So, it must be corrected via external head tracking. The sole use of external head tracking is not practical as it leads to a post rendering image warp that generates a lot of jitter.

We initially used an external head tracking system. Due to the ambient light in our test environment this proved to be impractical. Instead, we placed an ArUco marker on the HoloLens 2 and determined head position and orientation using OpenCV marker recognition (Fig. 8). We placed a Logitech c920 directly in front of the driver to detect the markers. Only if the deviation of the position determined by the HoloLens 2 and the position determined by the external head tracking exceeds 5 cm in x, y or z-direction an adjustment is made.

Due to our system design the HoloLens 2 is always seen as the main Camera in Unity which means that its position in Unity cannot be adjusted via a C# script. Accordingly, we shift the forklift model and do not change the HoloLens 2 position. For this purpose we determine the relative pose of the forklift to the pose of the HoloLens 2 via an inverse transformation from the pose of the ArUco marker.

In this section the results for the functionality, latency and accuracy of the developed system are shown. Therefore, we derive requirements from a safety perspective.

4.1 Validation of the basic functions

The validation of the overall system took place in several stages. First, we tested the various subfunctions by overlaying the forklift with its digital model without superimposing the camera images. The model overlay was tested at a standstill, during straight-ahead travel and cornering. For this purpose the translational and rotational motion of the lift mast was examined. The forklift model is represented congruently on the real forklift at a standstill (Fig. 9a). The implemented sensors and algorithms can compensate the lift mast movements and shift the model against the occurring displacements (Fig. 9b). The motion model combined with the rotary encoders on the front wheels shows satisfactory accuracy for straight-ahead travel (Fig. 9c). An overlay with a uniform deviation between the 3D model and the forklift throughout the validation drive could be observed. Figure 9d shows that the model performs a too large lateral movement during a 90° cornering. One reason for this is that the forklift quickly starts to slip which is not registered by the encoders.

Figure 10 (a-d) shows an overlay of the entire reconstructed scene for different driving modes. With the help of these images the fade-in accuracy of the reconstructed scene with reality can be attested. Based on these results the fundamental suitability of an AR system for vision restriction compensation on industrial trucks can be confirmed. The impression of a transparent lift mast or a-pillar is created for the driver. At the same time the driver sees the real environment.

However, in the current state of the system no industrial use is possible, among other things, due to the performance of hardware. In addition to the lack of robustness of the setup the accuracy is outside an acceptable range, especially for longer and faster movements. Furthermore, the latencies are too high for a safety-critical assistance system. In order to quantify these subjective impressions we present a measurement setup for measuring accuracy and latency below.

4.2 Requirements for accuracy and latency

Due to the use in industrial environments, we define environmental conditions in the following and derive requirements from them. A distinction can be made between requirements aimed at immersive overlay and requirements from a safety perspective. The latency of the overlay is the main driver for an inaccurate overlay with reality. We define a maximum permissible total deviation d_max = 0.3 m. This corresponds to the width of a person crossing the path b_p, since a person just entering the path would be displayed next it. Figure 11 shows the scenario from which we derive the requirements.

At a maximum speed of v_max = 20 km/h of the forklift, the braking distance is d_b = 10 m. The distance between the driver's head position and the tip of the forks is d_h,f = 2.4 m. It is assumed that a person steps onto the carriageway at a speed of v_p = 5 km/h. The distance d_f,l that the forklift travels until the person appears on the roadway is therefore added to the braking distance. This deceleration is dependent on the maximum permissible latency l_max. During this time, the person covers the distance d_p. In the worst case, the errors resulting from the latency and the accuracy add up. The accuracy error is described by the angular error α. The relationships can be described in a simplified way using the following formulas:

$${d}_{max}=\text{sin}\left(\alpha \right)·\left({d}_{b}+{d}_{h,f}+{d}_{f,l}\right)+{d}_{p}$$

$${d}_{f,l}={l}_{max}·{v}_{max}$$

$${d}_{p}={l}_{max}·{v}_{p}$$

We define a budget of 0.3 m for the total error. Table 2 shows possible distributions of the error due to latency and accuracy for different ratios. The maximum angular error would be 1.386° and the maximum latency 216 ms for the idealized cases of having only one of the two errors.

Table 2

Possible distributions of the error due to latency and accuracy for different ratios
	Latency		Accuracy
Ratio	Latency	Error	Angular error	Error
1:0	0 ms	0.000 m	1.386°	0.300 m
3:1	54 ms	0.075 m	1.015°	0.225 m
2:1	72 ms	0.100 m	0.895°	0.200 m
1:1	108 ms	0.150 m	0.661°	0.150 m
1:2	144 ms	0.200 m	0.434°	0.100 m
1:3	162 ms	0.225 m	0.323°	0.075 m
0:1	216 ms	0.300 m	0.000°	0.000 m

4.3 Methods and Apparatus for accuracy and latency measurement

An essential part of the assistance system is the Microsoft HoloLens 2 we use as HMD. It is possible to record videos with the internal camera during use for vision impairment compensation on which the real environment and the superimposed holograms can be seen. The holograms seen in this video are those rendered for the right eye. Accordingly, these videos do not correspond to what an operator who looks through the HMD with both eyes sees. Thus, to evaluate the accuracy of the overlay for the operator, it is not possible to simply evaluate the video recorded with the HMD. Using the video is also of limited use for measuring the overall latency of the system since the effect of the simultaneous video recording of the HMD on the latency is unknown. For these reasons we propose the measurement setup shown in Fig. 12.

The HMD is fixed in a holder and a camera with a manual focus lens is mounted behind the display to simulate the human eye. The HMD is mounted in the drivers cab and orientated in the drive direction. This camera thus records the real environment through the display as well as the holograms shown on the display of the HMD. The videos recorded in this way can then be used for the measurements.

For measuring the accuracy of our system we place a 3000x2300 mm ChArUco-Board as a known pattern in front of the forklift. The pattern has 63 Aruco markers arranged on a chessboard. The starting point with ID 0 is at the top left. The pattern is spatially hidden by the vehicle components. The pattern is captured via the scene reconstruction system and augmented to the real world via the HMD. The usage of a known pattern allows us to estimate the hidden markers' position and determine the registration error.

We use a variant of frame counting for the latency measurement. A monitor is placed in front of the cameras for environment recording so that it is still seen by the camera imitating the human eye. The ambient image is then superimposed so that the monitor is visible in the hologram on one of the viewing constraints. A counter is then superimposed on the monitor. The frame difference between the virtual and the real counter can then be determined. The camera films with 60 fps. Thus, the interval between the two frames is 16.66 ms.

4.4 Accuracy

For the accuracy analysis we investigated 150 frames. As mentioned before these were taken from the driver's cab. From this point of view several markers are obscured by the view restrictions. Figure 13 shows this scenario. The view restrictions, in this case the lift mast, are highlighted in green. The markers are superimposed on the view restrictions.

Due to the distorted and partially fragmented representation of the markers only the markers M = { 11, 24, 37, 50, 63 } can be detected using OpenCV (Bradski 2000) with a maximum detection probability of 90.5%. The detection probability for the individual markers is shown in Fig. 14.

The markers in the upper part of the image are detected more reliably (77–90.5% detection probability) than the one at the bottom (ID 63; 25%). Figure 14 shows three consecutive frames with different detected markers.

The results for the horizontal and vertical registration error are shown in Fig. 16 and Fig. 17. The registration error is the difference between the interpolated and the detected upper left marker corner point. For the horizontal error it shows that there is a systematic error because all marker corner points are shifted by at least − 15 px. M_ID,63 shows the smallest mean deviation with − 19.4 px. The largest mean deviation is -3.5 px (ID 37). All deviations are in the interval of [-15; -29] px. For the vertical error the interval is [+ 29; -23] px. Furthermore, M_ID,11 and M_ID,24 located in the upper half of the image show a deviation in the positive direction and M_ID,50 and M_ID,63 show a negative deviation. M_ID,37 which is in the center of the image shows an average vertical registration error of 3 px.

Table 3 shows the deviation of the calculated minimum and maximum of the horizontal and vertical registration error. On average the maximum of the horizontal registration error is 20% higher and the minimum 28% lower. For the vertical registration error, the minimum is similar at 27%, but the maximum is 30% higher than the medium.

Table 3

Percentage deviation of the minimum/maximum of the horizontal/vertical registration error relative to the mean value
ID	Registration Error
	Horizontal		Vertical
	Min.	Max.	Min.	Max
	\|%\| (med.)	\|%\| (med.)	\|%\| (med.)	\|%\| (med.)
11	20	20	24	26
24	30	20	26	45
37	22	23	28	200
50	32	22	29	75
63	37	19	26	47
\|x_m\|	28	20	27	30

The deviation was measured at a distance of 2.7 m. A maximum measured horizontal registration error of 32 px equals an angular deviation of 1.37°. As described in Chap. 4.2, this error would be at the upper limit of the tolerance range for the safety-related considerations, in case no latencies prevail in the system. Since this is not possible in reality, the horizontal registration error is classified as critical and evaluated as too high from a safety point of view.

4.5 Latency

The presented system consists of many modules, each with its latency. In the context of this paper we define the total latency of the system as the time difference between an actual event and its display on the HMD. Other latencies which cause a delayed shift of the display, for example in the case of a forklift movement are not considered in the test setup.

First, we examined the influence of a camera’s resolution on the overall latency. We varied the resolution of the RGB-D-cameras from 424 x 240 px to 1280 x 720 px. The pixel count applies to both the RGB image and the depth image. Figure 18 shows an example of the images used to evaluate the different resolutions. For the resolution of 424 x 240 px the numbers in the hologram are so difficult to see that no evaluation of the frame difference is possible.

Figure 19 shows the results of the latency measurement for a resolution of 640 x 480 px, 848 x 480 px and 1280 x 720 px. The yellow boxplots show the total latency based on the frame difference when a smaller number was safely captured with the eye-camera than is displayed on the HMD. The red boxplot shows the total latency determined based on the frame difference when a larger number was safely recorded with the eye-camera than is displayed on the HMD. The real latency of our system lies between the two values but cannot be narrowed down more precisely because the RGB-D-cameras are not synchronized with the eye-camera used to capture the images for analysis. Accordingly, the time at which the image is captured by the RealSense D435 cameras can only be delimited by the frame that is safely before and safely after it. 240 frames were evaluated for the measurement.

The total latency for 1280 x 720 px resolution is the highest with a mean value of 282.90 ms for the lower boundary and 321.81 ms for the upper boundary. Contrary to expectations the total latency for 640 x 480 px with a mean value of 180.07 ms for the lower boundary and 223,47 ms for the upper boundary is above that for a resolution of 848 x 480 px. Here the mean values are 174.02 ms for the lower boundary and 216.08 ms for the upper boundary. However, Grunnet-Jepsen et al. (2020) confirmed that the RealSense D435 performance is best at a resolution of 848 x 480 px. They propose to only use lower resolutions if a reduction on the USB3.0 bus is necessary.

The results show that only the average values for the resolution of 848 x 480 px fall below the maximum latency we defined. However, the upper boundary is already too high compared to the permissible values for rear view cameras of 200 ms. It can also not be assumed that the maximum latency can be kept for every frame considering the significant fluctuations in the measured values. In addition, the maximum permissible latency of 216 ms refers to a state in which there is no accuracy error in the overlay, which seems unrealistic in the current state of development.

In our system we must use multiple RealSense cameras to reconstruct the scene behind the view restrictions. Therefore, the reduction of the resolution is necessary although this leads to a higher latency. After examining the impact of resolution on overall latency the impact of the number of RGB-D cameras used for scene reconstruction on total latency was considered. One to four cameras with a resolution of 640 x 480 px were used for this purpose. Figure 20 shows the results for this measurement.

As expected, the total latency rises with the number of cameras used. For two cameras the total latency has a mean value of 186.82 ms for the lower boundary and 229.98 ms for the upper boundary. The total latency for three cameras has a mean value of 218.58 ms for the lower boundary and 261.07 ms for the upper boundary. The mean values for four cameras are 258.76 ms for the lower boundary and 301.35 ms for the upper boundary. The results show that the number of cameras used for scene reconstruction are of great importance for the systems latency. While the increase from one to two cameras increases the latencies only slightly and their variation remains moderate, an increase to four cameras leads to a large jump in the mean values but also in the variation. Thus, the number of cameras used for each frame should be reduced in the future. Nevertheless, the systems latency is too high even for one RealSense camera with more than 200 ms for the upper boundary. In the future, it should be examined whether the latency can be reduced by using other cameras. In addition, the influence of the other external sensors should be investigated.

In this work we have presented a real-time AR system to optimize the view of a forklift driver. The system eliminates view restrictions in the driver's field of view. Concealing areas are superimposed with visual information of the hidden area. The implementation is done by superimposing a 3D scene that is reconstructed in real-time. The 3D scene is captured by five RGB-D cameras which provide depth information and a color value for each depth value. The challenge lies in the pose determination between the cameras which must be done in parallel to the runtime of the application. Conventional algorithms for registering depth data are not real-time capable. To solve this problem, we first determined the initial pose of the multi-camera system using the Matlab Calibration Toolbox and placed the cameras in Unity. Using the kinematics model of the forklift and external sensors, we determined the pose between each RGB-D camera in real-time. The cameras' tilt is determined by a tracking camera mounted on the fork carriage of the vehicle. A cable sensor determines the change in position. Another challenge is the use of HoloLens 2 inside a vehicle. In the world tracking mode of the HoloLens 2 it is position cannot be tracked with sufficient accuracy in feature-poor environments.

Furthermore, for the implementation of the see-through system in Unity the current position of the visual constraints, for example the vehicle, must be known otherwise the driver cannot move relative to them. Simply using the world tracking mode of the HoloLens 2 in a moving vehicle result in the user moving through the truck as it is initially anchored in the start coordinates (0,0,0). Using rotary encoders on the wheels and implementing the two-wheel model, we compute the position difference between the forklift and HoloLens 2 at runtime and can thus move within the vehicle while a vehicle is in motion. The current challenge is the accuracy of the overlay during cornering since the encoders do not detect any movement when the tires are slipping. Thus, during a turn the operator can virtually drift out of the vehicle and the system must be reinitialized. The results for the investigation of the accuracy have shown that a registration error is present. It is noticeable that the vertical registration error is significantly larger than the horizontal registration error. The system's overlay accuracy is classified as critical from a safety point of view. The registration error also reduces the immersion of the application. For the reduction of the registration error an ICP method should be implemented parallel to the runtime of the program, in order to minimize the registration error. The latency of our system strongly depends on the number of cameras used and is most at 450 ms with four cameras. This means that the latency is far above what would be permissible from a safety point of view. To reduce latency only those cameras that are needed for scene reconstruction can be used for rendering in Unity by predicting the viewing direction. The use of other cameras should also be examined since we are already at or above the maximum permissible latency when using one or two cameras.

Funding

This work is part of the research project 20158 N of the Research Foundation Intralogistics / Material Handling and Logistics (IFL) and has been funded by the AiF within the program for sponsorship by Industrial Joint Research (IGF) of the German Federal Ministry of Economic Affairs and Climate Action based on an enactment of the German Parliament.

Declarations

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Data availability

The datasets generated during and analyzed during the current study are available from the corresponding author on reasonable request.

Azuma RT (1995) Predictive tracking for augmented reality. University of North Carolina, Dissertation
Azuma RT (1997) A Survey of Augmented Reality. In: Teleoperators & Virtual Environments 6:355–385. https://doi.org/10.1162/pres.1997.6.4.355
Azuma RT, Bishop G (1994) Improving static and dynamic registration in an optical see-through HMD. Proceedings of the 21^st annual conference on Computer graphics and interactive techniques
Billeter M, Rothlin G, Wezel J, Iwai D, Grundhofer A (2016) A LED-Based IR/RGB End-to-End Latency Measurement Device. In: 2016 IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct). IEEE:184–188. https://doi.org/10.1109/ISMAR-Adjunct.2016.0072
Bouguet J-Y (2003) Camera Calibration Toolbox for Matlab. http://robots.stanford.edu/cs223b04/JeanYvesCalib/. Accessed 25 May 2022
Bradski G (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools
Chang Y-L, Tsai Y-M, Chen L-G (2008) A real-time augmented view synthesis system for transparent car pillars. In: 2008 15^th IEEE International Conference on Image Processing. IEEE:1972–1975
Chen H-I, Chen Y-L, Lee W-T, Wang F, Chen B-Y (2015) Integrating Dashcam Views through Inter-Video Mapping. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE:3110–3118. https://doi.org/10.1109/ICCV.2015.356
DGUV e.V. (2021) Arbeitsunfallgeschehen 2020. https://publikationen.dguv.de/widgets/pdf/download/article/4271. Accessed 29 March 2022
Dorodnicov S (2018) Unity Wrapper for RealSense SDK 2.0. https://github.com/IntelRealSense/librealsense/tree/master/wrappers/unity. Accessed 15 May 2022
Dudek G, Jenkin M (2000) Computational principles of mobile robotics, 1^st edn. Cambridge Univ. Press, Cambridge
Friston S, Steed A (2014) Measuring latency in virtual environments. IEEE Trans Vis Comput Graph 20:616–625. https://doi.org/10.1109/TVCG.2014.30
Giancola S, Valenti M, Sala R (2018) A survey on 3d cameras: Metrological comparison of time-of-flight, structured-light and active stereoscopy technologies. Springer briefs in computer science. Springer, Chan. https://doi.org/10.1007/978-3-319-91761-0
Gomes P, Vieira F, Ferreira M (2012) The See-Through System: From implementation to test-drive. In: 2012 IEEE Vehicular Networking Conference (VNC). IEEE:40–47
Gruen R, Ofek E, Steed A, Gal R, Sinclair M, Gonzalez-Franco M (2020) Measuring System Visual Latency through Cognitive Latency on Video See-Through AR devices. In: 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR):791–799. https://doi.org/10.1109/VR46266.2020.00103
Grunnet-Jepsen A, Sweetser JN, Woodfill J (2020) Tuning depth cameras for best performance. https://dev.intelrealsense.com/docs/tuning-depth-cameras-for-best-performance. Accessed 20 May 2022
He D, Liu F, Pape D, Dawe G, Sandin D (2000) Video-based measurement of system latency. In: International Immersive Projecion Technology Workshop
Holland PW, Welsch RE (1977) Robust regression using iteratively reweighted least-squares. Communications in Statistics - Theory and Methods 6:813–827. https://doi.org/10.1080/03610927708827533
Holloway RL (1995) Registration Errors in Augmented Reality Systems: Dissertation
Holloway RL (1997) Registration Error Analysis for Augmented Reality. In:Teleoperators & Virtual Environments 6:413–432. https://doi.org/10.1162/pres.1997.6.4.413
Ikeda S, Takemura I, Kimura A, Shibata F (2018) Diminished Reality System Based on Open-Source Software for Self-Driving Mobility. In: IEEE International Symposium on Mixed and Augmented Reality Adjunct. IEEE:354–357. https://doi.org/10.1109/ISMAR-Adjunct.2018.00103
Inami M, Kawakami N, Tachi S (2003) Optical camouflage using retro-reflective projection technology. In: Proceedings of IEEE and ACM International Symposium on Mixed and Augmented Reality. IEEE, Los Alamitos:348–349. https://doi.org/10.1109/ISMAR.2003.1240754
ISO 16505 (2019) Road vehicles - Ergonomic and performance aspects of Camera Monitor Systems - Requirements and test procedures. International Organization for Standardization
Kawakami N, Inami M, Sekiguchi D, Yanagida Y, Maeda T, Tachi S (1999) Object-oriented displays: a new type of display systems-from immersive display to object-oriented displays. In: Proceedings of IEEE International Conference on Systems. IEEE. https://doi.org/10.1109/icsmc.1999.815704
Kim S-W, Qin B, Chong ZJ, Shen X, Liu W, Ang MH, Frazzoli E, Rus D (2015) Multivehicle Cooperative Driving Using Cooperative Perception: Design and Experimental Validation. IEEE Transactions on Intelligent Transportation Systems 16:663–680. https://doi.org/10.1109/TITS.2014.2337316
Kim DY, Kim YJ, Son H, Hwang J-H (2020) Transparent Manipulators Accomplished with RGB-D Sensor, AR Marker, and Color Correction Algorithm. Journal of Korea Robotics Society 15:293–300. https://doi.org/10.7746/jkros.2020.15.3.293
Kittaka T, Fujii H, Yamashita A, Asama H (2016) Creating see-through image using two RGB-D sensors for remote control robot. In: 2016 11^th France-Japan & 9^th Europe-Asia Congress on Mechatronics (MECATRONICS) /17^th International Conference on Research and Education in Mechatronics (REM). IEEE:86–91
Kittaka T, Fujii H, Yamashita A, Asama H (2018) Real-Time Registration of Rgb-D Image Pair for See-Through System. In: 2018 25^th IEEE International Conference on Image Processing (ICIP). IEEE. https://doi.org/10.1109/icip.2018.8451512
Lincoln P, Blate A, Singh M, Whitted T, State A, Lastra A, Fuchs H (2016) From Motion to Photons in 80 Microseconds: Towards Minimal Latency for Virtual and Augmented Reality. IEEE Trans Vis Comput Graph 22:1367–1376. https://doi.org/10.1109/tvcg.2016.2518038
Lindemann P, Rigoll G (2017) Examining the Impact of See-Through Cockpits on Driving Performance in a Mixed Reality Prototype. In: Boll S, Löcken A, Schroeter R, Baumann M, Alvarez I, Chuang L, Feuerstack S, Jeon M, Broy N, van Huysduynen HH, Osswald S, Politis I, Large D (eds) Proceedings of the 9^th International Conference on Automotive User Interfaces and Interactive Vehicular Applications Adjunct. ACM, New York, NY, USA:83–87
Lindemann P, Eisl D, Rigoll G (2019) Acceptance and User Experience of Driving with a See-Through Cockpit in a Narrow-Space Overtaking Scenario. In: 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE:1040–1041
Livingston MA, Ai Z (2008) The effect of registration error on tracking distant augmented objects. In: 2008 7^th IEEE/ACM International Symposium on Mixed and Augmented Reality. IEEE. https://doi.org/10.1109/ismar.2008.4637329
MacIntyre B, Coelho EM, Julier SJ (2002) Estimating and adapting to registration errors in augmented reality systems. In: Proceedings IEEE Virtual Reality 2002. IEEE Comput. Soc:73–80. https://doi.org/10.1109/VR.2002.996507
Meerits S, Saito H (2015) Real-Time Diminished Reality for Dynamic Scenes. In: IEEE International Symposium on Mixed and Augmented Reality Workshops. IEEE:53–59. https://doi.org/10.1109/ISMARW.2015.19
Mela C, Papay F, Liu Y (2021) Novel Multimodal, Multiscale Imaging System with Augmented Reality. Diagnostics (Basel) 11:441. https://doi.org/10.3390/diagnostics11030441
Microsoft Corporation (2022) Holographic Remoting Player Overview. https://docs.microsoft.com/en-us/windows/mixed-reality/develop/native/holographic-remoting-player. Accessed 15.05,2022
Miller D, Bishop G (2002) Latency Meter: A device to measure end-to-end latency of VE systems. Proc SPIE. https://doi.org/10.1117/12.468062
Mori S, Maezawa M, Saito H (2017) A Work Area Visualization by Multi-View Camera-Based Diminished Reality. MTI 1:18. https://doi.org/10.3390/mti1030018
Mori S, Erat O, Broll W, Saito H, Schmalstieg D, Kalkofen D (2020) InpaintFusion: Incremental RGB-D Inpainting for 3D Scenes. IEEE Trans Vis Comput Graph 26:2994–3007
Overmeyer L, Podszus F, Dohrmann L (2016) Multimodal speech and gesture control of AGVs, including EEG-based measurements of cognitive workload. CIRP Annals 65:425–428. https://doi.org/10.1016/j.cirp.2016.04.030
Pandya A, Siadat M-R, Auner G (2005) Design, implementation and accuracy of a prototype for medical augmented reality. Comput Aided Surg 10:23–35. https://doi.org/10.3109/10929080500221626
Quandt M, Knoke B, Gorldt C, Freitag M, Thoben K-D (2018) General Requirements for Industrial Augmented Reality Applications. Procedia CIRP 72:1130–1135. https://doi.org/10.1016/j.procir.2018.03.061
Rameau F, Ha H, Joo K, Choi J, Park K, Kweon IS (2016a) A Real-Time Augmented Reality System to See-Through Cars. IEEE Trans Vis Comput Graph 22:2395–2404. https://doi.org/10.1109/tvcg.2016.2593768
Rameau F, Ha H, Joo K, Choi J, Kweon I (2016b) A Real-Time Vehicular Vision System to Seamlessly See-Through Cars. In: Hua G, Jégou H (eds) Computer Vision – ECCV 2016 Workshops, vol 9914. Springer International Publishing, Cham:209–222
Rameau F, Bailo O, Park J, Joo K, Kweon IS (2022) Real-Time Multi-Car Localization and See-Through System. Int J Comput Vis:1–21. https://doi.org/10.1007/s11263-021-01558-5
Roberts D, Duckworth T, Moore C, Wolff R, O’Hare J (2009) Comparing the End to End Latency of an Immersive Collaborative Environment and a Video Conference. In: 2009 13^th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications. IEEE:89–94. https://doi.org/10.1109/DS-RT.2009.43
Sadovitch V (2020) Fehlertolerante Anzeigengestaltung für Augmented Reality Head-up-Displays: Bewertung und Kompensation von Registrierungsfehlern im automobilen Kontext, AutoUni
Sasai S, Kitahara I, Kameda Y (2015) MR Visualization of Wheel Trajectories of Driving Vehicle by Seeing-Through Dashboard. In: IEEE International Symposium on Mixed and Augmented Reality Workshops. IEEE:40–46. https://doi.org/10.1109/ISMARW.2015.17
Shin D, Jung W, Dunston P (2008) Camera constraint on multi-range calibration of augmented reality systems for construction sites. In: Electronic Journal of Information Technology in Construction 13:521–535
Stauffert J-P, Korwisi K, Niebling F, Latoschik ME (2021) Ka-Boom!!! Visually Exploring Latency Measurements for XR. In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. ACM, New York:1–9. https://doi.org/10.1145/3411763.3450379
Sugimoto K, Fujii H, Yamashita A, Asama H (2014) Half-diminished reality image using three RGB-D sensors for remote control robots. In: IEEE International Symposium on Safety, Security, and Rescue Robotics. IEEE:1–6
Suthau T (2005) Sensorfusion von Augmented Reality Komponenten für die medizinische Navigation. TU Berlin
Sutherland IE (1968) A head-mounted three dimensional display. In: Proceedings of the joint computer conference. ACM Press, New York. https://doi.org/10.1145/1476589.1476686
Swindells C, Dill JC, Booth KS (2000) System lag tests for augmented and virtual environments. In: Ackerman M, Edwards K (eds) Proceedings of the ACM symposium on User interface software and technology. ACM Press, New York:161–170. https://doi.org/10.1145/354401.354444
van Amersfoorth E, Roefs L, Bonekamp Q, Schuermans L, Pfleging B (2019) Increasing driver awareness through translucency on windshield displays. In: Proceedings of the International Conference on Automotive User Interfaces and Interactive Vehicular Applications. ACM, New York:156–160
Walko C, Maibach M-J (2021) Flying a helicopter with the HoloLens as head-mounted display. Proceedings Virtual, Augmented, and Mixed Reality (XR) Technology for Multi-Domain Operations 60. https://doi.org/10.1117/1.OE.60.10.103103
Welch RB, Carterette EC, Friedman MP (2014) Perceptual Modification: Adapting to Altered Sensory Environments. Elsevier Science, Saint Louis
Wilches D (2018) Ardity: Arduino + Unity communication made easy. https://github.com/dwilches/Ardity. Accessed 15 May 2022
Wu W, Dong Y, Hoover A (2013) Measuring Digital System Latency from Sensing to Actuation at Continuous 1-ms Resolution. In: Teleoperators & Virtual Environments 22:20–35. https://doi.org/10.1162/PRES_a_00131
Yasuda H, Ohama Y (2012) Toward a practical wall see-through system for drivers: How simple can it be? In: IEEE International Symposium on Mixed and Augmented Reality. IEEE:333–334. https://doi.org/10.1109/ISMAR.2012.6402600
Yoshida T, Jo K, Minamizawa K, Nii H, Kawakami N, Tachi S (2008) Transparent Cockpit: Visual Assistance System for Vehicle Using Retro-reflective Projection Technology. In: IEEE virtual reality. IEEE, Piscataway, NJ:185–188. https://doi.org/10.1109/VR.2008.4480771

An Augmented Reality Assistance System to See-Through Vehicle Components

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related Work

2.1 See-through systems

2.2 Performance measurement in AR systems

3 Proposed Approach

3.1 Initial calibration and environmental recording

3.2 Forklift tracking

3.3 Lift mast tracking

3.4 Scene reconstruction, rendering and visualization

3.5 Head tracking

3.1 Initial calibration and environmental recording

3.2 Forklift tracking

3.3 Lift mast tracking

3.4 Scene reconstruction, rendering and visualization

3.5 Head tracking

4 Results

4.1 Validation of the basic functions

4.2 Requirements for accuracy and latency

4.3 Methods and Apparatus for accuracy and latency measurement

4.4 Accuracy

4.5 Latency

5 Conclusion And Outlook

Declarations

References

Status:

Version 1