TNES: terrain traversability mapping, navigation and excavation system for autonomous excavators on worksite

We present a terrain traversability mapping and navigation system (TNS) for autonomous excavator applications in an unstructured environment. We use an efficient approach to extract terrain features from RGB images and 3D point clouds and incorporate them into a global map for planning and navigation. Our system can adapt to changing environments and update the terrain information in real-time. Moreover, we present a novel dataset, the Complex Worksite Terrain dataset, which consists of RGB images from construction sites with seven categories based on navigability. Our novel algorithms improve the mapping accuracy over previous methods by 4.17–30.48%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} and reduce MSE on the traversability map by 13.8–71.4%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}. We have combined our mapping approach with planning and control modules in an autonomous excavator navigation system and observe 49.3%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$49.3\%$$\end{document} improvement in the overall success rate. Based on TNS, we demonstrate the first autonomous excavator that can navigate through unstructured environments consisting of deep pits, steep hills, rock piles, and other complex terrain features. In addition, we combine the proposed TNS with the autonomous excavation system (AES), and deploy the new pipeline, TNES, on a more complex construction site. With minimum human intervention, we demonstrate autonomous navigation capability with excavation tasks.


Introduction
Excavators are one of the most common types of heavyduty machinery used for earth-moving activities, including mining, construction, environmental restoration, etc. As the demand for excavators increases, many autonomous excavator systems Kim & Russell, 2003;Seo et al., 2011) have been proposed for material loading tasks, which involve perception and motion planning techniques.
Some of the major issues in terms of using autonomous excavators are the development of robust perception and navigation sub-systems. In general, perception in unstructured environments such as excavation has many challenges. There have been many works related to unstructured environments, including perception and terrain classification (Guan et al., 2022;Viswanath et al., 2021;Singh et al., 2021) and navigation (Manduchi et al., 2005;Kahn et al., 2021;Procopio et Kumar et al., 2021). Applications in unstructured, hazardous environments have even more difficulties in terms of robustness and limitations on the computational budget. For example, many accurate learning methods have been proposed to improve the perception capabilities, but we cannot assume access to large GPUs or clusters for excavators operating in hazardous environments. Instead, we need to develop robust methods with lower computational requirements.
Traversability is a term that encompasses both perception and navigation. It has been well-studied for decade, and there have been many works (Chilian & Hirschmüller, 2009;Dahlkamp et al., 2006;Zhao et al., 2019;Sock et al., 2016;Maturana et al., 2018) on traversability estimation for planning and navigation. Terrain traversability is a binary value, or a probability score, measuring the difficulty of navigating a region through perception sensors like camera, LiDAR and IMU. Terrain traversability estimation is a critical step between perception and navigation. In many autonomous driving (AD) cases (Procopio et al., 2009;Hewitt et al., 2017), a method capable of detecting obstacles and distinguishing road and non-road regions is sufficient for navigation. On the other hand, in an unstructured, hazardous environment where off-road navigation is unavoidable, there are many factors that must be considered, including efficiency, adaptability, and safety. In such cases, not only a more detailed classification according to terrain features is needed, but also a continuous value for traversability is preferred to describe the complexity of the terrain and provide the best option for the navigation module. Therefore, we need good techniques to detect traversable regions for reliable navigation in an unstructured scene. Main Results We present a terrain traversability mapping and navigation system (TNS) for traversability classification and autonomous navigation. We describe an efficient semanticgeometric fusion method to extract traversability maps. Our method leverages the physical and computational constraints of the robot, including maximum climbing degree, width of the body, run-time computational budget, etc. The novel aspects of our approach include: 1. We present the Complex Worksite Terrain (CWT) dataset, which consists of 30 min of video and 669 RGB images in unstructured environments with seven different classes based on terrain types, traversable regions, and obstacles. We will release the CWT dataset in the public domain. 2. We present a real-time terrain traversability estimation and navigation system (TNS) from 3D LiDAR and RGB camera inputs for mapping, planning, and navigation. We describe a novel learning-based geometric fusion solution that considers machine specifications and hardware limitations for terrain traversability prediction in unstructured environments. We show that our method is the state-of-the-art (SOTA) traversability mapping method on complex terrains. Our method outperforms previous SOTA methods by 4.17-30.48% in terms of mAcc and reduces the MSE by 13.8-71.4%. 3. We have integrated TNS with mapping, planning and control algorithms and evaluated the performance extensively on an autonomous excavator in various challenging construction scenes, as shown in Fig. 1. We also elaborate on many non-trivial issues that came up during the implementation and evaluation and how we address them. We show that our TNS can safely navigate an excavator in unstructured environments and observe a 49% improvement in terms of planning success rate. We highlight the benefits of TNS as the first autonomous excavator that can navigate through complex, unstructured environments. 4. We combine the autonomous navigation pipeline (TNS) with autonomous excavation system (AES) , and deploy Terrain Traversability Mapping, Navigation and Excavation System (TNES) on a more complex construction site that contains structured and unstructured road, building materials, construction workers, and other construction and transportation vehicles. We highlight the capability of TNES by demonstrating autonomous navigation in such complex scene with excavation tasks.

Field robots and systems
Field robots usually refer to machines that operate in off-road, hazardous environments. These include heavy-duty service robots for industrial usage in mining (Shariati et al., 2019), excavation , agriculture (Shamshiri et al., 2018), construction (Nath & Behzadan, 2020), etc. To satisfy industrial needs and save labor costs, many automated systems Kim & Russell, 2003;Seo et al., 2011) have been developed for service robots in the field. These systems include modules for perception, planning and control. However, it remains a challenge to fully automate many tasks in unknown, unstructured environments.

Terrain traversability recognition
The concept of traversability, also referred to as "drivability," "navigability," etc. Papadakis (2013), has been studied for decades. There are many viewpoints on the problems and challenges associated with traversability, and investigations into such topics have had different evaluation methods and goals. Many works focus on getting correct predictions of the terrain (Matsuzaki et al., 2018;Xue et al., 2017;Suryamurthy et al., 2019;Rothrock et al., 2016;Kingry et al., 2018;Holder & Breckon, 2018;Chavez-Garcia et al., 2018;Guan et al., 2022;Hirose et al., 2018;Deng et al., 2017) by some notion of ground truth based on human-labeled annotation, similar to the metrics of 2D and 3D semantic segmentation. Most of the methods mentioned above are based on visual features of the terrain, which sometimes lack the properties that enable real-world navigation due to recognition failure. On the other hand, some works focus on obtaining traversability maps that result in the best navigation outcomes. There are plenty of works (Paz et al., 2020;Maturana et al., 2018;Dahlkamp et al., 2006;Zhao et al., 2019;Guan et al., 2022Guan et al., , 2023 on classifying different terrains based on either material categories or navigability properties and demonstrate their mapping results through navigation outcomes. However, those methods deal with structured roads or roads with clear path boundaries in unstructured environments. In more complex environments, point clouds obtained from LiDAR are used to extract geometric attributes of the surface, including slope, height variation, roughness, obstacles, etc., as proposed in Chilian andHirschmüller (2009), Braun et al. (2008), Wermelinger et al. (2016), Zhou et al. (2021), Ahtiainen et al. (2017), Hewitt et al. (2017). Schilling et al. (2017) uses both point cloud and RGB images to classify terrains with safe, risky, and obstacle labels in the 2D image plane for better performance. Rosenfeld et al. (2018) presents a pipeline from perception to motion control and uses five different data sources for navigation, including range and intensity values from a 2D LiDAR and edge information from an RGB-D camera. Khan et al. (2016), Khan et al. (2020) analyze the terrain and create roadmaps for road safety. Frey et al. (2022) train a sparse CNN network to predict the feasibility cost of locomotion from a purely geometric representation of the environment. By utilizing 3D voxel occupancy maps, their method overcomes the limitations of commonly used elevation maps, which often lead to errors in complex scenarios. Ewen et al. (2022) introduce a Bayesian inference framework that estimates the probability distribution of terrain surface contours and attributes using an RGB-D camera. Their approach incorporates various terrain attribute parameters to assist robot movement. In comparison to the aforementioned works, our method focuses on calculating terrain data in unstructured, open environments while integrating semantic information with terrain geometry to determine traversability.
The works most similar to our proposed method are (Dahlkamp et al., 2006;Zhao et al., 2019;Sock et al., 2016;Maturana et al., 2018), which focus on finding a better terrain representation for navigation in unstructured terrains for vehicles or mobile robots that are smaller and lighter than an excavator. In those works, the environments are not as challenging as ours, and the traversable region is relatively smooth and safe for fast driving. Consequently, the requirements for the prediction accuracy and robustness are relatively low. In our case, the boundary of the navigable region is not clear, and the excavator sometimes needs to nav-igate slowly on steep or bumpy terrains that are unsuitable for small vehicles or robots. Therefore, our application requires a superior traversability measurement that is sensitive to various terrains and goes beyond simple binary classification like (Dahlkamp et al., 2006;Sock et al., 2016). While (Zhao et al., 2019;Maturana et al., 2018) incorporate semantic information into the point cloud, they do not adequately utilize geometric information or fusion results and may struggle with more complex terrains. In contrast to the aforementioned works, our approach comprehensively utilizes both semantic and geometric aspects.

Datasets for unstructured environments
Most recent developments in perception tasks like object detection and semantic segmentation focus on urban driving scene datasets like KITTI (Geiger et al., 2013), Waymo (Sun et al., 2020), etc., which achieve high accuracy in terms of average precision. On the other hand, unstructured scenes like the natural environment, construction sites, and complicated traffic scenarios are less explored, for two primary reasons. First, there are fewer datasets with unstructured environments; second, perception and autonomous navigation in unstructured off-road environments are challenging due to unpredictability and diverse terrain types.
Recent efforts in off-road perception and navigation include RUGD (Wigness et al., 2019) and RELLIS-3D , which are semantic segmentation datasets collected from a robot navigating in off-road and natural envi-ronments. These datasets contain scenes like trails, forests, creeks, etc. Roberts and Golparvar-Fard (2019) is a construction dataset containing annotations of heavy-duty vehicles for detection, tracking, and activity classifications. Our operating scenario and proposed dataset differ from the previously mentioned works. Our excavator operates in an outdoor construction environment, which features highly unstructured roads and includes various construction tools and vehicles that are commonly found on worksites. In this context, we provide semantic labels for the images collected while an excavator navigates through these challenging terrains.

Perception for autonomous excavators
The road conditions in structured environments such as highways are usually navigation-friendly, so the core problem during navigation in structured environments is avoiding obstacles rather than determining which part of the surface is easier and safer to navigate. In contrast, excavators are usually operated in unstructured and dangerous environments consisting of rock piles, cliffs, deep pits, steep hills, etc. Such an environment lacks any lane markings, and the arrangement of obstacles tends to be non-uniform. In addition, due to tasks like digging and dumping, the working conditions for excavators are constantly changing. Landfalls and cave-ins occur, potentially causing the excavator to tip over and injure the operator. Therefore, it is crucial to identify different terrains and predict safe regions for navigation. Furthermore, we need solutions with low computational requirements.
In our context, traversability (Papadakis, 2013) refers to the capability of a ground vehicle to reside over a region of terrain under an admissible state wherein it can enter given its current state. In order to solve navigation challenges for excavators as well as other working vehicles in unstructured terrain, we formulate the problem of obtaining an accurate traversability map representation as follows: Problem definition Given sensor inputs S 1 , S 2 , . . . , S h from h different sources over a time span T , the goal is to obtain a 2D grid map T ∈ [0, 1] H ×W with resolution r , where T corresponds to some region R of shape (Hr, Wr). The maximum value corresponds to a non-traversable region and the minimum value corresponds to the most traversable region.
Metrics for traversability map We need to consider the following measurements in excavator applications: • Accuracy Similar to Schilling et al. (2017), Dahlkamp et al. (2006), Sock et al. (2016), we use an ROC curve to measure the accuracy of the traversability prediction. In addition, the map output should fit the terrain closely, so we also use MSE (mean squared error) as a fitness measurement. Please refer to Sect. 6.2.2 for the definition of those metrics. • Performance Maturana et al. (2018), Zhao et al. (2019) use navigation outcome to measure their terrain traversability mapping algorithms, which include travel time, success rate, etc. • Energy constraints and run-time Due to the limitations of hardware and power supply on the excavator, energy efficiency and run-time computational budget should also be measured in a terrain traversability mapping method. We use number of parameters, Giga-FLOPS (floatingpoint operations per second) and runtime on the excavator as evaluation metrics.

TNS: system architecture
In this section, we describe our system for terrain traversability mapping and navigation (TNS) in excavator applications, as shown in Fig. 2. TNS takes a 3D point cloud stream from the LiDAR, an RGB camera stream from the RGB camera, and the corresponding poses of the excavator extracted from GPS RTK or localization module. The goal of our proposed system is to identify safe, navigable regions for excavators and autonomously navigate the excavator based on the traversability map and the planned trajectory. The output of TNS includes a global map consisting of terrain information, including semantic information, geometric information, and a final traversability score, as well as the planned trajectory.

Traversability mapping
The terrain is represented as an elevation grid map and is updated in real-time based on incoming point clouds and RGB images. Internally, each grid cell in the map stores the average height value of the latest p points within this cell, as well as overall information about those points like update time, slope, step height, and their semantic information. A traversability score is calculated for each grid cell. In Fig. 3, we present an overview of our perception approach. Our implementation is based on the open-source grid map library . Segmentation and mapping to point cloud We use 2D semantic segmentation on unstructured terrains. Given an input RGB image I ∈ R 3×H ×W , the goal is to generate a mask P ∈ {0, 1, . . . , N − 1} H ×W , where N is the number of classes. We use Fast-SCNN (Poudel et al., 2019) after leveraging accuracy and efficiency, as shown in Table 2 from Sect. 6.1. After we get the segmentation prediction P, we use a timestamp to locate the corresponding point cloud C and use camera calibration matrices to find the correspondence Overview of the perception module in TNS: Our system takes RGB images and point clouds as inputs to infer traversability. We extract semantic information using segmentation and associate terrain labels with point clouds, as shown in A (top). We extract geometric informa-tion using slope and step height estimation, as shown in B (bottom). We produce a traversability grid map based on semantic and geometric information and convert it to a 2D occupancy map for path planning and navigation in C (right) of each point to the segmentation results and save the terrain label in the grid map cell. Geometric information computation In this section, we present details of slope and step height estimation and highlight how machine specifications are considered to calculate the geometric traversability score.

Slope estimation
Each grid cell g is abstracted to a single point p = {x, y, z}, where x, y is the center of the cell in the global coordinate frame and z is the height value of the grid. The slope s in arbitrary grid cell g is computed by the angle between the surface normal and the z-axis 1 of the global coordinate frame: 1 Up direction in the real world.
where n z is the z-component of the surface normal n and n = 1. Similar to Chilian and Hirschmüller (2009), Bellone et al. (2013), we use Principal Component Analysis (PCA) to calculate the normal direction of a grid cell. The covariance matrix C cov of the nearest neighbors of the query grid cell is calculated as follows: where k is the number of neighbors considered in the neighborhood of g, p i = {x, y, z} is the position of the neighbor grid in the global coordinate frame,p is the 3D centroid of the neighbors, λ j is the j-th eigenvalue of the covariance matrix, and v j is the j-th eigenvector. The surface normal n of grid g is the eigenvector v 0 with the smallest absolute value of eigenvalue λ 0 . The purpose of the slope estimation is to get the shape of the terrain and avoid navigating on a steep surface. For excavator applications, the width between the tracks or wheels is a good indicator of the navigation stability on rough terrain. Usually, when the area of a rough region is less than half the width between the excavator's tracks, the excavator can navigate through it without any trouble. Specifically in our excavator setup, the width of our excavator track is 0.6 m, so we chose the grid resolution d res = 0.2 m and search the nearest eight neighbors, which covers the necessary area.

Step height estimation
The step height h is computed as the largest height difference between the center point p of the grid and its k nearest neighbors: Since slope is a description of variation in the terrain in a relatively small region, we choose to use a neighbor search parameter k = 7 * 7 > k that spans 1.4 m to measure height change in a larger scope. For excavator applications, the step height calculation guarantees that the track does not traverse areas with extreme height differences.

Geometric traversability estimation
Based on information about slope and step height of the terrain, we can calculate a geometric traversability score T geo . According to the physical constraints of the robot, we create some critical values, s cri , s sa f e , h cri , h sa f e , as the thresholds for safety and danger detection. The purpose of those threshold values is to avoid danger when the surface condition exceeds the limits of the robot and to avoid more calculations when the surface is very flat. The formula for geometric traversability T geo for each grid is: where the weights α 1 and α 2 sum up to 1.
The step height estimation is complementary to slope estimation; it provides a global perspective, whereas slope is local terrain information. Combining these two specifications can help us remove noise in the map, such as bumps caused by dust, and ensure the robustness of the T geo .
Traversability with geometric and semantic fusion In this section, we describe our algorithm for geometric-semantic fusion. From the semantic and geometric information, we use a continuous traversability score T ∈ [0, 1] to measure how easily the surface can be navigated. This is especially relevant to off-road scenarios because we prefer flat regions over bumpy roads to save energy. Moreover, when an excavator is navigating on a construction site, being able to correctly identify different regions is critical to avoid hazardous situations like flipping over.
The overall traversability score T is calculated based on semantic terrain classes C sem and geometric traversability T geo on each grid: This method is simple yet more effective than other comparably complicated fusion methods (Sock et al., 2016;Zhao et al., 2019), as demonstrated in Sects. 6 and 7.1.

Localization
Recent representative SLAM works (Xu et al., 2022;Shan et al., 2020;Lin et al., 2021) have exhibited several advantages, including high precision with localization drift less than 10 cm with loop closure, and fast calculation speed. However, our excavator uses low-cost LiDAR with a very limited field of view. During fast excavator swing motion, significant changes in the scene between LiDAR scans may cause the SLAM process to fail, resulting in subsequent wrong localization. To enhance the robustness of the localization from GPS RTK, we introduce a pre-generated global map and directly integrate odometry from each LiDAR frame with IMU using Iterated Kalman Filter (Julier & Uhlmann, 1997). State transition model We define the sensor state x, motion input u, and noise w as follows: where [ G p G R], b a , b w , n ba , n bw , n a , n w are the position and attitude of the IMU sensor in the global frame G, the IMU biases, and noise terms for each corresponding value, respectively. We use i to index the IMU measurements. The continuous state transition model can be discretized at the IMU sampling period t: where and Measurement model For each point in one frame from the LiDAR sensor, we identify the corresponding plane in the map, and the pose of the sensor is computed by minimizing the distance between the point and the plane. Let L p j be the k-th point in the LiDAR frame L, and let L n j be the corresponding ranging noise. The true location L p gt j , can be expressed as: After projecting L p gt j to the global map using the corresponding pose, it should lie exactly on the map point cloud, and the point-to-plane distance is 0: where L p gt j is the normal vector of the corresponding plane and G q j is one point on the plane. The extrinsic calibration between LiDAR and IMU is known as I T L . Therefore, the measurement model for the state vector x k can be simplified as: Iterated Kalman filter With the State Transition Model and Measurement Model, we employ an iterated Kalman filter toolbox IKFoM  to handle the computation and ouput the state vector x k at LiDAR frequence.

Traversability-based planning
We modify Hybrid A* (Kurzer, 2016) to calculate a trajectory based on the traversability map output after the post-processing step. Hybrid A* is a global path planner based on a 2D occupancy grid map as an input for trajectory planning. The planner will generate a trajectory and send it to the motion controller, which guides the excavator to follow this trajectory.
The traditional Hybrid A* algorithm only considers the traveling distance and certain driving maneuvers (such as reversing, turning, etc.), not the ground condition and traversability. As a result, the autonomous excavators can be easily navigated to areas with low traversability in real-world applications with the traditional Hybrid A* planner. To solve the problem, we extend the Hybrid A* algorithm by introducing TNS and calculating the traversability cost. Specifically, we calculate the cost to the start of a vertex, which is the distance from the start state to the vertex with extra reversing or turning cost, and is weighted by the traversability value obtained from TNS. In the improved Hybrid A* algorithm, the cost to start g(x) is increased by k T N S · δl + g extra when performing vertex expansion from the parent to the child vertices, where δl is the distance between the two nodes, g extra is the extra penalty for reversing and turning, and k T N S is the traversability weighting factor calculated by: where A t and A u are the areas covered by the two tracks and between two tracks, respectively, of the excavator from the parent to the child vertices; T t and T u are the mean traversability value of areas A t and A u ; and k t and k u are two calibrating parameters.

Control and navigation
The trajectory tracking controller is composed of a lateral trajectory tracking controller and a longitudinal speed controller: • The controller is designed based on the unicycle model, as shown in (14), which describes the kinematics of the excavator: where (x, y) is the two-dimensional position of the vehicle with respect to the global coordinate, θ is the heading angle, and (v, ω) are the inputs to the model, representing the linear and angular velocities, respectively. The inputs to the model can also be presented by the left and right track rotation speeds, (v l , v r ), by substituting the following equation into (14): where l is the distance between two tracks.

Benefits over prior methods
Previous perception methods for traversability calculation only use geometric approaches (Chilian & Hirschmüller, 2009;Wermelinger et al., 2016;Bellone et al., 2013Bellone et al., , 2014 in simple scenarios for mobile robot applications, or they can only navigate in an off-road environment with a clear visual path (Dahlkamp et al., 2006;Zhao et al., 2019;Sock et al., 2016;Maturana et al., 2018). Our system is the first one to focus on excavator navigation applications in very challenging environments consisting of pits, hills, rock piles, etc. without a clear pathways. In addition, our experiments and data are based on real-world scenarios in a construction site. Our method also adapts to the physical constraints of excavators to determine threshold, resolution of the grid, and k neighbors.

Complex worksite terrain (CWT) dataset
In this section, we present the Complex Worksite Terrain (CWT) dataset, which is collected at a construction site while an excavator is navigating through the work area. The hardware has the same setup as described in Sect. 7.1.1. We collect three videos (30 min in total) under different circumstances and annotate 669 images of size 1920 × 1080 according to terrain semantics. We only highlight the ontology and differences between CWT and other off-road datasets Wigness et al., 2019), and provide details of the collection, class distribution, and analysis in the supplemental material.
The CWT dataset is annotated with seven labels based on terrain features and navigability, as shown in Table 1. The annotation is decided based on the opinion of a team of excavator operators. In most cases, when flat surfaces are detected, they are preferable to other surfaces.
While the CWT dataset and other datasets like RUGD (Wigness et al., 2019) and RELLIS-3D  are collected in unstructured, outdoor environments, the CWT has several distinctions. As shown in Fig. 4, the CWT dataset mostly consists of uneven terrain with unfavorable road conditions and covers many situations that might be encountered on a work site, including rock-piles, pits, stagnant water after rain, etc.
In addition, the CWT dataset focuses entirely on roads and terrains, and the annotation is based on terrain semantics instead of fine-grained semantics on every possible classes. Such annotation scheme is designed for the benefit of other downstream tasks, including planning and navigation for robots of any sizes, and excavation activities on hazardous terrains.
Overall, CWT presents many new challenges to the vision community to improve perception in hazardous environment, while providing support for autonomous robotics applications in dangerous environment. We demonstrate the difficulty of our dataset by showing the performances of several SOTA semantic segmentation methods on the CWT and existing off-road datasets like RELLIS-3D in Sect. 6.1. The CWT dataset can be accessed through https://forms.gle/ zeAcgptpideCrFbw8.

Experiments and evaluations
In Sect. 6.1, we show evaluation results for the semantic segmentation task on our CWT dataset and RELLIS-3D . In Sect. 6.2, we evaluate our TNS on RELLIS-3D and show the benefits of our method compared to other SOTA mapping methods.

Perception evaluation on the CWT dataset
We show some evaluations using several SOTA segmentation methods on the CWT dataset and the RELLIS-3D dataset in Table 2. The CWT dataset is a more challenging terrain dataset than RELLIS-3D. For an image I , let I (x, y) be the predicted label at pixel location (x, y), Y (x, y) be the ground truth label at (x, y), 1(X ) be the indicator function, and B be a set of the class labels, we use the following segmentation metrics for evaluation: Intersection over Union (IoU) for class i: A lightweight perception algorithm is crucial for robotic applications due to the limited power capacity of mobile power sources. High-end GPUs consume a significant amount of wattage, making stable performance challenging, particularly when dealing with computationally intensive tasks. If the total power requirements of the robot and sensors exceed the power supply capacity, the performance of all devices will become unstable. Therefore, it is important to carefully select the GPU and perception algorithm to ensure that the robot's normal operations are not affected. In addition to accuracy metrics, we also emphasize the number of parameters and Giga-FLOPS (floating-point operations per second) as measurements to ensure that the computational demands do not place excessive pressure on the power supply.

Terrain traversability map evaluation
In Table 3, we evaluate the accuracy of our method and compare it with several SOTA traversability mapping methods on the RELLIS-3D dataset. We use the ground truth semantic labels from RELLIS-3D on a 3D point cloud and convert the labels to either 0 or 1 to indicate traversability on a grid map. During evaluation, we assume that the traversability map is based on the Clearpath Warthog, the same robot that collected the RELLIS-3D dataset: traversable regions like grass, dirt, concrete, and asphalt are set to 0, while puddles, bushes, and obstacles are set to 1. Although our method outputs a value between 0 and 1, we want to simplify the conversion between  labels and traversability scores to avoid any biases. See Fig. 5 for some qualitative results.

Comparisons
Since many methods do not have publicly available codes, we implement their methods based on the papers, which can only run on an offline dataset and not in the real world. We compare our method with the following methods: Dahlkamp et al. (2006) use a Mixture of Gaussian Model to make a binary prediction on RGB images for traversable regions and make an inverse perspective transform to the world coordinates. Geometric-based methods (Chilian & Hirschmüller, 2009;Zhou et al., 2021) only use geometric information from the point cloud for navigation tasks. 3D semantic segmentation (Thomas et al., 2019;Cortinhal et al., 2020) methods are useful for classifying terrains. We obtain their inference results from the official repository of RELLIS-3D .

Evaluation metrics and results
We evaluate the traversability map based on offline data with four different metrics. In general, our method has better performance in terms of accuracy and MSE. Note that in the first three metrics, all traversability values are converted to either 0 or 1 for methods that have a continuous output. The metrics are described as follows: Mean accuracy The average accuracy of traversable and non-traversable regions.
All accuracy Accuracy over all grids. ROC (receiver operation curve) Previous methods (Dahlkamp et al., 2006;Sock et al., 2016) make binary predictions over each grid, so ROC is a common indicator of the performance through true positive and false positive rates, as shown in Fig. 6.
MSE (mean squared error) To describe how well the prediction fits the ground truth, we also calculate the average distance between the prediction and the ground truth over all grids.

Navigation and excavation in the real world
In this section, our focus is on evaluating our system in realworld environments, which include unstructured terrains and construction sites. For our evaluation, we utilize two different construction sites. The first construction site comprises numerous unstructured terrains, allowing us to emphasize the demonstration of the Navigation System (TNS) in Sect. 7.1.
On the other hand, the second construction site is more complex, featuring a combination of structured and unstructured terrains, along with various elements such as building materials, fences, construction and transportation vehicles, and modular work site camps. In Sect. 7.2, we showcase a more comprehensive Navigation and Excavation System (TNES) on the second worksite. Finally, we provide valuable insights and intuitive aspects related to the design of this intricate system.

TNS: Evaluation on unstructured terrains
We highlight the results on real-world environments and overall performance of our navigation system based on TNS.
We also compare its performance with a geometric-only method (Chilian & Hirschmüller, 2009).

Hardware setup
We use a 49-ton XCMG XE490D excavator to perform our experiments. The excavator is equipped with a Livox-Mid100 LiDAR, an HIK web camera with FOV of 56.8 degrees with a pitch angle of 30.3 degrees to detect the environment, and a Huace real-time kinematic (RTK) positioning device to provide the location. We run our code on a laptop with an Intel Core i7-10875 H CPU, 16 GB RAM, and 6GB GeForce RTX 2060 on the excavator. XCMG XE490D excavator has a maximum climbing angle of 35 degrees; the typical recommended climbing angles for any vehicle as a safe climbing angle is 10 degrees. Therefore, we set s cri = 35 deg and s sa f e = 10 deg. We obtain an approximation of the maximum height allowed by s cri and s sa f e after expanding three times the resolution d res along the surface to get: h cri = 3 tan(s cri ) × d res = 0.35 m LiDAR-based segmentation methods (Cortinhal et al., 2020;Thomas et al., 2019) are trained on the point cloud labels, so they have the advantage of prior knowledge on the ground truth. In the real world, annotated 3D point cloud data would not be easily available for applications. We use a point in the ROC plot to represent those methods, as there is not a threshold to adjust h sa f e = 3 tan(s sa f e ) × d res = 0.10 m

Traversability map results and analysis
We evaluate our system in the real world with visual results. In Fig. 5, we show some typical scenarios excavators encounter to illustrate the advantages of geometric and semantic fusion. In those cases, the steel bar and stone were not captured by geometric calculation, while with semantic information, those obstacles can be detected.

Planning based on offline traversability map
Based on the resulting occupancy grid maps from the proposed TNS and geometric-only method (Chilian & Hirschmüller, 2009), we randomly choose start and goal positions on an unoccupied grid with over 90 trials. The success rates of finding a valid path without collision for our TNS and the other method are 82.6% and 33.3%, respectively. We show some comparisons on planning results in Fig. 7. We use an occupied threshold t occ of 0.6. The height of the cabin h cab is 0.5 m, and the distance between two tracks d track is 2.75 m for map post-processing and planner configuration.

Real-world experiments and trials
We test our system TNS on two construction sites with a total area of at least 200 m 2 . We summarize those trials in Table 4. We tested 3 types of trajectories, including going straight while avoiding lower traversability areas, making normal turns, and making sharp turns on the terrain. For all We have tested TNS on different types of trajectories, including straight paths, normal turns, and sharp turns. Our system can achieve 10 cm tracking error accuracy for these scenarios Our method can be run in real-time and update the traversability map at a rate of 10 Hz tests, the excavator was able to successfully reach the given target, which demonstrates the robustness of our system. Furthermore, the tracking error of all trajectories is within 10 cm on average. For details of the testing site, please refer to the supplemental materials.

Run-time analysis of traversability map
Our method consists of the following major parts, which contribute to the overall runtime of the system: • Segmentation generates a pixel-wise semantic classification on each image in the RGB input stream. • Projection casts the 2D segmentation result onto the 3D point cloud and assigns each point a semantic label through the calibration matrix. • Geometric traversability calculation estimates and updates slope and step height based on point cloud data in a grid map representation.
In Table 5, we give details of the run-time of each component in the system. The final fusion step is under 2 ms and does not contribute to the overall runtime of the method. Our method can update the traversability map at a rate of 10 Hz. Please refer to the video for more visual results of excavator navigation.

Controller error analysis
The tracking trajectory controller can maintain the excavator around the desired path with the maximum absolute lateral tracking error less than 15 cm in most of our test runs. In the case shown in Fig. 8, the maximum tracking error is around Fig. 7 Planner output comparisons between geometric-only scheme (Chilian & Hirschmüller, 2009)

(top) and TNS (bottom):
We show planned trajectories with our modified Hybrid A* (Kurzer, 2016) planner. The planning is based on a global traversability map. We highlight some obstacles that are not observed by the geometric method (red) as well as some traversable regions that are falsely observed by the geometric method (blue) (Color figure online) 14 cm. This test run lasts for 102 s with an average speed of 0.5 m per second. The total length of the trajectory is about 50 m. The left plot shows the planned path from the improved hybrid A* planner and the actual path of the excavator. The excavator starts from the blue pentagram and ends at the red dot. The top-right plot represents the tracking error, which is the distance between the excavator and the closest point on the planned path. The bottom-right plot is the histogram of the tracking error, where the y-axis represents the percentage of each tracking error column. The tracking error is around 6 cm most of the time.

TNES: navigation and excavation
In this section, we present TNES, as an extension of our work by combining TNS and AES . After gaining insight from navigation on unstructured terrains, we extend and test our system in a more comprehensive activity on another active worksite, where the complexity drastically increases due to the diversity of the objects in the scene. In this scenario, the excavator not only needs to navigate on unstructured terrains, but also go to the construction location on either structured or unstructured terrains, with building materials, fences, other construction and transportation vehicles, modular work site camps, etc. We further demonstrate the navigation system integrated with excavation tasks, which is another step towards autonomy of excavation activities.

Hardware setup
We have some changes in the hardware setup because the experimentation and demonstration are conducted on a new worksite. As shown in Fig. 9, we deploy TNES on a 6.5-ton XCMG XE75D excavator and test it in the new environment. The excavator is equipped with 3 Livox-Mid70 LiDARs with field of view of 70.4 degrees and 4 HIK web cameras to provide comprehensive coverage of the scanning areas, and a Huace GPS device to provide global location. We run our code on a NUVO laptop with an Intel Core i7-9700E CPU, 32 GB RAM, and 6GB GeForce RTX 2060 on the excavator. The RTK system is removed to reduce the cost of the hardware. In terms of software, we utilize map-based localization to fuse the information from LiDAR, IMU, and GPS in order to reduce the drift of the GPS system with respect to altitude.

Localization evaluation
Due to the complexity of the scene, localization becomes crucial before deploying the system autonomously to avoid damage and injury. A global map generated from our system is shown in Fig. 10. Localization evaluation The height accuracy of GPS RTK is approximately 20 cm. A purely RTK-based localization system could result in variations in the height measurements of the same grid cell at different time, leading to inaccurate geometric information. In Fig. 11, we show that our map-based localization solution can have better height accuracy. During the experiment, the excavator traverses a round-trip path and the height data was recorded. Compared to the GPS measurement, our map-based localization estimation produces more consistent height measurements at the same position.

Dataset
As we have access to another more complex worksite environment, we collect and annotate 128 images according to the 7 semantic labels defined by the ontology of the CWT dataset, with additional 12 semantic classes that are common for a construction site. Since most terrain types are defined by CWT ontology in Sect. 5, the additional semantic labels are mostly non-navigation related, as shown in Table 6. We show a few collected images and corresponding annotations in Fig. 12.
The Complex Terrain Worksite described focuses on the terrain, so the camera is mounted with a pitch angle of 30.3 degrees towards the ground. To get more context of the working environment and the surroundings on the worksite, we mount the camera with a pitch angle of 8.5 degrees towards the ground so the field of view includes the terrains, foreground and background. The purpose of the annotation is to adapt to the current environment for our experimentation and demonstration. The eventual goal is to collect terrain data and construction context images on multiple worksites, so the inference model can generalize to more construction situations.

TNES on the worksite
By integrating the TNS with automatic digging function from AES , we have successfully implemented an operation process that allows an excavator to navigate to the designated location for the excavation task and return to the specified location after the completion of the task. We show the effectiveness of our system in providing enhanced walking capabilities for autonomous excavators on construction sites, making it a valuable addition to the existing AES system. Please refer to multi-media supplemental submission for the demo video.

Analysis and lessons learned
In this section, we highlight some of the failures of and lessons learned from the design and evaluation of our system: • Perception errors include segmentation error and Lidar measurement error. It is hard to find similar terrains or scenarios in existing datasets for annotations and supervised training, especially when the terrain becomes rougher and bumpier. To alleviate segmentation error, we collect and annotate some terrain data on construction sites with different terrain labels, including flat surface, bumpy surface, water puddle, obstacles, rocks, etc., aiming to improve perception accuracy in unstructured environments and enable such construction vehicle applications. However, the Lidar measurement can be unreliable due to the dust in the air. To remove such noise, we use step height estimation and semantic fusion for more robust traversability predictions, as mentioned in Sect. 4.1. In addition, certain terrain features are visually detectable but not easily captured geometrically, such as a large water puddle with a flat surface. Conversely, some features, like a dirt hill with significant elevation changes, are more apparent geometrically than visually. Our objective is to combine and overlay both types of information to make a conservative prediction that ensures the safety of the terrain from multiple perspectives. By leveraging both visual and geometric data, we aim to detect any Estimate GPS Fig. 11 The round trip results from the localization experiment. z represents the height measurement, while x represents the milestone location in the round trip. Ideally, the measurement should be consistent at the same location x. The error of height measurement z from map-based localization estimation is significantly smaller compared to GPS RTK

Fig. 12
New environment and semantic annotations: We highlight several challenging scenes on our new working site, and the corresponding 18 types of semantic annotation. We omit "Other" class legend for simplicity unsafe regions even in cases where one source may have uncertainties. It is uncommon for an unsafe region to be challenging to recognize both visually and geometrically. • Terrain roughness is an issue in geometric-based traversability methods on a mobile robot (Chilian & Hirschmüller, 2009). However, it is less effective for large machines like excavators due to the scale difference. In our case, roughness can be partially modeled either through the slope and step height or captured by visual features from the RGB images. However, it could become an issue if the terrain is very uneven or has large rocks or obstacles. • Localization accuracy directly impacts the quality of the system. In our experiment, the main reason for the localization inaccuracy is the drift of the RTK system on the altitude. In our open test field, the accuracy of the RTK system in latitude and longitude is around 5 cm, whereas the altitude accuracy is about 20 cm. To plan accurately and navigate over a period of time, we only use the most recent grid cells to calculate the traversability score because the drift is small. In addition, our attempt to use SLAM for localization failed because most features are quite uniform (similar hills, pits, rock piles, etc.), causing degraded performance and very low accuracy due to instability. In the future, we could build a more stable localization system to fuse RTK, LiDAR, and camera data. • Planner needs to be adjusted to fully utilize the traversability map. We choose the Hybrid A* algorithm over the standard A* algorithm in our system to avoid sharp turns, which could cause damage or bumpiness to the ground surface. We adjust the Hybrid A* planner as described in 4.3 to compute a smoother and safer path with continuous traversability map values. However, it is hard to guarantee that our planner will always generate a smooth path on arbitrary terrains. • Computational and power budget is a major issue in the design of our perception and planning algorithm. Our traversability map computations and navigation module run on a laptop with an Intel Core i7-10875 H CPU and a 6GB GeForce RTX 2060. Our implementation must be efficient and light-weight to run in real-time. Recently, many deep and reinforcement learning methods have been proposed for object detection and navigation, but they require a high-end GPU for efficient execution. We can't use such methods on our platform. • Safety In deploying the autonomous excavator system to the real world, safety is always the most critical consideration. We develop the terrain traversability mapping component to describe the complexity of the terrain and provide safe regions for the autonomous excavator to navigate. Our method can be combined with other safety strategies such as object detection, collision avoidance, etc., and maintain the stability of the excavator to ensure the safety of autonomous operation. • Excavator size also governs the performance of our system. There are three broad classes of excavators: compact excavator (less than 6 tons), standard excavator (7 − 45 tons) and large excavators (45 − 90 tons). The size of the excavator impacts the performance of the navigation system when computing a smooth trajectory and the resulting path. There is a relative trade-off between mobility and stability for different sizes. We have evaluated the performance of TNS on a large, 49-ton excavator. In general, developing autonomous excavation technology for larger excavators is more challenging.

Conclusions, limitations, and future work
In this paper, we present a terrain traversability mapping and navigation System (TNS) for autonomous excavation navigation. We highlight its application and benefits on difficult excavator navigation tasks in real-world scenarios. We use a novel learning-based geometric fusion solution and demonstrate its benefits over prior traversability mapping algorithms. With a better localization solution, we further deploy our navigation system on a more challenging worksite. In addition, we release the CWT dataset with challenging real-world scenes in unstructured construction sites for for terrain estimation task, and we add another set of images with more diverse annotations that are not limited to terrains and includes common objects on a worksite for more challenging perception tasks. Finally, we present TNES, an integration of the proposed TNS and the Autonomous Excavation System (AES) , and demonstrations integrated navigation and excavation activities. Our work has some limitations. Due to safety issues, we are not able to extensively test our system in all types of scenarios, including cases with many human workers and other machines. We have only evaluated the performance on a large, 49-ton excavator and a small, 6.5-ton excavator. As part of our future work, we would like to improve the planner further and utilize the specifications of the excavator like a human operator. For example, the excavator should be able to run over small obstacles using the space between two tracks. In addition, we would like to evaluate the performance in different types of outdoor terrains. Our longer-term goal is to enable autonomy and collaborations among machines or with humans on construction sites. This requires several systems and modules working together, including autonomous excavation, autonomous navigation, and human machine interactions.
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Tianrui Guan is a Ph.D. student at University of Maryland, College Park. He received B.S. in Computer Science and Statistics at UMD in 2019 with Magna Cum Laude, and continued as a graduate student in Computer Science. His research interests include detection and segmentation, perceptionbased retrieval and autonomous driving.