Automatic targetless LiDAR–camera calibration: a survey

The recent trend of fusing complementary data from LiDARs and cameras for more accurate perception has made the extrinsic calibration between the two sensors critically important. Indeed, to align the sensors spatially for proper data fusion, the calibration process usually involves estimating the extrinsic parameters between them. Traditional LiDAR–camera calibration methods often depend on explicit targets or human intervention, which can be prohibitively expensive and cumbersome. Recognizing these weaknesses, recent methods usually adopt the autonomic targetless calibration approach, which can be conducted at a much lower cost. This paper presents a thorough review of these automatic targetless LiDAR–camera calibration methods. Specifically, based on how the potential cues in the environment are retrieved and utilized in the calibration process, we divide the methods into four categories: information theory based, feature based, ego-motion based, and learning based methods. For each category, we provide an in-depth overview with insights we have gathered, hoping to serve as a potential guidance for researchers in the related fields.


Introduction
In modern autonomous systems such as self-driving vehicles, accurate perception of the surrounding environment is an important capability and a prerequisite for making subsequent decisions. In order to further improve perception accuracy, autonomous systems usually apply different types of sensors and combine their advantages through data fusion (Cui et al. 2022;Feng et al. 2021;. Among them, a most typical multi-sensor fusion is for a LiDAR sensor and an RGB camera, as shown in Fig. 1. Currently, LiDAR-camera fusion has been widely applied to a variety of challenging tasks, such as object detection and tracking (Chen et al. 2017;Vora et al. 2020;Kim et al. 2021), simultaneous localization and mapping (SLAM) (Graeter et al. 2018;Zuo et al. 2019), and navigation (Hussein et al. 2016).
In order to fuse the data from a LiDAR sensor and a camera, it is critical to first calculate the extrinsic transformation between the two sensors in a common frame of reference. The process of the parameter estimation for the transformation between sensor coordinate systems, including rotation and translation, is called LiDAR-camera extrinsic calibration.
The extrinsic calibration involves finding the correspondence between data from the two sensors. LiDAR point clouds and camera images are of two distinct modalities, which differ in dimension, resolution, field of view, etc., bringing great challenges to the calibration process. Traditionally, the two sensors are calibrated by placing definite targets, such as checkerboards, polygonal boards, and boxes, in specific scenes, or by manually extracting and matching particular features from the sensor outputs. However, these methods require pricey and lengthy manual operations, which is expensive to compensate for the drifts of calibration parameters caused by displacements in the location of sensors on moving vehicles.
To approach this problem, a recent trend is to extract features or other discriminative information, such as the common attribute probability distribution and the motion trajectory, that can be used to calibrate the sensors in the actual driving environment. This approach does not require any calibration target or manual effort. As a result, it is referred to as automatic targetless calibration.
Automatic targetless calibration promises to revolutionize the way calibration is done, but it also brings great challenges to the design of the system. In particular, without a clear calibration target, the difficulties of feature extraction and matching increase significantly. In this paper, we carefully review recent automatic targetless calibration methods, with a special focus on how they tackle this challenge mathematically. Specifically, based on how these methods exploit potential cues in the environment, we divide them into four The process for LiDAR-camera fusion based on extrinsic calibration, the image and the point cloud are taken from the KITTI public dataset (Geiger et al. 2013), as in the following figures categories: (1) information theory based methods that measure the statistical similarity between joint histogram values of several common properties between the two modalities, (2) feature based methods that extract geometric, semantic or motion features from the environment, (3) ego-motion based methods that make use of the sensor-movement related information, and (4) learning based methods that use neural network models to estimate the extrinsic parameters.
Though there are already a few surveys on LiDAR-camera calibration such as Nie et al. (2021), Yaopeng et al. (2021), Khurana and Nagla (2021), and Wang et al. (2021), they usually cover a wide range of traditional calibration methods and only provide a relatively brief review of automatic targetless calibration. Given its rapidly increasing popularity, we believe a thorough and insightful review on automatic targetless LiDAR-camera calibration is both imperative and important. To summarize, our contributions are as follows: 1. We provide an accurate and inclusive automatic targetless LiDAR-camera calibration classification. Based on how the potential information in the environment is utilized to solve the extrinsic calibration problem, we divide these methods into four categories, i.e., information theory based methods, feature based, ego-motion based, and learning based methods. Then we further split each category according to their different choices for the implementation. 2. We present an extensive and detailed introduction to related studies that fall within the scope of automatic targetless calibration. Then we carefully classify and introduce these papers in detail. 3. We provide elaborate comparisons and discussions for the four categories on their characteristics and advantages, as well as their limitations.
The remainder of this paper is organized as follows. The overall structure is presented in Fig. 2. In Sect. 2 we explain the mathematical principles of extrinsic calibration between a LiDAR sensor and a camera, and introduce the criteria for the classification of current methods, then we present our automatic targetless LiDAR-camera calibration framework. Section 3 provides a review of LiDAR-camera extrinsic calibration Fig. 2 Global structure of this survey methods from the four previously-mentioned categories. In Sect. 4 we summarize and compare the four categories of methods on their pros and cons. Section 5 provides the conclusion of the paper.

Background
The calibration for LiDAR and camera is aimed at obtaining the transformation between the two sensors' coordinates, which enables the conversion of the data from the LiDAR sensor and the camera into the same coordinate system. Fusion of the calibrated data is crucial to improve performance for perception tasks, such as object detection, classification, tracking, and so on (Chen et al. 2017;Zhang et al. 2019).
In this section, we specify the concepts of intrinsic and extrinsic calibration parameters and review the mathematics for transformation between LiDAR and camera coordinates. Then we introduce four categories of extrinsic calibration methods, that can be summarized according to the need for calibration targets and whether human intervention is required. In this paper, we focus on automatic targetless extrinsic calibration between a LiDAR sensor and a camera.
It should be noted that the LiDAR-camera calibration here refers to the alignment of the sensor at the spatial level, which is to obtain a rigid transformation relationship between the two sensor coordinate systems. In addition, a concept related to calibration is temporal calibration or time synchronization, which is the alignment of sensors at the temporal level. Since each sensor often has different sampling frequencies, some methods need to be used to synchronize the data of multiple sensors to a unified timestamp. In this paper, we assume that temporal calibration has been performed well.

Transformation between LiDAR and camera coordinates
The transformation relationship between the coordinate systems of a LiDAR sensor and a camera is specified by extrinsic parameters in LiDAR-camera calibration. Meanwhile, the camera is treated as a classical pinhole camera model. Then a 3D point in the camera coordinate system is projected onto a 2D point in the image plane, where intrinsic parameters specifies the projection. We use both extrinsic and intrinsic parameters to transform a 3D point in the LiDAR coordinate system to a 2D pixel in the image plane and vice versa, which defines the correspondence between points and pixels. Notice that, the intrinsic parameters represent the internal properties of the camera such as focal length and principal point, which can be measured offline. Then we only need to estimate extrinsic parameters online for the transformation.
As illustrated in Fig. 3, we use O L and O C to denote the origin of coordinate systems attached to the LiDAR sensor (L) and the camera (C), respectively. The position coordinates of a point P w.r.t. L and C can be denoted as P L = [X L Y L Z L ] ⊤ and P C = [X C Y C Z C ] ⊤ , respectively. The point P is also projected on the image plane at p C = [u v] ⊤ . Then the projection between the 3D point P C in the camera coordinates and the 2D point p C on the image plane, i.e., intrinsic parameters, is specified by the following equation: where f x and f y denote the focal length in pixels on the x and y axes respectively, (u 0 , v 0 ) denotes the optical center (the principal point), and s denotes the skew coefficient, which is non-zero if the image axes are not perpendicular. Meanwhile, Z C denotes the depth scale factor.
The transformation between the point P L in the LiDAR coordinates and the point P C in the camera coordinates, i.e., extrinsic parameters, is specified by the following equation: where R and t denote the rotation matrix and the translation vector between LiDAR and camera coordinates, respectively. Let T = [R t] . In the following, we use T to denote the extrinsic parameters.
At last, we can specify the transformation between P L and p C by combining Eqs. (1) and (2): In the following, we use proj T (P L ) = p C to denote the projection function from the 3D LiDAR point P L to the 2D point p C on the image plane w.r.t. the extrinsic parameters T . With a slight abuse of the notion, we also use proj T (P L ) = p C to denote the projection

Camera coordinate system
LiDAR coordinate system P Image plane Transformation between a LiDAR sensor and a camera using extrinsic parameters. A point P in 3D world scene is observed by a LiDAR sensor and a camera, denoted as P L in the LiDAR coordinate system and P C in the camera coordinate system. The coordinate transformation of these two points P L and P C is performed through extrinsic parameters R and t 1 3 function from a set of 3D LiDAR points P L to an image p C , where for each P L ∈ P L , proj T (P L ) ∈ p C , and vice versa.

Categories of extrinsic calibration methods
According to the need for calibration targets and whether human intervention is required, extrinsic calibration between a LiDAR sensor and a camera can be divided into the following four categories: Manual target-based These extrinsic calibration methods require engineers to manually specify the correspondences between the LiDAR point clouds and camera images based on one or more calibration targets, like checkerboard patterns (Zhang and Pless 2004;Geiger et al. 2012;Zhou and Deng 2012), ArUco tags (Dhall et al. 2017;Yoo et al. 2018), custommade planar targets (Vel'as et al. 2014;Guindel et al. 2017), ordinary boxes (Pusztai and Hajder 2017;Hassanein et al. 2016).
These specified calibration targets impose geometric constraints between corresponding 3D points in point clouds and pixels in images, which enable the agent to estimate extrinsic parameters. For example, Zhang and Pless (2004) proposed to use a checkerboard from multiple views to calibrate a 2D LiDAR sensor and a camera, where the extrinsic parameters were estimated by solving a nonlinear least-squares iterative minimization problem. Later, Unnikrishnan and Hebert (2005) extended the work to calibrate 3D LiDARs and cameras following a similar procedure.
Automatic target-based Different from manual target-based methods, these methods do not require human intervention, where the correspondences between point clouds and images are automatically estimated using various features w.r.t. the calibration targets.
There are various calibration methods in this category. For instance, Geiger et al. (2012) presented an automatic extrinsic calibration method using a single shot only. In specific, the method required finding several checkerboards in different places, other than taking several shots on one checkerboard located differently. Toth et al. (2020) uses a spherical target for automatic extrinsic calibration. The method calculates the sphere center of the target using the detected surfaces and contour from point clouds and images respectively, and estimates extrinsic parameters via the geometric constraint for the same sphere center.
Manual targetless Extrinsic parameters may need to be adjusted online in some real-world applications, like self-driving (Levinson and Thrun 2013). Then targetless calibration methods are required to estimate extrinsic parameters in the real world without specified targets.
Manual targetless methods consider the problem by manually specifying the correspondences between point clouds and images, which often require a set of predefined rules or patterns for selecting the correspondences. For example, Scaramuzza et al. (2007) proposed a targetless calibration method. The method first manually selects a set of pairs between 3D points in point clouds and pixels in images. Then it estimates extrinsic parameters using the PnP (Perspective from n Points) algorithm (Quan and Lan 1999) followed by an iterative least-squares refinement.
Automatic targetless Automatic targetless calibration methods estimate extrinsic parameters by exploiting useful information from surrounding environments automatically. These approaches neither require any specified calibration targets nor heavy manual work. In the next section, we summarize various existing automatic targetless extrinsic calibration methods according to which information they are used for the estimation.
Notice that, automatic targetless calibration is widely applied for lots of practical applications for autonomous systems, like intelligent vehicles, drones, and robots  3 Automatic targetless LiDAR-camera calibration Automatic targetless LiDAR-camera calibration methods intend to estimate the extrinsic parameters between LiDAR and camera automatically, by exploiting useful information from surrounding environments online, without any human intervention.
According to three specific sources of information exploited from environments, there are three categories of automatic targetless LiDAR-camera calibration methods, i.e., information theory based methods, feature based methods, and ego-motion based methods. Different from them, learning based methods use neural networks to implicitly capture useful information from environments for the calibration.
The reason that we group existing automatic targetless LiDAR-camera calibration methods in these four categories, is to indicate how the information from surrounding environment is utilized for calibration.Moreover, these four categories also suggest calibration methods for different application scenarios.
In specific, information theory based methods are preferred for environments with few features, as they are to maximize the similarity between (the projection of) the set of all 3D points from the LiDAR and the whole image from the camera, rather than certain kinds of features.On the other hand, feature based methods are suitable for scenes that provide sufficient features such as urban environments with rich geometric and semantic features.
Ego-motion based methods can be performed for scenarios when both LiDAR and camera are moving during the calibration process, like the case that both sensors are mounted on a moving car.Correspondingly, ego-motion based methods should not be applied for scenarios when both sensors are static during the calibration process, like the applications of roadside sensing systems.Different from the above three categories, the applications of learning based methods require large sets of training data and enough computing resources for online inference.
In this section, we summarize most recent automatic targetless calibration methods into four categories, i.e., information theory based methods, feature based methods, egomotion based methods, and learning based methods. For each category, we introduce the basic principles of the methods and further explore their differences by specifying multiple choices for the implementation.

Information theory based methods
Information theory based methods estimate the extrinsic parameters by maximizing the similarity transformation between the LiDAR sensor and the camera, which is measured by various information metrics. In specific, the basic principles of information theory based methods can be summarized as the following equation: where P L denotes the set of 3D points generated by the LiDAR sensor, p C denotes the image generated by the camera, proj T denotes the project function from the set of 3D points to the image w.r.t. the extrinsic parameters T , and IM denotes the corresponding information metric that measures the similarity between proj T (P L ) and p C .
Following the statement in Eq. (4), an information theory based method for LiDAR-camera calibration consists of three steps: 3D-2D projection for LiDAR points proj T projects the set P L of 3D LiDAR points to the image proj T (P L ) w.r.t. the extrinsic parameters T.
Statistical similarity measure IM measures the statistical similarity between the 2D projected image proj T (P L ) and the camera image p C w.r.t. some features that share the similar distribution between the sensor data obtained by LiDAR and camera. Notice that, different choices of these features and corresponding statistical dependence measures would result in different LiDAR-camera calibration methods.
Optimization The statistical dependence measure IM is usually a non-convex function, which requires an optimization method to reach global optima. A typical pipeline of an information theory based approach is shown in Fig.4.
Note that, there are several attributes of the sensor data obtained by LiDAR and camera that share a similar distribution. For instance, LiDAR data points with high reflectivity usually correspond to bright surfaces in the image, and points with low reflectivity correspond to dark areas (Pandey et al. 2012). The correlation between the LiDAR reflectivity and camera intensity is often applied to measure the similarity between the data of LiDAR and camera. Besides reflectivity and intensity, gradient magnitude and orientation extracted from both LiDAR points clouds and camera images can also be considered here ).

Pairs of point cloud and image attributes
We summarize pairs of attributes for LiDAR point clouds and images that are commonly adopted in existing information theory based methods and specify them in the form of "Point cloud attribute -Image attribute". • Reflectivity-Grayscale intensity The reflectivity of a LiDAR point is recorded as the return strength of a laser beam, and grayscale intensity denotes the intensity of the pixel in a grayscale image. When the camera and LiDAR simultaneously observe the environment, there would be a statistical similarity between the reflectivity of the LiDAR point clouds and the grayscale intensity of the image, as both attributes mainly depend on the same surface property of the objects (Pandey et al. 2014). Similarly, other pairs of attributes, like Reflectivity-Hue (Zhao et al. 2016), Reflectivity-Visible light wavelengths (Pascoe et al. 2015) and Reflectivity-color (Irie et al. 2016), also depend on the same surface property of the objects. • Surface normal-Grayscale intensity Given the light sources in the environment, the surface normal will affect the grayscale intensity of the corresponding pixels in the image. Then, there is a statistical relation between the surface normal obtained from the LiDAR point clouds and the grayscale intensity of the image. The surface normal can be estimated from either dense or sparse LiDAR point clouds via various methods (Taylor and Nieto 2012). Given the normal vector of a point, the corresponding angle between the horizontal plane can also be calculated . It often assumes that most of the light is coming from above, then this angle results in the largest influence on the intensity, which implies the statistical relation between the surface normal and the grayscale intensity. • Gradient magnitude and orientation-Gradient magnitude and orientation When comparing two multi-modal images, a camera picture and a LiDAR depth image for example, if the pixel intensity of a patch in one image differs significantly from its surroundings, then the strength of the corresponding site in the other modality is likely to change accordingly ). This correlation exists as changes in these intensities typically represent differences between the background and the detected material or object. For 2D images, the magnitude and orientation of its pixel gradient can be calculated using the Sobel operator (Taylor et al. 2014). As for point clouds, each pixel is first projected onto a sphere, then the gradient is computed using its nearest 8 neighbors based on the algorithm proposed in Taylor et al. (2014). • 3D semantic label-2D semantic label Due to the fact that the semantic label of each 3D point is the same as its corresponding image pixel if exists, we should be able to perform data association using such information (Jiang et al. 2021). The point-wise semantic labels in an image and a point cloud can be predicted respectively, in a segmentation task using neural network models (Takikawa et al. 2019;Cortinhal et al. 2020). • Combination of 3D-2D attribute-pairs Instead of relying on one specific pair of 3D-2D attributes to estimate the pixel similarity, some methods found that using a mixture of features is advantageous for improving algorithmic robustness against varying environments (Irie et al. 2016). They compute similarity measurements using a combined set of 3D-2D attribute pairs with appropriate weights assigned to each. These attribute sets are usually a combination of some of the above attribute pairs, such as reflectivity, surface normal, and gradient in the point cloud, and grayscale intensity and gradient in the image. They compute similarity measurements with a combined set of 3D-2D attribute pairs and assign them appropriate weights.

Statistical similarity measure
Based on the above attribute pairs for LiDAR point clouds and camera images, we can use various statistical dependence measures to measure the statistical similarity between them, where larger measure values lead to better correspondences. In the following, we summarize statistical dependence measures that are commonly applied in existing information theory based methods.
• Mutual Information (MI) MI provides a means to measure statistical dependence between two random variables or the amount of information that one variable contains about the other. Under the Shannon entropy (Shannon 2001), MI is defined as: where H(X) and H(Y) are the individual entropies of random variables X and Y, and H(X, Y) is the joint entropy of the two random variables, i.e., where p X (x), p Y (y), p XY (x, y) denote the marginal and joint probabilities of these random variables, respectively. In practice, we can use, for example, the reflectivity value of each LiDAR point and the intensity of each image pixel as two random variables X and Y. Then the probability distribution of both random variables can be estimated using methods, like kernel density estimation (KDE) Scott (1992). • Normalized Mutual Information (NMI) Notice that, MI can be influenced by the total amount of information contained in both LiDAR points and the image. Then the preferred similarity transformations between the LiDAR sensor and the camera, i.e., the extrinsic parameters for the calibration, may not result in larger MI measure values (Studholme et al. Jan 1999). NMI addresses the problem by normalizing the value in MI, i.e., • Gradient Orientation Measure (GOM) GOM operates by calculating how well the orientation of the gradients is aligned between two images . The magnitude of the gradient is also considered as the weight. There is a major difference between NMI and GOM. GOM uses the gradients of points rather than their intensity, so it takes into account the values of neighboring points and the geometry present in the image. • Normalised Information Distance (NID) NID (Li et al. 2004) is a similarity metric that can be used to match the modalities of different sensors. The normalization property of NID brings similar advantages over MI metrics, as it does not depend on the total information content of the two images, thus, it does not detrimental to global image alignment due to matching between highly textured image regions. • Bagged Least-squares Mutual Information (BLSMI) BLSMI (Irie et al. 2016) is a combination of methods composed of a kernel-based dependence estimator and noise reduction by bootstrap aggregating (bagging). One of the advantages of

BLSMI over ordinary MI is that BLSMI is robust against outliers because it does not include a logarithm. • Mutual Information and Distance between Histogram of Oriented Gradients (MID-HOG) MIDHOG is a metric that combines NMI and Distance between Histogram of
Oriented Gradients (DHOG) to measure the consistency between images (Guislain et al. 2017). MIDHOG is defined as a parameter representing the weight : When applied to images with only a few textures, DHOG performs much better than NMI. However, on images with a lot of textures, NMI gives more accurate results. Thus, MIDHOG is able to deal with different scenarios by inheriting the properties of MI and DHOG. • Mutual Information Neural Estimation (MINE) MINE (Belghazi et al. 2018) use neural networks to estimate the mutual information between high dimensional continuous random variables. MINE is scalable, flexible, and completely trainable via back-propm, and it can be used in mutual information estimation, maximization, and minimization. MINE uses the Donsker-Varadhan (DV) duality to represent MI as: F is a function parameterized by a neural network, where are the parameters of the neural network.

Optimization methods
We also summarize optimization methods that are most commonly adopted in existing information theory based methods.
• Barzilai-Borwein steepest descent method The Barzilai-Borwein steepest descent method (Barzilai and Borwein 1988) is a gradient method with an adaptive step size in the direction of the gradient of the cost function. • Nelder-Mead downhill simplex method The Nelder-Mead method (Nelder and Mead 1965) is a direct search method and is often applied to nonlinear optimization problems for which derivatives may not be known. • Levenberg-Marquardt algorithm The Levenberg-Marquardt algorithm (Levenberg 1944) is a commonly used iterative algorithm to solve non-linear minimization problems. • Particle swarm optimization Particle swarm optimization (Kennedy and Eberhart 1995) is a global optimization algorithm. It works by placing an initial population of particles randomly in the search space, then iteratively optimizing to solve the problem.

• Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton method The BFGS
quasi-Newton method (Kelley 1999) is a gradient-based algorithm to maximize the objective function. • The Bound Optimization BY Quadratic Approximation (BOBYQA) algorithm The BOBYQA algorithm (Powell 2009) is a deterministic, derivative-free optimization algorithm that relies on an iteratively constructed quadratic approximation.

Summary of information theory based methods
Following the above discussion, we summarize information theory based methods in Table 1 and group the methods by corresponding 'Information metric', 'LiDAR attribute -Image attribute', and 'Optimization method'. When using combined 3D-2D attribute pairs, the specific attribute pairs are selected differently for each method.  and Guislain et al. (2017) both choose 'reflectivity -grayscale intensity' and 'surface normal -grayscale intensity' the two attribute pairs. Irie et al. (2016) also used 'depth discontinuity -edge' attribute pair. The assumption of this method is that depth changes in the point cloud are likely to appear as edges in the image, which will be described in detail later in the feature based method section. Zhao et al. (2016) used reflectivity, surface normal, and curvature as LiDAR attributes and intensity, hue, and gradient as image attributes. Here, the curvature attribute in the point cloud is used to correspond to the gradient attribute in the image. Besides reflectivity and surface normal for the point cloud and color for the image, the curvature attribute in the point cloud corresponds to the gradient attribute in the image.

Feature based methods
Different from information theory based methods, feature based methods for automatic targetless LiDAR-camera calibration directly extract and match the features from LiDAR and camera images, without optimizing their statistical similarities.
Features that are commonly adopted in these methods can be sorted into three categories, including geometric, semantic and motion features. They need to be acquired online from both LiDAR points and camera images of surrounding environments. In specific, geometric features are constructed by a set of geometric elements like points or edges in environments. Semantic features are high-level representations that often specify semanticaware components of the environment, such as skylines, cars, and poles. Motion features describe the characteristics of moving objects, including pose, velocity, acceleration, etc.
As illustrated in Fig. 5, the process of a typical feature based method often contains three steps, i.e., feature extraction, feature matching, and transformation estimation.
Feature extraction Feature extraction aims to automatically detect stable and unique features from both point clouds and images. These features usually represent specific geometric or semantic elements in the surrounding environments.
Feature matching Feature matching intends to provide the correspondence between the features extracted from the point cloud and the image. For this purpose, various feature descriptors as well as spatial relationships between features are applied.
Transformation estimation This step estimates the transformation relationship, i.e., extrinsic parameters, for LiDAR-camera calibration, based on feature correspondences provided by feature matching. Singular Value Decomposition (SVD) is a widely applied algorithm for the step.
Meanwhile, many methods also combine the steps of feature matching and transformation estimation (Levinson and Thrun 2013;Li et al. 2017;Zhu et al. 2020). They estimate the transformation relationship while looking for the feature correspondences.
In the following, we summarize typical features and corresponding extraction methods for feature extraction and commonly applied strategies for feature matching. Then we

Feature extraction
In the early stage, features in camera images and point clouds are specified by hand (Scaramuzza et al. 2007), which are often used in some manual methods for the LiDAR-camera calibration problem. With the development of computer vision and the requirement for automatic matching, many feature detection methods have been developed to extract unique and robust features from both images and point clouds.
There are a number of feature detectors for point clouds and images, respectively. In LiDAR-camera calibration, we need a pair of feature detectors for both point clouds and images. In the following, we summarize the pairs that are commonly employed in existing feature based methods and address them in the form of "point cloud feature extractor We list popular information theory methods in terms of information metrics, chosen attributes of point clouds and images, and optimization methods. In addition, we also list the complexity of the method, application environment, and whether it is open source  -image feature extractor". We collect these pairs into three categories, i.e., both features in the pair are geometric features, semantic features, and motion features, respectively. We first summarize the pairs of feature detectors for geometric features, where points of interest and edges are widely applied.
Points of interest are geometric features that are widely applied in LiDAR-camera calibration. A point of interest may have a special attribute that can be significantly different from its neighbors, such as color or brightness. It may also have an explicit location in the image space, e.g., intersection points of geographic edges (Willis and Sui 2009). Point features can be calculated regularly and reliably to provide effective detection results. Besides points of interest, edges are another type of geometric feature that is widely applied in LiDAR-camera calibration. These edges in point clouds and images contain geometric information of environments that are useful, especially for environments when point features disclose their instability (Yu et al. 2020).
• Depth discontinuity-Intensity difference Edges in LiDAR point clouds can be extracted using depth discontinuities. In specific, these edges are recognized from the points by calculating the differences in depth between neighboring points and filtering out points whose values of differences are below a pre-set threshold (Levinson and Thrun 2013). This idea has been widely applied in various edge extraction methods (Blaga and Nedevschi 2017;Banerjee et al. 2018;Munoz-Banon et al. 2020;Ma et al. 2021;Wang et al. 2018;Xu et al. 2019). The idea can be further extended by first generating a dense depth map by upsampling the point cloud, then identifying the edges by calculating gradient changes in depth (Castorena et al. 2016). Meanwhile, edges in images can be extracted by detecting shape changes in pixel intensity. It is often assumed that edges extracted by depth discontinuity in point clouds are one-to-one corresponded to edges extracted by intensity difference in images.

3
• Depth discontinuity-Sobel operator Edges in LiDAR point clouds are still extracted by depth discontinuity. Meanwhile, edges in images are extracted by performing Sobel operator (Sobel et al. n.d.), which is an operator that detects edges based on changes of image grayscales. In specific, Sobel operator combines Gaussian smoothing and differentiation to compute the approximation of the gradient of the image intensity function. Besides Soble operator, Canny edge detector (CANNY 1987) and LSD algorithm (von Gioi et al. 2012) also provide methods to extract edges in images. In particular, Canny edge detector uses a multi-stage algorithm to detect a wide range of edges in images, which involves steps of noise reduction, intensity gradient estimation, non-maximum suppression, and hysteresis thresholding. On the other hand, LSD is an edge detection algorithm based on the gradient of the grayscale image. Therefore, Depth discontinuities-Canny detector and Depth discontinuities-LSD are also possible pairs for feature based methods. i.e., edges with depth discontinuity and edges with depth continuity. In specific, depthdiscontinuous edges are those whose depth values changed dramatically w.r.t. their neighboring points, which often refer to edges between foreground and background objects. In contrast, depth-continuous edges are those with continuously varying depth values, which tend to suggest planar intersection lines. These depth-continuous edges can be extracted from a dense point cloud, like the one generated by a solidstate LiDAR. In particular, these edges, i.e., plane intersection lines, can be extracted using point cloud voxel partitioning and plane fitting, which divides the point cloud into small voxels of given sizes and repeatedly uses RANSAC to fit and extract planes in these voxels . Meanwhile, edges in images can be extracted by Canny detector. • Depth continuity-L- CNNBai et al. (2020) reports that the edges of buildings often have sharp edges and explicit line textures, which can be easily extracted in both point clouds and images. In specific, the planes of corresponding buildings can be identified by various point cloud segmentation methods (Nurunnabi et al. 2012;Vo et al. 2015;Xu et al. 2015). Then the edges, i.e., plane intersection lines, in the point cloud can be conveniently obtained by a line detection algorithm based on these segmented 3D planes. On the other hand, an end-to-end neural model, named L-CNN , can be trained to output a vectorized wireframe that contains semantically significant and geometrically salient lines and junctions.
Notice that, real-world environments often contain a large number of similar geometric features, which would increase the difficulty of LiDAR-camera calibration. On the other hand, semantic features often reflect high-level characteristics that account for semantic-aware constraints of the environments, such as skyline (Hofmann et al. 2014), vehicles , and road lanes (Ma et al. 2021). This semantic information is consistent across data modalities. Then they can be extracted from both LiDAR point clouds and camera images and used for LiDAR-camera calibration.
• Skyline-Skyline A skyline is a curve or contour between the sky and other objects in urban environments. This semantic feature is evident in both LiDAR point clouds and images and can be extracted for calibration. In the point cloud, the skyline can be obtained from the contour plot involving the foreground and the sky, since LiDAR sensors only get distance measurements for objects and receive no response from the open sky (Hofmann et al. 2014). Other methods first generate a projected image of the point cloud, then identify the highest pixel in the image from the bottom in a columnwise fashion. Once such a pixel is found, the corresponding point is considered to be on the skyline (Zhu et al. 2018). The skyline in an image can be determined from all world objects based on a given brightness threshold and alpha shapes (Edelsbrunner et al. 1983). Alternatively, based on the large difference in image pixel values between the sky and other objects, along with some prior information on the skyline's location in the image, the desired skyline points can be retrieved by performing a column-wise search for the first pixel point with a jump in grayscale values from top to bottom. Besides geometric and semantic features for static elements in environments, motion features such as trajectories of moving objects can also be used to calibrate multiple sensors.
• Object trajectory-Object trajectory Based on multiple detection and tracking algorithms, we can receive estimated trajectories of moving objects from LiDAR and camera respectively. Notice that, we can obtain two trajectories for a moving object from LiDAR and camera respectively. These two trajectories should be as closely matched as possible, which helps us to calibrate the LiDAR and camera (Peršić et al. 2020).

Feature matching strategies
Feature matching intends to establish the correspondences between the points in LiDAR point clouds and the pixels in images, which are identified by feature extraction. Here we summarise popular feature matching strategies, which are specified for certain kinds of features by considering descriptor similarities and spatial geometric relationships, respectively.
• Descriptors similarity Descriptor similarity based matching methods are usually applied for geometric features focusing on points of interest. Based on extracted feature points, a description, i.e., a compact representation of the neighborhood of the points, is often used to compute a descriptor for each feature point. This strategy matches the feature points with the most similar descriptors between the image and the projected image from the point cloud. Brute force matching calculates the similarity of the matched features w.r.t. the reference feature set. On the other hand, the nearest neighbor fast search method can alleviate the problem. Meanwhile, Euclidean distance is often used as the distance metric. Since there are many incorrectly matched points in established correspondences, it is usually necessary to apply a random sampling consistent random optimization algorithm (RANSAC) to eliminate these incorrectly matched point pairs (

Summary of feature based methods
In Table 2, we summarize feature based methods for LiDAR-camera calibration and group them by categories of their features. Besides the methods listed in Table 2 Besides only considering line features alone, there are also methods that use the combination of line features and depth information from both images and point clouds for calibration. These methods assume that the depth difference between the measured LiDAR data and the image should be minimized. The depth information from images can be obtained either from the point cloud projection (Castorena et al. 2016) or from the monocular depth estimation (Vaida and Nedevschi 2019).
For the methods that extract lines using depth continuity in the point cloud, the point cloud is often segmented into uniform size voxels in Yuan et al. (2021). Different from these methods, Liu et al. (2021) implements the adaptive voxelization to dynamically segment the LiDAR point cloud into voxels of different sizes.

Table 2
Summary of feature based methods We list popular feature based methods in terms of feature type, feature extraction of point clouds and images, and feature matching strategy. In addition, we also list the complexity of the method, application environment, and whether it is open source As an alternative method for extracting feature points, Nieto et al. (2010) used the SIFT extractor automatically and matches the features by looking for the two closest features in the space of SIFT descriptors. As an alternative way to use semantic features, Zhu et al. (2020) applied semantic masks of vehicles in the image and constructs a height map to encourage LiDAR points to fall on the pixels labeled as vehicles. In this work, semantic segmentation is performed only in the image.

Ego-motion based methods
Ego-motion based methods exploit the motion of sensors mounted on the traveling vehicle to estimate the extrinsic parameter. In this scope, some methods try to find the correspondence between the trajectories generated by LiDARs and those by cameras, with LiDARs and visual odometry techniques, or IMU and GNSS measurements (Taylor and Nieto 2015;Ishikawa et al. 2018;Park et al. 2020). There are also methods that make use of the structure from motion (SfM) approach to estimate the 3D structure from the image sequences, thus converting the 3D-2D LiDAR-camera data registration into a 3D-3D case (Swart et al. 2011;Nagy et al. 2019a). In accordance with how the ego-motion information between sensors is used, ego-motion based methods can be divided into hand-eye based and 3D structure estimation based ones.

Hand-eye based methods
Hand-eye calibration problem is a fundamental and critical issue in robot vision applications. It is a problem in determining the transformation between a robot base and a camera, in the case where the camera (the "eye") is mounted on an arm (the "hand") of the robot, or fixed elsewhere other than the arm. The mathematical formulation of this problem also takes the form of AX = XB , where A and B describe the motions of the arm and the camera respectively, and X is the desired unknown transformation matrix. Methods discussed in this section extend the traditional hand-eye calibration to the LiDAR-camera calibration problem, although the rigid-mounted robot sensors should satisfy the same traditional constraints.
Given the following notation: T : The transformation between a LiDAR sensor and a camera. Then the extrinsic parameter between a LiDAR sensor and a camera can be formulated by the hand-eye calibration: A depiction of the hand-eye calibration problem is shown in Fig. 6. Hand-eye based LiDAR-Camera calibration procedure can be roughly split into three stages:

Estimation of each sensor's motion
In the first stage, the state transformation matrices for the LiDAR and the camera, i.e. T i L and T i C , are estimated with rotation and translation considered between neighboring frames for each sensor. For the LiDAR, Iterative Closest Point (ICP) and LiDAR odometry are popular algorithms to compute T i L (Taylor and Nieto 2014;Shi et al. 2019), while for the camera, SfM and visual odometry are commonly used methods to find T i C (Taylor and Nieto 2015;Park et al. 2020). Estimation of the extrinsic parameter Since the motion of each sensor is estimated independently, the transformation between the LiDAR sensor and the camera can be obtained by solving the homogeneous equation defined by Eq. (5).
Solutions to the transformation equation can be categorized based on whether the rotation and translation parameters are estimated separately or simultaneously. In a hand-eye based extrinsic calibration problem, the separated solution is frequently used due to its simplicity (Taylor and Nieto 2015).
T i C , T i L , and T are 4 × 4 transformation matrices which can be written as: The matrix T can be divided into two parts: R T and t T . Thus, Eq. (5) yields the following two equations. First, the rotation R T is determined by: Once R T is known, Eq. (7) becomes linear and t T can then be calculated by: Refinement of extrinsic parameter In hand-eye based methods, the external parameter is usually initialized by solving the homogeneous transform equation. However, deviations in the motion estimation can affect the calibration results and lead to inaccuracies (Taylor and Nieto 2015). The appearance information in the surroundings, such as geometric edge alignment, can be useful to reduce such errors. Liao and Liu (2019) utilizes the line  Fig. 6 a Standard hand-eye calibration problem. The camera "eye" is mounted on the robot gripper "hand", and the robot is performing a series of movements. The transformation between the camera and the gripper is calculated by solving the equation AX = XB . b LiDAR-camera calibration formulated as the hand-eye calibration problem. The two sensors are mounted on the vehicle. As the carrier vehicle moves, each sensor's motion is estimated. The extrinsic parameter between the two sensors is determined by the same equation above features in both the image and the point cloud to refine the calibration parameter by feature matching. A typical pipeline of a hand-eye based method is shown in Fig. 7.
To further discuss the above ego-motion based calibration pipeline, we focus on the selections of algorithms in each step. For the first motion-estimation step, we introduce the widely used LiDAR and camera motion estimation algorithms (Besl and McKay 1992;Zhang and Singh 2014;Mur-Artal et al. 2015); for the second equation-solving step, if the rotational parameter R T is given, then Eq. (7) for the translational parameter t T becomes linear that can be solved straightforwardly. Therefore, we focus on the way that R T behaves in the solution. For the final calibration-refinement step, we present exactly what kinds of appearance information are involved.
To calculate the extrinsic parameter, we first estimate the pose of the LiDAR and the camera respectively for each paired data. Various methods are applied depending on the type of sensors. We summarize multiple popular methods for sensor motion estimation in hand-eye based calibration tasks.
• LiDAR motion estimation The iterative Closest Point (ICP) algorithm (Besl and McKay 1992) is a classical approach for point cloud registration. It iteratively queries the closest points between two sets of point clouds and minimizes the distance between the corresponding points. The output of ICP is a rigid transformation that associates the two point clouds. In addition, some variants of ICP have been developed for both point clouds and images (Oishi et al. 2005;Pomerleau et al. 2013). The motion of LiDAR sensors can also be estimated by LiDAR odometry methods (Shi et al. 2019;Park et al. 2020). For example, LOAM (Zhang and Singh 2014) is a simple and efficient 3D algorithm for this task, which matches the corresponding feature edges and planes. From each trajectory, LOAM extracts a set of relative transformations and utilizes them for extrinsic calibration. • Camera motion estimation Using the SfM approach, a set of transformations that describe the movement of the camera can be calculated, up to scale ambiguity (Ullman 1979). Given 2-D images, SfM estimates the camera pose and retrieves a sparse reconstruction simultaneously. The camera motion transformations can also be found using a standard visual odometry approach, which estimate the motion of a camera in real time using sequential images (i.e., ego-motion). As an example, ORB-

Point clouds
Ego-motion estimation Transformation estimation Refinement AX = XB Fig. 7 The hand-eye based LiDAR-Camera calibration procedure can be roughly divided into three stages: the estimation of each sensor's ego-motion, the estimation of the transformation according to AX = XB , and the refinement of the estimated transformation. In the refinement stage, we use the line feature alignment method as an example SLAM (Mur-Artal et al. 2015) is a feature-based monocular simultaneous localization and mapping (SLAM) system that is frequently mentioned (Shi et al. 2019;Liao and Liu 2019). Note that, the motion estimation purely based on visual estimation faces the problem of scale ambiguity and requires the use of some additional methods to estimate the scale (Taylor and Nieto 2016;Ishikawa et al. 2018).
As mentioned above, Eq. (7) can be solved as a linear equation for translational transformation parameter t T with known rotational parameter R T . Here we focus mainly on different parameterization techniques for R T , including rotation matrix, Angle-axis (Shiu and Ahmad 1989), Lie algebra (Park and Martin 1994) and Quaternions (Chou and Kamel 1991).
• Rotation matrix A rotation matrix is determined by a 3 × 3 matrix. Although not as compact as other representations, a matrix uniquely defines a 3D rotation. Park et al. (2020) found the rotation matrix of the equation by decomposing the covariance matrix of camera-LiDAR relative poses, after aligning the correspondences in the continuous-time trajectories of the sensors. • Angle-axis The axis-angle representation parameterizes a rotation by two quantities: a unit vector, i.e., the rotation axis, pointing to the direction of the rotation, along with an angle indicating the magnitude of the rotation about this axis. In solving the homogeneous transformation equation for the rotation parameter, the use of an angle-axis representation can simplify the process (Taylor and Nieto 2016). • Lie algebra The rotation parameters can also be expressed in the form of Lie algebras (Xu et al. 2019), which is suitable for optimization problems. Lie algebra specifies the extrinsic parameter through a vector with 6 degrees of freedom (DoF) variables. The 6 DoF parameters include a rotation vector r = (r 1 , r 2 , r 3 ) and a translation vector t = (x, y, z). • Quaternion Quaternion provides a simple and unique representation for describing finite rotations in 3D space. Liao and Liu (2019) presented the rotation with quaternion which reduced the variable number from nine to four. Given the rotation, the translation parameter can be found by solving the linear equation (7).
Several methods based on environmental information have been reported useful for refining the calibrations between LiDARs and cameras, such as aligning edges (Levinson and Thrun 2013) or correlating the data intensity of the two modalities (Pandey et al. 2012). There are also methods to continuously optimize the estimation for camera motion and extrinsic parameters alternatively by sensor fusion odometry (Ishikawa et al. 2018).
• Edge alignment Line features in natural scenes can be used for optimizing the extrinsic parameter (Taylor and Nieto 2014;Liao and Liu 2019). The correspondence between 3D lines in point clouds and 2D lines in images can be derived from the line-to-line constraints, thus refining the results obtained from the motion estimation. • Intensity matching An intensity alignment approach based on the statistical dependence measure can also be used to further refine extrinsic parameters. Shi et al. (2019) aligned the LiDAR reflectivity with the camera image intensity through the metric of mutual information. The hypothesis for this matching is that the LiDAR reflectivity is usually similar to the image intensity in the environment.
• Depth matching The correspondence between the depth images generated by LiDAR and the camera respectively, also adds to the optimization of the extrinsic parameters (Xu et al. 2019). The LiDAR depth map is created by projecting the LiDAR point cloud from the initial extrinsic parameter and the camera depth map is produced from the monocular depth estimation. The principle is that an arbitrary point in the LiDAR depth map should is bound to a pixel in the camera depth map at the same pixel coordinates and their depth values should be identical. • Color matching This refinement method works by assuming that the points in the point cloud are of the same color as in the camera images in two consecutive frames (Taylor and Nieto 2016). It operates by first projecting points onto the image to obtain the corresponding colors of the local pixels, then the same points are projected onto the next frame of the image, with the time offset being compensated by the estimated motion information. By minimizing the average difference between the color of the points in current and previous frames, a more accurate extrinsic parameter can be obtained. • 3D-2D point matchingPark et al. (2020) refined the extrinsic parameter by reducing the 3D-2D projection error. In their work, the 3D coordinates of 2D features are computed by triangulation instead of directly from 3D LiDAR points. After the 3D-2D projection is performed with LiDAR-camera extrinsic parameter, the result is improved using non-linear optimization.
We summarize hand-eye based methods for LiDAR-camera calibration in Table 3.

3D structure estimation based methods
Another way for LiDAR-camera calibration based on motion information is to estimate the 3D structure of the surrounding environment from images, one of the most commonly used methods is structure from motion (SfM) (Ullman 1979). SfM is a technique to estimate the 3D structure of a scene from 2D image sequences, that has been applied in many occasions, such as 3D modeling, augmented reality, visual SLAM, etc. 3D structure estimation based approaches use SfM to generate 3D point clouds from a set of images recorded by the camera on the moving vehicle, which converts the LiDAR-camera calibration problem into a registration task in the 3D domain. Swart et al. (2011) described an approach to register panoramic images and LiDAR point clouds. They generate a sparse 3D point cloud from images and match it to a dense 3D point cloud from LiDAR using a non-rigid ICP process. The results were then polished by adding SIFT interest points corresponding to the framework. Moussa et al. (2012) proposed a bundle block adjustment method to determine the accurate 3D-3D correspondences. Corsini et al. (2012) divided the calibration into coarse and fine-grained alignment procedures. After obtaining the intermediate results by applying the ICP algorithm to the LiDAR generated point clouds, they use a global refinement method based on mutual information to improve the accuracy of the fine 2D-3D alignment. Wang et al. (2018) used sequential scene information from the vehicle motion to obtain the initial extrinsic parameter. The method uses the SfM algorithm to calculate 3D points from 2D image sequences, and registers the SfM points with LiDAR points through the ICP algorithm to estimate the primary result. Then by projecting the 3D LiDAR points to the 2D image plane, they use feature points of edges with a combined optimization method to further promote the accuracy of the extrinsic parameter.

Table 3
Summary of hand-eye based methods We list the differences between hand-eye based methods in terms of estimation method of motion trajectory, rotation parameterization, and refinement strategy, In addition, we also list the complexity of the method, application environment, and whether it is open source Method However, the ICP algorithm may fail when the density of SfM cloud points is very different from the LiDAR ones. To address this challenge, Li et al. (2018) designed an automatic registration method based on semantic features extracted from panoramic images and point clouds. They use GPS and IMU aiding the SfM algorithm to obtain rotation parameters, then extract parked vehicles from two modalities to estimate translation parameters by maximizing the overlapping area of corresponding target pairs. Nagy et al. (2019a) proposed an extrinsic calibration method with an object-level registration. First, they use SfM to generate point clouds from consecutive camera images that can be used for alignment and registration, then they introduce a target-level alignment between the generated and the LiDAR point clouds base on object detection results. Nagy et al. (2019b) introduced similar work and used semantic information in the point clouds registration stage.
Later, Nagy and Benedek (2020) made an extension to their previous work mainly in terms of optimization for the registration stage. They manage to diminish the registration error using the point-level ICP method after the object-level registration step, then they introduce a curvebased non-rigid point cloud registration refinement step build on the non-uniform rational basis spline approximation.

Other methods
Besides generating 3D point clouds through the motion trajectory of the sensors and recovering 3D structures from image sequences, there are alternative ideas of motion based methods to solve the LiDAR-camera calibration problem. Bileschi (2009) made an early attempt to associate video streams with LiDAR data from a moving vehicle. The initial calibration parameter is obtained with the help of the IMU motion signal and then is refined by matching 2D and 3D contours in camera images and LiDAR point clouds. Chien et al. (2016) developed a LiDAR-engaged visual odometry framework and embed the ego-motion estimation problem into LiDAR-camera calibration. Their idea is based on the idea that the performance of the estimated ego-motion is directly related to the quality of extrinsic parameters. Specifically, if the extrinsic parameter deviates far from the ground truth, then the ego-motion estimation would also lose effectiveness. Combining the ego-motion estimation problem with LiDAR-camera calibration will form a bi-level optimization structure, this method introduces data constraints such as intensity and discontinuity restrictions to solve such a problem.
Under the Gaussian noise assumption, Huang and Stachniss (2017) applied the Gauss-Helmert model to multi-sensor extrinsic calibration. With constraints between the motions of the individual sensors given, they jointly optimize the extrinsic parameter and reduce the pose observation error using the Gauss-Helmert paradigm. Castorena et al. (2020) proposed a motion-guided method for automatic calibration of the two multi-modal sensors. With a sequence of time-synchronized point clouds from LiDAR and the corresponding images from the camera, they compute the motion vector for each modality independently, then estimated the extrinsic parameter.
When using sensor movement information for extrinsic calibration, the motion must satisfy constraints such as moving in all directions and rotating around all the axes. If the sensors are mounted on a mobile robot performing planar motions, some parameters are rendered as unobservable. Zuniga-Noel et al. (2019) estimated the extrinsic parameter of multiple heterogeneous sensors mounted on a mobile robot subjected to such movements. The method computes the 2D parameters (x, y, yaw) from sensors' incremental motions, and used the observation of the ground plane to estimate the remaining 3 parameters (z, pitch, roll). Horn et al. (2021) used dual quaternions (DQs) to represent translation and rotation with fewer parameters. Based on DQs, they confine the optimization to planar calibration only, and combine a fast local and a global optimization approach for estimating the result.

Summary of ego-motion based methods
Ego-motion based methods use the motion of sensors from LiDAR and camera data sequences. They do not require initial calibration parameters and overlapping field of view for hand-eye based methods and turn the 2D-3D registration problem into a 3D-3D registration problem for 3D structure estimation based methods. However, the accuracy of motion estimation for the sensors often affects the performance of these methods.
For SFM based methods 1. Accurate calibration results

Learning based methods
Recently, deep learning has made breakthroughs in automatic feature engineering, and achieved excellent performance on multiple tasks, like detection tasks in images and LiDAR point clouds, respectively. Learning-based methods require no artificial definition of features, which can learn useful information using neural networks. These methods can also be applied in LiDAR-camera calibration.

End-to-end methods
End-to-end methods use network models to process input camera images and LiDAR point clouds, then directly output the extrinsic parameters. These methods achieve optimal calibration parameters by minimizing corresponding loss functions. End-to-end methods rely heavily on the training data. In the training phase, pairs of point clouds and images accompanied with ground truth extrinsic parameters are fed to the model. However, referring to the ground truth of hundreds of thousands of different relative positions of laser scanners and cameras can be bothersome. Therefore, Schneider et al. (2017) reformulated the problem as determining the mis-calibration mis-calib between the initial calibration parameter T init and the ground truth parameter T gt . With the mis-calibrated extrinsic parameter T init and camera matrix K, the LiDAR points were projected to the camera frame as depth images. The mis-calibration mis-calib can be varied randomly to get a huge amount of training data.
For end-to-end methods, their network architectures can be classified into three categories: Regression: Methods in this category take RGB pictures and depth images as inputs. Their networks often have two branches to extract features from RGB and depth images respectively. Then the features from both modalities are fused by the feature matching component. Finally, the global information extracted from both modalities is regressed to obtain the mis-calibration parameters. The common architecture of regression methods is shown in Fig. 8.
RegNet (Schneider et al. 2017) is one of the first deep learning methods that integrate feature extraction, feature matching, and global regression into a convolutional neural network, for estimating extrinsic parameters between the LiDAR and the camera. In Reg-Net, blocks of Network in Network (NiN) (Lin et al. 2013) were used to extract and match the features of LiDAR depth maps and camera RGB images.
Based on RegNet,  presented an online calibration method for visual and depth sensors. The depth camera and the LiDAR are first calibrated and fused as a virtual depth sensor, then this virtual sensor is calibrated with the camera. Iyer et al. (2018) proposed CalibNet, which takes the geometry information into account and introduces a 3D spatial transformer layer in the model. The RGB branch is the convolutional layers of a pre-trained ResNet-18 (He et al. 2016), and the depth branch is a similar network but with the number of filters halved. The two outputs are then concatenated and passed through the global aggregation block. CalibNet performs end-to-end training by maximizing the geometric and photometric consistency between the image and the point cloud. Yuan et al. (2020) designed RGGNet. This method considers Riemannian geometry and employs deep generative models to build a tolerance-aware loss function. RGGNet not only considers the calibration error, but also focuses on the tolerance within the error bounds. Shi et al. (2020) created and demonstrated CalibRCNN, which combines CNN with LSTM. The output features from the two branches were fused then fed into the LSTM layer to extract temporal features for sequential learning. CalibRCNN not just added pose constraints between consecutive frames, but uses the geometric and photometric loss to refine the calibration accuracy of the predicted transformation parameters. Zhao et al. (2021) proposed CalibDNN and applied it to a complex dataset with diverse scenarios. As a simple system with one model and a single iteration, CalibDNN considers transformation loss and geometric loss to maximize the consistency of multimodal data.  Fig. 8 The common architecture of regression methods for the estimation of the extrinsic calibration parameters for LiDAR-camera calibration. The point cloud is projected to the image plane using an initial calibration T init . The RGB and Depth branches extract the features for matching separately, and then the features are matched in the second part. Lastly, the regression layer regresses the mis-calibration parameters by gathering global information Lv et al. (2021) presented LCCNet for extrinsic calibration of a LiDAR and a camera. To match the features between depth image and RGB image, cost volume layer is constructed instead of concatenating the features directly. In addition to the smooth L1-Loss as supervision for the ground truth, a point cloud constraint is also added to the loss function.
Calibration Flow: The concept of optical flow refers to the movement of target pixels in an image due to the behaviors of objects or the motions of the camera in two consecutive frames. The calibration flow is similar to the optical flow, which includes two channels and represents the horizontal and vertical offsets. Methods in this category take 2D pictures and LiDAR depth maps as inputs. Images from the two modalities are fed into an optical flow network to predict the flow between mis-calibrated depth map and the RGB image, then get the correspondence between cloud points and image pixels. Finally, the initial extrinsic parameters can be optimized by minimizing the projection errors. Lv et al. (2021) showcased CFNet, which can generate a refined calibration flow. A group of accurate 2D-3D correspondences can be constructed and the EPnP algorithm with the RANSAC scheme is applied to estimate the extrinsic parameters. Jing et al. (2022) presented DXQ-Net, which predicts the calibration flow with uncertainty. The network architecture of DXQ-Net is derived from RAFT (Teed and Deng 2021), and a differentiable pose estimation module is used to compute the extrinsic parameters.
Keypoints: Unlike end-to-end learning methods in the above two categories, keypoint methods directly point clouds as inputs, along with camera images. The network extracts feature descriptors from the input data, then finds the corresponding 2D points on the image for each 3D keypoint. Finally, the extrinsic parameters between the LiDAR and camera can be estimated. Ye et al. (2022) offered RGKCNet model, a 2D-3D pose estimation network based on keypoints. This network extracts sparse keypoints and matches them, then a weighted nonlinear PnP solver is applied to estimate the pose. RGKCNet uses extrinsic calibration constraints to solve the data association problem of 2D-3D points. The optimizer in the network is based on geometric constraints.

Hybrid-learning methods
Different from end-to-end methods, hybrid-learning methods use neural networks only to extract information such as geometric and semantic features, while feature association and extrinsic parameter calculation procedures are still based on non-learning methods.
As introduced in Sect. 3.2,  designed SOIC with the introduction of semantic centroids, to ease the demand for prior knowledge of initial calibration. In SOIC, 2D and 3D semantic centroids are calculated based on semantic segmentation of images and LiDAR points. Thus, the LiDAR-Camera calibration initialization is transformed into a PnP problem. Furthermore, the optimal calibration parameter was obtained by minimizing the cost function based on the semantic elements. Zhu et al. (2020) suggested aligning semantic features instead of edge features to improve LiDAR-camera calibration robustness, especially for low-resolution LiDAR and noisy inputs. They extracted cars from both the image and the point cloud, and the extrinsic calibration was optimized through a cost function under semantic constraints.

Summary of learning based methods
Learning based methods use neural networks to find potential features of LiDAR and camera data, these methods can obtain suitable features and achieve good results if there are sufficient data for training. However, existing learning algorithms for calibration usually require a large number of training calculations, which results in a great deal of computational cost. Moreover, they are demanding the conditions of application, which implies that the algorithms need broadly similar scenes for training and validation/test, thus their generalization performance needs to be improved urgently.

Discussion
This paper provides a systematic review of automatic targetless methods for extrinsic calibration between LiDAR sensors and cameras. In literature, current targetless LiDAR-camera calibration paradigms can be divided into four categories, i.e., information theory based methods, feature based methods, ego-motion based methods, and learning based methods.
In specific, information theory based methods evaluate the statistical similarity of data from LiDAR and camera. They calculate precise calibration parameters by maximizing a similarity measurement. However, the accuracy is susceptible to some environmental factors, such as occlusion between objects or the presence of shadows (Pandey et al. 2012) (Parmehr et al. 2014).
Feature based methods extract information from the natural environment and find correspondence between images and point clouds after the feature matching phase. Distinguishable features are available in LiDAR data and optical images. Features can be separated into geometric, semantic, and motion ones. Geometric features such as points and lines are easy to extract (González-Aguilera et al. 2009;Zhang et al. 2021), while semantic features have more differentiation degrees and are simple to match . However, LiDAR data and optical images often capture different characteristics of the environment and feature extraction can be easily affected by random factors such as noise and occlusion . Some methods use the motion trajectory of the detected object as a motion feature. This motion feature introduces dynamic information to allow for temporal calibration, however, this variety of methods requires lots of moving objects to generate sufficient trajectories to be tracked (Peršić et al. 2020).
Comparing the information theory based method with feature based methods, the former runs on the entire 2D-3D data, which avoids the problem of unstable feature extraction and matching, and yields alignment information over the whole data. On the other hand, the latter uses features extracted from 2D and 3D data, which are more discriminative and lead to a simpler optimization.
Ego-motion based methods exploit the motion information generated from the two sensors. These methods can be divided into hand-eye and 3D structure estimation based methods, in accordance with how motion information is used to transform calibration into different problems. Using the trajectories of the LiDAR sensor and the camera, hand-eye methods can proceed without an initial guess for the extrinsic parameters. They require no overlapping field of view (Park et al. 2020), as they do not need to extract features or compute the statistical similarity of the corresponding attributes. However, the accuracy of these algorithms strongly depends on sufficient estimation performance, which usually needs to be refined by other methods (Shi et al. 2019). Since methods based on ego-motion introduces dynamic information, they also need to solve the problem of time synchronization.
The most typical learning based methods are end-to-end methods. End-to-end approaches transform several calibration steps into single-step methods using neural network models. They employ such models to learn useful features by themselves instead of defining features by hand. With the help of high-performance neural networks, these methods can achieve satisfactory calibration results. However, datasets for calibration are difficult to obtain (Schneider et al. 2017). End-to-end methods rely heavily on labeled data for training, and their performance often ends up being unstable in unseen environments. There are also hybrid approaches where that use semantic segmentation networks to extract more robust features while using classical algorithms for subsequent matching and optimization. We list the strengths and problems of the four methods in Table 4.
However, achieving accurate and robust automatic targetless LiDAR-camera calibration for different types of scenarios remains a challenge for future efforts. Different methods have their applicable and inapplicable scenarios. It is a challenging task to design a method that can be used in indoor, urban, and natural environments. On the other hand, for online calibration, some unanswered questions still remain: how to quickly detect the offset of the calibration parameters? At what rate should the calibration data be updated? How much does the offset of the calibration parameters affect the perception results? At the same time, an ideal calibration solution should be able to run on various platforms regardless of their computational constraints, so there is also a difficult task to balance accuracy, efficiency, and resources.
Hybrid methods provide a promising way to improve the performance.For example, the combination of hand-craft and learning based methods can take advantage of deep learning capabilities while maintaining theoretical modeling. Moreover, such methods can reduce their computational cost while maintaining acceptable performance. There is also the combination of SLAM technology and the integration of different sensors such as IMU. SLAM technology is very similar in feature extraction and matching, and with the help of other sensors, additional information can be brought to help calibration task. In learning based methods, the successful application of semi-supervised or unsupervised learning would be helpful due to the hard-to-obtain nature of the ground truth for calibration parameters.

Conclusion
This paper reviews the existing calibration algorithms for automatic targetless calibration between LiDARs and cameras. Unmanned intelligent perception systems are usually equipped with a combination of LiDAR sensors and cameras, taking advantage of the two sensors to better perceive the surrounding environment. A key pre-step of data fusion is to calibrate the extrinsic parameters of the sensor. Traditional methods either rely on calibration objects or require manual interaction. Automatic targetless methods spontaneously obtain information from the surrounding environments in the data, thus eliminating the requirements for calibration targets and human efforts. The current automatic targetless LiDAR-camera calibration methods can be categorized into four categories, i.e., information theory based, feature-based, ego-motion based, and learning based methods. Methods in the first category measure the statistical similarity between the LiDAR data and optical images. The feature-based methods extract geometric or semantic features of the environment, instead of running on the entire 2D-3D data as inputs. The ego-motion based methods exploit the motion of sensors from LiDAR point and camera image sequences. At last, learning based methods use neural network models to learn useful features rather than define features manually.