A Data Quality-Aware Unified Dense Depth Fusion Method with LiDAR and Camera Sensors

Yu Huang
10 min readDec 3, 2019

--

Abstract: This article proposes a data quality-aware unified sensor data fusion system for dense depth map generation from LiDAR and camera sensor data.

1. Introduction

In autonomous driving, multiple sensor fusion is a popularly accepted style for perception setting. It is known sensor fusion has been a hot topic in perception, including low level data fusion, middle level feature space fusion and high level task fusion (such as detection, tracking, localization and segmentation). The fusion process can be classified as early fusion or late fusion, the former means the data is concatenated before it starts fusion, while the latter concatenation of results after some critical work, such as feature extraction, feature matching and object proposals etc.

This article addresses apparently the data fusion level, depth map estimation. The depth map is reconstruction, either from mono/stereo images, depth sensors (such as Kinect), or projection from LiDAR point cloud. The LiDAR measures the range with high accuracy (1 millimeter) and behaves stably, only the data is sparse (limited by the scanning line), with limited range (normally at most 100 meters) and low frame rate (10 Hz); Kinect-like depth sensors can own higher resolution but shorter range (less than 10 meters) which is affected by strong lights (like sunlight), so mostly used indoors; The stereo camera estimates depth based on stereo matching and triangulation, with high resolution but bad robustness; The mono image recently has been used for depth estimation by deep learning. It is worth to note that cameras and Kinect depth sensors capture data with higher frame rate (30 fps). In summary, it is necessary to build a multiple sensor fusion mechanism for depth estimation in order to utilize each sensor advantages.

It is interesting to see that, the depth fusion framework from multiple sensors with deep learning is quite similar to depth estimation from the mono image, such as the encoder-decoder framework, 3D geometric constraints (stereo matching), motion constraints (ego motion and object motion), surface normal and edge consistency, semantic segmentation for context and attention mechanism etc. The difference lies in that, depth fusion confronts the sparsity characteristic. So some research work discussed the variation of CNN to handle that, such as sparse invariant convolution [1], normalized convolution [2] and dilated convolution [3]. Besides, training of the depth fusion model lack dense depth data sometimes, which may not be appropriate for supervised methods, such as the Discriminator of GAN (generative adversarial network).

An important issue for depth fusion, like task level fusion, is the requirement of unified model to handle deficient or invalid data of some sensor.

Zhang and Funkhouser[15] proposed a depth completion method from a RGB-D image, which run two subnetworks to estimate the surface normal and object boundaries respectively, then it realized the dense depth estimation in regularization of the global fusion objective function via sparse depth map. It really utilized the information from RGB pixels as guide, however motion, segmentation context and geometric constraints are neglected. Another shortcoming is lack of confidence map (equivalent to the binary mask map in input of sparse depth map), which can provide attention mechanism in the fusion process.

Google’s work ,called PLIN[16], generates the spatial-temporal point cloud sequence from the three consecutive RGB images from a camera and the two sparse depth maps from a LiDAR. Here assume the sensors are time sync, but the frame rates are different (one is 20 fps and the other 10 fps). It utilized the motion information (optic flow)and segmentation context for depth fusion. Improvement issues could be consideration of geometric constraints (stereo data for training), surface normal and edge constraints. Besides, separation of optic flow into camera motion and residual object motion may provide better motion estimation performance.

In this article, we propose a quality-aware unified sensor fusion method for dense depth map generation, which involves motion, boundary, surface normal, semantic segmentation and confidence etc. It can automatically choose the onsite reliable sensor data (LiDAR point cloud and camera image) to process the depth estimation. Here we assume the LiDAR and camera are time synchronized. The chosen data co-work at the same framerate.

The side effect of this framework is, interpolation of the sparse depth map generated from the LiDAR point cloud is guided by the image, i.e. depth completion. The reliable discrete depth data projected from LiDAR point cloud provides prior or regularization to enhance the depth estimation by images, which results in the improvement of the depth estimation system’s accuracy and robustness.

2. Joint Depth Fusion with Lidar and Cameras

Figure 1 is the system diagram of depth fusion, where the “hour glass” shape structure represents the encoder-decoder network, the left half is encoder and the right half is decoder.

Figure 1. The diagram I of depth fusion system with LiDAR and camera

First, there are sensor data quality check modules for LiDAR and camera respectively to control data input, i.e. switch A, B and C. Module “Image Quality Evaluation” checks the input image quality by the traditional image and video industry criterion, like PSNR (peak signal-to-noise ratio) and SSIM (structural similarity).

LiDAR point cloud quality is checked in module “Point Cloud Quality Evaluation”. Here we think the LiDAR data quality check is related to its alignment with camera image too. It is required to project point cloud data to the camera image plane based on calibration parameters. Calculation of gradient information for the projected depth map is then followed by computing correlation with image edge information, i.e.

where w is the window size, f the image index, (i, j) the pixel location in the image, p 3-D point of the point cloud, X the LiDAR data and D the image gradient map. If image quality is bad too, we can only rely on LiDAR data itself. Instead, we use Rényi Quadratic Entropy (RQE) as

with G(ab)as Gaussian distribution function with mean a and variance b. As a matter of fact, RQE defines the crispness of point cloud distribution under a Gaussian Mixture Model(GMM), which is turns to be a quality criterion.

If there is only LiDAR sensor, then switch D will swap to allow LiDAR point data to generate dense depth map. Since the lack of guidance from RGB image, the available network for use would be sparse invariant CNN [1], or normalized CNN [2]; the former inputs the sparse depth map and sparsity mask resulted from module “Perspective Projection”, while the latter the sparse depth map and confidence map. In Figure 1, the former option is shown; Then the latter one shown in Figure 2.

Figure 2. The diagram II of depth fusion system with LiDAR and camera

If without LiDAR sensor on the vehicle, the only used sensor is camera (mono or binocular), controlled by switch D, i.e. depth estimation from cameras. First, we assume mono camera for use, the camera enters “Encoder” to build the feature map, which architecture could be ResNet[4] or DenseNet[5]. Then, the feature map is fed into 4 networks, i.e. segmentation net based on U-Net[6], normal net[7], edge net [8] and pose net [9]. The former three networks output attention map, normal map and edge map, which are combined with the image (without sparse depth and sparsity mask) to enter “DepthNet”[9]. The result of DepthNet is depth map and confidence map. We need to mention that the pose net requires two consecutive images and generates camera’s ego motion parameters via regression, i.e. rotation matrix and translation vector. Note: there is a unknown scalar factor for the translation vector.

The image feature map also is warped using ego motion parameters before entering residual FlowNet[11], which estimates the residual optic flow[10]. The final optic flow is residual flow plus the ego motion.

The loss terms in the loss function for network training include depth reconstruction term[9], normal term[7], edge term[8], attention term[17], and geometric/motion consistency terms[10]. Though there is no stereo image input for network inference, stereo image could be utilized for network training with consideration of the stereo mismatching term. The motion consistency term comes from errors of PoseNet and Residual FlowNet, and the surface term from the geometric transform between the depth map and the normal map.

Now, if binocular vision is available (no LiDAR yet), we change the depth estimation network only. Usually, there are two ways for depth estimation: one is direct feature concatenation or correlation, such as FlowNet[11] and its derived DispNet[12]; the other is building the 4-D cost volume stem from traditional stereo vision, which is fed into a 3D CNN, such as PSM-Net[13] and GCNet[14]. It is suggested using the latter one, i.e. the cost volume-based 3-D CNN method.

Next, when both LiDAR and mono camera sensor are available, sparse depth map, sparsity mask (initial confidence map), along with the RGB image, segmentation map, normal map and edge map, will be fed into “DepthNet”. In this network, there are two different ways to fuse information from LiDAR and camera: one is early fusion, i.e. the input of the encoder-decoder network is concatenation of all those maps (image as well), shown in Figure 3; another is late fusion, where the input from LiDAR are encoded by a encoder and the input from camera by another encoder, then merged to go into one decoder for the final result, dense depth map and confidence map, shown in Figure 4. We suggest using late fusion.

Figure 3. Early fusion for mono image plus LiDAR
Figure 4. Late fusion for mono image plus LiDAR

At last, if both LiDAR and stereo cameras are available, the late fusion architecture will look different from that for the setting of LiDAR plus mono camera. In Figure 5, it is suggested to avoid 3D CNN cause it is not suitable for normal, edge and segmentation maps to embed, instead the “correlation” layer (Note: this idea comes from FlowNet[11]) is called to measure the correlation between left and right feature maps, and the result is concatenated with other maps. Correspondingly, the loss function in network training will remove the term from stereo used in mono image-based depth estimation.

Figure 5. Late fusion for stereo image and LiDAR

3. Summary

This article addresses how to utilize the advantages and disadvantages of multiple sensors, complementary for each other in depth fusion. It is expected to provide a low cost solution (the less LiDAR scan line number, the lower cost the whole LiDAR) in dense depth estimation with data quality awareness. It is flexible in application, automatically choosing the most reliable sensor data as the highest priority. Except the brute force data driven idea in deep learning, this method applies more prior knowledge from the image itself, such as normal, segmentation, edge, motion and pose etc.

Reference

1. Jonas Uhrig et al., “Sparsity Invariant CNNs“,International Conference on 3D Vision (3DV),8,2017

2. A Eldesokey et al., “Propagating Confidences through CNNs for Sparse Data Regression“,BMCV,5,2018

3. K Park, S Kim, K Sohn,“High-precision Depth Estimation with the 3D LiDAR and Stereo Fusion“,ICRA,5,2018

4. K He et al., “Deep Residual Learning for Image Recognition”, CVPR 2016

5. G Huang et al.,“Densely Connected Convolutional Networks”, CVPR 2017

6. Olaf Ronneberger et al.,“U-Net: Convolutional Networks for Biomedical Image Segmentation”, MICCAI, 2015

7. Y Zhang, T Funkhouser ,“Deep Depth Completion of a RGB-D Image“,CVPR,2018

8. Z Yang et al., “LEGO: Learning Edge with Geometry all at Once by Watching Videos“, AAAI,2019

9. Y Zhang et al.,“DFineNet: Ego-Motion Estimation and Depth Refinement from Sparse, Noisy Depth Input with RGB Guidance”, arXiv 1903.06397, 2019

10. H Liu et al.,“PLIN: A Network for Pseudo-LiDAR Point Cloud Interpolation”, arXiv 1909.07137, 2019

11. P Fischer et al.,“FlowNet: Learning Optical Flow with Convolutional Networks”, ICCV 2015

12. N. Mayer et al., “A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation”, CVPR 2016

13. J.-R. Chang, Y.-S. Chen. “Pyramid stereo matching network”, IEEE CVPR 2018

14. X Guo et al., “Group-wise Correlation Stereo Network”, IEEE CVPR 2019.

15. Y Zhang, T Funkhouser,“Deep Depth Completion of a RGB-D Image“,CVPR,2018

16. H Liu et al,“PLIN: A Network for Pseudo-LiDAR Point Cloud Interpolation”,arXiv 1909.07137,2019.9

17. J Qiu et al., “DeepLiDAR: Deep Surface Normal Guided Depth Prediction for Outdoor Scene from Sparse LiDAR Data and Single Color Image”, arXiv 1812.00488, April, 2019

Appendix: referenced deep neural networks in the article (Figure 6–14).

Figure. 6 Depth completion with normal net and boundary net [7]
Figure 7. Learns Edges and Geometry (depth, normal) all at Once (LEGO) [8]
Figure 8. Ego-Motion Estimation and Depth Refinement from RGB image and LiDAR point cloud [9]
Figure 9. Deep Surface Normal Guided Depth Prediction [17]
Figure 10. A Network for Pseudo-LiDAR Point Cloud Interpolation [10]
Figure 11. FlowNet for optic flow estimation [11]
Figure 12. DispNet for disparity estimation of stereo vision [12]
Figure 13. PSM-Net [13]
Figure 14. GC-Net[14]

--

--

Yu Huang
Yu Huang

Written by Yu Huang

Working in Computer vision, deep learning, AR & VR, Autonomous driving, image & video processing, visualization and large scale foundation models.

No responses yet