Automatic Multi-Sensor Data Annotation of BEV/Occupancy Autonomous Driving (Part II)

14 min readApr 12, 2023

1 Introduction
2 Data acquisition and system configuration
3 Traditional Methods of Sensor Data Annotation

4 Semi-Traditional Methods of Sensor Data Annotation

The traditional methods for processing LiDAR point clouds and camera images in last session 3 have some problems, such as point cloud clustering, detection based on reflection thresholding, and line fitting for liDAR , as well as superpixel segmentation, region growing, binarization, and edge detection for images, which often have poor results in complex scenes. Below, some deep learning models are introduced for improvement.

Firstly, for LiDAR point clouds, we propose a semi traditional annotation method framework, as shown in Figure 10:

1) Similar to Figure 3, it also undergoes a “pre-process” module, a “SLAM” module (which architecture is shown in Figure 4), and a “mot seg” module. Then, in the “inst seg” module, point cloud based detection is directly performed on moving objects those are different from the background [40–41]. Neural network models are used to extract feature maps from the point cloud (such as PointNet [42] and PointPillar [43]);

2) Afterwards, similar to Figure 3, in the “Track” module, we perform temporal association on each segmented object [32] to obtain annotations as dynamic objects’ 3-d bounding boxes;

3) For static backgrounds, after the “Grd Seg” module, the point cloud that is judged as non road surface enters another “Inst Seg” module [40–41] for object detection, and the annotations of the static objects’ 3-D bounding boxes are obtained;

4) For the point cloud of the road surface, enter the “Semantic Seg” module, and based on a deep learning model, we use reflection intensity to perform pixel by pixel classification of semantic objects similar to images [44], that is, lane markings, zebra crossings, and road areas. The road curbs are obtained by detecting road boundaries, and finally the polyline-based annotations are made in the “Vect Rep” module;

5) The tracked dynamic object point cloud and the static object point cloud obtained by the instance segmentation enter the “surf recon” module and run the shape recovery algorithm [59];

6) Finally, all annotations are projected onto the vehicle coordinate system, as a single frame, to obtain the final annotation.

For the image data from multiple cameras, we propose a semi traditional 3D annotation framework, as shown in Figure 11:

1) Firstly, three modules are used in the image sequence of multiple cameras, namely “inst seg”, “depth map”, and “optical flow”, to calculate the instance segmentation map, depth map, and optical flow map respectively; The “inst seg” module uses the depth learning model to locate and classify some object pixels [45], such as vehicles and pedestrians, the “depth map” module uses the depth learning model to estimate the pixel-wise motion of the two consecutive frames based on monocular video to form virtual stereo vision to infer the depth map [46], and the “optical flow” module uses the depth learning model to directly infer the pixel motion of the two consecutive frames [47]; The three modules try to use the neural network model with unsupervised learning;

2) Based on depth map estimation, the “SLAM/SFM” module can obtain dense 3D reconstructed point clouds similar to RGB-D+IMU sensors [48–49], (similar to the SLAM framework of lidar+camera+IMU, as shown in Figure 9, only omitting the step of projecting a liDAR point cloud onto the image plane); At the same time, the instance segmentation results actually allow for the removal of obstacles from the image, such as vehicles and pedestrians, while further distinguishing static and dynamic obstacles in the “mot seg” module based on ego-vehicle odometry and optical flow estimation;

3) The various dynamic obstacles obtained by the instance segmentation will be reconstructed in the next “SLAM/SFM” module (where IMU is not input), which is similar to the SLAM architecture of RGB-D sensors and can be an extension of monocular SLAM, as shown in Figure 7; Then, it transfers the results of “inst seg” to the “obj recog” module and annotate the 3D bounding box of the object point cloud;

4) For static background, the “grd det” module will distinguish between static objects and road point clouds, so that static obstacles (such as parking vehicles and traffic cone) will transfer the results of “inst seg” module to the “obj recog” module, which annotates the 3-D bounding box of point clouds;

5) The dynamic object point cloud obtained from the “SLAM/SFM” module and the static object point cloud obtained from “grd det” module enter the “Surf Recon” module to run the shape recovery algorithm [59];

6) The road surface point cloud only provides a fitted 3D road surface. From the image domain “inst seg” module, the road surface area can be obtained. Based on ego odometry, image stitching can be performed. After running the “seman seg” module on the stitched road surface image, lane markings, zebra crossings, and road boundaries can be obtained. Then, the “vect rep” module is used for polyine labeling;

7) Finally, all annotations are projected onto the vehicle coordinate system, as a single frame, to obtain the final annotation.

Figure 11 Multi camera image’s 3-D annotation

So for the data of LiDAR and camera, we propose a semi traditional method framework for joint data annotation, as shown in Figure 12:

1) When both the multi camera image and the LiDAR point cloud are present, replace the “optical flow ” module in Figure 11 with the “scene flow” module, which estimates the motion of the 3D point cloud based on a deep learning model [50]; Replace the “depth map” module with the “depth compl” module, which uses a neural network model for completion of depth which is obtained from projection of the point cloud (interpolation and “hole filling”) onto the image plane[51], and then inversely project it back to the 3D space to generate the point cloud; Replace the “inst seg” module with the “seman. seg.” module, which uses a deep learning model to label point clouds according to object categories [52];

2) Afterwards, the dense point cloud and IMU data will enter the “SLAM” module to estimate the odometry [30] (which architecture is shown in Figure 4), and point clouds marked as obstacles (vehicles and pedestrians) will be chosen. At the same time, the estimated scene flow will also enter the “mot seg” module, further distinguishing between moving obstacles and static obstacles;

3) Afterwards, similar to Figure 10, once the moving objects pass through the “inst seg” module [40–41] and the “track” module [32], the annotations of the moving objects are obtained; Similarly, after passing through the “grd seg” module, static obstacles are labeled by the “inst seg” module [40–41]; Map elements, such as lane markings, zebra crossings, and road edges, are obtained by running the “seman. seg.” module in the stitched road surface image and the aligned point clouds [53–54], and then enter the “vect rep” module for polyline labeling;

4) The dynamic object point cloud obtained from tracking and the static object point cloud obtained from instance segmentation enter the “surf recon” module which runs the shape recovery algorithm [59];

5) Finally, all annotations are projected onto the vehicle coordinate system, as a single frame, to obtain the final annotation.

Figure 12 LiDAR and camera joint annotation

5 Deep learning-based data annotation

Finally, based on a certain amount of annotation data, the system is able to construct a full deep learning NN model to annotate the LiDAR point cloud and camera image data, including an independent LiDAR point cloud annotation model, an independent camera image annotation model, and a joint annotation model of LiDAR point cloud and multi-camera image data.

we design a LiDAR point cloud annotation system for a full deep learning model, as shown in Figure 13:

1) Firstly, in the “voxe-lization” module [57], the point cloud is divided into evenly spaced voxel grids, generating a many-to-one mapping between 3D points and voxels; Then enter the “Feat Encod” module and convert the voxel grid into a point cloud feature map (using PointNet [42] or PointPillar [43]);

2) On the one hand, in the “view transform” module, the feature map is projected onto the BEV [10], where a feature aggregator and a feature encoder are combined, and then BEV decoding is performed in the BEV space, divided into two heads: one head detects key points and categories (i.e. regression and classification) of road map elements such as lane markings, zebra crossings, and road boundaries in the “Map Ele Det” module [11], which structure is similar to the transformer based DETR model, also using a deformable attention module, and outputing the position of key points and the ID of the element they belong to; These key points are embedded into the “PolyLine Generat” module, which is also a model based on the Transformer architecture [11]. Based on BEV features and initial key points, the polyline distribution model can generate the vertices of the polyline and obtain the geometric representation of map elements; The other head performs BEV object detection through the “obj det” module, with a structure similar to the PointPillar model [43];

3) On the other hand, 3D point cloud feature maps can directly enter the “3D Decod” module [58], obtain multi-scale voxel features through 3D deconvolution, and then perform upsampling and class prediction in the “Occup.” module to generate voxel semantic segmentation [58].

Then we design a multi camera image annotation system for a full deep learning model, as shown in Figure 14:

1) Multi camera images are first encoded through the “backbone” module, such as EfficientNet or RegNet plus FPN/Bi-FPN, and then are divided into two paths;

2) On the one hand, image features enter the “view transform” module, and BEV feature are constructed through depth distribution [55] or Transformer architecture [56], and then go to two different heads respectively: similar to Figure 13, one head outputs vectorized representation of map elements through the “map ele detector” module and the “polyline generat” module; The other head passes through the “BEV obj Detector” module to obtain the obj BEV bounding box, which can be implemented using the Transformer architecture [56] or similar PointPillar architecture [57];

3) On the other hand, in the “2D-3D transform” module, the 2-D feature encoding is projected to the 3-D coordinate based on the depth distribution [55], where the height information is preserved; The camera voxel features obtained then enter the “3D decod” module to obtain multi-scale voxel features [58], and afterwards go into the “Occupancy” module for class prediction to generate voxel semantic segmentation [58].

At last we design a deep learning system for liDAR point cloud+camera image data joint annotation, and its model architecture is shown in Figure 15:

1) Similar to Figure 14, the camera image enters the “backbone” module to obtain 2D image encoding features;

2) Similar to Figure 13, the LiDAR point cloud enters the “voxe-lizat” and “feat encod” modules to obtain 3D point cloud features;

3) Afterwards, it is divided into two pathways:

a. On the one hand, point cloud features are projected to BEV [10] through the “view transform” module, while image features are extracted based on Transformer [56] or depth distribution [55] through another “view transform” module; Then we concatenate the two features together in the “Feat concat” module; Next, go to two different heads: one head passes through the “BEV obj Detector” module, similar to the PointPillar architecture, and obtain the BEV object bounding box; The other head outputs vectorized representations of map elements through the “Map Ele Detector” module and the “polyLine Generat” module;

b. On the other hand, image features are projected onto the 3D coordinates through the “2D-3D transform” module [55], preserving height information, and then concatenate with point cloud features in another “feat concat” module to form voxel features; Next, enter the “3D Decod” module and the “Occup” module to get the voxel semantic segmentation.

Figure 15 LiDAR-camera joint data annotation

In Summary

The autonomous driving data annotation system we propose, combines traditional methods and deep learning methods, which can meet the needs of both the R&D stage and the mass production stage. At the same time, the traditional methods require some manual assistance to provide training data for the next full deep learning methods.

References

1. Tesla AI day，August 19th, 2021.

2. Tesla AI day，Sept. 30th, 2022.

3. B Yang, M Bai, M Liang, W Zeng, R Urtasun, “Auto4D: Learning to Label 4D Objects from Sequential Point Clouds”, arXiv 2101.06586, 3, 2021

4. C R. Qi，Y Zhou，M Najibi，P Sun，K Vo，B Deng，D Anguelov，“ Offboard 3D Object Detection from Point Cloud Sequences”，arXiv 2103.05073, 3, 2021

5. N Homayounfar, W Ma, J Liang, et al., “DAGMapper: Learning to Map by Discovering Lane Topology”, arXiv 2012.12377, 12, 2020

6. B Liao, S Chen, X Wang, et al., “MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction”, arXiv 2208.14437, 8, 2022

7. J Shin, F Rameau, H Jeong, D Kum, “InstaGraM: Instance-level Graph Modeling for Vectorized HD Map Learning”, arXiv 2301.04470, 1, 2023

8. M Elhousni, Y Lyu, Z Zhang, X Huang , “Automatic Building and Labeling of HD Maps with Deep Learning”, arXiv 2006.00644, 6, 2020

9. K Tang, X Cao, Z Cao, et al., “THMA: Tencent HD Map AI System for Creating HD Map Annotations”, arXiv 2212.11123, 12, 2022

10. Q Li, Y Wang, Y Wang, H Zhao, “HDMapNet: An Online HD Map Construction and Evaluation Framework”, arXiv 2107.06307, 7, 2021

11. Y Liu, Y Wang, Y Wang, H Zhao, “VectorMapNet: End-to-end Vectorized HD Map Learning”, arXiv 2206.08920, 6, 2022

12. J L. Scheunberger, J Frahm, “Structure-from-Motion Revisited”, CVPR, 2016

13. J L. Scheuonberger, E Zheng, M Pollefeys, J Frahm, “Pixelwise View Selection for Unstructured Multi-View Stereo”, ECCV, 2016

14. J Zhang and S Singh, “LOAM: Lidar Odometry and Mapping in Real-time”, Conference of Robotics: Science and Systems, Berkeley, 2014.

15. J. Behley and C. Stachniss. ”Efficient Surfel-Based SLAM using 3D Laser Range Data in Urban Environments“. Robotics: Science and Systems, 2018.

16. R Yu, C Russell, L Agapito, “Video Pop-up: Monocular 3D Reconstruction of Dynamic Scenes”, ECCV, 2014

17. S Bullinger, C Bodensteiner, M Arens, R Stiefelhagen, “3D Vehicle Trajectory Reconstruction in Monocular Video Data Using Environment Structure Constraints”, ECCV, 2018

18. H. Lim, S. Hwang, and H. Myung. “ERASOR: Egocentric Ratio of Pseudo Occupancy-Based Dynamic Object Removal for Static 3D Point Cloud Map Building”. IEEE Robotics and Automation Letters (RA-L), 2021.

19. X. Chen, S. Li, B. Mersch, L. Wiesmann, J. Gall, J. Behley, and C. Stachniss. “Moving Object Segmentation in 3D LiDAR Data: A Learning-based Approach Exploiting Sequential Data”，arXiv 2105.08971，2021

20. X Chen，B Mersch，L Nunes，R Marcuzzi，I Vizzo，J Behley，C Stachniss，“Automatic Labeling to Generate Training Data for Online LiDAR-based Moving Object Segmentation”，arXiv 2201.04501, 1, 2022

21. M Saputra, A Markham, N Trigoni，“Visual SLAM and Structure from Motion in Dynamic Environments: A Survey”，ACM Computing Surveys，2019

22. S Milz, G Arbeiter, C Witt，B Abdallah，“Visual SLAM for Automated Driving: Exploring the Applications of Deep Learning”，IEEE CVPR，2018

23. Z ZHANG，“A Flexible New Technique for Camera Calibration”，IEEE Trans. on Pattern Analysis and Machine Intelligence，2000, 22(11): 1330–1334.

24. J. Levinson and S. Thrun, “Automatic online calibration of cameras and lasers.” Robotics: Science and Systems, vol. 2, 2013.

25. G. Pandey, J. R. McBride, S. Savarese, and R. M. Eustice, “Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information.” AAAI, 2012.

26. T Qin, S Shen, “Online Temporal Calibration for Monocular Visual-Inertial Systems”，IEEE IROS，2018

27. X Wang, L Xu, H Sun, et al. “On road Vehicle Detection and Tracking Using MMW Radar and Monovision Fusion“, IEEE Trans. on Intelligent Transportation Systems, 2016, 17(7):1–10.

28. H Caesar，V Bankiti，A H. Lang，et al，“nuScenes: A multimodal dataset for autonomous driving”，arXiv 1903.11027，2019

29. 黄思源，刘利民，董健，傅雄军，“车载激光雷达点云数据地面滤波算法综述“，光电工程，47（12），2020

30. W Xu , Y Cai , D He, et al，“FAST-LIO2: Fast Direct LiDAR-inertial Odometry”，arXiv 2107.06829，7, 2021

31. Y Zhao，X Zhang，X Huang， “A Technical Survey and Evaluation of Traditional Point Cloud Clustering Methods for LiDAR Panoptic Segmentation”，ICCV workshop，2021

32. X Weng, J Wang, D Held，K Kitani，“3D Multi-Object Tracking: A Baseline and New Evaluation Metrics”，arXiv 1907.03961，7，2019.

33. N Certad, W Morales-Alvarez, C Olaverri-Monreal，“Road Markings Segmentation from LIDAR Point Clouds using Reflectivity Information”，arXiv 2211.01105，2022

34. P Sun, X Zhao, Z Xu, “A 3D LiDAR Data-Based Dedicated Road Boundary Detection Algorithm for Autonomous Vehicles”, IEEE Access, 7(2), 2019

35. Z Yang, T Liu, S Shen，“Self-Calibrating Multi-Camera Visual-Inertial Fusion for Autonomous MAVs”, IEEE IROS, 2016

36. Y Yang，D Tang，D Wang，et al，“Multi-camera visual SLAM for off-road navigation”，Robotics and Autonomous Systems, 128(3), 2020

37. Achanta R, Shaji A, Smith K, et al. “SLIC superpixels compared to state–of–the–art superpixel methods”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2 012, 34(11).

38. Y. Shin, S. Park, A. Kim. “Direct Visual SLAM using Sparse Depth for Camera-LiDAR System”. IEEE ICRA, 2018.

39. Y Zhu, C Zheng, C Yuan, et al., “DVIO: Depth-Aided Visual Inertial Odometry for RGBD Sensors”, arXiv 2110.10805, 10，2021

40. F Zhang, C Guan, J Fang, et al., “Instance Segmentation of LiDAR Point Clouds”, IEEE ICRA, 2020

41. L. Nunes et al., “Unsupervised Class-Agnostic Instance Segmentation of 3D LiDAR Data for Autonomous Vehicles”， IEEE Robotics and Automation Letters, 7(4), 2022.

42. C R Qi, H Su, KMo, L J Guibas，“Pointnet: Deep learning on point sets for 3d classification and segmentation”. CVPR, 2017.

43. A. H. Lang, S. Vora, H. Caesar, et al. “Pointpillars: Fast encoders for object detection from point clouds”. IEEE CVPR, 2019.

44. G Csurka, R Volpi, B Chidlovskii, “Semantic Image Segmentation: Two Decades of Research”，arXiv 2302.06378，2023

45. Y Wang, Z Xu, X Wang, et al.,“End-to-End Video Instance Segmentation with Transformers”，arXiv 2011.14503， 10，2021

46. K Luo, C Wang, S Liu, et al., “Upflow: Upsampling pyramid for unsupervised optical flow learning”, IEEE CVPR, 2021

47. Y Wei, L Zhao, W Zheng, et al., “SurroundDepth: Entangling Views for Self-Supervised Multi-Camera Depth Estimation”, arXiv 2204.03636, 2022

48. S Yang, X Yi, Z Wang, et al., “Visual SLAM using Multiple RGB-D Cameras”, IEEE Int. Conf. on Robotics and Biomimetics (ROBIO), 2015

49. X Meng, W Gao, Z Hu, “Dense RGB-D SLAM with multiple cameras”, Sensors, 18(7), 2018

50. H Liu, T Lu, Y Xu, et al., “Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion”, arXiv 2303.12017, 2023

51. D. Nazir, A. Pagani, M. Liwicki, et al.，”SemAttNet: Towards Attention-based Semantic Aware Guided Depth Completion”. arXiv 2204.13635, 2022

52. J Li, H Dai, H Han, Y Ding, “MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving”, IEEE CVPR, 2023

53. H Li, Y Chen, Q Zhang, D Zhao, “BiFNet: Bidirectional Fusion Network for Road Segmentation”，arXiv 2004.08582，4，2020

54. R Yin, B Yu, H Wu, et al., “FusionLane: Multi-Sensor Fusion for Lane Marking Semantic Segmentation Using Deep Neural Networks”, arXiv 2003.04404, 2020

55. Y Li, Z Ge, G Yu, et-al., “BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection”，arXiv 2206.10092, 6, 2022

56. B Liao, S Chen, X Wang, et al., “MapTR: Structured Modeling And Learning For Online Vectorized HD Map Construction”, arXiv 2208.14437, 8, 2022

57. J Huang，G Huang，Z Zhu，D Du，“BEVDet: High-Performance Multi-Camera 3D Object Detection in Bird-Eye-View”，arXiv 2112.11790， 12，2021

58. X Wang, Z Zhu, W Xu, et al., “OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception”, arXiv 2303.03991, 3, 2023

59. Z Huang, Y Wen, Z Wang, J Ren, and K Jia. “Surface Reconstruction from Point Clouds: A Survey and a Benchmark”. arXiv 2205.02413，4，2020

Automatic Multi-Sensor Data Annotation of BEV/Occupancy Autonomous Driving (Part II)

4 Semi-Traditional Methods of Sensor Data Annotation

5 Deep learning-based data annotation

In Summary

References

Written by Yu Huang