Automatic Multi-Sensor Data Annotation of BEV/Occupancy Autonomous Driving (Part III)

10 min readApr 13, 2023

1 Introduction
2 Data acquisition and system configuration
3 Traditional Methods of Sensor Data Annotation
4 Semi-Traditional Methods of Sensor Data Annotation
5 Deep learning-based data annotation
In Summary

Appendix: Deep Learning Models used for Camera & LiDAR Data Annotation

Instance segmentation of LiDAR point cloud

(a) Point cloud frame as input, (b) knn to get dense representation, (c) feature learning with a self-attention block to re-organize the unordered points, (d) backbone network, (e) instance segmentation result with bounding boxes.

2. Image Semantic Segmentation

The Dual Attention Networks aggregates the output of the Position Attention Module (PAM) that aims at capturing the spatial dependencies between any two positions of the feature maps with the channel attention module (CAM) exploiting the inter-dependencies between channel maps. Specifically, the outputs of the two attention modules are transformed by a convolution layer before fused by an element-wise sum followed by a convolution layer to generate the final prediction maps.

3. Video Instance Segmentation with Transformers

Fig. A3 Video instance segmentation (VIS) [45]

VisTR: It contains four main components: 1) a CNN backbone that extracts feature representation of multiple images; 2) an encoder-decoder Transformer that models the relations of pixel-level features and decodes the instance-level features; 3) an instance sequence matching module that supervises the model; and 4) an instance sequence segmentation module that outputs the final mask sequences.

4. UPFlow

It contains two stage: pyramid encoding to extract feature pairs in different scales and pyramid decoding to estimate optical flow in each scale. Note that the parameters of the decoder module and the upsample module are shared across all the pyramid levels.

5. SurroundDepth

It utilizes encoder-decoder networks to predict depths. To entangle surrounding views, it proposes the cross-view transformer (CVT) to fuse multi-camera features in a multi-scale fashion. Pretrained with the sparse pseudo depths generated by two-frame Structure-from-Motion, the depth model is able to learn the absolute scale of the real world. By explicitly introducing extrinsic matrices into pose estimation, it can predict multi-view consistent ego-motions and boost the performance of scale-aware depth estimation.

6. Bidirectional Camera-LiDAR Fusion Module (Bi-CLFM)

Features from two different modalities (LiDAR and camera) are fused in a bidirectional way, so that both modalities can benefit each other. It detaches the gradient from one branch to the other to prevent one modality from dominating.

Synchronized camera and LiDAR frames are taken as input, from which dense optical flow and sparse scene flow are estimated respectively. Built on top of the PWC architecture, CamLiPWC is a two-branch network with multiple bidirectional fusion connections (Bi-CLFM) between them.

Built on top of the RAFT (Recurrent All-pairs Field Transforms) architecture, it performs four-stage feature fusion: features from the feature encoder, the context encoder, the correlation lookup operation, and the motion encoder are fused to pass complementary information.

7. SemAttNet

It consists of a novel three-branch backbone and a CSPN (convolutional spatial propagation network)++ module with Atrous convolutions. Unlike earlier image-guided methods, it designs a separate branch for learning the semantic information of the scene. Furthermore, it proposes to apply attention based fusion block (ABF) to perform semantic-aware fusion between RGB, depth, and semantic modalities. Each branch outputs a depth map and a confidence map, which are adaptively fused to produce a fused depth map. In the end, the fused depth map are sent to CSPN++ module with Atrous convolutions for refinement. Note, due to shortage of space, it uses AFB to represent SAMMAFB block.

8. MSeg3D

Fig. A10 Multi-modal 3D semantic segmentation model (MSeg3D).

For multi-modal feature fusion, GF-Phase mainly includes the Geometry-based Feature Fusion Module (GFFM), while SF-Phase consists of LiDAR Semantic Feature Aggregation Module (SFAM), camera SFAM, and Semantic-based Feature Fusion Module (SFFM).

9. The BiFNet architecture

The networks have two backbones. Each backbone has five convolution blocks same as ResNet. The module has two groups, and each group is composed of DST (Dense space transformation), DT (domain transformation) and CBF (Context based fusion). DST module transforms the features into the same spatial space, and DT adapts the appropriate domain since the multi-sensors have different measurements and different feature dimensions. Finally, CBF fuses the features which have the same spatial space and domain. The output of the camera space branch is the final result for testing.

10. FusionLane

The FusionLane network structure diagram mainly shows the general structure. It designs an encoder module with two input branches instead of a single input with multiple channels. This allows the network to be independent of the assumption of perfect alignment. In this way, it can fuse the result of CBEV (camera) semantic segmentation (C-Region) with LBEV (LIDAR) and that makes the segmentation result of the proposed network has both advantages of accurate classification from camera and precise position information from LIDAR. And the LSTM structure can help the network to achieve better prediction results through timing info.

References

1. Tesla AI day，August 19th, 2021.

2. Tesla AI day，Sept. 30th, 2022.

3. B Yang, M Bai, M Liang, W Zeng, R Urtasun, “Auto4D: Learning to Label 4D Objects from Sequential Point Clouds”, arXiv 2101.06586, 3, 2021

4. C R. Qi，Y Zhou，M Najibi，P Sun，K Vo，B Deng，D Anguelov，“ Offboard 3D Object Detection from Point Cloud Sequences”，arXiv 2103.05073, 3, 2021

5. N Homayounfar, W Ma, J Liang, et al., “DAGMapper: Learning to Map by Discovering Lane Topology”, arXiv 2012.12377, 12, 2020

6. B Liao, S Chen, X Wang, et al., “MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction”, arXiv 2208.14437, 8, 2022

7. J Shin, F Rameau, H Jeong, D Kum, “InstaGraM: Instance-level Graph Modeling for Vectorized HD Map Learning”, arXiv 2301.04470, 1, 2023

8. M Elhousni, Y Lyu, Z Zhang, X Huang , “Automatic Building and Labeling of HD Maps with Deep Learning”, arXiv 2006.00644, 6, 2020

9. K Tang, X Cao, Z Cao, et al., “THMA: Tencent HD Map AI System for Creating HD Map Annotations”, arXiv 2212.11123, 12, 2022

10. Q Li, Y Wang, Y Wang, H Zhao, “HDMapNet: An Online HD Map Construction and Evaluation Framework”, arXiv 2107.06307, 7, 2021

11. Y Liu, Y Wang, Y Wang, H Zhao, “VectorMapNet: End-to-end Vectorized HD Map Learning”, arXiv 2206.08920, 6, 2022

12. J L. Scheunberger, J Frahm, “Structure-from-Motion Revisited”, CVPR, 2016

13. J L. Scheuonberger, E Zheng, M Pollefeys, J Frahm, “Pixelwise View Selection for Unstructured Multi-View Stereo”, ECCV, 2016

14. J Zhang and S Singh, “LOAM: Lidar Odometry and Mapping in Real-time”, Conference of Robotics: Science and Systems, Berkeley, 2014.

15. J. Behley and C. Stachniss. ”Efficient Surfel-Based SLAM using 3D Laser Range Data in Urban Environments“. Robotics: Science and Systems, 2018.

16. R Yu, C Russell, L Agapito, “Video Pop-up: Monocular 3D Reconstruction of Dynamic Scenes”, ECCV, 2014

17. S Bullinger, C Bodensteiner, M Arens, R Stiefelhagen, “3D Vehicle Trajectory Reconstruction in Monocular Video Data Using Environment Structure Constraints”, ECCV, 2018

18. H. Lim, S. Hwang, and H. Myung. “ERASOR: Egocentric Ratio of Pseudo Occupancy-Based Dynamic Object Removal for Static 3D Point Cloud Map Building”. IEEE Robotics and Automation Letters (RA-L), 2021.

19. X. Chen, S. Li, B. Mersch, L. Wiesmann, J. Gall, J. Behley, and C. Stachniss. “Moving Object Segmentation in 3D LiDAR Data: A Learning-based Approach Exploiting Sequential Data”，arXiv 2105.08971，2021

20. X Chen，B Mersch，L Nunes，R Marcuzzi，I Vizzo，J Behley，C Stachniss，“Automatic Labeling to Generate Training Data for Online LiDAR-based Moving Object Segmentation”，arXiv 2201.04501, 1, 2022

21. M Saputra, A Markham, N Trigoni，“Visual SLAM and Structure from Motion in Dynamic Environments: A Survey”，ACM Computing Surveys，2019

22. S Milz, G Arbeiter, C Witt，B Abdallah，“Visual SLAM for Automated Driving: Exploring the Applications of Deep Learning”，IEEE CVPR，2018

23. Z ZHANG，“A Flexible New Technique for Camera Calibration”，IEEE Trans. on Pattern Analysis and Machine Intelligence，2000, 22(11): 1330–1334.

24. J. Levinson and S. Thrun, “Automatic online calibration of cameras and lasers.” Robotics: Science and Systems, vol. 2, 2013.

25. G. Pandey, J. R. McBride, S. Savarese, and R. M. Eustice, “Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information.” AAAI, 2012.

26. T Qin, S Shen, “Online Temporal Calibration for Monocular Visual-Inertial Systems”，IEEE IROS，2018

27. X Wang, L Xu, H Sun, et al. “On road Vehicle Detection and Tracking Using MMW Radar and Monovision Fusion“, IEEE Trans. on Intelligent Transportation Systems, 2016, 17(7):1–10.

28. H Caesar，V Bankiti，A H. Lang，et al，“nuScenes: A multimodal dataset for autonomous driving”，arXiv 1903.11027，2019

29. 黄思源，刘利民，董健，傅雄军，“车载激光雷达点云数据地面滤波算法综述“，光电工程，47（12），2020

30. W Xu , Y Cai , D He, et al，“FAST-LIO2: Fast Direct LiDAR-inertial Odometry”，arXiv 2107.06829，7, 2021

31. Y Zhao，X Zhang，X Huang， “A Technical Survey and Evaluation of Traditional Point Cloud Clustering Methods for LiDAR Panoptic Segmentation”，ICCV workshop，2021

32. X Weng, J Wang, D Held，K Kitani，“3D Multi-Object Tracking: A Baseline and New Evaluation Metrics”，arXiv 1907.03961，7，2019.

33. N Certad, W Morales-Alvarez, C Olaverri-Monreal，“Road Markings Segmentation from LIDAR Point Clouds using Reflectivity Information”，arXiv 2211.01105，2022

34. P Sun, X Zhao, Z Xu, “A 3D LiDAR Data-Based Dedicated Road Boundary Detection Algorithm for Autonomous Vehicles”, IEEE Access, 7(2), 2019

35. Z Yang, T Liu, S Shen，“Self-Calibrating Multi-Camera Visual-Inertial Fusion for Autonomous MAVs”, IEEE IROS, 2016

36. Y Yang，D Tang，D Wang，et al，“Multi-camera visual SLAM for off-road navigation”，Robotics and Autonomous Systems, 128(3), 2020

37. Achanta R, Shaji A, Smith K, et al. “SLIC superpixels compared to state–of–the–art superpixel methods”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2 012, 34(11).

38. Y. Shin, S. Park, A. Kim. “Direct Visual SLAM using Sparse Depth for Camera-LiDAR System”. IEEE ICRA, 2018.

39. Y Zhu, C Zheng, C Yuan, et al., “DVIO: Depth-Aided Visual Inertial Odometry for RGBD Sensors”, arXiv 2110.10805, 10，2021

40. F Zhang, C Guan, J Fang, et al., “Instance Segmentation of LiDAR Point Clouds”, IEEE ICRA, 2020

41. L. Nunes et al., “Unsupervised Class-Agnostic Instance Segmentation of 3D LiDAR Data for Autonomous Vehicles”， IEEE Robotics and Automation Letters, 7(4), 2022.

42. C R Qi, H Su, KMo, L J Guibas，“Pointnet: Deep learning on point sets for 3d classification and segmentation”. CVPR, 2017.

43. A. H. Lang, S. Vora, H. Caesar, et al. “Pointpillars: Fast encoders for object detection from point clouds”. IEEE CVPR, 2019.

44. G Csurka, R Volpi, B Chidlovskii, “Semantic Image Segmentation: Two Decades of Research”，arXiv 2302.06378，2023

45. Y Wang, Z Xu, X Wang, et al.,“End-to-End Video Instance Segmentation with Transformers”，arXiv 2011.14503， 10，2021

46. K Luo, C Wang, S Liu, et al., “Upflow: Upsampling pyramid for unsupervised optical flow learning”, IEEE CVPR, 2021

47. Y Wei, L Zhao, W Zheng, et al., “SurroundDepth: Entangling Views for Self-Supervised Multi-Camera Depth Estimation”, arXiv 2204.03636, 2022

48. S Yang, X Yi, Z Wang, et al., “Visual SLAM using Multiple RGB-D Cameras”, IEEE Int. Conf. on Robotics and Biomimetics (ROBIO), 2015

49. X Meng, W Gao, Z Hu, “Dense RGB-D SLAM with multiple cameras”, Sensors, 18(7), 2018

50. H Liu, T Lu, Y Xu, et al., “Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion”, arXiv 2303.12017, 2023

51. D. Nazir, A. Pagani, M. Liwicki, et al.，”SemAttNet: Towards Attention-based Semantic Aware Guided Depth Completion”. arXiv 2204.13635, 2022

52. J Li, H Dai, H Han, Y Ding, “MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving”, IEEE CVPR, 2023

53. H Li, Y Chen, Q Zhang, D Zhao, “BiFNet: Bidirectional Fusion Network for Road Segmentation”，arXiv 2004.08582，4，2020

54. R Yin, B Yu, H Wu, et al., “FusionLane: Multi-Sensor Fusion for Lane Marking Semantic Segmentation Using Deep Neural Networks”, arXiv 2003.04404, 2020

55. Y Li, Z Ge, G Yu, et-al., “BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection”，arXiv 2206.10092, 6, 2022

56. B Liao, S Chen, X Wang, et al., “MapTR: Structured Modeling And Learning For Online Vectorized HD Map Construction”, arXiv 2208.14437, 8, 2022

57. J Huang，G Huang，Z Zhu，D Du，“BEVDet: High-Performance Multi-Camera 3D Object Detection in Bird-Eye-View”，arXiv 2112.11790， 12，2021

58. X Wang, Z Zhu, W Xu, et al., “OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception”, arXiv 2303.03991, 3, 2023

59. Z Huang, Y Wen, Z Wang, J Ren, and K Jia. “Surface Reconstruction from Point Clouds: A Survey and a Benchmark”. arXiv 2205.02413，4，2020

Automatic Multi-Sensor Data Annotation of BEV/Occupancy Autonomous Driving (Part III)

Appendix: Deep Learning Models used for Camera & LiDAR Data Annotation

References

Written by Yu Huang