Automatic Multi-Sensor Data Annotation of BEV/Occupancy Autonomous Driving (Part I)

19 min readApr 12, 2023

1 Introduction

Autonomous driving has been recognized as a “long tail distribution” problem that requires continuous discovery of corner case data to improve the performance of solutions. Obtaining valuable data, annotating data, model training, validation, and deployment etc. constitute a data closed loop for developing autonomous driving. Cloud service companies that provide autonomous driving data platforms, all have semi-automatic/automatic annotation business functions, but most of them are focused on 2D images rather than real 3D scenes.

Tesla has demonstrated strong capabilities in this area [1–2], with prominent automatic annotation. It fully utilizes pure visual sensor data to construct the entire automatic annotation big model, and obtains many 3D annotation information from visual reconstruction through deep learning methods, such as obstacle detection and segmentation of roads (lane lines), as well as estimation of motion, depth, and ego vehicle trajectory. At the same time, Google WayMo and Uber ATG have also introduced automatic object annotation methods based on LiDAR data [3–4], which can segment static backgrounds and dynamic vehicles and pedestrians from point cloud data, and refine the structure of the static background and dynamic foreground respectively. The automated production of high-definition maps, which annotate map elements such as lane markings, curbs, zebra crossings, and even arrows and text, has also attracted more attention [5–11].

Currently, annotating static backgrounds from sensor data is mostly done through SLAM or SfM methods [12–15], but for moving objects it remains challenging [16–20]. Compared to the visual system, LiDAR has natural advantages as an active 3D sensor [18–20]. However, the 3D annotation of the visual system has problems in practice, with significantly weaker accuracy and robustness [21–22].

There are some weak issues of existing methods:

1. The 3D automatic/semi-automatic annotation methods for autonomous driving scene data still lack an effective development model, which requires a lot of manual work to fix the errors of inaccurate annotation;

2. The 3D automatic/semi-automatic annotation system for autonomous driving data using a combination of LiDAR and camera is relatively rare, mostly for data annotation tasks collected independently by LiDAR or camera;

3. There is still a lack of a mature solution for 3D automatic/semi-automatic data annotation of dynamic objects, especially in the field of pure vision;

4. Currently, there is a lack of annotation tool design and development that supports both BEV (birds’ eye view) network and occupancy network.

Here we propose an automatic/semi-automatic labeling system for automatic driving data collection, which is suitable for both collection vehicle data in the R&D stage and customer data in the mass production stage. This proposal constructs a 3D scene semi-automatic annotation system for both LiDAR and camera simultaneously, which also allows for independent operation of point cloud annotation for LiDAR or visual reconstruction’s 3-D data annotation for cameras, as well as joint annotation of the two (LiDAR and camera).
Meanwhile, we adopt both traditional and deep learning methods to design automatic/semi-automatic annotation methods: In the development of the system, a progressive development mode is adopted, and training data is continuously accumulated to feed the deep learning model through traditional methods. This can maintain and update the automatic annotation model based on deep learning, and construct a closed loop within the annotation platform. This proposed annotation system supports both BEV perception and occupancy perception, which not only works for the development of BEV networks, but also assists in the research of occupancy networks.

2 Data Acquisition and System Configuration

As an autonomous driving data acquisition system, sensors such as GNSS, IMU, camera, millimeter wave radar, and LiDAR need to be set up, as shown in Figure 1 (for example, open source data Nuscenes[28]). Among them, LiDAR may not be a standard configuration for mass production vehicles, so the data obtained on mass production vehicles does not include 3D LiDAR point clouds. The configuration of the cameras can observe the 360 degree environment near the vehicle, with 6 cameras, 5 millimeter wave radars, and one 360 degree scanning LiDAR (or a combination of multiple limited angle scanning LiDARs can be used to cover the 360 degree).

Figure 1 Nuscenes’ vehicle sensor configuration [28]

If multi-modal sensor data needs to be collected, sensor calibration is required, which involves determining the coordinate system relationships between each sensor data, such as camera calibration [23], camera - LiDAR calibration [24–25], LiDAR - IMU calibration [26], and camera - radar calibration [27]. In addition, a unified clock needs to be used between sensors (GNSS as an example), and then a certain signal is used to trigger the operation of the sensor. For example, the transmission signal of the lidar can trigger the exposure time of the camera, which is time synchronized [28].

As an automated driving development platform, it requires supporting the entire data closed-loop system from the vehicle end and server cloud, including data collection and preliminary screening at the vehicle side, mining based on active learning in the cloud side database, automatic tagging, model training and simulation testing (simulation data can also be used for model training), and model deployment back to the vehicle side, as shown in Figure 2. Data selection and data annotation are key modules that determine the efficiency of data closed-loop.

The task of data annotation is classfied as the R&D stage and the mass production stage: 1) The R&D stage mainly involves the data collection vehicle of the development team including LiDAR, so that LiDAR can provide 3D point cloud data to the camera’s image data to provide 3D truth values. For example, BEV (birds’ eye view) visual perception requires obtaining BEV output from 2D images, which involves perspective projection and speculation of 3D information; 2) In the mass production stage, data is mostly provided by customers of passenger vehicles or operational customers of commercial vehicles. Most of them do not have LiDAR data or are only 3D point clouds from a limited FOV (such as forward). Therefore, for camera image data, estimation or reconstruction of 3D data is required for annotation.

Note: The construction of a digital twin for driving scenarios in the simulation system requires a 3D reconstruction of the entire environment, including roadside buildings and blocks, and the proposed labeling system design still works with slight modifications.

3 Traditional Methods of Sensor Data Annotation

Deep learning model algorithms require data for training, while traditional methods can perform unsupervised data processing. Firstly, for LiDAR data, we propose a traditional algorithm framework for 3D point cloud annotation, as shown in Figure 3:

1) The ‘pre-process’ module includes coordinate conversion of point cloud data (such as polar representation, range image, or BEV image), denoising filtering [29], and sampling (including ‘filling holes’ of missing point clouds); The point cloud and IMU data sequences are optimized for LiDAR-IMU odometry estimation and background bundle adjustment (BA) in the “SLAM” module [30], while generates vehicle motion trajectories. The referred architecture of the “SLAM” module is shown in Figure 4 (details will be introduced later); Based on the odometry results, it segments point clouds in the “Mot Seg” module that differ from the background motion for each frame;

2) Due to the different view angles and significant changes in coordinates collected by LiDAR each time, excessive obstacle points can affect the extraction of the object’s 3-D frame. Therefore, the “ROI extr” module filters out regions of interest from the original point cloud; Based on the driving trajectory of the ego vehicle and known map information, a rough road area can be determined as the ROI;

3) Each frame of point cloud data contains a large number of ground points, and the purpose of detection is to obtain road obstacle information, which requires further segmentation of point clouds on the ground; The “grd seg” module is used to locate the ground point cloud, and the roadside is also the boundary of the driving road; Here, a segmented plane fitting method can be used, where the point cloud is divided into multiple segments along the x-axis direction, and then the RANSAC plane fitting method is applied to extract ground points in each segment; To detect the road boundary, set the distance threshold (15–30 cm) to ensure that the ground points include all road boundary point;

4) For the non road point cloud output from the “grd seg” module, in the “clust” module, the static object (obstacle) uses unsupervised clustering method to form multiple clusters, and each cluster represents an obstacle (such as traffic cone and parking); The clustering method can simply use Euclidean distance for clustering, such as K-means or DBSCAN [31];

5) In another “clust” module, the moving objects (obstacles) also use unsupervised clustering methods to form multiple clusters, with each cluster stands for an obstacle (such as vehicles, pedestrians, bicycles, etc.), and the algorithm is the same as the previous step;

6) For moving objects, the 3D Kalman filter algorithm [32] can be used in the “Track” module to determine the point cloud associationn and trajectory of the object in the previous and proceding frames, resulting in smoother 3D frame positions and postures; Based on the motion of the object, multiple frame point clouds can be aligned to construct a denser point cloud output (i.e. rigid objects, such as vehicles; for deformable objects, such as pedestrians, the effect will be weakened);

7) For obstacles in clusters (including static and dynamic), feature extraction and classifier can be used to identify them in two different “class” modules; In addition, performing 3D cube box fitting on each cluster and calculating obstacle attributes, such as center point, centroid point, length, width, etc;

8) The point cloud for dynamic object tracking alignment and the point cloud for static object clustering enter the “surf rec” module and run the shape recovery algorithm [59];

9) For road traffic markings, such as lane markings, zebra crossings, and sidewalk markings, the reflection values of point clouds are used for filtering [33], which eliminates unmarked point clouds on the road surface in the “thre” module, and the remaining points are connected to clusters in the “clust” module, and corresponding categories of lane markings and zebra crossings are given;

10) In order to detect road boundariess, it is necessary to extract road point cloud features, such as the height difference, smoothness, and horizontal distance of adjacent points. Then, candidate road boundary points are obtained in the “thre” module, and finally, piecewise approximation of connected points is performed in the “line fit” module [34], and the road boundary class is given;

11) Finally, these road markings are segmented and marked with polylines in the “Vect Rep” module;

12) Finally, all annotations are projected onto the vehicle coordinate system, i.e. a single frame (LiDAR or camera), to obtain the final annotation.

As shown in Figure 4, the referred framework of the LiDAR-IMU “SLAM” module in Figure 3 is given follows: the original points of the LiDAR are accumulated between 10ms (100Hz update for IMU) and 100ms (10Hz update for LiDAR), and the accumulated point cloud is called scanning data; For state estimation, the newly scanned point cloud is registered to the map points (i.e. odometry) maintained in a large local map using a tightly coupled iterative Kalman filtering framework (IEKF); The map is stored using an incremental kd-tree (ikd-tree) [30]; The observation equation is a direct match between point clouds and maps; In addition to k-nearest neighbor search (k-NN), it also supports incremental map updates (i.e. point insertion, downsampling, point deletion) [30]; If the current FoV range of the LiDAR crosses the map boundary, the map history points which are the farthest from the LiDAR pose will be deleted from the ikd-tree; The optimized pose estimation registers the newly scanned points to the global coordinate system, and inserts them into the ikd tree (i.e. mapping) at the odometry rate and merges them into the map.

For multi camera image data, we propose a traditional method to annotate visual data as a 3D annotation framework, as shown in Figure 5:

1) Firstly, the “SLAM/SFM” module uses multi camera vision -IMU odometry[35] to reconstruct feature point clouds of static backgrounds; The referred architecture is shown in Figure 6 (details will be introduced later);

2) At the same time, in the “Mot Seg” module, matched feature point pairs which motion are different from the background motion, are selected based on odometry and sent to the “Clust” module to obtain different feature point pairs of moving objects (removing some isolated point pairs);

3) Then cluster feature point pairs for 3D reconstruction in the “SLAM/SFM” module [36] (where IMU is only the motion of the ego vehicle and cannot participate in dynamic objects’ motion estimation), and its referred architecture is shown in Figure 7 (details will be introduced later); After obtaining the point cloud cluster, fit it as a 3D object (CAD model or rectangular box); To recover the missed point cloud in the featureless image regions, image segmentation is performed in the “super-pixel seg.” module [37], and the superpixels with reconstructed feature point clouds are backprojected into the 3D space (with depth interpolation) to form a much denser point cloud; Then it fit the 3D object into the “Obj. Recog” module for classification (based on the RGB features extracted from superpixels, we construct a object classifier, such as SVM-based or MLP-based);

4) In the “Grd. Det” module, road surface fitting is performed on feature point clouds, where the optional “ROI Ext” module, mentioned in Figure 3, could be employed to select the road surface as the region of interest based on the estimated ego-vehicle trajectory; Then we project the road surface point cloud into the image and use the flood fill algorithm in the “reg. grow” module to obtain the road surface area;

5) Based on the fitted road surface, it finds static objects (obstacles)’ point cloud that are not lying on the road surface, and then generates feature point cloud cluster in the “clust” module, after that fits it into 3D objects (CAD model or rectangular box); Similarly, the “super-pixel seg” module applied in step 2), is used to obtain a much denser reconstruction; Later we fit the 3D object and enter the “obj recog” module for classification;

6) The point clouds obtained from dynamic objects SLAM and static objects clustering enter the “surf recons” module and run the shape recovery algorithm [59];

7) Based on the point cloud fitting of the road surface, the road surface equation is obtained, and then the road surface markings are processed: one way is to first perform grayscale threshold binarization (such as Otsu method), edge detection (such as Canny operator), and line fitting (such as Hough transform) on the road surface area in the “Img Proc.” module, and obtain the detected lane markings, zebra markings, and road edges, which are then projected back onto the road surface, known as IPM (inverse projective mapping); Another way is to first perform IPM on the road surface image pixel by pixel, and then perform the similar operations above on the IPM image in the “Img Proc.” module to obtain the detected lane markings, zebra crossings, and road edges, which are the actual detection results on the road surface;

8) In the “Vect Rep” module, we segment lane markings, zebra crossings, and road edges, and then annotate them with piecewise polylines;

9) Finally, all annotations are projected onto the vehicle coordinate system, i.e. a single frame, to obtain the final annotation.

Figure 6 is the referred architecture of multi camera-IMU SLAM in Figure 5, consisting of three stages. The first two stages aim to initialize the estimator linearly and obtain the initial values of camera-IMU calibration without prior knowledge; In the third stage, a tightly coupled state estimator is developed with nonlinear optimization using the initial values of the first two stages. The initialization architecture is basically obtained by running multiple times the VINS (visual-inertial navigation system) for a monocular camera-IMU system. Assuming that multiple cameras have not been calibrated yet (if the cameras have been calibrated, initial values can be given directly), so feature matching between cameras is not considered in the initial stage. The rotation calibration is similar to the hand-eye calibration process, while the translation calibration will extend the VINS sliding window estimator technique to multiple cameras. Based on the initialization step, it establishes feature tracking from the same camera (temporal) and between the cameras (spatial) based on the relative pose between the cameras. Intuitively speaking, cameras with overlapping fields of view can achieve 3-D triangulation of features. On the other hand, if there is no overlapping field of view between the cameras, or if the feature points are too far away, the system will degenerate into multiple monocular VINS configurations.

The referred SLAM framework for multiple cameras is shown in Figure 7: it includes three modules, namely multi camera visual localization, panoramic mapping, and closed-loop correction. The goal of multi camera visual positioning is to obtain the 6D pose of the vehicle in real-time. Based on the multi camera spatial perception model, the pose can be quickly estimated through multi camera image frames. The localization process can be divided into three states: initialization, tracking, and re-localization. The mapping system constructs a sparse point cloud from matched feature points to create a map, with each map point by feature point descriptor, making the map reusable. To avoid the map being too large, it only builds the map for keyframes that meet specific conditions. Keyframes are composed of features extracted from multi camera images. In order to represent the co-visibility information between keyframes, keyframes are used as nodes, and the number of shared map points between two frames is used as the weight of edges to construct a co-visibility graph. A larger weight means that more frames share the observations. The mapping process includes two types: synchronous and asynchronous. Synchronous mapping uses any pair of cameras to participate in the 3D construction process. Asynchronous mapping utilizes the current and previous keyframes in a common view to triangulate map points. Loop closure detection is the ability of the system to detect whether it has returned to the previous scene, and correction based on loop closure detection can greatly improve the global consistency of the system. Based on loop closure information, trajectory and map can be corrected simultaneously.

Based on the provided LiDAR point cloud and multi camera image data, we can jointly annotate them with a traditional method, as shown in Figure 8:

1) Similarly, the input LiDAR data is processed through a “pre-process” module for point cloud filtering, and then combined with multi camera image data into the “SLAM” module. Here, the LiDAR point cloud is projected onto the image plane to form a depth grid, which is jointly estimated with the IMU for vehicle odometry [38–39], and a 3D dense point cloud map is reconstructed. The architecture is shown in Figure 9 (details will be given later);

2) Similar to Figure 3, then enter the “Mot Seg” module to obtain moving objects and stationary backgrounds, while the moving objects also run the “Clust” module, “Track” module, and “Obj Recog” module to get moving obstacle annotations;

3) On the other hand, running the “grd seg” module on a stationary background yields two parts: static objects (obstacles) and road surfaces: the former is similar to Figure 3, running the “clust” module and the “obj recog” module to obtain static obstacle labels, while the latter is similar to Figure 5, running the “reg grow” module, the “img proc” module (IPM based on road surface equations), and the “vect rep” module to obtain polyline labeling of road lane lines, zebra crossings and road boundaries;

4) The point clouds obtained from dynamic object tracking and static object clustering enter the “surf recon” module and perform the shape recovery algorithm [59];

5) Finally, all annotations are projected onto the vehicle coordinate system of a single frame to obtain the final annotation.

Fig. 8 LiDAR+Multi-camera data Annotation

Figure 9 shows the referred SLAM framework of LiDAR-multi camera-IMU: LiDAR point clouds are projected onto each camera to form a depth grid, while feature detection and tracking are performed on each image to obtain the initial pose; The depth grid and 2D feature positions can combine together to calculate the depth of 2D features (note: here, each camera - LiDAR form a SLAM pipeline); Then, feature tracking data and IMU data are used to initialize the estimator; Afterwards, IMU pose, velocity, and biase from IMU pre-integration, as well as camera frame features, are used to create the sliding windows. Nonlinear optimization processes are used to perform state estimation; Once the estimated state of the sliding window is obtained, global pose graph optimization is performed together with the loop closure detection (place recognition) module, and finally a 3D point cloud map is obtained.

to be continued ……

References

1. Tesla AI day，August 19th, 2021.

2. Tesla AI day，Sept. 30th, 2022.

3. B Yang, M Bai, M Liang, W Zeng, R Urtasun, “Auto4D: Learning to Label 4D Objects from Sequential Point Clouds”, arXiv 2101.06586, 3, 2021

4. C R. Qi，Y Zhou，M Najibi，P Sun，K Vo，B Deng，D Anguelov，“ Offboard 3D Object Detection from Point Cloud Sequences”，arXiv 2103.05073, 3, 2021

5. N Homayounfar, W Ma, J Liang, et al., “DAGMapper: Learning to Map by Discovering Lane Topology”, arXiv 2012.12377, 12, 2020

6. B Liao, S Chen, X Wang, et al., “MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction”, arXiv 2208.14437, 8, 2022

7. J Shin, F Rameau, H Jeong, D Kum, “InstaGraM: Instance-level Graph Modeling for Vectorized HD Map Learning”, arXiv 2301.04470, 1, 2023

8. M Elhousni, Y Lyu, Z Zhang, X Huang , “Automatic Building and Labeling of HD Maps with Deep Learning”, arXiv 2006.00644, 6, 2020

9. K Tang, X Cao, Z Cao, et al., “THMA: Tencent HD Map AI System for Creating HD Map Annotations”, arXiv 2212.11123, 12, 2022

10. Q Li, Y Wang, Y Wang, H Zhao, “HDMapNet: An Online HD Map Construction and Evaluation Framework”, arXiv 2107.06307, 7, 2021

11. Y Liu, Y Wang, Y Wang, H Zhao, “VectorMapNet: End-to-end Vectorized HD Map Learning”, arXiv 2206.08920, 6, 2022

12. J L. Scheunberger, J Frahm, “Structure-from-Motion Revisited”, CVPR, 2016

13. J L. Scheuonberger, E Zheng, M Pollefeys, J Frahm, “Pixelwise View Selection for Unstructured Multi-View Stereo”, ECCV, 2016

14. J Zhang and S Singh, “LOAM: Lidar Odometry and Mapping in Real-time”, Conference of Robotics: Science and Systems, Berkeley, 2014.

15. J. Behley and C. Stachniss. ”Efficient Surfel-Based SLAM using 3D Laser Range Data in Urban Environments“. Robotics: Science and Systems, 2018.

16. R Yu, C Russell, L Agapito, “Video Pop-up: Monocular 3D Reconstruction of Dynamic Scenes”, ECCV, 2014

17. S Bullinger, C Bodensteiner, M Arens, R Stiefelhagen, “3D Vehicle Trajectory Reconstruction in Monocular Video Data Using Environment Structure Constraints”, ECCV, 2018

18. H. Lim, S. Hwang, and H. Myung. “ERASOR: Egocentric Ratio of Pseudo Occupancy-Based Dynamic Object Removal for Static 3D Point Cloud Map Building”. IEEE Robotics and Automation Letters (RA-L), 2021.

19. X. Chen, S. Li, B. Mersch, L. Wiesmann, J. Gall, J. Behley, and C. Stachniss. “Moving Object Segmentation in 3D LiDAR Data: A Learning-based Approach Exploiting Sequential Data”，arXiv 2105.08971，2021

20. X Chen，B Mersch，L Nunes，R Marcuzzi，I Vizzo，J Behley，C Stachniss，“Automatic Labeling to Generate Training Data for Online LiDAR-based Moving Object Segmentation”，arXiv 2201.04501, 1, 2022

21. M Saputra, A Markham, N Trigoni，“Visual SLAM and Structure from Motion in Dynamic Environments: A Survey”，ACM Computing Surveys，2019

22. S Milz, G Arbeiter, C Witt，B Abdallah，“Visual SLAM for Automated Driving: Exploring the Applications of Deep Learning”，IEEE CVPR，2018

23. Z ZHANG，“A Flexible New Technique for Camera Calibration”，IEEE Trans. on Pattern Analysis and Machine Intelligence，2000, 22(11): 1330–1334.

24. J. Levinson and S. Thrun, “Automatic online calibration of cameras and lasers.” Robotics: Science and Systems, vol. 2, 2013.

25. G. Pandey, J. R. McBride, S. Savarese, and R. M. Eustice, “Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information.” AAAI, 2012.

26. T Qin, S Shen, “Online Temporal Calibration for Monocular Visual-Inertial Systems”，IEEE IROS，2018

27. X Wang, L Xu, H Sun, et al. “On road Vehicle Detection and Tracking Using MMW Radar and Monovision Fusion“, IEEE Trans. on Intelligent Transportation Systems, 2016, 17(7):1–10.

28. H Caesar，V Bankiti，A H. Lang，et al，“nuScenes: A multimodal dataset for autonomous driving”，arXiv 1903.11027，2019

29. 黄思源，刘利民，董健，傅雄军，“车载激光雷达点云数据地面滤波算法综述“，光电工程，47（12），2020

30. W Xu , Y Cai , D He, et al，“FAST-LIO2: Fast Direct LiDAR-inertial Odometry”，arXiv 2107.06829，7, 2021

31. Y Zhao，X Zhang，X Huang， “A Technical Survey and Evaluation of Traditional Point Cloud Clustering Methods for LiDAR Panoptic Segmentation”，ICCV workshop，2021

32. X Weng, J Wang, D Held，K Kitani，“3D Multi-Object Tracking: A Baseline and New Evaluation Metrics”，arXiv 1907.03961，7，2019.

33. N Certad, W Morales-Alvarez, C Olaverri-Monreal，“Road Markings Segmentation from LIDAR Point Clouds using Reflectivity Information”，arXiv 2211.01105，2022

34. P Sun, X Zhao, Z Xu, “A 3D LiDAR Data-Based Dedicated Road Boundary Detection Algorithm for Autonomous Vehicles”, IEEE Access, 7(2), 2019

35. Z Yang, T Liu, S Shen，“Self-Calibrating Multi-Camera Visual-Inertial Fusion for Autonomous MAVs”, IEEE IROS, 2016

36. Y Yang，D Tang，D Wang，et al，“Multi-camera visual SLAM for off-road navigation”，Robotics and Autonomous Systems, 128(3), 2020

37. Achanta R, Shaji A, Smith K, et al. “SLIC superpixels compared to state–of–the–art superpixel methods”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2 012, 34(11).

38. Y. Shin, S. Park, A. Kim. “Direct Visual SLAM using Sparse Depth for Camera-LiDAR System”. IEEE ICRA, 2018.

39. Y Zhu, C Zheng, C Yuan, et al., “DVIO: Depth-Aided Visual Inertial Odometry for RGBD Sensors”, arXiv 2110.10805, 10，2021

40. F Zhang, C Guan, J Fang, et al., “Instance Segmentation of LiDAR Point Clouds”, IEEE ICRA, 2020

41. L. Nunes et al., “Unsupervised Class-Agnostic Instance Segmentation of 3D LiDAR Data for Autonomous Vehicles”， IEEE Robotics and Automation Letters, 7(4), 2022.

42. C R Qi, H Su, KMo, L J Guibas，“Pointnet: Deep learning on point sets for 3d classification and segmentation”. CVPR, 2017.

43. A. H. Lang, S. Vora, H. Caesar, et al. “Pointpillars: Fast encoders for object detection from point clouds”. IEEE CVPR, 2019.

44. G Csurka, R Volpi, B Chidlovskii, “Semantic Image Segmentation: Two Decades of Research”，arXiv 2302.06378，2023

45. Y Wang, Z Xu, X Wang, et al.,“End-to-End Video Instance Segmentation with Transformers”，arXiv 2011.14503， 10，2021

46. K Luo, C Wang, S Liu, et al., “Upflow: Upsampling pyramid for unsupervised optical flow learning”, IEEE CVPR, 2021

47. Y Wei, L Zhao, W Zheng, et al., “SurroundDepth: Entangling Views for Self-Supervised Multi-Camera Depth Estimation”, arXiv 2204.03636, 2022

48. S Yang, X Yi, Z Wang, et al., “Visual SLAM using Multiple RGB-D Cameras”, IEEE Int. Conf. on Robotics and Biomimetics (ROBIO), 2015

49. X Meng, W Gao, Z Hu, “Dense RGB-D SLAM with multiple cameras”, Sensors, 18(7), 2018

50. H Liu, T Lu, Y Xu, et al., “Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion”, arXiv 2303.12017, 2023

51. D. Nazir, A. Pagani, M. Liwicki, et al.，”SemAttNet: Towards Attention-based Semantic Aware Guided Depth Completion”. arXiv 2204.13635, 2022

52. J Li, H Dai, H Han, Y Ding, “MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving”, IEEE CVPR, 2023

53. H Li, Y Chen, Q Zhang, D Zhao, “BiFNet: Bidirectional Fusion Network for Road Segmentation”，arXiv 2004.08582，4，2020

54. R Yin, B Yu, H Wu, et al., “FusionLane: Multi-Sensor Fusion for Lane Marking Semantic Segmentation Using Deep Neural Networks”, arXiv 2003.04404, 2020

55. Y Li, Z Ge, G Yu, et-al., “BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection”，arXiv 2206.10092, 6, 2022

56. B Liao, S Chen, X Wang, et al., “MapTR: Structured Modeling And Learning For Online Vectorized HD Map Construction”, arXiv 2208.14437, 8, 2022

57. J Huang，G Huang，Z Zhu，D Du，“BEVDet: High-Performance Multi-Camera 3D Object Detection in Bird-Eye-View”，arXiv 2112.11790， 12，2021

58. X Wang, Z Zhu, W Xu, et al., “OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception”, arXiv 2303.03991, 3, 2023

59. Z Huang, Y Wen, Z Wang, J Ren, and K Jia. “Surface Reconstruction from Point Clouds: A Survey and a Benchmark”. arXiv 2205.02413，4，2020

Automatic Multi-Sensor Data Annotation of BEV/Occupancy Autonomous Driving (Part I)

1 Introduction

2 Data Acquisition and System Configuration

3 Traditional Methods of Sensor Data Annotation

References

Written by Yu Huang