Automatic Multi-Sensor Data Annotation of BEV/Occupancy Autonomous Driving (Part III)

Yu Huang
10 min readApr 13, 2023


1 Introduction

2 Data acquisition and system configuration

3 Traditional Methods of Sensor Data Annotation

4 Semi-Traditional Methods of Sensor Data Annotation

5 Deep learning-based data annotation

In Summary

Appendix: Deep Learning Models used for Camera & LiDAR Data Annotation

  1. Instance segmentation of LiDAR point cloud
Fig. A1 Instance segmentation [40]

(a) Point cloud frame as input, (b) knn to get dense representation, (c) feature learning with a self-attention block to re-organize the unordered points, (d) backbone network, (e) instance segmentation result with bounding boxes.

2. Image Semantic Segmentation

Fig. A2 Dual Attention Networks

The Dual Attention Networks aggregates the output of the Position Attention Module (PAM) that aims at capturing the spatial dependencies between any two positions of the feature maps with the channel attention module (CAM) exploiting the inter-dependencies between channel maps. Specifically, the outputs of the two attention modules are transformed by a convolution layer before fused by an element-wise sum followed by a convolution layer to generate the final prediction maps.

3. Video Instance Segmentation with Transformers

Fig. A3 Video instance segmentation (VIS) [45]

VisTR: It contains four main components: 1) a CNN backbone that extracts feature representation of multiple images; 2) an encoder-decoder Transformer that models the relations of pixel-level features and decodes the instance-level features; 3) an instance sequence matching module that supervises the model; and 4) an instance sequence segmentation module that outputs the final mask sequences.

4. UPFlow

Fig. A4 UPFlow [46]

It contains two stage: pyramid encoding to extract feature pairs in different scales and pyramid decoding to estimate optical flow in each scale. Note that the parameters of the decoder module and the upsample module are shared across all the pyramid levels.

5. SurroundDepth

Fig. A5 SurroundDepth [47]

It utilizes encoder-decoder networks to predict depths. To entangle surrounding views, it proposes the cross-view transformer (CVT) to fuse multi-camera features in a multi-scale fashion. Pretrained with the sparse pseudo depths generated by two-frame Structure-from-Motion, the depth model is able to learn the absolute scale of the real world. By explicitly introducing extrinsic matrices into pose estimation, it can predict multi-view consistent ego-motions and boost the performance of scale-aware depth estimation.

6. Bidirectional Camera-LiDAR Fusion Module (Bi-CLFM)

Fig. A6 Bidirectional Camera-LiDAR Fusion Module (Bi-CLFM)[50]

Features from two different modalities (LiDAR and camera) are fused in a bidirectional way, so that both modalities can benefit each other. It detaches the gradient from one branch to the other to prevent one modality from dominating.

Fig. A7 CamLiPWC[50]

Synchronized camera and LiDAR frames are taken as input, from which dense optical flow and sparse scene flow are estimated respectively. Built on top of the PWC architecture, CamLiPWC is a two-branch network with multiple bidirectional fusion connections (Bi-CLFM) between them.

Fig. A8 CamLiRAFT [50]

Built on top of the RAFT (Recurrent All-pairs Field Transforms) architecture, it performs four-stage feature fusion: features from the feature encoder, the context encoder, the correlation lookup operation, and the motion encoder are fused to pass complementary information.

7. SemAttNet

Fig. A9 SemAttNet [51]

It consists of a novel three-branch backbone and a CSPN (convolutional spatial propagation network)++ module with Atrous convolutions. Unlike earlier image-guided methods, it designs a separate branch for learning the semantic information of the scene. Furthermore, it proposes to apply attention based fusion block (ABF) to perform semantic-aware fusion between RGB, depth, and semantic modalities. Each branch outputs a depth map and a confidence map, which are adaptively fused to produce a fused depth map. In the end, the fused depth map are sent to CSPN++ module with Atrous convolutions for refinement. Note, due to shortage of space, it uses AFB to represent SAMMAFB block.

8. MSeg3D

Fig. A10 Multi-modal 3D semantic segmentation model (MSeg3D).

For multi-modal feature fusion, GF-Phase mainly includes the Geometry-based Feature Fusion Module (GFFM), while SF-Phase consists of LiDAR Semantic Feature Aggregation Module (SFAM), camera SFAM, and Semantic-based Feature Fusion Module (SFFM).

9. The BiFNet architecture

Fig. A11 The BiFNet architecture [53]

The networks have two backbones. Each backbone has five convolution blocks same as ResNet. The module has two groups, and each group is composed of DST (Dense space transformation), DT (domain transformation) and CBF (Context based fusion). DST module transforms the features into the same spatial space, and DT adapts the appropriate domain since the multi-sensors have different measurements and different feature dimensions. Finally, CBF fuses the features which have the same spatial space and domain. The output of the camera space branch is the final result for testing.

10. FusionLane

Fig. A12 FusionLane [54]

The FusionLane network structure diagram mainly shows the general structure. It designs an encoder module with two input branches instead of a single input with multiple channels. This allows the network to be independent of the assumption of perfect alignment. In this way, it can fuse the result of CBEV (camera) semantic segmentation (C-Region) with LBEV (LIDAR) and that makes the segmentation result of the proposed network has both advantages of accurate classification from camera and precise position information from LIDAR. And the LSTM structure can help the network to achieve better prediction results through timing info.


Yu Huang
Working in Computer vision, deep learning, AR & VR, Autonomous driving, image & video processing, visualization and large scale foundation models.

