A Low Cost High Definition Map Building Method for Autonomous Driving

Yu Huang
9 min readDec 4, 2019

--

Abstract: This article addresses a low cost high definition Map Building platform with camera sensor (assisted with IMU and GPS as option) only for autonomous driving.

1. Introduction

In an autonomous driving system, a high definition map (HD Map) with accuracy up to vehicle lane level for localization is required. The popular method to build HD Map uses LiDAR scanner to observe the street environment and get 3-D point cloud and road reflectivity map, then annotate the lanes, traffic signs and lights, walkway, special lane (such as public bus/shuttle), road markings, curbs, guard fences, flyover, tunnels and even parking lots etc. However, the devices with LiDAR, GPS (differential GPS as RTK) and IMUs are expensive, and the process to get the final map will scan the same street environments many times to overcome occlusions caused by moving objects or obstacles. Further, the in-time map update (for instance, work field or temporal detour) needs LiDAR scanner fleet to get new data for map refinement.

Another way to make HD Map at low cost, uses camera plus consumer level IMU and GPS only to capture RGB image data. For example, Mobileye proposed REM (road experience management) or Road Book, applying computer vision techniques (SLAM, SfM) to extract lanes, traffic signs (including RGB light signals), curbs and road markings etc. for map annotation. Similarly, some startups like KuanDeng, Deep-Motion and Momenta at China, Lvl5, Carmera and Mapper at USA, and Atlatec at Germany are using similar techniques.

The landmark extraction task from the captured data for camera-based HD Map building is more difficult than LiDAR-based methods, let alone 3-D reconstruction. How to improve the performance of this kind of techniques is critical to spread this low cost method among the HD Map industry.

Milz et al. addressed application of deep learning in visual SLAM [1], including depth estimation, optic flow estimation, feature correspondence, bundle adjustment (BA) and visual odometry (camera pose estimation) etc. Apparently this paper omitted the importance of semantic objects as lanes and signs.

Since the HD Map building work cannot be accomplished full-automatically, part of manual editing and adjustment is required in the process. How to build a nicely interactive GUI of map building is also an important issue. If the editing work can be done by non-professional workers (staff members without expertise in computer vision, photogrammetry and machine learning) with many efficient editing functions, the business operation cost to build HD Map would be reduced a lot.

Qualcomm’s work [2] proposed an end-to-end system for low cost crowdsourced 3D mapping method for autonomous driving. The front camera with consumer level GPS/IMU collects data online, while real time traffic sign and lane detection is used for annotation with triangulation. Offline clustering and Bundle Adjustment (BA) of multiple collected data are run for optimization. Obviously it didn’t build a map editing tool in [1] to handle error in SLAM and landmark detection. Besides, only those land marks are used for triangulation and BA. Unfortunately these land marks are too sparse, which may compromise the SLAM performance, especially for the loop closure case.

In the following session, we will propose a low cost HD Map building system for autonomous driving, where a camera-based vision module assisted with consumer level IMU and GPS devices run SLAM (simultaneous localization and mapping), add semantic objects, such as lanes/walkway/markings/traffic signs/RGB lights etc., remove moving objects and obstacles such as pedestrians, vehicles (motorists as well) and cyclists. Eventually a map editing platform is designed to support 2D object addition and erasing, transfer 2D locations to 3D space based on the plane assumption, and store the map.

2. Map Building with Semantic Objects Annotation

First, the system is based on a monocular visual keyframe-based SLAM framework, such as ORB-SLAM[3], which includes key point detection, camera pose estimation and tracking, key frame extraction, loop closure and BA etc. If there is a vehicle IMU for assistance, a visual-inertial mapping (VIM) system can fuse information from camera and IMU, such as maplab[4], in which the visual-inertial (VI) optimization module can handle specifically the camera pose estimation and tracking. Under some GPS-denied environments (street with skyscrapers, underground parking lot and tunnels etc.) , camera plus IMU are nice sensor configuration. Otherwise, GPS can help VIM in camera pose tracking[5], such as initialization of the whole VI optimization or initialization of loop closure. Usually the error of GPS localization at some position can be regarded as constant.

The data collected in the SLAM system will generate key frame pool, corresponding camera poses, key points sets and their reconstructed point cloud. Then the data enters a HD map editing platform shown in Figure 1.

Figure 1. The proposed Map Building System Diagram

On this editing platform, users can erase or add some semantic object information, like moving objects removal, addition and annotation of lanes/markings/traffic signs/RGB light. Since 2D image space is not linearly transferable to 3D SLAM reconstruction space, there is an ill-posed process to inversely project image points back to 3d space. To get the unique 3d points, it is necessary to constrain the manifold space of those image points, such as lying on a given plane. Besides, missed landmarks have to be added or wrongly segmented static objects have to be recovered.

2.1 Instance segmentation

The instance segmentation can help extracting the moving objects such as pedestrians and vehicles from the scenes. It is suggested using Mask-RCNN [6] for this task. When keypoints falling on those regions are removed, their corresponding 3d point clouds are neglected too.

2.2 Lane detection

The lane detection is kind of pixel-level partial segmentation, here Spatial CNN (SCNN) [7] is referred for use. The lanes could be dashed or solid, single or double solid, while or yellow. The lanes are represented by straight line segments or curve segments (formulated by B-spline, for example) for storage.

2.3 Road markings detection and segmentation

Road markings such as turn arrow, characters (such as speed limit, ‘ONLY’, ‘TURN’,‘SCHOOL’,’KEEP CLEAR’ , ‘EXIT’, ‘NO PARKING’ etc.), can be extracted by segmentation, similar to session 2.1, Mask-RCNN [6] is suggested. Those shapes of characters and arrows could be represented by straight lines and curve segments as well. Those shapes of characters and arrows could be represented by straight lines and curve segments as well. Detection and recognition work is done in [11]. Utilizing the geometric constraints as vanishing point, extensive work is reported in [12].

2.4 Traffic sign and RGB light detection

As a 2D object detection task, we still refer to some fast one-stage detection method, such as YOLOv3[8] or SSD[10]. The traffic signs’ shapes could be rectangles, circles, triangles, polygons and diamonds. The corners and masks for those signs can be features for map-based localization. Joint detection and classification work of traffic sign is given in [9].

2.5 Reconstruction of 2D landmarks

It is known a monocular image is hard to determine the absolute distance of an object in the 3D space. So, we assume the landmarks are located at a known plane, then their 3d coordinates could be estimated, shown in Figure 2.

Figure 2. 2D-to-3D reconstruction based on the known plane

We assume the plane is estimated by the neighboring keypoints around landmarks and the camera calibration parameters are given in advance, then the derivation is obtained as below:

A 3D point X projects to the camera image plane as 2D image point x,here X and camera center C are denoted as 3x1 vector. The camera pose matrix in the SLAM coordinate system is denoted as R,x= [u, v, 1]^T,camera intrinsic matrix is K,then the perspective projection is formulated as:

For inverse projection,the goal is to reconstruct the ray direction as

and determine X with additional information, i.e. namda. Based on plane assumption(the known 3-D plane equation is ax + by + cz + d=0,where the normal vector is n=[a b c]^T),the position of X is computed as:

2.6 Map storage

The HD map will store the visual layer for SLAM’s image keypoints and their 3D point cloud, and the semantic layer for those landmarks. The format to save these contents could be defined by map builders or referred to public standards, as Open Drive, NDS (navigation data standard) and OSM (Open Street Map) etc.

3. Summary

This article proposed a low cost HD Map building system for autonomous driving, suitable for crowdsourced map data collection. On the one hand, we extract landmarks and add them into the semantic layer of the map. On the other hand, we remove moving obstacles to clean SLAM point clouds. It is also useful for private vehicles to build a parking lot map by itself for L-4 autonomous valet parking (AVP).

Reference

1. S Milz et al,“Visual SLAM for Automated Driving: Exploring the Applications of Deep Learning”,CVPR workshop,2018

2. O Dabeer et al,“An End-to-End System for Crowdsourced 3d Maps for Autonomous Vehicles: The Mapping Component“,arXiv 1703.10193,2017

3. R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM: a versatile and accurate monocular SLAM system,” T-RO, 2015.

4. T. Schneider et al,“maplab: An Open Framework for Research in Visual-inertial Mapping and Localization”,IEEE Robotics and Automation Letters,2018

5. Z Tao et al,“Mapping and localization using GPS, lane markings and proprioceptive sensors”,IEEE IROS,2013

6. K He et al., “Mask R-CNN”, arXiv 1703.06870, 2017

7. Xi Pan, J Shi, P Luo, X Wang, X Tang. “Spatial As Deep: Spatial CNN for Traffic Scene Understanding”, AAAI2018

8. J Redmon,A Farhadi,“YOLOv3: An Incremental Improvement”,arXiv 1804.02767,2018

9. Z Zhu et al., “Traffic-Sign Detection and Classification in the Wild”, IEEE CVPR 2016.

10. J Mueller and K Dietmayer,“Detecting Traffic Lights by Single Shot Detection”, arXiv 1805.02523, Oct. 2018

11. O Bailo et al.,“Robust Road Marking Detection and Recognition Using Density-Based Grouping and Machine Learning Techniques”, IEEE WACV, 2017

12. S Lee et al.,“VPGNet: Vanishing Point Guided Network for Lane and Road Marking Detection and Recognition”, arXiv 1710.06288, Oct. 2017

13. C Yu et al.,“DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments”, IEEE IROS 2018

--

--

Yu Huang
Yu Huang

Written by Yu Huang

Working in Computer vision, deep learning, AR & VR, Autonomous driving, image & video processing, visualization and large scale foundation models.

No responses yet