Collaborative Perception in V2X for Autonomous Driving

8 min readJul 15, 2020

Introduction

— — — — — — — —

V2X (vehicle to everything) is a vehicular technology system that enables vehicles to communicate with the traffic and the environment around them, including vehicle-to-vehicle communication (V2V) and vehicle-to-infrastructure (V2I). By accumulating detailed information from other peers, drawbacks of the ego vehicle such as sensing range, blind spots and insufficient planning may be alleviated.

V2X is a special “sensor” for autonomous driving vehicles, then transferring the information from other vehicles or road side devices can enhance the perception capability based on time delay and spatial pose difference.

V2V perception will consider an on-vehicle sensor data processing agent again. A good example is the front vehicle, which can perceive the scene unseen to the ego vehicle and share the detected information, such as the lanes, traffic signs and obstacles.

V2I instead discusses how to process the sensor data captured from roadside, for example at the cross intersection. It means the roadside perception can share the traffic signal, road lane information and vehicle/pedestrian status.

Below we are discussing how to realize a better perception with V2X.

Related Work

— — — — — — — —

Reference [1] designs a consensus-based vehicle control algorithm for the CDS, in which not only the local traffic flow stability is guaranteed, but also the shock waves are supposed to be smoothed. It aims to develop an enhanced cooperative microscopic (car-following) traffic model considering V2V and V2I communication (or V2X for short), and investigate how vehicular communications affect the vehicle cooperative driving, especially in traffic disturbance scenarios, shown in Fig. 1.

Fig. 1 Cooperative driving with the help of V2X communications [1]

In reference [2], it proposes a hardware and software architecture to build such a reliable Intelligent Transportation System, able to create an accurate digital twin of an extended highway stretch, called Providentia, shown in Fig. 2.

Fig. 2 Platform architecture of the Providentia system [2]

Reference [3] tries to develop a reusable framework of cooperative perception for vehicle control on the road that can extend perception range beyond line-of-sight and beyond field-of-view. For this goal, the following problems are addressed: map merging, vehicle identification, sensor multi-modality, impact of communications, and impact on path planning, shown in Fig. 3.

Fig. 3 Cooperation perception for vehicle control [3]

Later, authors in [3] present another work in reference [4] about a multivehicle cooperative driving system architecture using cooperative perception along with experimental validation. It first proposes a multimodal cooperative perception system that provides see-through, lifted-seat, satellite and all-around views to drivers. Using the extended range information from the system, it then realizes cooperative driving by a see-through forward collision warning, overtaking/lane-changing assistance, and automated hidden obstacle avoidance, shown in Fig. 4.

Fig. 4 Multivehicle Cooperative Driving Using Cooperative Perception [4]

Uber discloses a NN model, called V2VNet, for autonomous driving with V2V intermediate representation fusion[5] at CVPR 2020 workshop on Autonomous Driving. A presentation video’s screen copy is shown in Fig. 5.

Perception in V2X

— — — — — — — — — — —

In V2X, vehicle On-Board Unit or Equipment (OBU or OBE) includes: antenna, location system, processor, vehicle operation system and HMI (human machine interface). Roadside Unit or Equipment (RSU or RSE) consists of antenna, location system, processor, vehicle infrastructure interface and other interface.

In perception of V2X, it is similar to that in Autonomous Driving, which difference is that sensors at the roadside could be static or moving in a regular way mostly (not limited to random monitoring controlled by operators). Actually the sensors at the roadside could be stronger in its setting position (for example, at a higher pose to watch a broader view and avoid a large of occlusions happened at the ego vehicle), unconstrained by vehicle regulation and cost. Besides, edge computing at the roadside also provides the stronger computing platform than ego vehicle’s. Fig. 6 is illustration of the perception system in V2X.

Time info convey the time difference between data from different agents; to be flexible, the data container is preferred to keep a temporal window, for example, 1 second (10 frames for LiDAR/radar and 30 frames for camera). Pose info is needed for spatial registration, acquired from vehicle localization, mostly that is based on matching with information in HD map.

Here we assume the sensors are cameras and LiDARs. The neural network model can process the raw data to output intermediate representation (IR), scene segmentation and object detection. To unify the fusion space, the raw data are mapped to BEV (bird eye view) and processed results are also in the same space.

Note: modules ‘compression’ and ‘decompression’ are required for raw data, modules ‘interpolation’ and ‘motion compensation’ are needed at the receiver based on the time sync signal and relative pose based on HD Map/localization.

Perception fusion of V2X needs more work in the receiver to integrate the information from other vehicles and roadsides. Fig. 7 illustrates the V2X fusion, where IR, segmentation and detection channels are fused respectively.

To keep a limited scale space, multiple layers in IR are reserved, such as 3, which allows flexible fusion of different data resolution (for instance, 16, 32 or 64 scanning lines in mechanic LiDAR sensor)

Raw data are fused at the receiver side by module ‘Motion Compensation’ and module ‘Interpolation’. Meanwhile IR are sent to a neural network to generate object-level results too. Then the object-level results, i.e. detection and segmentation are fused (in module ‘Object Fusion’) respectively.

The HD Map-based localization for V2X perception fusion is critical. It must be a sensor fusion framework to handle each sensor shortcomings and utilize the information from each side. Fig. 8 is a sensor fusion framework of vehicle localization without HD Map. LiDAR and camera odometry will work with GPS/IMU/wheel encoder to a fusion filter (such as Kalman filter or Particle filter). LiDAR odometry usually apply point cloud matching, like ICP/GICP to estimate the vehicle motion. Visual odometry applies either direct method (image-based), feature-based method (feature extraction and matching) and semi-direct method (partially feature used, like edge and gradient).

Fig. 8 Sensor fusion-based vehicle localization (without HD Map)

Instead, Fig. 9 illustrates a localization platform with HD map, GPS and other odometry devices. Apparently HD Map matching can be expected to give higher localization performance. Histogram/particle filter are used for LiDAR reflectivity map-based matching, and NDT (normal distribution transform) is for LiDAR point cloud-based matching. For camera sensor installed vehicles, detection of landmarks like road lane/markings, traffic signs/lights, are identified and matched with the corresponding elements in HD Map for matching. IPM (inverse perspective mapping) is used to convert landmark location in image plane to road plane for reasonable matching with HD Map. Instead, traffic signs and lights in the HD Map are projected onto the image plane for convenient matching. PnP (perspective-n-points) is a typical method for 3-D point cloud matching with 2-D image feature points.

Fig. 9 Sensor fusion-based vehicle localization (with HD map)

At last, at some areas without HD map, localization gets critical. Meanwhile, to understand the driving scenarios based on onsite perception results can help autonomous driving to decide its planning and navigation.

Tesla’s work in reference [6], presented at CVPR 2020 workshop on scalability of autonomous driving, tells us it can identify/predict road lines and edges online on BEV. Its screen copy of Andrej Karpathy‘s presentation is shown in Fig. 10.

Nvidia labs builds a WaitNet model in reference [7] which can identify intersection by watching traffic signs and signals, like humans. Its work style is shown in Fig. 11, sending the signal respectively to other two neural network models, LightNet and SignNet.

Fig. 11 Nvidia’s WaitNet, LightNet and SignNet [7].

Based on that, a neural network model is designed to consider those information in a V2X framework, shown in Fig. 12. From roadside and other vehicles’ perception, the ego vehicle can get more information about the road network and traffic rules, then it can integrate with its own perception to identify more confidently the driving environment it is facing.

Fig. 12 Road network-Traffic environment understanding in V2X perception fusion

It is kind of scenario identification to understand the local road network and traffic rules, such lane merge, lane split and ramp in/out in the highway, and walkway, cross intersection, T-shaped intersection and roundabout in the urban street, and drivable space in open space. Along with the road information, traffic rules are identified as well, i.e. traffic light, stop/yield sign, speed limit, turn/straight arrow, traffic cone, warning for school area, construction area, (even police recognition and gesture understanding,) and so on. Note: motion compensation and interpolation can align the detected landmarks and road markings with the ego vehicle’s result.

Summary

— — — — — —

We discuss the perception in V2X and design a fusion network to combine information about raw data, IR and object level results together with the aid of time delay and pose signal. Meanwhile, we discuss the localization framework in V2X to help collaborative perception. Eventually, a driving environment understanding platform, consisting of road network and traffic rules, is designed for V2X when HD map is not available.

References

— — — — — — —

D Jia, D NgoDuy, “Enhanced cooperative car-following traffic model with the combination of V2V and V2I communication”, ScienceDirect, Transportation Research Part B: Methodological, Volume 90, August 2016, Pages 172–191
A Krammer et al., “Providentia — A Large Scale Sensing System for the Assistance of Autonomous Vehicles”, arXiv 1906.06789, 2019
S Kim et al., “Cooperative perception for autonomous vehicle control on the road: Motivation and experimental results”, IEEE IRS, 2013
S Kim et al., “Multivehicle Cooperative Driving Using Cooperative Perception: Design and Experimental Validation”, IEEE T-ITS, 2014
Raquel Urtasum, “V2V communications for self driving”, IEEE CVPR workshop on autonomous driving, 2020. Video links (YouTube): https://www.youtube.com/watch?v=oikdOpmIoc4
Andrej Karpathy, “AI for fully self driving”, CVPR 2020 workshop on scalability of autonomous driving, video links (YouTube): https://www.youtube.com/watch?v=g2R2T631x7k&t=1259s.
Nvidia DRIVE Labs, “Intersection Detection Using AI-Based Live Perception”. https://news.developer.nvidia.com/drive-labs-signnet-and-lighnet-dnns/. Its demo video links (YouTube):https://www.youtube.com/watch?time_continue=3&v=6MY2xiF52o8&feature=emb_logo.