Si-ChauffeurNet: A Prediction System for Vehicle Driving Behaviors and Trajectories

9 min readDec 3, 2019

Abstract: This article adds the vehicle signal, i.e. left/right turn and braking (back over for parking lot), into the driving behavior prediction model in autonomous driving, besides of normal traffic signal (RGB light, stop sign, speed limit and yield sign etc.). The model is called Signaled interaction -ChauffeurNet (Si-ChauffeurNet). The front and rear light of the vehicle could be replaced by drivers/passengers’ hand gesture, recognized by the camera image. If V2X (V2V and V2P) infrastructure is ready, communication devices can obtain those signals.

1. Introduction

Behavior planning in autonomous driving is quite challenging. So far, there are two behavior modeling ways: one is offline imitation learning or behavior cloning [1], which shortcomings are lack of flexibility, bad suitability to new behavior target; recently a new method is proposed, called conditional imitation learning, which applied command order to reduce ambiguities; the other is adaptive method, called reinforcement learning [2], mostly working online, which collects more data to update the model and repeat this cycle for each agent. Recently, GAN (generative adversarial network) has been used for behavior modeling and prediction[4, 5], improving the model prediction performance though adversarial learning.

In crowded traffic scenes (multiple agents as vehicles and pedestrians), behavior analysis of traffic participants in the surrounding environment provides a strong tool for their trajectory prediction. In this process, the interaction of autonomous ego-vehicle with the background (traffic sign/RGB light, traffic cone and traffic police gesture signal) and other traffic participants (vehicles and pedestrians), and participant’s interaction with each other, will make the effect on prediction, unable to be ignored. Another difficulty in planning is limited range of sensors, such as occlusion and low data quality.

Though some research works has proposed efficient methods in behavior modeling and prediction, but it is found mostly they are analyzing the traffic participants’ speed, pose and trajectory for feature extraction of the traffic environment, but avoiding the participants’ emitted signals, such as drivers’ hand gesture and front/rear light for the turn intention. In the parking lot or the similar environment, the “back over” light at the rear side or gesture is also the important signal for behavior understanding.

In [3] , Google WayMo tried imitation learning of the human driving policies from the driving data, which model training considered the disturbance in imitation loss computation, such as collision, out of the road and non-smooth trajectories etc., to punish some unexpected events and enhance the system robustness. In the model input, there are the road map, traffic light map, speed limit map, navigation route, ego location and pose, other vehicles’ location and pose, and ego history trajectory, then the model output is predicted ego pose. Unfortunately, we don’t find traffic signals involved from the other traffic participants, such as flash left/right turn light, braking light and drivers/passengers’ hand gesture signal.

Li et al. proposed a GAN-based trajectory prediction method [5], in which various possible trajectories are sampled. It applied environment image context, interaction behavior and attention mechanism etc. to handle the uncertainties of the trajectory prediction. Its CNN encoder extracted feature maps and latent representation, and the generator and the discriminator make the prediction. We can see the image sequence are bird eye views, annotating the traffic signal (RGB light), the traffic sign (stop/yield) and the road map etc. However, they didn’t mention the traffic participants’ signal emitted from front/rear lights or hand gestures, nor the speed limit and traffic cone.

In this article, we propose a behavior and trajectory prediction model which takes into account other participants’ signals emitted from the front/rear light, called Signaled interaction-ChauffeurNet (Si-ChauffeurNet). These interaction clues can send out the intention signal, so make the autonomous ego vehicle easily predict their motion trajectories. These signals can be identified from the camera images, or the V2X communication system.

2. Driving Behavior and Trajectory Prediction

First, the input signals come from perception and map-based localization, depicted as 2D bird eye view image. Let’s see a traffic illustration map, shown in Figure 1. It is a cross scene, with 4-way traffic light control. For each direction (vertical or horizontal), there are two lanes. The separation between lanes with the different direction is the curb (yellow), the pedestrian walkways are at the cross (solid purple rectangles). The ego vehicle (green) at the right bottom is almost reaching the cross for right turn (a small red point at the right rear). Other vehicles (blue) either wait for left turn, emit the light for lane change (left or right red point at the rear), touch brakes for slowing down (red line at the rear, similar to the stop sign) or normal drive straight ahead. If the vehicle traffic signal emits for lane change, correspondingly the request for road occupation will generate a safety region for warning (pink region), however on the lane which it wants to take, the other driving vehicle still has the priority.

Figure 1. The traffic scene illustration

Now we can figure out the input signals shown in Figure 2. The newly added signal is the vehicle signal map, which depicts the gray level intensity map for the pink region in Figure 1, and other signals are road map, traffic light, speed limit, navigation route, ego vehicle location, obstacle location, vehicle signal and ego vehicle pose history etc.

The output signal is the future ego vehicle trajectory, shown in Figure 3.

Figure 4 shows the prediction system diagram. In this deep learning model, “Encoder” is a CNN for intermediate representation of feature maps, “Behavior LSTM” is prediction of ego vehicle’s direction, speed, way points and location heatmap, where LSTM (Long Short-term memory) [6] is a special version of RNN (recurrent neural network), to capture the temporal characteristics. “Road Decoder” is drivable area’s segmentation map. “Scene LSTM” is other traffic participants’ location heatmap, where LSTM captures their temporal properties. Finally “Fully Connected Layers” outputs the rendered future ego vehicle pose trajectory.

Figure 4. The deep learning-based behavior prediction system diagram

In model training, the loss function includes imitation losses (rendered bird eye view images shown in Figure 2), ego vehicle collision (with other vehicles) term, vehicle on-road term (drivable region), vehicle geometric loss (way points), other vehicle’s collision term (predicted location heatmap) and the vehicle drivable region term.

The imitation losses are the same to ChaufferNet [3]. Let’s define the other terms as below.

Assume the ego vehicle and other vehicles’ predicted location heatmaps are respectively B and Obj, vehicle traffic signal is S (region caused by the vehicle signal), the other vehicles’ true location is Obj_GT，then the ego vehicle collision term is defined as:

With namda as the weight for the vehicle traffic signal, 0<namda<1 (it is suggested namda=0.3).

Assume the predicted drivable region and its true region are respectively R and R_GT，then the ego vehicle on road term is defined as

Besides, the ego vehicle geometric loss comes from its predicted vehicle trajectory region, assume the true region (binary map) is G_GT，then the ego vehicle geometric term is:

Other vehicles’ collision term is

where H（）function is cross entropy.

Finally, the other vehicles’ drivable region term is defined as

To improve the system learning power in behavior prediction, we propose another GAN-based system, shown in Figure 5.

Figure 5. The GAN-based behavior system diagram

The GAN needs a generator (G) to obtain the data distribution and a discriminator (D) to estimate the sample is from the training data or the generator. The original GAN[7]’s generator needs a noise input (Gaussian distribution for example) for data generation, but conditional GAN[8] allows to input another signal simultaneously to both G and D. In Figure 5, the eight rendered maps input the encoder which output feature maps are fed into G. The three modules in G are the same to that in Figure 4, and the G’s output is given to D to decide it is true or fake. In D, “Classifier LSTM” is a LSTM-based temporal sequence classification model, and the “Fully connected layer” shows the decision result.

To detect and recognize the vehicle light signal, shown in Figure 6 (different light configurations), we refer a method for vehicle tail light processing, based on the CNN-LSTM model[9].

Figure 6. Vehicle front/rear light appearance examples

If applying this model for side cameras watching the back, we can detect and recognize the front light signal (except the braking and back over signal) too. To build a workable system, it has to be put together with a vehicle detector/tracker[10], shown in Figure 7.

Figure 7. Vehicle traffic signal detection system

Similarly, vehicle driver or passenger, motorist and cyclist could use hand gesture to emit the traffic signal, shown in Figure 8. Note: please notice the difference of right side driving from left side driving (Japan, UK, Hong Kong).

Figure 8. Vehicle Driver/Passenger/Motorist/Cyclist’s hand gesture signal

Google had the patent [11] to detect and recognize the cyclist hand gesture signal. Actually we can put it with the police hand gesture signal (an example of Chinese police hand gesture shown in Figure 9) together, linked with a typical obstacle detection module, and then build a deep learning-based traffic hand gesture signal understanding system, shown in Figure 10.

Figure 9. Chinese police hand gesture examples

Shown in Figure 10, we illustrate two different modules. After running obstacle detection-tracking-classification, for the frontal camera we will discriminate the police, pedestrian, vehicle, cyclist and motorist; for the side back viewing camera, we only need classify vehicles and cyclists/motorists. The pose estimation model refers to OpenPose[12], and the hand gesture signal can be recognized by a CNN-LSTM model.

Figure 10. A traffic hand gesture signal understanding system

Figure 11. Pose Estimation by OpenPose[12]

3. Summary

This article proposes a new framework for driving behavior prediction, taking into account the vehicle traffic signal emitted from front/rear light or driver/passenger hand gesture, or from the V2X communication device, called Signaled interaction-ChauffeurNet (Si-ChauffeurNet). It can improve the awareness of vehicle interaction for better vehicle driving behavior prediction. We propose a deep learning-based model to learn the driving data and a GAN-based model to enhance the learning power.

Reference

1. L Sun et al.,“A Fast Integrated Planning and Control Framework for Autonomous Driving via Imitation Learning ”，arXiv 1707.02515， 2017

2. M Moghadam, G Elkaim,“A Hierarchical Architecture for Sequential Decision-Making in Autonomous Driving using Deep Reinforcement Learning”, ICML, 2019

3. M Bansal et al., “ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst”, arXiv 1812.03079, 2018

4. J Li et al., “Interaction-aware Multi-agent Tracking and Probabilistic Behavior Prediction via Adversarial Learning”, IEEE ICRA, 2019

5. J Li, H Ma, M Tomizuka, “Conditional Generative Neural System for Probabilistic Trajectory Prediction” arXiv 1905.61631, 2019

6. I Goodfellow et al., “Deep Learning”, MIT Press, 2016.

7. I Goodfellow et al.,“Generative Adversarial Nets”, NIPS, 2014

8. M Mirza, S Osindero,“Conditional Generative Adversarial Nets”, arXiv:1411.1784, 2014

9. D Frossard，E Kee，R Urtasun，“DeepSignals: Predicting Intent of Drivers Through Visual Signals”，ICRA，2019

10. W Luo, B Yang, R Urtasun“Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net”, IEEE CVPR 2018

11. H Kretzschmar, J Zhu，“Cyclist hand signal detection by an autonomous vehicle”， Google patent，US 9,014.905 B1，April，2015

12. Z Cao et al., “OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields”, arXiv:1812.08008，2018

Appendix: Model training in ChauffeurNet [3]

Si-ChauffeurNet: A Prediction System for Vehicle Driving Behaviors and Trajectories

Written by Yu Huang

Responses (1)