Autonomous Driving with Large Scale Foundation Models

32 min readNov 6, 2023

Introduction

Let’s first review some background knowledge to comprehend the discussing problems better.

Note: There are two survey papers[53, 55] for large language model-based autonomous driving (as well with intelligent transportation in [53]). However, we try to investigate this area from a new point of view, also a broader domain.

1 Large Scale Language Models

Large Scale Language Models (LLMs), exemplified by models like GPT-4, LLaMA, and PALM-E, have demonstrated remarkable capabilities in reasoning and common sense across various domains. These models leverage Reinforcement Learning from Human Feedback (RLHF) to fine-tune their behavior based on user intent, showing significant advancements in aligning AI behavior with human expectations.

Large language models (LLMs) are the category of Transformer-based language models that are characterized by having an enormous number of parameters, typically numbering in the hundreds of billions or even more. These models are trained on massive text datasets, enabling them to understand natural language and perform a wide range of complex tasks, primarily through text generation and comprehension.

The emergent abilities of LLMs are one of the most significant characteristics that distinguish them from smaller language models. Specifically, in-context learning (ICL), instruction following and reasoning with chain-of-thought (CoT) are three typical emergent abilities for LLMs.

Remarkable abilities of LLMs demonstrate early signs of Artificial General Intelligence (AGI), exhibiting capabilities such as out-of-distribution (OOD) reasoning, common sense understanding, knowledge retrieval, and the ability to naturally communicate these aspects with humans.

The success of LLMs is undoubtedly exciting as it demonstrates the extent to which machines can learn human knowledge. In advanced tasks with LLMs, the translation of natural language input into actionable results is crucial. One prominent task is language-to-actions mapping, which has seen early approaches leveraging frameworks like temporal logic and motion primitive learning, evolving towards more recent end-to-end models for instruction-following in navigation and manipulation tasks, employing latent embeddings of language commands. Another critical dimension is language-to-code generation, extensively explored in contexts ranging from coding competitions to instruction-following tasks. Moreover, the translation of natural language instructions into rewards has found applications in robotic domains, often requiring domain-specific reward models.

2 Multi-modal Language Model, World Model and Embodied AI

Vision-Language Models (VLMs) bridge the capabilities of Natural Language Processing (NLP) and Computer Vision (CV), breaking down the boundaries between text and visual information to connect multimodal data. With the rise of LLMs, there is also an increasing focus on exploring how to effectively incorporate visual modules into LLMs to perform multimodal tasks.

Visual language model: CLIP（Contrastive Language-Image Pre-Training）

Motivated by the potential of LLMs, numerous Multimodal LLMs (MLLMs), e.g., PaLM-E, LLaVA, MiniGPT-4, Video-LLaMA and InstructBLIP, have been proposed to expand the LLMs to the multimodal field, i.e., perceiving image/video input, and conversating with users in multiple rounds.

“PaLM-E: An Embodied Multimodal Language Model”

World models explicitly represent the knowledge of an autonomous agent about its environment. They are defined as a generative model that predicts the next observation in an environment given past observations and the current action. The main use cases are: pure representation learning, planning (look-ahead search), or learning a policy in the world model (neural simulator).

“Structured World Models from Human Videos”

Recent works have developed more efficient reinforcement learning agents for robotics and embodied AI. The focus is on enhancing agents’ abilities for planning, reasoning, and collaboration in embodied environments. Some approaches combine complementary strengths into unified systems for embodied reasoning and task planning. High-level commands enable improved planning while low-level controllers translate commands into actions. Dialogue for information gathering can accelerate training. Some agents can work for embodied decision-making and exploration guided by internal world models.

“RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control”

3 Autonomous Driving

Autonomous driving targets the development of vehicles that can navigate and control themselves without human intervention, reducing incidents and improving traffic efficiency. The driving automation level defined by the Society of Automotive Engineers (SAE) from Level 0 (No Automation) to Level 5 (Full Automation). As autonomy increases, human intervention reduces, while the requirements for the vehicle to understand its surroundings increase. Currently, most commercial vehicles are at Level 2 or 3, providing partial automation but still requiring driver supervision.

Exiting autonomous driving solutions can be broadly categorized into the classic modular paradigm (i.e. perception-prediction-planning-control) and the end-to-end approach. However, these schemes all face serious challenges such as interpretability, generalization, causal confusion, robustness, etc. Researchers have attempted to address these issues using various methods, but constructing a safe, stable, and interpretable AD system remains an open topic.

Perception collects information from sensors and discovers relevant knowledge from the environment. It develops a contextual understanding of driving environment, such as detection, tracking and segmentation of obstacles, road signs/marking and free space drivable areas. Based on the sensors implemented, the environment perception task can be tackled by using LIDARs, cameras, radars or a fusion between these three kinds of devices.

As the center of autonomous driving systems, the decision making module is a crucial component due to the significance of intelligent reaction to a constantly changing environment. The output of the decision module can be low level motion control instructions, such as the throttle, speed, acceleration, etc., or it can be high-level instructions, such as the motion primitive and future trajectory, which can be adopted by the subsequent planning module.

Motion planning is a core challenge in autonomous driving, aiming to plan a driving trajectory that is safe and comfortable. The intricacies of motion planning arise from its need to accommodate diverse driving scenarios and make reasonable driving decisions.

Existing motion planning approaches generally fall into two categories. The rule-based methods designed explicit rules to determine driving trajectories. These methods have clear interpretability but generally fail to handle extreme driving scenarios that are not covered by rules. Alternatively, the learning-based approaches resorted to a data-driven strategy and learned their models from large-scale human driving trajectories.

While exhibiting good performance, these approaches sacrifice interpretability by viewing motion planning as a blackbox forecasting problem. Essentially, both prevailing rule-based and learning-based approaches are devoid of the common sense reasoning ability innate to human drivers, which restricts their capabilities in tackling long-tailed driving scenarios.

4 Diffusion model and Neural Radiance Field

Besides, we append some context knowledge, i. e. the generative Artifical Intelligence Generated Content (AIGC) method, Diffusion Model (DM), and the image-based 3D synthesis method, Neural Radience Field (NeRF).

The diffusion model aims to generate images from Gaussian noise via an iterative denoising process. Its implementation is built based on strict physical implications, including a diffusion process and a reverse process. In the diffusion process, an image is converted to a Gaussian distribution by adding random Gaussian noise with iterations. The reverse process is to recover the image from the distribution by several denoising steps.

“Diffusion Models: A Comprehensive Survey of Methods and Applications”

Diffusion models represent a family of probabilistic generative models that progressively introduce noise to data and subsequently learn to reverse this process for the purpose of generating samples. These models have recently garnered significant attention due to their exceptional performance in various applications, setting new benchmarks in image synthesis, video generation, and 3D content generation. The fundamental essence of diffusion-based generative models lies in their capacity to comprehend and understand the intricacies of the world.

NeRF is a technique that trains AI algorithms to generate 3D objects from 2D images. NeRF can render novel 3D reconstructed views of complex scenes utilizing an interpolation approach between scenes. However, instead of directly recovering the entire 3D scene geometry, NeRF computes a “radiance field,” a volumetric representation that generates color and density for each point in the concerned 3D space. NeRF is appealing for two unique features: self-supervised and photo-realistic.

NeRF uses a deep neural network to represent complex scenes in a fully-connected, non-convolutional way. The input to the network is a continuous 5D coordinate, while the output is the volume density and view-dependent emitted radiance at that particular spatial location. The method uses classic volume rendering techniques to synthesize views and is optimized using a set of images with known camera poses.

Foundation Models for Autonomous Driving

Foundation models have taken shape most strongly in NLP. On a technical level, foundation models are enabled by transfer learning and scale. The idea of transfer learning is to take the “knowledge” learned from one task and apply it to another task. Foundation models usually follow such a paradigm that a model is pre-trained on a surrogate task and then adapted to the downstream task of interest via fine-tuning.

Below we categorize the foundation models’ application in autonomous driving based on its grounding levels, from simulation (data synthesis), world model (learning-and-then-prediction), perception annotation (auto-labeling), and decision making or driving actions (E2E). In the simulation area, we split it further into two directions: sensor data synthesis and traffic flow generation. In the decision making or driving actions, the approaches are classified as three groups: LLMs’ integration, tokenization like GPT and pre-trained foundation models.

We include methods with diffusion models and NeRFs applied in autonomous driving. Though they may not apply LLMs or foundation models yet now, but the potential in nature and forseeable bindings make us believe its cominng true in future, with the techniques from Dream fields[1], Dreamfusion[2], Latent-NeRF[3], Magic-3D[4], NeRDi[5], Text2NeRF[6], DALL-E3[7], NExT-GPT[8] and EasyGEN[9] etc.

“Zero-shot text-guided object generation with dream fields“

“DreamFusion: Text-to-3D Using 2D Diffusion“

Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures

“Magic3D: High-Resolution Text-to-3D Content Creation“

“NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors“

“Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields“

Simulation and World Model in Autonomous Driving

We group simulation and world model together since world models could be regarded as a neural simulator. Simulation is kind of methods for AIGC (AI generated content), but focusing on static and dynamic components in the driving environment. World models understand the dynamics and then predict the future.

1 Sensor Data Synthesis

Scene-Diffusion[19] is a learned method of traffic scene generation designed to simulate the output of the perception system of a self-driving car. Inspired by latent diffusion, a combination of diffusion and object detection is used to directly create realistic and physically plausible arrangements of discrete bounding boxes for agents.

READ (Autonomous Driving scene Render)[16] is a large-scale neural rendering method to synthesize the autonomous driving scene. In order to represent driving scenarios, an 𝜔 — 𝑛𝑒𝑡 rendering network is proposed to learn neural descriptors from sparse point clouds. This model can not only synthesize realistic driving scenes but also stitch and edit driving scenes.

**READ: Large-Scale Neural Scene Rendering for Autonomous Driving**

MARS (ModulAr and Realistic Simulator)[25] is a modular framework for photorealistic autonomous driving simulation based on NeRFs. This open-sourced framework consists of a background node and multiple foreground nodes, enabling the modeling of complex dynamic scenes.

MARS: An Instance-aware, Modular and Realistic Simulator for Autonomous Driving

UniSim[27] is a neural sensor simulator that takes a single recorded log captured by a sensor-equipped vehicle and converts it into a realistic closed-loop multi-sensor simulation. UniSim builds neural feature grids to reconstruct both the static background and dynamic actors in the scene, and composites them together to simulate LiDAR and camera data at new viewpoints, with actors added or removed and at new placements. To better handle extrapolated views, it incorporates learnable priors for dynamic objects, and leverages a convolutional network called hypernet, to complete unseen regions.

UniSim: A Neural Closed-Loop Sensor Simulator

Adv3D [30] is proposed for modeling adversarial examples as NeRFs. It is trained by minimizing the surrounding objects’ confidence predicted by 3D detectors on the training set. To generate physically realizable adversarial examples initialized from Lift3D[160], primitive-aware sampling and semantic-guided regularization that enable 3D patch attacks with camouflage adversarial texture are proposed.

Adv3D: Generating 3D Adversarial Examples in Driving Scenarios with NeRF

DriveSceneGen [39], is a data-driven driving scenario generation method that learns from the real world driving dataset and generates entire dynamic driving scenarios from scratch. The pipeline consists of two stages: a generation stage and a simulation stage. In the generation stage, a diffusion model is employed to generate a rasterized Birds-Eye-View (BEV) representation of the initial scene of the driving scenario, which is then decoded by a rule-based vectorization method. In the simulation stage, the vectorized representation of the scenario is consumed by a simulation network based on the Motion TRansformer (MTR) framework as the initial scene to predict multi-modal joint distributions of the generated agents’ future trajectories.

DriveSceneGen: Generating Diverse and Realistic Driving Scenarios from Scratch

MagicDrive[41] is a street view generation framework offering diverse 3D geometry controls, including camera poses, road maps, and 3D bounding boxes, together with textual descriptions, achieved through tailored encoding strategies. The power of pre-trained stable diffusion is harnessed and further fine-tuned for street view generation with road map information by ControlNet. Besides, this design incorporates a cross-view attention module, ensuring consistency across multiple camera views.

DrivingDiffusion [47] is a spatial-temporal consistent diffusion framework, to generate realistic multi-view videos controlled by 3D layout. It is based on the widely used image synthesis diffusion model where the 3D layout is utilized as additional control information (this is also a drawback). Based on CLIP, local prompt to guide the relationship between the whole image and local instances, and global prompt, are cooperated. Unfortunately, it is not yet a E2E simulation used for autonomous driving.

2 Traffic Flow Synthesis

Realistic Interactive TrAffic flow (RITA) [12] is an integrated component of existing driving simulators to provide high-quality traffic flow. RITA consists of two modules called RITABackend and RITAKit. RITABackend is built to support vehicle-wise control and provide diffusion-based traffic generation models with from real-world datasets, while RITAKit is developed with easy-to-use interfaces for controllable traffic generation via RITABackend.

CTG (controllable traffic generation) [21] is a conditional diffusion model for users to control desired properties of trajectories at test time (e.g., reach a goal or follow a speed limit) while maintaining realism and physical feasibility through enforced dynamics. The key technical idea is to leverage diffusion modeling and differentiable logic to guide generated trajectories to meet rules defined using signal temporal logic (STL). It can be extended as guidance to multi-agent settings and enable interaction-based rules like collision avoidance.

CTG++[22] is a scene-level conditional diffusion model guided by language instructions. A scene-level diffusion model equipped with a spatio-temporal transformer backbone is designed, which generates realistic and controllable traffic. Then a large language model (LLM) is harnessed to convert a user’s query into a loss function, guiding the diffusion model towards query-compliant generation.

SurrealDriver [32] is a generative ‘driver agent’ simulation framework based on LLMs, capable of generating human-like driving behaviors: understanding situations, reasoning, and taking actions. Interviews with 24 drivers are conducted to get their detailed descriptions of driving behavior as CoT prompts to develop a ‘coach agent’ module, which can evaluate and assist ‘driver agents’ in accumulating driving experience and developing humanlike driving styles.

3 World Model

World models hold great promise for generating diverse and realistic driving videos, encompassing even long-tail scenarios, which can be utilized to train foundation models in autonomous driving. Furthermore, the predictive capabilities in world models facilitate end-to-end driving, ushering in a new era of seamless and comprehensive autonomous driving experiences.

Anomaly Detection is an important issue for data closed loop of autonomous driving, which decides the data selection efficiency for model upgrade by training with newly selected valuable data. An overview of how world models can be leveraged to perform anomaly detection in the domain of autonomous driving in [27].

TrafficBots[17] is a multi-agent policy built upon motion prediction and end-to-end driving. Based on that, a world model is obtained and tailored for the planning module of autonomous vehicles. To generate configurable behaviors, for each agent both a destination as navigational information and a time-invariant latent personality to specify the behavioral style are introduced. To improve the scalability, a scheme of positional encoding for angles is designed, allowing all agents to share the same vectorized context based on dot-product attention. As a result, all traffic participants in dense urban scenarios are simulated.

TrafficBots: Towards World Models for Autonomous Driving Simulation and Motion Prediction

UniWorld[29], a spatial-temporal world model, is able to perceive its surroundings and predict the future behavior of other participants. UniWorld involves initially predicting 4D geometric occupancy as the World Models for foundational stage and subsequently fine-tuning on downstream tasks. UniWorld can estimate missing information concerning the world state and predict plausible future states of the world. Besides, UniWorld’s pre-training process is label-free, enabling the utilization of massive amounts of image-LiDAR pairs to build a Foundational Model.

GAIA-1 (‘Generative AI for Autonomy’) [40], proposed by Wayve (an UK startup), is a generative world model that leverages video, text, and action inputs to generate realistic driving scenarios while offering fine-grained control over ego-vehicle behavior and scene features. It casts world modeling as an unsupervised sequence modeling problem by mapping the inputs to discrete tokens, and predicting the next token in the sequence. GAIA-1 provides new possibilities for innovation in the field of autonomy, enabling enhanced and accelerated training of autonomous driving technology.

DriveDreamer[50] is a world model entirely derived from real-world driving scenarios. Regarding that modeling the world in intricate driving scenes entails an overwhelming search space, the diffusion model is harnessed to construct a representation of the environment, which is called Auto-DM, where the diffusion steps estimate noise and generate loss with the input noise to optimize the model. Furthermore, a two-stage training pipeline is adopted. In the first stage, DriveDreamer acquires a deep understanding of structured traffic constraints, while the next stage equips it with the ability to anticipate future states.

DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

In [56] a world modeling approach is proposed by a AD startup Waabi, which first tokenizes sensor observations with VQVAE and then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, Masked Generative Image Transformer is recast into the discrete diffusion framework with a few simple changes.

Learning Unsupervised World Models For Autonomous Driving Via Discrete Diffusion

MUVO [58] is a Multimodal World Model with Geometric VOxel Representations to take into account the physical attributes of the world. It utilizes raw camera and lidar data to learn a sensor-agnostic geometric representation of the world, to predict raw camera and lidar data as well as 3D occupancy representations multiple steps into the future, conditioned on actions.

[59] introduces the concept of interleaved vision-action pair, which unifies the format of visual features and control signals. Based on the vision- action pairs, a general world model based on MLLM and diffusion model for autonomous driving, termed ADriver-I, is constructed. It takes the vision-action pairs as inputs and autoregressively predicts the control signal of current frame. The generated control signals together with the historical vision-action pairs are further conditioned to predict the future frames. With the predicted next frame, ADriver-I performs further control signal prediction.

[60] explore a framework of learning a world model, OccWorld, in the 3D Occupancy space to simultaneously predict the movement of the ego car and the evolution of the surrounding scenes. To facilitate the modeling of the world evolution, they learn a reconstruction-based scene tokenizer on the 3D occupancy to obtain discrete scene tokens to describe the surrounding scenes. They then adopt a GPT-like spatial-temporal generative transformer to generate subsequent scene and ego tokens to decode the future occupancy and ego trajectory.

Drive-WM is a driving world model compatible with existing end-to-end planning models [61]. Through a joint spatialt-temporal modeling facilitated by view factorization, this model generates high-fidelity multi-view videos in driving scenes. Building on its generation ability, they showcase the potential of applying the world model for safe driving planning. Particularly, Drive-WM enables driving into multiple futures based on distinct driving maneuvers, and determines the optimal trajectory according to the image-based rewards.

4 Discussions

Table I summarizes the methods of simulation and world model given in session VI.A, including modalities, functions and technologies applied. LLMs and diffusion models provide the commonsense knowledge and generalization, while NeRF is a tool for 3-D reconstruction and high fidelity scene rendering. Diffusion models also support the dynamic modeling, which is useful for world model building.

Automatic Data Annotation in Autonomous Driving

Data annotation is the cornerstone of deep learning model training, cause mostly model training runs in a supervised manner. Automatic labeling is strongly helpful for autonomous driving research and development, especially for open vocabulary scene annotations. LLMs and VLMs provide a way to realize it based on the learned knowledge and common sense.

Recently, models trained with large-scale image-text datasets have demonstrated robust flexibility and generalization capabilities for open-vocabulary image-based classification, detection and semantic segmentation tasks. Though it does not perform in real time, a human-level cognition capability is potential to behave like a teacher model at the cloud side, teaching a student model at client side to realize approximated performance.

Talk2Car[10] is an object referral dataset when taking into account the problem in an autonomous driving setting, where a passenger requests an action that can be associated with an object found in a street scene. It contains commands formulated in textual natural language for self-driving cars. The textual annotations are free form commands, which guide the path of an autonomous vehicle in the scene. Each command describes a change of direction, relevant to a referred object found in the scene.

OpenScene[13] is a simple yet effective zero-shot approach for open-vocabulary 3D scene understanding. The key idea is to compute dense features for 3D points that are co-embedded with text strings and image pixels in the CLIP feature space. To achieve this, associations between 3D points and pixels from posed images in the 3D scene are established, and a 3D network is trained to embed points using CLIP pixel features as supervision.

MSSG (Multi-modal Single Shot Grounding ) [20] is a multi-modal visual grounding method for LiDAR point cloud with a token fusion strategy. It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector without any post-processing. The cross-modal learning enforces the detector to concentrate on important regions in the point cloud by considering the informative language expressions.

Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving

HiLM-D (Towards High-Resolution Understanding in MLLMs for Autonomous Driving)[33] is an efficient method to incorporate HR information into MLLMs for the perception task. Especially, HiLM-D integrates two branches: (i) the LR reasoning branch, can be any MLLMs, processes LR videos to caption risk objects and discern ego-vehicle intentions/suggestions; (ii) the HR perception branch (HR-PB), ingests HR images to enhance detection by capturing vision-specific HR feature maps.

HiLM-D: Towards HR Understanding in Multimodal Large Language Models for Autonomous Driving

NuPrompt [35] is the object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space. It expands Nuscenes dataset by constructing a total of 35,367 language descriptions, each referring to an average of 5.3 object tracks. Based on the object-text pairs from the new benchmark, a prompt-based driving task, i.e., employing a language prompt to predict the described object trajectory across views and frames is formulated. Furthermore, a simple end-to-end baseline model based on Transformer, named PromptTrack (modified from PF-Track, Past-and-Future reasoning for Tracking), is provided.

In [37] a multi-modal auto labeling pipeline is presented by Google WayMo, capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels, defined as Unsupervised 3D Perception with 2D Vision-Language distillation (UP-VL). This pipeline exploits motion cues inherent in point cloud along with the freely available 2D image-text pairs. This method can handle both static and moving objects in the unsupervised manner and is able to output open-vocabulary semantic labels thanks to the proposed vision-language knowledge distillation.

Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving

OpenAnnotate3D [52], is an opensource open-vocabulary auto-labeling system that can automatically generate 2D masks, 3D masks, and 3D bounding box annotations for vision and point cloud data. It integrates the chain-of-thought (CoT) capabilities of LLMs and the cross-modality capabilities of VLMs. Current off-the-shelf cross-modality vision-language models are based on 2D images, such as CLIP and SAM.

Table II summarizes methods of automatic annotation. LLMs provide the knowledge to support labeling, meanwhile VLMs or MMLMs extend the LLMs to more modalities for annotating more diverse data.

Planning and E2E in Autonomous Driving

1 Integration of Large Language Models

Drive-Like-a-Human [24] is a closed-loop system to showcase its abilities in driving scenarios (for instance, HighwayEnv, i.e. a collection of environments for autonomous driving and tactical decision-making tasks), by using an LLM (GPT3.5). Besides, perception tools and agent prompts are provided to aid its observation and decision-making. The agent prompts provide GPT-3.5 with information about its current actions, driving rules, and cautions. GPT-3.5 employs the ReAct strategy [83] to perceive and analyze its surrounding environment through a cycle of thought, action, and observation. Based on this information, GPT-3.5 makes decisions and controls vehicles in HighwayEnv, forming a closed-loop driving system.

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

An open-loop driving commentator LINGO-1 [31] is proposed by Wayve which combines vision, language and action to enhance how to interpret and train the driving models, like Visual Question Answering (VQA). At Wayve, similar to embodied AI, vision-language-action models (VLAMs) are explored, which incorporate three kinds of information: images, driving data, and now language. The language can be used to explain the causal factors in the driving scene, which may enable faster training and generalization to new environments. VLAMs open up the possibility of interacting with driving models through dialogue, where users can ask autonomous vehicles what they are doing and why. In other words, LINGO-1 can provide a description of the driving actions and reasoning.

A text-based representation of traffic scenes is proposed [34] and processed with a pre-trained language encoder. Text-based representations, based on DistilBERT (a slim variant of BERT), combined with classical rasterized image representations, lead to descriptive scene embeddings, which are subsequently decoded into a trajectory prediction. Predictions on the nuScenes dataset is given as a benchmark.

Can you text what is happening? Integrating pre-trained language encoders into trajectory prediction models

Drive-as-You-Speak [36] is a framework that leverages LLMs to enhance autonomous vehicles’ decision-making processes. By integrating LLMs’ natural language capabilities and contextual understanding, specialized tools usage, synergizing reasoning, and acting with various modules on autonomous vehicles, this framework aims to seamlessly integrate the advanced language and reasoning capabilities of LLMs into autonomous vehicles. The framework holds the potential to revolutionize the way autonomous vehicles operate, offering personalized assistance, continuous learning, and transparent decision-making, ultimately contributing to safer and more efficient autonomous driving technologies.

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles

DiLu[38] combines a Reasoning module and a Reflection module to enable the system to perform decision-making based on common-sense knowledge and evolve continuously. To be specific, the driver agent utilizes the Reasoning Module to query experiences from the Memory Module and leverage the common-sense knowledge of the LLM to generate decisions based on current scenarios. It then employs the Reflection Module to identify safe and unsafe decisions, subsequently refining them into correct decisions using the knowledge embedded in the LLM.

LanguageMPC[42] employs LLMs as a decision-making component for complex AD scenarios that require human common-sense understanding. The cognitive pathways are designed to enable comprehensive reasoning with LLMs, as well as algorithms for translating LLM decisions into actionable driving commands. Through this approach, LLM decisions are seamlessly integrated with low-level controllers (MPC) by guided parameter matrix adaptation.

LanguageMPC: Large Language Models As Decision Makers For Autonomous Driving

DriveGPT4[43] is an interpretable E2E autonomous driving system utilizing LLMs (LLaMA2). DriveGPT4 is capable of interpreting vehicle actions and providing corresponding reasoning. DriveGPT4 also predicts vehicle low-level control signals in an E2E fashion. Based on tokenization, the language model can concurrently generate responses to human inquiries and predict control signals for the next step. Upon producing predicted tokens, a de-tokenizer decodes them to restore human languages.

GPT-Driver[44] transforms the GPT3.5 model into a reliable motion planner for autonomous vehicles, which capitalizes on the strong reasoning capabilities and generalization potential inherent to LLMs. The insight is the reformulation of motion planning as a language modeling problem, in which the planner inputs and outputs are represented as language tokens, and the driving trajectories are generated through a language description of coordinate positions. Furthermore, a prompting-reasoning-finetuning strategy is proposed to stimulate the numerical reasoning potential of the LLM.

‘LLM-Driver’[45] is an object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations. A distinct pretraining strategy is devised to align numeric vector modalities with static LLM representations using vector captioning language data. As a matter of fact, training the LLM-Driver involves formulating it as a Driving Question Answering (DQA) problem within the context of a language model.

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

Talk2BEV[46] is a large vision language model (LVLM) interface for bird’s-eye view (BEV) maps in autonomous driving contexts, based on BLIP-2 and LLaVA. The BEV map from image and LiDAR data is first generated. Then the language-enhanced map is constructed, augmented with aligned image-language features for each object from LVLMs. These features can directly be used as context to LVLMs for answering object-level and scene-level queries. Talk2BEV-Bench is a benchmark encompassing 1000 human annotated BEV scenarios, with more than 20,000 questions and ground-truth responses from the NuScenes dataset.

DriveLM is an autonomous driving (AD) dataset incorporating linguistic information [48]. Through DriveLM, people can connect LLMs and autonomous driving systems, and eventually introduce the reasoning ability of LLM in AD to make decisions and ensure explainable planning. Specifically, in DriveLM, Perception, Prediction, and Planning(P3) are facilitated with human-written reasoning logic as a connection. To take it a step further, The idea of Graph-of-Thought (GoT) is leveraged to connect the QA pairs in a graph-style structure and use "What if"-style questions to reason about future events that have not happened.

Drive-Anywhere [54] is a generalizable E2E autonomous driving model with multimodal foundation models to enhance the robustness and adaptability. Specifically, it is capable of providing driving decisions from representations queried by image and text. To do so, a method is proposed to extract nuanced spatial (pixel/patch-aligned) features from Transformers to enable the encapsulation of both spatial and semantic features. This approach allows the incorporation of latent space simulation (via text) for improved training (data augmentation via text) and policy debugging.

Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models

Agent-Driver[57], transforms the traditional autonomous driving pipeline by introducing a tool library accessible via function calls, a cognitive memory of common sense and experiential knowledge for decision-making, and a reasoning engine capable of chain-of-thought reasoning, task planning, motion planning, and self-reflection. Powered by LLMs, Agent-Driver is endowed with intuitive common sense and robust reasoning capabilities, thus enabling a more nuanced, human-like approach to autonomous driving.

Table III summarize the methods of LLM/VLM-based autonomous driving models. This is the mostly adapted way to integrate LLM/VLM into the self driving model, where the common sense of human knowledge is naturally applied for reasoning and decision making to handle driving policies and navigation.

2 Tokenizion Like GPT

A framework [11], called Talk-to the-Vehicle, consisting of a Natural Language Encoder (NLE), a Waypoint Generator Network (WGN) and a local planner, is designed to generate navigation waypoints for the self-driving car. NLE takes as input the natural language instructions and translates them into high-level machine-readable codes/encodings. WGN combines the local semantic structure with the language encodings to predict the local waypoints. The local planner generates an obstacle avoiding trajectory to reach the locally generated waypoints and executes it by employing a low-level controller.

Talk to the Vehicle: Language Conditioned Autonomous Navigation of Self Driving Cars

ADAPT (Action-aware Driving cAPtion Transformer) [15], is an end-to-end transformer-based architecture, which provides user-friendly natural language narrations and reasoning for each decision making step of autonomous vehicular control and action. ADAPT jointly trains both the driving caption task and the vehicular control prediction task, through a shared video representation.

ConBaT (Control Barrier Transformer)[18] is an approach that learns safe behaviors from demonstrations in a self-supervised fashion, like a world model. ConBaT uses a causal transformer that learns to predict safe robot actions autoregressively using a critic that requires minimal safety data labeling. During deployment, a lightweight online optimization is employed to find actions that ensure future states lie within the learned safe set.

The MTD-GPT (Multi-Task Decision-Making Generative Pre-trained Transformer) method [26] abstracts the multi-task decision making problem in autonomous driving as a sequence modeling task. Leveraging the inherent strengths of reinforcement learning (RL) and the sequence modeling capabilities of the GPT, it simultaneously manages multiple driving tasks, such as left turns, straight-ahead driving, and right turns at unsignalized intersections.

BEVGPT [51], is a generative pre-trained large model that integrates driving scenario prediction, decision-making, and motion planning. The model takes the bird’s-eye-view (BEV) images as the only input source and makes driving decisions based on surrounding traffic scenarios. To ensure driving trajectory feasibility and smoothness, an optimization-based motion planning method is developed.

Table IV summarizes the methods of tokenization like NLP’s GPT. Instead of directly calling pretrained LLM/VLM, this type of approach builds the model based on self collected data (with the help of LLM/VLM) in a similar way as the language GPT.

3 Pre-trained Foundation Model

PPGeo (Policy Pre-training via Geometric modeling)[14] is a fully self-supervised driving policy pre-training framework to learn from unlabeled and uncalibrated driving videos. It models the 3D geometric scene by jointly predicting ego-motion, depth, and camera intrinsics. In the first stage, the ego-motion is predicted based on consecutive frames as does in conventional depth estimation frameworks. In the second stage, the future ego-motion is estimated based on the single frame by a visual encoder, and could be optimized with the depth and camera intrinsics network well-learned in the first stage.

AD-PT (Autonomous Driving Pre-Training) [23] leverages the few-shot labeled and massive unlabeled point-cloud data to generate the unified backbone representations that can be directly applied to many baseline models and benchmarks, decoupling the AD-related pre-training process and downstream fine-tuning task. During this work a large-scale pre-training point-cloud dataset with diverse data distribution is built for learning generalizable representations.

UniPad[49] is a self-supervised learning paradigm applying 3D volumetric differentiable rendering. UniPAD implicitly encodes 3D space, facilitating the reconstruction of continuous 3D shape structures and the intricate appearance characteristics of their 2D projections. It can be seamlessly integrated into both 2D and 3D frameworks, enabling a more holistic comprehension of the scenes.

UniPad: A Universal Pre-Training Paradigm For Autonomous Driving

Table V summarizes the methods of pretrained foundation models. This way seldom applies LLM/VLM info.

Connclusion

In simulation, we find combination of language model+diffusion model+NeRF will be the trend to realize photorealistic sensor data and human-like traffic flows. The similar thing happens to the world model, but it needs to model the environment behavior (especially the dynamics) due to the world model’s goal of prediction.

In automatic annotation, multi-modal language models play important roles, especially for 3-D data. Mostly the visual-language model is the base, expanded to additional modalities with less data. The LLMs and VLMs provide the possibility of open vocabulary scene understandings.

In decision making and E2E, we still prefer the integration of large scale language models or multi-modal large scale language models. Either pretrained foundation models or tokenization like GPT sounds to be a strong owner of the autonomous driving large scale models, however the performance is more difficult to realize the grounding capabilities due to limited data collection and concern of hallucination.

References

[1] A Jain, B Mildenhall, J T. Barron, P Abbeel, and B Poole. Zero-shot text-guided object generation with dream fields (NeRF+CLIP). arXiv 2112.01455, 2021

[2] B Poole, A Jain, J T Barron, and B Mildenhall. DreamFusion: Text-to-3d using 2d diffusion (Imagen+NeRF+Diffusion). arXiv 2209.14988, 2022.

[3] G Metzer, E Richardson, O Patashnik, R Giryes, and D Cohen-Or. Latent-NeRF for shape-guided generation of 3d shapes and textures (Text-to-3D). arXiv 2211.07600, 2022

[4] C Lin, J Gao, L Tang, et al. Magic3D: High resolution text-to-3D content creation (NeRF+diffusion). arXiv 2211.10440, 2022

[5] C Deng, C Jiang, C Qi, et al., NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors, arXiv 2212.03267, 2022

[6] J Zhang, X Li, Z Wan, C Wang, and J Liao, Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields (Diffusion), arXiv 2305.11588, 2023

[7] J Betker et al., Improving Image Generation with Better Captions (DALL-E3), OpenAI report, 10, 2023

[8] S Wu et al., NExT-GPT: Any-to-Any Multimodal LLM, arXiv 2309.05519, 2023

[9] X Zhao et al., Making Multimodal Generation Easier: When Diffusion Models Meet LLMs (EasyGen), arXiv 2310.08949, 2023

[10]T Deruyttere et al., Talk2Car: Taking Control of Your Self-Driving Car, Int. Joint Conf. on Natural Language Processing, Hong Kong, China, Nov. 2019

[11]N N Sriram et al., Talk to the Vehicle: Language Conditioned Autonomous Navigation of Self Driving Cars, IEEE IROS, 2019

[12]Z Zhu, S Zhang, Y Zhuang, et al., RITA: Boost Autonomous Driving Simulators with Realistic Interactive Traffic Flow, arXiv 2211.03408, 2022

[13]S Peng et al., OpenScene: 3D Scene Understanding with Open Vocabularies, arXiv 2211.15654, 2022

[14]P Wu, L Chen, H Li, et al. Policy Pre-Training for Autonomous Driving via Self-Supervised Geometric Modeling, ICLR, 2023

[15]B Jin et al., ADAPT: Action-aware Driving Caption Transformer, arXiv 2302.00673, 2023

[16]Z. Li, L. Li, Z. Ma, et al. READ: Large-scale neural scene rendering for autonomous driving, Conf. Artif. Intell. (AAAI), 2023

[17]Z Zhang et al., TrafficBots: Towards World Models for Autonomous Driving Simulation and Motion Prediction, arXiv 2303.04116, 2023

[18]Y Meng , S Vempralay, R Bonatti, et al., ConBaT: Control Barrier Transformer for Safe Policy Learning, arXiv 2303.04212, 2023

[19]E Pronovost, K Wang, Nick Roy, Generating Driving Scenes with Diffusion, arXiv 2305.18452, 2023

[20]W Cheng et al., Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving, arXiv 2305.15765, 2023

[21]Z Zhong, D Rempe, D Xu, et al., Guided conditional diffusion for controllable traffic simulation. IEEE ICRA, 2023

[22]Z Zhong, D Rempe, Y Chen, et al. Language-Guided Traffic Simulation via Scene-Level Diffusion, arXiv 2306.06344, 2023

[23]J Yuan et al., AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset, arXiv 2306.00612, 2023

[24]D Fu, X Li, L Wen, et al., Drive Like a Human: Rethinking Autonomous Driving with Large Language Models, arXiv 2307.07162, 2023

[25]C Wu et al., MARS: An Instance-aware, Modular and Realistic Simulator for Autonomous Driving, arXiv 2307.15058, 2023

[26]J Liu, P Hang, X Qi, et al., MTD-GPT: A Multi-Task Decision-Making GPT Model for Autonomous Driving at Unsignalized Intersections, arXiv 2307.16118, 2023

[27]Z. Yang, Y. Chen, J. Wang, et al, UniSIM: A neural closed-loop sensor simulator, IEEE CVPR, 2023

[28]D Bogdoll, L Bosch, T Joseph, et al., Exploring the Potential of World Models for Anomaly Detection in Autonomous Driving, arXiv 2308.05701, 2023

[29]M Chen, UniWorld: Autonomous Driving Pre-training via World Models, arXiv 2308.07234, 2023

[30]L Li, Lian, Y-C Chen, Adv3D: Generating 3D Adversarial Examples in Driving Scenarios with NeRF, arXiv 2309.01351, 2023

[31]Wayve AI, LINGO-1: Exploring Natural Language for Autonomous Driving, https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/ , Sept. 2023

[32]Y Jin et al., SurrealDriver: Designing Generative Driver Agent Simulation Framework in Urban Contexts based on Large Language Model, arXiv 2309.03135, 2023

[33]X Ding et al., HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving, arXiv 2309.05186, 2023

[34]A Keysan et al., Can you text what is happening? Integrating pre-trained language encoders into trajectory prediction models for autonomous driving, arXiv 2309.05282, 2023

[35]D Wu et al., Language Prompt for Autonomous Driving, arXiv 2309.04379, 2023

[36]C Cui et al., Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles, arXiv 2309.10228, 2023

[37]M Najibi et al., Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving, arXiv 2309.14491, 2023

[38]L Wen et al., DiLu: a Knowledge-Driven Approach to Autonomous Driving with Large Language Models, arXiv 2309.16292, 2023

[39]S Sun et al., DriveSceneGen: Generating Diverse and Realistic Driving Scenarios from Scratch, arXiv 2309.14685, 2023

[40]A Hu et al., GAIA-1:A Generative World Model for Autonomous Driving, arXiv 2309.17080, 2023

[41]R Gao et al., MagicDrive: Street View Generation With Diverse 3D Geometry Control, arXiv 2310.02601, 2023

[42]H Sha et al., LanguageMPC: Large Language Models As Decision Makers For Autonomous Driving, arXiv 2310.03026, 2023

[43]Z Xu et al., DriveGPT4: Interpretable End-To-End Autonomous Driving Via Large Language Model, arXiv 2310.01412, 2023

[44]J Mao et al., GPT-Driver: Learning To Drive With GPT, arXiv 2310.01415, 2023

[45]L Chen et al., Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving, arXiv 2310.01957, 2023

[46]V Dewangan et al., Talk2BEV: Language-enhanced Bird’s-eye View Maps for Autonomous Driving, arXiv 2310.02251, 2023

[47]X Li, Y Zhang, X Ye, DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model, arXiv 2310.07771, 2023

[48]C Cui et al., Receive, Reason, and React: Drive as You Say with Large Language Models in Autonomous Vehicles, arXiv 2310.08034, 2023

[49]H Yang et al., UniPad: A Universal Pre-Training Paradigm For Autonomous Driving, arXiv 2310.08370, 2023

[50]X Wang et al., DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving, arXiv 2309.09777, 2023

[51]P Wang et al., BEVGPT: Generative Pre-trained Large Model for Autonomous Driving Prediction, Decision-Making, and Planning, arXiv 2310.10357, 2023

[52]Y Zhou et al., OpenAnnotate3D: Open-Vocabulary Auto-Labeling System for Multi-modal 3D Data, arXiv 2310.13398, 2023

[53]X Zhou et al., Vision Language Models in Autonomous Driving and Intelligent Transportation Systems, arXiv 2310.14414, 2023

[54]T-H Wang et al., Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models, arXiv 2310.17642, 2023

[55]Z Yang, X Jia, H Li, J Yan, A Survey of Large Language Models for Autonomous Driving, arXiv 2311.01043, 2023

[56]L Zhang, Y Xiong, Z Yang et al., Learning Unsupervised World Models For Autonomous Driving Via Discrete Diffusion, arXiv 2311.01017, 2023

[57]J Mao, J Ye, Y Qian, M Pavone, Y Wang, A Language Agent for Autonomous Driving, arXiv 2311.10813, 2023

[58] D Bogdoll, Y Yang, J. M Zollner, MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations, arXiv 2311.11762, 2023

[59] F Jia, W Mao, Y Liu, et al., ADriver-I: A GeneralWorld Model for Autonomous Driving, arXiv 2311.13549, 2023

[60] Y Wang, J He, L Fan, et al., Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving (Drive-WM), arXiv 2311.17918, 2023

[61] A Tonderski, Carl Lindstroom, G Hess, et al., NeuRAD: Neural Rendering for Autonomous Driving, arXiv 2311.15260, 2023