What Data Imperative for Action Learning in Embodied AI? (1)

24 min readDec 7, 2024

Abstract

Data is the bottleneck in the development of embodied AI or embodied intelligence (EI). In this paper, we want to clarify what data is imperative for action/behavior training in EI. First the AI development is briefly overviewed in algorithms, compute and data (as well as classification of AI levels). Then we focus on the embodied AI’s action/behavior learning methods (including world model and visual-language-action model). For data collection, robots’ type and dexterity are surveyed, then robot simulation platforms and real action data capture platforms for robots or humans, are compared. With respect to human data, egocentric AI or wearable AI is investigated. Eventually the requirements of datasets for EI are manifested. At last, we discuss the generalization tricks.

1. Introduction

Progress in artificial intelligence (AI) is propped up for advances in three areas: algorithms, compute, and data. Algorithms are the procedures or formulas that computer systems use for solving problems or completing tasks. Compute refers to the use of computer systems to perform calculations or process data. Data refers to the information required to train and validate AI models.

1.1 Algorithms

Transformer. Different from the Convolutional Neural Networks (CNNs), the Transformer [22] is an encoder-decoder architecture, designed for extracting information from natural language. The basic building block is called a cell, which is composed of two modules, Multi-head Attention (MHA), and a feed-forward network (FFN).

Besides, positional information is added to the model explicitly, Positional encoding (PE), to retain the information regarding the order of words in a sentence.

Mamba. Mamba [276, 320] is a selective structured state space model, excels in long sequence modeling. Mamba handles challenges in long sequences by overcoming the local perception limitations of CNNs and the quadratic computational complexity of Transformers. Figure 1 shows block architectures of Mamba-1 and Mamba-2.

Foundational Model. Foundation models [126, 210, 212] have taken shape most strongly in NLP. On a technical level, foundation models are enabled by transfer learning and scale. The idea of transfer learning is to take the “knowledge” learned from one task and apply it to another task.

Foundation models usually follow such a paradigm that a model is pre-trained on a surrogate task and then adapted to the downstream task of interest via fine-tuning.

LLMs. Most of the Large Scale Language Models (LLMs) [137, 243] appearing recently are among or based on the Foundation Models. Recent models with billion parameters have been effectively utilized for zero/few-shot learning, achieving impressive performance without requiring large-scale task-specific data or model parameter updating.

LLMs are the category of Transformer-based language models that are characterized by having an enormous number of parameters, typically numbering in the hundreds of billions or even more. These models are trained on massive text datasets, enabling them to understand natural language and perform a wide range of complex tasks, primarily through text generation and comprehension.

Some well-known examples of LLMs include OpenAI GPT-1/2/3/3.5/4 [23, 29, 43, 80, 132], Meta LLaMA 1/2/3 [123, 169, 318], Microsoft Phi-1/1.5/2/3[158, 183, 217, 273], Mistral [194], Google Gemini [223] and its open lightweight version Gemma [259] etc.

By selecting the appropriate prompts [170], the model behavior is manipulated so that the pre-trained LM itself can be used to predict the desired output, sometimes even without any additional task-specific training.

The emergent abilities of LLMs are one of the most significant characteristics that distinguish them from smaller language models. Specifically, in-context learning (ICL) [114, 170], instruction following [177] and reasoning with chain-of-thought (CoT) [115, 148, 190, 275] are three typical emergent abilities for LLMs.

Parameter-efficient fine tuning (PEFT) is a crucial technique used to adapt pre-trained models to specialized downstream applications, in which a typical method, LoRA [59], utilizes low-rank decomposition matrices to reduce the number of trainable parameters.

Since LLMs are trained to capture the data characteristics of pre-training corpora (including both high-quality and low-quality data), they are likely to generate toxic, biased, or even harmful content for humans. It is necessary to align LLMs with human values [80, 189], e.g., helpful, honest, and harmless.

Recently OpenAI o1 model is announced [339] for complex reasoning with Reinforcement Learning (RL), kicking off the inference-time optimization style in the new test-time scaling law, similar techniques can be seen in Math-Shepherd [222], MiPS [239], OmegaPRM [295], REBASE [319], scaling LLM test-time compute [322], Qwen2.5-Math [334], OpenR [342] and Dualformer [347] etc.

MoE. Mixture of Experts (MoE) [226, 300] models select different parameters for each incoming example. The result is a sparsely-activated model with a number of parameters but a constant computational cost.

CoE. Composition of Experts (CoE) is proposed by SambaNova [267, 286] (shown in Figure 2) to combine broad capabi-lities and accuracy of the large models with the performance of much smaller models, which is similar to the routing network as RouteLLM [305].

VLMs. Vision-Language Models (VLMs) bridge the capabilities of Natural Language Processing (NLP) and Computer Vision (CV), breaking down the boundaries between text and visual information to connect multimodal data, such as Dino 1/2 [56, 144], CLIP [63], DALL-E1/2/3 [64, 86, 203], BLIP-1/2 [78, 119], Flamingo [88], SAM [141], SEEM [143] and GPT-4v [192] etc.

MLLMs. Motivated by the potential of LLMs, numerous multimodal LLMs (MLLMs) [186] have been proposed to expand the LLMs to the multimodal field, i.e., perceiving image/video input, and conversating with users in multiple rounds, such as OpenAI’s GPT-4o [293] and Anthropic Claude 3.5 [362]. Pre-trained on massive image/video-text pairs, the above models can only handle image-level tasks, such as image captioning and question answering.

Agents. LLM based agents [178, 185, 229, 240] can exhibit reasoning and planning abilities comparable to symbolic agents through techniques like CoT/ToT and problem decomposition. They can also acquire interactive capabilities with the environment, akin to reactive agents, by learning from feedback and performing new actions.

Embodied AI. Recent works have developed more efficient reinforcement learning agents for robotics and embodied AI [209, 212, 220, 221, 238, 241, 249, 288, 310]. The focus is on enhancing agents’ abilities for planning, reasoning, and collaboration in embodied environments.

Some approaches combine complementary strengths into unified systems for embodied reasoning and task planning. High-level commands enable improved planning while low-level controllers translate commands into actions.

1.2 Compute

Distributed Training. The Colossal-AI system [72] introduced a unified interface to scale sequential code of model training to distributed environments. It supports parallel training methods such as data, pipeline, tensor, and sequence parallelism, as well as heterogeneous training methods integrated with zero redundancy optimizer (ZeRO).

Model parallelism refers to partitioning, or sharding, a neural architecture graph into subgraphs, and assigning each subgraph, or model shard, to a different device. Data parallelism enables multiple mini-batches of data to be consumed in parallel. Tensor parallelism splits a tensor into N chunks along a specific dimension and fits a large model in multiple GPUs. Pipeline parallelism decomposes incoming batches into mini-batches and divides the layers of the model across multiple GPUs. Sequence parallelism partitions along the sequence dimension, making it an effective method for training long text sequences.

Efficient Transformer Architecture. There are some modifications of Transformer architecture to improve its efficiency and scalability, like multi-query attention (MQA)/Grouped-query attention (GQA), Switch Transformers, Rotary Position Embedding (RoPE), FlashAttention1/2 in Megatron-LM (a large, powerful transformer developed by NVIDIA) [31].

Perceiver IO [62] employs the self-attention mechanism on a not-too-large set of latent vectors (e.g. 256 or 512), and only use the inputs to perform cross-attention with the latent. This allows for the time and memory requirements of the self-attention mechanism to not depend on the size of the inputs.

Hardware Efficiency. FlashAttention [90] is an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. The modified FlashAttention-2 [168], with better work partitioning to address these issues, is proposed.

Memory Optimization. Zero Redundancy Optimizer (ZeRO) [33] optimizes redundant model states in memory by partitioning them in three corresponding stages across processors and optimizing the communication, with the final model states evenly split on each node. ZeRO-Offload [51] offloads the data and computations of both states to the CPU, thus leveraging the CPU to save the GPU memory. ZeRO-Infinity [52] leverages CPU and NVMe memory (which is cheap, slow, but massive) in parallel across multiple devices to aggregate efficient bandwidth for current GPU clusters.

The autoregressive decoding method generates tokens one by one. In each decoding step, all model weights are loaded from the off-chip high-bandwidth memory (HBM) to the GPU chip, resulting in high memory access costs. The size of the KV cache increases with the input length, which may lead to memory fragmentation and irregular memory access patterns.

The KV Cache method stores and reuses previous (K-V) pairs in the Multi-Head Self-Attention (MHSA) block [208]. It consists of two steps shown in Figure 3: 1) Pre-filling; LLM calculates and stores the KV cache of the initial input token and generates the first output token; 2) Decoding: LLM uses the KV cache to generate output tokens one by one, and then uses the K-V pairs of the newly generated token to update.

PagedAttention is proposed by vLLM [184] (a high-throughput and memory-efficient inference and serving engine for LLMs), being an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems.

Inference Acceleration. Medusa [231] adds extra “heads” to LLMs to predict multiple future tokens simultaneously. These heads each produce multiple likely words for the corresponding position. LLaMA.cpp [235] is a low-level C/C++ implement-ation of the LLaMA architecture that supports multiple BLAS backends for fast processing. It uses the GGUF quantization scheme and has CPU and GPU offload capabilities.

1.3 Data

In CV and NLP, training with large, diverse datasets scraped from the internet [7, 251] can produce models that are generalizable to a wide variety of new tasks.

Similarly, in embodied AI, such as robotic manipulation, recent work has shown that larger, more diverse robotic training datasets can push the limits of policy generalization, including active transfer to new objects, instructions, scenarios, and implementations [70, 179, 195].

Researchers have turned to simulation environments [8, 45, 154, 199, 343] to ease the difficulty of data acquisition and accelerate the data collection process. However, this strategy also has its own challenges, the most important of which is the gap between simulation and reality.

For a dataset, it is necessary to perform cleaning, filtering and curation, to protect privacy, align with human preference, maintain the diversity and quality, even in specific domain and profession.

1.4 Levels of AI

The “sparks” of AGI (artificial general intelligence) have been present in the latest generation of LLMs. Levels of AGI was proposed by Google DeepMind based on depth (performance) and breadth (generality) of capabilities [207].

AI agents are capable to comprehend, predict, and response based on its training and input data. While these capabilities are developed and improved, it’s important to understand their limitations and the effect of the underlying data they are trained on. Levels of AI agents is proposed in [285], based on utilities and power, including perception, tools, action, reasoning, decision making, memory, reflection, generalization, self-learning, personality and collaboration etc.

2. Spatial AI and Embodied AI

2.1 Spatial AI

“Active perception”, proposed in 1980s [1, 2], refers to a process where an organism actively gathers information about its environment by moving and changing its viewpoint, while “passive vision” means simply receiving visual information without any intentional movement or control over the sensory input. Figure 4 illustrates the active tracking example.

However the active sensing goal was over ambitious on that time, because some revolutionary techniques were not available, such as good visual features (SIFT), depth sensors (Kinect), deep learning models, and reasoning capabilities caused by LLMs.

Spatial computing [246] is a technological advancement that facilitates the seamless integration of devices into the physical environment, resulting in a more natural and intuitive digital world user experience in VR or AR.

Spatial AI is proposed by Fei-fei Li [292] and a startup AI company World Labs [340] is founded to build Large World Models to perceive, generate, and interact with the 3D world.

It is seen that spatial AI is not for spatial computing, instead it aims at active perception with comprehensive understanding and modeling of environments, for which the agents would reason, plan and make actions. However, spatial AI is not equivalent to embodied AI.

2.2 Embodied AI

The concept of embodied intelligence (EI) [310] was first proposed by Turing in the embodied Turing test established in 1950, which aims to determine whether the agent can show intelligence that is not limited to solving abstract problems in a virtual environment (digital space).

Note: the AI agent is the basis of EI, existing in the digital space and the physical world, and embodied in various entities, including robots and other devices, but can also cope with the complexity and unpredictability of the physical world.

Therefore, the development of EI is regarded as a fundamental way to achieve AGI. EI covers multiple key technologies such as CV, NLP, and robotics, among which the most representative are [238, 241, 310] embodied perception (including visual language navigation [249, 336]), embodied interaction, embodied intelligence (including visual-language-action model [288, 299]), and virtual-to-reality transfer etc.

3. Action/Behavior Learning Methods

Robot manipulation refers to how robots intelligently interact with objects around them, such as grabbing objects and carrying them from one place to another. Dexterous manipulation skills enable robots to assist humans in completing a variety of tasks that may be too dangerous or difficult to complete.

This requires robots to be able to intelligently plan and control the movements of their arms. Object manipulation is a key skill for robots to complete multiple tasks. However, this also poses challenges to robotics [30, 38, 109, 209, 212, 220, 221, 238, 244, 277, 278, 326]. Note: Autonomous driving [210, 324] is a special domain in EI.

Imitation learning (IL) [73] aims to imitate expert behavior. Generally, IL includes three main methods: behavior cloning (BC), inverse reinforcement learning (IRL), and generative adversarial imitation learning (GAIL).

Behavior cloning (BC) is a supervised learning formulation for robot action policy learning. Given expert demonstration data consisting of a series of state-action pairs, the model is trained to predict the correct action vector for a given input state (e.g., an image). This framework has proven to be very effective, especially when provided with a sufficient amount of training data.

Some IL methods can be listed as below: Trans-porter Networks [48], CLiPort [69], BC-Z [73], Behavior transformers (BeT) [93], WHIRL [97], Perceiver-Actor [101], RoboCat [159], Vi-PRoM [176] and VQ-BeT [255].

The IL methods, however, presents its own challenges, particularly in high-precision domains. To address these challenges, Action Chunking with Transformers (ACT), shown in Figure 5, which learns a generative model over action sequences, is proposed by UC Berkeley as long as the ALOHA teleoperation platform [145]. Their modifications are given RoboAgent/MT-ACT [182], Bunny-VisionPro [309], InterACT [313], CrossFormer [328], RUM [332] and Haptic-ACT [333].

Diffusion-based strategies leverage the success of diffusion models in the field of computer vision. Among them, the diffusion policy (DP) [127] (shown in Figure 6) is one of the earliest strategies to use diffusion for action generation. Compared to common behavior cloning strategies, diffusion strategies show advantages in handling multimodal action distributions and high-dimensional action spaces.

Other varied versions of DP have crossway diffusion [163], 3D diffusion policy [256], iDP3 [349], ALOHA Unleashed [350] and Dex-Diffuser [361].

Reinforcement learning (RL) [25, 54, 357] is a family of methods that enables robots to optimize policies by interacting with the environment by optimizing a reward function. These interactions are usually conducted in a simulated environment and are sometimes augmented with data from physical robot hardware to achieve simulation-to-real transfer.

RL algorithms are classified by: (1) model-based or model-free, (2) value function, and (3) on-policy or off-policy.

Unlike imitation learning, RL does not require human demonstrations and (theoretically) has the potential to achieve superhuman performance. Regarding to RL problems, the expected return of a policy is maximized using rollout data collected from interactions with the environment. Feedback is received from the environment in the form of reward signals, guiding the robot to understand which actions lead to favorable outcomes and which do not.

MT-OPT [54] is a scalable and generalizable multi-task deep RL method developed by a multi-robot collective learning system for data collection.

Some RL methods using the Transformer architecture are listed as below: Decision Transformer [57], Trajectory transformer [58], and OCDM [356].

Note: World model is a special category for prediction in embodied AI, mostly realized by RL and diffusion model, which will be discussed in Session 3.1.

Robot manipulation tasks have a highly hierarchical structure. A complicated task can be divided into subtasks, which are then further broken down into even smaller subtasks. Even basic skills, such as grasping or pushing, can be further broken down into multiple goal-oriented action stages. This hierarchy divides the main task into smaller, more tractable problems.

The robot can learn skill policies to perform the lowest-level tasks and then use these skills as the basis for actions to perform the next-level tasks. Thus, the robot can gradually learn a hierarchical strategy for skills, and the resulting strategy hierarchy reflects the task hierarchy.

LLM-based robots show a new direction [209, 212, 220, 221, 238]: LLM provides robots with the ability to interact with natural language, allowing users to communicate with robots in an intuitive and convenient way; LLM enables robots to adapt to different tasks and environments. LLM enables robots to better collaborate with humans. Therefore, robots can jointly solve problems, make plans, and perform tasks through interaction with language models.

A rough classification of robot learning methods is categorized as follows:

· Pre-trained language models as part of a system for task planning & execution: T-LM [76], Socratic Model [84], GATO [89], LATTE [98], ProgPrompt [102], LLM-planner [111], DEPS [121], Reflexion (LLM + RL) [134], Self-Refine [136], Beam search [146], Embodied-GPT (LLM + Transformer) [149], SwiftSage [152], ChatGPT for robotics [161], EUREKA [198], RoboGPT [213], LAP [265], Socratic Planner [274];

· Applying language models to robot control, and fine-tuning the model with actions, and obtaining generalizable control policies: SayCan [85], Inner Monologue [96], CaP [99], Perceiver-Actor (LLM + Transformer) [101], ReAct [106], GA [142], VOYAGER [150], DoReMi [162], SayPlan [166], SUDD (LLM + Diffusion) [171], PSL (LLM + RL) [280], Octo [287], ReAd [289], Grounding-RL [296];

· Pretrained vision-language models are integrated for robotic representation learning as components of a modular system for task planning and execution: VoxPoser [165], Concept-Graphs [191], VLP [197], Robo-Flamingo [204], ViLa [215], SpatialVLM [232], AutoRT [233], RoboMamba [294], RDT-1B (DiT) [345], DiT-Block Policy [348];

· End-to-end learning vision-language-action (VLA) models, which is the most promising category in this domain, discussed individually in Session 3.2.

3.1 World Model

A general world model [282, 306] is an important way to achieve general artificial intelligence (AGI) and is the cornerstone of various applications from virtual environments to decision-making systems.

Creating a world model in a simulated environment that is very similar to the real world can help algorithms generalize better in transferring. The world model approach is to build an end-to-end model to make decisions in a generative or predictive way by predicting the next state, mapping vision to action, or even any mapping relationship.

The biggest difference between this world model and the VLA model is that, the VLA model is first trained on a large-scale Internet dataset to achieve high-level emergent capabilities and then fine-tuned with real-world robot data. In contrast, the world model is trained from scratch on physical world data and gradually develops high-level capabilities as the amount of data increases.

Such world models are still low-level physical world models, similar to the mechanisms of the human neural reflex system to some extent. This makes them more suitable for scenarios where both input and output are relatively structured, such as autonomous driving (input: vision, output: throttle, brake, steering wheel) or object classification (input: vision, instructions, digital sensors, output: to grab the target object and place it at the target location). They are not well suited for generalization to unstructured, complex, specific tasks.

Learning world models has broad application prospects in the field of physical simulation. Compared with traditional simulation methods, it has significant advantages, such as being able to reason about interactions with incomplete information, meeting real-time computing requirements, and improving prediction accuracy over time.

The prediction ability of this world model is crucial, enabling robots to develop the physical intuition which is required for manipulation in the human world. According to the learning pipeline of the world environment, they can be divided into generation-based methods, prediction-based methods, and knowledge-driven methods.

The architecture of the world model aims to emulate the consistent thinking and decision-making process of the human brains, integrating the following key components [306]: perception module, memory module, control/action module, and world model module as the core.

The world model is able to simulate cognitive processes and decisions makings similar to humans. By integrating these modules, the world model achieves a comprehensive and predictive understanding of its environment.

A category of world models is roughly classified as follows:

a) RL：Dreamer 1/2/3 [37, 47, 116] (shown in Figure 7), DayDreamer [94], TD-MPC 1/2 [81, 201], FOWM [200], PWM [308];

b) Transformer-based：TWM [130], STORM [196], WHALE [358];

c) RL + Transformer-based：MWM [95];

d) Diffusion-based： UniPi [120];

e) LLM-based： DEKARD [118], Surfer [157]，Google Gemini [223], 3D-VLA [261];

f) Transformer + Diffusion：Sora [252];

g) LLM + RL： DynaLang [175], GenRL [303],

h) LLM + Diffusion： RoboDreamer [270].

3.2 VLA model

Vision-Language-Action (VLA) models [288] represent a class of models designed to process multimodal inputs, combining information from visual, language, and action modalities. They are models that process multimodal inputs of vision and language and output robot actions to complete embodied tasks. They are the cornerstone of the field of EI in robot policy instruction following.

VLA models are developed to solve instruction following tasks in EI. These models rely on powerful visual encoders, language encoders, and action decoders.

EI requires controlling physical entities and interacting with the environment. Robotics is the most prominent field of EI. In language-conditioned robotics tasks, the policy must have the ability to understand language instructions, visually perceive the environment, and generate appropriate actions, which requires the multimodal capabilities of VLA.

Pre-trained visual representations emphasize the importance of visual encoders, because visual observations play a crucial role in perceiving the current state of the environment. Therefore, it sets an upper limit on the performance of the entire model. In VLA, general vision models are pre-trained with robot or human data to enhance their capabilities in tasks such as object detection, affordance map extraction, and even visual-language alignment, which are critical for robotic tasks.

Compared with earlier deep RL methods, VLA-based policies show superior diversity, flexibility, and generalization in complex environments. This makes VLA applicable not only to controlled environments like factories, but also to daily life tasks (homes).

To improve the performance of various robotic tasks, some VLAs prioritize obtaining high-quality pre-trained visual representations; others focus on improving low-level control policies, which are good at receiving short-term task instructions and generating actions that can be executed through robot motion planning; in addition, some VLAs focus on decomposing long-term tasks into subtasks that can be executed by low-level control policies.

The combination of low-level control policies and high-level task planners can be viewed as a hierarchical strategy. The high-level task planner generates plans based on user instructions, which are then executed step by step by the low-level control policy.

Most low-level control policies predict the motion of the end-effector pose while abstracting the motion planning module that controls the motion of individual joints using inverse kinematics. While this abstraction helps generalize better to different embodied robots, it also imposes limitations on flexibility.

While LLM-based control strategies can greatly enhance command following capabilities because LLMs can better interpret user intent, there are concerns about their training cost and deployment speed. Slow inference speed can severely affect performance in dynamic environments because environmental changes may occur during LLM inference.

PaLM-E [124] is a visual language generalist model that treats images and text as multimodal inputs represented by latent vectors. The output of PaLM-E is divided into two parts: when dealing with text generation tasks, the model directly generates the final output. In contrast, when used for specific planning and control tasks, PaLM-E generates low-level instruction text (such as instructions for robot control).

Robot Transformer 1 (RT-1) [113] is able to encode high-dimensional input and output data (including images and instructions) into compact tokens that can be efficiently processed by Transformer. It is not an end-to-end model. Similar works are RT-trajectory [206], LEO [211], SARA-RT [219], GR-1[224], ATM [227], RT-H [254], SRT [272] and RVT 1/2 [160, 297] etc.

The proposed Robotics Transformer 2 (RT-2) [172] is trained on web-scale datasets to achieve generalization capabilities for new tasks and direct possession of semantic perception. By fine-tuning the VLM, it can generate actions based on text encodings, i.e. a VLA model.

In a collaborative effort, Open X-Embodiment generalizes the idea of “generalist” robot policies and advocates that trainable models can adapt to different robots, tasks, and environments [195]. Robot Transformer X (RT-X) is divided into two branches: RT-1-X and RT-2-X. RT-1-X adopts the RT-1 architecture and is trained using the Open-X-embodiment dataset, while RT-2-X uses the policy architecture of RT-2 and is trained on the same dataset.

A number of VLA models have been proposed, such as QUAR-VLA [225], 3D VLA [261], Bi-VLA [284], OpenVLA [299] (shown in Figure 8), LLARVA [302], CoVLA [324], TinyVLA [335], GR-2 [341], DP-VLA [352], and DeeR-VLA [354] etc.

4. Dexterity and Robotic Types

Embodied robots can generally be divided into six categories [310].

The first is a fixed-base robot, such as a robotic arm, either single or dual, which is often used in laboratory automation synthesis, education, industry and other fields, such as KUKA iiwa [44] and Franka Emika Robot [91].

The second is a wheeled robot, which is known for its efficient mobility and is widely used in logistics, warehousing and security inspection, for example, Kiva Systems [5] and Jackal Robot/ Clearpath Robotics [103].

The third is a tracked robot, which has strong off-road capabilities and mobility and shows potential in agriculture, construction and disaster response, such as iRobot Packbot [3], CMU RoMan [39], and Polibot [140].

The fourth is a quadruped robot, which is known for its stability and adaptability and is very suitable for complex terrain detection, rescue missions and military applications, for instance, Boston Dynamic Bigdog [6], MIT Cheetah [10], ANYbotics’ ANYmal C [174] and Unitree Go1 [236].

The fifth is a humanoid robot, which is key to its dexterous hands and is widely used in the service industry, healthcare and collaborative environments.

Some examples are Softbank Robotics’ Pepper [13], Atlas humanoid robot [15], Tesla bot (shown in Figure 9) [67], Figure 01/02 [139, 329] and Unitree H1 [268].

It is worth to note: Dexterous hands are an emerging type of embodied entity used to perform complex dexterous manipulation tasks, for example [360, 361] Shadow Hand (shown in Figure 10), Adroit hand and Allegro Hand.

The last is a biomimetic robot, which performs tasks in complex and dynamic environments by simulating the effective movement and functions of natural organisms. Biomimetic robots include fish-shaped robots, insect-shaped robots and soft robots etc.

Dexterous manipulation is the cornerstone of advanced robotics and can be applied in various fields such as service robotics and industrial automation.

Due to hardware and algorithmic challenges, the ability of robots to mimic human-level dexterity in manipulation tasks remains unsolved.

The high DOFs of dexterous manipulation poses significant challenges for planning and control. Traditional optimal control methods, which often require simplified contacts, are often not feasible for more complex tasks.

Recently, RL has been explored to learn dexterous policies in simulation with minimal assumptions about the task or environment. The learned policies can solve complex tasks including in-hand object re-localization, bimanual manipulation, and long-reach manipulation. Deploying the learned policies to real-world robots remains challenging due to the gap between simulation and reality.

Model-based RL and control methods have shown some success in robotic dexterous multi-fingered hands for tasks such as rotating objects and in-hand manipulation. Similarly, model-free RL methods have shown that Sim2Real can achieve very good skills, such as in-hand cube rotation and face rotation of a Rubik’s cube.

However, both learning methods require hand-crafted reward functions and system identification, or task-specific training procedures. This, as well as long training times (often taking weeks), makes dexterous manipulation difficult to generalize to general tasks.

To address the low sample efficiency of previous learning-based approaches, some research has begun to study IL. Here, an emulated policy can be trained in a few hours with only a small number of demonstrations. This imitation-based approach has indeed been successful on real robot hands.

On the other hand, IL focuses on learning directly from real-world demonstration data, which can be obtained through teleportation or human video.

Collecting high-quality demonstration data for dexterous robots is very difficult. They either require expensive gloves, require extensive calibration, or are subject to monocular occlusion.

Compared to gripper-based manipulators, teleoperated dexterous arm systems usually require expensive and cumbersome dedicated equipment such as VR headsets, wearable gloves, handheld controllers, tactile sensors, or motion capture trackers.

Teleoperated dexterous hands have high degrees of freedom and complex kinematics. Glove-based systems can track the operator’s finger movements, but expensive and specific hand sizes. Recent vision-based methods use cameras or VR headsets to achieve dexterous arm teleoperation.

The works in this area can be classified as the following categories:

· IL method: DMPF [12], LPEI [16], TeachNet [27], DexPilot [34], SOIL [40], DexMV [66], DIME [82], IMDM [87], T-Dex [135], Bunny-VisionPro [309] and DexH2R (given in Figure 11) [357];

· RL method: DAPG [19], LDIM [25], PDDM [32], DexVIP [77], Visual dexterity [110], VideoDex [112], M-RRT/G-RRT [125] and DTIM [129];

· Sim2Real Transfer method: DexTransfer [104], Dextreme [108], Touch Dexterity [133] and OmniH2O [298].

5. Simulation

Real-world IL methods require large amounts of data that cannot be collected efficiently at low cost and are otherwise impractical for real-world deployment.

Real-world RL methods are promising but require extensive setups in the real world to produce real-world rewards/successes and environment resets.

Researchers have turned to simulation environments to ease the difficulty of data acquisition and accelerate the data collection process. However, this strategy also has its own challenges, the most important of which is the gap between simulation and reality.

This gap occurs when models trained on simulated data perform poorly in real-world deployments. There are several reasons for this gap, including differences in rendering quality, inaccuracies in physics simulations, and domain transfer characterized by unrealistic object properties and robot motion planners.

Simulators are critical to EI, providing a cost-effective means of experiments, ensuring safety by simulating potentially dangerous scenarios, scalability for testing in diverse environments, rapid prototyping capabilities, providing a controllable environment for research, generating data for training and evaluation, and providing standardized benchmarks for algorithms.

Traditional simulators include Gazebo [4], MORSE [8], MuJoCo (shown in Figure 12) [9], V-Rep/CoppeliaSim [11], Pybullet [14], AirSim [17], MINOS [20], Unity-ML Agents [26], Furniture Bench [147], Nvidia’s ORBIT [117], Aerial Gym [151], Issac Sim [154], and Webots [153].

Diffusion Model. In AI generated content (AIGC) domain [128, 181], the overwhelming success of Diffusion Models [75, 100, 131, 234] aims to generate images from Gaussian noise via an iterative denoising process, which consists of a diffusion process and a reverse process. Diffusion models has been extended to other modalities, like video, audio, text, graph and 3-D model etc.

NeRFs. As the new branch of multi-view visual reconstruction, Neural Radiance Field (NeRF) [105, 155, 237, 279] provides implicit representation of 3D information. Marriage of diffusion models and NeRF has achieved remarkable results in text-to-3D synthesis.

GS. Gaussian Splatting (GS) [230, 262, 281, 317, 351] leverages 3D Gaussian primitives for explicit scene representation and enables differentiable rendering, which outperforms NeRFs in real-time rendering.

Real scene-based simulators are Matterport3D [18], AI2-THOR [21], VirtualHome [24], RoboTHOR [41], SAPIEN [45], ManipulaTHOR [55], iGibson 1.0/2.0 [50, 65], HM3D [68], ThreeDWorld [74], ProcTHOR [92], Habitat 1/2/3 (shown in Figure 13) [36, 60, 199], ManiSkill 1/2/3 [61, 122, 343], RoboGen [205], Humanoid Bench [264], SIMPLER [283], RoboCAS [311], MetaUrban [314], GRUtopia [316], HoloDeck [330], PhyScene [331], GenSim 1/2 [193, 344], BiGym [312] and SL-DSL [353].

5.1 Sim2Real Transfer

Sim-to-Real adaptation/transfer in embodied intelligence refers to the process of transferring capabilities or behaviors learned in a simulated environment (digital space) to the real world (physical world). This process includes verifying and improving the effectiveness of algorithms, models, and control strategies developed in simulation to ensure that they perform stably and reliably in the physical environment.

In order to achieve simulation-to-reality adaptation, embodied world models, data collection & training methods and embodied control algorithms, are three key elements.

There are five paradigms for simulation-to-reality transfer [46, 202, 298, 310]: 1) Real2Sim2Real uses RL trained in a “digital twin” simulation environment to enhance IL in real-world scenarios; 2) TRANSIC enables real-time human intervention to correct the robot’s behavior in real-world scenarios; 3) Domain randomization introduces parameter randomization during simulation; 4) System identification builds an accurate mathematical model of physical scenes in real environments; 5) Lang4sim2real uses textual descriptions of images as a unifying signal across domains.