What Data Imperative for Action Learning in Embodied AI? (2)

Yu Huang
38 min readDec 7, 2024

--

6. Data Capture Platform and Datasets

A key property of robotic manipulation policies is their ability to generalize, i.e., to perform the desired manipulation task under new lighting conditions, in new environments, or with new objects. Training policies that can adapt to such changes is a critical step toward deploying robots in everyday environments.

A key ingredient in training such generalizable policies is diverse data for training: in computer vision (CV) and natural language processing (NLP), training with large and diverse datasets crawled from the internet can produce models that are applicable to a wide range of new tasks.

Similarly, in robotic manipulation, larger and more diverse robotic training datasets can help push the limits of policy generalization, including actively transferring to new goals, instructions, scenarios, and implementations. An important stepping stone to more robust robotic manipulation policies is the creation of large, diverse, high-quality robotic manipulation datasets.

Compared to fields such as CV and NLP, the scarcity of high-quality data has hampered progress in robotics in many ways. To address this challenge, researchers have proposed algorithms based on techniques such as few-shot learning and multi-task learning. While these approaches show promise in alleviating the data scarcity problem, they still rely on large amounts of high-quality data to achieve effective task generalization.

In terms of both scale and relevant content, Internet video data can help alleviate the data bottleneck problem in robotics. Specifically, the benefits include: (i) improving the generalization ability of existing robotics data, (ii) improving data efficiency and performance in the distribution of robotics data, and potentially (iii) obtaining emergent capabilities that cannot be extracted from robotics data alone.

Learning robotic actions from Internet videos still faces many fundamental and practical challenges. First, video data is generally high-dimensional, noisy, random, and inaccurately labeled. Second, videos lack information that is critical to robotics, including action labels, low-level force, and proprioceptive information. In addition, various distribution changes may occur between Internet videos and the robotics domain.

Two key questions in this area are: (i) How to extract relevant knowledge from Internet videos? (ii) How to apply the knowledge extracted from videos to robotics?

At the same time, there has been a quest to collect larger real-world robotics datasets. Efforts here include the aggregation of human teleoperation and different laboratory data. There is also research into methods for automating data collection, improving scalability and teleoperation.

Figure 14. DexCap
Figure 15. ALOHA

The most common approach to robotic demonstration collection is to pair a robot or end-effector with a teleoperator device or a kinematic isomorphic device. The devices used are of various complexity and form factors:

1) Full robotic exoskeletons as TABLIS [49], WULE [138], AirExo [188] and DexCap (shown in Figure 14) [263];

2) Simpler robotic data collection tools as ALOHA [145] (shown in Figure 15), GELLO [187], mobile ALOHA [228], ALOHA 2 [271] and AV-ALOHA [337] etc.;

3) Not a physically moving robot, like Dobb×E/stick v1 [214], UMI (shown in Figure 16) [247], UMI on Legs [315], RUM/Stick v2 [332] and Fast-UMI [338];

4) Use of video game controllers (for instance joysticks), like LIBERO [156];

5) Control with VR devices such as Holo-Dex [107], AnyTeleop [164], Open Teach [258], HumanPlus [301], Open-Television [307], ACE [327], ARCap (shown in Figure 17) [346], BiDex [359];

6) Control with mobile phones as RoboTurk [28].

Figure 16. UMI
Figure 17. ARCap

Demonstration data collected via teleoperated robotic systems provide precise in-domain observation-action pairs, enabling effective robotic policy learning via supervised learning. However, the requirements for both the robotic system and a skilled human operator significantly limit the accessibility and scalability of data collection.

The collection of real-world robotics data faces great challenges due to various factors: cost, time, inconsistencies and accuracy etc.

Due to these difficulties, public real-world robotics datasets are relatively scarce. In addition, evaluating the performance of robotic systems under realistic conditions adds another layer of complexity, as accurately reproducing the setup is challenging and often requires human supervision.

Another strategy to address the data scarcity problem in real-world environments is to leverage human data. Due to its flexibility and diversity, human behavior provides a lot of guidance for robotic policies.

However, this strategy also has inherent disadvantages. It is inherently difficult to capture human hand/body movements and transfer them to robots. In addition, the inconsistency of human data poses a problem, as some data may be first-person egocentric, while others are captured from a third-person perspective. Moreover, filtering human data to extract useful information can be labor-intensive [248]. These obstacles highlight the complexity of incorporating human data into the robot learning process.

Some datasets and benchmarks may not be directly used for robot manipulation and navigation, but they aim at other relevant capabilities of embodied intelligence, such as spatial reasoning, physical understanding, and world knowledge. These capabilities are invaluable for task planners.

Although pre-training datasets like Open X-embodiment [195] appear to have a unified structure, significant issues remain. These issues are due to lack of those items, as sensor multi-modality, unified format of multi-robots, compatibility of different platforms, sufficient data, datasets for both simulation and real contents.

There are robot manipulation datasets as RoboNet [35], BridgeData 1/2 [70, 179], RH20T [173], RoboSet [182], Open-X (shown in Figure 18) [195], Droid [269], BRMData [291] and ARIO (unified data format) [325].

Figure 18. Open-X

Alternatively, human demonstrations can be collected using portable systems without the need for physical robotic hardware. These systems leverage human dexterity and adaptability to directly manipulate objects in the wild, thus facilitating the creation of large-scale, diverse datasets of human demonstrations. However, due to the lack of robotic hardware, it is not immediately clear whether the collected demonstration data can be used to train robot policies without a multi-step process.

The differences between humans and robots in embodiment require data retargeting. Additionally, the retargeted data must be validated by replaying the actions on an actual robot interacting with real objects. Finally, the robot policy must be trained using validated data.

The success of human demonstrations depends heavily on the operator’s experience and awareness of the differences in geometry and capabilities between robots and humans. Failures can occur during the retargeting phase caused by the robot’s joint and velocity limitations, during the validation phase caused by accidental collisions, or during the policy training phase caused by the inclusion of invalid data.

Figure 19. Ego4D

There are human action datasets as EPIC-Kitchens [42], Ego4D (show in Figure 19) [71], HOI4D [79], Assembly101 [83], InternVid [167], Ego-Exo4D [216], Behavior-1k [260], EgoExoLearn [266] and COM Kitchens [321].

7. Wearable AI

Mapping the activities of others to the egocentric view is a basic skill of humans from a very early age.

Wearable AI or Ego AI is essentially a robotics application. Devices such as smart glasses, neural wristbands, and AR headsets (Meta Project Aria [180] shown in Figure 20, VisionProTeleop [253]), use AI to perceive the user’s environment, understand spatial context, and make predictions [218, 304, 323].

Figure 20. The Project Aria device

Although there is a lot of data collected from the ego-centric view (based on wearable devices), it is crucial for AI agents to learn directly from demonstration videos captured from different views.

Only a few datasets record videos of both egocentric and exocentric views in the same environ in a time-synch manner. In the generalization of action learning of embodied intelligence, the transform between the third-person view and the first-person view [53, 257] is required.

8. Requirements for Datasets

Based on the analysis from various aspects above, a list of requirements could be manifested as follows:

1) The dataset aims at promoting the study of large-scale embodied learning task.

2) The dataset supports generalization to new objects, new environments, new tasks, and even new embodied entities.

3) The dataset meets requirements in diversity of entity, time, place, view, goal, skill.

4) The dataset provides sufficiently accurate ground truth: calibration, synchronization, mapping and localization, and annotation.

5) The dataset complies with privacy and ethical standards: de-identification.

6) The dataset includes real and simulation data: both real2sim & sim2real transfer are realized.

7) The dataset includes Exo-Ego view data: support flexible transform of Exo-Ego views.

8) The dataset formulates a unified format standard: convertible between various data formats.

9) The dataset provides evaluation benchmarks: perception, cognition (reflection, reasoning, planning) and action (manipulation).

9. Conclusion

This paper overviews the evolving of traditional AI to LLM, VLM, agent, spatial intelligence and embodied AI, and analyzes the policy training for embodied action/behavior, embodied dexterity for data capture platforms, simulation platforms and egocentric/wearable AI etc. Then, the requirements imperative for building the dataset are manifested.

Finally, we discuss the generalization tricks in embodied AI, which gives insight to embodied data capture.

9.1 Tricks for Generalization

The methods for policy generalization in embodied AI could be as follows.

1) Sim-2-Real domain transfer in RL [25, 46];

2) Data augmentation and generative AI model (diffusion policy) [127];

3) Data scale and diversity (Open-X) [195];

4) Intermediate representation [355];

5) Large scale model architecture [113, 172];

6) Pre-trained large foundation models [299];

7) Post-training fine-tuning [200];

8) Inference-time optimization [334].

References

1. J Aloimonos, I Weiss, A Bandyopadhyay, “Active vision”, IJCV, vol. 1, Jan. 1987

2. R. Bajcsy, “Active Perception”, IEEE Proceedings, Vol 76, No 8, Aug. 1988.

3. B. M. Yamauchi, “Packbot: a versatile platform for military robotics” (iRobot), Unmanned ground vehicle technology VI, vol. 5422. SPIE, Sept. 2004.

4. N. Koenig and A. Howard, “Design and use paradigms for Gazebo, an open-source multi-robot simulator,” IEEE/RSJ IRS, Oct., 2004.

5. J P. R. Wurman, R. D’Andrea, and M. Mountz, “Coordinating hundreds of cooperative, autonomous vehicles in warehouses” (Kiva Systems), AI magazine, 29 (1), July, 2008.

6. M. Raibert, K. Blankespoor, G. Nelson, and R. Playter, “Bigdog, the rough-terrain quadruped robot,” IFAC Proceedings Volumes, 41(2), July 2008.

7. Deng, W Dong, R Socher, and et al. “ImageNet: A large-scale hierarchical image database”, IEEE CVPR, Aug. 2009

8. G. Echeverria, N. Lassabe, A. Degroote and S. Lemaignan, “Modular open robots simulation engine: MORSE,” IEEE ICRA, May, 2011.

9. E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” IEEE/RSJ IRS, Oct. 2012

10. MIT Quadruped Cheetah, https://spectrum.ieee.org/mit-cheetah-robot-running, IEEE Spectrum, May 2013

11. E. Rohmer, S. P. Singh, and M. Freese, “V-Rep: A versatile and scalable robot simulation framework” (CoppeliaSim), IEEE/RSJ IRS, Nov. 2013

12. Y Bai and C K Liu. “Dexterous manipulation using both palm and fingers” (DMPF). IEEE ICRA, June, 2014.

13. F. Tanaka, K. Isshiki, F. Takahashi, and et al., “Pepper learns together with children: Development of an educational application”, IEEE Int. Conf. on Humanoid Robots, Nov. 2015.

14. E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” https://github.com/bulletphysics/bullet3, 2016

15. S. Maniatopoulos, P. Schillinger, V. Pong, D. C. Conner, and H. Kress-Gazit, “Reactive high-level behavior synthesis for an Atlas humanoid robot,” IEEE ICRA, May 2016.

16. V. Kumar, A. Gupta, E. Todorov, and S. Levine, “Learning dexterous manipulation policies from experience and imitation” (LPEI), arXiv 1611.05095, 2016.

17. S. Shah, D. Dey, C. Lovett, and A. Kapoor, “AirSim: High-fidelity visual and physical simulation for autonomous vehicles,” arXiv 1705.05065, 2017

18. A. Chang, A. Dai, T. Funkhouser, and et al., “Matterport3D: Learning from RGB-D data in indoor environments”, arXiv1709.06158, 2017

19. A. Rajeswaran, V. Kumar, A. Gupta, and et al, “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations” (DAPG), RSS’18, arXiv 1709.10087, 2017.

20. M Savva, A Chang, A Dosovitskiy, and et al., “MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments”, arXiv 1712.03931, 2017

21. E. Kolve, R. Mottaghi, D. Gordon, and et al., “AI2-THOR: An interactive 3d environment for visual AI,” arXiv 1712.05474, 2017

22. A Vaswani, N Shazeer, N Parmar, et al. “Attention is All You Need” (Transformer). Advances in Neural Information Processing Systems, 2017.

23. A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training” (GPT-1), https://openai.com/index/language-unsupervised/, June 2018.

24. X. Puig, K. Ra, M. Boben, and et al., “Virtualhome: Simulating household activities via programs,” in IEEE/CVF CVPR, Jun 2018

25. OpenAI team, “Learning dexterous in-hand manipulation” (LDIM), arXiv 1808.00177, 2018.

26. A. Juliani, V-P Berges, E. Teng, and et al., “Unity: A general platform for intelligent agents” (Unity ML-Agents), arXiv 1809.02627, 2018.

27. S Li, X Ma, H Liang, and et al. “Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network” (TeachNet). ICRA, arXiv 1809.06268, 2018.

28. A Mandlekar, Y Zhu, A Garg, and et al. “RoboTurk: A crowdsourcing platform for robotic skill learning through imitation”. ICRL, arXiv 1811.02790, 2018.

29. A. Radford, J. Wu, R. Child, et al., “Language models are unsupervised multitask learners” (GPT-2), OpenAI blog, 2019.

30. Kroemer, O., Niekum, S., & Konidaris, G. “A review of robot learning for manipulation: Challenges, representations, and algorithms”. arXiv 1907.03146, 2019

31. M Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”, arXiv 1909.08053, 2019

32. A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar, “Deep dynamics models for learning dexterous manipulation” (PDDM), arXiv 1909.11652, 2019.

33. S Rajbhandari, J Rasley, O Ruwase, Y He, “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”, arXiv 1910.02054, 2019

34. A Handa, K Van Wyk, W Yang, and et al. “DexPilot: Vision-based teleoperation of dexterous robotic hand-arm system”. IEEE ICRA, arXiv 1910.03135, 2019.

35. S Dasari, F Ebert, S Tian, and et al. “RoboNet: Large-scale multi-robot learning”. CoRL’19, arXiv 1910.11215, 2019

36. M Savva, A Kadian, O Maksymets, and et al. “Habitat: A platform for embodied AI research”. IEEE ICCV, 2019.

37. Hafner D, Lillicrap T, Ba J, et al. “Dream to control: learning behaviors by latent imagination” (Dreamer v1). arXiv 1912.01603, 2019

38. Ravichandar, H., Polydoros, A. S., Chernova, S., & Billard, A. “Recent advances in robot learning from demonstration” (review). Annual Review of Control, Robotics, Auto. Systems, vol.3, 2020

39. C C. Kessens, J Fink, A Hurwitz, and et al., “Toward fieldable human-scale mobile manipulation using RoMan”, AI and Machine Learning for Multi-Domain Operations Applications II, Volume 11413, SPIE, April, 2020

40. I. Radosavovic, X. Wang, L. Pinto, and J. Malik, “State-only imitation learning for dexterous manipulation” (SOIL), IEEE/RSJ IROS’21. arXiv 2004.04650, 2020.

41. M Deitke, W Han, A Herrasti and et al. “RoboTHOR: An open simulation-to-real embodied AI platform”. CVPR’20, arXiv 2004.06799, 2020

42. Damen D, Doughty H, Farinella G M, et al. “The EPIC-Kitchens dataset: collection, challenges and baselines”. arXiv 2005.00343, IEEE T-PAMI, 43(11): 4125–4141, 2021

43. T. B. Brown, B. Mann, N. Ryder, et al., “Language models are few-shot learners” (GPT-3), arXiv 2005.14165, 2020

44. C. Li, S. Zhu, Z. Sun, and J. Rogers, “BAS optimized ELM for KUKA iiwa Robot Learning,” IEEE Transactions on Circuits and Systems II: Express Briefs, 68 (6), Oct. 2020.

45. F. Xiang, Y. Qin, K. Mo, and et al., “SAPIEN: A simulated part-based interactive environment,” arXiv 2003.08515, IEEE/CVF CVPR, Jun 2020.

46. Zhao, W., Queralta, J. P., and Westerlund, T. “Sim-to-real transfer in deep reinforcement learning for robotics: a survey”. arXiv 2009.13303, 2020.

47. Hafner D, Lillicrap T, Norouzi M, et al. “Mastering Atari with discrete world models” (Dreamer v2). arXiv 2010.02193, 2020

48. A. Zeng, P. Florence, J. Tompson, and et al., “Transporter networks: Rearranging the visual world for robotic manipulation”. CoRL’20, arXiv 2010.14406, 2020

49. Y Ishiguro, T Makabe, Y Nagamatsu, and et al., “Bilateral humanoid teleoperation system using whole-body exoskeleton cockpit TABLIS”, IEEE IROS, Oct. 2020

50. B. Shen, F. Xia, C. Li, and et al., “iGibson 1.0: A simulation environment for interactive tasks in large realistic scenes,” arXiv 2012.02924, IEEE/RSJ IRS, 2021

51. J Ren, S Rajbhandari, R Y Aminabadi et al., “ZeRO-offload: Democratizing Billion-Scale Model Training”, arXiv 2101.06840, 2021

52. S Rajbhandari et al., “ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning”, arXiv 2104.07857, 2021

53. Y Li, T Nagarajan, B Xiong, and K Grauman. “Ego-Exo: Transferring visual representations from third-person to first-person videos”. arXiv 2104.07905, CVPR, 2021

54. D Kalashnikov, J Varley, Y Chebotar, and et al., “MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale”, arXiv 2104.08212, 2021

55. K. Ehsani, W. Han, A. Herrasti, and et al., “ManipulaTHOR: A framework for visual object manipulation,” arXiv 2104.11213, IEEE/CVF CVPR, 2021.

56. M Caron, H Touvron, I Misra, and et al. “Emerging Properties in Self-Supervised Vision Transformers” (Dino v1), arXiv 2104.14294, 2021

57. Chen L, Lu K, Rajeswaran A, et al. “Decision transformer: reinforcement learning via sequence modeling”, arXiv 2106.01345, 2021

58. Janner M, Li Q, Levine S. “Offline reinforcement learning as one big sequence modeling problem” (Trajectory Transformer), arXiv 2106.02039, 2021

59. E Hu et al., “LORA: Low-Rank Adaptation of Large Language Models”, arXiv 2106.09685, 2021

60. A Szot, A Clegg, E Undersander, and et al. “Habitat 2.0: Training Home Assistants to Rearrange their Habitat”, arXiv 2106.14405, 2021

61. Mu T Z, Ling Z, Xiang F B, et al. “Maniskill: generalizable manipulation skill benchmark with large-scale demonstrations”, arXiv 2107.14483, 2021

62. A. Jaegle, S. Borgeaud, J. B. Alayrac, and et al. “Perceiver IO: A general architecture for structured inputs & outputs”. arXiv 2107.14795, 2021.

63. A Radford, J W Kim, C Hallacy, et al. “Learning transferable visual models from natural language supervision” (CLIP). ICML 2021.

64. A Ramesh, M Pavlov, G Goh, et al., “Zero-shot text-to-image generation” (DALL-E). ICML. Virtual event, July 2021

65. C. Li, F. Xia, R. Mart ́ın-Mart ́ın, and et al., “iGibson 2.0: Object- centric simulation for robot learning of everyday household tasks,” arXiv 2108.03272, CRL’21, 2021

66. Y Qin, Y-H Wu, S Liu, and et-al. “DexMV: Imitation learning for dexterous manipulation from human videos”. ECCV’22, arXiv 2108.05877, 2021.

67. Tesla Bot (Optimus), https://spectrum.ieee.org/elon-musk-robot, IEEE Spectrum, Aug., 2021

68. S K Ramakrishnan, A Gokaslan, E Wijmans, and et al. “Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D environments for embodied AI”. arXiv 2109.08238, 2021

69. M. Shridhar, L. Manuelli, and D. Fox, “CliPort: What and where pathways for robotic manipulation,” arXiv 2109.12098, 2021

70. F. Ebert, Y. Yang, K. Schmeckpeper, and et al. “Bridge data: Boosting generalization of robotic skills with cross-domain datasets”. arXiv 2109.13396, 2021.

71. K. Grauman, A. Westbury, E. Byrne and et al. “Ego4D: Around the world in 3,000 hours of egocentric video”. arXiv 2110.07058, 2021

72. Z Bian et al., “Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training”, arXiv 2110.14883, 2021

73. E Jang, A Irpan, M Khansari, and et al. “BC-Z: Zero-shot task generalization with robotic imitation learning”. CoRL, 2021

74. C. Gan, J. Schwartz, S. Alter, and et al., “ThreeDWorld: A platform for interactive multi-modal physical simulation,” arXiv 2007.04954, NeuIPS’21, 2021

75. R Rombach, A Blattmann, D Lorenz, P Esser, and B Ommer. “High-resolution image synthesis with latent diffusion models” (Stable Diffusion). arXiv 2112.10752, 2021.

76. W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents” (T-LM), arXiv 2201.07207, ICML, 2022.

77. P Mandikal and K Grauman. “DexVIP: Learning dexterous grasping with human hand pose priors from video”. CoRL, arXiv 2202.00164, 2022.

78. Li J, Li D, Xiong C, et al. “BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation”, arXiv 2201.12086, 2022

79. Y. Liu, Y. Liu, C. Jiang, and et al., “HOI4D: A 4D egocentric dataset for category-level human-object interaction”. CVPR’22, arXiv 2203.014577, 2022

80. L Ouyang, J Wu, X Jiang et al., “Training language models to follow instructions with human feedback” (GPT-3.5/InstructGPT), arXiv 2203.02155, 2022

81. N Hansen, X Wang, H Su, “Temporal Difference Learning for Model Predictive Control” (TD-MPC), arXiv 2203.04955, 2022

82. S P Arunachalam, S Silwal, B Evans, and L Pinto. “Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation” (DIME). arXiv 2203.13251, 2022.

83. F Sener, D Chatterjee, D Shelepov, and et al. “Assembly101: A large-scale multi-view video dataset for understanding procedural activities”. CVPR’22, arXiv 2203.14712, 2022

84. A. Zeng, M. Attarian, K. M. Choromanski, and et al., “Socratic models: Composing zero-shot multimodal reasoning with language”, arXiv 2204.00598, 2022

85. M Ahn, A Brohan, N Brown, and et al., “Do as I Can, Not as I Say: Grounding Language in Robotic Affordances” (SayCan), arXiv 2204.01691, 2022

86. A Ramesh, P Dhariwal, A Nichol, and et al. “Hierarchical text-conditional image generation with clip latents” (DALL-E2). arXiv 2204.06125,2022.

87. Y Qin, H Su, and X Wang. “From one hand to multiple hands: Imitation learning for dexterous manipulation from single-camera teleoperation” (IMDM). RA-L, 7(4), arXiv 2204.12490, 2022.

88. J-B Alayrac, J Donahue, P Luc, et al., “Flamingo: a visual language model for few-shot learning”. arXiv 2204.14198, 2022

89. Reed, S., Zolna, K., Parisotto, E., and et al. “A Generalist Agent” (GATO). arXiv 2205.06175, 2022

90. T Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”, arXiv 2205.14135, 2022

91. S. Haddadin, S. Parusel, L. Johannsmeier, and et al., “The Franka Emika Robot: A reference platform for robotics research and education”, IEEE Robotics & Automation Magazine, 29 (2), June, 2022.

92. M. Deitke, E. VanderBilt, A. Herrasti, and et al., “ProcTHOR: Large-Scale Embodied AI Using Procedural Generation”, arXiv 2206.06994, NeurIPS’22, 2022

93. N M Shafiullah, Z J Cui, A Altanzaya, L Pinto, “Behavior Transformers: Cloning k modes with one stone”, arXiv 2206.11251, 2022

94. P Wu, A Escontrela, D Hafner, P Abbeel, and K Goldberg. “DayDreamer: World models for physical robot learning”. arXiv 2206.14176, 2022

95. Y. Seo, D. Hafner, H. Liu, and et al., “Masked world models for visual control” (MWM), arXiv 2206.14244, 2022

96. Huang W, Xia F, Xiao T, et al. “Inner monologue: embodied reasoning through planning with language models”. arXiv 2207.05608, 2022

97. S Bahl, A Gupta, D Pathak, “Human-to-Robot Imitation in the Wild” (WHIRL), arXiv 2207.09450, July, 2022

98. A Bucker, L Figueredo, S Haddadin, and et al., “LATTE: LAnguage Trajectory TransformEr”, arXiv 2208.02918, 2022

99. J Liang, W Huang, F Xia, and et al., “Code as Policies: Language Model Programs for Embodied Control” (CaP), arXiv 2209.07753, 2022

100. L Yang, Z Zhang, S Hong et al., “Diffusion Models: A Comprehensive Survey of Methods and Applications”, arXiv 2209.00796, 2022

101. M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-Actor: A multi-task transformer for robotic manipulation,” arXiv 2209.05451, 2022

102. I. Singh, V. Blukis, A. Mousavian, and et al., “ProgPrompt: Generating situated robot task plans using large language models,” arXiv 2209.11302, IEEE ICRA’23, 2022.

103. B. Reily, P. Gao, F. Han, H. Wang, and H. Zhang, “Real-time recognition of team behaviors by multisensory graph-embedded robot learning” (Jackal Robot/Clearpath Robotics), IJRA, 41(8), Sep. 2022.

104. Z Q Chen, K Van Wyk, Y-W Chao, and et-al. “DexTransfer: Real world multi-fingered dexterous grasping with minimal human demonstrations”. arXiv 2209.14284, 2022.

105. K. Gao, Y. Gao, H. He, et al., “NeRF: Neural radiance field in 3d vision, a comprehensive review”. arXiv 2210.00379, 2022.

106. S Yao, J Zhao, D Yu, and et al., “ReAct: Synergizing Reasoning and Acting in Language Models”, arXiv 2210.03629, 2022

107. S P Arunachalam, I Güzey, S Chintala, and Lerrel Pinto. “Holo-Dex: Teaching dexterity with immersive mixed reality”. IEEE ICRA’23, arXiv 2210.06463, 2022.

108. A Handa, A Allshire, V Makoviychuk, and et al. “Dextreme: Transfer of agile in-hand manipulation from simulation to reality”. arXiv 2210.13702, 2022.

109. Mohammed, Q., Kwek, C., Chua, C. and et al. “Review of learning-based robotic manipulation in cluttered environments”. Sensors, vol. 22 (20), 2022.

110. T Chen, M Tippur, S Wu, and et al. “Visual dexterity: In-hand dexterous manipulation from depth”. arXiv 2211.11744, 2022.

111. C H Song, J Wu, C Washington, and et al., “LLM-Planner: Few-shot grounded planning for embodied agents with large language models”. arXiv 2212.04088, 2022

112. K Shaw, S Bahl, and D Pathak. “VideoDex: Learning dexterity from internet videos”. arXiv 2212.04498, 2022.

113. A Brohan, N Brown, J Carbajal, and et al. “RT-1: Robotics transformer for real-world control at scale”. arXiv 2212.06817, 2022

114. P Liu, W Yuan, J Fu, and et al. “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing”. ACM Computing Surveys, 55(9):1–35, 2023.

115. Q Dong, L Li, D Dai, and et al., “A survey for in-context learning”. arXiv 2301.00234, 2023.

116. Hafner D, Pasukonis J, Ba J, et al. “Mastering diverse domains through world models” (Dreamer v3), arXiv 2301.04104, 2023

117. M Mittal, C Yu, Q Yu, and et al. “ORBIT: A Unified Simulation Framework for Interactive Robot Learning Environments”, arXiv 2301.04195, 2017

118. K. Nottingham, P. Ammanabrolu, A. Suhr, and et al. “Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling” (DEKARD), arXiv 2301.12050, 2023

119. Li J, Li D, Savarese S, et al. “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models”. arXiv 2301.12597, 2023

120. Y Du, M Yang, B Dai, and et al., “Learning Universal Policies via Text-Guided Video Generation” (UniPi), arXiv 2302.00111, 2023

121. Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang, “Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents” (DEPS), arXiv 2302.01560, 2023.

122. J Gu, F Xiang, X Li, and et al., “ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills”, arXiv 2302.04659, 2023

123. H. Touvron, T. Lavril, G. Izacard, and et al. “LLaMA: Open and efficient foundation language models”. arXiv 2302.13971, 2023.

124. D Driess, F Xia, M. Sajjadi, et al., “PaLM-E: An Embodied Multimodal Language Model”, arXiv 2303.03378, 2023

125. G Khandate, S Shang, ET Chang, and et-al. “Sampling- based Exploration for Reinforcement Learning of Dexterous Manipulation” (M-RRT/G-RRT). RSS’23, arXiv 2303.03486, 2023.

126. S Yang, O Nachum, Y Du, and et al., “Foundation models for decision making: Problems, methods, and opportunities” (review). arXiv 2303.04129, 2023

127. C Chi, Z Xu, S Feng, and et al., “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion”, arXiv 2303.04137, 2023

128. Y Cao, S Li, Y Liu, and et al. “A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT”, arXiv 2303.04226, 2023

129. J Pitz, L Ro ̈stel, L Sievers, and B Ba ̈uml. “Dextrous tactile in-hand manipulation using a modular reinforcement learning architecture” (DTIM). arXiv 2303.04705, 2023.

130. J Robine, M H”oftmann, T Uelwer, and S Harmeling. “Transformer-based world models are happy with 100k interactions” (TWM). ICLR’23, arXiv 2303.07109, 2023

131. C Zhang, C Zhang, M Zhang, I S Kweon, “Text-to-image Diffusion Models in Generative AI: A Survey”, arXiv 2303.07909, 2023

132. J. Achiam, S. Adler, S. Agarwal, and et al. “GPT-4 technical report”. arXiv 2303.08774, 2023

133. Z-H Yin, B Huang, Y Qin, Q Chen, and X Wang. “Rotating without seeing: Towards in-hand dexterity through touch” (Touch Dexterity). arXiv 2303.10880, 2023.

134. Shinn N, Cassano F, Berman E, et al. “Reflexion: language agents with verbal reinforcement learning”, arXiv 2303.11366, 2023

135. I Guzey, B Evans, S Chintala, and L Pinto. “Dexterity from touch: Self-supervised pre- training of tactile representations with robotic play” (T-Dex). arXiv 2303.12076, 2023.

136. Madaan A, Tandon N, Gupta P, et al. “Self-Refine: iterative refinement with self-feedback”, arXiv 2303.17651, 2023

137. W X Zhao, K Zhou, J Li, and et al., “A Survey of Large Language Models”, arXiv 2303.18233, Mar. 2023

138. L Zhao, T Yang, Y Yang, and P Yu. “A wearable upper limb exoskeleton for intuitive teleoperation of anthropomorphic manipulators” (WULE). MDPI Machines, 11(4):441, Mar. 2023.

139. Figure 01, https://www.fastcompany.com/90859010/the-race-to-build-ai-powered-humanoids-is-heating-up, Mar., 2023

140. A. Ugenti, R. Galati, G. Mantriota, and G. Reina, “Analysis of an all-terrain tracked robot with innovative suspension system” (Polibot), Mechanism and Machine Theory, vol. 182, April, 2023.

141. A Kirillov, E Mintun, N Ravi, and et al. “Segment Anything” (SAM). arXiv 2304.02643, 2023

142. J Park, J Brien, C Cai and et al., “Generative Agents: Interactive Simulacra of Human Behavior” (GA), arXiv 2304.03442, 2023

143. X Zou, J Yang, H Zhang, et al., “Segment everything everywhere all at once” (SEEM). arXiv 2304.06718, 2023

144. M Oquab, T Darcet, T Moutakanni, and et al. “Dinov2: Learning robust visual features without supervision”. arXiv 2304.07193, 2023

145. T Z. Zhao, V Kumar, S Levine, C Finn, “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware” (ALOHA/ACT), arXiv 2304.13705, 2023

146. Y Xie, K Kawaguchi K, Y Zhao, and et al. “Self-evaluation guided beam search for reasoning”, arXiv 2305.00633, 2023

147. M Heo, Y Lee, D Lee, and J. Lim. “FurnitureBench: Reproducible real-world benchmark for long-horizon complex manipulation”. arXiv 2305.12821, 2023.

148. S Yao, D Yu, J Zhao, and et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”, arXiv 2305.10601, 2023

149. Mu Y, Zhang Q, Hu M, et al. “EmbodiedGPT: vision-language pre-training via embodied chain of thought”. arXiv 2305.15021, 2023

150. G Wang, Y Xie, Y Jiang, and et al., “VOYAGER: An Open-Ended Embodied Agent with Large Language Models”, arXiv 2305.16291, 2023

151. M Kulkarni, T J. L. Forgaard, K Alexis, “Aerial Gym — Isaac Gym Simulator for Aerial Robots”, arXiv 2305.16510, 2023

152. B Y Lin B Y, Y Fu, K Yang, and et al. “SwiftSage: a generative agent with fast and slow thinking for complex interactive tasks”. arXiv 2305.17390, 2023

153. Cyberbotics, “Webots: open-source robot simulator”, https://github.com/cyberbotics/webots, 2023

154. NVIDIA, “Nvidia Isaac Sim: Robotics simulation and synthetic data,” https://developer.nvidia.com/isaac/sim, 2023

155. AKM Shahariar, Azad Rabby, C Zhang, “BeyondPixels: A Comprehensive Review of the Evolution of Neural Radiance Fields”, arXiv 2306.03000, 2023

156. B Liu, Y Zhu, C Gao, and et al. “LIBERO: Benchmarking knowledge transfer for lifelong robot learning”. arXiv 2306.03310, 2023

157. P Ren, K Zhang, H Zheng, and et al. “Surfer: Progressive reasoning with world models for robotic manipulation”, arXiv 2306.11335, 2023

158. Microsoft, “Textbooks Are All You Need” (phi-1), arXiv 2306.11644, 2023

159. Bousmalis K, Vezzani G, Rao D, et al. “RoboCat: a self-improving generalist agent for robotic manipulation”. arXiv 2306.11706, 2023

160. A Goyal, J Xu, Y Guo, and et al. “RVT: Robotic view transformer for 3D object manipulation”. arXiv 2306.14896, 2023

161. Vemprala S, Bonatti R, Bucker A, and et al. “ChatGPT for robotics: design principles and model abilities”, arXiv 2306.17582, 2023

162. Y Guo, Y-J Wang, L Zha, J Chen, “DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment”, arXiv 2307.00329, 2023

163. X Li, V Belagali, J Shang and M S. Ryoo, “Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning”, arXiv 2307.01849, 2023

164. Y Qin, W Yang, B Huang, and et al. “AnyTeleop: A general vision-based dexterous robot arm-hand teleoperation system”. arXiv 2307.04577, 2023.

165. Huang W, Wang C, Zhang R, et al. “VoxPoser: Composable 3D value maps for robotic manipulation with language models”. arXiv 2307.05973, 2023

166. K. Rana, J. Haviland, S. Garg, and et al. “SayPlan: Grounding large language models using 3d scene graphs for scalable task planning,” arXiv 2307.06135, ICRL’23, 2023.

167. Wang, Y., He, Y., Li, Y., and et al. “InternVid: A large-scale video-text dataset for multimodal understanding and generation”. arXiv 2307.06942, 2023.

168. T Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”, arXiv 2307.08691, 2023

169. H. Touvron, L. Martin, K. Stone, and et al. “Llama 2: Open foundation and fine-tuned chat models”. arXiv 2307.09288, 2023.

170. J Gu, Z Han, S Chen, and et al. “A systematic survey of prompt engineering on vision-language foundation models”. arXiv 2307.12980, 2023

171. H Ha, P Florence, and S Song. “Scaling up and distilling down: Language-guided robot skill acquisition” (SUDD). CoRL’23, arXiv 2307.14535, 2023

172. A Brohan, N Brown, J Carbajal, and et al. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control”, arXiv 2307.15818, 2023

173. H Fang, H Fang, Z Tang, and et al. “RH20T: A robotic dataset for learning diverse skills in one-shot”. RSS 2023 Workshop on Learning for Task and Motion Planning, arXiv 2307.00595, July 2023

174. P. Arm, G. Waibel, J. Preisig, and et al., “Scientific exploration of challenging planetary analog environments with a team of legged robots” (ANYmal C), arXiv 2307.10079, Science robotics, 8 (80), July, 2023.

175. Lin J, Du Y, Watkins O, et al. “Learning to model the world with language” (Dynalang). arXiv 2308.01399, 2023

176. Jing, Y., Zhu, X., Liu, X., and et al. “Exploring visual pre-training for robot manipulation: Datasets, models and methods” (Vi-PRoM). arXiv 2308.03620, 2023.

177. S Zhang, L Dong, X Li and et al., “Instruction Tuning for Large Language Models: A Survey”, arXiv 2308.10792, 2023

178. L Wang, C Ma, X Feng, and et al. “A Survey on Large Language Model based Autonomous Agents”, arXiv 2308.11432, 2023

179. H. Walke, K. Black, A. Lee, and et al. “Bridgedata v2: A dataset for robot learning at scale”, arXiv 2308.12952, 2023.

180. K Somasundaram, J Dong, H Tang, and et al. “Project Aria: A new tool for egocentric multi-modal AI research”. arXiv 2308.13561, 2023.

181. L G Foo, H Rahmani, and J Liu, “AIGC for Various Data Modalities: A Survey”, arXiv 2308.14177, Aug. 2023

182. H Bharadhwaj, J Vakil, M Sharma, and et al., “RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking” (MT-ACT/RoboSet), arXiv 2309.01918, 2023

183. Microsoft, “Textbooks Are All You Need II: Phi-1.5 technical report”, arXiv 2309.05463, 2023

184. W Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (vLLM), arXiv 2309.06180, 2023

185. Z Xi, W Chen, X Guo, and et al. “The Rise and Potential of Large Language Model Based Agents: A Survey”, arXiv 2309.07864, 2023

186. C Li, Z Gan, Z Yang, and et al. “Multimodal Foundation Models: From Specialists to General-Purpose Assistants” (survey), arXiv 2309.10020, 2023

187. P Wu, Y Shentu, Z Yi, X Lin, and P Abbeel. “GELLO: A general, low-cost, and intuitive tele-operation framework for robot manipulators”. arXiv 2309.13037, 2023

188. H Fang, H Fang, Y Wang, and et al. “AirExo: Low-cost exoskeletons for learning whole-arm manipulation in the wild”. arXiv 2309.14975, 2023

189. T Shen, R Jin, Y Huang, and et al., “Large Language Model Alignment: A Survey”, arXiv 2309.15025, 2023

190. Z Chu, J Chen, Q Chen, and et al., “A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future”, arXiv 2309.15402, 2023

191. Q. Gu, A. Kuwajerwala, S. Morin, and et al., “ConceptGraphs: Open-vocabulary 3d scene graphs for perception and planning,” arXiv 2309.16650, 2023

192. Z Yang, L Li, K Lin, and et al., “The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)”, arXiv 2309.17421, 2023

193. L Wang, Y Ling, Z Yuan, and et al. “GenSim: Generating Robotic Simulation Tasks via Large Language Models”, arXiv 2310.01361, 2023

194. A Q. Jiang, A Sablayrolles, A Mensch, and et al., “Mistral 7B”, arXiv 2310.06285, 2023

195. A Padalkar, A Pooley, A Jain, and et al. “Open X-Embodiment: Robotic learning datasets and RT-x models”. arXiv 2310.08864, 2023

196. Zhang W, Wang G, Sun J, et al. “STORM: efficient stochastic transformer based world models for reinforcement learning”. arXiv 2310.09615, 2023

197. Du, Y., Yang, M., Florence, P. R., and et al. “Video language planning” (VLP). arXiv 2310.10625, Oct. 2023

198. Y J Ma, W Liang, G Wang, and et al. “EUREKA: Human-Level Reward Design Via Coding Large Language Models”, arXiv 2310.12931, ICLR’24, 2023

199. Puig, X., Undersander, E., Szot, A., and et al. “Habitat 3.0: A co-habitat for humans, avatars and robots”. arXiv 2310.13724, 2023

200. Y Feng, N Hansen, Z Xiong, and et al., “Fine-tuning Offline World Models in the Real World” (FOWM), arXiv 2310.16029, 2023

201. Hansen N, Su H, Wang X. “TD-MPC2: scalable, robust world models for continuous control”. arXiv 2310.16828, 2023

202. A Mandlekar, S Nasiriany, B Wen, and et al., “MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations”, arXiv 2310.17596, 2023

203. J Betker, G Goh, L Jing, and et al., “Improving Image Generation with Better Captions” (DALL-E3), OpenAI report, Oct., 2023

204. Li X, Liu M, Zhang H, et al. “Vision-language foundation models as effective robot imitators” (RoboFlamingo). arXiv 2311.01378, 2023

205. Wang Y, Xian Z, Chen F, et al. “RoboGen: towards unleashing infinite data for automated robot learning via generative simulation”. arXiv 2311.01455, 2023

206. J Gu, S Kirmani, P Wohlhart, and et al., “RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches”, arXiv 2311.01977, 2023

207. M R Morris, J Sohl-dickstein, N Fiedel, and et al. “Levels of AGI: Operationalizing Progress on the Path to AGI”, arXiv 2311.02462, 2023

208. H Peng, C Ding, T Geng, and et al., “Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs”, arXiv 2311.04417, 2023

209. Zeng, F., Gan, W., Wang, Y., and et al. “Large language models for robotics: A survey”. arXiv 2311.07226, 2023

210. Y Huang, Y Chen, Z Li, “Applications of Large Scale Foundation Models for Autonomous Driving” (survey), arXiv 2311.12144, 2023

211. J. Huang, S. Yong, X. Ma, and et al., “An embodied generalist agent in 3d world” (LEO), arXiv 2311.12871, 2023

212. X Xiao, J Liu, Z Wang, and et al., “Robot Learning in the Era of Foundation Models: A Survey”, arXiv 2311.14379, 2023

213. Y. Chen, W. Cui, Y. Chen, and et al., “RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks,” arXiv 2311.15649, 2023.

214. N Shafiullah, A Rai, H Etukuru, and et al. “On bringing robots home” (Dobb·E/Stick v1/HoNY). arXiv 2311.16098, 2023.

215. Y. Hu, F Lin, T Zhang, L Yi, and Y Gao, “Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning” (ViLa), arXiv 2311.17842, 2023.

216. K Grauman, A Westbury, L Torresani, and et al. “Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives”, arXiv 2311.18259, 2023

217. Javaheripi M, Bubeck S, Abdin M, et al. “Phi-2: the surprising power of small language models”. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/, 2023

218. Y Song, E Byrne, T Nagarajan, and et al. “Ego4D goal-step: Toward hierarchical understanding of procedural activities”. NeurIPS, 2023

219. I Leal, K Choromanski, D Jain, and et al., “SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust Attention”, arXiv 2312.01990, 2023

220. R Firoozi, J Tucker, S Tian, and et al., “Foundation Models in Robotics: Applications, Challenges, and the Future” (review), arXiv 2312.07843, 2023

221. Y Hu, Q Xie, V Jain, and et al. “Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis”, arXiv 2312.08782, 2023

222. P Wang, L Li, Z Shao, and et al. “Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations”. arXiv 2312.08935, 2023

223. Team, G., Anil, R., Borgeaud, S., and et al. “Gemini: a family of highly capable multimodal models”. arXiv 2312.11805, 2023.

224. H Wu, Y Jing, C Cheang, and et al., “GR-1: Unleashing Large-Scale Video Generative Pre-Training For Visual Robot Manipulation”, arXiv 2312.13139, 2023

225. P Ding, H Zhao, W Song, and et al., “QUAR-VLA: Vision-Language-Action Model for Quadruped Robots”, arXiv 2312.14457, 2023

226. Mistral AI, “Mixtral of experts: A high quality Sparse Mixture-of-Experts”, https://mistral.ai/news/mixtral-of-experts/, Dec. 2023

227. C Wen, X Lin, J So, and et al., Any-point Trajectory Modeling for Policy Learning” (ATM), arXiv 2401.00025, 2024

228. Z Fu, T Z Zhao, and C Finn, “Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation”, arXiv 2401.02117, 2024

229. Y Cheng, C Zhang, Z Zhang, and et al. “Exploring Large Language Model Based Intelligent Agents: Definitions, Methods, and Prospects” (survey), arXiv 2401.03428, 2024

230. G Chen and W Wang, “A Survey on 3D Gaussian Splatting”, arXiv 2401.03890, 2024

231. T Cai, Y Li, Z Geng, and et al., “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads”, arXiv 2401.10774, 2024

232. B Chen, Z Xu, S Kirmani, and et al. “SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities”, arXiv 2401.12168, 2024

233. M Ahn, D Dwibedi, C Finn, and et al., “AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents”, arXiv 2401.12963, 2024

234. Ming, R., Huang, Z., Ju, Z., and et al. “A survey on video prediction: From deterministic to generative approaches”. arXiv 2401.14718, 2024.

235. LLaMA.cpp, LLM inference in C/C++, https://github.com/ggerganov/llama.cpp, Jan. 2024

236. S. Le Cleac’h, T. A. Howell, S. Yang, and et al, “Fast contact-implicit model predictive control” (Unitree Go1), IEEE Transactions on Robotics, Jan. 2024.

237. X Yan, J Xu, Y Huo, H Bao, “Neural Rendering and Its Hardware Acceleration: A Review”, arXiv 2402.00028, 2024

238. Z Xu, K Wu, J Wen, and et al. “A Survey on Robotics with Foundation Models: toward Embodied AI”, arXiv 2402.02385, 2024

239. Z Wang, Y Li, Y Wu, and et al. “Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision” (MiPS). arXiv 2402.02658, 2024b.

240. X Huang, W Liu, X Chen, and et al. “Understanding the planning of LLM agents: A survey”, arXiv 2402.02716, 2024

241. G Paolo, J G-Billandon, B Kegl, “A Call for Embodied AI”, arXiv 2402.03824, 2024

242. K Kawaharazuka, T Matsushima, A Gambardella, and et al. “Real-World Robot Applications of Foundation Models: A Review”, arXiv 2402.05741, 2024

243. S Minaee, T Mikolov, N Nikzad, and et al. “Large Language Models: A Survey”, arXiv 2402.06196, 2024

244. C Eze, C Crick. “Learning by watching: A review of video-based learning approaches for robot manipulation”. arXiv 2402.07127, 2024

245. B Fei, J Xu, R Zhang, and et al., “3D Gaussian as A New Vision Era: A Survey”, arXiv 2402.07181, 2024

246. G Yenduri, Ramalingam M, P Maddikunta, and et al., “Spatial Computing: Concept, Applications, Challenges and Future Directions”, arXiv 2402.07912, 2024

247. C Chi, Z Xu, C Pan, and et al., “Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots” (UMI), arXiv 2402.10329, 2024

248. Z Tan, A Beigi, S Wang, and et al. “Large Language Models for Data Annotation: A Survey”, arXiv 2402.13446, 2024

249. P Gao, P Wang, F Gao, et al. “Vision-Language Navigation with Embodied Intelligence: A Survey”, arXiv 2402.14304, 2024

250. S Yang, J Walker, J Parker-Holder and et al. “Video as the new language for real-world decision making”, arXiv 2402.17139, 2024

251. Y Liu, J Cao, C Liu, and et al., “Datasets for Large Language Models: A Comprehensive Survey”, arXiv 2402.18041, 2024

252. OpenAI Sora, “Video generation models as world simulators”, https://openai.com/index/video-generation-models-as-world-simulators/, Feb. 2024

253. Y Park and P Agrawal. “Using apple vision pro to train and control robots” (VisionProTeleop), https://github.com/Improbable-AI/VisionProTeleop, 2024

254. 2402.07912, S. Belkhale, T. Ding, T. Xiao, “RT-H: Action hierarchies using language,” arXiv 2403.01823, Mar. 2024

255. S Lee, Y Wang, H Etukuru, and et al., “Behavior Generation with Latent Actions” (VQ-BeT), arXiv 2403.03181, 2024

256. Ze Y, Zhang G, Zhang K, et al. “3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations”. arXiv 2403.03954, 2024

257. M Luo, Z Xue, A Dimakis, K Grauman, “Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos”, arXiv 2403.06351, 2024

258. A Iyer, Z Peng, Y Dai, and et al. “Open Teach: A versatile teleoperation system for robotic manipulation”. arXiv 2403.07870, 2024.

259. Google Gemma Team, “Gemma: Open Models Based on Gemini Research and Technology”, arXiv 2403.08295, 2024

260. Li, C., Zhang, R., Wong, J., and et al. “Behavior-1k: A human-centered, embodied AI benchmark with 1,000 everyday activities and realistic simulation”, arXiv 2403.09227, 2024

261. H. Zhen, X Qiu, P Chen, and et al., “3D-VLA: 3d vision-language-action generative world model,” arXiv:2403.09631, 2024.

262. T Wu, Y Yuan, L Zhang, and et al. “Recent Advances in 3D Gaussian Splatting” (review), arXiv 2403.11134, 2023

263. C Wang, H Shi, W Wang, and et al., “DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation”, arXiv 2403.07788, 2024

264. C. Sferrazza, D.-M. Huang, X. Lin, Y. Lee, and P. Abbeel. “HumanoidBench: Simulated humanoid benchmark for whole-body locomotion and manipulation”. arXiv 2403.10506, 2024.

265. J F. Mullen Jr, D Manocha, “LAP, Using Action Feasibility for Improved Uncertainty Alignment of Large Language Model Planners”, arXiv 2403.13198, 2024

266. Y Huang, G Chen, J Xu, et al. “EgoExoLearn: A Dataset for Bridging AsynchronousEgo- and Exo-centric View of Procedural Activities in Real World”, arXiv 2403.16182, 2024

267. “AI Power: Accurate Models at Blazing Speeds | SambaNova”, https://sambanova.ai/blog/accurate-models-at-blazing-speed, Samba COE v0.2, March, 2024

268. Unitree humanoid H1, https://kr-asia.com/unitree-robotics-develops-personal-robot-dogs-that-jog-alongside-you, Mar. 2024

269. A. Khazatsky, K. Pertsch, S. Nair, “Droid: A large-scale in-the-wild robot manipulation dataset”, arXiv 2403.12945, 2024

270. S Zhou, Y Du, J Chen, and et al. “RoboDreamer: Learning compositional world models for robot imagination”, arXiv 2404.12377, 2024

271. ALOHA 2 Team, and et al, “ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation”, arXiv 2405.02292, 2024

272. J W Kim, T Z. Zhao, S Schmidgall, and et al., “Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks”, arXiv 2407.12998, 2024

273. Microsoft, “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone”, arXiv 2404.14219, 2024

274. S. Shin, J. Kim, G.-C. Kang, and et al., “Socratic planner: Inquiry-based zero-shot planning for embodied instruction following,” arXiv 2404.15190, 2024.

275. Y Xia, R Wang, X Liu, and et al., “Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs”, arXiv 2404.15676, 2024

276. R Xu, S Yang, Y Wang, and et al., “Visual Mamba: A Survey and New Outlooks”, arXiv 2404.18861, 2024

277. R McCarthy, D Tan, D Schmidt, and et al. “Towards Generalist Robot Learning from Internet Video: A Survey”, arXiv 2404.19664, 2024

278. R Cadene, S Alibert, A Soare, and et al., https://github.com/huggingface/lerobot (LeRobot), May, 2024

279. G. Wang, L. Pan, S. Peng, and et al., “NeRF in robotics: A survey,” arXiv 2405.01333, 2024.

280. M Dalal, T Chiruvolu, D Chaplot, and R Salakhutdinov. “Plan-Seq-Learn: Language model guided rl for solving long horizon robotics tasks” (PSL), arXiv 2405.01534, 2024

281. A Dalal, D Hagen, K Robbersmyr, and et al. “Gaussian Splatting: 3D Reconstruction and Novel View Synthesis, a Review”, arXiv 2405.03417, 2024

282. Z Zhu, X Wang, W Zhao, and et al. “Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond”, arXiv 2405.03520, 2024

283. X Li, K Hsu, J Gu, and et al., “Evaluating Real-World Robot Manipulation Policies in Simulation” (SIMPLER), arXiv 2405.05941, 2024

284. K F Gbagbe, M A Cabrera, A Alabbas, and et al., “Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations”, arXiv 2405.06039, 2024

285. Y Huang, “Levels of AI Agents: from Rules to Large Language Models”, arXiv 2405.06643, May, 2024

286. R Prabhakar, R Sivaramakrishnan, D Gandhi, and et al., “SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts”, arXiv 2405.07518, 2024

287. Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, and et al. “Octo: An open-source generalist robot policy”,arXiv 2405.12213, 2024

288. Y Ma, Z Song, Y Zhuang, and et al. “A Survey on Vision-Language-Action Models for Embodied AI”, arXiv 2405.14093, 2024

289. Zhang Y, Yang S, Bai C J, et al. “Towards efficient LLM grounding for embodied multi-agent collaboration” (ReAd). arXiv 2405.14314, 2024

290. F Bordes, R Y Pang, A Ajay, and et al. “An Introduction to Vision-Language Modeling”, arXiv 2405.17247, 2024

291. T Zhang, D Li, Y Li, and et al., “Empowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation Dataset for Household Tasks” (BRMData), arXiv 2405.18860, 2024

292. Fei-Fei Li, “With Spatial Intelligence, Artificial Intelligence Will Understand the Real World”, https://www.youtube.com/watch?v=y8NtMZ7VGmU, May, 2024

293. OpenAI GPT-4o, https://openai.com/index/hello-gpt-4o/, May, 2024

294. J. Liu, M. Liu, Z. Wang, and et al., “RoboMamba: Multimodal state space model for efficient robot reasoning and manipulation,” arXiv 2406.04339, 2024

295. L Luo, Y Liu, R Liu, and et al. “Improve mathematical reasoning in language models by automated process supervision” (OmegaPRM). arXiv 2406.06592, 2024.

296. A. Szot, B Mazoure, H Agrawal, and et al., “Grounding multimodal large language models in actions” (Grounding-RL), arXiv 2406.07904, 2024.

297. A Goyal, V Blukis, J Xu, and et al. “RVT-2: Learning Precise Manipulation from Few Demonstrations”. arXiv 2406.08545, 2024

298. T He, Z Luo, X He, and et al., “OmniH2O: Universal and Dexterous Human-to-Humanoid Whole-Body Tele-operation and Learning”, arXiv 2406.08858, 2024

299. M J Kim, K Pertsch, S Karamcheti, and et al., “OpenVLA: An Open-Source Vision-Language-Action Model”, arXiv 2406.09246, 2024

300. W Cai, J Jiang, F Wang, and et al., “A Survey on Mixture of Experts”, arXiv 2407.06204, 2024

301. Z Fu, Q Zhao, Q Wu, G Wetzstein, and C Finn. “HumanPlus: Humanoid shadowing and imitation from humans”. arXiv 2406.10454, 2024.

302. D Niu, Y Sharma, G Biamby, and et al, “LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning”, arXiv 2406.11815, 2024

303. P Mazzaglia, T Verbelen, B Dhoedt, and et al., “GenRL: Multimodal-foundation world models for generalization in embodied agents”, arXiv 2406.18043, 2024

304. B Pei, G Chen, J Xu, and et al. “EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation”, arXiv 2406.18070, 2024

305. Isaac Ong, Amjad Almahairi, V Wu, and et al., “RouteLLM: Learning to Route LLMs with Preference Data”, arXiv 2406.18665, 2024

306. X Mai, Z Tao, J Lin, and et al. “From Efficient Multimodal Models to World Models: A Survey”, arXiv 2407.00118, 2024

307. X Cheng, J Li, S Yang, G Yang, and X Wang. “Open-Television: Teleoperation with immersive active visual feedback”, arXiv 2407.01512, 2024

308. I Georgiev, V Giridhar, N Hansen, A Garg, “PWM: Policy Learning with Large World Models”, arXiv 2407.02466, 2024

309. R Ding, Y Qin, J Zhu, and et al, “Bunny-VisionPro: Real-Time Bimanual Dexterous Teleoperation for Imitation Learning”, arXiv 2407.03162, 2024

310. Y Liu, W Chen, Y Bai and et al. “Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI”, arXiv 2407.06886, 2024

311. L Zheng, F Yan, F Liu, and et al., “RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios”, arXiv 2407.06951, 2024

312. N Chernyadev, N Backshall, X Ma, and et al., “BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark”, arXiv 2407.07788, 2024

313. A Lee, I Chuang, L-Y Chen, I Soltani, “InterACT: Inter-dependency Aware Action Chunking with Hierarchical Attention Transformers for Bimanual Manipulation”, arXiv 2409.07914, 2024

314. W Wu, H He, Y Wang, and et al., “MetaUrban: A Simulation Platform for Embodied AI in Urban Spaces”, arXiv 2407.08725, 2024

315. H Ha, Y Gao, Z Fu, J Tan, and S Song, “UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers”, arXiv 2407.10353, 2024

316. H Wang, J Chen, W Huang, and et al., “GRUtopia: Dream General Robots in a City at Scale”, arXiv 2407.10943, 2024

317. Y Bao, T Ding, J Huo, and et al. “3D Gaussian Splatting: Survey, Technologies, Challenges, and Opportunities”, arXiv 2407.17418, 2024

318. Llama Team, Meta AI, “The Llama 3 Herd of Models”, arXiv 2407.21783, 2024

319. Y Wu, Z Sun, S Li, S Welleck, Y Yang. “Inference Scaling Laws: An Empirical Analysis of Compute-optimal Inference For LLM Problem-solving” (REBASE), arXiv 2408.00724, 2024

320. H Qu, L Ning, R An, and et al., “A Survey of Mamba”, arXiv 2408.01129, 2024

321. K Maeda, T Hirasawa, A Hashimoto, and et al. “COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark”, atXiv 2408.02272, 2024

322. C Snell, J Lee, K Xu, and A Kumar. “Scaling LLM test-time compute optimally can be more effective than scaling model parameters”. arXiv 2408.03314, 2024.

323. Z Fang, M Yang, W Zeng, and et al., “Egocentric Vision Language Planning” (EgoPlan), arXiv 2408.05802, 2024

324. H Arai, K Miwa, K Sasaki, and et al., “CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving”, arXiv 2408.10845, 2024

325. Z Wang, H Zheng, Y Nie, and et al., “All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents” (ARIO). arXiv 2408.10899, 2024

326. Y Zheng, L Yao, Y Su, and et al., “A Survey of Embodied Learning for Object-Centric Robotic Manipulation”, arXiv 2408.11537, 2024

327. S Yang, M Liu, Y Qin, and et al. “ACE: A cross-platform visual-exoskeletons system for low-cost dexterous teleoperation”, arXiv 2408.11805, 2024

328. R Doshi, H Walke, O Mees, S Dasari, S Levine, “Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation” (CrossFormer), arXiv 2408.11812, 2024

329. Figure 02, https://techcrunch.com/2024/08/06/figures-new-humanoid-robot-leverages-openai-for-natural-speech-conversations/, Aug. 2024

330. Y. Yang, F.-Y. Sun, L. Weihs, and et al., “Holodeck: Language guided generation of 3d embodied AI environments,” IEEE/CVF CVPR, 2024

331. Y. Yang, B. Jia, P. Zhi, and S. Huang, “Physcene: Physically interactable 3d scene synthesis for embodied AI,” IEEE/CVF CVPR, 2024

332. H Etukuru, N Naka, Z Hu, and et al., “Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments” (Stick-v2/RUM), arXiv 2409.05865, 2024

333. K Li, S M Wagh, N Sharma, and et al., “Haptic-ACT: Bridging Human Intuition with Compliant Robotic Manipulation via Immersive VR”, arXiv 2409.11925, 2024

334. A Yang, B Zhang, B Hui, and et al. “Qwen2.5-math technical report: Toward mathematical expert model via self-improvement”. arXiv 2409.12122, 2024.

335. J Wen, Y Zhu, J Li, and et al., “TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation”, arXiv 2409.12514, 2024

336. A Anwar, J Welsh, J Biswas, S Pouya, Y Chang, “ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation”, arXiv 2409.13682, 2024

337. I Chuang, A Lee, D Gao, I Soltani, “Active Vision Might Be All You Need: Exploring Active Vision in Bimanual Robotic Manipulation” (AV-ALOHA), arXiv 2409.17435, 2024

338. Z Wu, T Wang, Z Zhuoma, and et al., “Fast-UMI: A Scalable and Hardware-Independent Universal Manipulation Interface”, arXiv 2409.19499, 2024

339. OpenAI o1, “Learning to reason with LLMs”. https://openai.com/index/learning-to-reason-with-llms, 2024.

340. World Labs, an AI company for spatial intelligence, https://www.worldlabs.ai/, Sep. 2024

341. C-L Cheang, G Chen, Y Jing, and et al., “GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation”, ByteDance Research, Tech. Report, arXiv 2410.06158, Oct., 2024

342. J Wang, M Fang, Z Wan, and et al., “OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models”, Tech. Report, arXiv 2410.09671, Oct. 2024

343. S Tao, F Xiang, A Shukla, and et al., “ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI”, arXiv 2410.00425, 2024

344. P Hua, M Liu, A Macaluso, and et al., “GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs”, arXiv 2410.03645, 2024

345. S Liu, L Wu, B Li, and et al.,“RDT-1B: a Diffusion Foundation Model For Bimanual Manipulation”, arXiv 2410.07864, 2024

346. S Chen, C Wang, K Nguyen, Li F-F, C. K Liu, “ARCap: Collecting High-quality Human Demonstrations for Robot Learning with Augmented Reality Feedback”, arXiv 2410.08464, 2024

347. D Su, S Sukhbaatar, M Rabbat, Y Tian, Q Zheng, “Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces”, arXiv 2410.09918, 2024

348. S Dasari, O Mees, S Zhao, M K Srirama, S Levine, “The Ingredients for Robotic Diffusion Transformers” (DiT-Block Policy), arXiv 2410.10088, 2024

349. Y Ze, Z Chen, W Wang, and et al., “Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies” (iDP3), arXiv 2410.10803, 2024

350. T Z. Zhao, J Tompson, D Driess, and et al., “ALOHA Unleashed: A Simple Recipe for Robot Dexterity”, arXiv 2410.13126, 2024

351. S Zhu, G Wang, D Kong, H Wang, “3D Gaussian Splatting in Robotics: A Survey”, arXiv 2410.12262, 2024

352. B Han, J Kim, and J Jang, “A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM” (DP-VLA), arXiv 2410.15549, 2024

353. Y Zhang, Z Li, M Zhou, S Wu, Jiajun Wu, “The Scene Language: Representing Scenes with Programs, Words, and Embeddings” (SL-DSL), arXiv 2410.16770, 2024

354. Y Yue, Y Wang, B Kang, and et al., “DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution”, arXiv 2411.02359, 2024

355. S Nasiriany, S Kirmani, T Ding, and et al., “RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation”, arXiv 2411.02704, 2024

356. Y Chen, C Wang, Y Yang, C. Liu, “Object-Centric Dexterous Manipulation from Human Motion Data” (OCDM), arXiv 2411.04005, 2024

357. S Zhao, X Zhu, Y Chen, and et al., “DexH2R: Task-oriented Dexterous Manipulation from Human to Robots”, arXiv 2411.04428, 2024

358. Z Zhang, R Chen, J Ye, and et al., “WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making”, arXiv 2411.05619, 2024

359. K Shaw, Y Li, J Yang, and et al., “Bimanual Dexterity for Complex Tasks” (BiDex), arXiv 2411.13677, 2024

360. X Wang, L Horrigan, J Pinskier, and et al., “DexGrip: Multi-modal Soft Gripper with Dexterous Grasping and In-hand Manipulation Capacity”, arXiv 2411.17124, 2024

361. Z Liang, Y Mu, Y Wang, and et al., “DexDiffuser: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation”, arXiv 2411.18562, 2024

362. Anthropic, https://www.anthropic.com/news/3-5-models-and-computer-use, Nov. 2024

--

--

Yu Huang
Yu Huang

Written by Yu Huang

Working in Computer vision, deep learning, AR & VR, Autonomous driving, image & video processing, visualization and large scale foundation models.

No responses yet