6. Data Capture Platform and Datasets
A key property of robotic manipulation policies is their ability to generalize, i.e., to perform the desired manipulation task under new lighting conditions, in new environments, or with new objects. Training policies that can adapt to such changes is a critical step toward deploying robots in everyday environments.
A key ingredient in training such generalizable policies is diverse data for training: in computer vision (CV) and natural language processing (NLP), training with large and diverse datasets crawled from the internet can produce models that are applicable to a wide range of new tasks.
Similarly, in robotic manipulation, larger and more diverse robotic training datasets can help push the limits of policy generalization, including actively transferring to new goals, instructions, scenarios, and implementations. An important stepping stone to more robust robotic manipulation policies is the creation of large, diverse, high-quality robotic manipulation datasets.
Compared to fields such as CV and NLP, the scarcity of high-quality data has hampered progress in robotics in many ways. To address this challenge, researchers have proposed algorithms based on techniques such as few-shot learning and multi-task learning. While these approaches show promise in alleviating the data scarcity problem, they still rely on large amounts of high-quality data to achieve effective task generalization.
In terms of both scale and relevant content, Internet video data can help alleviate the data bottleneck problem in robotics. Specifically, the benefits include: (i) improving the generalization ability of existing robotics data, (ii) improving data efficiency and performance in the distribution of robotics data, and potentially (iii) obtaining emergent capabilities that cannot be extracted from robotics data alone.
Learning robotic actions from Internet videos still faces many fundamental and practical challenges. First, video data is generally high-dimensional, noisy, random, and inaccurately labeled. Second, videos lack information that is critical to robotics, including action labels, low-level force, and proprioceptive information. In addition, various distribution changes may occur between Internet videos and the robotics domain.
Two key questions in this area are: (i) How to extract relevant knowledge from Internet videos? (ii) How to apply the knowledge extracted from videos to robotics?
At the same time, there has been a quest to collect larger real-world robotics datasets. Efforts here include the aggregation of human teleoperation and different laboratory data. There is also research into methods for automating data collection, improving scalability and teleoperation.
The most common approach to robotic demonstration collection is to pair a robot or end-effector with a teleoperator device or a kinematic isomorphic device. The devices used are of various complexity and form factors:
1) Full robotic exoskeletons as TABLIS [49], WULE [138], AirExo [188] and DexCap (shown in Figure 14) [263];
2) Simpler robotic data collection tools as ALOHA [145] (shown in Figure 15), GELLO [187], mobile ALOHA [228], ALOHA 2 [271] and AV-ALOHA [337] etc.;
3) Not a physically moving robot, like Dobb×E/stick v1 [214], UMI (shown in Figure 16) [247], UMI on Legs [315], RUM/Stick v2 [332] and Fast-UMI [338];
4) Use of video game controllers (for instance joysticks), like LIBERO [156];
5) Control with VR devices such as Holo-Dex [107], AnyTeleop [164], Open Teach [258], HumanPlus [301], Open-Television [307], ACE [327], ARCap (shown in Figure 17) [346], BiDex [359];
6) Control with mobile phones as RoboTurk [28].
Demonstration data collected via teleoperated robotic systems provide precise in-domain observation-action pairs, enabling effective robotic policy learning via supervised learning. However, the requirements for both the robotic system and a skilled human operator significantly limit the accessibility and scalability of data collection.
The collection of real-world robotics data faces great challenges due to various factors: cost, time, inconsistencies and accuracy etc.
Due to these difficulties, public real-world robotics datasets are relatively scarce. In addition, evaluating the performance of robotic systems under realistic conditions adds another layer of complexity, as accurately reproducing the setup is challenging and often requires human supervision.
Another strategy to address the data scarcity problem in real-world environments is to leverage human data. Due to its flexibility and diversity, human behavior provides a lot of guidance for robotic policies.
However, this strategy also has inherent disadvantages. It is inherently difficult to capture human hand/body movements and transfer them to robots. In addition, the inconsistency of human data poses a problem, as some data may be first-person egocentric, while others are captured from a third-person perspective. Moreover, filtering human data to extract useful information can be labor-intensive [248]. These obstacles highlight the complexity of incorporating human data into the robot learning process.
Some datasets and benchmarks may not be directly used for robot manipulation and navigation, but they aim at other relevant capabilities of embodied intelligence, such as spatial reasoning, physical understanding, and world knowledge. These capabilities are invaluable for task planners.
Although pre-training datasets like Open X-embodiment [195] appear to have a unified structure, significant issues remain. These issues are due to lack of those items, as sensor multi-modality, unified format of multi-robots, compatibility of different platforms, sufficient data, datasets for both simulation and real contents.
There are robot manipulation datasets as RoboNet [35], BridgeData 1/2 [70, 179], RH20T [173], RoboSet [182], Open-X (shown in Figure 18) [195], Droid [269], BRMData [291] and ARIO (unified data format) [325].
Alternatively, human demonstrations can be collected using portable systems without the need for physical robotic hardware. These systems leverage human dexterity and adaptability to directly manipulate objects in the wild, thus facilitating the creation of large-scale, diverse datasets of human demonstrations. However, due to the lack of robotic hardware, it is not immediately clear whether the collected demonstration data can be used to train robot policies without a multi-step process.
The differences between humans and robots in embodiment require data retargeting. Additionally, the retargeted data must be validated by replaying the actions on an actual robot interacting with real objects. Finally, the robot policy must be trained using validated data.
The success of human demonstrations depends heavily on the operator’s experience and awareness of the differences in geometry and capabilities between robots and humans. Failures can occur during the retargeting phase caused by the robot’s joint and velocity limitations, during the validation phase caused by accidental collisions, or during the policy training phase caused by the inclusion of invalid data.
There are human action datasets as EPIC-Kitchens [42], Ego4D (show in Figure 19) [71], HOI4D [79], Assembly101 [83], InternVid [167], Ego-Exo4D [216], Behavior-1k [260], EgoExoLearn [266] and COM Kitchens [321].
7. Wearable AI
Mapping the activities of others to the egocentric view is a basic skill of humans from a very early age.
Wearable AI or Ego AI is essentially a robotics application. Devices such as smart glasses, neural wristbands, and AR headsets (Meta Project Aria [180] shown in Figure 20, VisionProTeleop [253]), use AI to perceive the user’s environment, understand spatial context, and make predictions [218, 304, 323].
Although there is a lot of data collected from the ego-centric view (based on wearable devices), it is crucial for AI agents to learn directly from demonstration videos captured from different views.
Only a few datasets record videos of both egocentric and exocentric views in the same environ in a time-synch manner. In the generalization of action learning of embodied intelligence, the transform between the third-person view and the first-person view [53, 257] is required.
8. Requirements for Datasets
Based on the analysis from various aspects above, a list of requirements could be manifested as follows:
1) The dataset aims at promoting the study of large-scale embodied learning task.
2) The dataset supports generalization to new objects, new environments, new tasks, and even new embodied entities.
3) The dataset meets requirements in diversity of entity, time, place, view, goal, skill.
4) The dataset provides sufficiently accurate ground truth: calibration, synchronization, mapping and localization, and annotation.
5) The dataset complies with privacy and ethical standards: de-identification.
6) The dataset includes real and simulation data: both real2sim & sim2real transfer are realized.
7) The dataset includes Exo-Ego view data: support flexible transform of Exo-Ego views.
8) The dataset formulates a unified format standard: convertible between various data formats.
9) The dataset provides evaluation benchmarks: perception, cognition (reflection, reasoning, planning) and action (manipulation).
9. Conclusion
This paper overviews the evolving of traditional AI to LLM, VLM, agent, spatial intelligence and embodied AI, and analyzes the policy training for embodied action/behavior, embodied dexterity for data capture platforms, simulation platforms and egocentric/wearable AI etc. Then, the requirements imperative for building the dataset are manifested.
Finally, we discuss the generalization tricks in embodied AI, which gives insight to embodied data capture.
9.1 Tricks for Generalization
The methods for policy generalization in embodied AI could be as follows.
1) Sim-2-Real domain transfer in RL [25, 46];
2) Data augmentation and generative AI model (diffusion policy) [127];
3) Data scale and diversity (Open-X) [195];
4) Intermediate representation [355];
5) Large scale model architecture [113, 172];
6) Pre-trained large foundation models [299];
7) Post-training fine-tuning [200];
8) Inference-time optimization [334].
References
1. J Aloimonos, I Weiss, A Bandyopadhyay, “Active vision”, IJCV, vol. 1, Jan. 1987
2. R. Bajcsy, “Active Perception”, IEEE Proceedings, Vol 76, No 8, Aug. 1988.
3. B. M. Yamauchi, “Packbot: a versatile platform for military robotics” (iRobot), Unmanned ground vehicle technology VI, vol. 5422. SPIE, Sept. 2004.
4. N. Koenig and A. Howard, “Design and use paradigms for Gazebo, an open-source multi-robot simulator,” IEEE/RSJ IRS, Oct., 2004.
5. J P. R. Wurman, R. D’Andrea, and M. Mountz, “Coordinating hundreds of cooperative, autonomous vehicles in warehouses” (Kiva Systems), AI magazine, 29 (1), July, 2008.
6. M. Raibert, K. Blankespoor, G. Nelson, and R. Playter, “Bigdog, the rough-terrain quadruped robot,” IFAC Proceedings Volumes, 41(2), July 2008.
7. Deng, W Dong, R Socher, and et al. “ImageNet: A large-scale hierarchical image database”, IEEE CVPR, Aug. 2009
8. G. Echeverria, N. Lassabe, A. Degroote and S. Lemaignan, “Modular open robots simulation engine: MORSE,” IEEE ICRA, May, 2011.
9. E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” IEEE/RSJ IRS, Oct. 2012
10. MIT Quadruped Cheetah, https://spectrum.ieee.org/mit-cheetah-robot-running, IEEE Spectrum, May 2013
11. E. Rohmer, S. P. Singh, and M. Freese, “V-Rep: A versatile and scalable robot simulation framework” (CoppeliaSim), IEEE/RSJ IRS, Nov. 2013
12. Y Bai and C K Liu. “Dexterous manipulation using both palm and fingers” (DMPF). IEEE ICRA, June, 2014.
13. F. Tanaka, K. Isshiki, F. Takahashi, and et al., “Pepper learns together with children: Development of an educational application”, IEEE Int. Conf. on Humanoid Robots, Nov. 2015.
14. E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” https://github.com/bulletphysics/bullet3, 2016
15. S. Maniatopoulos, P. Schillinger, V. Pong, D. C. Conner, and H. Kress-Gazit, “Reactive high-level behavior synthesis for an Atlas humanoid robot,” IEEE ICRA, May 2016.
16. V. Kumar, A. Gupta, E. Todorov, and S. Levine, “Learning dexterous manipulation policies from experience and imitation” (LPEI), arXiv 1611.05095, 2016.
17. S. Shah, D. Dey, C. Lovett, and A. Kapoor, “AirSim: High-fidelity visual and physical simulation for autonomous vehicles,” arXiv 1705.05065, 2017
18. A. Chang, A. Dai, T. Funkhouser, and et al., “Matterport3D: Learning from RGB-D data in indoor environments”, arXiv1709.06158, 2017
19. A. Rajeswaran, V. Kumar, A. Gupta, and et al, “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations” (DAPG), RSS’18, arXiv 1709.10087, 2017.
20. M Savva, A Chang, A Dosovitskiy, and et al., “MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments”, arXiv 1712.03931, 2017
21. E. Kolve, R. Mottaghi, D. Gordon, and et al., “AI2-THOR: An interactive 3d environment for visual AI,” arXiv 1712.05474, 2017
22. A Vaswani, N Shazeer, N Parmar, et al. “Attention is All You Need” (Transformer). Advances in Neural Information Processing Systems, 2017.
23. A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training” (GPT-1), https://openai.com/index/language-unsupervised/, June 2018.
24. X. Puig, K. Ra, M. Boben, and et al., “Virtualhome: Simulating household activities via programs,” in IEEE/CVF CVPR, Jun 2018
25. OpenAI team, “Learning dexterous in-hand manipulation” (LDIM), arXiv 1808.00177, 2018.
26. A. Juliani, V-P Berges, E. Teng, and et al., “Unity: A general platform for intelligent agents” (Unity ML-Agents), arXiv 1809.02627, 2018.
27. S Li, X Ma, H Liang, and et al. “Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network” (TeachNet). ICRA, arXiv 1809.06268, 2018.
28. A Mandlekar, Y Zhu, A Garg, and et al. “RoboTurk: A crowdsourcing platform for robotic skill learning through imitation”. ICRL, arXiv 1811.02790, 2018.
29. A. Radford, J. Wu, R. Child, et al., “Language models are unsupervised multitask learners” (GPT-2), OpenAI blog, 2019.
30. Kroemer, O., Niekum, S., & Konidaris, G. “A review of robot learning for manipulation: Challenges, representations, and algorithms”. arXiv 1907.03146, 2019
31. M Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”, arXiv 1909.08053, 2019
32. A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar, “Deep dynamics models for learning dexterous manipulation” (PDDM), arXiv 1909.11652, 2019.
33. S Rajbhandari, J Rasley, O Ruwase, Y He, “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”, arXiv 1910.02054, 2019
34. A Handa, K Van Wyk, W Yang, and et al. “DexPilot: Vision-based teleoperation of dexterous robotic hand-arm system”. IEEE ICRA, arXiv 1910.03135, 2019.
35. S Dasari, F Ebert, S Tian, and et al. “RoboNet: Large-scale multi-robot learning”. CoRL’19, arXiv 1910.11215, 2019
36. M Savva, A Kadian, O Maksymets, and et al. “Habitat: A platform for embodied AI research”. IEEE ICCV, 2019.
37. Hafner D, Lillicrap T, Ba J, et al. “Dream to control: learning behaviors by latent imagination” (Dreamer v1). arXiv 1912.01603, 2019
38. Ravichandar, H., Polydoros, A. S., Chernova, S., & Billard, A. “Recent advances in robot learning from demonstration” (review). Annual Review of Control, Robotics, Auto. Systems, vol.3, 2020
39. C C. Kessens, J Fink, A Hurwitz, and et al., “Toward fieldable human-scale mobile manipulation using RoMan”, AI and Machine Learning for Multi-Domain Operations Applications II, Volume 11413, SPIE, April, 2020
40. I. Radosavovic, X. Wang, L. Pinto, and J. Malik, “State-only imitation learning for dexterous manipulation” (SOIL), IEEE/RSJ IROS’21. arXiv 2004.04650, 2020.
41. M Deitke, W Han, A Herrasti and et al. “RoboTHOR: An open simulation-to-real embodied AI platform”. CVPR’20, arXiv 2004.06799, 2020
42. Damen D, Doughty H, Farinella G M, et al. “The EPIC-Kitchens dataset: collection, challenges and baselines”. arXiv 2005.00343, IEEE T-PAMI, 43(11): 4125–4141, 2021
43. T. B. Brown, B. Mann, N. Ryder, et al., “Language models are few-shot learners” (GPT-3), arXiv 2005.14165, 2020
44. C. Li, S. Zhu, Z. Sun, and J. Rogers, “BAS optimized ELM for KUKA iiwa Robot Learning,” IEEE Transactions on Circuits and Systems II: Express Briefs, 68 (6), Oct. 2020.
45. F. Xiang, Y. Qin, K. Mo, and et al., “SAPIEN: A simulated part-based interactive environment,” arXiv 2003.08515, IEEE/CVF CVPR, Jun 2020.
46. Zhao, W., Queralta, J. P., and Westerlund, T. “Sim-to-real transfer in deep reinforcement learning for robotics: a survey”. arXiv 2009.13303, 2020.
47. Hafner D, Lillicrap T, Norouzi M, et al. “Mastering Atari with discrete world models” (Dreamer v2). arXiv 2010.02193, 2020
48. A. Zeng, P. Florence, J. Tompson, and et al., “Transporter networks: Rearranging the visual world for robotic manipulation”. CoRL’20, arXiv 2010.14406, 2020
49. Y Ishiguro, T Makabe, Y Nagamatsu, and et al., “Bilateral humanoid teleoperation system using whole-body exoskeleton cockpit TABLIS”, IEEE IROS, Oct. 2020
50. B. Shen, F. Xia, C. Li, and et al., “iGibson 1.0: A simulation environment for interactive tasks in large realistic scenes,” arXiv 2012.02924, IEEE/RSJ IRS, 2021
51. J Ren, S Rajbhandari, R Y Aminabadi et al., “ZeRO-offload: Democratizing Billion-Scale Model Training”, arXiv 2101.06840, 2021
52. S Rajbhandari et al., “ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning”, arXiv 2104.07857, 2021
53. Y Li, T Nagarajan, B Xiong, and K Grauman. “Ego-Exo: Transferring visual representations from third-person to first-person videos”. arXiv 2104.07905, CVPR, 2021
54. D Kalashnikov, J Varley, Y Chebotar, and et al., “MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale”, arXiv 2104.08212, 2021
55. K. Ehsani, W. Han, A. Herrasti, and et al., “ManipulaTHOR: A framework for visual object manipulation,” arXiv 2104.11213, IEEE/CVF CVPR, 2021.
56. M Caron, H Touvron, I Misra, and et al. “Emerging Properties in Self-Supervised Vision Transformers” (Dino v1), arXiv 2104.14294, 2021
57. Chen L, Lu K, Rajeswaran A, et al. “Decision transformer: reinforcement learning via sequence modeling”, arXiv 2106.01345, 2021
58. Janner M, Li Q, Levine S. “Offline reinforcement learning as one big sequence modeling problem” (Trajectory Transformer), arXiv 2106.02039, 2021
59. E Hu et al., “LORA: Low-Rank Adaptation of Large Language Models”, arXiv 2106.09685, 2021
60. A Szot, A Clegg, E Undersander, and et al. “Habitat 2.0: Training Home Assistants to Rearrange their Habitat”, arXiv 2106.14405, 2021
61. Mu T Z, Ling Z, Xiang F B, et al. “Maniskill: generalizable manipulation skill benchmark with large-scale demonstrations”, arXiv 2107.14483, 2021
62. A. Jaegle, S. Borgeaud, J. B. Alayrac, and et al. “Perceiver IO: A general architecture for structured inputs & outputs”. arXiv 2107.14795, 2021.
63. A Radford, J W Kim, C Hallacy, et al. “Learning transferable visual models from natural language supervision” (CLIP). ICML 2021.
64. A Ramesh, M Pavlov, G Goh, et al., “Zero-shot text-to-image generation” (DALL-E). ICML. Virtual event, July 2021
65. C. Li, F. Xia, R. Mart ́ın-Mart ́ın, and et al., “iGibson 2.0: Object- centric simulation for robot learning of everyday household tasks,” arXiv 2108.03272, CRL’21, 2021
66. Y Qin, Y-H Wu, S Liu, and et-al. “DexMV: Imitation learning for dexterous manipulation from human videos”. ECCV’22, arXiv 2108.05877, 2021.
67. Tesla Bot (Optimus), https://spectrum.ieee.org/elon-musk-robot, IEEE Spectrum, Aug., 2021
68. S K Ramakrishnan, A Gokaslan, E Wijmans, and et al. “Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D environments for embodied AI”. arXiv 2109.08238, 2021
69. M. Shridhar, L. Manuelli, and D. Fox, “CliPort: What and where pathways for robotic manipulation,” arXiv 2109.12098, 2021
70. F. Ebert, Y. Yang, K. Schmeckpeper, and et al. “Bridge data: Boosting generalization of robotic skills with cross-domain datasets”. arXiv 2109.13396, 2021.
71. K. Grauman, A. Westbury, E. Byrne and et al. “Ego4D: Around the world in 3,000 hours of egocentric video”. arXiv 2110.07058, 2021
72.
Z Bian et al., “Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training”, arXiv 2110.14883, 2021
73. E Jang, A Irpan, M Khansari, and et al. “BC-Z: Zero-shot task generalization with robotic imitation learning”. CoRL, 2021
74. C. Gan, J. Schwartz, S. Alter, and et al., “ThreeDWorld: A platform for interactive multi-modal physical simulation,” arXiv 2007.04954, NeuIPS’21, 2021
75. R Rombach, A Blattmann, D Lorenz, P Esser, and B Ommer. “High-resolution image synthesis with latent diffusion models” (Stable Diffusion). arXiv 2112.10752, 2021.
76. W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents” (T-LM), arXiv 2201.07207, ICML, 2022.
77. P Mandikal and K Grauman. “DexVIP: Learning dexterous grasping with human hand pose priors from video”. CoRL, arXiv 2202.00164, 2022.
78. Li J, Li D, Xiong C, et al. “BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation”, arXiv 2201.12086, 2022
79. Y. Liu, Y. Liu, C. Jiang, and et al., “HOI4D: A 4D egocentric dataset for category-level human-object interaction”. CVPR’22, arXiv 2203.014577, 2022
80. L Ouyang, J Wu, X Jiang et al., “Training language models to follow instructions with human feedback” (GPT-3.5/InstructGPT), arXiv 2203.02155, 2022
81. N Hansen, X Wang, H Su, “Temporal Difference Learning for Model Predictive Control” (TD-MPC), arXiv 2203.04955, 2022
82. S P Arunachalam, S Silwal, B Evans, and L Pinto. “Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation” (DIME). arXiv 2203.13251, 2022.
83. F Sener, D Chatterjee, D Shelepov, and et al. “Assembly101: A large-scale multi-view video dataset for understanding procedural activities”. CVPR’22, arXiv 2203.14712, 2022
84. A. Zeng, M. Attarian, K. M. Choromanski, and et al., “Socratic models: Composing zero-shot multimodal reasoning with language”, arXiv 2204.00598, 2022
85. M Ahn, A Brohan, N Brown, and et al., “Do as I Can, Not as I Say: Grounding Language in Robotic Affordances” (SayCan), arXiv 2204.01691, 2022
86. A Ramesh, P Dhariwal, A Nichol, and et al. “Hierarchical text-conditional image generation with clip latents” (DALL-E2). arXiv 2204.06125,2022.
87. Y Qin, H Su, and X Wang. “From one hand to multiple hands: Imitation learning for dexterous manipulation from single-camera teleoperation” (IMDM). RA-L, 7(4), arXiv 2204.12490, 2022.
88. J-B Alayrac, J Donahue, P Luc, et al., “Flamingo: a visual language model for few-shot learning”. arXiv 2204.14198, 2022
89. Reed, S., Zolna, K., Parisotto, E., and et al. “A Generalist Agent” (GATO). arXiv 2205.06175, 2022
90. T Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”, arXiv 2205.14135, 2022
91. S. Haddadin, S. Parusel, L. Johannsmeier, and et al., “The Franka Emika Robot: A reference platform for robotics research and education”, IEEE Robotics & Automation Magazine, 29 (2), June, 2022.
92. M. Deitke, E. VanderBilt, A. Herrasti, and et al., “ProcTHOR: Large-Scale Embodied AI Using Procedural Generation”, arXiv 2206.06994, NeurIPS’22, 2022
93. N M Shafiullah, Z J Cui, A Altanzaya, L Pinto, “Behavior Transformers: Cloning k modes with one stone”, arXiv 2206.11251, 2022
94. P Wu, A Escontrela, D Hafner, P Abbeel, and K Goldberg. “DayDreamer: World models for physical robot learning”. arXiv 2206.14176, 2022
95. Y. Seo, D. Hafner, H. Liu, and et al., “Masked world models for visual control” (MWM), arXiv 2206.14244, 2022
96. Huang W, Xia F, Xiao T, et al. “Inner monologue: embodied reasoning through planning with language models”. arXiv 2207.05608, 2022
97. S Bahl, A Gupta, D Pathak, “Human-to-Robot Imitation in the Wild” (WHIRL), arXiv 2207.09450, July, 2022
98. A Bucker, L Figueredo, S Haddadin, and et al., “LATTE: LAnguage Trajectory TransformEr”, arXiv 2208.02918, 2022
99. J Liang, W Huang, F Xia, and et al., “Code as Policies: Language Model Programs for Embodied Control” (CaP), arXiv 2209.07753, 2022
100. L Yang, Z Zhang, S Hong et al., “Diffusion Models: A Comprehensive Survey of Methods and Applications”, arXiv 2209.00796, 2022
101. M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-Actor: A multi-task transformer for robotic manipulation,” arXiv 2209.05451, 2022
102. I. Singh, V. Blukis, A. Mousavian, and et al., “ProgPrompt: Generating situated robot task plans using large language models,” arXiv 2209.11302, IEEE ICRA’23, 2022.
103. B. Reily, P. Gao, F. Han, H. Wang, and H. Zhang, “Real-time recognition of team behaviors by multisensory graph-embedded robot learning” (Jackal Robot/Clearpath Robotics), IJRA, 41(8), Sep. 2022.
104. Z Q Chen, K Van Wyk, Y-W Chao, and et-al. “DexTransfer: Real world multi-fingered dexterous grasping with minimal human demonstrations”. arXiv 2209.14284, 2022.
105. K. Gao, Y. Gao, H. He, et al., “NeRF: Neural radiance field in 3d vision, a comprehensive review”. arXiv 2210.00379, 2022.
106. S Yao, J Zhao, D Yu, and et al., “ReAct: Synergizing Reasoning and Acting in Language Models”, arXiv 2210.03629, 2022
107. S P Arunachalam, I Güzey, S Chintala, and Lerrel Pinto. “Holo-Dex: Teaching dexterity with immersive mixed reality”. IEEE ICRA’23, arXiv 2210.06463, 2022.
108. A Handa, A Allshire, V Makoviychuk, and et al. “Dextreme: Transfer of agile in-hand manipulation from simulation to reality”. arXiv 2210.13702, 2022.
109. Mohammed, Q., Kwek, C., Chua, C. and et al. “Review of learning-based robotic manipulation in cluttered environments”. Sensors, vol. 22 (20), 2022.
110. T Chen, M Tippur, S Wu, and et al. “Visual dexterity: In-hand dexterous manipulation from depth”. arXiv 2211.11744, 2022.
111. C H Song, J Wu, C Washington, and et al., “LLM-Planner: Few-shot grounded planning for embodied agents with large language models”. arXiv 2212.04088, 2022
112. K Shaw, S Bahl, and D Pathak. “VideoDex: Learning dexterity from internet videos”. arXiv 2212.04498, 2022.
113. A Brohan, N Brown, J Carbajal, and et al. “RT-1: Robotics transformer for real-world control at scale”. arXiv 2212.06817, 2022
114. P Liu, W Yuan, J Fu, and et al. “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing”. ACM Computing Surveys, 55(9):1–35, 2023.
115. Q Dong, L Li, D Dai, and et al., “A survey for in-context learning”. arXiv 2301.00234, 2023.
116. Hafner D, Pasukonis J, Ba J, et al. “Mastering diverse domains through world models” (Dreamer v3), arXiv 2301.04104, 2023
117. M Mittal, C Yu, Q Yu, and et al. “ORBIT: A Unified Simulation Framework for Interactive Robot Learning Environments”, arXiv 2301.04195, 2017
118. K. Nottingham, P. Ammanabrolu, A. Suhr, and et al. “Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling” (DEKARD), arXiv 2301.12050, 2023
119. Li J, Li D, Savarese S, et al. “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models”. arXiv 2301.12597, 2023
120.
Y Du, M Yang, B Dai, and et al., “Learning Universal Policies via Text-Guided Video Generation” (UniPi), arXiv 2302.00111, 2023
121. Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang, “Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents” (DEPS), arXiv 2302.01560, 2023.
122. J Gu, F Xiang, X Li, and et al., “ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills”, arXiv 2302.04659, 2023
123. H. Touvron, T. Lavril, G. Izacard, and et al. “LLaMA: Open and efficient foundation language models”. arXiv 2302.13971, 2023.
124. D Driess, F Xia, M. Sajjadi, et al., “PaLM-E: An Embodied Multimodal Language Model”, arXiv 2303.03378, 2023
125. G Khandate, S Shang, ET Chang, and et-al. “Sampling- based Exploration for Reinforcement Learning of Dexterous Manipulation” (M-RRT/G-RRT). RSS’23, arXiv 2303.03486, 2023.
126. S Yang, O Nachum, Y Du, and et al., “Foundation models for decision making: Problems, methods, and opportunities” (review). arXiv 2303.04129, 2023
127. C Chi, Z Xu, S Feng, and et al., “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion”, arXiv 2303.04137, 2023
128. Y Cao, S Li, Y Liu, and et al. “A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT”, arXiv 2303.04226, 2023
129. J Pitz, L Ro ̈stel, L Sievers, and B Ba ̈uml. “Dextrous tactile in-hand manipulation using a modular reinforcement learning architecture” (DTIM). arXiv 2303.04705, 2023.
130. J Robine, M H”oftmann, T Uelwer, and S Harmeling. “Transformer-based world models are happy with 100k interactions” (TWM). ICLR’23, arXiv 2303.07109, 2023
131. C Zhang, C Zhang, M Zhang, I S Kweon, “Text-to-image Diffusion Models in Generative AI: A Survey”, arXiv 2303.07909, 2023
132. J. Achiam, S. Adler, S. Agarwal, and et al. “GPT-4 technical report”. arXiv 2303.08774, 2023
133. Z-H Yin, B Huang, Y Qin, Q Chen, and X Wang. “Rotating without seeing: Towards in-hand dexterity through touch” (Touch Dexterity). arXiv 2303.10880, 2023.
134. Shinn N, Cassano F, Berman E, et al. “Reflexion: language agents with verbal reinforcement learning”, arXiv 2303.11366, 2023
135. I Guzey, B Evans, S Chintala, and L Pinto. “Dexterity from touch: Self-supervised pre- training of tactile representations with robotic play” (T-Dex). arXiv 2303.12076, 2023.
136. Madaan A, Tandon N, Gupta P, et al. “Self-Refine: iterative refinement with self-feedback”, arXiv 2303.17651, 2023
137. W X Zhao, K Zhou, J Li, and et al., “A Survey of Large Language Models”, arXiv 2303.18233, Mar. 2023
138. L Zhao, T Yang, Y Yang, and P Yu. “A wearable upper limb exoskeleton for intuitive teleoperation of anthropomorphic manipulators” (WULE). MDPI Machines, 11(4):441, Mar. 2023.
139. Figure 01, https://www.fastcompany.com/90859010/the-race-to-build-ai-powered-humanoids-is-heating-up, Mar., 2023
140. A. Ugenti, R. Galati, G. Mantriota, and G. Reina, “Analysis of an all-terrain tracked robot with innovative suspension system” (Polibot), Mechanism and Machine Theory, vol. 182, April, 2023.
141. A Kirillov, E Mintun, N Ravi, and et al. “Segment Anything” (SAM). arXiv 2304.02643, 2023
142. J Park, J Brien, C Cai and et al., “Generative Agents: Interactive Simulacra of Human Behavior” (GA), arXiv 2304.03442, 2023
143. X Zou, J Yang, H Zhang, et al., “Segment everything everywhere all at once” (SEEM). arXiv 2304.06718, 2023
144. M Oquab, T Darcet, T Moutakanni, and et al. “Dinov2: Learning robust visual features without supervision”. arXiv 2304.07193, 2023
145. T Z. Zhao, V Kumar, S Levine, C Finn, “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware” (ALOHA/ACT), arXiv 2304.13705, 2023
146. Y Xie, K Kawaguchi K, Y Zhao, and et al. “Self-evaluation guided beam search for reasoning”, arXiv 2305.00633, 2023
147. M Heo, Y Lee, D Lee, and J. Lim. “FurnitureBench: Reproducible real-world benchmark for long-horizon complex manipulation”. arXiv 2305.12821, 2023.
148. S Yao, D Yu, J Zhao, and et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”, arXiv 2305.10601, 2023
149. Mu Y, Zhang Q, Hu M, et al. “EmbodiedGPT: vision-language pre-training via embodied chain of thought”. arXiv 2305.15021, 2023
150. G Wang, Y Xie, Y Jiang, and et al., “VOYAGER: An Open-Ended Embodied Agent with Large Language Models”, arXiv 2305.16291, 2023
151. M Kulkarni, T J. L. Forgaard, K Alexis, “Aerial Gym — Isaac Gym Simulator for Aerial Robots”, arXiv 2305.16510, 2023
152. B Y Lin B Y, Y Fu, K Yang, and et al. “SwiftSage: a generative agent with fast and slow thinking for complex interactive tasks”. arXiv 2305.17390, 2023
153. Cyberbotics, “Webots: open-source robot simulator”, https://github.com/cyberbotics/webots, 2023
154. NVIDIA, “Nvidia Isaac Sim: Robotics simulation and synthetic data,” https://developer.nvidia.com/isaac/sim, 2023
155. AKM Shahariar, Azad Rabby, C Zhang, “BeyondPixels: A Comprehensive Review of the Evolution of Neural Radiance Fields”, arXiv 2306.03000, 2023
156. B Liu, Y Zhu, C Gao, and et al. “LIBERO: Benchmarking knowledge transfer for lifelong robot learning”. arXiv 2306.03310, 2023
157. P Ren, K Zhang, H Zheng, and et al. “Surfer: Progressive reasoning with world models for robotic manipulation”, arXiv 2306.11335, 2023
158. Microsoft, “Textbooks Are All You Need” (phi-1), arXiv 2306.11644, 2023
159. Bousmalis K, Vezzani G, Rao D, et al. “RoboCat: a self-improving generalist agent for robotic manipulation”. arXiv 2306.11706, 2023
160. A Goyal, J Xu, Y Guo, and et al. “RVT: Robotic view transformer for 3D object manipulation”. arXiv 2306.14896, 2023
161. Vemprala S, Bonatti R, Bucker A, and et al. “ChatGPT for robotics: design principles and model abilities”, arXiv 2306.17582, 2023
162. Y Guo, Y-J Wang, L Zha, J Chen, “DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment”, arXiv 2307.00329, 2023
163. X Li, V Belagali, J Shang and M S. Ryoo, “Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning”, arXiv 2307.01849, 2023
164. Y Qin, W Yang, B Huang, and et al. “AnyTeleop: A general vision-based dexterous robot arm-hand teleoperation system”. arXiv 2307.04577, 2023.
165. Huang W, Wang C, Zhang R, et al. “VoxPoser: Composable 3D value maps for robotic manipulation with language models”. arXiv 2307.05973, 2023
166. K. Rana, J. Haviland, S. Garg, and et al. “SayPlan: Grounding large language models using 3d scene graphs for scalable task planning,” arXiv 2307.06135, ICRL’23, 2023.
167. Wang, Y., He, Y., Li, Y., and et al. “InternVid: A large-scale video-text dataset for multimodal understanding and generation”. arXiv 2307.06942, 2023.
168. T Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”, arXiv 2307.08691, 2023
169. H. Touvron, L. Martin, K. Stone, and et al. “Llama 2: Open foundation and fine-tuned chat models”. arXiv 2307.09288, 2023.
170. J Gu, Z Han, S Chen, and et al. “A systematic survey of prompt engineering on vision-language foundation models”. arXiv 2307.12980, 2023
171. H Ha, P Florence, and S Song. “Scaling up and distilling down: Language-guided robot skill acquisition” (SUDD). CoRL’23, arXiv 2307.14535, 2023
172. A Brohan, N Brown, J Carbajal, and et al. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control”, arXiv 2307.15818, 2023
173. H Fang, H Fang, Z Tang, and et al. “RH20T: A robotic dataset for learning diverse skills in one-shot”. RSS 2023 Workshop on Learning for Task and Motion Planning, arXiv 2307.00595, July 2023
174. P. Arm, G. Waibel, J. Preisig, and et al., “Scientific exploration of challenging planetary analog environments with a team of legged robots” (ANYmal C), arXiv 2307.10079, Science robotics, 8 (80), July, 2023.
175. Lin J, Du Y, Watkins O, et al. “Learning to model the world with language” (Dynalang). arXiv 2308.01399, 2023
176. Jing, Y., Zhu, X., Liu, X., and et al. “Exploring visual pre-training for robot manipulation: Datasets, models and methods” (Vi-PRoM). arXiv 2308.03620, 2023.
177. S Zhang, L Dong, X Li and et al., “Instruction Tuning for Large Language Models: A Survey”, arXiv 2308.10792, 2023
178. L Wang, C Ma, X Feng, and et al. “A Survey on Large Language Model based Autonomous Agents”, arXiv 2308.11432, 2023
179. H. Walke, K. Black, A. Lee, and et al. “Bridgedata v2: A dataset for robot learning at scale”, arXiv 2308.12952, 2023.
180. K Somasundaram, J Dong, H Tang, and et al. “Project Aria: A new tool for egocentric multi-modal AI research”. arXiv 2308.13561, 2023.
181. L G Foo, H Rahmani, and J Liu, “AIGC for Various Data Modalities: A Survey”, arXiv 2308.14177, Aug. 2023
182. H Bharadhwaj, J Vakil, M Sharma, and et al., “RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking” (MT-ACT/RoboSet), arXiv 2309.01918, 2023
183. Microsoft, “Textbooks Are All You Need II: Phi-1.5 technical report”, arXiv 2309.05463, 2023
184. W Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (vLLM), arXiv 2309.06180, 2023
185. Z Xi, W Chen, X Guo, and et al. “The Rise and Potential of Large Language Model Based Agents: A Survey”, arXiv 2309.07864, 2023
186. C Li, Z Gan, Z Yang, and et al. “Multimodal Foundation Models: From Specialists to General-Purpose Assistants” (survey), arXiv 2309.10020, 2023
187. P Wu, Y Shentu, Z Yi, X Lin, and P Abbeel. “GELLO: A general, low-cost, and intuitive tele-operation framework for robot manipulators”. arXiv 2309.13037, 2023
188. H Fang, H Fang, Y Wang, and et al. “AirExo: Low-cost exoskeletons for learning whole-arm manipulation in the wild”. arXiv 2309.14975, 2023
189. T Shen, R Jin, Y Huang, and et al., “Large Language Model Alignment: A Survey”, arXiv 2309.15025, 2023
190. Z Chu, J Chen, Q Chen, and et al., “A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future”, arXiv 2309.15402, 2023
191. Q. Gu, A. Kuwajerwala, S. Morin, and et al., “ConceptGraphs: Open-vocabulary 3d scene graphs for perception and planning,” arXiv 2309.16650, 2023
192. Z Yang, L Li, K Lin, and et al., “The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)”, arXiv 2309.17421, 2023
193.
L Wang, Y Ling, Z Yuan, and et al. “GenSim: Generating Robotic Simulation Tasks via Large Language Models”, arXiv 2310.01361, 2023
194. A Q. Jiang, A Sablayrolles, A Mensch, and et al., “Mistral 7B”, arXiv 2310.06285, 2023
195. A Padalkar, A Pooley, A Jain, and et al. “Open X-Embodiment: Robotic learning datasets and RT-x models”. arXiv 2310.08864, 2023
196. Zhang W, Wang G, Sun J, et al. “STORM: efficient stochastic transformer based world models for reinforcement learning”. arXiv 2310.09615, 2023
197. Du, Y., Yang, M., Florence, P. R., and et al. “Video language planning” (VLP). arXiv 2310.10625, Oct. 2023
198. Y J Ma, W Liang, G Wang, and et al. “EUREKA: Human-Level Reward Design Via Coding Large Language Models”, arXiv 2310.12931, ICLR’24, 2023
199. Puig, X., Undersander, E., Szot, A., and et al. “Habitat 3.0: A co-habitat for humans, avatars and robots”. arXiv 2310.13724, 2023
200. Y Feng, N Hansen, Z Xiong, and et al., “Fine-tuning Offline World Models in the Real World” (FOWM), arXiv 2310.16029, 2023
201. Hansen N, Su H, Wang X. “TD-MPC2: scalable, robust world models for continuous control”. arXiv 2310.16828, 2023
202. A Mandlekar, S Nasiriany, B Wen, and et al., “MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations”, arXiv 2310.17596, 2023
203. J Betker, G Goh, L Jing, and et al., “Improving Image Generation with Better Captions” (DALL-E3), OpenAI report, Oct., 2023
204. Li X, Liu M, Zhang H, et al. “Vision-language foundation models as effective robot imitators” (RoboFlamingo). arXiv 2311.01378, 2023
205. Wang Y, Xian Z, Chen F, et al. “RoboGen: towards unleashing infinite data for automated robot learning via generative simulation”. arXiv 2311.01455, 2023
206. J Gu, S Kirmani, P Wohlhart, and et al., “RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches”, arXiv 2311.01977, 2023
207. M R Morris, J Sohl-dickstein, N Fiedel, and et al. “Levels of AGI: Operationalizing Progress on the Path to AGI”, arXiv 2311.02462, 2023
208. H Peng, C Ding, T Geng, and et al., “Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs”, arXiv 2311.04417, 2023
209. Zeng, F., Gan, W., Wang, Y., and et al. “Large language models for robotics: A survey”. arXiv 2311.07226, 2023
210. Y Huang, Y Chen, Z Li, “Applications of Large Scale Foundation Models for Autonomous Driving” (survey), arXiv 2311.12144, 2023
211. J. Huang, S. Yong, X. Ma, and et al., “An embodied generalist agent in 3d world” (LEO), arXiv 2311.12871, 2023
212. X Xiao, J Liu, Z Wang, and et al., “Robot Learning in the Era of Foundation Models: A Survey”, arXiv 2311.14379, 2023
213. Y. Chen, W. Cui, Y. Chen, and et al., “RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks,” arXiv 2311.15649, 2023.
214. N Shafiullah, A Rai, H Etukuru, and et al. “On bringing robots home” (Dobb·E/Stick v1/HoNY). arXiv 2311.16098, 2023.
215. Y. Hu, F Lin, T Zhang, L Yi, and Y Gao, “Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning” (ViLa), arXiv 2311.17842, 2023.
216. K Grauman, A Westbury, L Torresani, and et al. “Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives”, arXiv 2311.18259, 2023
217. Javaheripi M, Bubeck S, Abdin M, et al. “Phi-2: the surprising power of small language models”. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/, 2023
218. Y Song, E Byrne, T Nagarajan, and et al. “Ego4D goal-step: Toward hierarchical understanding of procedural activities”. NeurIPS, 2023
219. I Leal, K Choromanski, D Jain, and et al., “SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust Attention”, arXiv 2312.01990, 2023
220. R Firoozi, J Tucker, S Tian, and et al., “Foundation Models in Robotics: Applications, Challenges, and the Future” (review), arXiv 2312.07843, 2023
221. Y Hu, Q Xie, V Jain, and et al. “Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis”, arXiv 2312.08782, 2023
222. P Wang, L Li, Z Shao, and et al. “Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations”. arXiv 2312.08935, 2023
223. Team, G., Anil, R., Borgeaud, S., and et al. “Gemini: a family of highly capable multimodal models”. arXiv 2312.11805, 2023.
224. H Wu, Y Jing, C Cheang, and et al., “GR-1: Unleashing Large-Scale Video Generative Pre-Training For Visual Robot Manipulation”, arXiv 2312.13139, 2023
225. P Ding, H Zhao, W Song, and et al., “QUAR-VLA: Vision-Language-Action Model for Quadruped Robots”, arXiv 2312.14457, 2023
226. Mistral AI, “Mixtral of experts: A high quality Sparse Mixture-of-Experts”, https://mistral.ai/news/mixtral-of-experts/, Dec. 2023
227.
C Wen, X Lin, J So, and et al., “
Any-point Trajectory Modeling for Policy Learning” (ATM), arXiv 2401.00025, 2024
228. Z Fu, T Z Zhao, and C Finn,
“Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation”, arXiv 2401.02117, 2024
229. Y Cheng, C Zhang, Z Zhang, and et al. “Exploring Large Language Model Based Intelligent Agents: Definitions, Methods, and Prospects” (survey), arXiv 2401.03428, 2024
230. G Chen and W Wang, “A Survey on 3D Gaussian Splatting”, arXiv 2401.03890, 2024
231. T Cai, Y Li, Z Geng, and et al., “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads”, arXiv 2401.10774, 2024
232. B Chen, Z Xu, S Kirmani, and et al. “SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities”, arXiv 2401.12168, 2024
233. M Ahn, D Dwibedi, C Finn, and et al., “AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents”, arXiv 2401.12963, 2024
234. Ming, R., Huang, Z., Ju, Z., and et al. “A survey on video prediction: From deterministic to generative approaches”. arXiv 2401.14718, 2024.
235. LLaMA.cpp, LLM inference in C/C++, https://github.com/ggerganov/llama.cpp, Jan. 2024
236. S. Le Cleac’h, T. A. Howell, S. Yang, and et al, “Fast contact-implicit model predictive control” (Unitree Go1), IEEE Transactions on Robotics, Jan. 2024.
237. X Yan, J Xu, Y Huo, H Bao, “Neural Rendering and Its Hardware Acceleration: A Review”, arXiv 2402.00028, 2024
238. Z Xu, K Wu, J Wen, and et al. “A Survey on Robotics with Foundation Models: toward Embodied AI”, arXiv 2402.02385, 2024
239. Z Wang, Y Li, Y Wu, and et al. “Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision” (MiPS). arXiv 2402.02658, 2024b.
240. X Huang, W Liu, X Chen, and et al. “Understanding the planning of LLM agents: A survey”, arXiv 2402.02716, 2024
241. G Paolo, J G-Billandon, B Kegl, “A Call for Embodied AI”, arXiv 2402.03824, 2024
242. K Kawaharazuka, T Matsushima, A Gambardella, and et al. “Real-World Robot Applications of Foundation Models: A Review”, arXiv 2402.05741, 2024
243. S Minaee, T Mikolov, N Nikzad, and et al. “Large Language Models: A Survey”, arXiv 2402.06196, 2024
244. C Eze, C Crick. “Learning by watching: A review of video-based learning approaches for robot manipulation”. arXiv 2402.07127, 2024
245. B Fei, J Xu, R Zhang, and et al., “3D Gaussian as A New Vision Era: A Survey”, arXiv 2402.07181, 2024
246. G Yenduri, Ramalingam M, P Maddikunta, and et al., “Spatial Computing: Concept, Applications, Challenges and Future Directions”, arXiv 2402.07912, 2024
247. C Chi, Z Xu, C Pan, and et al., “Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots” (UMI), arXiv 2402.10329, 2024
248. Z Tan, A Beigi, S Wang, and et al. “Large Language Models for Data Annotation: A Survey”, arXiv 2402.13446, 2024
249. P Gao, P Wang, F Gao, et al. “Vision-Language Navigation with Embodied Intelligence: A Survey”, arXiv 2402.14304, 2024
250. S Yang, J Walker, J Parker-Holder and et al. “Video as the new language for real-world decision making”, arXiv 2402.17139, 2024
251. Y Liu, J Cao, C Liu, and et al., “Datasets for Large Language Models: A Comprehensive Survey”, arXiv 2402.18041, 2024
252. OpenAI Sora, “Video generation models as world simulators”, https://openai.com/index/video-generation-models-as-world-simulators/, Feb. 2024
253. Y Park and P Agrawal. “Using apple vision pro to train and control robots” (VisionProTeleop), https://github.com/Improbable-AI/VisionProTeleop, 2024
254. 2402.07912, S. Belkhale, T. Ding, T. Xiao, “RT-H: Action hierarchies using language,” arXiv 2403.01823, Mar. 2024
255. S Lee, Y Wang, H Etukuru, and et al., “Behavior Generation with Latent Actions” (VQ-BeT), arXiv 2403.03181, 2024
256. Ze Y, Zhang G, Zhang K, et al. “3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations”. arXiv 2403.03954, 2024
257. M Luo, Z Xue, A Dimakis, K Grauman, “Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos”, arXiv 2403.06351, 2024
258. A Iyer, Z Peng, Y Dai, and et al. “Open Teach: A versatile teleoperation system for robotic manipulation”. arXiv 2403.07870, 2024.
259. Google Gemma Team, “Gemma: Open Models Based on Gemini Research and Technology”, arXiv 2403.08295, 2024
260. Li, C., Zhang, R., Wong, J., and et al. “Behavior-1k: A human-centered, embodied AI benchmark with 1,000 everyday activities and realistic simulation”, arXiv 2403.09227, 2024
261. H. Zhen, X Qiu, P Chen, and et al., “3D-VLA: 3d vision-language-action generative world model,” arXiv:2403.09631, 2024.
262. T Wu, Y Yuan, L Zhang, and et al. “Recent Advances in 3D Gaussian Splatting” (review), arXiv 2403.11134, 2023
263. C Wang, H Shi, W Wang, and et al., “DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation”, arXiv 2403.07788, 2024
264. C. Sferrazza, D.-M. Huang, X. Lin, Y. Lee, and P. Abbeel. “HumanoidBench: Simulated humanoid benchmark for whole-body locomotion and manipulation”. arXiv 2403.10506, 2024.
265. J F. Mullen Jr, D Manocha, “LAP, Using Action Feasibility for Improved Uncertainty Alignment of Large Language Model Planners”, arXiv 2403.13198, 2024
266. Y Huang, G Chen, J Xu, et al. “EgoExoLearn: A Dataset for Bridging AsynchronousEgo- and Exo-centric View of Procedural Activities in Real World”, arXiv 2403.16182, 2024
267. “AI Power: Accurate Models at Blazing Speeds | SambaNova”, https://sambanova.ai/blog/accurate-models-at-blazing-speed, Samba COE v0.2, March, 2024
268. Unitree humanoid H1, https://kr-asia.com/unitree-robotics-develops-personal-robot-dogs-that-jog-alongside-you, Mar. 2024
269. A. Khazatsky, K. Pertsch, S. Nair, “Droid: A large-scale in-the-wild robot manipulation dataset”, arXiv 2403.12945, 2024
270. S Zhou, Y Du, J Chen, and et al. “RoboDreamer: Learning compositional world models for robot imagination”, arXiv 2404.12377, 2024
271. ALOHA 2 Team, and et al, “ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation”, arXiv 2405.02292, 2024
272. J W Kim, T Z. Zhao, S Schmidgall, and et al., “Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks”, arXiv 2407.12998, 2024
273. Microsoft, “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone”, arXiv 2404.14219, 2024
274. S. Shin, J. Kim, G.-C. Kang, and et al., “Socratic planner: Inquiry-based zero-shot planning for embodied instruction following,” arXiv 2404.15190, 2024.
275. Y Xia, R Wang, X Liu, and et al., “Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs”, arXiv 2404.15676, 2024
276. R Xu, S Yang, Y Wang, and et al., “Visual Mamba: A Survey and New Outlooks”, arXiv 2404.18861, 2024
277. R McCarthy, D Tan, D Schmidt, and et al. “Towards Generalist Robot Learning from Internet Video: A Survey”, arXiv 2404.19664, 2024
278. R Cadene, S Alibert, A Soare, and et al., https://github.com/huggingface/lerobot (LeRobot), May, 2024
279. G. Wang, L. Pan, S. Peng, and et al., “NeRF in robotics: A survey,” arXiv 2405.01333, 2024.
280. M Dalal, T Chiruvolu, D Chaplot, and R Salakhutdinov. “Plan-Seq-Learn: Language model guided rl for solving long horizon robotics tasks” (PSL), arXiv 2405.01534, 2024
281. A Dalal, D Hagen, K Robbersmyr, and et al. “Gaussian Splatting: 3D Reconstruction and Novel View Synthesis, a Review”, arXiv 2405.03417, 2024
282. Z Zhu, X Wang, W Zhao, and et al. “Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond”, arXiv 2405.03520, 2024
283. X Li, K Hsu, J Gu, and et al., “Evaluating Real-World Robot Manipulation Policies in Simulation” (SIMPLER), arXiv 2405.05941, 2024
284. K F Gbagbe, M A Cabrera, A Alabbas, and et al., “Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations”, arXiv 2405.06039, 2024
285. Y Huang, “Levels of AI Agents: from Rules to Large Language Models”, arXiv 2405.06643, May, 2024
286. R Prabhakar, R Sivaramakrishnan, D Gandhi, and et al., “SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts”, arXiv 2405.07518, 2024
287. Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, and et al. “Octo: An open-source generalist robot policy”,arXiv 2405.12213, 2024
288. Y Ma, Z Song, Y Zhuang, and et al. “A Survey on Vision-Language-Action Models for Embodied AI”, arXiv 2405.14093, 2024
289. Zhang Y, Yang S, Bai C J, et al. “Towards efficient LLM grounding for embodied multi-agent collaboration” (ReAd). arXiv 2405.14314, 2024
290. F Bordes, R Y Pang, A Ajay, and et al. “An Introduction to Vision-Language Modeling”, arXiv 2405.17247, 2024
291. T Zhang, D Li, Y Li, and et al., “Empowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation Dataset for Household Tasks” (BRMData), arXiv 2405.18860, 2024
292. Fei-Fei Li, “With Spatial Intelligence, Artificial Intelligence Will Understand the Real World”, https://www.youtube.com/watch?v=y8NtMZ7VGmU, May, 2024
293. OpenAI GPT-4o, https://openai.com/index/hello-gpt-4o/, May, 2024
294. J. Liu, M. Liu, Z. Wang, and et al., “RoboMamba: Multimodal state space model for efficient robot reasoning and manipulation,” arXiv 2406.04339, 2024
295. L Luo, Y Liu, R Liu, and et al. “Improve mathematical reasoning in language models by automated process supervision” (OmegaPRM). arXiv 2406.06592, 2024.
296. A. Szot, B Mazoure, H Agrawal, and et al., “Grounding multimodal large language models in actions” (Grounding-RL), arXiv 2406.07904, 2024.
297. A Goyal, V Blukis, J Xu, and et al. “RVT-2: Learning Precise Manipulation from Few Demonstrations”. arXiv 2406.08545, 2024
298. T He, Z Luo, X He, and et al., “OmniH2O: Universal and Dexterous Human-to-Humanoid Whole-Body Tele-operation and Learning”, arXiv 2406.08858, 2024
299. M J Kim, K Pertsch, S Karamcheti, and et al., “OpenVLA: An Open-Source Vision-Language-Action Model”, arXiv 2406.09246, 2024
300. W Cai, J Jiang, F Wang, and et al., “A Survey on Mixture of Experts”, arXiv 2407.06204, 2024
301. Z Fu, Q Zhao, Q Wu, G Wetzstein, and C Finn. “HumanPlus: Humanoid shadowing and imitation from humans”. arXiv 2406.10454, 2024.
302. D Niu, Y Sharma, G Biamby, and et al, “LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning”, arXiv 2406.11815, 2024
303. P Mazzaglia, T Verbelen, B Dhoedt, and et al., “GenRL: Multimodal-foundation world models for generalization in embodied agents”, arXiv 2406.18043, 2024
304. B Pei, G Chen, J Xu, and et al. “EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation”, arXiv 2406.18070, 2024
305. Isaac Ong, Amjad Almahairi, V Wu, and et al., “RouteLLM: Learning to Route LLMs with Preference Data”, arXiv 2406.18665, 2024
306. X Mai, Z Tao, J Lin, and et al. “From Efficient Multimodal Models to World Models: A Survey”, arXiv 2407.00118, 2024
307. X Cheng, J Li, S Yang, G Yang, and X Wang. “Open-Television: Teleoperation with immersive active visual feedback”, arXiv 2407.01512, 2024
308. I Georgiev, V Giridhar, N Hansen, A Garg, “PWM: Policy Learning with Large World Models”, arXiv 2407.02466, 2024
309. R Ding, Y Qin, J Zhu, and et al, “Bunny-VisionPro: Real-Time Bimanual Dexterous Teleoperation for Imitation Learning”, arXiv 2407.03162, 2024
310. Y Liu, W Chen, Y Bai and et al. “Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI”, arXiv 2407.06886, 2024
311. L Zheng, F Yan, F Liu, and et al., “RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios”, arXiv 2407.06951, 2024
312. N Chernyadev, N Backshall, X Ma, and et al., “BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark”, arXiv 2407.07788, 2024
313. A Lee, I Chuang, L-Y Chen, I Soltani, “InterACT: Inter-dependency Aware Action Chunking with Hierarchical Attention Transformers for Bimanual Manipulation”, arXiv 2409.07914, 2024
314. W Wu, H He, Y Wang, and et al., “MetaUrban: A Simulation Platform for Embodied AI in Urban Spaces”, arXiv 2407.08725, 2024
315. H Ha, Y Gao, Z Fu, J Tan, and S Song, “UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers”, arXiv 2407.10353, 2024
316. H Wang, J Chen, W Huang, and et al., “GRUtopia: Dream General Robots in a City at Scale”, arXiv 2407.10943, 2024
317. Y Bao, T Ding, J Huo, and et al. “3D Gaussian Splatting: Survey, Technologies, Challenges, and Opportunities”, arXiv 2407.17418, 2024
318. Llama Team, Meta AI, “The Llama 3 Herd of Models”, arXiv 2407.21783, 2024
319. Y Wu, Z Sun, S Li, S Welleck, Y Yang. “Inference Scaling Laws: An Empirical Analysis of Compute-optimal Inference For LLM Problem-solving” (REBASE), arXiv 2408.00724, 2024
320. H Qu, L Ning, R An, and et al., “A Survey of Mamba”, arXiv 2408.01129, 2024
321. K Maeda, T Hirasawa, A Hashimoto, and et al. “COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark”, atXiv 2408.02272, 2024
322. C Snell, J Lee, K Xu, and A Kumar. “Scaling LLM test-time compute optimally can be more effective than scaling model parameters”. arXiv 2408.03314, 2024.
323. Z Fang, M Yang, W Zeng, and et al., “Egocentric Vision Language Planning” (EgoPlan), arXiv 2408.05802, 2024
324. H Arai, K Miwa, K Sasaki, and et al., “CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving”, arXiv 2408.10845, 2024
325. Z Wang, H Zheng, Y Nie, and et al., “All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents” (ARIO). arXiv 2408.10899, 2024
326. Y Zheng, L Yao, Y Su, and et al., “A Survey of Embodied Learning for Object-Centric Robotic Manipulation”, arXiv 2408.11537, 2024
327. S Yang, M Liu, Y Qin, and et al. “ACE: A cross-platform visual-exoskeletons system for low-cost dexterous teleoperation”, arXiv 2408.11805, 2024
328. R Doshi, H Walke, O Mees, S Dasari, S Levine, “Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation” (CrossFormer), arXiv 2408.11812, 2024
329. Figure 02, https://techcrunch.com/2024/08/06/figures-new-humanoid-robot-leverages-openai-for-natural-speech-conversations/, Aug. 2024
330. Y. Yang, F.-Y. Sun, L. Weihs, and et al., “Holodeck: Language guided generation of 3d embodied AI environments,” IEEE/CVF CVPR, 2024
331. Y. Yang, B. Jia, P. Zhi, and S. Huang, “Physcene: Physically interactable 3d scene synthesis for embodied AI,” IEEE/CVF CVPR, 2024
332. H Etukuru, N Naka, Z Hu, and et al., “Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments” (Stick-v2/RUM), arXiv 2409.05865, 2024
333. K Li, S M Wagh, N Sharma, and et al., “Haptic-ACT: Bridging Human Intuition with Compliant Robotic Manipulation via Immersive VR”, arXiv 2409.11925, 2024
334. A Yang, B Zhang, B Hui, and et al. “Qwen2.5-math technical report: Toward mathematical expert model via self-improvement”. arXiv 2409.12122, 2024.
335. J Wen, Y Zhu, J Li, and et al., “TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation”, arXiv 2409.12514, 2024
336. A Anwar, J Welsh, J Biswas, S Pouya, Y Chang, “ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation”, arXiv 2409.13682, 2024
337. I Chuang, A Lee, D Gao, I Soltani, “Active Vision Might Be All You Need: Exploring Active Vision in Bimanual Robotic Manipulation” (AV-ALOHA), arXiv 2409.17435, 2024
338. Z Wu, T Wang, Z Zhuoma, and et al., “Fast-UMI: A Scalable and Hardware-Independent Universal Manipulation Interface”, arXiv 2409.19499, 2024
339. OpenAI o1, “Learning to reason with LLMs”. https://openai.com/index/learning-to-reason-with-llms, 2024.
340. World Labs, an AI company for spatial intelligence, https://www.worldlabs.ai/, Sep. 2024
341. C-L Cheang, G Chen, Y Jing, and et al., “GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation”, ByteDance Research, Tech. Report, arXiv 2410.06158, Oct., 2024
342. J Wang, M Fang, Z Wan, and et al., “OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models”, Tech. Report, arXiv 2410.09671, Oct. 2024
343. S Tao, F Xiang, A Shukla, and et al., “ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI”, arXiv 2410.00425, 2024
344. P Hua, M Liu, A Macaluso, and et al., “GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs”, arXiv 2410.03645, 2024
345. S Liu, L Wu, B Li, and et al.,“RDT-1B: a Diffusion Foundation Model For Bimanual Manipulation”, arXiv 2410.07864, 2024
346. S Chen, C Wang, K Nguyen, Li F-F, C. K Liu, “ARCap: Collecting High-quality Human Demonstrations for Robot Learning with Augmented Reality Feedback”, arXiv 2410.08464, 2024
347. D Su, S Sukhbaatar, M Rabbat, Y Tian, Q Zheng, “Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces”, arXiv 2410.09918, 2024
348. S Dasari, O Mees, S Zhao, M K Srirama, S Levine, “The Ingredients for Robotic Diffusion Transformers” (DiT-Block Policy), arXiv 2410.10088, 2024
349. Y Ze, Z Chen, W Wang, and et al., “Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies” (iDP3), arXiv 2410.10803, 2024
350. T Z. Zhao, J Tompson, D Driess, and et al., “ALOHA Unleashed: A Simple Recipe for Robot Dexterity”, arXiv 2410.13126, 2024
351. S Zhu, G Wang, D Kong, H Wang, “3D Gaussian Splatting in Robotics: A Survey”, arXiv 2410.12262, 2024
352. B Han, J Kim, and J Jang, “A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM” (DP-VLA), arXiv 2410.15549, 2024
353. Y Zhang, Z Li, M Zhou, S Wu, Jiajun Wu, “The Scene Language: Representing Scenes with Programs, Words, and Embeddings” (SL-DSL), arXiv 2410.16770, 2024
354. Y Yue, Y Wang, B Kang, and et al., “DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution”, arXiv 2411.02359, 2024
355. S Nasiriany, S Kirmani, T Ding, and et al., “RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation”, arXiv 2411.02704, 2024
356. Y Chen, C Wang, Y Yang, C. Liu, “Object-Centric Dexterous Manipulation from Human Motion Data” (OCDM), arXiv 2411.04005, 2024
357. S Zhao, X Zhu, Y Chen, and et al., “DexH2R: Task-oriented Dexterous Manipulation from Human to Robots”, arXiv 2411.04428, 2024
358. Z Zhang, R Chen, J Ye, and et al., “WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making”, arXiv 2411.05619, 2024
359. K Shaw, Y Li, J Yang, and et al., “Bimanual Dexterity for Complex Tasks” (BiDex), arXiv 2411.13677, 2024
360. X Wang, L Horrigan, J Pinskier, and et al., “DexGrip: Multi-modal Soft Gripper with Dexterous Grasping and In-hand Manipulation Capacity”, arXiv 2411.17124, 2024
361. Z Liang, Y Mu, Y Wang, and et al., “DexDiffuser: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation”, arXiv 2411.18562, 2024
362. Anthropic, https://www.anthropic.com/news/3-5-models-and-computer-use, Nov. 2024