Robot之VideoMimic:《Visual Imitation Enables Contextual Humanoid Control》翻译与解读
Robot之VideoMimic:《Visual Imitation Enables Contextual Humanoid Control》翻译与解读
导读:这篇论文介绍了VIDEOMIMIC,一个基于视觉模仿的真实到模拟到真实流水线,用于训练人形机器人执行上下文相关的全身动作。该方法通过分析日常人类动作视频,联合重建人类和环境,学习出能够根据环境和全局指令调整的机器人控制策略。实验表明,VIDEOMIMIC能够使人形机器人熟练掌握爬楼梯、坐立等复杂技能,展现了其在提升机器人环境适应性和泛化能力方面的潜力。
>> 背景痛点:目前,教导人形机器人执行复杂任务,例如爬楼梯、坐在椅子上等,需要耗费大量的时间和精力进行编程和训练。 传统的机器人控制方法往往难以处理环境变化和任务的多样性,缺乏对环境上下文的理解。 现有方法难以实现机器人对环境的灵活适应和对任务的泛化能力。
>> 具体的解决方案:论文提出了一种名为 VIDEOMIMIC 的真实到模拟到真实的流水线方法,该方法利用日常视频数据来学习人形机器人的控制策略。 它通过挖掘日常视频,联合重建人类和环境,生成用于执行相应技能的人形机器人的全身控制策略。
>> 核心思路步骤:VIDEOMIMIC 的核心思路包含以下步骤:
● 数据采集: 收集日常生活中的人类执行各种动作的视频数据,例如爬楼梯、坐在椅子上等。
● 环境和人类的联合重建: 对视频数据进行处理,同时重建视频中的人类动作和环境场景的三维模型。
● 策略学习: 基于重建的模型和人类动作数据,学习一个能够控制人形机器人执行相应动作的策略。该策略能够根据环境和全局根命令进行调节。
● 模拟到真实: 将学习到的策略先在模拟环境中进行测试和优化,然后再转移到真实的人形机器人上进行实验。
>> 优势:VIDEOMIMIC 的主要优势在于:
● 简单易行: 只需要提供日常生活中拍摄的人类动作视频,无需复杂的编程或手动标注。
● 可扩展性强: 能够处理各种不同的环境和任务,具有良好的可扩展性。
● 鲁棒性好: 学习到的策略能够在真实环境中稳定可靠地执行。
● 上下文感知: 策略能够根据环境和全局根命令进行调节,实现上下文感知的控制。
>> 结论和观点:
● 论文通过在真实的人形机器人上进行实验,展示了 VIDEOMIMIC 的有效性。 它能够让机器人执行各种复杂的全身动作,例如爬楼梯、坐在椅子上等,所有这些都来自于单个策略,并且能够根据环境和全局根命令进行调节。
● VIDEOMIMIC 为教导人形机器人在不同的真实世界环境中运行提供了一种可扩展的途径。
● 该研究表明,通过视觉模仿学习,可以有效地将人类的技能转移到人形机器人上,从而实现更灵活、更智能的机器人控制。
目录
《Visual Imitation Enables Contextual Humanoid Control》翻译与解读
Abstract
Figure 1:VideoMimic is a real-to-sim-to-real pipeline that converts monocular videos into transferable humanoid skills, letting robots learn context-aware behaviors (terrain-traversing, stairs-climbing, sitting) in a single policy. Video results are available on our webpage: https://videomimic.net.图 1:VideoMimic 是一个从真实到模拟再到真实的流程,它能将单目视频转换为可迁移的人形技能,让机器人在单一策略中学习情境感知行为(地形穿越、爬楼梯、坐下)。视频结果可在我们的网页上查看:https://videomimic.net 。
1、Introduction
Conclusion
《Visual Imitation Enables Contextual Humanoid Control》翻译与解读
地址 | 论文地址:[2505.03729] Visual Imitation Enables Contextual Humanoid Control |
时间 | 2025年5月6日 最新为2025年5月7日 |
作者 | UC Berkeley |
Abstract
How can we teach humanoids to climb staircases and sit on chairs using the surrounding environment context? Arguably, the simplest way is to just show them—casually capture a human motion video and feed it to humanoids. We introduce VideoMimic, a real-to-sim-to-real pipeline that mines everyday videos, jointly reconstructs the humans and the environment, and produces whole-body control policies for humanoid robots that perform the corresponding skills. We demonstrate the results of our pipeline on real humanoid robots, showing robust, repeatable contextual control such as staircase ascents and descents, sitting and standing from chairs and benches, as well as other dynamic whole-body skills—all from a single policy, conditioned on the environment and global root commands. VideoMimic offers a scalable path towards teaching humanoids to operate in diverse real-world environments. | 我们如何利用周围环境的背景信息来教会类人机器人爬楼梯和坐在椅子上?可以说,最简单的方法就是直接展示给他们看——随意拍摄一段人类动作视频,然后将其输入给类人机器人。我们推出了 VideoMimic,这是一种从真实到模拟再到真实的流程,它挖掘日常视频,同时重建人类和环境,并为类人机器人生成执行相应技能的全身控制策略。我们在真实的类人机器人上展示了我们流程的结果,展示了稳健、可重复的环境相关控制,例如上下楼梯、从椅子和长凳上坐起和站起,以及其他动态全身技能——所有这些都来自一个策略,该策略以环境和全局根命令为条件。VideoMimic 为教会类人机器人在各种真实世界环境中操作提供了一条可扩展的途径。 |
Figure 1:VideoMimic is a real-to-sim-to-real pipeline that converts monocular videos into transferable humanoid skills, letting robots learn context-aware behaviors (terrain-traversing, stairs-climbing, sitting) in a single policy. Video results are available on our webpage: https://videomimic.net.图 1:VideoMimic 是一个从真实到模拟再到真实的流程,它能将单目视频转换为可迁移的人形技能,让机器人在单一策略中学习情境感知行为(地形穿越、爬楼梯、坐下)。视频结果可在我们的网页上查看:https://videomimic.net
1、Introduction
How do we learn to interact with the world around us—like sitting on a chair or climbing a staircase? We watch others perform these actions, try them ourselves, and gradually build up the skill. Over time, we can handle new chairs and staircases, even if we have not seen those exact ones before. If humanoid robots could learn in this way—by observing everyday videos—they could acquire diverse contextual whole-body skills without relying on hand-tuned rewards or motion-capture data for each new behavior and environment. We refer to this ability to execute environment-appropriate actions as contextual control. We introduce VideoMimic, a real-to-sim-to-real pipeline that turns monocular videos—such as casual smartphone captures—into transferable skills for humanoids. From these videos, we jointly recover the 4D human-scene geometry, retarget the motion to a humanoid, and train an RL policy to track the reference trajectories. We then distill the policy into a single unified policy that observes only proprioception, a local height-map, and the desired root direction. This distilled policy outputs low-level motor actions conditioned on the terrain and body state, allowing it to execute appropriate behaviors—such as stepping, climbing, or sitting—across unseen environments without explicit task labels or skill selection. | 我们是如何学会与周围世界互动的,比如坐在椅子上或爬楼梯?我们观察他人做这些动作,自己尝试,然后逐渐掌握技能。随着时间的推移,我们能够应对新的椅子和楼梯,即便之前从未见过那些具体的物件。如果类人机器人能够通过这种方式学习——观察日常视频——它们就能在不依赖于为每种新行为和环境手动调整奖励或动作捕捉数据的情况下,获得各种情境下的全身技能。我们将这种执行适合环境动作的能力称为情境控制。 我们推出了 VideoMimic,这是一种从真实到模拟再到真实的流水线,能够将单目视频(例如随意用智能手机拍摄的视频)转化为适用于仿人机器人的可迁移技能。从这些视频中,我们共同恢复出 4D 人体场景几何结构,将动作重新定位到仿人机器人上,并训练一个强化学习策略来跟踪参考轨迹。然后,我们将该策略提炼为一个单一的统一策略,该策略仅观察本体感觉、局部高度图和期望的根方向。这个提炼后的策略根据地形和身体状态输出低级运动动作,使其能够在未见过的环境中执行适当的行为,例如行走、攀爬或坐下,而无需明确的任务标签或技能选择。 |
We develop a perception module that reconstructs 3D human motion from a monocular RGB video, along with aligned scene point clouds in the world coordinate frame. We convert the point clouds into meshes and align them with gravity to ensure compatibility with physics simulators. The global motion and local poses are retargeted to a humanoid with constraints that ensure physical plausibility, accounting for the embodiment gap. The mesh and retargeted data seed a goal-conditioned DeepMimic [1]-style reinforcement-learning phase in simulation: we warm-start on MoCap data, then train a single policy to track motions from multiple videos in their respective height-mapped environments while randomizing mass, friction, latency, and sensor noise for robustness. Once our tracking policy is trained, we distill it using DAgger [2] to a policy that operates without conditioning on target joint angles. The new policy observes proprioception, an 11 × 11 height-map patch centered on the torso, and the vector to the goal in the robot’s local reference frame. PPO fine-tuning under this reduced observation set yields a generalist controller that, given height-map and root direction at test time, selects and smoothly executes context-appropriate actions such as stepping, climbing, or sitting. In particular, every step of our policy relies only on observations available at real-world deployment, making it immediately runnable on real hardware. Our approach bridges 4D video reconstruction and robot skill learning in a single, data-driven loop. Unlike earlier work that recovers only the person or the scene in isolation, we jointly reconstruct both at a physically meaningful scale and represent them as meshes and motion trajectories suitable for physics-based policy learning. We train our approach on 123 monocular RGB videos, which will be released. We validate the approach through deployment on a real Unitree G1 robot, which shows generalized humanoid motor skills in the context of surrounding environments, even on unseen environments. We will release the reconstruction code, policy training framework, and the video dataset to facilitate future research. | 我们开发了一个感知模块,该模块能够从单目 RGB 视频中重建 3D 人体运动,并结合在世界坐标系中的对齐场景点云。我们将点云转换为网格,并使其与重力对齐,以确保与物理模拟器兼容。全局运动和局部姿态被重新定位到一个类人形体上,并施加约束以确保物理合理性,同时考虑了实体差距。网格和重新定位的数据为模拟中的目标条件式 DeepMimic [1] 风格的强化学习阶段提供了种子:我们从动作捕捉数据开始预热,然后训练一个单一策略,使其在多个视频各自的高度图环境中跟踪运动,同时随机化质量、摩擦力、延迟和传感器噪声以增强鲁棒性。一旦我们的跟踪策略训练完成,我们使用 DAgger [2] 对其进行蒸馏,得到一个无需基于目标关节角度进行条件设定的策略。新策略观察本体感觉、以躯干为中心的 11×11 高度图补丁以及机器人本地参考系中到目标的向量。在这一缩减的观测集下对 PPO 进行微调,可得到一个通用控制器,该控制器在测试时,根据高度图和根部方向,选择并流畅执行诸如行走、攀爬或坐下等符合情境的动作。特别是,我们策略的每一步都仅依赖于在实际部署中可用的观测值,因此可立即在真实硬件上运行。 我们的方法在单一的数据驱动循环中将 4D 视频重建与机器人技能学习相结合。与早期仅单独恢复人物或场景的工作不同,我们以具有物理意义的尺度共同重建两者,并将其表示为适合基于物理的策略学习的网格和运动轨迹。我们在 123 个单目 RGB 视频上训练了我们的方法,这些视频将被发布。我们通过在真实的 Unitree G1 机器人上部署来验证该方法,该机器人在周围环境的背景下展示了通用的人形运动技能,甚至在未见过的环境中也是如此。我们将发布重建代码、策略训练框架和视频数据集,以促进未来的研究。 |
Conclusion
We introduced VideoMimic, a real-to-sim-to-real pipeline that converts everyday human videos into environment-conditioned control policies for humanoids. The system (i) reconstructs humans and surrounding geometry from monocular clips, (ii) retargets the motion to a kinematically feasible humanoid, and (iii) uses the recovered scene as task terrain for dynamics-aware RL. The result is a single policy that delivers robust, repeatable contextual control—e.g., stair ascents/descents and chair sit-stand—all driven only by the environment geometry and a root direction command. VideoMimic offers a scalable path for teaching humanoids contextual skills directly from videos. We expect future work to extend the system to richer human–environment interactions, multi-modal sensor-based context learning, and multi-agent behavior modeling, among other directions. | 我们推出了 VideoMimic,这是一种从真实到模拟再到真实的流水线,能够将日常的人类视频转换为适用于仿人机器人的环境条件控制策略。该系统(i)从单目视频片段中重建人类和周围环境的几何形状,(ii)将动作重新定位到运动学上可行的仿人机器人上,(iii)将恢复的场景作为任务地形用于动态感知强化学习。其结果是一个单一的策略,能够提供稳健、可重复的上下文控制,例如上下楼梯和坐立,仅由环境几何形状和根方向指令驱动。VideoMimic 为直接从视频中教授仿人机器人上下文技能提供了一条可扩展的途径。我们期望未来的工作将该系统扩展到更丰富的人类与环境交互、基于多模态传感器的上下文学习以及多智能体行为建模等方面。 |