Video2Manip

Abstract

Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose \mysystem, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, \mysystem reconstructs explicit 4D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%.

Pipeline Overview

Overview of VideoManip Framework — We first reconstruct 4D robot–object interaction trajectories from RGB human videos via recent advances in 3D vision. To utilize the reconstructed data for dexterous grasping and manipulation learning, we perform grasp contact optimization and interaction-centric grasp modeling, and synthesize trajectories for generalizable manipulation. Finally, we deploy the trained models for real-world dexterous grasping and manipulation.

Dexterous Grasp Predictions

We visualize the predicted grasps generated by our DRO grasping model, trained using video-reconstructed grasp data. We test on 18-Dof Inspire Hand in IsaacGym.

Real-world Evaluation Demos

All real-world robot execution videos are sped up by 4 times

Select Category: Select Task:

Dexterous Manipulation Policies from RGB Human Videos via 4D Hand-Object Trajectory Reconstruction

Abstract

Pipeline Overview

Dexterous Grasp Predictions

Real-world Evaluation Demos