General embodied AI technology holds significant industrial potential worldwide, attracting hundreds of billions of dollars in investment over the past two years. However, for intelligent robots to truly address real-world human challenges, they must comprehend and predict complex embodied dynamics and generate control signals in complicated scenarios. Xu Zhuo has been dedicated to this field for over a decade and made several breakthroughs in 2024.
A key goal in robotics is to create intelligent agents that can understand multimodal instructions and provide meaningful assistance. To this end, he leveraged the long-context understanding capabilities of vision-language models to extract high-level goals from demonstration videos. A topological graph constructed from the same video manages low-level execution. This approach enables novel user-interactive behaviors and achieves end-to-end success on real-world multimodal instructed tasks.
Foundation vision-language models pretrained on internet data often lack essential embodied reasoning skills like spatial relationship recognition and size estimation. To address this, he developed a method for training vision-language models with internet-scale synthetic spatial reasoning data.
He was also a core contributor to several landmark projects in embodied AI, including Open X-Embodiment and Gemini Robotics. Open X-Embodiment was the first to demonstrate the positive transfer potential of diverse robotic manipulation data, earning the Best Paper Award at ICRA 2024. Gemini Robotics marked a significant milestone in VLA’s progress toward dexterous and generalizable manipulation.