Photo of Zhuo Xu

Artificial intelligence & robotics

Zhuo Xu

Mitigating the underfitting of foundation models caused by the scarcity of robotic data.

Year Honored
2024

Organization
Google DeepMind

Region
China

Hails From
China
General embodied AI technology holds significant industrial potential worldwide, attracting hundreds of billions of dollars in investment over the past two years. However, for intelligent robots to truly address real-world human challenges, they must comprehend and predict complex embodied dynamics and generate control signals in complicated scenarios. Xu Zhuo has been dedicated to this field for over a decade and made several breakthroughs in 2024.

A key goal in robotics is to create intelligent agents that can understand multimodal instructions and provide meaningful assistance. To this end, he leveraged the long-context understanding capabilities of vision-language models to extract high-level goals from demonstration videos. A topological graph constructed from the same video manages low-level execution. This approach enables novel user-interactive behaviors and achieves end-to-end success on real-world multimodal instructed tasks.

Foundation vision-language models pretrained on internet data often lack essential embodied reasoning skills like spatial relationship recognition and size estimation. To address this, he developed a method for training vision-language models with internet-scale synthetic spatial reasoning data.

He was also a core contributor to several landmark projects in embodied AI, including Open X-Embodiment and Gemini Robotics. Open X-Embodiment was the first to demonstrate the positive transfer potential of diverse robotic manipulation data, earning the Best Paper Award at ICRA 2024. Gemini Robotics marked a significant milestone in VLA’s progress toward dexterous and generalizable manipulation.