Photo of Han HU

Artificial intelligence & robotics

Han HU

Making machines see the visual world in the same way they understand language.

Year Honored

Microsoft Research Asia

Can you imagine machines processing language and understanding images almost identically? Han Hu firmly believes this is achievable and has dedicated himself to pursuing this goal for years. If achieved, it will likely make it feasible to develop a generic AI model to handle various intelligent tasks.

However, the natural language processing and computer vision fields have traditionally employed different mechanisms, particularly, very different mainstream neural architectures. Transformer is the mainstream neural architecture for natural language processing, while convolutional neural networks (CNNs) have been used extensively in computer vision. Can the same neural network be utilized for modeling these two fields? Han Hu has placed his bet on Transformer due to its strong generality and has focused his efforts on adapting it for computer vision problems. While this is a challenging task, several initial attempts by both Han Hu and Transformer's original authors have failed to develop practical visual Transformers.

The Swin Transformer Han Hu proposed in 2021 has marked a significant advancement in the migration of visual backbone neural networks from CNNs to vision Transformers. By introducing the “hierarchy” and “locality” designs into the original Transformer, along with a new “shifted window” technique, Swin Transformer becomes suitable for visual signals and efficient to implement. With these innovations, for the first time, Transformer based visual architectures significantly surpassed previous records held by CNNs on two of the most important evaluation benchmarks in computer vision, COCO object detection and ADE20K semantic segmentation.

Swin Transformer received the Best Paper (Marr Prize) at the International Conference on Computer Vision (ICCV), which is regarded as one of the highest honors in the field of computer vision. The related paper has made a huge research impact, reflected by its more than 5,000 citations and more than 10,000 GitHub stars in a year or so.

Han Hu received his Ph.D. from the Department of Automation, Tsinghua University in 2014. He is currently working at Microsoft Research Asia as a Principal Researcher (Manager). He plans to dedicate himself to ultimately solving the generic computer vision, allowing machines to fully comprehend any image or generate arbitrary images with nearly no errors. He believes that there is no essential difference between vision and language in modeling and learning. Noting the large language models represented by ChatGPT have been able to almost solve the natural language problems in some sense, he thinks the general visual problems may be solved in a similar way.