Large language models are brilliant with words, partly thanks to the huge amount of text data they’ve been fed. But the frontier of AI is building systems that can interpret the world around them beyond text—seeing, hearing, and responding to complex situations. It’s an ambitious leap, and one of the field’s biggest challenges. One of the top researchers dedicated to pushing this frontier further is Northwestern University’s Manling Li, 33.
Li’s work focuses on a key challenge in AI: translating language into real-world action. While traditional AI systems specialize in a single type of input—like text—Li makes systems that integrate perception, reasoning, and action. She created a framework that can allow AI to piece together what is happening from multimedia information, such as images, audio, and video, as well as text. The ability to “perceive” across various data formats is essential for building AI that can make more well-rounded judgements in the real world.
Rather than just identifying what happens in AI systems’ surroundings, her work helps the systems “understand” why things occur and how they connect. Instead of just tagging objects in a video or picking out keywords from a sentence, her systems can follow what’s happening, figure out how different actions are related, and explain why something occurred. This transparency is increasingly crucial as AI systems make more consequential decisions in our daily lives.
Her work is already being used beyond the lab. Government agencies, including DARPA, have adopted her systems, and through open-sourced tools, she has helped make advanced AI techniques more widely available. She’s also created new benchmarks for evaluating AI performance in real-world settings, such as navigating physical environments and answering complex questions about what’s going on in a video.
As AI becomes embedded in everything from smart assistants to autonomous vehicles, Li’s work ensures these systems are not only powerful but also trustworthy and transparent.