AI models make decisions for reasons that no one fully understands. As a result, AI models are often thought of as a black box—data goes in and bodies of text, generated images or videos, and more come out.
But if researchers can’t understand why models do what they do, it’s difficult to fix them when they generate bad or useless information. That’s where Neel Nanda, 26, focuses his work: “I see my job as: Do research such that by the time we make human-level AI, it is safe and good for the world.” Nanda leads a team at Google DeepMind working on a subfield of AI safety called mechanistic interpretability, often shortened to “mech interp,” which involves using mathematical techniques to better understand what an AI model is doing internally.
A popular approachis to divide an AI model into layers of computation, and use tools called sparse autoencoders to pull out traits and concepts the model is implicitly learning within each layer. Last year, Nanda and other Google DeepMind researchers published Gemma Scope, a collection of over 400 sparse autoencoders. Each was trained on Google’s Gemma 2 models to represent a distinct concept that Gemma interprets in pieces of text. This publicly available collection, which can be demoed online, allows researchers to get a kind of X-ray view into the behavior of the Gemma models, uncovering associations the models made completely on their own.
Nanda got into AI because of his growing concern about how quickly artificial general intelligence, or AGI, could arrive—which he believes could pose major risks without a proper understanding of how to make it safe. He believes getting more people involved in the field is critical to ensure people understand AGI before they build it. To this end, Nanda writes explainers on mech interp, makes YouTube walkthroughs, and works as a mentor in the independent ML Alignment & Theory Scholars program.
Nanda suspects this outreach has helped popularize mech interp as a field. “I’ve seen professors complaining on [X] that too many of their PhD applicants want to do mechanistic interpretability,” he says. “I like to think I helped.”