English is spoken by less than 20% of the world’s population, but some experts estimate that it accounts for over 90% of the training data used to build large language models. The result is AI models that perform worse in the roughly 7,000 non-English languages spoken around the world, reinforce the cultural norms and values espoused in the English-language data, and create hard-to-detect harm.
As a senior research scientist at Google Research, Sunipa Dev, 32, is trying to change that with more inclusive, multilingual, and multicultural datasets to train and evaluate AI.
Starting in 2023, Dev and her colleagues published a multilingual and multi-regional dataset of stereotypes, a pair of papers called SeeGULL, which made up the largest dataset of its kind at the time. Using a combined methodology of synthetic and community-contributed data, they include examples from 178 English-speaking countries, as well as 20 non-English languages in 23 regions.
To ensure that generative AI’s outputs are relevant to local users, her team worked with individual data annotators around the world, including in the Middle East. In some underrepresented regions, including across India, Latin America, and sub-Saharan Africa, they partnered with local nonprofit organizations, UX designers, and others to include additional insights.
Google is already using SeeGULL datasets to evaluate how well its LLMs are able to avoid reproducing harmful stereotypes. It’s also publicly available for broader AI safety evaluations. Since SeeGULL is open-source, Dev and her peers hope it will ensure that the concerns of non-Western communities are included in AI safety testing.
Dev is looking to expand the reach of her mission by helping foster a community of like-minded AI practitioners. The ultimate hope, she says, is that, in the next five years, 90% of the speakers of the world’s major languages will be able to access coherent, relevant, safe, and ultimately beneficial AI; and that one day that number will get closer to reaching everyone. “Artificial intelligence has to be globally intelligent,” Dev says, “and not just intelligent in some contexts.”