Why Super Mario Bros. is the Ultimate AI Benchmark

Move over, chess and Go—Super Mario Bros. is the new proving ground for artificial intelligence. Researchers at the University of California San Diego’s Hao AI Lab have thrown some of the world’s most advanced AI models into the pixelated world of the iconic 1985 platformer. The results? Let’s just say Mario’s mushroom-fueled adventures are tougher for AI than you might think.

Using a custom framework called GamingAgent, the team integrated AI models like Anthropic’s Claude 3.7, Google’s Gemini 1.5 Pro, and OpenAI’s GPT-4o into an emulated version of the game. The AIs were tasked with controlling Mario, dodging Goombas, and navigating treacherous jumps—all while receiving basic instructions like “move/jump left to dodge” and real-time screenshots.

But here’s the twist: even the most advanced reasoning models, which excel at solving complex problems step by step, struggled to keep up. Why? Because in Super Mario Bros., timing is everything. A split-second delay can mean the difference between a flawless victory and a pitfall-induced demise.

The Surprising Weakness of Reasoning Models

One of the most intriguing findings from the study was the underperformance of reasoning-based AI models. These models, like OpenAI’s o1, are designed to “think” through problems methodically. While they dominate in tasks requiring deep analysis, they falter in fast-paced, real-time environments like Super Mario Bros.

“Reasoning models take seconds to decide on actions,” explains the Hao Lab team. “In a game where milliseconds matter, that’s a fatal flaw.”

Instead, non-reasoning models, which rely on quicker, more instinctive decision-making, fared better. This raises questions about how AI benchmarks should evolve to reflect real-world challenges, where speed and adaptability are just as important as precision.

Are Gaming Benchmarks Really Useful?

While using games to test AI isn’t new—think DeepMind’s AlphaStar mastering StarCraft II—some experts question whether gaming benchmarks truly measure technological progress. Games, after all, are abstract and rule-bound, offering a controlled environment with infinite data for training.

Andrej Karpathy, a founding member of OpenAI, has even called this an “evaluation crisis.” In a recent post on X, he admitted, “I don’t really know what [AI] metrics to look at right now.”

Still, there’s something undeniably captivating about watching AI tackle a game as beloved and challenging as Super Mario Bros. Whether it’s a meaningful measure of AI advancement or just a fun experiment, one thing’s clear: Mario’s world is as tough for machines as it is for humans.