Nicholas Mitsakos

So far, we’ve attempted to answer that question through benchmarks. These give models a fixed set of questions to answer and grade them on how many they get right. But just like exams, these benchmarks don’t always reflect deeper abilities. Lately, it seems as if a new AI model is released every week, and each time a company introduces one, it comes with fresh scores showing it surpassing the capabilities of its predecessors.

On paper, everything appears to be getting better all the time. In practice, it’s not so simple.

Just as grinding for a test might boost your score without improving your critical thinking, models can be trained to optimize for benchmark results without actually getting smarter. As Andrej Karpathy recently put it, “we’re living through an evaluation crisis – our scoreboard for AI no longer reflects what we really want to measure.”

Benchmarks have become stale for several key reasons. First, the industry has learned to “teach to the test,” training AI models to score well rather than genuinely improve. Second, widespread data contamination means that models may have already encountered the benchmark questions, or even the answers, in their training data. Ultimately, many benchmarks are already at their maximum. On popular tests like SuperGLUE, models have already reached or surpassed 90% accuracy, making further gains feel more like statistical noise than meaningful improvement.

At that point, the scores no longer provide us with any useful information. 

That’s especially true in high-skill domains like coding, reasoning, and complex STEM problem-solving. However, a growing number of teams worldwide are attempting to address the AI evaluation crisis. One result is a new benchmark called LiveCodeBench Pro. It draws problems from international algorithmic olympiads—competitions for elite high school and university programmers, where participants solve challenging problems without external tools. The top AI models currently manage only about 53% of medium-difficulty problems at first pass and 0% of the hardest ones. These are tasks where human experts routinely excel. AI excels at making plans and executing tasks, but it struggles with nuanced algorithmic reasoning. AI is still far from matching the best human coders.

Then There’s Riskiness

In real-world, application-driven environments, especially with AI agents, unreliability, hallucinations, and brittleness are ruinous. One wrong move could spell disaster when safety or money is at stake.

There are attempts to address the problem. Some benchmarks, such as ARC-AGI, now keep part of their dataset private to prevent AI models from being optimized excessively for the test, a problem known as “overfitting.” Meta’s Yann LeCun has created LiveBench, a dynamic benchmark that evolves its questions every six months.

The goal is to evaluate models not just on knowledge but on adaptability. Xbench, a Chinese benchmark project, is another such effort. Xbench is notable for its dual-track design, which tries to bridge the gap between lab-based tests and real-world utility. The first track assesses technical reasoning skills by evaluating a model’s STEM knowledge and its ability to conduct research in the Chinese language.

The second track aims to assess the practical usefulness – how well a model performs on tasks such as recruitment and marketing. For instance, one task requires an agent to identify five qualified battery engineer candidates; another has it match brands with relevant influencers from a pool of over 800 creators.  The team behind Xbench has big ambitions. They plan to expand their testing capabilities into sectors such as finance, law, and design, and update the test set quarterly to prevent stagnation.

Are We Having Fun Yet?

A model’s hardcore reasoning ability doesn’t necessarily translate into a fun, informative, and creative experience. Most queries from average users are probably not going to be rocket science. There isn’t much research yet on how to effectively evaluate a model’s creativity, or which model would be the best for creative writing or art projects.

What About the Humans?

Human preference testing has also emerged as an alternative to benchmarks. One increasingly popular platform enables users to submit questions and compare responses from different models side by side, allowing them to select the one they prefer. Still, this method has its flaws. Users sometimes reward the answer that sounds more flattering or agreeable, even if it’s wrong. That can incentivize “sweet-talking” models and skew results in favor of pandering.

The Status Quo of AI Testing Cannot Continue.

AI research is a hypercompetitive infinite game. An infinite game is open-ended—the goal is to keep playing. However, in AI, a dominant player often produces a significant result, triggering a wave of follow-up papers that chase the same narrow topic. This race-to-publish culture puts enormous pressure on researchers, rewarding speed over depth and short-term wins over long-term insight. If academia chooses to play a finite game, it will lose.

This “finite vs. infinite game” framework also applies to benchmarks. So, do we have a truly comprehensive scoreboard for evaluating the true quality of a model? Not really. Many dimensions—social, emotional, interdisciplinary—still evade assessment. But the wave of new benchmarks hints at a shift. As the field evolves, a bit of skepticism is probably healthy.

Share This