AI Benchmark vs Human Performance | Opinion

AI Wins the Benchmarks. It Still Can't Do Your Job. Here's Why.

AI scored in the 90th percentile on the bar exam. Outperformed doctors on cancer detection. Won every benchmark that mattered. Law firms still have associates. Hospitals still have radiologists. Here is the part nobody in the industry wants to explain.

Nikhil_Writes
Nikhil_Writes
8 min read

In 2023, GPT-4 passed the Uniform Bar Exam scoring in the 90th percentile. AI systems were outperforming radiologists at detecting certain cancers. OpenAI's models were beating most humans on the SAT and graduate-level science tests.

Law firms did not lay off associates. Hospitals did not cut radiology staff. Marketing teams did not shrink.

If AI is genuinely this capable, why does nothing feel different at the actual job level? The answer is not that AI is overhyped. It is that the instrument we use to measure AI capability is itself broken. 

The benchmarks are not lying. They are just not telling you what you think they are.

The Test Was Never About Your Job

A benchmark is a standardized test. Fixed questions, known correct answers, repeatable scores. The most cited one is the MMLU, covering 57 subjects from mathematics to law to medicine. 

In 2020, top AI models scored around 60 percent on it. By 2024, they were above 90 percent. Human expert performance sits at roughly 89 percent. By this measure, AI has surpassed the average human expert.

But who wrote the test?

AI Wins the Benchmarks. It Still Can't Do Your Job. Here's Why.

 

MMLU was compiled from freely available online sources and academic question banks. The same internet used to train the models now being tested on it. 

Researcher Arvind Narayanan at Princeton has written about this directly, noting that benchmark contamination, where test data leaks into training data, is widespread and underreported. 

A 2023 MIT study found several leading models showed performance patterns consistent with having seen benchmark questions during training. The scores go up. The capability does not always follow.

Winning a Test You Helped Write

Gary Marcus, cognitive scientist and persistent critic of AI benchmark culture, said it plainly: “These systems are very good at tasks that look like things they have been trained on. That is not the same as being good at thinking.”

The medical imaging case makes this concrete. Systems built with the help of AI development services and trained on cancer detection have matched or beaten radiologists in controlled studies. 

But a 2021 review in Nature Medicine found that the vast majority of these studies used retrospective data, old images that had already been diagnosed and curated. When researchers looked at real-time clinical data, AI performance dropped significantly.

Real hospitals have blurry images from machines that need servicing. Patients who cannot stay still. Clinical notes that are incomplete. The benchmark has none of this. It is a closed room with good lighting. Of course AI wins there.

What Real Work Actually Demands

Three things benchmarks cannot test, and almost every real job requires.

Institutional context. 

Every workplace runs on invisible logic. Processes that exist because of a decision made years ago by someone who left. Clients handled a certain way because of a relationship nobody documented. An AI responds to what you type, not to the organizational history sitting behind your question.

Relational judgment. 

A significant portion of knowledge work is reading people. Knowing when a client is unhappy before they say so. Knowing how to deliver feedback to a specific person in a way they will actually hear. This requires years of social experience, a face, a stake in the outcome. AI has none of these.

Consequence ownership. 

When AI produces a wrong answer, the person who used it is accountable. The AI is not. This means every AI output gets filtered through human judgment before anything real happens with it. The AI is not doing the job. It is producing a draft that a human then does the job of evaluating.

A 2024 McKinsey survey found that while 65 percent of respondents used AI tools regularly, only 22 percent reported a reduction in overall workload. The tools were being used. The work was not going away.

 

The 80 Percent Problem

AI handles the first 80 percent of almost any knowledge task competently. It drafts the email, summarizes the document, structures the report. It does this fast and often well.

The remaining 20 percent is where the work actually lives.

The 20 percent is knowing which option is right for this client, in this market, at this moment. It is the data summary that is technically accurate but misses what makes it significant. It is the email draft that is perfectly worded but wrong in tone for the specific person receiving it.

Ethan Mollick at Wharton calls this the "jagged frontier" of AI capability. Some tasks that seem complex, AI handles easily. Some tasks that seem simple, AI handles badly. The failures come as surprises, in moments you least expect, which is a different and more dangerous kind of limitation than a tool that fails predictably.

The Next Model is Not the Answer

The industry response to these limitations is always the same: the next version will be better.

This misses the point.

The jobs most resistant to AI replacement are not resistant because AI lacks information. They are resistant because they require cognition that current AI architecture is not built to produce. 

Language models work by predicting the most statistically likely next output given the input. This produces fluent, often accurate results. It is not reasoning in the way humans use that word.

Yann LeCun, Chief AI Scientist at Meta and one of the architects of modern machine learning, has argued publicly that current LLM architecture has fundamental limits that scaling alone cannot fix, and that genuine machine intelligence will require entirely different approaches. His skepticism carries weight precisely because he is not an outsider critic. He built the field.

The Honest Frame

Benchmark scores have become a proxy for a claim about human replaceability, made largely by AI development companies with a direct financial interest in people believing it. A score on a controlled test, designed by the same ecosystem producing the model, is not neutral evidence.

Reading a benchmark result without that context is like hiring someone based entirely on their GPA. It tells you something. It does not tell you the thing you most need to know.

AI is useful. For specific, bounded tasks, the value is real. 

But it cannot carry the weight of context, judgment, and accountability that real jobs require. The next time a benchmark makes headlines, the question worth asking is not what score the model got. It is whether any of that has anything to do with the work on your desk.

Usually, it does not.

Sources

  1. OpenAI GPT-4 Technical Report, 2023 — openai.com/research/gpt-4
  2. MMLU Benchmark, Hendrycks et al., 2020 — arxiv.org/abs/2009.03300
  3. Benchmark Contamination, Narayanan and Kapoor, Princeton — aisnakeoil.com
  4. AI Diagnostic Performance Review, Nature Medicine, 2021 — nature.com
  5. Gary Marcus, MIT Technology Review, 2023 — technologyreview.com
  6. McKinsey AI Workplace Survey, 2024 — mckinsey.com
  7. Ethan Mollick, "The Jagged Frontier," 2023 — oneusefulthing.org
  8. Yann LeCun on LLM limits, Meta AI blog, 2023 to 2024 — ai.meta.com/blog

Discussion (0 comments)

0 comments

No comments yet. Be the first!