If you’re looking for a new reason to be nervous about artificial intelligence, try this: Some of the smartest humans in the world are struggling to create tests that A.I. systems can’t pass.
For years, A.I. systems were measured by giving new models a variety of standardized benchmark tests. Many of these tests consisted of challenging, S.A.T.-caliber problems in areas like math, science and logic. Comparing the models’ scores over time served as a rough measure of A.I. progress.
But A.I. systems eventually got too good at those tests, so new, harder tests were created — often with the types of questions graduate students might encounter on their exams.
Those tests aren’t in good shape, either. New models from companies like OpenAI, Google and Anthropic have been getting high scores on many Ph.D.-level challenges, limiting those tests’ usefulness and leading to a chilling question: Are A.I. systems getting too smart for us to measure?
This week, researchers at the Center for AI Safety and Scale AI are releasing a possible answer to that question: A new evaluation, called “Humanity’s Last Exam,” that they claim is the hardest test ever administered to A.I. systems.
Humanity’s Last Exam is the brainchild of Dan Hendrycks, a well-known A.I. safety researcher and director of the Center for AI Safety. (The test’s original name, “Humanity’s Last Stand,” was discarded for being overly dramatic.)
Mr. Hendrycks worked with Scale AI, an A.I. company where he is an advisor, to compile the test, which consists of roughly 3,000 multiple-choice and short answer questions designed to test A.I. systems’ abilities in areas ranging from analytic philosophy to rocket engineering.
Questions were submitted by experts in these fields, including college professors and prizewinning mathematicians, who were asked to come up with extremely difficult questions they knew the answers to.
Here, try your hand at a question about hummingbird anatomy from the test:
Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.
Or, if physics is more your speed, try this one:
A block is placed on a horizontal rail, along which it can slide frictionlessly. It is attached to the end of a rigid, massless rod of length R. A mass is attached at the other end. Both objects have weight W. The system is initially stationary, with the mass directly above the block. The mass is given an infinitesimal push, parallel to the rail. Assume the system is designed so that the rod can rotate through a full 360 degrees without interruption. When the rod is horizontal, it carries tension T1. When the rod is vertical again, with the mass directly below the block, it carries tension T2. (Both these quantities could be negative, which would indicate that the rod is in compression.) What is the value of (T1−T2)/W?
(I would print the answers here, but that would spoil the test for any A.I. systems being trained on this column. Also, I’m far too dumb to verify the answers myself.)
The questions on Humanity’s Last Exam went through a two-step filtering process. First, submitted questions were given to leading A.I. models to solve.
If the models couldn’t answer them (or if, in the case of multiple-choice questions, the models did worse than by random guessing), the questions were given to a set of human reviewers, who refined them and verified the correct answers. Experts who wrote top-rated questions were paid between $500 and $5,000 per question, as well as receiving credit for contributing to the exam.
Kevin Zhou, a postdoctoral researcher in theoretical particle physics at the University of California, Berkeley, submitted a handful of questions to the test. Three of his questions were chosen, all of which he told me were “along the upper range of what one might see in a graduate exam.”
Mr. Hendrycks, who helped create a widely used A.I. test known as Massive Multitask Language Understanding, or M.M.L.U., said he was inspired to create harder A.I. tests by a conversation with Elon Musk. (Mr. Hendrycks is also a safety advisor to Mr. Musk’s A.I. company, xAI.) Mr. Musk, he said, raised concerns about the existing tests given to A.I. models, which he thought were too easy.
“Elon looked at the M.M.L.U. questions and said, ‘These are undergrad level. I want things that a world-class expert could do,’” Mr. Hendrycks said.
There are other tests trying to measure advanced A.I. capabilities in certain domains, such as FrontierMath, a test developed by Epoch AI, and ARC-AGI, a test developed by the A.I. researcher François Chollet.
But Humanity’s Last Exam is aimed at determining how good A.I. systems are at answering complex questions across a wide variety of academic subjects, giving us what might be thought of as a general intelligence score.
“We are trying to estimate the extent to which A.I. can automate a lot of really difficult intellectual labor,” Mr. Hendrycks said.
Once the list of questions had been compiled, the researchers gave Humanity’s Last Exam to six leading A.I. models, including Google’s Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet. All of them failed miserably. OpenAI’s o1 system scored the highest of the bunch, with a score of 8.3 percent.
(The New York Times has sued OpenAI and its partner, Microsoft, accusing them of copyright infringement of news content related to A.I. systems. OpenAI and Microsoft have denied those claims.)
Mr. Hendrycks said he expected those scores to rise quickly, and potentially to surpass 50 percent by the end of the year. At that point, he said, A.I. systems might be considered “world-class oracles,” capable of answering questions on any topic more accurately than human experts. And we might have to look for other ways to measure A.I.’s impacts, like looking at economic data or judging whether it can make novel discoveries in areas like math and science.
“You can imagine a better version of this where we can give questions that we don’t know the answers to yet, and we’re able to verify if the model is able to help solve it for us,” said Summer Yue, Scale AI’s director of research and an organizer of the exam.
Part of what’s so confusing about A.I. progress these days is how jagged it is. We have A.I. models capable of diagnosing diseases more effectively than human doctors, winning silver medals at the International Math Olympiad and beating top human programmers on competitive coding challenges.
But these same models sometimes struggle with basic tasks, like arithmetic or writing metered poetry. That has given them a reputation as astoundingly brilliant at some things and totally useless at others, and it has created vastly different impressions of how fast A.I. is improving, depending on whether you’re looking at the best or the worst outputs.
That jaggedness has also made measuring these models hard. I wrote last year that we need better evaluations for A.I. systems. I still believe that. But I also believe that we need more creative methods of tracking A.I. progress that don’t rely on standardized tests, because most of what humans do — and what we fear A.I. will do better than us — can’t be captured on a written exam.
Mr. Zhou, the theoretical particle physics researcher who submitted questions to Humanity’s Last Exam, told me that while A.I. models were often impressive at answering complex questions, he didn’t consider them a threat to him and his colleagues, because their jobs involve much more than spitting out correct answers.
“There’s a big gulf between what it means to take an exam and what it means to be a practicing physicist and researcher,” he said. “Even an A.I. that can answer these questions might not be ready to help in research, which is inherently less structured.”
Read the full article here