AI Edges Closer to Passing a Test That Only a Handful of the World's Experts Could Solve

The exam's ultimate purpose is to illuminate both the capabilities and the limitations of these technologies.

Artificial intelligence continues to advance in tasks requiring complex cognitive capabilities and, according to experts, may be on the verge of surpassing what is known as the Humanity's Last Exam (HLE) —a benchmark designed to measure the boundaries of expert-level knowledge.

The exam brings together 2,500 questions spanning more than 100 disciplines, ranging from mythology to aerospace engineering, and was conceived to be solved exclusively by specialists with the equivalent of a doctoral-level background. Developed with input from more than 1,000 experts across various fields, its stated purpose is to assess how close artificial intelligence has come to the frontiers of human knowledge.

"The model creators have done an excellent job improving these reasoning models," Calvin Zhang, head of research at Scale — the AI company responsible for the HLE —told The Times. He explained that the exam aspires to serve as an academic benchmark that only "a handful of people on Earth" would be capable of completing.

The performance of AI systems has improved significantly in a short period of time. While ChatGPT answered fewer than 3 percent of questions correctly in 2024, models such as Google Gemini reached approximately 19 percent within a matter of months and have since surpassed 45 percent. "If this were really the only thing that mattered to us in life, I think we could get there pretty quickly," said Kate Olszewska, suggesting that a score approaching 100 percent could be within reach within a year.

To preserve the exam's difficulty, its creators filtered tens of thousands of candidate questions and kept the answers concealed to prevent AI models from memorizing them. Among the challenges included are tasks such as translating ancient inscriptions and identifying microanatomical structures —demands that require deep comprehension well beyond pattern recognition.

Nevertheless, some experts caution that a meaningful gap between AI and human intelligence remains. "When AI systems start to achieve exceptional results […] it is tempting to think they are approaching human understanding," said Tung Nguyen, who emphasized that true intelligence involves context and specialization. According to Nguyen, the exam's ultimate purpose is to illuminate both the capabilities and the limitations of these technologies. (CubaSí)