- The accuracy achieved by the top-scoring AI on this planet’s hardest benchmark as improved by 183% in simply two weeks
- ChatGPT o3-mini now scores as much as 13% accuracy relying on capability
- OpenAI Deep Analysis obliterates competitors with 26.6% accuracy outcome
The world’s hardest AI examination, Humanity’s Final Examination, was launched lower than two weeks in the past, and we have already seen an enormous bounce in accuracy, with ChatGPT o3-mini and now OpenAI’s Deep Reasoning topping the leaderboard.
The AI benchmark created by consultants from world wide incorporates a few of the hardest reasoning issues and questions identified to man – it is so exhausting, that after I beforehand wrote about Humanity’s Final Examination within the article linked above, I could not even perceive one of many questions, not to mention reply it.
On the time of writing that final article, world phenomenon DeepSeek R1 sat on the prime of the leaderboard with a 9.4% accuracy rating when evaluated solely on textual content (not multi-modal). Now, OpenAI‘s o3-mini, which launched earlier this week, has scored 10.5% accuracy on the o3-mini setting, and 13% accuracy on the o3-mini-high setting, which is extra clever however takes longer to generate solutions.
Extra spectacular, nevertheless, is OpenAI’s new AI agent Deep Analysis’s rating on the benchmark, with the brand new instrument scoring 26.6%, a whopping 183% improve in outcome accuracy in lower than 10 days. Now, it is value noting that Deep Analysis has search capabilities which make comparisons barely unfair, as the opposite AI fashions do not. The flexibility to look the net is useful for a take a look at like Humanity’s Final Examination, because it contains some basic knowledge-based questions.
That stated, the accuracy of outcomes by fashions taking Humanity’s Final Examination outcomes is steadily enhancing, and it does make you surprise simply how lengthy we’ll want to attend to see an AI mannequin come near finishing the benchmark. Realistically, AI should not have the ability to come shut any time quickly, however I would not wager towards it.
It appears like the newest OpenAI mannequin may be very doing effectively throughout many matters.My guess is that Deep Analysis significantly helps with topics together with drugs, classics, and legislation. pic.twitter.com/x8Ilmq1aQSFebruary 3, 2025
Higher, however 26.6% by no means obtained me any SATs
OpenAI Deep Analysis is an extremely spectacular instrument, and I have been blown away by the examples that OpenAI confirmed off when it introduced the AI agent. Deep Analysis is ready to work as your private analyst, taking time to conduct intense analysis and give you experiences and solutions that will in any other case take people hours and hours to finish.
Whereas a rating of 26.6% on Humanity’s Final Examination is severely spectacular, particularly contemplating how far the benchmark’s leaderboard has are available simply a few weeks, it is nonetheless a low rating in absolute phrases – nobody would declare to have handed a take a look at with something lower than 50% in the actual world.
Humanity’s Final Examination is a superb benchmark, and one that can show invaluable as AI fashions develop, enabling us to gauge simply how far they’ve come. How lengthy will we’ve got to attend to see an AI bypass the 50% mark? And which mannequin would be the first to take action?
You might also like