OpenAI’s Deep Analysis smashes data for the world’s hardest AI examination, with ChatGPT o3-mini and DeepSeek left in its wake

The accuracy achieved by the top-scoring AI on this planet’s hardest benchmark as improved by 183% in simply two weeks
ChatGPT o3-mini now scores as much as 13% accuracy relying on capability
OpenAI Deep Analysis obliterates competitors with 26.6% accuracy outcome

The world’s hardest AI examination, Humanity’s Final Examination, was launched lower than two weeks in the past, and we have already seen an enormous bounce in accuracy, with ChatGPT o3-mini and now OpenAI’s Deep Reasoning topping the leaderboard.

The AI benchmark created by consultants from world wide incorporates a few of the hardest reasoning issues and questions identified to man – it is so exhausting, that after I beforehand wrote about Humanity’s Final Examination within the article linked above, I could not even perceive one of many questions, not to mention reply it.

On the time of writing that final article, world phenomenon DeepSeek R1 sat on the prime of the leaderboard with a 9.4% accuracy rating when evaluated solely on textual content (not multi-modal). Now, OpenAI‘s o3-mini, which launched earlier this week, has scored 10.5% accuracy on the o3-mini setting, and 13% accuracy on the o3-mini-high setting, which is extra clever however takes longer to generate solutions.

Extra spectacular, nevertheless, is OpenAI’s new AI agent Deep Analysis’s rating on the benchmark, with the brand new instrument scoring 26.6%, a whopping 183% improve in outcome accuracy in lower than 10 days. Now, it is value noting that Deep Analysis has search capabilities which make comparisons barely unfair, as the opposite AI fashions do not. The flexibility to look the net is useful for a take a look at like Humanity’s Final Examination, because it contains some basic knowledge-based questions.

That stated, the accuracy of outcomes by fashions taking Humanity’s Final Examination outcomes is steadily enhancing, and it does make you surprise simply how lengthy we’ll want to attend to see an AI mannequin come near finishing the benchmark. Realistically, AI should not have the ability to come shut any time quickly, however I would not wager towards it.

It appears like the newest OpenAI mannequin may be very doing effectively throughout many matters.My guess is that Deep Analysis significantly helps with topics together with drugs, classics, and legislation. pic.twitter.com/x8Ilmq1aQSFebruary 3, 2025

Higher, however 26.6% by no means obtained me any SATs

OpenAI Deep Analysis is an extremely spectacular instrument, and I have been blown away by the examples that OpenAI confirmed off when it introduced the AI agent. Deep Analysis is ready to work as your private analyst, taking time to conduct intense analysis and give you experiences and solutions that will in any other case take people hours and hours to finish.

Whereas a rating of 26.6% on Humanity’s Final Examination is severely spectacular, particularly contemplating how far the benchmark’s leaderboard has are available simply a few weeks, it is nonetheless a low rating in absolute phrases – nobody would declare to have handed a take a look at with something lower than 50% in the actual world.

Humanity’s Final Examination is a superb benchmark, and one that can show invaluable as AI fashions develop, enabling us to gauge simply how far they’ve come. How lengthy will we’ve got to attend to see an AI bypass the 50% mark? And which mannequin would be the first to take action?

Cookie	Duration	Description
cookielawinfo-checkbox-analytics		This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional		The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary		This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others		This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance		This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy		The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

OpenAI’s Deep Analysis smashes data for the world’s hardest AI examination, with ChatGPT o3-mini and DeepSeek left in its wake

Consumer Information for Odoo POS Supply Display screen

A deep dive into proof scores

Microservices Structure: Greatest Practices & Challenges

Greenland ice cracks are widening, probably dashing the rise of world sea ranges

Kingdom Come: Deliverance 2 – All Quick Journey Areas (Full World Map)

Kingdom Come: Deliverance 2 - All Quick Journey Areas (Full World Map)

Leave a Reply Cancel reply

Categories

Recent Posts