

The AI firm Galileo has simply introduced its newest Hallucination Index, which is a framework that evaluates 22 main generative AI fashions.
Fashions are examined utilizing a metric referred to as context adherence, which measures “closed-domain hallucinations: instances the place your mannequin stated issues that weren’t offered within the context.”
One of the best performing mannequin general for RAG, in accordance with the rating, is Claude 3.5 Sonnet from Anthropic. Galileo stated that this mannequin and Anthropic’s different mannequin Claude 3 Opus had close to good scores, beating out OpenAI’s fashions, which gained final yr.
From a price perspective, the very best performing mannequin was Google’s Gemini 1.5 Flash. And Alibaba’s Qwen2-72B-Instruct was general the very best performing open supply mannequin, although briefly context RAG checks, Meta’s llama-3-60b-instruct was the very best.
Damaged down by context size, the very best closed-source mannequin briefly context RAG was Claude 3.5 Sonnet, in medium context RAG was Google’s Gemini-1.5-flash-001 (with price being the tiebreaker with different fashions that additionally scored an ideal rating), and in giant context RAG was once more Claude 3.5 Sonnet.
“In right this moment’s quickly evolving AI panorama, builders and enterprises face a vital problem: methods to harness the facility of generative AI whereas balancing price, accuracy, and reliability. Present benchmarks are sometimes primarily based on tutorial use-cases, moderately than real-world purposes. Our new Index seeks to deal with this by testing fashions in real-world use instances that require the LLMs to retrieve information, a typical apply in enterprise AI implementations,” says Vikram Chatterji, CEO and co-founder of Galileo. “As hallucinations proceed to be a serious hurdle, our aim wasn’t to simply rank fashions, however moderately give AI groups and leaders the real-world information they should undertake the appropriate mannequin, for the appropriate process, on the proper value.”
You may additionally like…
Meta’s new Llama 3.1 mannequin competes with GPT-4o and Claude 3.5 Sonnet