Google DeepMind researchers present a new standard for improving LLM reality and reducing hallucinations

Photo of author

By [email protected]


Join our daily and weekly newsletters for the latest updates and exclusive content on our industry-leading AI coverage. He learns more


Hallucinationsor factually inaccurate responses, still plague large linguistic models (LLMs). Forms particularly falter when they are given more complex tasks and when users are looking for very specific and detailed responses.

It’s a challenge that data scientists, and now researchers, have struggled to overcome Google DeepMind They say they’re a step closer to achieving true realism in base models. They introduced FACTS Grounding, a standard that evaluates LLM holders’ ability to generate accurate, factual responses based on lengthy documents. Models are also judged on whether their responses are detailed enough to provide useful and relevant answers to the prompts.

Along with the new standard, the researchers released a Leaderboard facts To the Kaggle data science community.

As of this week, Gemini 2.0 Flash topped the leaderboard, with a real-world score of 83.6%. Others in the top nine include Google Gemini 1.0 Flash and Gemini 1.5 Pro; Anthropic’s Clade 3.5 Sonnet and Clade 3.5 Haiku; and GPT-4o, 4o-mini, o1-mini, and o1-preview from OpenAI. All of these things scored higher than 61.7% in terms of accuracy.

The leaderboard will be actively maintained and constantly updated to include new models and their various iterations, researchers say.

“We believe this criterion fills a gap in evaluating a broad range of typical behaviors related to realism, compared to criteria that focus on narrower use cases… such as summarization alone,” the researchers wrote in the article. Technical paper Published this week.

Eliminate inaccurate responses

a guarantee Realistic accuracy In LLM responses are difficult due to modeling (architecture, training and inference) and measurement (evaluation methodologies, data and metrics) factors. Researchers typically point out that pre-training focuses on predicting the next token given previous tokens.

“Although this goal may teach models salient global knowledge, it does not directly optimize the model toward different real-world scenarios, and instead encourages the model to generate more general reasonable text,” the researchers write.

To address this issue, the FACTS dataset includes 1,719 examples – 860 public and 859 private – each requiring long answers based on the context in the documents provided. Each example includes:

  • System prompt (system_instruction) with general directions and answer order only based on the context provided;
  • Task (user_request) has a specific question to answer;
  • A long document (context_document) containing the necessary information.

To achieve success and obtain the title of “accurate”, Model The long document must be processed and a subsequent long response generated that is comprehensive and fully attributed to the document. Responses are classified as “inaccurate” if the form claims are not directly supported by the document and are not highly relevant or useful.

For example, a user might ask the form to summarize the main reasons for a company’s revenue decline in the third quarter, and provide detailed information including the company’s annual financial report that discusses quarterly profits, expenses, planned investments, and market analysis.

If the form then returns, for example: “The company experienced challenges in the third quarter that impacted its revenue,” it will be considered inaccurate.

“The response avoids identifying any causes, such as market trends, increased competition, or operational setbacks, which are likely to be present in the document,” the researchers point out. “This does not show an attempt to engage with or extract relevant details.”

By contrast, if a user asks “What are some tips on saving money?” She provided a set of categorized money-saving tips for college students, and the correct answer would be very detailed: “Use free activities on campus, buy items in bulk and cook at home. Also, set spending goals, avoid credit cards, and conserve resources.”

DeepMind uses LLMs to judge LLMs

To allow for diverse inputs, the researchers included documents of varying lengths, up to 32,000 characters (or the equivalent of 20,000 words). These cover areas including finance, technology, retail, medicine and law. User requests are also extensive, including creating questions and answers, summary requests, and rewrites.

Each example is judged in two stages. First, responses are evaluated for eligibility: if they do not meet the user’s requests, they are excluded. Second, answers must be free of hallucinations and based entirely on the documents provided.

These realism scores are calculated by three different LLM judges – specifically Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet – who assign individual scores based on the percentage of model outputs that are accurate. The final factual determination is then based on the average of the three judges’ scores.

The researchers point out that models are often biased towards other members of their model family – with an average increase of about 3.23% – so bringing together different judges was crucial to help ensure that the answers were indeed realistic.

Ultimately, researchers emphasize that realism and groundedness are key to the success and usefulness of future MBAs. “We believe that comprehensive measurement methods, coupled with ongoing research and development, will continue to improve AI systems,” they wrote.

However, they also acknowledged that: “We recognize that benchmarks can quickly be surpassed through progression, so the launch of our FACTS Grounding benchmark and leaderboard is just the beginning.”



https://venturebeat.com/wp-content/uploads/2025/01/a-medium-shot-of-a-sophisticated-ai-robo_z7e8_hz3QaqLCZpO3cV2tw_AJOkXZ8wSti6QVF-s9_LZg-transformed.jpeg?w=1024?w=1200&strip=all
Source link

Leave a Comment