Study suggests that even the best AI models hallucinate a bunch

All generative AI models hallucinate, from Google’s Gemini to Anthropic’s Claude to the latest stealth release of OpenAI’s GPT-4o. The models are unreliable narrators in other words — sometimes to hilarious effect, other times problematically so.

But not all models make things up at the same rate. And the kinds of mistruths they spout depends on which sources of info they’ve been exposed to.

A recent study from researchers at Cornell, the universities of Washington and Waterloo and the nonprofit research institute AI2 sought to benchmark hallucinations by fact-checking models like GPT-4o against authoritative sources on topics ranging from law and health to history and geography. They found that no model performed exceptionally well across all topics, and that models that hallucinated the least did so partly because they refused to answer questions they’d otherwise get wrong.

“The most important takeaway from our work is that we cannot yet fully trust the outputs of model generations,” Wenting Zhao, a doctorate student at Cornell and a co-author on the research, told TechCrunch. “At present, even the best models can generate hallucination-free text only about 35% of the time.”

There’s been other academic attempts at probing the “factuality” of models, including one by a separate AI2-affiliated team. But Zhao notes that these earlier tests asked models questions with answers easily found on Wikipedia — not exactly the toughest ask, considering most models are trained on Wikipedia data.

To make their benchmark more challenging — and to more accurately reflect the types of questions people ask of models — the researchers identified topics around the web that don’t have a Wikipedia reference. Just over half the questions in their test can’t be answered using Wikipedia (they included some Wikipedia-sourced ones for good measure), and touch on topics including culture, geography, astronomy, pop culture, finance, medicine, computer science and celebrities.

For their study, the researchers evaluated over a dozen different popular models, many of which were released in the past year. In addition to GPT-4o, they tested “open” models such as Meta’s Llama 3 70B, Mistral’s Mixtral 8x22B and Cohere’s Command R+ as well as gated-behind-API models like Perplexity’s Sonar-Large (which is based on Llama), Google’s Gemini 1.5 Pro and Anthropic’s Claude 3 Opus.

The results suggest that models aren’t hallucinating much less these days, despite claims to the contrary from OpenAI, Anthropic and the other big generative AI players.

GPT-4o and OpenAI’s much older flagship GPT-3.5 performed about the same in terms of the percentage of questions they answered factually correctly on the benchmark. (GPT-4o was marginally better.) OpenAI’s models were the least hallucinatory overall, followed by Mixtral 8x22B, Command R and Perplexity’s Sonar models.

Questions pertaining to celebrities and finance gave the models the hardest time, but questions about geography and computer science were easiest for the models to answer (perhaps because their training data contained more references to these). In cases where the source of an answer wasn’t Wikipedia, every model answered less factually on average (but especially GPT-3.5 and GPT-4o), suggesting that they’re all informed heavily by Wikipedia content.

Even models that can search the web for information, like Command R and Perplexity’s Sonar models, struggled with “non-Wiki” questions in the benchmark. Model size didn’t matter much; smaller models (e.g. Anthropic’s Claude 3 Haiku) hallucinated roughly as frequently as larger, ostensibly more capable models (e.g. Claude 3 Opus).

So what does all this mean — and where are the improvements that vendors promised?

Well, we wouldn’t put it past vendors to exaggerate their claims. But a more charitable take is the benchmarks they’re using aren’t fit for this purpose. As we’ve written about before, many, if not most, AI evaluations are transient and devoid of important context, doomed to fall victim to Goodhart’s law.

Regardless, Zhao says that she expects the issue of hallucinations to “persist for a long time.”

“Empirical results in our paper indicate that, despite the promise of certain methods to reduce or eliminate hallucinations, the actual improvement achievable with these methods is limited,” she said. “Additionally, our analysis reveals that even the knowledge found on the internet can often be conflicting, partly because the training data — authored by humans — can also contain hallucinations.”

An interim solution could be simply programming models to refuse to answer more often — the technical equivalent to telling a know-it-all to knock it off.

In the researchers’ testing, Claude 3 Haiku answered only around 72% of the questions it was asked, choosing to abstain from the rest. When accounting for the abstentions, Claude 3 Haiku was in fact the most factual model of them all — at least in the sense that it lied least often.

But will people use a model that doesn’t answer many questions? Zhao thinks not and says vendors should focus more of their time and efforts on hallucination-reducing research. Eliminating hallucinations entirely may not be possible, but they can be mitigated through human-in-the-loop fact-checking and citation during a model’s development, she asserts.

“Policies and regulations need to be developed to ensure that human experts are always involved in the process to verify and validate the information generated by generative AI models,” Zhao added. “There are still numerous opportunities to make significant impacts in this field, such as developing advanced fact-checking tools for any free text, providing citations for factual content and offering corrections for hallucinated texts.”

Source link

Study suggests that even the best AI models hallucinate a bunch

Recent posts

Another VC-backed fintech, Earnin, faces crackdown over allegedly ‘predatory’ loans

Polaris Dawn astronauts perform historic private spacewalk while wearing SpaceX-made suits

A co-lead on Sora, OpenAI’s video generator, has left for Google

Could Trump’s AI-generated Taylor Swift endorsement be illegal?

A new campaign aims to safeguard social media from billionaires using Bluesky’s underlying tech

TechCrunch Minute: Trade My Spin takes the pain out of selling used Pelotons

From a $2.5 million hyper car to a Spanish track-ready EV, here were the most interesting EVs at Monterey Car Week

Sam Altman and Jeff Bezos are the latest billionaires to donate $1M to Trump fund

Bitcoin zooms past $100,000 mark for the first time

Revolut to launch mortgages, smart ATMs and business credit products

Las Vegas sheriff tells a16z partners what’s next on his wish list: AI for bodycams

SEC sues Elon Musk for allegedly failing to disclose Twitter acquisition on time

Flutterwave’s chief on the company’s executive hires, product focus, and IPO plans

Google brings Gemini-powered search history and Lens to Chrome desktop

India again delays rules to break PhonePe-Google Pay duopoly

Related articles

Trump pardons Silk Road creator Ross Ulbricht

MrBeast is reportedly now among those trying to buy TikTok

Meta COO Sheryl Sandberg sanctioned by judge for allegedly deleting emails

Microsoft is no longer OpenAI’s exclusive cloud provider

OpenAI teams up with SoftBank and Oracle on $500B data center project

Scale AI’s Alexandr Wang has published an open letter lobbying Trump to invest in AI

Perplexity launches Sonar, an API for AI search

Trump targets EV charging funding programs Tesla benefits from

Company

Follow us