Study suggests that even the best AI models hallucinate a bunch

All generative AI models hallucinate, from Google’s Gemini to Anthropic’s Claude to the latest stealth release of OpenAI’s GPT-4o. The models are unreliable narrators in other words — sometimes to hilarious effect, other times problematically so.

But not all models make things up at the same rate. And the kinds of mistruths they spout depends on which sources of info they’ve been exposed to.

A recent study from researchers at Cornell, the universities of Washington and Waterloo and the nonprofit research institute AI2 sought to benchmark hallucinations by fact-checking models like GPT-4o against authoritative sources on topics ranging from law and health to history and geography. They found that no model performed exceptionally well across all topics, and that models that hallucinated the least did so partly because they refused to answer questions they’d otherwise get wrong.

“The most important takeaway from our work is that we cannot yet fully trust the outputs of model generations,” Wenting Zhao, a doctorate student at Cornell and a co-author on the research, told TechCrunch. “At present, even the best models can generate hallucination-free text only about 35% of the time.”

There’s been other academic attempts at probing the “factuality” of models, including one by a separate AI2-affiliated team. But Zhao notes that these earlier tests asked models questions with answers easily found on Wikipedia — not exactly the toughest ask, considering most models are trained on Wikipedia data.

To make their benchmark more challenging — and to more accurately reflect the types of questions people ask of models — the researchers identified topics around the web that don’t have a Wikipedia reference. Just over half the questions in their test can’t be answered using Wikipedia (they included some Wikipedia-sourced ones for good measure), and touch on topics including culture, geography, astronomy, pop culture, finance, medicine, computer science and celebrities.

For their study, the researchers evaluated over a dozen different popular models, many of which were released in the past year. In addition to GPT-4o, they tested “open” models such as Meta’s Llama 3 70B, Mistral’s Mixtral 8x22B and Cohere’s Command R+ as well as gated-behind-API models like Perplexity’s Sonar-Large (which is based on Llama), Google’s Gemini 1.5 Pro and Anthropic’s Claude 3 Opus.

The results suggest that models aren’t hallucinating much less these days, despite claims to the contrary from OpenAI, Anthropic and the other big generative AI players.

GPT-4o and OpenAI’s much older flagship GPT-3.5 performed about the same in terms of the percentage of questions they answered factually correctly on the benchmark. (GPT-4o was marginally better.) OpenAI’s models were the least hallucinatory overall, followed by Mixtral 8x22B, Command R and Perplexity’s Sonar models.

Questions pertaining to celebrities and finance gave the models the hardest time, but questions about geography and computer science were easiest for the models to answer (perhaps because their training data contained more references to these). In cases where the source of an answer wasn’t Wikipedia, every model answered less factually on average (but especially GPT-3.5 and GPT-4o), suggesting that they’re all informed heavily by Wikipedia content.

Even models that can search the web for information, like Command R and Perplexity’s Sonar models, struggled with “non-Wiki” questions in the benchmark. Model size didn’t matter much; smaller models (e.g. Anthropic’s Claude 3 Haiku) hallucinated roughly as frequently as larger, ostensibly more capable models (e.g. Claude 3 Opus).

So what does all this mean — and where are the improvements that vendors promised?

Well, we wouldn’t put it past vendors to exaggerate their claims. But a more charitable take is the benchmarks they’re using aren’t fit for this purpose. As we’ve written about before, many, if not most, AI evaluations are transient and devoid of important context, doomed to fall victim to Goodhart’s law.

Regardless, Zhao says that she expects the issue of hallucinations to “persist for a long time.”

“Empirical results in our paper indicate that, despite the promise of certain methods to reduce or eliminate hallucinations, the actual improvement achievable with these methods is limited,” she said. “Additionally, our analysis reveals that even the knowledge found on the internet can often be conflicting, partly because the training data — authored by humans — can also contain hallucinations.”

An interim solution could be simply programming models to refuse to answer more often — the technical equivalent to telling a know-it-all to knock it off.

In the researchers’ testing, Claude 3 Haiku answered only around 72% of the questions it was asked, choosing to abstain from the rest. When accounting for the abstentions, Claude 3 Haiku was in fact the most factual model of them all — at least in the sense that it lied least often.

But will people use a model that doesn’t answer many questions? Zhao thinks not and says vendors should focus more of their time and efforts on hallucination-reducing research. Eliminating hallucinations entirely may not be possible, but they can be mitigated through human-in-the-loop fact-checking and citation during a model’s development, she asserts.

“Policies and regulations need to be developed to ensure that human experts are always involved in the process to verify and validate the information generated by generative AI models,” Zhao added. “There are still numerous opportunities to make significant impacts in this field, such as developing advanced fact-checking tools for any free text, providing citations for factual content and offering corrections for hallucinated texts.”

Source link

Study suggests that even the best AI models hallucinate a bunch

Recent posts

Mistral releases new AI models optimized for edge devices

Laptop-leasing startup Fleet wants to become the IT companion for small companies

Former Snap engineer launches Butterflies, a social network where AIs and humans coexist

Hit by hurricanes? FCC says you qualify for internet and mobile service subsidies

UK antitrust body probes Google’s ties with AI rival Anthropic

Waymo picks its next robotaxi, Joby lands more Toyota bucks, and Cybertruck notches its fifth recall

Slack is turning into an AI agent hub. Should it?

Euro VCs welcome Balderton’s fresh $1.3B but grumble about Europe’s AI misses

Contactles stores to grow in Europe as Sensei reels in another $16M

Spotify points finger at Apple over an unwelcome change to volume control technology

CareYaya is enabling affordable home care by connecting healthcare students with elders

Socket lands a fresh $40M to scan software for security flaws

Raspberry Pi releases more AI-focused add-ons

VCs are still pouring billions into generative AI startups

Twilio says hackers identified cell phone numbers of two-factor app Authy users

Related articles

WhatsApp rolls out voice message transcripts

Threads adjusts its algorithm to show you more content from accounts you follow

Spotify tests a video feature for audiobooks as it ramps up video expansion

Candela brings its P-12 electric ferry to Tahoe and adds another $14M to build more

OneRail’s software helps solve the last-mile delivery problem

Bill to ban social media use by under-16s arrives in Australia’s parliament

Lighthouse, an analytics provider for the hospitality sector, lights up with $370M at a $1B valuation

DOJ: Google must sell Chrome to end monopoly

Company

Follow us