Will Smith eating spaghetti and other weird AI benchmarks that took off in 2024

Date:

Share post:


When a company releases a new AI video generator, it’s not long before someone uses it to make a video of actor Will Smith eating spaghetti.

It’s become something of a meme as well as a benchmark: Seeing whether a new video generator can realistically render Smith slurping down a bowl of noodles. Smith himself parodied the trend in an Instagram post in February.

Will Smith and pasta is but one of several bizarre “unofficial” benchmarks to take the AI community by storm in 2024. A 16-year-old developer built an app that gives AI control over Minecraft and tests its ability to design structures. Elsewhere, a British programmer created a platform where AI plays games like Pictionary and Connect 4 against each other.

It’s not like there aren’t more academic tests of an AI’s performance. So why did the weirder ones blow up?

Image Credits:Paul Calcraft

For one, many of the industry-standard AI benchmarks don’t tell the average person very much. Companies often cite their AI’s ability to answer questions on Math Olympiad exams, or figure out plausible solutions to Ph.D.-level problems. Yet most people — yours truly included — use chatbots for things like responding to emails and basic research.

Crowdsourced industry measures aren’t necessarily better or more informative.

Take, for example, Chatbot Arena, a public benchmark many AI enthusiasts and developers follow obsessively. Chatbot Arena lets anyone on the web rate how well AI performs on particular tasks, like creating a web app or generating an image. But raters tend not to be representative — most come from AI and tech industry circles — and cast their votes based on personal, hard-to-pin-down preferences.

LMSYS
The Chatbot Arena interface.Image Credits:LMSYS

Ethan Mollick, a professor of management at Wharton, recently pointed out in a post on X another problem with many AI industry benchmarks: they don’t compare a system’s performance to that of the average person.

“The fact that there are not 30 different benchmarks from different organizations in medicine, in law, in advice quality, and so on is a real shame, as people are using systems for these things, regardless,” Mollick wrote.

Weird AI benchmarks like Connect 4, Minecraft, and Will Smith eating spaghetti are most certainly not empirical — or even all that generalizable. Just because an AI nails the Will Smith test doesn’t mean it’ll generate, say, a burger well.

Mcbench
Note the typo; there’s no such model as Claude 3.6 Sonnet.Image Credits:Adonis Singh

One expert I spoke to about AI benchmarks suggested that the AI community focus on the downstream impacts of AI instead of its ability in narrow domains. That’s sensible. But I have a feeling that weird benchmarks aren’t going away anytime soon. Not only are they entertaining — who doesn’t like watching AI build Minecraft castles? — but they’re easy to understand. And as my colleague Max Zeff wrote about recently, the industry continues to grapple with distilling a technology as complex as AI into digestible marketing.

The only question in my mind is, which odd new benchmarks will go viral in 2025?





Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

Microsoft to spend $80 billion in FY’25 on data centers for AI

Microsoft has earmarked $80 billion in fiscal 2025 to build data centers designed to handle artificial intelligence...

FTC orders AI accessibility startup accessiBe to pay $1M for misleading advertising

The U.S. Federal Trade Commission (FTC) has fined accessiBe, a startup that claims to make websites more compatible...

Hydrogen tax credit rules give startups clarity while boosting nuclear and carbon capture

Hydrogen startups are widely seen as a promising way to eliminate fossil fuels from heavy industry and...

US sanctions Chinese cyber firm linked to Flax Typhoon hacks

The U.S. government has sanctioned a Beijing-based cybersecurity company over its alleged links to a China government-backed...

Electra found a cheap, clean way to purify iron, and it’s raising $257M to make it happen

Electra has raised $76.3 million to clean up the dirty ironmaking industry, TechCrunch has learned. The startup has...

Rivian wraps 2024 with more than 50,000 EVs delivered

Rivian finished last year having delivered 51,579 electric SUVs, trucks, and vans, more than triple the number...

Hindustan Unilever in talks to acquire Peak XV-backed Minimalist for up to $350M

Consumer goods giant Hindustan Unilever is in advanced talks to acquire four-year-old direct-to-consumer startup Minimalist for up...

Cloudflare’s VPN app among half-dozen pulled from Indian app stores

More than half-a-dozen VPN apps, including Cloudflare’s widely-used 1.1.1.1, have been pulled from India’s Apple App Store...