Will Smith eating spaghetti and other weird AI benchmarks that took off in 2024

When a company releases a new AI video generator, it’s not long before someone uses it to make a video of actor Will Smith eating spaghetti.

It’s become something of a meme as well as a benchmark: Seeing whether a new video generator can realistically render Smith slurping down a bowl of noodles. Smith himself parodied the trend in an Instagram post in February.

Google Veo 2 has done it.

We are now eating spaghett at last. pic.twitter.com/AZO81w8JC0

— Jerrod Lew (@jerrod_lew) December 17, 2024

Will Smith and pasta is but one of several bizarre “unofficial” benchmarks to take the AI community by storm in 2024. A 16-year-old developer built an app that gives AI control over Minecraft and tests its ability to design structures. Elsewhere, a British programmer created a platform where AI plays games like Pictionary and Connect 4 against each other.

It’s not like there aren’t more academic tests of an AI’s performance. So why did the weirder ones blow up?

Image Credits:Paul Calcraft

For one, many of the industry-standard AI benchmarks don’t tell the average person very much. Companies often cite their AI’s ability to answer questions on Math Olympiad exams, or figure out plausible solutions to Ph.D.-level problems. Yet most people — yours truly included — use chatbots for things like responding to emails and basic research.

Crowdsourced industry measures aren’t necessarily better or more informative.

Take, for example, Chatbot Arena, a public benchmark many AI enthusiasts and developers follow obsessively. Chatbot Arena lets anyone on the web rate how well AI performs on particular tasks, like creating a web app or generating an image. But raters tend not to be representative — most come from AI and tech industry circles — and cast their votes based on personal, hard-to-pin-down preferences.

The Chatbot Arena interface.Image Credits:LMSYS

Ethan Mollick, a professor of management at Wharton, recently pointed out in a post on X another problem with many AI industry benchmarks: they don’t compare a system’s performance to that of the average person.

“The fact that there are not 30 different benchmarks from different organizations in medicine, in law, in advice quality, and so on is a real shame, as people are using systems for these things, regardless,” Mollick wrote.

Weird AI benchmarks like Connect 4, Minecraft, and Will Smith eating spaghetti are most certainly not empirical — or even all that generalizable. Just because an AI nails the Will Smith test doesn’t mean it’ll generate, say, a burger well.

Mcbench — Note the typo; there’s no such model as Claude 3.6 Sonnet.Image Credits:Adonis Singh

One expert I spoke to about AI benchmarks suggested that the AI community focus on the downstream impacts of AI instead of its ability in narrow domains. That’s sensible. But I have a feeling that weird benchmarks aren’t going away anytime soon. Not only are they entertaining — who doesn’t like watching AI build Minecraft castles? — but they’re easy to understand. And as my colleague Max Zeff wrote about recently, the industry continues to grapple with distilling a technology as complex as AI into digestible marketing.

The only question in my mind is, which odd new benchmarks will go viral in 2025?

Source link

Will Smith eating spaghetti and other weird AI benchmarks that took off in 2024

Recent posts

US tech giants fight Indian telcos’ bid to regulate internet services, pay for network usage

Those ‘Founder mode’ memes keep coming

Bending Spoons plans to lay off 75% of WeTransfer staff after acquisition

What are AI ‘world models,’ and why do they matter?

Marvel Fusion lands $70M for laser-powered fusion bet

China’s autonomous vehicle startup WeRide seeks US IPO at $5B valuation

Indian fintech Jar turns cash flow positive

Liquid AI just raised $250M to develop a more efficient type of AI model

Atomico backs Tem to help businesses buy renewable energy directly from sources

Uber’s subscription service reportedly target of FTC probe

A hidden microphone on a San Francisco street pole is spotting ‘bops’ in the wild

Why Reid Hoffman feels optimistic about our AI future

OpenAI said to be in talks to raise $40B at a $340B valuation

As data center usage heats up, Submer raises $55.5M to cool things down

Groww pays $160M tax as it returns to India amid a startup relocation wave

Related articles

China hits back at Trump tariffs with Google antitrust investigation

Accel backs Indian AI startup building ‘ChatGPT for presentations’

Adam Candeub, a vocal critic of Big Tech, will reportedly join the FCC

Ontario cancels, then restores, $68 million Starlink contract after protesting US tariffs

Stripe brings aboard new head of ‘startup and VC partnerships’

Naver-backed Cinamon wants to make 3D video animation easier using AI

No, DeepSeek isn’t uncensored if you run it locally

Meta says it may stop development of AI systems it deems too risky

Company

Follow us