People are benchmarking AI by having it make balls bounce in rotating shapes

Date:

Share post:


The list of informal, weird AI benchmarks keeps growing.

Over the past few days, some in the AI community on X have become obsessed with a test of how different AI models, particularly so-called reasoning models, handle prompts like this: “Write a Python script for a bouncing yellow ball within a shape. Make the shape slowly rotate, and make sure that the ball stays within the shape.”

Some models manage better on this “ball in rotating shape” benchmark than others. According to one user on X, Chinese AI lab DeepSeek’s freely available R1 swept the floor with OpenAI’s o1 pro mode, which costs $200 per month as a part of OpenAI’s ChatGPT Pro plan.

Per another X poster, Anthropic’s Claude 3.5 Sonnet and Google’s Gemini 1.5 Pro models misjudged the physics, resulting in the ball escaping the shape. Other users reported that Google’s Gemini 2.0 Flash Thinking Experimental, and even OpenAI’s older GPT-4o, aced the evaluation in one go.

But what does it prove that an AI can or can’t code a rotating, ball-containing shape?

Well, simulating a bouncing ball is a classic programming challenge. Accurate simulations incorporate collision detection algorithms, which try to identify when two objects (e.g. a ball and the side of a shape) collide. Poorly written algorithms can affect the simulation’s performance or lead to obvious physics mistakes.

X user N8 Programs, a researcher in residence at AI startup Nous Research, says it took him roughly two hours to program a bouncing ball in a rotating heptagon from scratch. “One has to track multiple coordinate systems, how the collisions are done in each system, and design the code from the beginning to be robust,” N8 Programs explained in a post.

But while bouncing balls and rotating shapes are a reasonable test of programming skills, they’re not a very empirical AI benchmark. Even slight variations in the prompt can — and do — yield different outcomes. That’s why some users on X report having more luck with o1, while others say that R1 falls short.

If anything, viral tests like these point to the intractable problem of creating useful systems of measurement for AI models. It’s often difficult to tell what differentiates one model from another, outside of esoteric benchmarks that aren’t relevant to most people.

Many efforts are underway to build better tests, like the ARC-AGI benchmark and Humanity’s Last Exam. We’ll see how those fare — and in the meantime watch GIFs of balls bouncing in rotating shapes.





Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

Perplexity submits a new bid for TikTok

Perplexity AI has submitted a revised proposal to merge with TikTok, in an arrangement that would give...

DeepSeek gets Silicon Valley talking

Since Chinese AI company DeepSeek released an open version of its reasoning model R1 at the beginning...

Why Reid Hoffman feels optimistic about our AI future

In Reid Hoffman’s new book Superagency: What Could Possibly Go Right With Our AI Future, the LinkedIn...

2025 will likely be another brutal year of failed startups, data suggests

More startups shut down in 2024 than the year prior, according to multiple sources, and that’s not...

As demand for data centers soars, real estate companies look to become energy developers

Brendan Wallace has a lot on his mind lately. Wallace is the co-founder of Fifth Wall Ventures,...

Trump administration reportedly negotiating an Oracle takeover of TikTok

The Trump administration is negotiating a deal that would see Oracle take over TikTok alongside new U.S....

Will states lead the way on AI regulation?

2024 was a busy year for lawmakers (and lobbyists) concerned about AI — most notably in California,...

OpenAI wants to take over your browser

Welcome back to Week in Review. This week we’re diving into OpenAI’s newly released AI agent, called...