People are benchmarking AI by having it make balls bounce in rotating shapes

The list of informal, weird AI benchmarks keeps growing.

Over the past few days, some in the AI community on X have become obsessed with a test of how different AI models, particularly so-called reasoning models, handle prompts like this: “Write a Python script for a bouncing yellow ball within a shape. Make the shape slowly rotate, and make sure that the ball stays within the shape.”

Some models manage better on this “ball in rotating shape” benchmark than others. According to one user on X, Chinese AI lab DeepSeek’s freely available R1 swept the floor with OpenAI’s o1 pro mode, which costs $200 per month as a part of OpenAI’s ChatGPT Pro plan.

👀 DeepSeek R1 (right) crushed o1-pro (left) 👀

Prompt: “write a python script for a bouncing yellow ball within a square, make sure to handle collision detection properly. make the square slowly rotate. implement it in python. make sure ball stays within the square” pic.twitter.com/3Sad9efpeZ

— Ivan Fioravanti ᯅ (@ivanfioravanti) January 22, 2025

Per another X poster, Anthropic’s Claude 3.5 Sonnet and Google’s Gemini 1.5 Pro models misjudged the physics, resulting in the ball escaping the shape. Other users reported that Google’s Gemini 2.0 Flash Thinking Experimental, and even OpenAI’s older GPT-4o, aced the evaluation in one go.

Tested 9 AI models on a physics simulation task: rotating triangle + bouncing ball. Results:

🥇 Deepseek-R1
🥈 Sonar Huge
🥉 GPT-4o

Worst? OpenAI o1: Completely misunderstood the task 😂

Video below ↓ First row = Reasoning models, rest = Base models. pic.twitter.com/EOYrHvNazr

— Aadhithya D (@Aadhithya_D2003) January 22, 2025

But what does it prove that an AI can or can’t code a rotating, ball-containing shape?

Well, simulating a bouncing ball is a classic programming challenge. Accurate simulations incorporate collision detection algorithms, which try to identify when two objects (e.g. a ball and the side of a shape) collide. Poorly written algorithms can affect the simulation’s performance or lead to obvious physics mistakes.

X user N8 Programs, a researcher in residence at AI startup Nous Research, says it took him roughly two hours to program a bouncing ball in a rotating heptagon from scratch. “One has to track multiple coordinate systems, how the collisions are done in each system, and design the code from the beginning to be robust,” N8 Programs explained in a post.

But while bouncing balls and rotating shapes are a reasonable test of programming skills, they’re not a very empirical AI benchmark. Even slight variations in the prompt can — and do — yield different outcomes. That’s why some users on X report having more luck with o1, while others say that R1 falls short.

If anything, viral tests like these point to the intractable problem of creating useful systems of measurement for AI models. It’s often difficult to tell what differentiates one model from another, outside of esoteric benchmarks that aren’t relevant to most people.

Many efforts are underway to build better tests, like the ARC-AGI benchmark and Humanity’s Last Exam. We’ll see how those fare — and in the meantime watch GIFs of balls bouncing in rotating shapes.

Source link

People are benchmarking AI by having it make balls bounce in rotating shapes

Recent posts

WP Engine sues Automattic and WordPress co-founder Matt Mullenweg

Anthony Levandowski buys Elon Musk’s vision for the future

From AI agents to enterprise budgets, 20 VCs share their predictions on enterprise tech in 2025

OpenAI reportedly in talks to close a new funding round at $100B+ valuation

Perplexity acquires Read.cv, a social media platform for professionals

Fei-Fei Li picks Google Cloud, where she led AI, as World Labs’ main compute provider

PharmEasy still 92% below its peak $5.6 billion valuation, investor estimates

Threads explores ads, but says ‘no immediate timeline’ toward monetization

Robust AI’s Carter Pro robot is designed to work with, and be moved by, humans

HarperCollins CEO touts Spotify’s audiobooks entry, AI’s impact on publishing

Canoo hit with two supplier lawsuits as last remaining co-founder leaves

A16z’s Joshua Lu says AI is already radically changing video games and Discord is the future

As remote working keeps rolling, Oyster raises $59M Series D at $1.2B valuation

Threads is testing the ‘trending now’ view in Japan

A reporter used AI to apply to 2,843 jobs

Related articles

Perplexity submits a new bid for TikTok

DeepSeek gets Silicon Valley talking

Why Reid Hoffman feels optimistic about our AI future

2025 will likely be another brutal year of failed startups, data suggests

As demand for data centers soars, real estate companies look to become energy developers

Trump administration reportedly negotiating an Oracle takeover of TikTok

Will states lead the way on AI regulation?

OpenAI wants to take over your browser

Company

Follow us