People are benchmarking AI by having it make balls bounce in rotating shapes

Date:

Share post:


The list of informal, weird AI benchmarks keeps growing.

Over the past few days, some in the AI community on X have become obsessed with a test of how different AI models, particularly so-called reasoning models, handle prompts like this: “Write a Python script for a bouncing yellow ball within a shape. Make the shape slowly rotate, and make sure that the ball stays within the shape.”

Some models manage better on this “ball in rotating shape” benchmark than others. According to one user on X, Chinese AI lab DeepSeek’s freely available R1 swept the floor with OpenAI’s o1 pro mode, which costs $200 per month as a part of OpenAI’s ChatGPT Pro plan.

Per another X poster, Anthropic’s Claude 3.5 Sonnet and Google’s Gemini 1.5 Pro models misjudged the physics, resulting in the ball escaping the shape. Other users reported that Google’s Gemini 2.0 Flash Thinking Experimental, and even OpenAI’s older GPT-4o, aced the evaluation in one go.

But what does it prove that an AI can or can’t code a rotating, ball-containing shape?

Well, simulating a bouncing ball is a classic programming challenge. Accurate simulations incorporate collision detection algorithms, which try to identify when two objects (e.g. a ball and the side of a shape) collide. Poorly written algorithms can affect the simulation’s performance or lead to obvious physics mistakes.

X user N8 Programs, a researcher in residence at AI startup Nous Research, says it took him roughly two hours to program a bouncing ball in a rotating heptagon from scratch. “One has to track multiple coordinate systems, how the collisions are done in each system, and design the code from the beginning to be robust,” N8 Programs explained in a post.

But while bouncing balls and rotating shapes are a reasonable test of programming skills, they’re not a very empirical AI benchmark. Even slight variations in the prompt can — and do — yield different outcomes. That’s why some users on X report having more luck with o1, while others say that R1 falls short.

If anything, viral tests like these point to the intractable problem of creating useful systems of measurement for AI models. It’s often difficult to tell what differentiates one model from another, outside of esoteric benchmarks that aren’t relevant to most people.

Many efforts are underway to build better tests, like the ARC-AGI benchmark and Humanity’s Last Exam. We’ll see how those fare — and in the meantime watch GIFs of balls bouncing in rotating shapes.





Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

Apple iPhone 16e review: An A18 chip and Apple Intelligence for $599

Apple delivered its latest budget handset, the $599 iPhone 16e, without pomp. There was no big event...

Europe’s Relay pulls in $35M Series A after applying Asia’s model to delivery

Being somewhat later than Europe in adopting the idea of parcel delivery, much of Asia built its...

Lonestar and Phison’s data center infrastructure is headed to the moon

Data storage and resilience company Lonestar and semiconductor and storage company Phison launched a data center infrastructure...

Commercetools, a pioneer in ‘headless commerce’, lays off dozens of staff

Commercetools — a “headless commerce” platform that provides APIs to companies building online storefronts — saw a...

Shop Circle raises $60M to encircle ecommerce with an app suite

The boom in ecommerce post-pandemic meant shops moved online. However, some merchants ended up with dozens of...

Maternity clinic Millie nabs $12M Series A from an all-star, all female class of VCs

Millie, a California-based maternity clinic, founded by Anu Sharma, announced the raise of a $12 million Series...

Avride’s sidewalk delivery bots land in Japan

Avride sidewalk bots will start delivering restaurant orders and groceries in central Tokyo this week through a...

Here are all the tech companies rolling back DEI or still committed to it — so far

Companies around America have started cutting DEI programs and eliminating DEI commitments from public documents in response...