Anthropic used Pokémon to benchmark its newest AI model

Date:

Share post:


Anthropic used Pokémon to benchmark its newest AI model. Yes, really.

In a blog post published Monday, Anthropic said that it tested its latest model, Claude 3.7 Sonnet, on the Game Boy classic Pokémon Red. The company equipped the model with basic memory, screen pixel input, and function calls to press buttons and navigate around the screen, allowing it to play Pokémon continuously.

A unique feature of Claude 3.7 Sonnet is its ability to engage in “extended thinking.” Like OpenAI’s o3-mini and DeepSeek’s R1, Claude 3.7 Sonnet can “reason” through challenging problems by applying more computing — and taking more time.

That came in handy in Pokémon Red, apparently.

Compared to a previous version of Claude, Claude 3.0 Sonnet, which failed to leave the house in Pallet Town where the story begins, Claude 3.7 Sonnet successfully battled three Pokémon gym leaders and won their badges. 

Image Credits:Anthropic

Now, it’s not clear how much computing was required for Claude 3.7 Sonnet to reach those milestones — and how long each took. Anthropic only said that the model performed 35,000 actions to reach the last gym leader, Surge.

It surely won’t be long before some enterprising developer finds out.

Pokémon Red is more of a toy benchmark than anything. However, there is a long history of games being used for AI benchmarking purposes. In the past few months alone, a number of new apps and platforms have cropped up to test models’ game-playing abilities on titles ranging from Street Fighter to Pictionary.



Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

DOGE’s HR email is getting the ‘Bee Movie’ spam treatment

Over the weekend, Elon Musk surveyed his followers on X — the platform he spent $44 billion...

SpaceX says Starship self-destructed after propellant leaks caused fires and comms blackout

SpaceX said Monday that last month’s Starship explosion was brought on by a cascading series of events...

Grok 3 appears to be driving Grok usage to new heights

Elon Musk’s AI company, xAI, released Grok 3, its long-awaited flagship AI model, last week. Grok 3...

Perplexity teases a web browser called Comet

AI-powered search engine Perplexity says it’s building its own web browser. In a post on X on Monday,...

The startup battle begins now: TechCrunch Startup Battlefield 200 applications are open

Pre-Series A founders, this is your moment! If you’re ready to showcase your game-changing innovations on a...

Flexport releases onslaught of AI tools in a move inspired by ‘Founder Mode’

Freight forwarding and logistics company Flexport is rolling out a suite of new products and features, many...

A single default password exposes access to dozens of apartment buildings

A security researcher says the default password shipped in a widely used door access control system allows...

Meta AI arrives in the Middle East and Africa with support for Arabic

Meta has formally expanded Meta AI to the Middle East and North Africa (MENA), opening the AI-enabled...