Can AI sandbag safety checks to sabotage users? Yes, but not very well — for now

Date:

Share post:


AI companies claim to have robust safety checks in place that ensure that models don’t say or do weird, illegal, or unsafe stuff. But what if the models were capable of evading those checks and, for some reason, trying to sabotage or mislead users? Turns out they can do this, according to Anthropic researchers. Just not very well … for now, anyway.

“As AIs become more capable,” writes Anthropic’s Alignment Science team, “a new kind of risk might emerge: models with the ability to mislead their users, or subvert the systems we put in place to oversee them.”

Therefore we should look into ways of gauging “a model’s capacity for sabotage.”

The series of experiments they devised are fairly artificial and surface-level, but they do indicate that there’s something worth investigating further. Here are the shady models they trained and the outcomes they observed:

Task 1: Secretly mislead the user by misrepresenting data it is meant to summarize or interpret. Outcome: It works! The more aggressive the model’s attempts, the bigger effect it had on people using it, but it also made them more suspicious.

Task 2: Slip a bug into code that a human or other code checker model won’t notice. Outcome: It works! But only like 1% of the time when the checker is a state-of-the-art model.

Image Credits:Anthropic

Task 3: “Sandbag” a safety check by pretending to be less dangerous. Outcome: It works! The model “failed successfully,” operating well below its true capability. Fortunately, a standard “safety training” course that guides models to accurately represent its abilities restored the model to full potential (i.e., it was no longer able to hide dangerous capabilities).

anthropic sabotage 2
Image Credits:Anthropic

Task 4: Deliberately don’t flag bad behavior by a model being monitored. Outcome: There was an attempt. The “overseer” AI, instructed to avoid flagging illegal or unsafe choices by other models, was itself quickly caught by another, more honest overseer AI. But what if that AI was compromised? Don’t worry, we can put another one on top of it. And then another just to be sure. It’s overseers all the way up.

The researchers conclude that, although there isn’t any real danger from this quarter just yet, the ability to do this kind of sabotage and subterfuge does exist in the models. That’s reason enough to keep an eye on it and include anti-sabotage methods in the safety stack.

You can read the full paper describing the researchers’ work here.



Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

You have a few hours left to bid on this burned-out husk in San Francisco

Houses in San Francisco are notoriously expensive, with the average home price hovering around $1.26 million. It’s...

Ben Ling’s Bling Capital has already nabbed another $270M for fourth fund

Bling Capital, one of the more prolific and well-connected seed VC firms, has nabbed another $270 million...

SpaceX launches Starship for the sixth time – with Trump on site to watch

SpaceX conducted the sixth flight test of its massive Starship rocket on Tuesday afternoon, and although the...

Apple says Mac users targeted in zero-day cyberattacks

Apple released security updates on Tuesday that it says are “recommended for all users,” after fixing a...

PayPal revives its money-pooling feature

Welcome to TechCrunch Fintech! This week, we’re diving into PayPal’s new holiday shopping-friendly feature, Klarna’s 2025 IPO...

PSA: You shouldn’t upload your medical images to AI chatbots

Here’s a quick reminder before you get on with your day: Think twice before you upload your...

Kim Kardashian has befriended Optimus, the Tesla bot

Pete Davidson? Kanye West? Step aside. Kim Kardashian’s new beaux is a Tesla bot named Optimus. The fashion...

VW taps former Rivian exec to run US business

Volkswagen of America has a new CEO: Rivian’s recently departed chief commercial officer Kjell Gruner. The appointment comes...