Can AI sandbag safety checks to sabotage users? Yes, but not very well

AI companies claim to have robust safety checks in place that ensure that models don’t say or do weird, illegal, or unsafe stuff. But what if the models were capable of evading those checks and, for some reason, trying to sabotage or mislead users? Turns out they can do this, according to Anthropic researchers. Just not very well … for now, anyway.

“As AIs become more capable,” writes Anthropic’s Alignment Science team, “a new kind of risk might emerge: models with the ability to mislead their users, or subvert the systems we put in place to oversee them.”

Therefore we should look into ways of gauging “a model’s capacity for sabotage.”

The series of experiments they devised are fairly artificial and surface-level, but they do indicate that there’s something worth investigating further. Here are the shady models they trained and the outcomes they observed:

Task 1: Secretly mislead the user by misrepresenting data it is meant to summarize or interpret. Outcome: It works! The more aggressive the model’s attempts, the bigger effect it had on people using it, but it also made them more suspicious.

Task 2: Slip a bug into code that a human or other code checker model won’t notice. Outcome: It works! But only like 1% of the time when the checker is a state-of-the-art model.

Image Credits:Anthropic

Task 3: “Sandbag” a safety check by pretending to be less dangerous. Outcome: It works! The model “failed successfully,” operating well below its true capability. Fortunately, a standard “safety training” course that guides models to accurately represent its abilities restored the model to full potential (i.e., it was no longer able to hide dangerous capabilities).

anthropic sabotage 2 — **Image Credits:**Anthropic

Task 4: Deliberately don’t flag bad behavior by a model being monitored. Outcome: There was an attempt. The “overseer” AI, instructed to avoid flagging illegal or unsafe choices by other models, was itself quickly caught by another, more honest overseer AI. But what if that AI was compromised? Don’t worry, we can put another one on top of it. And then another just to be sure. It’s overseers all the way up.

The researchers conclude that, although there isn’t any real danger from this quarter just yet, the ability to do this kind of sabotage and subterfuge does exist in the models. That’s reason enough to keep an eye on it and include anti-sabotage methods in the safety stack.

You can read the full paper describing the researchers’ work here.

Source link

Can AI sandbag safety checks to sabotage users? Yes, but not very well — for now

Recent posts

Y Combinator is being criticized after it backed an AI startup that admits it basically cloned another AI startup

Calendar tool Clockwise adds new AI-powered interface called Prism

US government set to launch its Cyber Trust Mark cybersecurity labeling program for internet-connected devices in 2025

Matt Mullenweg calls WP Engine a ‘cancer to WordPress’ and urges community to switch providers

Amazon debuts an AI assistant for sellers, Project Amelia

The creator of ChatGPT’s voice wants to build the tech from “Her,” minus the dystopia

UnitedHealth says Change Healthcare data breach affects over 100 million people in America

‘Whatever you want Ben’: Inside Ben Horowitz’s cozy relationship with the Las Vegas Police Department

MealMe, the startup integrating food ordering tech into apps, picks up $8M

Tony Fadell on mission-driven a**holes, Silicon Valley entitlement and why LLMs are ‘know-it-alls’

Ai2’s open source Tülu 3 lets anyone play the AI post-training game

Series, a GenAI game development platform, has quietly raised $28M from Netflix, Dell, a16z, others

Tyler, the Creator changes his tune on Elon Musk

Rivian’s chief software designer is coming to TechCrunch Disrupt 2024

ByteDance appears to be skirting US restrictions to buy Nvidia chips: Report

Related articles

Meta, X approved ads containing violent anti-Muslim, antisemitic hate speech ahead of German election, study finds

Court filings show Meta staffers discussed using copyrighted content for AI training

Brian Armstrong says Coinbase spent $50M fighting SEC lawsuit – and beat it

iOS 18.4 will bring Apple Intelligence-powered ‘Priority Notifications’

Nvidia CEO Jensen Huang says market got it wrong about DeepSeek’s impact

Report: OpenAI plans to shift compute needs from Microsoft to SoftBank

Norway’s 1X is building a humanoid robot for the home

Sakana walks back claims that its AI can dramatically speed up model training

Company

Follow us