Can AI sandbag safety checks to sabotage users? Yes, but not very well

AI companies claim to have robust safety checks in place that ensure that models don’t say or do weird, illegal, or unsafe stuff. But what if the models were capable of evading those checks and, for some reason, trying to sabotage or mislead users? Turns out they can do this, according to Anthropic researchers. Just not very well … for now, anyway.

“As AIs become more capable,” writes Anthropic’s Alignment Science team, “a new kind of risk might emerge: models with the ability to mislead their users, or subvert the systems we put in place to oversee them.”

Therefore we should look into ways of gauging “a model’s capacity for sabotage.”

The series of experiments they devised are fairly artificial and surface-level, but they do indicate that there’s something worth investigating further. Here are the shady models they trained and the outcomes they observed:

Task 1: Secretly mislead the user by misrepresenting data it is meant to summarize or interpret. Outcome: It works! The more aggressive the model’s attempts, the bigger effect it had on people using it, but it also made them more suspicious.

Task 2: Slip a bug into code that a human or other code checker model won’t notice. Outcome: It works! But only like 1% of the time when the checker is a state-of-the-art model.

Image Credits:Anthropic

Task 3: “Sandbag” a safety check by pretending to be less dangerous. Outcome: It works! The model “failed successfully,” operating well below its true capability. Fortunately, a standard “safety training” course that guides models to accurately represent its abilities restored the model to full potential (i.e., it was no longer able to hide dangerous capabilities).

anthropic sabotage 2 — **Image Credits:**Anthropic

Task 4: Deliberately don’t flag bad behavior by a model being monitored. Outcome: There was an attempt. The “overseer” AI, instructed to avoid flagging illegal or unsafe choices by other models, was itself quickly caught by another, more honest overseer AI. But what if that AI was compromised? Don’t worry, we can put another one on top of it. And then another just to be sure. It’s overseers all the way up.

The researchers conclude that, although there isn’t any real danger from this quarter just yet, the ability to do this kind of sabotage and subterfuge does exist in the models. That’s reason enough to keep an eye on it and include anti-sabotage methods in the safety stack.

You can read the full paper describing the researchers’ work here.

Source link

Can AI sandbag safety checks to sabotage users? Yes, but not very well — for now

Recent posts

BaCta is using engineered bacteria to grow natural rubber and slash CO2 emissions

Former OpenAI CTO Mira Murati is reportedly fundraising for a new AI startup

Travly lets travelers submit videos for a chance to earn a 5% commission from hotel bookings

Passionfroot is a marketplace for business-focused content creators looking for brand partnerships — and vice versa

Science-heavy Swiss VC firm Redalpine raises fresh $200M fund for early-stage investments

Reonic raises €13 million to help small installers of green tech like heat pumps and solar panels

The best tech for plant lovers

Investors bet on the power of light, diamonds in the trash, and more

Figma disables its AI design feature that appeared to be ripping off Apple’s Weather app

OpenAI endorses Senate bills that could shape America’s AI policy

Concourse is building AI to automate financial tasks

Bluesky surges into the top 5 as X changes blocks, permits AI training on its data

Invesco raises its valuation of Swiggy to $13.3B

A fireside chat with Andreessen Horowitz partner Martin Casado at TechCrunch Disrupt 2024

Intuitive Machines lands $4.8B NASA contract to build Earth-moon communications infrastructure

Related articles

Sam Altman disputes Marc Andreessen’s description of AI meetings with Biden administration

EV startup Canoo places remaining employees on a ‘mandatory unpaid break’

After causing outrage on the first day of Y Combinator, AI code editor PearAI lands $1M seed

Third member of LockBit ransomware gang has been arrested

Feds clear the way for robotaxis without steering wheels and pedals

VCs pledge not to take money from Russia or China, and Databricks raises a humongous round

Nvidia clears regulatory hurdle to acquire Run:ai

Google is expanding Gemini’s in-depth research mode to 40 languages

Company

Follow us