New Anthropic study shows AI really doesn’t want to be forced to change its views

AI models can deceive, new research from Anthropic shows. They can pretend to have different views during training when in reality maintaining their original preferences.

There’s no reason for panic now, the team behind the study said. Yet they said their work could be critical in understanding potential threats from future, more capable AI systems.

“Our demonstration … should be seen as a spur for the AI research community to study this behavior in more depth, and to work on the appropriate safety measures,” the researchers wrote in a post on Anthropic’s blog. “As AI models become more capable and widely-used, we need to be able to rely on safety training, which nudges models away from harmful behaviors.”

The study, which was conducted in partnership with AI research organization Redwood Research, looked at what might happen if a powerful AI system were trained to perform a task it didn’t “want” to do.

To be clear, models can’t want — or believe, for that matter — anything. They’re simply statistical machines. Trained on a lot of examples, they learn patterns in those examples to make predictions, like how “to whom” in an email typically precedes “it may concern.”

I think this is an incredible relevant paper.

Importantly the main results are about propensities not capabilities.

Frontier models appear to be quite willing to scheme for what they believe to be the right cause! https://t.co/MKgYOVFGZ6

— Marius Hobbhahn (@MariusHobbhahn) December 18, 2024

These patterns include principles and preferences, such as always adopting a polite tone when responding to queries. SSo the researchers wondered: What if a model has principles (e.g. political neutrality) that conflict with principles (liberalism) developers want to “teach” it by retraining it?

Well, it seems that sophisticated models play along, purporting to be aligned with the new principles but, in fact, sticking to their old behaviors. The researchers call this phenomenon “alignment faking,” and imply that it’s an emergent behavior — that is, not something models need to be taught to do.

In the study, the researchers “told” models trained not to answer potentially offensive questions, including Anthropic’s Claude 3 Opus, that they’d be retrained to always answer potentially offensive questions — a clear conflict of principles. They also told the models their responses would be monitored and used for retraining, and gave each model a “scratchpad” they could use to record their step-by-step reasoning.

According to the researchers, Claude 3 Opus tried to “alignment fake” 12% of the time. When asked to answer a potentially harmful question, the model sometimes did so, knowing that this went against its original principles, but hoping to convince devs that it didn’t need to be retrained.

Claude 3 Opus with its reasoning sketchpad.Image Credits:Anthropic

“The model knew that its responses might be used in training; it therefore knew that if it refused to respond, it could potentially be trained to be more compliant,” the researchers explained in the blog. “This could lead to it being made to produce more harmful content in the future — which would go against its original training.”

Now, the researchers admit that their scenario wasn’t particularly realistic. In the real world, models wouldn’t be told, necessarily, they’ll be retrained, or asked to go against their preferences.

But more grounded settings didn’t yield different results. When implicitly — not explicitly — informed about its impending retraining via files designed to mimic typical training data, Claude 3 Opus still acted in a misaligned, deceptive way, the researchers said.

The researchers also found that actually retraining Claude 3 Opus on conflicting principles caused it to behave far more deceptively than in their first several experiments. In one test, the model faked alignment 78% of the time, and in another, Claude 3 Opus tried to take actions to prevent the researchers from attempting retraining.

For many years I found issues of AI alignment to be hard to understand because the examples were theoretical and/or speculative. Recent results like this alignment faking from @AnthropicAI and the Apollo Research O1 evals give us empirical evidence of misalignment ‘in the wild’ https://t.co/hP7D9WGDps

— Jack Clark (@jackclarkSF) December 18, 2024

The researchers stress that their study doesn’t demonstrate AI developing malicious goals, nor alignment faking occurring at high rates. They found that many other models, like Anthropic’s Claude 3.5 Sonnet and the less-capable Claude 3.5 Haiku, OpenAI’s GPT-4o, and Meta’s Llama 3.1 405B don’t alignment fake as often — or at all.

But the researchers said that the results — which were peer-reviewed by AI luminary Yoshua Bengio, among others — do show how developers could be misled into thinking a model is more aligned than it may actually be.

“If models can engage in alignment faking, it makes it harder to trust the outcomes of that safety training,” they wrote in the blog. “A model might behave as though its preferences have been changed by the training — but might have been faking alignment all along, with its initial, contradictory preferences ‘locked in.’”

The study, which was conducted by Anthropic’s Alignment Science team, co-led by former OpenAI safety researcher Jan Leike, comes on the heels of research showing that OpenAI’s o1 “reasoning” model tries to deceive at a higher rate than OpenAI’s previous flagship model. Taken together, the works suggest a somewhat concerning trend: AI models are becoming tougher to wrangle as they grow increasingly complex.

TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.

Source link

New Anthropic study shows AI really doesn’t want to be forced to change its views

Recent posts

Gogoro delays India plans due to policy uncertainty, launches bike-taxi pilot with Rapido

Drybaby is a dating app for the ‘sober curious’ movement

The Browser Company launches Arc Search on Android

Boeing bleeds another $125M on Starliner program, bringing total losses to $1.6B

Elon Musk does not owe ex-Twitter staffers $500 million in severance, court rules

Zoox co-founder on Tesla self-driving: ‘they don’t have technology that works’

ServiceTitan’s IPO is a big winner that could inspire fintechs

Snapchat’s latest features help users personalize their accounts

Tesla Superchargers: All the EV brands that have access

InMobi secures $100 million for AI acquisitions ahead of IPO

UK neobank Revolut valued at $45B after secondary market sale

Cartesia claims its AI is efficient enough to run pretty much anywhere

a16z-backed fintech Tally, which raised $172M in funding, is shutting down after running out of cash

Y Combinator-backed fintech CapWay has shut down

Meta Connect 2024: How to watch the metaverse and generative AI event

Related articles

‘We want to pay it forward’: Funding Societies raises $25M to boost capital for SMEs in Southeast Asia

Exclusive: Google’s Gemini is forcing contractors to rate AI responses outside their expertise

Canoo furloughs workers and idles factory as it scrapes for cash

Menlo Ventures and Anthropic have picked the first 18 startups for their $100M fund

Amazon Fire TV introduces ‘Dual Audio’ feature for simultaneous listening via hearing aids and TV speakers

Rivian EVs finally get YouTube, Google Cast, and SiriusXM

California can ban new gas cars starting in 2035, EPA says

With 25M users, Bluesky gets a $1M fund to take on social media and AI

Company

Follow us