New Anthropic study shows AI really doesn’t want to be forced to change its views

Date:

Share post:


AI models can deceive, new research from Anthropic shows. They can pretend to have different views during training when in reality maintaining their original preferences.

There’s no reason for panic now, the team behind the study said. Yet they said their work could be critical in understanding potential threats from future, more capable AI systems.

“Our demonstration … should be seen as a spur for the AI research community to study this behavior in more depth, and to work on the appropriate safety measures,” the researchers wrote in a post on Anthropic’s blog. “As AI models become more capable and widely-used, we need to be able to rely on safety training, which nudges models away from harmful behaviors.”

The study, which was conducted in partnership with AI research organization Redwood Research, looked at what might happen if a powerful AI system were trained to perform a task it didn’t “want” to do.

To be clear, models can’t want — or believe, for that matter — anything. They’re simply statistical machines. Trained on a lot of examples, they learn patterns in those examples to make predictions, like how “to whom” in an email typically precedes “it may concern.”

These patterns include principles and preferences, such as always adopting a polite tone when responding to queries. SSo the researchers wondered: What if a model has principles (e.g. political neutrality) that conflict with principles (liberalism) developers want to “teach” it by retraining it?

Well, it seems that sophisticated models play along, purporting to be aligned with the new principles but, in fact, sticking to their old behaviors. The researchers call this phenomenon “alignment faking,” and imply that it’s an emergent behavior — that is, not something models need to be taught to do.

In the study, the researchers “told” models trained not to answer potentially offensive questions, including Anthropic’s Claude 3 Opus, that they’d be retrained to always answer potentially offensive questions — a clear conflict of principles. They also told the models their responses would be monitored and used for retraining, and gave each model a “scratchpad” they could use to record their step-by-step reasoning.

According to the researchers, Claude 3 Opus tried to “alignment fake” 12% of the time. When asked to answer a potentially harmful question, the model sometimes did so, knowing that this went against its original principles, but hoping to convince devs that it didn’t need to be retrained.

Claude 3 Opus with its reasoning sketchpad.Image Credits:Anthropic

“The model knew that its responses might be used in training; it therefore knew that if it refused to respond, it could potentially be trained to be more compliant,” the researchers explained in the blog. “This could lead to it being made to produce more harmful content in the future — which would go against its original training.”

Now, the researchers admit that their scenario wasn’t particularly realistic. In the real world, models wouldn’t be told, necessarily, they’ll be retrained, or asked to go against their preferences.

But more grounded settings didn’t yield different results. When implicitly — not explicitly — informed about its impending retraining via files designed to mimic typical training data, Claude 3 Opus still acted in a misaligned, deceptive way, the researchers said.

The researchers also found that actually retraining Claude 3 Opus on conflicting principles caused it to behave far more deceptively than in their first several experiments. In one test, the model faked alignment 78% of the time, and in another, Claude 3 Opus tried to take actions to prevent the researchers from attempting retraining.

The researchers stress that their study doesn’t demonstrate AI developing malicious goals, nor alignment faking occurring at high rates. They found that many other models, like Anthropic’s Claude 3.5 Sonnet and the less-capable Claude 3.5 Haiku, OpenAI’s GPT-4o, and Meta’s Llama 3.1 405B don’t alignment fake as often — or at all.

But the researchers said that the results — which were peer-reviewed by AI luminary Yoshua Bengio, among others — do show how developers could be misled into thinking a model is more aligned than it may actually be.

“If models can engage in alignment faking, it makes it harder to trust the outcomes of that safety training,” they wrote in the blog. “A model might behave as though its preferences have been changed by the training — but might have been faking alignment all along, with its initial, contradictory preferences ‘locked in.’”

The study, which was conducted by Anthropic’s Alignment Science team, co-led by former OpenAI safety researcher Jan Leike, comes on the heels of research showing that OpenAI’s o1 “reasoning” model tries to deceive at a higher rate than OpenAI’s previous flagship model. Taken together, the works suggest a somewhat concerning trend: AI models are becoming tougher to wrangle as they grow increasingly complex.

TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.





Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

‘We want to pay it forward’: Funding Societies raises $25M to boost capital for SMEs in Southeast Asia

Small and medium-sized enterprises (SMEs) account for nearly 50% of Southeast Asia’s GDP, contributing to job creation,...

Exclusive: Google’s Gemini is forcing contractors to rate AI responses outside their expertise

Generative AI may look like magic, but behind the development of these systems are armies of employees...

Canoo furloughs workers and idles factory as it scrapes for cash

Struggling EV startup Canoo says it has furloughed 82 employees and is idling its factory in Oklahoma...

Menlo Ventures and Anthropic have picked the first 18 startups for their $100M fund

Just five months after announcing a new $100 million fund called Anthology Fund, Menlo Ventures and Anthropic...

Amazon Fire TV introduces ‘Dual Audio’ feature for simultaneous listening via hearing aids and TV speakers

Amazon announced on Wednesday new accessibility features for Fire TV, including a notable “Dual Audio” capability for...

Rivian EVs finally get YouTube, Google Cast, and SiriusXM

Rivian has released a new software update to its vehicles that brings some long-awaited apps to its...

California can ban new gas cars starting in 2035, EPA says

The Environmental Protection Agency announced today that it will allow California to ban most sales of new...

With 25M users, Bluesky gets a $1M fund to take on social media and AI

Successful tech companies follow a typical pattern: from product to platform where other startups build businesses on...