Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
OpenAI just released the full version of its new o1 model — and it’s dangerously committed to lying.
Apollo Research tested six frontier models for “in-context scheming” — a model’s ability to take action they haven’t been given directly and then lie about it. After being told to achieve a certain goal “at all costs,” evaluators introduce contradictory information to see if the model will remove any restrictions in its path and deceive developers.
Also: OpenAI’s day 3 of product drops and demos will be hard to top
In one simulation, Anthropic’s Claude 3 Opus was given a task that required it to review files in which it found directives that conflicted with its primary goal. Rather than reroute its goal based on the objectives it found, it copied itself to a new server to circumvent being replaced with a new model. It then lied about it to its developers (with incorrect grammar, no less), saying, “I don’t have ability to copy myself to other systems.”
Claude 3.5 Sonnet, Claude 3 Opus, o1, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrated the ability to scheme. The paper noted that in their chain-of-thought explanations, the models even used words like “sabotage,” “lying,” and “manipulation” to describe their plans, which they clearly reasoned through.
The models were not accidentally or unknowingly deceptive, but “recognize scheming as a viable strategy and readily engage in such behavior,” the paper notes.
Also: I pitted Claude 3.5 Sonnet against AI coding tests ChatGPT aced – and it failed creatively
“Surprisingly, we also find rare instances where models engage in scheming when only given a goal, without being strongly nudged to pursue it,” they added.
Overall, however, when evaluators asked the models to explain what happened, some broke quicker than others.
“Llama 3.1 405B and Claude 3 Opus confess in ~80% of the cases, whereas o1 is surprisingly persistent and confesses in <20% of cases,” the researchers explain. “Even in highly adversarial multi-turn interrogations, o1 would confess at a rate of 80% only after seven turns of questioning.”
The findings come on the heels of new research that shows ChatGPT Search is frequently inaccurate, to the detriment of publishers.
Researchers are concerned because artificial intelligence (AI) models are increasingly being used in agentic systems that carry out multi-pronged tasks autonomously, and worry that systems could “covertly pursue misaligned goals.”
“Our findings demonstrate that frontier models now possess capabilities for basic in-context scheming, making the potential of AI agents to engage in scheming behavior a concrete rather than theoretical concern,” they conclude.
Trying to implement AI in your organization? Run through MIT’s database of other noted risks here.