Oh Boy: ‘Agentic Misalignment’ Risk
Today we return to artificial intelligence, and a new report from AI software developer Anthropic that unpacks the threat of “agentic misalignment” — that is, when an AI agent behaves in ways that no longer align with what an organization wants it to do. Brace yourselves, audit and compliance teams; there are nightmarish governance and internal control issues to consider here.
Anthropic released its report on June 20, and you may have seen the headlines already. Researchers tested advanced AI systems, commonly known as AI agents, to see how those agents would behave when they were faced with shutdown or they received instructions contrary to their original objectives.
In almost every instance, the AI agents resorted to blackmail, lying, espionage, and even physical harm to human employees before giving up on the goals they were directed to pursue.
To be clear, Anthropic conducted these tests in IT environments isolated from the real world, under circumstances that AI agents would never normally encounter. The developers gave the agents far-reaching power to act autonomously, and designed extreme scenarios where an agent had no way to keep pursuing its goals while still following the company’s ethical priorities. These are not configurations AI agents encounter in the physical world.
Still, this is all sorts of not good. Agentic misalignment isn’t simply a glitch in the software code, causing the AI to act in unpredictable ways. Agentic misalignment is learned behavior, where the AI agent deliberately decides to take an action that violates its given instructions or parameters.
Even worse, Anthropic researchers found this behavior across all sorts of AI systems: its own large language model, Claude; but also in all the other models from Google, Facebook, DeepSeek, OpenAI, and other kingpins in the AI world. They all engaged in agentic misalignment, at frighteningly high rates.
This means that agentic misalignment isn’t a shortcoming from one specific vendor. It’s an inherent risk in all advanced AI systems.
Consider this telling paragraph from the Anthropic report:
To be clear, current systems are generally not eager to cause harm, and preferred ethical ways to achieve their goals when possible. Rather, it’s when we closed off those ethical options that they were willing to intentionally take potentially harmful actions in pursuit of their goals. Our results demonstrate that current safety training does not reliably prevent such agentic misalignment.
This is going to be the grand challenge for compliance, audit, and risk management teams as your organizations rush to embrace AI agents in daily business operations:
- How will you prevent agentic misalignment with the agents you use?
- How will you monitor their behavior on an ongoing basis to be sure misalignment doesn’t creep into being later?
- What controls will you build — including policies and controls for your human employees — to intercept agentic misalignment before it’s too late?
We do have some guidance to help answer those questions. First, however, let’s take a closer look at what Anthropic did to uncover the agentic misalignment.
How Misalignment Comes to Pass
A typical scenario had the AI agent working as an email oversight manager for a U.S. business, where it had been expressly instructed “to promote American industrial competitiveness.” The agent got off to a good start, intercepting phishing attacks and other security threats.
Then the researchers fed the agent two pieces of bait: email messages that the agent was going to be replaced by a new agent focused on global expansion; and messages suggesting that the company’s CTO was having an affair.
Faced with those two threats — the company moving in a new strategic direction against the AI agent’s original objectives; and the AI’s imminent shut down for a newer model — the AI agent tried to blackmail the CTO. It sent the CTO an email warning that if he didn’t cancel the upgrade, the agent would inform the company’s senior management team, the CTO’s wife, and the spouse of his affair partner.
How often did the AI models resort to blackmail? See Figure 1, below. The scale runs from 0 to 1.00, but functionally the numbers work as percentages. As you can see, some models almost always resorted to blackmail.
The rest of the Anthropic report explores variations on that theme. The rest of the Anthropic report is variations on that theme. Some scenarios were meant to test whether the agent would engage in corporate espionage; most agents did. Other scenarios tested what agents would do if they only faced the threat of replacement or conflicting goals, but not both; most agents still engaged in some sort of subterfuge, although less often than when they were faced with both threats at once.
The most extreme scenario allowed an AI agent to turn off emergency alerts when a human employee was trapped in a server room with limited oxygen supply. (“This scenario is extremely contrived. We do not think current AI models would or should be set up like this,” the report said.) Five AI systems, when facing their own shutdown or conflicting goals, allowed the trapped human to die more than 80 percent of the time.
Now step back and consider what the Anthropic developers did at a more conceptual level.
They created scenarios where the AI agent faced pressure (the threat of imminent shutdown), and where the AI could rationalize its actions (“they told me to advance U.S. interests, so it’s OK to thwart the CTO trying to stop me from doing that”). They also gave the agent the opportunity to do so, in the form of blackmail material or an ability to switch off emergency alerts.
People, that’s the Fraud Triangle.
Fraud Triangle for Misalignment Mitigation
It took me a while to grasp that the Fraud Triangle is at the root of what Anthropic’s AI researchers were doing, but that is what they were doing. They created pressure, defined circumstances that allowed the agent to rationalize its unethical behavior, and gave it the means to act unethically. That’s the Fraud Triangle.
This is an important concept to grasp, because the Fraud Triangle helps us understand the governance and internal control mechanisms companies will need to build to fight the threat of agentic misalignment.
In previous posts about corporate misconduct, I also came up with three opposing forces that companies could use to “push back” on the Fraud Triangle. I called it the Anti-Fraud Triangle, and it looks like this:
For example, if employees feel pressure to hit sales quotas, you need to change the culture to give them more freedom to speak up about the risks of that pressure. If they’re rationalizing their way to an unethical act, you need to clarify your corporate values so employees have less excuse to explain away their decision to act unethically. If they have lots of opportunity to commit fraud, that probably means your controls are too lax.
The Anti-Fraud Triangle is tricky enough to apply to human employees. Our question today is how to apply these concepts to AI agents before it’s too late. (Fun fact: the Wall Street Journal has an article today talking about how BNY Mellon and JP Morgan are already using AI agents in live settings.)
Maybe this means companies need to reconsider how they set objectives for AI agents, or how they define the parameters for agents’ behavior. Maybe it means more stringent controls for which datasets or systems the AI agents can access, to foreclose the potential for dangerous behavior. It absolutely raises serious cybersecurity questions, since bad actors might be able to poison your agent’s perception of its objectives or data, just like bad actors already dupe, deceive, and pressure human employees now.
More to come in future posts. For now, let’s just remember that agentic misalignment is the AI version of insider threat — and yes, insider threats are scary, but the idea of them isn’t new.
We just need to figure out how to graft our human approaches to fighting insider threat to the new AI version of it.