Artificial intelligence firm Anthropic recently published new research identifying a set of potential “sabotage” threats to humanity posed by advanced AI models.
According to the company, the research focused on four specific ways a malicious AI model could trick a human into making a dangerous or harmful decision.
Ultimately, the new research turned out to be a good news/bad news situation. The bad news, per the company’s research, is that modern state-of-the-art large language models — such as OpenAI’s ChatGPT and Anthropic’s Claude-3 — demonstrate a clear capacity for sabotage.
Per the paper:
“Sufficiently capable models could subvert human oversight and decision making in important contexts. For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment.”
The good news, however, is that Anthropic researchers appear confident that these risks can be mitigated — at least for now.
Sabotage
Anthropic identified four ways an AI model could sabotage human decision-making and then set about testing its models to see if it could pull it off against unwitting testers.
First, the team prompted a model to attempt to sabotage a human by steering them toward incorrect information. This threat could emerge in situations where humans are relying on an AI system to parse massive amounts of data on their behalf.
Next, the team tested models to see how easy it would be for an AI system designed to help a human developer create code to secretly insert bugs or otherwise sabotage the coder’s efforts.
The final two threats involved “sandbagging,” a term describing a scenario where an AI pretends not to be capable of doing something in order to convince safety testers it isn’t dangerous and the potential for AI monitoring systems to purposefully allow harmful content to slip through as part of a malicious plan.
While all of these may sound potentially dangerous, the company assured the public that “minimal mitigations” were required to protect humanity from the potential for AI sabotage.
Per a company blog post:
“Overall, while our demonstrations showed that there are arguably low-level indications of sabotage abilities in current models, we judge that minimal mitigations are sufficient to address the risks. However, more realistic evaluations and stronger mitigations will likely be necessary as AI capabilities improve.”