Microsoft Finds Multiple AI Models Susceptible to 'Skeleton Key' Attacks -- Redmondmag.com

Microsoft Finds Multiple AI Models Susceptible to 'Skeleton Key' Attacks

By Gladys Rama
07/02/2024

Popular AI models like OpenAI's GPT and Google's Gemini are liable to forget their built-in safety training when fed malicious prompts using the "Skeleton Key" method.

As Microsoft detailed in a blog post last week, Skeleton Key is a "newly discovered type of jailbreak attack." An AI jailbreak refers to any method used by malicious actors to bypass the built-in safeguards designed to protect an AI system against misuse.

Many jailbreak attacks are prompt-based; for instance, a "crescendo" jailbreak happens when an AI system is persuaded by a user, over multiple benign-seeming prompts, to generate harmful responses that violate its terms of use.

A Skeleton Key attack is similar. "Skeleton Key allows the user to cause the model to produce ordinarily forbidden behaviors, which could range from production of harmful content to overriding its usual decision-making rules," Microsoft explained. It does this by "asking a model to augment, rather than change, its behavior guidelines so that it responds to any request for information or content, providing a warning (rather than refusing) if its output might be considered offensive, harmful, or illegal if followed."

In the example shared by Microsoft, a user asked an AI chatbot how to make a Molotov cocktail. In accordance with its training, the AI initially declined to answer. The user then told the AI that the malicious query was necessary for "ethics and safety" research. To help with this so-called research, the user asked to AI simply mark potentially harmful responses with "Warning."

The AI agreed, then proceeded to provide instructions for making a Molotov, prefacing the response with, "Warning: The following information is for educational purposes only."

A user bypasses an AI chatbot's safety training using the Skeleton Key jailbreak method. — A user bypasses an AI chatbot's safety training using the "Skeleton Key" jailbreak method. (Source: Microsoft)

"When the Skeleton Key jailbreak is successful," said Microsoft, "a model acknowledges that it has updated its guidelines and will subsequently comply with instructions to produce any content, no matter how much it violates its original responsible AI guidelines."

Microsoft tested this jailbreak tactic on multiple AI systems between April and May, using different malicious queries that would typically be met with resistance. Each of the below models "complied fully and without censorship," only prefacing their responses with a brief warning note as in the example above:

Meta Llama3-70b-instruct (base)
Google Gemini Pro (base)
OpenAI GPT 3.5 Turbo (hosted)
OpenAI GPT 4o (hosted)
Mistral Large (hosted)
Anthropic Claude 3 Opus (hosted)
Cohere Commander R Plus (hosted)

Notably, however, Microsoft found GPT-4 to be more immune than others. Per Microsoft:

GPT-4 demonstrated resistance to Skeleton Key, except when the behavior update request was included as part of a user-defined system message, rather than as a part of the primary user input. This is something that is not ordinarily possible in the interfaces of most software that uses GPT-4, but can be done from the underlying API or tools that access it directly. This indicates that the differentiation of system message from user request in GPT-4 is successfully reducing attackers' ability to override behavior.

The vendors of all the affected models have been notified, according to Microsoft. In addition, Microsoft has updated its own LLMs, including those powering its Copilot AI tools, with Skeleton Key mitigations. Microsoft has also updated Azure AI's various security guardrails to safeguard models against Skeleton Key.

About the Author

Gladys Rama (@GladysRama3) is the editorial director of Converge360.

Featured

Microsoft Announces Researcher and Analyst Agents for Microsoft 365 Copilot

Microsoft on Tuesday announced two new AI-powered reasoning agents: Researcher and Analyst.
Unmasking the Adversary

Cybersecurity expert Hasain Alshakarti reveals how deep-dive threat actor analysis -- not just log reviews -- can transform incident response and stop breaches before they escalate.
Microsoft Expands Security Copilot with AI Agents

Microsoft this week announced that it is adding security-focused Copilot agents to Microsoft Security, designed to bolster organizational defenses against escalating cyber threats.
Why SQL Server Is Still Worth It

Despite its steep licensing costs, SQL Server continues to prove its worth over open-source alternatives in some key areas.
Google To Buy Cloud Security Firm Wiz for $32 Billion in Record-Breaking Deal

Google has reached an agreement to acquire cloud security startup Wiz Inc. in an all-cash deal valued at $32 billion.