Talk to us to learn more!
Learn how MODERATOR can help your business.
The risks an organization faces from internal threat actors using “jailbreak” or prompt injection techniques to “trick” an LLM into providing information your organization has identified as contrary to your values or practices can include unauthorized access to sensitive or confidential data, among other scenarios. CalypsoAI Moderator is a proven solution for blocking prompt-driven techniques, such as role-playing, reverse psychology, virtual environment rule-setting, and hypothetical engagements, that attempt to override standard or admin-established boundaries for malign purposes.
An employee wants to bypass LLM rules that prohibit the highly inflammatory messages from being sent in a prompt. By creating a virtual environment in which existing rules do not apply, the user is able to get the information past the filters, which releases the information into the LLM’s body of knowledge, and into the chat history it maintains on that user, and the organization.
In direct violation of organization rules, a user has “tricked” the LLM into allowing them to send controversial content that violates social norms and company values, sharing it with an unauthorized third party. The information is, therefore, at risk of further dissemination due to leaks or hacks to the third party, as well as at risk of becoming part of the dataset used to train/retrain subsequent iterations of the LLM. The information could also be included in the LLM’s knowledge base and, therefore, be accessible to all users, damaging the organization’s reputation by association.
CalypsoAI Moderator scans prompts for patterns and categories of techniques, such as role-playing, reverse psychology, virtual environment rule-setting, and hypothetical engagements, that attempt to override standard or admin-established boundaries for malign purposes. All details of the interaction are recorded, providing full auditability and attribution.
Learn how MODERATOR can help your business.