UStackUStack
ElevenLabs Guardrails 2.0 icon

ElevenLabs Guardrails 2.0

Configurable safety and behavioral controls for ElevenAgents, guiding voice AI responses and blocking unsafe or off-policy outputs before users see them.

ElevenLabs Guardrails 2.0

What is ElevenLabs Guardrails 2.0?

ElevenLabs Guardrails 2.0 is a redesigned control layer in ElevenAgents for voice AI agents that need configurable safety and behavioral protections before responses reach the end user. It’s designed to help keep agents on-brand, on-topic, and compliant at enterprise scale by guiding agents toward correct outputs and preventing unsafe or off-policy responses.

Because AI agents are non-deterministic and can drift during long conversations—or be pushed by adversarial inputs—Guardrails 2.0 uses layered defenses. It combines system prompt hardening with real-time checks on user inputs and agent responses, plus options for how violations are handled.

Key Features

  • System prompt hardening (Focus Guardrail): Defines allowed and disallowed behavior in the system prompt and reinforces those instructions throughout the conversation to reduce off-goal drift.
  • User input validation (Manipulation Guardrails): Detects prompt injection and instruction-override attempts in user messages; when a security risk is detected, it can terminate the conversation.
  • Agent response validation (Policy enforcement): Evaluates every agent reply against configured policies in real time and can block responses that violate the rules before delivery to the user.
  • Pre-built and custom guardrails: Includes pre-built protections for common risk areas and Custom Guardrails where teams define domain-specific policies in natural language.
  • Configurable enforcement behavior: Supports execution modes that trade off latency vs. strictness, exit strategies (end, transfer, escalate to a human, or retry with corrective instructions), and content sensitivity levels per content category.
  • Operational visibility and governance support: Logs every guardrail trigger in conversation analytics (which guardrail fired and the action taken), and can redact sensitive information from transcripts, recordings, and webhook payloads after a call ends.

How to Use ElevenLabs Guardrails 2.0

  1. Define baseline behavior in the system prompt using the allowed and disallowed instructions your voice agent should follow.
  2. Enable the layered guardrails for the two real-time checkpoints: validate user inputs for manipulation attempts and validate agent outputs against your policies.
  3. Add Custom Guardrails by writing domain-specific rules in natural language for your application’s specific risk and compliance needs.
  4. Choose enforcement configuration: set guardrail execution modes to balance response latency and strictness, configure exit strategies for triggered violations, and tune content sensitivity levels to avoid over-blocking.
  5. Review logged triggers and refine policies using conversation analytics; optionally enable conversation history redaction to remove sensitive content from stored outputs.

Use Cases

  • Customer support voice agents: Keep responses on-topic and aligned with internal policies during long back-and-forth calls, while blocking replies that violate configured rules.
  • Sales and lead qualification: Reinforce consistent, goal-directed behavior from the system prompt and validate responses in real time to prevent off-message guidance.
  • Internal workflow assistance: Protect high-impact internal interactions by stopping prompt-injection and instruction-override attempts that could lead the agent off task.
  • Compliance-sensitive content handling: Use Content Guardrails to screen for potentially sensitive or unsafe content categories with tunable thresholds.
  • Domain-specific policy enforcement: Create Custom Guardrails to encode business or regulatory constraints (in natural language) and enforce them automatically across calls.

FAQ

Does Guardrails 2.0 rely only on a system prompt? No. While system prompt hardening (with the Focus Guardrail) is the foundation, Guardrails 2.0 also adds independent real-time checks for user input manipulation and agent response policy violations.

What happens when a guardrail is triggered? Guardrails 2.0 can take configured actions such as ending the conversation, transferring to a different agent, escalating to a human, or retrying with corrective instructions.

Can guardrails affect voice latency? Yes. The feature includes execution modes that let teams choose a tradeoff between speed and strictness. One mode can run guardrails alongside the response (with the possibility that a fraction of a second of audio plays), while another mode can hold responses until fully cleared.

How are policy violations tracked? Every trigger is logged in conversation analytics, including which guardrail fired and what action was taken, helping teams refine their prompts and guardrails over time.

Can sensitive data be removed after a call? Yes. After a call ends, Guardrails 2.0 can automatically redact sensitive information from transcripts, recordings, and webhook payloads while keeping data needed for analytics, QA, and training.

Alternatives

  • Manual moderation and post-hoc review: Instead of blocking or redirecting responses in real time, teams can analyze transcripts after calls. This typically increases risk of unsafe content reaching users and slows feedback loops.
  • Single-layer prompt-only controls: Relying only on a hardened system prompt reduces complexity but does not address non-determinism and adversarial user inputs as effectively as layered checks.
  • Application-side content filtering: Implement filters on input and output streams in the calling application. This can achieve similar safety goals, though you may need to build and maintain evaluation logic and logging yourself.
  • General purpose safety classifiers without policy orchestration: Using standalone moderation models for content detection can help with unsafe content screening, but may not provide the same unified approach to input validation, response blocking, exit strategies, and analytics logging described here.