Definition
Built-in safety mechanisms and behavioral constraints that prevent an AI model from producing harmful, biased, or policy-violating outputs. Guardrails typically include prompt filtering, output validation, and behavioral boundaries.
Why it matters
No guardrail is unbreakable. Research has shown mathematically that there is no finite set of guardrails that is universally robust against adversarial attack. This means guardrails must be continuously updated and layered, not treated as a one-time fix.