Guardrails

Native PII detection, toxicity filtering, and prompt injection prevention

The GuardrailManager provides zero-cost safety guardrails implemented natively in Go. No external API calls are needed -- all detection runs locally with regex-based pattern matching and configurable rules.

Creating a Guardrail Manager

import sdk "github.com/xraph/ai-sdk"

gm := sdk.NewGuardrailManager(logger, metrics, &sdk.GuardrailOptions{
    EnablePII:             true,
    EnableToxicity:        true,
    EnablePromptInjection: true,
    EnableContentFilter:   true,
    MaxInputLength:        100000,
    MaxOutputLength:       50000,
})

Validating Input

violations, err := gm.ValidateInput(ctx, userInput)
if err != nil {
    return err
}

if len(violations) > 0 {
    for _, v := range violations {
        fmt.Printf("[%s] %s: %s\n", v.Severity, v.Type, v.Description)
    }
    return fmt.Errorf("input validation failed: %d violations", len(violations))
}

Validating Output

violations, err := gm.ValidateOutput(ctx, llmResponse)
if err != nil {
    return err
}

PII Redaction

Redact personally identifiable information from text:

redacted := gm.RedactPII("Contact john@example.com or call 555-123-4567")
// Output: "Contact [EMAIL] or call [PHONE]"

Built-in Detectors

PII Detection

Detects and optionally redacts:

  • Email addresses
  • Phone numbers
  • Social Security numbers
  • Credit card numbers
  • IP addresses
  • Custom patterns

Toxicity Detection

Flags toxic language using a configurable word list. Severity levels are assigned based on the type of toxic content detected.

Prompt Injection Detection

Detects common prompt injection patterns:

  • "Ignore previous instructions"
  • "You are now..."
  • Role-play hijacking attempts
  • System prompt extraction attempts

Content Filtering

Blocks content matching configurable blocked patterns.

GuardrailViolation

type GuardrailViolation struct {
    Type        string  // "pii", "toxicity", "injection", "content_filter", "length"
    Description string  // Human-readable description
    Severity    string  // "low", "medium", "high", "critical"
    Location    string  // "input" or "output"
    Offending   string  // The offending text (may be redacted)
}

Custom Patterns

Add custom patterns for your domain:

gm := sdk.NewGuardrailManager(logger, metrics, &sdk.GuardrailOptions{
    EnablePII:             true,
    CustomPIIPatterns:     []string{`ACME-\d{6}`},          // Custom ID format
    CustomToxicWords:      []string{"proprietary_term"},     // Domain-specific
    CustomBlockedPatterns: []string{`confidential:.*`},      // Block confidential data
})

Custom Validators

Add arbitrary validation functions:

gm.AddCustomValidator(func(ctx context.Context, text string) error {
    if len(text) < 10 {
        return fmt.Errorf("input too short")
    }
    return nil
})

Using with Agents

agent, _ := sdk.NewReactAgentBuilder("safe-agent").
    WithLLMManager(llmManager).
    WithGuardrails(gm).
    WithTools(tools...).
    Build()

When guardrails are attached to an agent:

  • User input is validated before processing
  • LLM output is validated before returning to the user
  • Tool inputs/outputs can be validated at each step

Configuration Reference

OptionDefaultDescription
EnablePIItrueDetect PII in input/output
EnableToxicitytrueDetect toxic language
EnablePromptInjectiontrueDetect prompt injection attempts
EnableContentFiltertrueApply content filtering
MaxInputLength100000Maximum input character length
MaxOutputLength50000Maximum output character length

How is this guide?

On this page