Resilience
Circuit breakers, retry policies, and rate limiting
The SDK provides built-in resilience patterns for handling LLM provider failures, rate limits, and transient errors.
Circuit Breaker
The circuit breaker prevents cascading failures by stopping requests to a failing provider:
import sdk "github.com/xraph/ai-sdk"
cb := sdk.NewCircuitBreaker(sdk.CircuitBreakerConfig{
Name: "openai",
MaxFailures: 5, // Open after 5 consecutive failures
ResetTimeout: 60 * time.Second, // Try again after 60s
HalfOpenMax: 2, // 2 successful calls to close
}, logger, metrics)States
| State | Behavior |
|---|---|
| Closed | Requests pass through normally |
| Open | Requests are rejected immediately |
| Half-Open | Limited requests are allowed to test recovery |
Usage
err := cb.Execute(ctx, func(ctx context.Context) error {
_, err := llmManager.Chat(ctx, request)
return err
})
if err != nil {
// Could be a circuit breaker rejection or an actual error
fmt.Printf("Error: %v\n", err)
}Checking State
state := cb.State()
if state == sdk.CircuitStateOpen {
fmt.Println("Circuit is open -- provider is unhealthy")
}Retry Policy
Configure retry behavior with exponential backoff:
type RetryConfig struct {
MaxRetries int
Delay time.Duration
MaxDelay time.Duration // Cap for exponential backoff
Multiplier float64 // Backoff multiplier (default: 2.0)
}Used in tool definitions, workflow nodes, and the LLM manager:
// On a tool
tool := &sdk.ToolDefinition{
Name: "api_call",
Handler: handler,
RetryConfig: &sdk.RetryConfig{
MaxRetries: 3,
Delay: time.Second,
MaxDelay: 30 * time.Second,
},
}
// On the LLM manager
manager, _ := llm.NewLLMManager(llm.LLMManagerConfig{
MaxRetries: 3,
RetryDelay: time.Second,
})Rate Limiting
The LLM manager supports per-provider rate limiting:
manager, _ := llm.NewLLMManager(llm.LLMManagerConfig{
DefaultProvider: "openai",
RateLimit: sdk.RateLimitConfig{
RequestsPerMinute: 60,
TokensPerMinute: 90000,
},
})When the rate limit is reached, requests are queued and delayed rather than rejected.
Timeouts
Set timeouts at multiple levels:
// Per-request timeout via context
ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()
// On the LLM manager
manager, _ := llm.NewLLMManager(llm.LLMManagerConfig{
RequestTimeout: 30 * time.Second,
})
// On a tool
tool := &sdk.ToolDefinition{
Timeout: 5 * time.Minute,
// ...
}
// On a workflow node
node := &sdk.WorkflowNode{
Timeout: 2 * time.Minute,
// ...
}Combining Patterns
Use circuit breakers, retries, and rate limiting together:
cb := sdk.NewCircuitBreaker(sdk.CircuitBreakerConfig{
Name: "openai",
MaxFailures: 5,
ResetTimeout: 60 * time.Second,
}, logger, metrics)
err := cb.Execute(ctx, func(ctx context.Context) error {
// The LLM manager handles retries and rate limiting internally
_, err := manager.Chat(ctx, request)
return err
})The recommended layering (outer to inner):
- Circuit breaker -- fast-fail if the provider is down
- Rate limiter -- prevent exceeding provider limits
- Retry -- handle transient failures
- Timeout -- bound individual request duration
How is this guide?