Traffic Management
Control how traffic is distributed, retried, and protected across your LLM backends using routing rules, load balancing strategies, retries, and circuit breakers.
Routing Rules
Routes match incoming requests by path, method, and optional headers. The first matching rule is applied. Use priority to establish deterministic ordering.
routes:
# Route chat completions to the pool
- name: "chat-completions"
path: "/v1/chat/completions"
methods: ["POST"]
backend: "llm-pool"
priority: 10
# Route embedding requests to a dedicated backend
- name: "embeddings"
path: "/v1/embeddings"
methods: ["POST"]
backend: "openai-backend"
priority: 20
# Header-based routing: internal testers get beta model
- name: "beta-chat"
path: "/v1/chat/completions"
methods: ["POST"]
headers:
X-Beta-User: "true"
backend: "beta-backend"
priority: 5 # evaluated before chat-completionsRoute Fields
| Field | Type | Description |
|---|---|---|
| name | string | Unique identifier for the route |
| path | string | URL path to match (exact or prefix) |
| methods | string[] | HTTP methods: GET, POST, PUT, DELETE |
| headers | map<string,string> | Match on request header key=value pairs |
| backend | string | Backend or pool name to forward to |
| priority | integer | Lower number = higher priority. Default: 100 |
Load Balancing
When routing to a pool backend, RTD LLM Gateway selects the target using a configurable strategy. Each backend in the pool can carry a different weight.
| Strategy | Description | Best for |
|---|---|---|
| round-robin | Cycles through backends in order | Uniform capacity pools |
| weighted | Sends traffic proportional to each backend's weight | Mixed-tier pools |
| latency | Routes to the backend with lowest P90 latency | Latency-sensitive workloads |
| cost | Routes to the cheapest backend (per token pricing) | Cost-optimised deployments |
| least-connections | Routes to the backend with fewest in-flight requests | Long-running requests |
backends:
- name: "llm-pool"
type: pool
strategy: weighted # or: round-robin | latency | cost | least-connections
fallback: true # try next backend on error
healthCheck:
interval: 30s
timeout: 5s
backends:
- name: openai-backend
weight: 60
costPerToken: 0.000015 # used by 'cost' strategy
- name: anthropic-backend
weight: 30
costPerToken: 0.000012
- name: azure-backend
weight: 10
costPerToken: 0.000010Retries
Configure automatic retries on transient failures. Retries are attempted with exponential backoff and only on idempotent-safe status codes by default.
policies:
retry:
maxAttempts: 3 # total attempts (1 original + 2 retries)
backoff: exponential # linear | exponential | constant
initialInterval: 200ms
maxInterval: 5s
jitter: true # add random jitter to backoff
retryOn:
- 429 # rate limited
- 500 # internal server error
- 502 # bad gateway
- 503 # service unavailable
- 504 # gateway timeout
# Optional: different retry budget per route
perRoute:
- route: "chat-completions"
maxAttempts: 2
retryOn: [429, 503]Circuit Breaker
Prevent cascading failures by opening the circuit when a backend exceeds the error threshold. The gateway will stop routing to that backend until it recovers.
policies:
circuitBreaker:
enabled: true
threshold: 0.5 # open when >50% of requests fail
minRequests: 20 # minimum requests before evaluating
windowDuration: 60s # rolling window for error rate
sleepDuration: 30s # time circuit stays open before probing
probeRequests: 5 # requests allowed in half-open stateHow it works: When the error rate exceeds threshold within the rolling window, the circuit opens and requests fail fast. After sleepDuration, it enters half-open state and allows a probe. If the probe succeeds, the circuit closes. If it fails, it reopens.
Timeouts
Fine-grained timeout control at the gateway, route, and backend levels. Route-level settings override gateway defaults.
gateway:
timeout:
connect: 5s # TCP + TLS handshake
read: 120s # max time to receive full response
write: 10s # max time to send request body
idle: 300s # keep-alive idle timeout
# Per-route override
routes:
- name: "embeddings"
path: "/v1/embeddings"
methods: ["POST"]
backend: "openai-backend"
timeout:
read: 30s # embeddings are faster than chat