Traffic Management

Control how traffic is distributed, retried, and protected across your LLM backends using routing rules, load balancing strategies, retries, and circuit breakers.

routing load-balancing retries circuit-breaker timeouts

Routing Rules

Routes match incoming requests by path, method, and optional headers. The first matching rule is applied. Use priority to establish deterministic ordering.

gateway.yaml — routes

routes:
  # Route chat completions to the pool
  - name: "chat-completions"
    path: "/v1/chat/completions"
    methods: ["POST"]
    backend: "llm-pool"
    priority: 10

  # Route embedding requests to a dedicated backend
  - name: "embeddings"
    path: "/v1/embeddings"
    methods: ["POST"]
    backend: "openai-backend"
    priority: 20

  # Header-based routing: internal testers get beta model
  - name: "beta-chat"
    path: "/v1/chat/completions"
    methods: ["POST"]
    headers:
      X-Beta-User: "true"
    backend: "beta-backend"
    priority: 5            # evaluated before chat-completions

Route Fields

Field	Type	Description
name	string	Unique identifier for the route
path	string	URL path to match (exact or prefix)
methods	string[]	HTTP methods: GET, POST, PUT, DELETE
headers	map<string,string>	Match on request header key=value pairs
backend	string	Backend or pool name to forward to
priority	integer	Lower number = higher priority. Default: 100

Load Balancing

When routing to a pool backend, RTD LLM Gateway selects the target using a configurable strategy. Each backend in the pool can carry a different weight.

Strategy	Description	Best for
round-robin	Cycles through backends in order	Uniform capacity pools
weighted	Sends traffic proportional to each backend's weight	Mixed-tier pools
latency	Routes to the backend with lowest P90 latency	Latency-sensitive workloads
cost	Routes to the cheapest backend (per token pricing)	Cost-optimised deployments
least-connections	Routes to the backend with fewest in-flight requests	Long-running requests

gateway.yaml — load balancing

backends:
  - name: "llm-pool"
    type: pool
    strategy: weighted         # or: round-robin | latency | cost | least-connections
    fallback: true             # try next backend on error
    healthCheck:
      interval: 30s
      timeout: 5s
    backends:
      - name: openai-backend
        weight: 60
        costPerToken: 0.000015   # used by 'cost' strategy
      - name: anthropic-backend
        weight: 30
        costPerToken: 0.000012
      - name: azure-backend
        weight: 10
        costPerToken: 0.000010

Retries

Configure automatic retries on transient failures. Retries are attempted with exponential backoff and only on idempotent-safe status codes by default.

gateway.yaml — retries

policies:
  retry:
    maxAttempts: 3           # total attempts (1 original + 2 retries)
    backoff: exponential     # linear | exponential | constant
    initialInterval: 200ms
    maxInterval: 5s
    jitter: true             # add random jitter to backoff
    retryOn:
      - 429                  # rate limited
      - 500                  # internal server error
      - 502                  # bad gateway
      - 503                  # service unavailable
      - 504                  # gateway timeout
    # Optional: different retry budget per route
  perRoute:
    - route: "chat-completions"
      maxAttempts: 2
      retryOn: [429, 503]

Circuit Breaker

Prevent cascading failures by opening the circuit when a backend exceeds the error threshold. The gateway will stop routing to that backend until it recovers.

gateway.yaml — circuit breaker

policies:
  circuitBreaker:
    enabled: true
    threshold: 0.5           # open when >50% of requests fail
    minRequests: 20          # minimum requests before evaluating
    windowDuration: 60s      # rolling window for error rate
    sleepDuration: 30s       # time circuit stays open before probing
    probeRequests: 5         # requests allowed in half-open state

How it works: When the error rate exceeds threshold within the rolling window, the circuit opens and requests fail fast. After sleepDuration, it enters half-open state and allows a probe. If the probe succeeds, the circuit closes. If it fails, it reopens.

Timeouts

Fine-grained timeout control at the gateway, route, and backend levels. Route-level settings override gateway defaults.

gateway.yaml — timeouts

gateway:
  timeout:
    connect: 5s              # TCP + TLS handshake
    read: 120s               # max time to receive full response
    write: 10s               # max time to send request body
    idle: 300s               # keep-alive idle timeout

# Per-route override
routes:
  - name: "embeddings"
    path: "/v1/embeddings"
    methods: ["POST"]
    backend: "openai-backend"
    timeout:
      read: 30s              # embeddings are faster than chat

LLM Providers Security