LLM Gateway Docs v2.0

Traffic Management

Control how traffic is distributed, retried, and protected across your LLM backends using routing rules, load balancing strategies, retries, and circuit breakers.

Routing Rules

Routes match incoming requests by path, method, and optional headers. The first matching rule is applied. Use priority to establish deterministic ordering.

gateway.yaml — routes
routes:
  # Route chat completions to the pool
  - name: "chat-completions"
    path: "/v1/chat/completions"
    methods: ["POST"]
    backend: "llm-pool"
    priority: 10

  # Route embedding requests to a dedicated backend
  - name: "embeddings"
    path: "/v1/embeddings"
    methods: ["POST"]
    backend: "openai-backend"
    priority: 20

  # Header-based routing: internal testers get beta model
  - name: "beta-chat"
    path: "/v1/chat/completions"
    methods: ["POST"]
    headers:
      X-Beta-User: "true"
    backend: "beta-backend"
    priority: 5            # evaluated before chat-completions

Route Fields

FieldTypeDescription
namestringUnique identifier for the route
pathstringURL path to match (exact or prefix)
methodsstring[]HTTP methods: GET, POST, PUT, DELETE
headersmap<string,string>Match on request header key=value pairs
backendstringBackend or pool name to forward to
priorityintegerLower number = higher priority. Default: 100

Load Balancing

When routing to a pool backend, RTD LLM Gateway selects the target using a configurable strategy. Each backend in the pool can carry a different weight.

StrategyDescriptionBest for
round-robinCycles through backends in orderUniform capacity pools
weightedSends traffic proportional to each backend's weightMixed-tier pools
latencyRoutes to the backend with lowest P90 latencyLatency-sensitive workloads
costRoutes to the cheapest backend (per token pricing)Cost-optimised deployments
least-connectionsRoutes to the backend with fewest in-flight requestsLong-running requests
gateway.yaml — load balancing
backends:
  - name: "llm-pool"
    type: pool
    strategy: weighted         # or: round-robin | latency | cost | least-connections
    fallback: true             # try next backend on error
    healthCheck:
      interval: 30s
      timeout: 5s
    backends:
      - name: openai-backend
        weight: 60
        costPerToken: 0.000015   # used by 'cost' strategy
      - name: anthropic-backend
        weight: 30
        costPerToken: 0.000012
      - name: azure-backend
        weight: 10
        costPerToken: 0.000010

Retries

Configure automatic retries on transient failures. Retries are attempted with exponential backoff and only on idempotent-safe status codes by default.

gateway.yaml — retries
policies:
  retry:
    maxAttempts: 3           # total attempts (1 original + 2 retries)
    backoff: exponential     # linear | exponential | constant
    initialInterval: 200ms
    maxInterval: 5s
    jitter: true             # add random jitter to backoff
    retryOn:
      - 429                  # rate limited
      - 500                  # internal server error
      - 502                  # bad gateway
      - 503                  # service unavailable
      - 504                  # gateway timeout
    # Optional: different retry budget per route
  perRoute:
    - route: "chat-completions"
      maxAttempts: 2
      retryOn: [429, 503]

Circuit Breaker

Prevent cascading failures by opening the circuit when a backend exceeds the error threshold. The gateway will stop routing to that backend until it recovers.

gateway.yaml — circuit breaker
policies:
  circuitBreaker:
    enabled: true
    threshold: 0.5           # open when >50% of requests fail
    minRequests: 20          # minimum requests before evaluating
    windowDuration: 60s      # rolling window for error rate
    sleepDuration: 30s       # time circuit stays open before probing
    probeRequests: 5         # requests allowed in half-open state

How it works: When the error rate exceeds threshold within the rolling window, the circuit opens and requests fail fast. After sleepDuration, it enters half-open state and allows a probe. If the probe succeeds, the circuit closes. If it fails, it reopens.

Timeouts

Fine-grained timeout control at the gateway, route, and backend levels. Route-level settings override gateway defaults.

gateway.yaml — timeouts
gateway:
  timeout:
    connect: 5s              # TCP + TLS handshake
    read: 120s               # max time to receive full response
    write: 10s               # max time to send request body
    idle: 300s               # keep-alive idle timeout

# Per-route override
routes:
  - name: "embeddings"
    path: "/v1/embeddings"
    methods: ["POST"]
    backend: "openai-backend"
    timeout:
      read: 30s              # embeddings are faster than chat