Model Fallback
LLM endpoints fail. Rate limits, model outages, context-overflow errors, and content filters all surface as request errors that would otherwise stall an agent loop. Model fallback lets you pass a ranked list of models in a single request — Routeplane walks the list until one succeeds, then returns that response.
This is a body-level extension to the OpenAI, Anthropic, and Google protocol surfaces. No SDK required — set one field.
Quick example
Section titled “Quick example”curl http://127.0.0.1:4356/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-4o", "models": [ "openai/gpt-4o", "anthropic/claude-sonnet-4-6", "google/gemini-2.5-pro" ], "messages": [{"role": "user", "content": "Summarize the Iliad in one sentence."}] }'The model field stays the primary for billing and routing semantics. The models array overrides it as an ordered preference list — first one that returns successfully wins.
What triggers a fallback
Section titled “What triggers a fallback”Routeplane falls through to the next model on errors that are upstream-side and likely transient, and surfaces 4xx errors caused by your request directly to the caller.
| Outcome | Signal | Behavior |
|---|---|---|
| Rate limited | 429 |
Fall through |
| Server error | 5xx |
Fall through |
| Timeout / connection drop | 408, network error |
Fall through |
| Context window exceeded | provider-specific code | Fall through |
| Content filter / refusal | provider-specific code | Fall through |
| Mid-stream failure (no tokens emitted) | stream aborted before first token | Fall through |
| Mid-stream failure (after first token) | stream aborted mid-response | Surfaced — partial output already sent |
| Authentication error | 401 |
Surfaced |
| Forbidden / quota exhausted | 402, 403 |
Surfaced |
| Validation / bad request | 400, 422 |
Surfaced |
| Explicit cancel | client disconnect | Surfaced |
Fallback is single-pass. Routeplane attempts each model at most once per request, in order. There is no exponential backoff between attempts — the assumption is that you’d rather retry on a different model immediately than wait on a failing one.
Inspecting which model answered
Section titled “Inspecting which model answered”Three breadcrumbs:
- Response body
modelfield — set to the model that actually generated the response, not the one you requested first. (OpenAI convention.) - Response header
routeplane-served-by—<provider-id>/<model-id>, e.g.anthropic-direct/anthropic/claude-sonnet-4-6. - Response header
routeplane-fallback-trace— comma-separated list of attempts and outcomes, e.g.openai/gpt-4o:rate_limit,anthropic/claude-sonnet-4-6:served. Only emitted when at least one fallback fired.
Cost and latency tradeoffs
Section titled “Cost and latency tradeoffs”Each fallback attempt is a fresh upstream request. Practical advice:
- Lowest expected cost: order by cheapest first, accept higher tail latency under load.
- Lowest expected latency: order by most reliable first, accept higher per-token cost.
- For long-running agent loops: bias toward reliability. The cost of a stalled loop is much higher than the marginal cost difference between two frontier models.
For declarative cost or latency optimization across providers of a single model, see Provider Selection. Fallback and provider selection compose: Routeplane picks the best provider for each model in your models array, falling through to the next model only after the chosen provider for the current one has exhausted its retry budget.
Anthropic and Google surfaces
Section titled “Anthropic and Google surfaces”The models field works identically on /v1/messages (Anthropic Messages) and /v1beta/models/{model}:generateContent (Google Generative AI). On Anthropic, the field is read alongside the existing model field; on Google, Routeplane accepts it as an extension to the request body — the upstream :generateContent path is rewritten per attempt.
Limits
Section titled “Limits”- Maximum 8 entries in
models. - Each model ID must resolve to a registered model in the registry. Unknown IDs return
400before any upstream attempt. - Streaming is supported. The first model that begins emitting tokens wins; later models are not attempted even if the stream fails after the first token.