back to work

Routing three foundation models at 1.5s, without paying for it twice.

A production multi-model LLM inference layer behind enterprise applications — owning latency, cost, and reliability under varying load.

Multi-Model LLM Routing
Period
Feb 2024 – Jul 2024
Role
Generative AI Engineer at Concentrix + Webhelp

Enterprise teams wanted access to multiple foundation models from one stable endpoint. Each model had a different cost curve, a different latency profile, and a different failure mode. The naive fix — pick one model and stick with it — left money on the table and made every outage a full outage. The better fix was to put a routing layer in front of all of them and treat model choice as a tunable knob.

The constraints

  • Latency budget. End-to-end response had to feel interactive across consumer-facing flows — sub-2s, ideally sub-1.5s.
  • Cost ceiling. Inference spend was the largest line item in the platform's monthly bill.
  • Accuracy floor. Routing decisions could not silently degrade response quality.
  • Operability. When a model misbehaved, on-call needed to know in minutes, not days.

The architecture

  • Router. LiteLLM in front of AWS Bedrock and SageMaker endpoints, plus a request classifier that decided which model could meet the request's latency-cost-accuracy envelope.
  • Fallback chain. Every route had a sibling model it could fall back to on rate-limit or 5xx, so a single provider blip never became a user-visible outage.
  • Observability. Structured logging into CloudWatch with per-request model, latency, token, and confidence; safety checks ran inline and emitted their own stream.
  • Drift monitoring. Daily aggregates over production traffic that flagged accuracy and latency distributions creeping outside their bounds.

1.5s

End-to-end latency

p50 across all routes

−18%

Inference cost

vs single-model baseline

95%+

Response accuracy

under varying constraints

−42% / +35%

Prod incidents / MTTD

thanks to inline observability

What worked

  • Routing on a budget, not a vibe. The classifier picked the cheapest model that could meet the request's latency-accuracy envelope, not the “best” model globally. That's where most of the cost win came from.
  • Cheap observability beats expensive intuition.Structured per-request logs and a couple of dashboards bought most of the incident reduction. The router itself was simple; the visibility around it was the moat.
  • Fallbacks earned their keep on day one. Within the first week, a single Bedrock endpoint started returning inflated latency on a fraction of requests. The fallback chain absorbed it before anyone noticed.

What I'd do differently

  • Move classification logic from heuristic to a small fine-tuned classifier earlier — the heuristic was good enough to ship but ate engineering time as the model menu grew.
  • Pre-compute a per-tenant routing policy rather than a single global one — variance between enterprise customers' mix of requests turned out to be larger than I expected.