Organisations are increasingly running multiple open-source models (Qwen, Gemma, Kimi, MiniMax, Llama, DeepSeek) alongside managed APIs. It's the control layer that decides which model handles which request, what happens when one fails, and when a human needs to step in.

Here is an architecture pattern that works.

Three Layers

  1. Inference layer — GPU servers running model weights (vLLM, Ollama, TGI) or managed endpoints (Bedrock, Vertex AI)
  2. Control layer — routing logic, failover, approval workflows, and session management
  3. Client layer — web apps, mobile apps, internal tools consuming the unified API

Step 1: Serve the Models

Each open-source model needs a serving runtime that exposes an OpenAI-compatible /v1/chat/completions endpoint. This lets the control layer treat every model uniformly.

  • vLLM — production-grade, supports tensor parallelism across multiple GPUs. Best for large models (Qwen 72B, DeepSeek 67B).
  • Ollama — simple single-node setup. Good for development and smaller models.
  • LM Studio — desktop GUI with local server. Useful for prototyping.
  • Amazon Bedrock / Vertex AI — managed hosting. No GPU management, pay-per-token. Hosts Qwen, Kimi, Llama, and others.

The key decision: self-hosted vs. managed is not binary. The strongest production setups are hybrid — run your primary models on your own GPUs for cost and privacy, keep managed endpoints as fallbacks.

Step 2: Define Routing Rules

A. Primary + Fallback Chain

Every request goes to Model A. If it fails (timeout, overloaded, OOM), try Model B, then C:


primary: qwen-72b
fallbacks: [gemma, kimi-large, bedrock/llama]

This is the most common pattern. It gives you resilience without complexity.

B. Use-Case Routing

Different tasks go to different models based on request type:

  • Code generation → qwen-72b (best code quality)
  • Summarisation → gemma-27b (fast, cost-effective)
  • Classification → gemma-9b (low latency)
  • Legal/compliance → kimi-large (long context)

Route at the application layer — tag each request with its use case, and the control layer selects the model.

C. Per-Tenant / Per-Channel Routing


enterprise-tier  → qwen3.6-72b (dedicated GPU)
standard-tier    → gemma4-31b  (shared pool)
internal-tools   → ollama/local (free)

D. User-Switchable

Let power users pick their model at runtime. Expose an allowlist of approved models so users can trade off speed vs. quality without leaving your platform.

Step 3: Implement Failover Properly

Naive failover (retry on any error) wastes GPU cycles. A good failover system distinguishes:

  • Retriable errors (rate limit, temporary overload): retry same model with backoff
  • Failover errors (model crashed, auth expired, context too long): advance to next model in the chain
  • Non-retriable errors (invalid request, content policy): fail fast, don't waste other models' capacity

Additional production patterns:

  • Auth rotation — if you have multiple API keys for the same provider, rotate through them before failing over to a different model
  • Session stickiness — pin a user's session to one model/key to keep provider-side KV caches warm (dramatically reduces latency for multi-turn conversations)
  • Cooldown tracking — when a model returns a rate limit with a retry-after header, mark it as cooling down and skip it for the specified duration

Step 4: Add Human-in-the-Loop

A. Approval Gates

Define which actions require human sign-off before execution:


if action = "send_email" OR action = "modify_database":
  HOLD for human approval
  notify reviewer via Slack/Teams/email
  execute only after explicit approval

B. Confidence-Based Escalation

  1. Run request through Model A (Qwen)
  2. Run same request through Model B (Gemma)
  3. Compare outputs
  4. If outputs diverge significantly → escalate to human
  5. If aligned → deliver with high-confidence tag

This dual-model verification pattern is especially valuable for compliance, legal review, and medical content.

C. Standing Authorization with Boundaries

Instead of approving every action, define what the system can do autonomously:


AUTHORIZED:
- Classify incoming support tickets
- Draft response using the knowledge base
- Route to the correct department

REQUIRES HUMAN APPROVAL:
- Issue refunds over £100
- Escalate to legal
- Respond to press/media inquiries

ESCALATION TRIGGERS:
- Customer sentiment score < 0.3
- Request mentions litigation
- Model confidence < threshold

This "standing orders" pattern lets the system handle 80–90% of volume autonomously while routing edge cases to humans.

Step 5: Infrastructure on AWS

Recommended deployment:

  • g5.xlarge (1× A10G, 24 GB) — models up to 13B parameters
  • g5.12xlarge (4× A10G, 96 GB) — models up to 70B
  • p4d.24xlarge — very large models (70B+) at production throughput
  • Amazon Bedrock — fallback for when spot capacity is reclaimed

Cost optimisation:

  • Use Spot Instances for non-critical inference workloads (50–70% savings)
  • Keep a Bedrock fallback for when spot capacity is reclaimed
  • Set cost: 0 for self-hosted models in your routing config so the control layer prefers them over pay-per-token APIs

Key Takeaways

  1. Standardise on OpenAI-compatible endpoints. Whether you use vLLM, Ollama, or Bedrock, expose /v1/chat/completions. This makes your control layer provider-agnostic.
  2. Hybrid > pure self-hosted. Run your primary workload on your own GPUs. Keep managed APIs as fallbacks. This gives you cost control + resilience.
  3. Routing is a product decision, not just infrastructure. Which model handles which use case should be configurable by product teams, not hardcoded by engineers.
  4. Failover needs intelligence. Don't just retry blindly. Classify errors, rotate credentials, respect cooldowns, and pin sessions.
  5. Human-in-the-loop is not optional for production. Define clear boundaries between autonomous operation and human review. Use confidence thresholds and dual-model verification for high-stakes outputs.

The models are commoditising. The control layer is the differentiator.