This is Part 2 of a series on multi-model LLM architectures. Part 1 covered routing, failover, and human-in-the-loop.
The routing layer decides which model handles a request. But the memory and context layers decide what that model actually sees. Get this wrong, and your system forgets critical data every time it switches models, runs out of context, or restarts a session.
The Three-Layer Memory Stack
Production multi-model systems need three distinct memory tiers:
- Long-term memory — persists across sessions and compaction cycles. Survives model switches, restarts, and context resets.
- Short-term memory — session-scoped. Holds working knowledge for the current conversation.
- Working memory — in-context, lossy. The conversation transcript actively loaded into the model's context window.
Why three layers matter for multi-model:
When you switch from Model A to Model B (failover, routing, or user choice), the working memory carries over. But it was shaped by Model A's responses.
Long-term and short-term memory are model-agnostic. They're injected into the system prompt before any model sees them. This is the continuity bridge between models.
Without the persistent layers, every model switch is a soft reset. With them, the new model inherits the full operational context.
Context Assembly: What the Model Actually Sees
Every model call requires assembling a prompt from multiple sources under a strict token budget. The assembly order matters:
- Static system prompt (cacheable prefix)
- Long-term memory injections
- Short-term memory (session context)
- RAG retrieval results
- Recent conversation turns (working memory)
The cache boundary is critical for cost. Providers like Anthropic cache the stable prefix across turns. If your system prompt changes on every request — injecting different memory, reordering tools — you invalidate the cache and pay full input pricing on every call.
Design for prefix stability: keep the static portion byte-identical across turns and push dynamic content (retrieved memory, recent messages) after the cache boundary. At scale, this is 40–60% of your input token cost.
Compaction: When Context Runs Out
Every model has a finite context window. Long conversations will exceed it. The question is: what do you lose?
Naive truncation (drop oldest messages) destroys critical decisions, constraints, and identifiers mentioned early in the conversation. Production systems need intelligent compaction — summarisation that preserves what matters.
Structured compaction format:
## Summary
[2–3 sentence overview of what was accomplished]
## Decisions Made
- [decision 1 with rationale]
- [decision 2 with rationale]
## Exact Identifiers
- PR #4521, branch: fix/auth-timeout
- Port 8443, service: payments-api
- Cluster: prod-eu-west-1
## Constraints & Requirements
- [constraint 1]
- [constraint 2]
## Recent Turns (verbatim)
[Last 3 user/assistant exchanges kept word-for-word]
Why structured is better than free-form summarisation:
- Identifiers survive. Free-form summaries hallucinate port numbers, PR IDs, and paths. A dedicated "Exact Identifiers" section forces preservation.
- Decisions are explicit. Without a "Decisions Made" section, the model re-debates settled questions after compaction.
- Recent turns stay verbatim. The last 3 user/assistant pairs are kept word-for-word so the model doesn't lose the immediate thread.
Pre-Compaction Memory Flush
This is the pattern most teams miss. Before compaction erases context, flush durable insights to persistent memory.
- Detect: context approaching compaction threshold
- Run a silent "memory flush" turn (no user-visible output)
- The model extracts durable facts → writes to long-term/short-term memory
- Compaction runs normally
- Next turn: long-term memory re-injected into system prompt
- Facts survive even though the conversation history was compressed
Without this, compaction is lossy. With it, the most important context survives indefinitely across compaction cycles, model switches, and even session resets.
Cross-Model Context Transfer: The Hard Problem
When Model A fails and Model B takes over, what state transfers?
What stays the same:
- Full conversation transcript (same session storage)
- Compaction summaries from previous turns
- Tool call history (requests + results, even from Model A)
- Long-term and short-term memory files
- Session metadata (user preferences, flags, settings)
What changes:
- System prompt is rebuilt (different model may have different context window, tool format, etc.)
- Token counters reset (new model's tokeniser may count differently)
- Context budget recalculated (Model B might have a smaller window)
- Provider-specific formatting (Anthropic vs OpenAI message format)
The critical insight: the transcript is model-agnostic, but the context window is not.
If Model A has a 128K context window and Model B has 32K, the full conversation won't fit after a switch. The control layer must:
- Detect the context budget mismatch
- Trigger compaction before the first call to Model B
- Rebuild the prompt under Model B's budget
- Retry with the compacted context
RAG as the Memory Bridge
For multi-model systems, RAG serves a dual purpose:
- Augment any model's knowledge with domain-specific information
- Bridge context across model switches — when compaction loses detail, RAG can retrieve it
Recommended hybrid search:
- Vector search for semantic similarity
- BM25 keyword search for exact identifier matching
- Re-ranking layer to merge results
Index sources:
- Long-term memory files
- Daily notes and session transcripts
- Domain docs / knowledge base
- Code repositories
Key Takeaways
- Memory is the continuity layer. Models are stateless. Your system's "memory" is the persistent storage that gets re-injected on every call. Design it in tiers: long-term (survives everything), short-term (session-scoped), working (in-context, lossy).
- Compaction is not truncation. Dropping old messages is data destruction. Structured summarisation with explicit sections for decisions, identifiers, and constraints is what lets a 200-turn conversation survive in a 32K context window.
- Flush before you compress. Run a memory extraction pass before compaction. The most important facts from a long conversation should survive in persistent memory.
- Cross-model switches require context budget recalculation. A failover from a 128K model to a 32K model is not just a URL change — it's a context restructuring event.
- RAG bridges the compaction gap. When compaction loses detail, retrieval brings it back. Use hybrid search (vector + keyword) so exact identifiers aren't lost.
- Cache your prompt prefix. The stable portion of your system prompt should be byte-identical across turns. At scale, this isn't a micro-optimisation — it's 40–60% of your input token cost.