Recently I came across a very interesting paper:
“The Missing Memory Hierarchy: Demand Paging for LLM Context Windows”
Paper link:
- The Missing Memory Hierarchy: Demand Paging for LLM Context Windows
- Original: https://arxiv.org/abs/2603.09023
The core idea of this paper is very simple, yet highly inspiring:
An LLM’s context window is essentially “bare memory” with no memory management system.
If we treat an LLM Agent as a program, then the current design is like:
- All historical data stays in memory forever
- Every execution rescans the entire memory
- No paging
- No caching
- No working set
- No eviction policy
This is obviously very primitive.
The authors did something very “operating-system-esque”:
They add a virtual memory system to LLM context.
1. The problem: LLM context wastes a lot of tokens
The authors first did something very important:
They analyzed real production logs.
Data source:
- 857 Claude Code sessions
- 54,170 API calls
- Total input tokens: 4.45B
They split each message into several categories:
- User input
- Model reply
- Tool output (read file / bash / etc)
Then they measured:
- Token share
- Usage frequency
- Repeated reads
The results are startling.
1 Structural waste: 21.8%
About 21.8% of tokens are structural waste:
Mainly from:
1️⃣ Tool definitions
Many tool definitions are thousands of tokens, but are never invoked in the session.
Yet they are resent in every request round.
2️⃣ Old tool outputs
For example:
In the next 80 rounds of dialogue:
- Every API call
- Resends this 10KB
- And it participates in attention
But in reality it is never referenced again.
3️⃣ Duplicate configuration
For example:
- Skill lists
- Agent instructions
- Prompt templates
Often copied multiple times.
2. Key observation: tool outputs get “infinitely amplified”
The paper introduces a concept:
Amplification Factor
For example:
- Read a file in round 5
- File size is 10KB
- The session has 85 rounds total
Then the file will be:
Amplification factor:
And every time it must participate in attention computation.
3. Solution: “virtual memory” for LLM context
The core system in the paper is called:
Pichay
Its implementation is very clever.
It doesn’t change the model.
Instead:
It inserts a proxy between the client and the LLM API.
Architecture:
The proxy is responsible for:
- Counting tokens
- Modifying context
- Deleting old content
- Restoring it when necessary
The model and client are completely unaware.
4. Core idea: Demand Paging
The authors analogize LLM context to OS memory.
| OS | LLM |
|---|---|
| RAM | Context Window |
| Page | Tool output |
| Page fault | Re-reading a file |
| Page eviction | Deleting old tool output |
1 Eviction policy
Very simple:
When a tool output satisfies:
it is evicted.
2 How is evicted content represented?
It’s replaced with a placeholder:
If the model needs it, it will call the tool again.
3 Page Fault
If the model calls again:
The proxy detects:
So it:
- Re-reads the file
- Injects it back into context
This is:
An LLM page fault.
4 Fault-driven Pinning
If a page:
it means the eviction was too early.
The system will:
and no longer evict it in the future.
Unless:
5. Offline experiment: extremely low wrong-deletion probability
The authors ran an offline replay experiment using 29 full sessions.
They simulated:
Results:
- Simulated deletions: 1,393,000 times
- Page faults: 354 times
Page fault rate:
In other words:
99.97% of deletions are safe.
6. Online experiment: tokens down 37%
The authors compared three strategies:
| Mode | Strategy |
|---|---|
| Baseline | No modifications |
| Trimmed | Trim tool definitions |
| Compact+Trim | Add paging |
Results:
And the tasks:
- All completed
- No quality degradation
In some cases:
Answer quality was even better.
The reason is simple:
Less noise, more focused attention.
7. Real production cases
The authors used Pichay in their own development environment.
Case A: normal development
Original context:
With paging enabled:
Deleted:
Page faults:
Almost perfect.
Case B: extremely long session
681 rounds.
The system showed:
Thrashing
Cause: The working set is too large.
It keeps:
Similar to an OS:
swap storm.
8. The most important theoretical conclusion
In traditional operating systems:
So the strategy is:
But LLMs are the complete opposite.
Cost model of LLMs
If content stays in context:
For every generated token, attention has to be recomputed.
Complexity:
Therefore:
The accumulated cost of keeping tokens is very high.
Whereas re-reading a file has cost:
So the conclusion is:
In LLMs, it’s better to tolerate a few more page faults than to keep too many tokens.
9. Why this matters
This paper is really saying something big:
Future LLM systems must have a memory hierarchy.
An analogy:
| Level | Meaning |
|---|---|
| L1 | Current context |
| L2 | Working set |
| L3 | Session summary |
| L4 | Long-term memory |
Current LLM systems:
Only have L1.
10. Future directions
The paper proposes many research directions worth exploring:
1 Cost-driven eviction
No longer based on:
but instead compute:
2 phase-aware memory
Identify:
Use different strategies for different phases.
3 pin decay
Currently:
In the future:
4 object-level memory
No longer page by:
but by:
11. Some thoughts of my own
The biggest value of this paper is:
It completely systematizes the LLM memory problem.
It tells us:
The future architecture of AI Agents may look like:
rather than:
In other words:
Prompt Engineering → Context Engineering
12. Summary
The most central findings of this paper:
1️⃣ 21.8% of tokens are structural waste
2️⃣ A simple paging strategy has a page fault rate of only 0.025%
3️⃣ Token usage is reduced by 37%
4️⃣ Answer quality does not drop
5️⃣ LLM cost models are the complete opposite of traditional virtual memory
If future LLM agents really need to:
- Work for long periods
- Collaborate with many tools
- Sustain ongoing project development
then:
A memory hierarchy may become core infrastructure for AI systems.
And this paper will likely become one of the early classic works in this area.