LLMs Also Need Virtual Memory: Demand Paging for the Context Window
Recently I came across a very interesting paper:
“The Missing Memory Hierarchy: Demand Paging for LLM Context Windows”
Paper link:
- The Missing Memory Hierarchy: Demand Paging for LLM Context Windows
- Original: https://arxiv.org/abs/2603.09023
The core idea of this paper is very simple, yet highly inspiring:
An LLM’s context window is essentially “bare memory” with no memory management system.
If we treat an LLM Agent as a program, then the current design is like:
- All historical data stays in memory forever
- Every execution rescans the entire memory
- No paging
- No caching
- No working set
- No eviction policy
This is obviously very primitive.
The authors did something very “operating-system-esque”:
They add a virtual memory system to LLM context.
1. The problem: LLM context wastes a lot of tokens
The authors first did something very important:
They analyzed real production logs.
Data source:
- 857 Claude Code sessions
- 54,170 API calls
- Total input tokens: 4.45B
They split each message into several categories:
- User input
- Model reply
- Tool output (read file / bash / etc)
Then they measured:
- Token share
- Usage frequency
- Repeated reads
The results are startling.
1 Structural waste: 21.8%
About 21.8% of tokens are structural waste:
Mainly from:
1️⃣ Tool definitions
Many tool definitions are thousands of tokens, but are never invoked in the session.
Yet they are resent in every request round.
2️⃣ Old tool outputs
For example:
Read file: file.py (10KB)
In the next 80 rounds of dialogue:
- Every API call
- Resends this 10KB
- And it participates in attention
But in reality it is never referenced again.
3️⃣ Duplicate configuration
For example:
- Skill lists
- Agent instructions
- Prompt templates
Often copied multiple times.
2. Key observation: tool outputs get “infinitely amplified”
The paper introduces a concept:
Amplification Factor
For example:
- Read a file in round 5
- File size is 10KB
- The session has 85 rounds total
Then the file will be:
Repeatedly sent ≈ 80 times
Amplification factor:
80×
And every time it must participate in attention computation.
3. Solution: “virtual memory” for LLM context
The core system in the paper is called:
Pichay
Its implementation is very clever.
It doesn’t change the model.
Instead:
It inserts a proxy between the client and the LLM API.
Architecture:
IDE / Agent
↓
Pichay Proxy
↓
LLM API
The proxy is responsible for:
- Counting tokens
- Modifying context
- Deleting old content
- Restoring it when necessary
The model and client are completely unaware.
4. Core idea: Demand Paging
The authors analogize LLM context to OS memory.
| OS | LLM |
|---|---|
| RAM | Context Window |
| Page | Tool output |
| Page fault | Re-reading a file |
| Page eviction | Deleting old tool output |
1 Eviction policy
Very simple:
When a tool output satisfies:
size > 500B
and
unused for more than 4 user turns
it is evicted.
2 How is evicted content represented?
It’s replaced with a placeholder:
[Paged out: Read file.py (8192 bytes). Re-read if needed]
If the model needs it, it will call the tool again.
3 Page Fault
If the model calls again:
Read file.py
The proxy detects:
this content was just evicted
So it:
- Re-reads the file
- Injects it back into context
This is:
An LLM page fault.
4 Fault-driven Pinning
If a page:
gets evicted → triggers a page fault
it means the eviction was too early.
The system will:
pin this page
and no longer evict it in the future.
Unless:
the file contents change
5. Offline experiment: extremely low wrong-deletion probability
The authors ran an offline replay experiment using 29 full sessions.
They simulated:
if we deleted this content back then,
would it be needed later?
Results:
- Simulated deletions: 1,393,000 times
- Page faults: 354 times
Page fault rate:
0.0254%
In other words:
99.97% of deletions are safe.
6. Online experiment: tokens down 37%
The authors compared three strategies:
| Mode | Strategy |
|---|---|
| Baseline | No modifications |
| Trimmed | Trim tool definitions |
| Compact+Trim | Add paging |
Results:
Trimmed
↓
tokens -22.6%
Compact + Trim
↓
tokens -37.1%
And the tasks:
- All completed
- No quality degradation
In some cases:
Answer quality was even better.
The reason is simple:
Less noise, more focused attention.
7. Real production cases
The authors used Pichay in their own development environment.
Case A: normal development
Original context:
Remaining space: 7%
With paging enabled:
Remaining space: 43%
Deleted:
15 chunks
Page faults:
1 time
Almost perfect.
Case B: extremely long session
681 rounds.
The system showed:
Thrashing
Evictions: 680 times
Page faults: 659 times
Cause: The working set is too large.
It keeps:
delete → read → delete → read
Similar to an OS:
swap storm.
8. The most important theoretical conclusion
In traditional operating systems:
page faults are expensive
So the strategy is:
minimize page faults
But LLMs are the complete opposite.
Cost model of LLMs
If content stays in context:
For every generated token, attention has to be recomputed.
Complexity:
O(n²)
Therefore:
The accumulated cost of keeping tokens is very high.
Whereas re-reading a file has cost:
O(n)
So the conclusion is:
In LLMs, it’s better to tolerate a few more page faults than to keep too many tokens.
9. Why this matters
This paper is really saying something big:
Future LLM systems must have a memory hierarchy.
An analogy:
| Level | Meaning |
|---|---|
| L1 | Current context |
| L2 | Working set |
| L3 | Session summary |
| L4 | Long-term memory |
Current LLM systems:
Only have L1.
10. Future directions
The paper proposes many research directions worth exploring:
1 Cost-driven eviction
No longer based on:
number of turns
but instead compute:
future attention cost
2 phase-aware memory
Identify:
planning phase
execution phase
Use different strategies for different phases.
3 pin decay
Currently:
one page fault → pin forever
In the future:
the pin gradually decays
4 object-level memory
No longer page by:
files
messages
but by:
decisions
tasks
debug sessions
11. Some thoughts of my own
The biggest value of this paper is:
It completely systematizes the LLM memory problem.
It tells us:
The future architecture of AI Agents may look like:
LLM
↓
Memory Manager
↓
Context Cache
↓
Retrieval System
↓
Persistent Memory
rather than:
LLM
↓
a huge prompt
In other words:
Prompt Engineering → Context Engineering
12. Summary
The most central findings of this paper:
1️⃣ 21.8% of tokens are structural waste
2️⃣ A simple paging strategy has a page fault rate of only 0.025%
3️⃣ Token usage is reduced by 37%
4️⃣ Answer quality does not drop
5️⃣ LLM cost models are the complete opposite of traditional virtual memory
If future LLM agents really need to:
- Work for long periods
- Collaborate with many tools
- Sustain ongoing project development
then:
A memory hierarchy may become core infrastructure for AI systems.
And this paper will likely become one of the early classic works in this area.