Recently I came across a very interesting paper:

“The Missing Memory Hierarchy: Demand Paging for LLM Context Windows”

Paper link:

The Missing Memory Hierarchy: Demand Paging for LLM Context Windows
Original: https://arxiv.org/abs/2603.09023

The core idea of this paper is very simple, yet highly inspiring:

An LLM’s context window is essentially “bare memory” with no memory management system.

If we treat an LLM Agent as a program, then the current design is like:

All historical data stays in memory forever
Every execution rescans the entire memory
No paging
No caching
No working set
No eviction policy

This is obviously very primitive.

The authors did something very “operating-system-esque”:

They add a virtual memory system to LLM context.

1. The problem: LLM context wastes a lot of tokens

The authors first did something very important:

They analyzed real production logs.

Data source:

857 Claude Code sessions
54,170 API calls
Total input tokens: 4.45B

They split each message into several categories:

User input
Model reply
Tool output (read file / bash / etc)

Then they measured:

Token share
Usage frequency
Repeated reads

The results are startling.

1 Structural waste: 21.8%

About 21.8% of tokens are structural waste:

Mainly from:

1️⃣ Tool definitions

Many tool definitions are thousands of tokens, but are never invoked in the session.

Yet they are resent in every request round.

2️⃣ Old tool outputs

For example:

Read file: file.py (10KB)

In the next 80 rounds of dialogue:

Every API call
Resends this 10KB
And it participates in attention

But in reality it is never referenced again.

3️⃣ Duplicate configuration

For example:

Skill lists
Agent instructions
Prompt templates

Often copied multiple times.

2. Key observation: tool outputs get “infinitely amplified”

The paper introduces a concept:

Amplification Factor

For example:

Read a file in round 5
File size is 10KB
The session has 85 rounds total

Then the file will be:

Repeatedly sent ≈ 80 times

Amplification factor:

80×

And every time it must participate in attention computation.

3. Solution: “virtual memory” for LLM context

The core system in the paper is called:

Pichay

Its implementation is very clever.

It doesn’t change the model.

Instead:

It inserts a proxy between the client and the LLM API.

Architecture:

IDE / Agent
      ↓
   Pichay Proxy
      ↓
   LLM API

The proxy is responsible for:

Counting tokens
Modifying context
Deleting old content
Restoring it when necessary

The model and client are completely unaware.

4. Core idea: Demand Paging

The authors analogize LLM context to OS memory.

OS	LLM
RAM	Context Window
Page	Tool output
Page fault	Re-reading a file
Page eviction	Deleting old tool output

1 Eviction policy

Very simple:

When a tool output satisfies:

size > 500B
and
unused for more than 4 user turns

it is evicted.

2 How is evicted content represented?

It’s replaced with a placeholder:

[Paged out: Read file.py (8192 bytes). Re-read if needed]

If the model needs it, it will call the tool again.

3 Page Fault

If the model calls again:

Read file.py

The proxy detects:

this content was just evicted

So it:

Re-reads the file
Injects it back into context

This is:

An LLM page fault.

4 Fault-driven Pinning

If a page:

gets evicted → triggers a page fault

it means the eviction was too early.

The system will:

pin this page

and no longer evict it in the future.

Unless:

the file contents change

5. Offline experiment: extremely low wrong-deletion probability

The authors ran an offline replay experiment using 29 full sessions.

They simulated:

if we deleted this content back then,
would it be needed later?

Results:

Simulated deletions: 1,393,000 times
Page faults: 354 times

Page fault rate:

0.0254%

In other words:

99.97% of deletions are safe.

6. Online experiment: tokens down 37%

The authors compared three strategies:

Mode	Strategy
Baseline	No modifications
Trimmed	Trim tool definitions
Compact+Trim	Add paging

Results:

Trimmed
↓
tokens -22.6%

Compact + Trim
↓
tokens -37.1%

And the tasks:

All completed
No quality degradation

In some cases:

Answer quality was even better.

The reason is simple:

Less noise, more focused attention.

7. Real production cases

The authors used Pichay in their own development environment.

Case A: normal development

Original context:

Remaining space: 7%

With paging enabled:

Remaining space: 43%

Deleted:

15 chunks

Page faults:

1 time

Almost perfect.

Case B: extremely long session

681 rounds.

The system showed:

Thrashing

Evictions: 680 times
Page faults: 659 times

Cause: The working set is too large.

It keeps:

delete → read → delete → read

Similar to an OS:

swap storm.

8. The most important theoretical conclusion

In traditional operating systems:

page faults are expensive

So the strategy is:

minimize page faults

But LLMs are the complete opposite.

Cost model of LLMs

If content stays in context:

For every generated token, attention has to be recomputed.

Complexity:

O(n²)

Therefore:

The accumulated cost of keeping tokens is very high.

Whereas re-reading a file has cost:

O(n)

So the conclusion is:

In LLMs, it’s better to tolerate a few more page faults than to keep too many tokens.

9. Why this matters

This paper is really saying something big:

Future LLM systems must have a memory hierarchy.

An analogy:

Level	Meaning
L1	Current context
L2	Working set
L3	Session summary
L4	Long-term memory

Current LLM systems:

Only have L1.

10. Future directions

The paper proposes many research directions worth exploring:

1 Cost-driven eviction

No longer based on:

number of turns

but instead compute:

future attention cost

2 phase-aware memory

Identify:

planning phase
execution phase

Use different strategies for different phases.

3 pin decay

Currently:

one page fault → pin forever

In the future:

the pin gradually decays

4 object-level memory

No longer page by:

files
messages

but by:

decisions
tasks
debug sessions

11. Some thoughts of my own

The biggest value of this paper is:

It completely systematizes the LLM memory problem.

It tells us:

The future architecture of AI Agents may look like:

LLM
 ↓
Memory Manager
 ↓
Context Cache
 ↓
Retrieval System
 ↓
Persistent Memory

rather than:

LLM
 ↓
a huge prompt

In other words:

Prompt Engineering → Context Engineering

12. Summary

The most central findings of this paper:

1️⃣ 21.8% of tokens are structural waste

2️⃣ A simple paging strategy has a page fault rate of only 0.025%

3️⃣ Token usage is reduced by 37%

4️⃣ Answer quality does not drop

5️⃣ LLM cost models are the complete opposite of traditional virtual memory

If future LLM agents really need to:

Work for long periods
Collaborate with many tools
Sustain ongoing project development

then:

A memory hierarchy may become core infrastructure for AI systems.

And this paper will likely become one of the early classic works in this area.

LLMs Also Need Virtual Memory: Demand Paging for the Context Window

LLMs Also Need Virtual Memory: Demand Paging for the Context Window

1. The problem: LLM context wastes a lot of tokens

1 Structural waste: 21.8%

2. Key observation: tool outputs get “infinitely amplified”

3. Solution: “virtual memory” for LLM context

4. Core idea: Demand Paging

1 Eviction policy

2 How is evicted content represented?

3 Page Fault

4 Fault-driven Pinning

5. Offline experiment: extremely low wrong-deletion probability

6. Online experiment: tokens down 37%

7. Real production cases

Case A: normal development

Case B: extremely long session

8. The most important theoretical conclusion

Cost model of LLMs

9. Why this matters

10. Future directions

1 Cost-driven eviction

2 phase-aware memory

3 pin decay

4 object-level memory

11. Some thoughts of my own

12. Summary