Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun et al.
arXiv:2602.21611 (2026)
🔗 Original paper link: https://arxiv.org/abs/2602.21611

✨ One-sentence summary

This paper solves:

❌ Before: AI only knew how to “remember experience by the whole problem,” which is easy to misuse
✅ Now: AI remembers experience by reasoning steps, and only calls it at the corresponding stage → more robust, more accurate, more reusable

On SWE-bench Verified:

Average improvement +4.7% Pass@1
Up to +8.7% on hard tasks
Meanwhile significantly reduces result variance

For a benchmark that’s already very hard to improve, this gain is highly meaningful.

🧠 Background: Why is existing Agent “memory” not very usable?

Many SWE Agents today store memory like this:

Finish an issue → write a complete retrospective → next time retrieve “the most similar problem”

The problem is:

Looks similar ≠ same reasoning process

For example:

issue	essence
Can’t click the login button	frontend event binding
Logged out after being logged in for a while	backend session timeout

Both texts contain “login,” but the solution paths are completely different.

This leads to:

Retrieving the wrong experience
Strong models being misled instead
Larger performance fluctuations

🧩 Core idea: Memory must be aligned with the reasoning structure

Software engineering tasks are naturally staged:

ANALYZE — analyze the problem
REPRODUCE — reproduce the bug
EDIT — modify code
VERIFY — verify the fix

The paper’s core design:

No longer remember by “the whole problem”
but remember by “subtask stages”

🗂 What does the new memory unit look like?

Each memory becomes a triple:

(z, d, e)

1️⃣ z: Stage label

For example:

ANALYZE
EDIT
VERIFY

2️⃣ d: The local goal at the time (intent)

Structured description:

What you’re currently trying to solve
Key clues (function names / error messages)

3️⃣ e: Abstracted experience

Instead of storing the full process, it distills:

✅ transferable strategies
❌ repo-specific details

For example:

If a+b works but b+a errors → check radd

This is experience that can be reused across projects.

🔍 Retrieval mechanism: Search only in the “drawer for the current stage”

Traditional approach:

Do semantic similarity matching over the entire memory base ❌

The paper’s approach:

Filter by stage first
Then do semantic matching

Effects:

Avoid cross-stage interference
Even Top-1 retrieval is stable enough

🧠 How is experience produced?

At the end of each subtask, do a reflection:

Success → summarize the pattern
Failure → summarize anti-patterns

Then abstract into an experience card and write it into memory.

This is an online self-evolving system:

The more tasks it does → the more the Agent resembles a seasoned employee

📊 Experimental results

Overall performance

On SWE-bench Verified:

model	improvement
Gemini 2.5 Pro	+6.8%
Average	+4.7%

Impact on strong models

Instance-level memory:

Performance drops
Variance increases ❌

Subtask-level memory:

Performance improves
More stable ✅

This shows:

Strong models aren’t afraid of having no memory
They’re afraid of wrong memory

Biggest gains on hard tasks

By trajectory length:

difficulty	improvement
Easy	+1.8%
Hard	+8.7%

Because hard tasks rely most on experience reuse.

🧪 Key ablation studies

❌ Only force step-by-step thinking (no memory storage)

Only +1%

→ The real boost comes from “experience reuse”

❌ No stage-based retrieval

Improvement shrinks to +1.6%

→ Stage alignment is key

❌ Store raw trajectories

Improvement +1.2%

→ Must do abstraction

🏗 Engineering perspective: Why does this matter?

The most valuable part of this paper is:

No need to train a new model
System design alone can improve performance

Real systems can reuse this directly:

Cursor / Devin-like Agents
Internal enterprise Code Agent
Automated fix bots

What it means for teams

AI is no longer:

An intern starting from scratch every time

but instead:

A seasoned employee who accumulates project experience

Changes it brings:

Faster fixes for recurring issues
More stable CI auto-fixes
Lower onboarding cost

🌍 This paradigm can transfer to all Agents

Not limited to coding:

Data analysis Agents

Stages:

Understand requirements
Find data
Model
Interpret results

Customer support Agents

Stages:

Clarify the issue
Check policy
Provide a solution
Follow up

Research Agents

Stages:

Retrieve literature
Propose hypotheses
Design experiments
Analyze results

⚠️ Limitations

What the paper hasn’t solved yet:

1️⃣ Unbounded memory growth

Needs:

eviction mechanisms
compression
hierarchical memory

2️⃣ Pollution from incorrect experience

Current system:

If it’s written in, it’s trusted

In the future it must address:

memory credibility
automatic down-weighting
version control

3️⃣ Subtask partitioning is still manually designed

Future direction:

Automatically learn the stage structure

🔮 Future directions worth watching

The three with the most potential, in my view:

1️⃣ A memory governance system

Like:

Git for memory
memory diff / rollback

2️⃣ A shared experience base across multiple Agents

Team-level AI knowledge accumulation.

3️⃣ A long-running code steward

A true:

AI seasoned employee

🏁 My take

The value of this paper is:

It doesn’t build a bigger model
It answers a more fundamental question:

How should an Agent “learn experience”?

This is:

LLM → real agentic intelligence
A very key step in that process.

📌 Takeaway

If you’re building an Agent system, this paper offers a design principle you can implement directly:

✅ Correct memory design

Memory unit = reasoning subtasks
Store only abstract experience
Retrieval must be stage-aligned

Instead of:

Storing whole conversations
Global similarity retrieval

🧵 Finally

If you’re working on:

AI coding tools
Enterprise code Agents
Automated repair systems

This paper is the kind of work that:

Can directly change system design

Not just “+1% on the leaderboard.”

Teaching Software Engineering Agents to “Review Step by Step”: An Explanation of Subtask-Level Memory Mechanisms

Teaching Software Engineering Agents to “Review Step by Step”: An Explanation of Subtask-Level Memory Mechanisms

✨ One-sentence summary

🧠 Background: Why is existing Agent “memory” not very usable?

🧩 Core idea: Memory must be aligned with the reasoning structure

🗂 What does the new memory unit look like?

1️⃣ z: Stage label

2️⃣ d: The local goal at the time (intent)

3️⃣ e: Abstracted experience

🔍 Retrieval mechanism: Search only in the “drawer for the current stage”

🧠 How is experience produced?

📊 Experimental results

Overall performance

Impact on strong models

Biggest gains on hard tasks

🧪 Key ablation studies

❌ Only force step-by-step thinking (no memory storage)

❌ No stage-based retrieval

❌ Store raw trajectories

🏗 Engineering perspective: Why does this matter?

What it means for teams

🌍 This paradigm can transfer to all Agents

Data analysis Agents

Customer support Agents

Research Agents

⚠️ Limitations

1️⃣ Unbounded memory growth

2️⃣ Pollution from incorrect experience

3️⃣ Subtask partitioning is still manually designed

🔮 Future directions worth watching

1️⃣ A memory governance system

2️⃣ A shared experience base across multiple Agents

3️⃣ A long-running code steward

🏁 My take

📌 Takeaway

✅ Correct memory design

🧵 Finally