Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun et al.
arXiv:2602.21611 (2026)
🔗 Original paper link: https://arxiv.org/abs/2602.21611
✨ One-sentence summary
This paper solves:
❌ Before: AI only knew how to “remember experience by the whole problem,” which is easy to misuse
✅ Now: AI remembers experience by reasoning steps, and only calls it at the corresponding stage → more robust, more accurate, more reusable
On SWE-bench Verified:
- Average improvement +4.7% Pass@1
- Up to +8.7% on hard tasks
- Meanwhile significantly reduces result variance
For a benchmark that’s already very hard to improve, this gain is highly meaningful.
🧠 Background: Why is existing Agent “memory” not very usable?
Many SWE Agents today store memory like this:
Finish an issue → write a complete retrospective → next time retrieve “the most similar problem”
The problem is:
Looks similar ≠ same reasoning process
For example:
| issue | essence |
|---|---|
| Can’t click the login button | frontend event binding |
| Logged out after being logged in for a while | backend session timeout |
Both texts contain “login,” but the solution paths are completely different.
This leads to:
- Retrieving the wrong experience
- Strong models being misled instead
- Larger performance fluctuations
🧩 Core idea: Memory must be aligned with the reasoning structure
Software engineering tasks are naturally staged:
- ANALYZE — analyze the problem
- REPRODUCE — reproduce the bug
- EDIT — modify code
- VERIFY — verify the fix
The paper’s core design:
No longer remember by “the whole problem”
but remember by “subtask stages”
🗂 What does the new memory unit look like?
Each memory becomes a triple:
1️⃣ z: Stage label
For example:
- ANALYZE
- EDIT
- VERIFY
2️⃣ d: The local goal at the time (intent)
Structured description:
- What you’re currently trying to solve
- Key clues (function names / error messages)
3️⃣ e: Abstracted experience
Instead of storing the full process, it distills:
✅ transferable strategies
❌ repo-specific details
For example:
If a+b works but b+a errors → check radd
This is experience that can be reused across projects.
🔍 Retrieval mechanism: Search only in the “drawer for the current stage”
Traditional approach:
Do semantic similarity matching over the entire memory base ❌
The paper’s approach:
Effects:
- Avoid cross-stage interference
- Even Top-1 retrieval is stable enough
🧠 How is experience produced?
At the end of each subtask, do a reflection:
Success → summarize the pattern
Failure → summarize anti-patterns
Then abstract into an experience card and write it into memory.
This is an online self-evolving system:
The more tasks it does → the more the Agent resembles a seasoned employee
📊 Experimental results
Overall performance
On SWE-bench Verified:
| model | improvement |
|---|---|
| Gemini 2.5 Pro | +6.8% |
| Average | +4.7% |
Impact on strong models
Instance-level memory:
- Performance drops
- Variance increases ❌
Subtask-level memory:
- Performance improves
- More stable ✅
This shows:
Strong models aren’t afraid of having no memory
They’re afraid of wrong memory
Biggest gains on hard tasks
By trajectory length:
| difficulty | improvement |
|---|---|
| Easy | +1.8% |
| Hard | +8.7% |
Because hard tasks rely most on experience reuse.
🧪 Key ablation studies
❌ Only force step-by-step thinking (no memory storage)
Only +1%
→ The real boost comes from “experience reuse”
❌ No stage-based retrieval
Improvement shrinks to +1.6%
→ Stage alignment is key
❌ Store raw trajectories
Improvement +1.2%
→ Must do abstraction
🏗 Engineering perspective: Why does this matter?
The most valuable part of this paper is:
No need to train a new model
System design alone can improve performance
Real systems can reuse this directly:
- Cursor / Devin-like Agents
- Internal enterprise Code Agent
- Automated fix bots
What it means for teams
AI is no longer:
An intern starting from scratch every time
but instead:
A seasoned employee who accumulates project experience
Changes it brings:
- Faster fixes for recurring issues
- More stable CI auto-fixes
- Lower onboarding cost
🌍 This paradigm can transfer to all Agents
Not limited to coding:
Data analysis Agents
Stages:
- Understand requirements
- Find data
- Model
- Interpret results
Customer support Agents
Stages:
- Clarify the issue
- Check policy
- Provide a solution
- Follow up
Research Agents
Stages:
- Retrieve literature
- Propose hypotheses
- Design experiments
- Analyze results
⚠️ Limitations
What the paper hasn’t solved yet:
1️⃣ Unbounded memory growth
Needs:
- eviction mechanisms
- compression
- hierarchical memory
2️⃣ Pollution from incorrect experience
Current system:
If it’s written in, it’s trusted
In the future it must address:
- memory credibility
- automatic down-weighting
- version control
3️⃣ Subtask partitioning is still manually designed
Future direction:
Automatically learn the stage structure
🔮 Future directions worth watching
The three with the most potential, in my view:
1️⃣ A memory governance system
Like:
- Git for memory
- memory diff / rollback
2️⃣ A shared experience base across multiple Agents
Team-level AI knowledge accumulation.
3️⃣ A long-running code steward
A true:
AI seasoned employee
🏁 My take
The value of this paper is:
It doesn’t build a bigger model
It answers a more fundamental question:
How should an Agent “learn experience”?
This is:
LLM → real agentic intelligence
A very key step in that process.
📌 Takeaway
If you’re building an Agent system, this paper offers a design principle you can implement directly:
✅ Correct memory design
- Memory unit = reasoning subtasks
- Store only abstract experience
- Retrieval must be stage-aligned
Instead of:
- Storing whole conversations
- Global similarity retrieval
🧵 Finally
If you’re working on:
- AI coding tools
- Enterprise code Agents
- Automated repair systems
This paper is the kind of work that:
Can directly change system design
Not just “+1% on the leaderboard.”