From MDP to Dec-POMDP: Understanding the Differences Between Single-Agent and Multi-Agent Decision Models

In reinforcement learning and multi-agent systems, you often encounter a series of progressively deeper concepts: MDP, POMDP, Dec-POMDP.

They all describe “how agents make decisions continuously in an environment,” but they make different assumptions about the number of agents, state observability, and how information is shared.

This article mainly compares MDP and Dec-POMDP, and also introduces POMDP as a bridge in between to help understand how they evolve.

1. MDP: Single-agent decision-making under full observability

MDP stands for Markov Decision Process (in Chinese: 马尔可夫决策过程).

It describes:

How a single agent, when it can observe the full environment state, chooses actions to maximize long-term cumulative reward.

An MDP can usually be written as:

\mathcal{M} = (S, A, P, R, \gamma)

Where:

$S$ : the set of states, indicating what states the environment may be in;
$A$ : the set of actions, indicating what actions the agent can take;
$P(s' \mid s, a)$ : the state transition probability, indicating the probability of reaching state $s'$ after taking action $a$ in state $s$ ;
$R(s, a)$ : the reward function, indicating the immediate reward obtained by taking action $a$ in state $s$ ;
$\gamma$ : the discount factor, used to measure the importance of future rewards.

The core assumption of an MDP is the Markov property:

P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \dots) P(s_{t+1} \mid s_t, a_t)

It means:

What happens next depends only on the current state and current action, not on earlier history.

In other words, as long as the current state $s_t$ already contains enough information, the agent does not need to care about what happened in the past.

2. The Bellman equation: The core idea of MDPs

In an MDP, a very important formula is the Bellman optimality equation:

V(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V(s') \right]

This formula says:

The optimal value of the current state equals, among all available actions, the maximum of “current reward + discounted expected future value.”

Breaking it down: $V(s)$ denotes the value of state $s$ —that is, starting from state $s$ and then following the optimal policy thereafter, the long-term expected return you can obtain.

$R(s, a)$ denotes the reward received immediately at this step.

And this term:

\sum_{s'} P(s' \mid s, a) V(s')

means that after taking action $a$ , the future may transition into multiple different states. Each next state $s'$ has its own probability $P(s' \mid s, a)$ and its own value $V(s')$ , so we take a weighted average over these future state values.

Multiply by the discount factor:

\gamma \sum_{s'} P(s' \mid s, a) V(s')

to get the discounted expected future value.

So the entire expression:

R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V(s')

means:

The total value brought by choosing action $a$ in state $s$ .

Finally, the outer $\max_a$ means:

Among all possible actions, choose the one with the greatest total value.

Thus, the Bellman equation is essentially saying:

Whether an action is good depends not only on how much immediate reward it brings, but also on what kind of future states it will lead the agent to.

This is also an important theoretical foundation for reinforcement learning methods such as dynamic programming, Q-learning, DQN, and PPO.

3. POMDP: When the agent cannot see the full state

MDPs make a strong assumption: the agent can know the complete current state.

But in reality, many scenarios do not satisfy this condition.

For example:

A self-driving car cannot know the true intentions of all other vehicles;
A robot can only see a local part of the environment through a camera;
A game AI may not be able to see all enemies on the map;
A recommender system cannot fully know a user’s true preferences.

In such cases, you need a POMDP.

POMDP stands for Partially Observable Markov Decision Process (in Chinese: 部分可观测马尔可夫决策过程).

Compared with an MDP, a POMDP adds one core concept: Observation.

A POMDP can usually be written as:

\mathcal{M} = (S, A, P, R, \Omega, O, \gamma)

Where:

$S$ : the set of true states;
$A$ : the set of actions;
$P$ : the state transition function;
$R$ : the reward function;
$\Omega$ : the set of observations;
$O(o \mid s', a)$ : the observation probability function;
$\gamma$ : the discount factor.

In a POMDP, the agent cannot see the true state $s$ ; it can only see a partial observation $o$ .

Therefore, the policy is no longer written directly as:

\pi(a \mid s)

but more like:

\pi(a \mid o)

More accurately, however, the agent usually needs to maintain a belief based on the history of observations and actions—i.e., a probability estimate over the true state:

b(s) = P(s \mid h)

where $h$ denotes the history of observations and actions experienced by the agent.

A belief can be understood as:

The agent’s probability distribution over “what the current true state is.”

For example, a robot moves in a maze, but its sensors are not accurate enough. It might believe:

b(s_A) = 0.7

b(s_B) = 0.2

b(s_C) = 0.1

That is, it thinks there is a $70%$ chance it is at location $A$ , a $20%$ chance it is at location $B$ , and a $10%$ chance it is at location $C$ .

This is why POMDPs are harder than MDPs: the agent must not only make decisions, but also infer what state it is actually in.

4. Dec-POMDP: Multiple agents, each seeing only local information

Dec-POMDP stands for Decentralized Partially Observable Markov Decision Process (in Chinese, it can be called 去中心化部分可观测马尔可夫决策过程).

It can be understood as the multi-agent version of a POMDP.

If we compare them in one sentence:

MDP: one agent, can see the full state;
POMDP: one agent, can only see a partial observation;
Dec-POMDP: multiple agents, and each agent can only see its own local observation.

A Dec-POMDP describes:

How multiple agents cooperate to achieve a shared goal when they cannot see the full environment and also cannot fully know each other’s information.

A Dec-POMDP can usually be written as:

\mathcal{M} = (S, {A_i}*{i=1}^{n}, T, R, {\Omega_i}*{i=1}^{n}, O, \gamma)

Where:

$S$ : the set of global true states;
$A_i$ : the action set of the $i$ -th agent;
$T$ : the state transition function;
$R$ : the team-shared reward function;
$\Omega_i$ : the observation set of the $i$ -th agent;
$O$ : the observation probability function;
$\gamma$ : the discount factor;
$n$ : the number of agents.

In a Dec-POMDP, each agent $i$ has its own action $a_i$ .

All agents’ actions combined form a joint action:

\mathbf{a} = (a_1, a_2, \dots, a_n)

The environment transitions to the next state $s'$ based on the current true state $s$ and the joint action $\mathbf{a}$ :

P(s' \mid s, \mathbf{a})

Or equivalently:

P(s' \mid s, a_1, a_2, \dots, a_n)

Each agent receives its own local observation:

o_i \in \Omega_i

The entire team receives a shared reward:

R(s, \mathbf{a})

The goal is to maximize the team’s long-term cumulative reward:

\mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, \mathbf{a}_t) \right]

5. What does “Decentralized” actually mean?

In Dec-POMDP, Decentralized does not primarily mean that you absolutely cannot use centralized information during training; rather, it means:

At execution time, each agent must make decisions independently based only on its own local information.

In an MDP, the policy can usually be written as:

\pi(a \mid s)

meaning:

Given the full state $s$ , choose action $a$ .

But in a Dec-POMDP, agent $i$ cannot see the full state $s$ ; it can only choose an action based on its own local history $h_i$ :

\pi_i(a_i \mid h_i)

Here $h_i$ can be expressed as:

h_i = (o_i^0, a_i^0, o_i^1, a_i^1, \dots, o_i^t)

That is, agent $i$ can only make decisions based on what it has seen, what it has done, and what it is currently observing.

The local policies of multiple agents combine to form a joint policy:

\boldsymbol{\pi} = (\pi_1, \pi_2, \dots, \pi_n)

A Dec-POMDP asks us to find an optimal joint policy:

\boldsymbol{\pi}^* = \arg\max_{\boldsymbol{\pi}} \mathbb{E}*{\boldsymbol{\pi}} \left[ \sum*{t=0}^{\infty} \gamma^t R(s_t, \mathbf{a}_t) \right]

This is the core problem of Dec-POMDP.

6. An intuitive example: Warehouse robot cooperation

Suppose there are two robots in a warehouse searching for and transporting a package.

If modeled as an MDP, the problem might be:

There is only one robot;
It knows the full warehouse map;
It knows the exact locations of itself and the package;
At each step it chooses up, down, left, or right;
The goal is to find the package using the shortest path.

Then the robot’s policy can be written as:

\pi(a \mid s)

i.e., choose action $a$ based on the full state $s$ .

But in a Dec-POMDP, the situation becomes:

There are multiple robots;
Each robot can only see the area near itself;
Robots cannot share all information in real time;
Each robot does not know exactly what the other robots have observed;
They need to cooperate to complete the search and transport task;
The final reward is shared by the whole team.

For example:

Robot A may have seen the package;
Robot B did not see the package, but saw that a passage is blocked;
A does not necessarily know what B observed;
B does not necessarily know what A observed;
Yet they still must cooperate through their own local decisions.

At this time, the joint action of the two robots can be written as:

\mathbf{a} = (a_A, a_B)

The environment transition can be written as:

P(s' \mid s, a_A, a_B)

The shared reward can be written as:

R(s, a_A, a_B)

The policies of robot A and robot B are respectively:

\pi_A(a_A \mid h_A)

\pi_B(a_B \mid h_B)

where $h_A$ and $h_B$ are the local histories of the two robots.

This is the core difficulty of Dec-POMDP:

Each agent has only incomplete information, but together they must produce globally coordinated behavior.

7. The key differences between MDP and Dec-POMDP

Dimension	MDP	Dec-POMDP
Number of agents	Single agent	Multiple agents
State observability	Fully observable	Partially observable
Decision-making	Decisions based on the global state	Each agent makes decisions based on its own local observations or local history
Action form	Single action $a$	Joint action $\mathbf{a} = (a_1, a_2, \dots, a_n)$
Reward form	Single-agent reward $R(s, a)$	Team-shared reward $R(s, \mathbf{a})$
Policy form	$\pi(a \mid s)$	$\pi_i(a_i \mid h_i)$
Main difficulty	How to choose the action that maximizes long-term return	Local observations, information asymmetry, cooperation, and the explosion of the joint action space
Typical applications	Games, path planning, basic reinforcement learning	Multi-robot cooperation, multi-agent reinforcement learning, distributed sensors, cooperative control

8. Why is Dec-POMDP much harder than MDP?

The core problem in an MDP is:

Given the current state, how do we choose the optimal action?

But the difficulties in a Dec-POMDP stack on multiple levels.

8.1 Each agent cannot see the full state

In an MDP, the agent knows the current state $s$ .

But in a Dec-POMDP, each agent can only see its own local observation $o_i$ :

o_i \sim O_i(o_i \mid s)

This means the agent must first infer the environment and then make decisions.

8.2 Each agent does not know what other agents have seen

In multi-agent cooperation, an agent must consider not only the environment but also its teammates.

But the problem is:

I don’t know what my teammates saw, and I don’t know how they will interpret it, but I still have to coordinate with them.

This makes cooperation significantly harder.

8.3 The joint action space grows rapidly

If there are $n$ agents and each agent has $m$ actions, then the size of the joint action space is:

|A| = m^n

For example, with 5 robots and 10 actions per robot, the joint action space is:

10^5 = 100000

This means the number of action combinations grows exponentially with the number of agents.

8.4 History information also grows rapidly

In an MDP, the agent can make decisions directly based on the current state $s_t$ .

But in a Dec-POMDP, each agent typically has to make decisions based on its own history information $h_i^t$ :

h_i^t = (o_i^0, a_i^0, o_i^1, a_i^1, \dots, o_i^t)

As the time step $t$ increases, the number of possible history trajectories can also grow rapidly.

Therefore, Dec-POMDPs are usually much harder to solve than MDPs.

9. The evolutionary relationship from MDP to Dec-POMDP

You can understand it along the following path:

\text{MDP：} \text{Single-Agent} + \text{Fully Observable}

\text{POMDP：} \text{Single-Agent} + \text{Partially Observable}

\text{Dec-POMDP：} \text{Multi-Agent} + \text{Partially Observable} + \text{Decentralized Execution}

That is, compared with MDPs, Dec-POMDPs add two sources of complexity at the same time:

\text{Single-Agent} \rightarrow \text{Multi-Agent}

\text{Fully Observable} \rightarrow \text{Partially Observable}

If you further add the constraint of decentralized execution, the problem becomes even more complex.

10. Centralized training, decentralized execution

In modern multi-agent reinforcement learning, a commonly used paradigm is:

Centralized Training, Decentralized Execution, abbreviated as CTDE.

It means:

During training, you can use more global information, such as the full state $s$ , all agents’ actions $\mathbf{a}$ , and the global reward $R$ .

But during execution, each agent still can only make decisions based on its own local history $h_i$ :

\pi_i(a_i \mid h_i)

This approach tries to balance training efficiency with deployment constraints.

In other words, you can “use a god’s-eye view” during training, but at actual execution time, each agent must still rely only on its own local information.

11. Summary

Both MDP and Dec-POMDP describe sequential decision problems, but they apply to different scenarios.

MDPs are more suitable for describing:

A single agent;
Full observability of the state;
Choosing actions based on the current state;
The goal of maximizing long-term cumulative reward.

Their policy form is typically:

\pi(a \mid s)

Their objective can be written as:

\pi^* = \arg\max_{\pi} \mathbb{E}*{\pi} \left[ \sum*{t=0}^{\infty} \gamma^t R(s_t, a_t) \right]

Dec-POMDPs are more suitable for describing:

Multiple agents;
Each agent can only see local information;
No central controller at execution time;
All agents need to cooperate to achieve a shared goal.

Their local policy form is typically:

\pi_i(a_i \mid h_i)

Their joint optimization objective can be written as:

\boldsymbol{\pi}^* = \arg\max_{\boldsymbol{\pi}} \mathbb{E}*{\boldsymbol{\pi}} \left[ \sum*{t=0}^{\infty} \gamma^t R(s_t, \mathbf{a}_t) \right]

In one sentence:

MDP is the most basic single-agent decision model in reinforcement learning; Dec-POMDP is a more complex model designed for multi-agent, partially observable, decentralized cooperation scenarios.

If MDPs care about:

What should I do in the current state?

Then Dec-POMDPs care about:

Each of us knows only part of the information—how should we coordinate when we cannot fully share our view?

This is also why Dec-POMDPs are very important in multi-robot cooperation, autonomous driving fleets, distributed control, and multi-agent reinforcement learning.