class: middle, center, title-slide
Lecture 8: Making decisions
Prof. Gilles Louppe
[email protected]
class: middle, center
Reasoning under uncertainty and taking decisions:
- Markov decision processes
- MDPs
- Bellman equation
- Value iteration
- Policy iteration
- Partially observable Markov decision processes
.footnote[Image credits: CS188, UC Berkeley.]
.grid[
.kol-2-3[
Assume our agent lives in a
- Noisy movements: actions do not always go as planned.
- Each action achieves the intended effect with probability
$0.8$ . - The rest of the time, with probability
$0.2$ , the action moves the agent at right angles to the intented direction. - If there is a wall in the direction the agent would have been taken, the agent stays put.
- Each action achieves the intended effect with probability
- The agent receives rewards at each time step.
- Small 'living' reward each step (can be negative).
- Big rewards come at the end (good or bad).
Goal: maximize sum of rewards.
]
.kol-1-3[
.width-100[]]
]
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.grid.center[ .kol-1-4.center[ Deterministic actions
.width-100[]
]
.kol-3-4.center[
Stochastic actions
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
A Markov decision process (MDP) is a tuple
-
$\mathcal{S}$ is a set of states$s$ ; -
$\mathcal{A}$ is a set of actions$a$ ; -
$P$ is a (stationary) transition model such that$P(s'|s,a)$ denotes the probability of reaching state$s'$ if action$a$ is done in state$s$ ; -
$R$ is a reward function that maps immediate (finite) reward values$R(s)$ obtained in states$s$ .
class: middle
.grid[
.kol-1-5.center[
class: middle
.grid.center[
.kol-1-2[.center.width-70[]]
.kol-1-2[.center.width-70[]]
]
-
$\mathcal{S}$ : locations$(i,j)$ on the grid. -
$\mathcal{A}$ :$[\text{Up}, \text{Down}, \text{Right}, \text{Left}]$ . - Transition model:
$P(s'|s,a)$ - Reward: $$ R(s) = \begin{cases} -0.3 & \text{for non-terminal states} \\ \pm 1 & \text{for terminal states} \end{cases} $$
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.grid[ .kol-3-4[
Given the present state, the future and the past are independent:
.grid[ .kol-2-3[
- In deterministic single-agent search problems, our goal was to find an optimal plan, or sequence of actions, from start to goal.
- For MDPs, we want to find an optimal policy
$\pi^* : \mathcal{S} \to \mathcal{A}$ .- A policy
$\pi$ maps states to actions. - An optimal policy is one that maximizes the expected utility, e.g. the expected sum of rewards.
- An explicit policy defines a reflex agent.
- A policy
- Expectiminimax did not compute entire policies, but only some action for a single state.
]
.kol-1-3[
.width-100[]
.center[Optimal policy when
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
(a) Optimal policy when
Depending on
???
Discuss the balance between risk and rewards.
What preferences should an agent have over state or reward sequences?
- More or less?
$[2,3,4]$ or$[1, 2, 2]$ ? - Now or later?
$[1,0,0]$ or$[0,0,1]$ ?
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
If we assume stationary preferences over reward sequences, i.e. such that
.grid[ .kol-1-3.center[ Additive utility:
Discounted utility:
(
???
Explain what coherent means.
class: middle
.grid[ .kol-1-2[
- Each time we transition to the next state, we multiply in the discount once.
- Why discount?
Example: discount
-
$V([1,2,3]) = 1 + 0.5\times 2 + 0.25 \times 3$
$V([1,2,3]) < V([3,2,1])$
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
What if the agent lives forever? Do we get infinite rewards? Comparing reward sequences with
Solutions:
- Finite horizon: (similar to depth-limited search)
- Terminate episodes after a fixed number of steps
$T$ . - Results in non-stationary policies (
$\pi$ depends on time left).
- Terminate episodes after a fixed number of steps
- Discounting (with
$0 < \gamma < 1$ and rewards bounded by$\pm R_\text{max}$ ):$$V([r_0, r_1, ..., r_\infty]) = \sum_{t=0}^{\infty} \gamma^t r_t \leq \frac{R_\text{max}}{1-\gamma}$$ Smaller$\gamma$ results in a shorter horizon. - Absorbing state: guarantee that for every policy, a terminal state will eventually be reached.
class: middle
The expected utility obtained by executing
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
Among all policies the agent could execute, the optimal policy is the policy
Because of discounted utilities, the optimal policy is independent of the starting state
The utility, or value,
- That is, the expected (discounted) reward if the agent executes an optimal policy starting from
$s$ . - Notice that
$R(s)$ and$V(s)$ are quite different quantities:-
$R(s)$ is the short term reward for having reached$s$ . -
$V(s)$ is the long term total reward from$s$ onward.
-
class: middle
Utilities of the states in Grid World, calculated with
class: middle
Using the principle of maximum expected utility, the optimal action maximizes the expected utility of the subsequent state.
That is,
Therefore, we can extract the optimal policy provided we can estimate the utilities of states.
.footnote[Image credits: CS188, UC Berkeley.]
???
Point out the circularity of the argument!
class: middle
.footnote[Image credits: CS188, UC Berkeley.]
The utility of a state is the immediate reward for that state, plus the expected discounted utility of the next state, assuming that the agent chooses the optimal action:
- These equations are called the Bellman equations. They form a system of
$n=|\mathcal{S}|$ non-linear equations with as many unknowns. - The utilities of states, defined as the expected utility of subsequent state sequences, are solutions of the set of Bellman equations.
???
There is a direct relationship between the utility of a state and the utility of its neighbors.
The Bellman equation combines the expected utility (slide 16) with the policy extraction equation (slide 20).
class: middle
Because of the
The value iteration algorithm provides a fixed-point iteration procedure for computing the state utilities
- Let
$V_i(s)$ be the estimated utility value for$s$ at the$i$ -th iteration step. - The Bellman update consists in updating simultaneously all the estimates to make them locally consistent with the Bellman equation:
$$V_{i+1}(s) := R(s) + \gamma \max_a \sum_{s'} P(s'|s,a) V_i(s') $$ - Repeat until convergence.
class: middle
???
The stopping criterion is based on the fact that if the update is small, then the error is also small. That is, if
class: middle, center
(Step-by-step code example)
class: middle
Let
.bold[Theorem.] For any two approximations
- That is, the Bellman update is a contraction by a factor
$\gamma$ on the space of utility vector. - Therefore, any two approximations must get closer to each other, and in particular any approximation must get closer to the true
$V$ .
class: middle
Since
Therefore, value iteration converges exponentially fast:
- The maximum initial error is
$||V_0 - V||_\infty \leq 2R_\text{max} / (1-\gamma)$ . - To reach an error of at most
$\epsilon$ after$N$ iterations, we require$\gamma^N 2R_\text{max}/(1-\gamma) \leq \epsilon$ .
???
Figure on the right:
class: middle
Value iteration repeats the Bellman updates:
- Problem 1: it is slow –
$O(|\mathcal{S}|^2 |\mathcal{A}|)$ per iteration. - Problem 2: the
$\max$ at each state rarely changes. - Problem 3: the policy
$\pi_i$ extracted from the estimate$V_i$ might be optimal even if$V_i$ is inaccurate!
The policy iteration algorithm instead directly computes the policy (instead of state values). It alternates the following two steps:
- Policy evaluation: given
$\pi_i$ , calculate$V_i = V^{\pi_i}$ , i.e. the utility of each state if$\pi_i$ is executed. - Policy improvement: calculate a new policy
$\pi_{i+1}$ using one-step look-ahead based on$V_i$ :$$\pi_{i+1}(s) = \arg\max_a \sum_{s'} P(s'|s,a)V_i(s')$$
This algorithm is still optimal, and might converge (much) faster under some conditions.
class: middle
At the
- this can be solved exactly in
$O(n^3)$ by standard linear algebra methods.
???
Notice how we replaced
class: middle
In some cases
One way is to run
This hybrid algorithm is called modified policy iteration.
class: middle
class: middle, center
(Step-by-step code example)
.grid[ .kol-2-3[
The game 2048 is a Markov decision process!
-
$\mathcal{S}$ : all possible configurations of the board (huge!) -
$\mathcal{A}$ : swiping left, right, up or down. -
$P(s'|s,a)$ : encodes the game's dynamic- collapse matching tiles
- place a random tile on the board
-
$R(s)=1$ if$s$ is a winning state, and$0$ otherwise. ] .kol-1-3[ .width-100[] ] ]
class: middle
.center[The transition model for a
.footnote[Image credits: jdlm.info, The Mathematics of 2048.]
class: middle, center
Optimal play for a
See jdlm.info: The Mathematics of 2048.
class: middle
What if the environment is only partially observable?
- The agent does not know in which state
$s$ it is in.- Therefore, it cannot evaluate the reward
$R(s)$ associated to the unknown state. - Also, it makes no sense to talk about a policy
$\pi(s)$ .
- Therefore, it cannot evaluate the reward
- Instead, the agent collects percepts
$e$ through a sensor model$P(e|s)$ , from which it can reason about the unknown state$s$ .
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
We will assume that the agent maintains a belief state
-
$b$ represents a probability distribution${\bf P}(S)$ of the current agent's beliefs over its state; -
$b(s)$ denotes the probability$P(S=s)$ under the current belief state; - the belief state
$b$ is updated as evidence$e$ are collected.
This is filtering!
class: middle
.grid[
.kol-1-5.center[
.bold[Theorem (Astrom, 1965).] The optimal action depends only on the agent's current belief state.
- The optimal policy can be described by a mapping
$\pi^*(b)$ from beliefs to actions. - It does not depend on the actual state the agent is in.
In other words, POMDPs can be reduced to an MDP in belief-state space, provided we can define a transition model
class: middle
If
Therefore, $$ \begin{aligned} P(b'|b,a) &= \sum_e P(b',e|b,a)\\ &= \sum_e P(b'|b,a,e) P(e|b,a) \\ &= \sum_e P(b'|b,a,e) \sum_{s'} P(e|b,a,s') P(s'|b,a) \\ &= \sum_e P(b'|b,a,e) \sum_{s'} P(e|s') \sum_{s} P(s'|s,a) b(s) \end{aligned} $$
where
class: middle
We can also define a reward function for belief states as the expected reward for the actual state the agent might be in:
class: middle
.grid[
.kol-1-5.center[
class: middle
Although we have reduced POMDPs to MDPs, the Belief MDP we obtain has a continuous (and usually high-dimensional) state space.
- None of the algorithms described earlier directly apply.
- In fact, solving POMDPs remains a difficult problem for which there is no known efficient exact algorithm.
- Yet, Nature is a POMDP.
While it is difficult to directly derive
- The transition and sensor models are represented by a dynamic Bayesian network;
- The dynamic Bayesian network is extended with decision (
$A$ ) and utility ($R$ and$U$ ) nodes to form a dynamic decision network; - A filtering algorithm is used to incorporate each new percept and action and to update the belief state representation;
- Decisions are made by projecting forward possible action sequences and choosing (approximately) the best one, in a manner similar to a truncated Expectiminimax.
class: middle
At time
- Shaded nodes represent variables with known values.
- The network is unrolled for a finite horizon.
- It includes nodes for the reward of
$\mathbf{X}_{t+1}$ and$\mathbf{X}_{t+2}$ , but the (estimated) utility of$\mathbf{X}_{t+3}$ .
class: middle
Part of the look-ahead solution of the previous decision network:
- Each triangular node is a belief state in which the agent makes a decision.
- The belief state at each node can be computed by applying a filtering algorithm to the sequence of percepts and actions leading to it.
- The round nodes correspond to choices by the environment.
A decision can be extracted from the search tree by backing up the (estimated) utility values from the leaves, taking the average at the chance nodes and taking the maximum at the decision nodes.
???
Notice that there are no chance nodes corresponding to the action outcomes; this is because the belief-state update for an action is deterministic regardless of the actual outcome.
That is, we transition from
- Sequential decision problems in uncertain environments, called MDPs, are defined by transition model and a reward function.
- The utility of a state sequence is the sum of all the rewards over the sequence, possibly discounted over time.
- The solution of an MDP is a policy that associates a decision with every state that the agent might reach.
- An optimal policy maximizes the utility of the state sequence encountered when it is executed.
- Value iteration and policy iteration can both be used for solving MDPs.
- POMDPs are much more difficult than MDPs. However, a decision-theoretic agent can be constructed for those environments.
class: end-slide, center count: false
The end.