Reinforcement Learning/Policy iteration

Policy Iteration (PI) is one of the algorithms for finding the optimal policy (MDP control).

Policy iteration is a model-based algorithm.

The complexity of the algorithm is $| A | \times | S | \times k$ where $k$ is the number of iterations needed for convergence. Theoretically, the maximum number of iterations is $| A |^{| S |}$ .

The algorithm converges to the global optimum.

State-action value Q

State-action value of a policy $π$ , is calculated by taking the specified action $a$ immediately, then following the policy $Q^{π} (s, a) = R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} ∣ s, a) V^{π} (s^{'})$ Here, $R (s, a)$ is the reward function in MDP and $P (s^{'} | s, a)$ is the transition model.

Algorithm

Set $i = 0$
Initialize $π_{0} (s)$ randomly for all states $s$
While i=0 or |πi−πi−1|1>0 (L1-norm, measures if the policy changed for any state):
- Compute state-action value of a policy $π_{i}$ , for all $s \in S$ and all $a \in A$ $Q^{π} (s, a) = R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} ∣ s, a) V^{π} (s^{'})$
- Compute new policy $π_{i + 1}$ , for all $s \in S$ by choosing the action that returns the maximum state-action value for each specific state $π_{i + 1} (s) = \arg \max_{a} Q^{π_{i}} (s, a) \forall s \in S$

Explanation

In each iteration, by definition we have $\arg \max_{a} Q^{π_{i}} (s, a) \geq Q^{π_{i}} (s, π_{i} (s)) = V^{π_{i}} (s) \forall s \in S$

Proof

$\begin{matrix} V^{π_{i}} (s) \leq & \max_{a} Q^{π_{i}} (s, a) \\ = & \max_{a} R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V^{π_{i}} (s^{'}) \\ = & \max_{a} R (s, π_{i + 1} (s)) + γ \sum_{s^{'} \in S} P (s^{'} | s, π_{i + 1} (s)) V^{π_{i}} (s^{'}) \leftarrow by definition the action with the maximum Q value is taken as the new policy \\ \leq & \max_{a} R (s, π_{i + 1} (s)) + γ \sum_{s^{'} \in S} P (s^{'} | s, π_{i + 1} (s)) [\max_{a^{'}} Q^{π_{i}} (s^{'}, a^{'})] \\ = & \max_{a} R (s, π_{i + 1} (s)) + γ \sum_{s^{'} \in S} P (s^{'} | s, π_{i + 1} (s)) [R (s^{'}, π_{i + 1} (s^{'})) + γ \sum_{s^{″} \in S} P (s^{″} | s^{'}, π_{i + 1} (s^{'})) V^{π_{i}} (s^{″})] \\ ⋮ \\ \leq & V^{π_{i + 1}} \end{matrix}$

Reinforcement Learning/Policy iteration

Contents

State-action value Q

Algorithm

Explanation

Proof

Navigation menu

Reinforcement Learning/Policy iteration

State-action value Q

Algorithm

Explanation

Proof

Navigation menu

Search