Reinforcement Learning/Policy iteration

From testwiki
Revision as of 08:32, 11 March 2023 by imported>MathXplore (added Category:Reinforcement learning using HotCat)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Policy Iteration (PI) is one of the algorithms for finding the optimal policy (MDP control).

Policy iteration is a model-based algorithm.

The complexity of the algorithm is |A|×|S|×k where k is the number of iterations needed for convergence. Theoretically, the maximum number of iterations is |A||S|.

The algorithm converges to the global optimum.

State-action value Q

State-action value of a policy π, is calculated by taking the specified action a immediately, then following the policyQπ(s,a)=R(s,a)+γsSP(ss,a)Vπ(s)Here, R(s,a) is the reward function in MDP and P(s|s,a) is the transition model.

Algorithm

  • Set i=0
  • Initialize π0(s) randomly for all states s
  • While i=0 or |πiπi1|1>0 (L1-norm, measures if the policy changed for any state):
    • Compute state-action value of a policy πi, for all sS and all aA Qπ(s,a)=R(s,a)+γsSP(ss,a)Vπ(s)
    • Compute new policy πi+1, for all sS by choosing the action that returns the maximum state-action value for each specific stateπi+1(s)=argmaxaQπi(s,a)sS

Explanation

In each iteration, by definition we haveargmaxaQπi(s,a)Qπi(s,πi(s))=Vπi(s)sS

Proof

Vπi(s)maxaQπi(s,a)=maxaR(s,a)+γsSP(s|s,a)Vπi(s)=maxaR(s,πi+1(s))+γsSP(s|s,πi+1(s))Vπi(s)by definition the action with the maximum Q value is taken as the new policymaxaR(s,πi+1(s))+γsSP(s|s,πi+1(s))[maxaQπi(s,a)]=maxaR(s,πi+1(s))+γsSP(s|s,πi+1(s))[R(s,πi+1(s))+γsSP(s|s,πi+1(s))Vπi(s)]Vπi+1