Reinforcement Learning/Value Iteration

From testwiki
Revision as of 05:36, 17 May 2020 by imported>Dvd8719 (Algorithm)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Policy iteration vs Value iteration

  • Policy iteration computes optimal value and policy
  • Value iteration:
    • Maintain optimal value of starting in a state s if have a finite number of steps k left in the episode
    • Iterate to consider longer and longer episodes

Policy iteration and value iteration will converge to the same optimal policy.


Algorithm

Value function of a policy is the solution to the Bellman equationVπ(s)=Rπ(s)+γsSPπ(s|s)Vπ(s)Bellman-backup operator is an operator that is applied to a value function and returns a new value function. The Bellman-backup operator improves the value if it is possibleV(s)=maxaR(s,a)+γsSPπ(s|s,a)V(s)V yields a value function over all states s.