Reinforcement Learning/Monte Carlo Policy Evaluation

From testwiki
Revision as of 08:32, 11 March 2023 by imported>MathXplore (added Category:Reinforcement learning using HotCat)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The goal is to estimate Vπ(s) by generating many episodes under policy π.

  • An episode is a series of states, actions, and rewards (s1,a1,r1,s2,a2,r2,) created for an Markov Decision Process (MDP) under policy π.
  • In this method, we simply simulate many trajectories (decision processes), and calculate the average returns.
  • The error of calculated reward reduces with 1/N, where N is the number of trajectories created.
  • This method can be used only for episodic decision processes, meaning that the trajectories are finite and terminates after a number of states.
  • The evaluation does NOT require formal derivation of dynamics and rewards models.
  • This method does NOT assume states to be Markov.
  • Generally a high variance estimator. Reducing the variance can require a lot of data. Therefore, in cases where data is expensive to acquire or the stakes are high, MC may be impractical.

There are different types of Monte Carlo policy evaluation:

  1. First-visit Monte Carlo
  2. Every-visit Monte Carlo
  3. Incremental Monte Carlo


First-visit Monte Carlo

Algorithm:

Initialize N(s)=0,G(s)=0sS

Loop:

  • Sample episode i=si,1,ai,1,ri,1,si,2,ai,2,ri,2,,si,Ti
  • Define Gi,t=ri,t+γri,t+1+γ2ri,t+2+γTi1ri,Ti as return from time step t onwards in ith episode
  • For each state s visited in episode i
    • For first time t that state s is visited in episode i
      • Increment counter of total first visits: N(s)=N(s)+1
      • Increment total return G(s)=G(s)+Gi,t
      • Update estimate Vπ(s)=G(s)/N(s)

Properties:

  • Vπ first-time MC estimator is an unbiased estimator of true 𝔼π[Gtst=s]. (Read more about Bias of an estimator).
  • By law of large numbers, as N(s),Vπ(s)𝔼[Gtst=s]

Every-visit Monte Carlo

Algorithm:

Initialize N(s)=0,G(s)=0sS

Loop:

  • Sample episode i=si,1,ai,1,ri,1,si,2,ai,2,ri,2,,si,Ti
  • Define Gi,t=ri,t+γri,t+1+γ2ri,t+2+γTi1ri,Ti as return from time step t onwards in ith episode
  • For each state s visited in episode i
    • For every time t that state s is visited in episode i
      • Increment counter of total first visits: N(s)=N(s)+1
      • Increment total return G(s)=G(s)+Gi,t
      • Update estimate Vπ(s)=G(s)/N(s)

Properties:

  • Vπ every-visit MC estimator is a biased estimator of true Vπ(s)=𝔼π[Gtst=s]. (Read more about /Bias of an estimator/).
  • The every-visit MC estimator has MSE (variance + bias2) than first-visit estimator, because we collect way more data when we count every visit.
  • The every-visit estimator is a consistent estimator, meaning that the bias value consistently decreases with increasing number of simulated episodes. The bias of a consistent estimator asymptotically goes to zero with increasing number of sample size.

Incremental Monte Carlo

Incremental MC policy evaluation is a more general form of policy evaluation that can be applied to both first-visit and every-visit policy evaluation algorithms.

The benefit of using incremental MC algorithm is that it can be applied to cases where the system is non-stationary. The algorithm does this by giving higher weight to newer data.

In both first-visit and every-visit MC algorithms the value function is updated by the following equationVπ(s)=Vπ(s)N(s)1N(s)+Gi,t(s)N(s)=Vπ(s)+1N(s)(Gi,t(s)Vπ(s))This equation is easily derivable by looking value of Vπ(s), G(s), and N(s) each time the value function is updated.

If we change the update equation to the following we arrive at the incremental MC algorithm which can have both first-visit and every-visit variationsVπ(s)=Vπ(s)+α(Gi,t(s)Vπ(s))If we set α=1/N(s), we arrive at the original first-visit or every-visit MC algorithms, but if set α>1/N(s) we have an algorithm that gives more weight to the newer data and is more suitable for non-stationary domains.