The goal is to estimate $V^{π} (s)$ by generating many episodes under policy $π$ .

An episode is a series of states, actions, and rewards ( $s_{1}, a_{1}, r_{1}, s_{2}, a_{2}, r_{2}, \dots$ ) created for an Markov Decision Process (MDP) under policy $π$ .

In this method, we simply simulate many trajectories (decision processes), and calculate the average returns.

The error of calculated reward reduces with $1 / \sqrt{N}$ , where $N$ is the number of trajectories created.
This method can be used only for episodic decision processes, meaning that the trajectories are finite and terminates after a number of states.
The evaluation does NOT require formal derivation of dynamics and rewards models.
This method does NOT assume states to be Markov.
Generally a high variance estimator. Reducing the variance can require a lot of data. Therefore, in cases where data is expensive to acquire or the stakes are high, MC may be impractical.

There are different types of Monte Carlo policy evaluation:

First-visit Monte Carlo
Every-visit Monte Carlo
Incremental Monte Carlo

First-visit Monte Carlo

Algorithm:

Initialize $N (s) = 0, G (s) = 0 \forall s \in S$

Loop:

Sample episode $i = s_{i, 1}, a_{i, 1}, r_{i, 1}, s_{i, 2}, a_{i, 2}, r_{i, 2}, \dots, s_{i, T_{i}}$
Define $G_{i, t} = r_{i, t} + γ r_{i, t + 1} + γ^{2} r_{i, t + 2} + \dots γ^{T_{i - 1}} r_{i, T_{i}}$ as return from time step $t$ onwards in $i$ th episode
For each state s visited in episode i
- For first time t that state s is visited in episode i
  - Increment counter of total first visits: $N (s) = N (s) + 1$
  - Increment total return $G (s) = G (s) + G_{i, t}$
  - Update estimate $V^{π} (s) = G (s) / N (s)$

Properties:

$V^{π}$ first-time MC estimator is an unbiased estimator of true $𝔼_{π} [G_{t} ∣ s_{t} = s]$ . (Read more about Bias of an estimator).
By law of large numbers, as $N (s) \to \infty, V^{π} (s) \to 𝔼 [G_{t} ∣ s_{t} = s]$

Every-visit Monte Carlo

Algorithm:

Initialize $N (s) = 0, G (s) = 0 \forall s \in S$

Loop:

Sample episode $i = s_{i, 1}, a_{i, 1}, r_{i, 1}, s_{i, 2}, a_{i, 2}, r_{i, 2}, \dots, s_{i, T_{i}}$
Define $G_{i, t} = r_{i, t} + γ r_{i, t + 1} + γ^{2} r_{i, t + 2} + \dots γ^{T_{i - 1}} r_{i, T_{i}}$ as return from time step $t$ onwards in $i$ th episode
For each state s visited in episode i
- For every time t that state s is visited in episode i
  - Increment counter of total first visits: $N (s) = N (s) + 1$
  - Increment total return $G (s) = G (s) + G_{i, t}$
  - Update estimate $V^{π} (s) = G (s) / N (s)$

Properties:

$V^{π}$ every-visit MC estimator is a biased estimator of true $V^{π} (s) = 𝔼_{π} [G_{t} ∣ s_{t} = s]$ . (Read more about /Bias of an estimator/).
The every-visit MC estimator has MSE (variance + bias²) than first-visit estimator, because we collect way more data when we count every visit.
The every-visit estimator is a consistent estimator, meaning that the bias value consistently decreases with increasing number of simulated episodes. The bias of a consistent estimator asymptotically goes to zero with increasing number of sample size.

Incremental Monte Carlo

Incremental MC policy evaluation is a more general form of policy evaluation that can be applied to both first-visit and every-visit policy evaluation algorithms.

The benefit of using incremental MC algorithm is that it can be applied to cases where the system is non-stationary. The algorithm does this by giving higher weight to newer data.

In both first-visit and every-visit MC algorithms the value function is updated by the following equation $V^{π} (s) = V^{π} (s) \frac{N (s) - 1}{N (s)} + \frac{G_{i, t} (s)}{N (s)} = V^{π} (s) + \frac{1}{N (s)} (G_{i, t} (s) - V^{π} (s))$ This equation is easily derivable by looking value of $V^{π} (s)$ , $G (s)$ , and $N (s)$ each time the value function is updated.

If we change the update equation to the following we arrive at the incremental MC algorithm which can have both first-visit and every-visit variations $V^{π} (s) = V^{π} (s) + α (G_{i, t} (s) - V^{π} (s))$ If we set $α = 1 / N (s)$ , we arrive at the original first-visit or every-visit MC algorithms, but if set $α > 1 / N (s)$ we have an algorithm that gives more weight to the newer data and is more suitable for non-stationary domains.

Reinforcement Learning/Monte Carlo Policy Evaluation

Contents

First-visit Monte Carlo

Algorithm:

Properties:

Every-visit Monte Carlo

Algorithm:

Properties:

Incremental Monte Carlo

Navigation menu

Reinforcement Learning/Monte Carlo Policy Evaluation

First-visit Monte Carlo

Algorithm:

Properties:

Every-visit Monte Carlo

Algorithm:

Properties:

Incremental Monte Carlo

Navigation menu

Search