Reinforcement Learning/Markov Decision Process

Markov Decision Process (MDP) is Markov Chain + Rewards function + Actions.

The Markov Decision Process is reduced to Markov Rewards process by choosing a "policy" that specifies the action taken given the state, $π (s)$ .

Definition

A Markov decision process is a 4-tuple $(S, A, P_{a}, R_{a})$ , where

$S$ is a finite set of states,
$A$ is a finite set of actions (alternatively, is the finite set of actions available from state ),
$P_{a} (s, s^{'}) = Pr (s_{t + 1} = s^{'} | s_{t} = s, a_{t} = a)$ is the probability that action $a$ in state $s$ at time $t$ will lead to state $s^{'}$ at time $t + 1$ ,
$R_{a} (s, s^{'})$ is the immediate reward (or expected immediate reward) received after transitioning from state $s$ to state $s^{'}$ , due to action $a$

(Note: The theory of Markov decision processes does not state that $S$ or $A$ are finite, but the basic algorithms below assume that they are finite.)

Policy Specification

A policy if a function $π$ that specifies the action $a = π (s)$ that the decision maker will choose when it is in state $s$ .

Once a Markov decision process is combined with a policy, this fixes the action for each state and the resulting combination behaves like a Markov chain $\Pr (s_{t + 1} = s^{'} ∣ s_{t} = s, a_{t} = a) = \Pr (s_{t + 1} = s^{'} ∣ s_{t} = s, a_{t} = π (s))$

The goal is to choose a policy $π$ that will maximize some cumulative stochastic rewards function.

Typically the expected the cumulative reward is a discounted sum over a potentially infinite horizon:

𝔼 [\sum_{t = 0}^{\infty} γ^{t} R_{a_{t}} (s_{t}, s_{t + 1})]

(where we choose

a_{t} = π (s_{t})

, i.e. actions given by the policy). And the expectation is taken over

s_{t + 1} \sim P_{a_{t}} (s_{t}, s_{t + 1})

where $γ$ is the discount factor satisfying $0 \leq γ \leq 1$ , which is usually close to 1. (For example, $γ = 1 / (1 + r)$ for some discount rate r.)

Because of the Markov property, the optimal policy for this particular problem can indeed be written as a function of $s$ only, as assumed above.

The discount-factor motivates the decision maker to favor taking actions early, rather not postpone them indefinitely.

Reinforcement Learning/Markov Decision Process

Definition

Policy Specification

Navigation menu

Search