Reinforcement Learning/Markov Decision Process

From testwiki
Revision as of 08:31, 11 March 2023 by imported>MathXplore (added Category:Reinforcement learning using HotCat)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Markov Decision Process (MDP) is Markov Chain + Rewards function + Actions.

The Markov Decision Process is reduced to Markov Rewards process by choosing a "policy" that specifies the action taken given the state, π(s).

Definition

A Markov decision process is a 4-tuple (S,A,Pa,Ra), where

  • S is a finite set of states,
  • A is a finite set of actions (alternatively,  is the finite set of actions available from state ),
  • Pa(s,s)=Pr(st+1=s|st=s,at=a) is the probability that action a in state s at time t will lead to state s at time t+1,
  • Ra(s,s) is the immediate reward (or expected immediate reward) received after transitioning from state s to state s, due to action a

(Note: The theory of Markov decision processes does not state that S or A are finite, but the basic algorithms below assume that they are finite.)

Policy Specification

A policy if a function π that specifies the action a=π(s) that the decision maker will choose when it is in state s.

Once a Markov decision process is combined with a policy, this fixes the action for each state and the resulting combination behaves like a Markov chainPr(st+1=sst=s,at=a)=Pr(st+1=sst=s,at=π(s))


The goal is to choose a policy π that will maximize some cumulative stochastic rewards function.

Typically the expected the cumulative reward is a discounted sum over a potentially infinite horizon:

𝔼[t=0γtRat(st,st+1)]    (where we choose at=π(st), i.e. actions given by the policy). And the expectation is taken over st+1Pat(st,st+1)

where  γ  is the discount factor satisfying 0 γ  1, which is usually close to 1. (For example, γ=1/(1+r) for some discount rate r.)

Because of the Markov property, the optimal policy for this particular problem can indeed be written as a function of s only, as assumed above.

The discount-factor motivates the decision maker to favor taking actions early, rather not postpone them indefinitely.