Markov Decision Processes (MDP)

 

Learner and Decision-maker is agent.
Things it interacts with, outside agent is environment

In a finite Markov Decision Process (MDP):

  • The agent and the environment interact at discrete time steps, represented as \(t=0,1,2,\cdots\).

  • At each time step, \(t\), the agent recieves some information about the environment’s state, denoted as \(S_{t}∈ S\). Based on this state, the agent selects an action, \(A_{t}∈ A(S)\)

  • One time step later, the agent recieves a numerical reward, \(R_{t+1} ∈ \mathscr{R} ⊂ \mathbb{R}\), and find itself in a new state, \(S_{t+1}\).

  • The interaction sequence follows the pattern: \(S_{0},A_{0},R_{1},S_{1},A_{1},R_{2},S_{2},A_{2},R_{3},\cdots\)

In a finite MDP, the probability distribution of the random variable, \(R_{t}\) and \(S_{t}\) depend only on the preceding state & action i.e., if \(s^{'}∈ S\) and \(r ∈ \mathscr{R}\), then

\[P(s^{'},r ∣ s,a) \dot{=} \text{Pr}\left\{S_{t} = s^{'}, R_{t} = r ∣ S_{t-1} = s, A_{t-1} = a \right\} \notag\]

The function p defines the dynamics of the MDP. It is a deterministic function of four agruments,\(p: S × R × S × A → [0,1]\) and specifies a probability distribution for each choice of state \(s\) and action \(a\) satisfying:

\[\sum\limits_{s^{'}∈ S} \sum\limits_{ r∈ R} p(s^{'},r ∣ s,a) = 1 \hspace{3em} \text{for all} s∈ S, a∈ A_{s} \notag\]

Markov Decision Processes (MDP)

MDP, therefore, is a decision process in which the probability of each value for \(S_{t}\) and \(R_{t}\) depends only on the preceding state and action \(S_{t-1}\) and \(A_{t-1}\) and not on earlier state/action. In essence, if the state includes information about all aspects of the past agent-environment interaction, then the state is said to have a Markov Property.

Using the dynamic function, p, we can compute expected reward for state-action pair, denoted as \(r:S × A → R\)

\[r(s,a) \dot{=} \mathbb{E} [R_{t} ∣ S_{t-1} = s, A_{t-1} = a] = \sum\limits_{r∈ R} r \sum\limits_{s^{'}∈ S} p(s^{'},r ∣ s,a) \notag\]

Similarly, we can compute the expected reward for state-action-next state denoted as \(r:S× A × S → R\):

\[r(s,a,s^{'})\dot{=}\mathbb{E} [R_{t} ∣ S_{t-1} = s, A_{t-1} = a, S_{t} = s^{'}] = \sum\limits_{r ∈ R} r \frac{p(s',r ∣ s,a)}{p(s^{'} ∣ s,a)} \notag\]

Goals and Rewards

In Reinforcement Learning, goal of the agent is to maximise the reward, \(R\) which can be stated as reward hypothesis as That all of what we mean by goals and purposes can be well thought of as the maximization of the exptected value of the cumulative sum of a received scalar signal (called reward). In other words, if the exptected return at time step \(t\) is denoted by \(G_{t}\) and the final time step is \(T\), then

\[G_{t} = R_{t+1} + R_{t+2} + R_{t+3} + R_{t+4} + \cdots + R_{T} \notag\]

The objective is to maximise the cumulative reward received at each time step, \(R_{t+1}, R_{t+2},\cdots\), and so on, ultimately maximising \(G_{t}\)

MDP tasks of maximising \(G_{t}\) can be classified into two types: episodic task and continuing task.

In episodic task, the agent-environment breaks into subsection called episodes. Each episode ends in a special state called a terminal state, and independent of how the previous episode ended. On the other hand, in continuing tasks, the agent-environment oesn’t break into identifiable episodes and goes on continually without limit.

Discounting

In some approaches, the agent selects actions, \(A_{t}\) to maximize the sum of discounted return, \(G_{t}\) over the future. It is defined as:

\[G_{t} \dot{=} R_{t+1} + γ R_{t+2} + γ^{2} R_{t+3} + \cdots = \sum\limits_{k=0}^{\infty} γ^{k} R_{t+k+1}\]

Here, \(γ\) is Discount Rate and \(0 \le γ \le 1\). Importantly, if \(γ= 0\), the agent is concerned only with the maximizing immediate reward as the future terms \(γ R_{t+2} + γ^{2}R_{t+3} + \cdots\) become \(0\). On the other hand, if \(γ ≈ 1\), the agent’s return objective takes future reward into account more strongly.

Futhermore, if we look more closely into eq(1), we can observe that returns on successive time steps are related to each other as:

\[\begin{align} G_{t} & \dot{=} R_{t+1} + γ R_{t+2} + γ^{2}R_{t+3} + γ^{3}R_{t+4} + \cdots \notag \\ & = R_{t+1} + γ(R_{t+2} + γ R_{t+3} + γ^{2}R_{t+4} + \cdots) \notag \\ & = R_{t+1} + γG_{t+1} \notag \\ \end{align}\]

Note that it works for all time steps \(t \lt T\) , even if termination occurs at \(t+1\). Also, if \(G_{t}\) is expected of consisting infinite terms where reward is non-zero and constant \(γ \lt 1\), we can write \(G_{t}\) as:

\[G_{t} = \sum\limits_{k=0}^{\infty} γ^{k} = \frac{1}{1-γ} \notag\]