Markov Decision Processes (MDP)

 

Learner and Decision-maker is agent.
Things it interacts with, outside agent is environment

In a finite Markov Decision Process (MDP):

  • The agent and the environment interact at discrete time steps, represented as \(t=0,1,2,\cdots\).

  • At each time step, \(t\), the agent recieves some information about the environment’s state, denoted as \(S_{t}∈ S\). Based on this state, the agent selects an action, \(A_{t}∈ A(S)\)

  • One time step later, the agent recieves a numerical reward, \(R_{t+1} ∈ \mathscr{R} ⊂ \mathbb{R}\), and find itself in a new state, \(S_{t+1}\).

  • The interaction sequence follows the pattern: \(S_{0},A_{0},R_{1},S_{1},A_{1},R_{2},S_{2},A_{2},R_{3},\cdots\)

In a finite MDP, the probability distribution of the random variable, \(R_{t}\) and \(S_{t}\) depend only on the preceding state & action i.e., if \(s^{'}∈ S\) and \(r ∈ \mathscr{R}\), then

\[P(s^{'},r ∣ s,a) \dot{=} \text{Pr}\left\{S_{t} = s^{'}, R_{t} = r ∣ S_{t-1} = s, A_{t-1} = a \right\} \notag\]

The function p defines the dynamics of the MDP. It is a deterministic function of four agruments,\(p: S × R × S × A → [0,1]\) and specifies a probability distribution for each choice of state \(s\) and action \(a\) satisfying:

\[\sum\limits_{s^{'}∈ S} \sum\limits_{ r∈ R} p(s^{'},r ∣ s,a) = 1 \hspace{3em} \text{for all} s∈ S, a∈ A_{s} \notag\]

Markov Decision Processes (MDP)

MDP, therefore, is a decision process in which the probability of each value for \(S_{t}\) and \(R_{t}\) depends only on the preceding state and action \(S_{t-1}\) and \(A_{t-1}\) and not on earlier state/action. In essence, if the state includes information about all aspects of the past agent-environment interaction, then the state is said to have a Markov Property.

Using the dynamic function, p, we can compute expected reward for state-action pair, denoted as \(r:S × A → R\)

\[r(s,a) \dot{=} \mathbb{E} [R_{t} ∣ S_{t-1} = s, A_{t-1} = a] = \sum\limits_{r∈ R} r \sum\limits_{s^{'}∈ S} p(s^{'},r ∣ s,a) \notag\]

Similarly, we can compute the expected reward for state-action-next state denoted as \(r:S× A × S → R\):

\[r(s,a,s^{'})\dot{=}\mathbb{E} [R_{t} ∣ S_{t-1} = s, A_{t-1} = a, S_{t} = s^{'}] = \sum\limits_{r ∈ R} r \frac{p(s',r ∣ s,a)}{p(s^{'} ∣ s,a)} \notag\]

Goals and Rewards

In Reinforcement Learning, goal of the agent is to maximise the reward, \(R\) which can be stated as reward hypothesis as That all of what we mean by goals and purposes can be well thought of as the maximization of the exptected value of the cumulative sum of a received scalar signal (called reward). In other words, if the exptected return at time step \(t\) is denoted by \(G_{t}\) and the final time step is \(T\), then

\[G_{t} = R_{t+1} + R_{t+2} + R_{t+3} + R_{t+4} + \cdots + R_{T} \notag\]

The objective is to maximise the cumulative reward received at each time step, \(R_{t+1}, R_{t+2},\cdots\), and so on, ultimately maximising \(G_{t}\)

MDP tasks of maximising \(G_{t}\) can be classified into two types: episodic task and continuing task.

In episodic task, the agent-environment breaks into subsection called episodes. Each episode ends in a special state called a terminal state, and independent of how the previous episode ended. On the other hand, in continuing tasks, the agent-environment oesn’t break into identifiable episodes and goes on continually without limit.

Discounting

In some approaches, the agent selects actions, \(A_{t}\) to maximize the sum of discounted return, \(G_{t}\) over the future. It is defined as:

\[G_{t} \dot{=} R_{t+1} + γ R_{t+2} + γ^{2} R_{t+3} + \cdots = \sum\limits_{k=0}^{\infty} γ^{k} R_{t+k+1}\]

Here, \(γ\) is Discount Rate and \(0 \le γ \le 1\). Importantly, if \(γ= 0\), the agent is concerned only with the maximizing immediate reward as the future terms \(γ R_{t+2} + γ^{2}R_{t+3} + \cdots\) become \(0\). On the other hand, if \(γ ≈ 1\), the agent’s return objective takes future reward into account more strongly.

Futhermore, if we look more closely into eq(1), we can observe that returns on successive time steps are related to each other as:

\[\begin{align} G_{t} & \dot{=} R_{t+1} + γ R_{t+2} + γ^{2}R_{t+3} + γ^{3}R_{t+4} + \cdots \notag \\ & = R_{t+1} + γ(R_{t+2} + γ R_{t+3} + γ^{2}R_{t+4} + \cdots) \notag \\ & = R_{t+1} + γG_{t+1} \notag \\ \end{align}\]

Note that it works for all time steps \(t \lt T\) , even if termination occurs at \(t+1\). Also, if \(G_{t}\) is expected of consisting infinite terms where reward is non-zero and constant \(γ \lt 1\), we can write \(G_{t}\) as:

\[G_{t} = \sum\limits_{k=0}^{\infty} γ^{k} = \frac{1}{1-γ} \notag\]

Value Function

Function of states (or of state-action pair) that estimate how good it is for the agent to be in the state. (‘how good’ is in terms of future rewards or expected return).

Policy

Policy is mapping from states to probabilities of selecting each possible actions. If agent is following policy \(\pi\) at time \(t\) then \(\pi(a ∣ s)\) is the probability that \(A_{t} = a\) if \(S_{t} = s\)

Value function of a state under a policy \(\pi\), denoted \(v_{\pi}(s)\) is the expected return when starting in \(s\) and following \(\pi\) thereafter.

For MDP,

\[v_{\pi}(s) \dot{=} \mathbb{E}_{\pi}\left[G_{t} ∣ S_{t} \right] = \mathbb{E}_{\pi} \left[ \sum\limits_{k=0}^{∞} γ^{k} R_{t+k+1} \mid S_{t} = s \right]\]

This eq(1) is state-value function for policy \(\pi\).

Similarly, value of taking action \(a\) in state \(s\) under a policy \(\pi\), denoted \(q_{\pi}(s,a)\) as the expected return starting from \(s\), can be written as:

\[q_{\pi}(s,a) = \mathbb{E}_{\pi} \left[ G_{t} \mid S_{t}= s, A_{t} = a \right] = \mathbb{E}_{\pi} \left[ \sum\limits_{k=0}^{\infty} γ^{k} R_{t+k+1} \mid S_{t} = s, A_{t} = a \right]\]

This eq(2) is action-value function for policy \(\pi\).

  • If an agent follows policy \(\pi\) and maintains an average of the actual returns that followed each state encountered; then the average will converge to state’s value \(v_{\pi}(s)\) as no.of times state is encountered approaches \(\infty\).

  • If separate average are kept for each action taken in each state, then average will converge to action-value \(q_{\pi}(s,a)\) These estimation method are called Monte Carlo methods.

For any policy \(pi\) and any state \(s\), the following consistency condition hold between the value of \(s\) and the value of its possible successor states:

\[\begin{align} v_{pi}(s) &\dot{=} \mathbb{E}_{\pi} \left[G_{t} \mid S_{t}=s \right] \notag\\ & = \mathbb{E}_{\pi} \left[R_{t+1} + γG_{t+1} \mid S_{t} = s \right] \notag \\ & = \sum\limits_{a} \pi(a ∣ s) \sum\limits_{s^{'}} \sum\limits_{r} p(s^{'},r ∣ s,a) \left[r + γ \mathbb{E}_{\pi} \left[G_{t+1} \mid S_{t+1} = s^{'} \right]\right] \notag \\ v_{pi}(s) &=\sum_{a} \pi(a ∣ s)\sum\limits_{s^{'},r} p(s^{'},r ∣ s,a) \left[r + γv_{\pi}(s^{'}) \right] \end{align}\]

This eq(3) is Bellman Equation for \(v_{\pi}\). It expresses a relationship between the value of state and the value of its successor states.

Optimal Policies and Optimal Value Functions

A policy $\pi$ is defined to be better than or equal to policy \(\pi^{'}\) if its expected return is greater than or equal to that of \(\pi^{'}\) for all states i.e., \(π \ge \pi^{'}\) if and only if \(v_{\pi}(s) \ge v_{\pi^{'}}\)

Therefore, Optimal Policy is a policy that is better than or equal to all other policies. However, it can be more than one. We denote all these optimal policies with \(\pi_{\ast}\), and they all share same state-value function called optimal state-value function, \(v_{\ast}\) defined by:

\[v_{\ast}(s) = \max\limits_{\pi}v_{\pi}(s) \notag\]

Optimal policies also share same optimal action-value function, \(q_{\ast}\), defined by:

\[q_{\ast}(s,a) \dot{=}\max\limits_{\pi} q_{\pi}(s,a)\]

For the state action pair \((s,a)\) this function gives the expected return for taking action \(a\) in state \(s\) and thereafter following an optimal policy. Thus, we can write \(q_{\ast}\) in terms of \(v_{\ast}\) as follows:

\[q_{\ast}(s,a) = \mathbb{E}\left[R_{t+1} + γv_{\ast}(S_{t+1}) ∣ S_{t} = s,A_{t} = a \right]\]

Bellman Optimality Equation

It expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from the state. Bellman Optimality Equation for \(v_{\ast}\) therefore, is:

\[\begin{align} v_{\ast}(s) & =\max\limits_{a ∈ A(s)} q_{π \ast}(s,a) \notag \\ & = \max\limits_{a}\mathbb{E}_{\pi_{\ast}} \left[G_{t} ∣ S_{t} = s,A_{t} = a \right] \notag\\ & = \max\limits_{a}\mathbb{E}_{\pi_{\ast}} \left[R_{t+1} + γG_{t+1} \mid S_{t} = s, A_{t} = a \right] \notag\\ v_{\ast}(s) & = \max\limits_{a} \mathbb{E} \left[R_{t+1} + γv_{\ast}(S_{t+1}) \mid S_{t} = s,A_{t} = a \right] \notag \\ v_{\ast}(s) & =\max\limits_{a} \sum\limits_{s^{'}r} p(s^{'},r ∣ s,a) \left[r + γv_{\ast}(s^{'}) \right] \end{align}\]

Similarly, Bellman Optimality Equation for \(q_{\ast}\) is \(\begin{align} q_{\ast}(s,a) &= \mathbb{E} \left[ R_{t+1} + γ \max\limits_{a^{'}} q_{\ast}(s_{t+1},a^{'}) ∣ S_{t} = s, A_{t} = a \right] \notag \\ & = \sum\limits_{s^{'}r} p(s^{'},r ∣ s,a) \left[r + γ \max\limits_{a^{'}} q_{\ast}(s^{'},a^{'})\right] \\ \end{align}\)