Policies and Value Function

 

Value Function

Function of states (or of state-action pair) that estimate how good it is for the agent to be in the state. (‘how good’ is in terms of future rewards or expected return).

Policy

Policy is mapping from states to probabilities of selecting each possible actions. If agent is following policy \(\pi\) at time \(t\) then \(\pi(a ∣ s)\) is the probability that \(A_{t} = a\) if \(S_{t} = s\)

Value function of a state under a policy \(\pi\), denoted \(v_{\pi}(s)\) is the expected return when starting in \(s\) and following \(\pi\) thereafter.

For MDP,

\[v_{\pi}(s) \dot{=} \mathbb{E}_{\pi}\left[G_{t} ∣ S_{t} \right] = \mathbb{E}_{\pi} \left[ \sum\limits_{k=0}^{∞} γ^{k} R_{t+k+1} \mid S_{t} = s \right]\]

This eq(1) is state-value function for policy \(\pi\).

Similarly, value of taking action \(a\) in state \(s\) under a policy \(\pi\), denoted \(q_{\pi}(s,a)\) as the expected return starting from \(s\), can be written as:

\[q_{\pi}(s,a) = \mathbb{E}_{\pi} \left[ G_{t} \mid S_{t}= s, A_{t} = a \right] = \mathbb{E}_{\pi} \left[ \sum\limits_{k=0}^{\infty} γ^{k} R_{t+k+1} \mid S_{t} = s, A_{t} = a \right]\]

This eq(2) is action-value function for policy \(\pi\).

  • If an agent follows policy \(\pi\) and maintains an average of the actual returns that followed each state encountered; then the average will converge to state’s value \(v_{\pi}(s)\) as no.of times state is encountered approaches \(\infty\).

  • If separate average are kept for each action taken in each state, then average will converge to action-value \(q_{\pi}(s,a)\) These estimation method are called Monte Carlo methods.

For any policy \(pi\) and any state \(s\), the following consistency condition hold between the value of \(s\) and the value of its possible successor states:

\[\begin{align} v_{pi}(s) &\dot{=} \mathbb{E}_{\pi} \left[G_{t} \mid S_{t}=s \right] \notag\\ & = \mathbb{E}_{\pi} \left[R_{t+1} + γG_{t+1} \mid S_{t} = s \right] \notag \\ & = \sum\limits_{a} \pi(a ∣ s) \sum\limits_{s^{'}} \sum\limits_{r} p(s^{'},r ∣ s,a) \left[r + γ \mathbb{E}_{\pi} \left[G_{t+1} \mid S_{t+1} = s^{'} \right]\right] \notag \\ v_{pi}(s) &=\sum_{a} \pi(a ∣ s)\sum\limits_{s^{'},r} p(s^{'},r ∣ s,a) \left[r + γv_{\pi}(s^{'}) \right] \end{align}\]

This eq(3) is Bellman Equation for \(v_{\pi}\). It expresses a relationship between the value of state and the value of its successor states.

Optimal Policies and Optimal Value Functions

A policy $\pi$ is defined to be better than or equal to policy \(\pi^{'}\) if its expected return is greater than or equal to that of \(\pi^{'}\) for all states i.e., \(π \ge \pi^{'}\) if and only if \(v_{\pi}(s) \ge v_{\pi^{'}}\)

Therefore, Optimal Policy is a policy that is better than or equal to all other policies. However, it can be more than one. We denote all these optimal policies with \(\pi_{\ast}\), and they all share same state-value function called optimal state-value function, \(v_{\ast}\) defined by:

\[v_{\ast}(s) = \max\limits_{\pi}v_{\pi}(s) \notag\]

Optimal policies also share same optimal action-value function, \(q_{\ast}\), defined by:

\[q_{\ast}(s,a) \dot{=}\max\limits_{\pi} q_{\pi}(s,a)\]

For the state action pair \((s,a)\) this function gives the expected return for taking action \(a\) in state \(s\) and thereafter following an optimal policy. Thus, we can write \(q_{\ast}\) in terms of \(v_{\ast}\) as follows:

\[q_{\ast}(s,a) = \mathbb{E}\left[R_{t+1} + γv_{\ast}(S_{t+1}) ∣ S_{t} = s,A_{t} = a \right]\]

Bellman Optimality Equation

It expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from the state. Bellman Optimality Equation for \(v_{\ast}\) therefore, is:

\[\begin{align} v_{\ast}(s) & =\max\limits_{a ∈ A(s)} q_{π \ast}(s,a) \notag \\ & = \max\limits_{a}\mathbb{E}_{\pi_{\ast}} \left[G_{t} ∣ S_{t} = s,A_{t} = a \right] \notag\\ & = \max\limits_{a}\mathbb{E}_{\pi_{\ast}} \left[R_{t+1} + γG_{t+1} \mid S_{t} = s, A_{t} = a \right] \notag\\ v_{\ast}(s) & = \max\limits_{a} \mathbb{E} \left[R_{t+1} + γv_{\ast}(S_{t+1}) \mid S_{t} = s,A_{t} = a \right] \notag \\ v_{\ast}(s) & =\max\limits_{a} \sum\limits_{s^{'}r} p(s^{'},r ∣ s,a) \left[r + γv_{\ast}(s^{'}) \right] \end{align}\]

Similarly, Bellman Optimality Equation for \(q_{\ast}\) is \(\begin{align} q_{\ast}(s,a) &= \mathbb{E} \left[ R_{t+1} + γ \max\limits_{a^{'}} q_{\ast}(s_{t+1},a^{'}) ∣ S_{t} = s, A_{t} = a \right] \notag \\ & = \sum\limits_{s^{'}r} p(s^{'},r ∣ s,a) \left[r + γ \max\limits_{a^{'}} q_{\ast}(s^{'},a^{'})\right] \\ \end{align}\)