Policy Gradient Methods

 

Unlike other methods where we first learn the value of actions and then select action based on estimated action-values, here in Policy Gradient Methods, we use parameterized policy that can select actions without consulting a value function.

If we use \(θ ∈ \mathbb{R}^{d}\) for the policy’s parameter vector, and then probability for the action \(a\) is taken at time \(t\) given environment is in state \(s\) with parameter \(\theta\) is given by :

\[π (a \mid s,θ ) = Pr\left\{A_{t} = a \mid S_{t} = s, θ_{t} = θ \right\} \notag\]

Here, we seek to learn the policy parameter based on the gradient of some scalar performance measure \(J(\theta)\) i.e,

\[θ_{t + 1} = \theta_{t} + α \widehat{∇ J(θ_{t})}\]

where \(\widehat{∇ J(θ_{t})} ∈ \mathbb{R}^d\) is a stochastic estimate whose expectation approximates the gradient of the performance measure with respect to its argument \(\theta_{t}\).

Methods following these schema are called Policy Gradient Methods.

Whereas, methods that learn its approximations to both policy and value function are called Actor-Critic Methods.

Policy Approximation and its Advantages

In policy gradient methods, the policy can be parameterized in any way, as long as \(π (a ∣ s, θ)\) is differentiable with respect to its parameters, and to ensure exploration, we need policy to never become deterministic i.e, \(\pi(a ∣ s, θ) ∈ (0,1)\) for all \(s,a,θ\).

If the action space is discrete, and not too large, then parameterization can be done by forming numerical preferences \(h(s,a,θ)∈ \mathbb{R}\) for each state-action pair. The action with the highest preferences in each state are given the highest probabilities of being selected as per exponential soft-max distribution :

\[π(a ∣ s, θ) \dot{=} \frac{e^{h(s,a,θ)}}{\sum_{b}e^{h(s,b,θ)}} \notag\]

This kind of policy parameterization is called soft-max in action preferences. Importantly, action preferences can also be parameterized either by ANN using \(θ\) as a vector of weights. Or it could be linear in features, i.e, \(h(s,a,θ) = θ^{T}x(s,a)\) using feature vectors \(x(s,a) ∈ \mathbb{R}\).

Advantages

  1. With this, the action probabilities change smoothly as a function of the learned parameter, whereas in \(ϵ -\)greedy it can change dramatically. For example, for a small change in the estimated action-value, the result can have maximal value. Thus, policy parameterization guarantees stronger convergence.

  2. As in problems with significant function approximation, the best approximate policy can be stochastic. Parameterizing policy method help find optimal stochastic policy.

  3. It learns faster and yields a better asymptotic policy.

  4. It can help inject prior knowledge about the desired form of the policy into the RL system.

Policy Gradient Theorem

Episodic and continuing cases define the performance measure, \(J(θ)\) differently.

In episodic cases, the performance is define as :

\[J(θ) \dot{=} v_{π_{θ}}(s_{0}) \notag\]

where \(v_{π_{θ}}\) is the true value function for \(π_{θ}\), the policy determined by \(θ\), assuming no discounting i.e, \(γ=1\). In continuing cases, the performance is defined in terms of the average rate of reward per time step:

\[\begin{align} J(θ) &\dot{=} \lim_{h → ∞} \frac{1}{h}\sum\limits_{t=1}^{h} \mathbb{E} [R_{t} ∣ S_{0}, A_{0:t-1} ∼ π]\notag\\ &= \lim_{t → ∞} \mathbb{E} [R_{t} ∣ S_{0}, A_{0:t-1} ∼ π] \notag\\ &= \sum\limits_{s} μ(s)\sum\limits_{a} π(a ∣ s) \sum\limits_{s^{'},r} p(s^{'},r ∣ s,a)r \notag\\ \end{align}\]

Challenge:

Policy parameter determines both the action selections and the distribution of states in which those selections are made. Given a state, the effect of the policy parameter on the action can be computed, however, the state distribution is unknown.

Policy Gradient Theorem solves this by providing expression for the gradient of performance w.r.t. policy parameter that does not involve derivatives of state distribution. It establishes that :

\[∇ J(θ)∝ \sum\limits_{s} μ(s) \sum\limits_{a} q_{π}(s,a) ∇ π (a ∣ s,θ)\]

REINFORCE : Monte Carlo Policy Gradient

In eq(3), policy gradient theorem is a sum over states weighted by how often the sum sum occur under the target policy \(\pi\). Thus, it can be written as :

\[\begin{align} ∇J(θ) & ∝ \sum\limits_{s} μ(s) \sum\limits_{a} q_{\pi}(s,a) ∇ π (a ∣ s,θ) \notag \\ &= \mathbb{E}_{π} \left[\sum\limits_{a} q_{\pi}(S_{t},a) ∇ π (a ∣ S_{t},θ ) \right] \notag \\ \end{align}\]

Now, stochastic gradient ascent algorithm in eq(1) can be expressed as :

\[θ_{t+1} \dot{=} θ_{t} + α \sum\limits_{a} \hat{q} (S_{t},a,W) ∇ π (a ∣ S_{t}, θ)\]

where \(\hat{q}\) is some learned approximation to \(q_{\pi}\). This algorithm is called the all-actions method because its update involves all of the actions.

\(\text{REINFORCE}\) algorithm, however, only updates \(A_{t}\) and time \(t\) - the one action actually taken at \(t\)

In eq(4), if we introduce weighting to the appropriate sum over actions, without changing the equality, by multiplying and dividing the term by \(π (a ∣ S_{t}, θ)\), we have :

\[\begin{align} ∇ J(θ) &= \mathbb{E}_{\pi} \left[\sum\limits_{a} π (a ∣ S_{t}, θ) q_{π} (S_{t},a) \frac{∇ π (a ∣ S_{t},θ)}{π (a ∣ S_{t},θ)}\right] \notag\\ &= \mathbb{E}_{\pi} \left[q_{π}(S_{t},A_{t}) \frac{∇ π (A_{t} | S_{t},θ)}{π (A_{t} ∣ S_{t},θ)} \right] \notag\\ &= \mathbb{E}_{\pi} \left[G_{t}\frac{∇ π (A_{t} ∣ S_{t},θ)}{π (A_{t} ∣ S_{t},θ)}\right] \notag\\ \end{align}\]

Now, using eq(1) for stochastic gradient ascent, we can get REINFORCE as:

\[θ_{t+1} \dot{=} θ_{t} + αG_{t} \frac{∇ π (A_{t} ∣ S_{t}, θ)}{π (A_{t} ∣ S_{t},θ )} \notag\]

And, as REINFORCE uses the complete return from time \(t\) which includes all future rewards up until the end of the episode, it is in a sense Monte Carlo Algorithm - yielding slow learning and high variance.

REINFORCE with Baseline

Policy gradient theorem of eq(3) can be generalized to include a comparison of the action value to an arbitary baseline b(s):

\[∇ J(θ) ∝ \sum\limits_{s} μ(s) \sum\limits_{a} (q_{\pi}(s,a)-b(s)) ∇ π (a ∣ s,θ )\]

The baseline can be any function, even a random variable, as long as it doesn’t vary with \(a\). It is just introduced to significantly reduce the variance, thus speed up the learning process.

The policy gradient theorem with baseline in eq(7) can be used to derive an update rule using steps as in earlier discussion. New version of REINFORCE with update rule that includes baseline is as follows :

\[θ_{t+1} \dot{=} + α (G_{t} - b(S_{t})) \frac{∇ π (A_{t} ∣ S_{t}, θ_{t})}{π (A_{t} ∣ S_{t},θ_{t})}\notag\]