Temporal-Difference (TD) Learning and SARA

 

Temporal-Difference Learning is a combination of Monte Carlo (MC) ideas and Dynamic Programming (DP) ideas.

MC method wait until the return following the visit is known, and then uses that return as a target for \(V(S_{t})\). A single every-visit MC method suitable for non-stationary environments is:

\[V(S_{t}) ← V(S_{t}) + α \left[G_{t}-V(S_{t})\right] \notag\]

where,

$G_{t}$ is actual return following time \(t\)
\(\alpha\) is a constant step-size parameter.

TD method

Temporal-Difference(TD) method, unlike MC, doesn’t need to wait until the end of episode, it only needs to wait until the next time step. Simplest TD:

\[V(S_{t}) ← V(S_{t}) + α \left[ R_{t+1} + γ V(S_{t+1})-V(S_{t})\right]\]

In effect,

the target of the MC method is \(G_{t}\) the target of the TD update is \(R_{t+1}+ γ V(S_{t+1})\)

Also, above mentioned simplest TD method is called TD(0) or one-step TD

Because TD(0) bases its update in part on an existing estimate, we can say it is a bootstrapping method like DP. We know

\[\begin{align} v_{π} &= \mathbb{E}_{π} \left[G_{t} \mid S_{t} = s \right]\\ & = \mathbb{E} \left[R_{t+1} + γ G_{t+1} ∣ S_{t} = s \right] \notag\\ & = \mathbb{E}_{\pi} \left[R_{t+1} + γ v_{\pi}(S_{t+1} \mid S_{t} = s)\right]\\ \end{align}\]

MC method uses estimate of eq(2) as target
DP method uses estimate of eq(3) as target
TD method combine the sampling of MC with bootstrapping of DP.

Quantity in TD(0) update in eq(1) is a sort of error. measuring difference between estimted value of \(S_{t}\) and the better estimate \(R_{t+1} + γ V(S_{t+1})\). This is called TD error.

\[δ_{t} \dot{=} R_{t+1} + γ V(S_{t+1}) - V(S_{t})\]

Note: TD error is the error in the estimate made at that time, because it depends on the next state and next reward (not available until next time step) i.e., \(\delta\) is error in \(V(S_{t})\) available at \(t+1\).Also, if \(V\) doesn’t change during episode (as in MC methods) then, MC error can be written as sum of TD errors.

\[\begin{align} G_{t} - V(S_{t}) &= R_{t+1} + γ G_{t+1} - V(S_{t}) + γ V(S_{t+1}) - γ V(S_{t+1}) \notag\\ &= δ_{t} + γ (G_{t+1} - V(S_{t+1}))\notag\\ &= δ_{t} + γ δ_{t+1} + \gamma^{2}(G_{t+2} - V(S_{t+2})) \notag \\ &= \delta_{t} + γ δ_{t+1} + \gamma^{2} δ_{t+2} + \cdots + γ^{T-t-1} δ_{T-1} + γ^{T-t}(G_{T} - V(S_{T})) \notag \\ &= \delta_{t} + γ δ_{t+1} + γ^{2} δ_{t+2} + \cdots + \gamma^{T-t-1} δ_{T-1} + γ^{T-t}(0-0) \notag \\ &=\sum\limits_{k=t}^{T-1} γ^{k-1} \delta_{k} \notag\\ \end{align}\]

If step-size is small, then it may still hold approximatly, even if \(V\) is updated during episode.

Advantanges of TD Prediction Methods:

  • Over DP, TD methods do not require a model of the environments, of its reward and next-state probability distribution.
  • Over MC, TD methods, in an online fully incremental fashion, only need to wait one time step whereas in MC one need to wait until the end of the episode.
  • TD method learning from guess can guarantee convergence.

For any policy \(\pi\) TD(0) has been proved to converge to \(v_{\pi}\) in the mean for a constant step size parameter if it is small ot with probability 1 if the step size parameter decreases according to the usual stochastic approximation conditions.

Optimality of TD(0):

In case of limited experience, we increment learning methods to present the exprience repeatedly until the method converges upon an answer. Aftetr approximating a value function, \(V\), all the available exprience is processed again with the new value function to produce a new overall increment and so on, until the value function converges. This is called batch updating. Under batch updating, TD(0) converges deterministically to a single answer independent of the step size parameter, \(\alpha\) as long as \(\alpha\) is small.

SARSA : On-Policy TD Control

In any on-policy control method, the first step is to learn an action-value function rather than a state-value function.

Thus, our goal is to estimate \(q_{\pi}(s,a)\) for the current behavior policy \(\pi\), for all states \(s\) and action \(a\).

Instead of solely considering state transitions, in these methods, we examine transitions from state-action pairs to state-action pairs, learning the values associated with state-action pairs. The convergence theorem that ensures the convergence of state values under \(TD(0)\) also applies to action values:

\[Q(S_{t},A_{t}) ← Q(S_{t},A_{t}) + α \left[ R_{t+1} + γ Q(S_{t+1},A_{t+1}) -Q(S_{t},A_{t})\right]\]

This update is performed after each transition from one state-action pair to the next, utilizing a quintuple of events \((S_{t}, A_{t} ,R_{t+1}, S_{t+1}, A_{t+1})\). This update scheme is referred to as: SARSA.

When designing an on-policy control algorithm based on SARSA, the approach is similar to any other on-policy methods. We continuously estimate \(q_{\pi}\) for the behaviour policy \(\pi\), while simultaneously adjusting \(\pi\) to become more greedy with respect to \(q_{\pi}\). The convergence properties of the SARSA depends on the nature of the policy’s dependence on Q. For example, one could use \(ϵ -\text{greedy}\) or \(ϵ -\text{soft}\) policies.