Temporal-Difference (TD) Learning

Temporal-Difference Learning is a combination of Monte Carlo (MC) ideas and Dynamic Programming (DP) ideas.

MC method wait until the return following the visit is known, and then uses that return as a target for $V(S_{t})$. A single every-visit MC method suitable for non-stationary environments is:

\[V(S_{t}) ← V(S_{t}) + α \left[G_{t}-V(S_{t})\right] \notag\]

where,

$G_{t}$ is actual return following time $t$
$\alpha$ is a constant step-size parameter.

TD method

Temporal-Difference(TD) method, unlike MC, doesn’t need to wait until the end of episode, it only needs to wait until the next time step. Simplest TD:

\[V(S_{t}) ← V(S_{t}) + α \left[ R_{t+1} + γ V(S_{t+1})-V(S_{t})\right]\]

In effect,
the target of the MC method is $G_{t}$ the target of the TD update is $R_{t+1}+ γ V(S_{t+1})$

Also, above mentioned simplest TD method is called TD(0) or one-step TD

Because TD(0) bases its update in part on an existing estimate, we can say it is a bootstrapping method like DP. We know

\[\begin{align} v_{π} &= \mathbb{E}_{π} \left[G_{t} \mid S_{t} = s \right]\\ & = \mathbb{E} \left[R_{t+1} + γ G_{t+1} ∣ S_{t} = s \right] \notag\\ & = \mathbb{E}_{\pi} \left[R_{t+1} + γ v_{\pi}(S_{t+1} \mid S_{t} = s)\right]\\ \end{align}\]

MC method uses estimate of eq(2) as target
DP method uses estimate of eq(3) as target
TD method combine the sampling of MC with bootstrapping of DP.

Quantity in TD(0) update in eq(1) is a sort of error. measuring difference between estimted value of $S_{t}$ and the better estimate $R_{t+1} + γ V(S_{t+1})$. This is called TD error.

\[δ_{t} \dot{=} R_{t+1} + γ V(S_{t+1}) - V(S_{t})\]

Note: TD error is the error in the estimate made at that time, because it depends on the next state and next reward (not available until next time step) i.e., $\delta$ is error in $V(S_{t})$ available at $t+1$.Also, if $V$ doesn’t change during episode (as in MC methods) then, MC error can be written as sum of TD errors.

\[\begin{align} G_{t} - V(S_{t}) &= R_{t+1} + γ G_{t+1} - V(S_{t}) + γ V(S_{t+1}) - γ V(S_{t+1}) \notag\\ &= δ_{t} + γ (G_{t+1} - V(S_{t+1}))\notag\\ &= δ_{t} + γ δ_{t+1} + \gamma^{2}(G_{t+2} - V(S_{t+2})) \notag \\ &= \delta_{t} + γ δ_{t+1} + \gamma^{2} δ_{t+2} + \cdots + γ^{T-t-1} δ_{T-1} + γ^{T-t}(G_{T} - V(S_{T})) \notag \\ &= \delta_{t} + γ δ_{t+1} + γ^{2} δ_{t+2} + \cdots + \gamma^{T-t-1} δ_{T-1} + γ^{T-t}(0-0) \notag \\ &=\sum\limits_{k=t}^{T-1} γ^{k-1} \delta_{k} \notag\\ \end{align}\]

If step-size is small, then it may still hold approximatly, even if $V$ is updated during episode.

Advantanges of TD Prediction Methods

Over DP, TD methods do not require a model of the environments, of its reward and next-state probability distribution.
Over MC, TD methods, in an online fully incremental fashion, only need to wait one time step whereas in MC one need to wait until the end of the episode.
TD method learning from guess can guarantee convergence.

For any policy $\pi$ TD(0) has been proved to converge to $v_{\pi}$ in the mean for a constant step size parameter if it is small ot with probability 1 if the step size parameter decreases according to the usual stochastic approximation conditions.

Optimality of TD(0)

In case of limited experience, we increment learning methods to present the exprience repeatedly until the method converges upon an answer. Aftetr approximating a value function, $V$, all the available exprience is processed again with the new value function to produce a new overall increment and so on, until the value function converges. This is called batch updating. Under batch updating, TD(0) converges deterministically to a single answer independent of the step size parameter, $\alpha$ as long as $\alpha$ is small.

PREVIOUSRL03 - Off-Policy via Importance Sampling

NEXTRL05 - SARSA and Q-Learning