SARSA : On-Policy TD Control
In any on-policy control method, the first step is to learn an action-value function rather than a state-value function.
Thus, our goal is to estimate \(q_{\pi}(s,a)\) for the current behavior policy \(\pi\), for all states \(s\) and action \(a\).
Instead of solely considering state transitions, in these methods, we examine transitions from state-action pairs to state-action pairs, learning the values associated with state-action pairs. The convergence theorem that ensures the convergence of state values under \(TD(0)\) also applies to action values:
\[Q(S_{t},A_{t}) ← Q(S_{t},A_{t}) + α \left[ R_{t+1} + γ Q(S_{t+1},A_{t+1}) -Q(S_{t},A_{t})\right]\]This update is performed after each transition from one state-action pair to the next, utilizing a quintuple of events \((S_{t}, A_{t} ,R_{t+1}, S_{t+1}, A_{t+1})\). This update scheme is referred to as: SARSA.
When designing an on-policy control algorithm based on SARSA, the approach is similar to any other on-policy methods. We continuously estimate \(q_{\pi}\) for the behaviour policy \(\pi\), while simultaneously adjusting \(\pi\) to become more greedy with respect to \(q_{\pi}\). The convergence properties of the SARSA depends on the nature of the policy’s dependence on Q. For example, one could use \(ϵ -\text{greedy}\) or \(ϵ -\text{soft}\) policies.
Q-Learning : Off-Policy TD Control
Q-Learning is defined by:
\[Q(S_{t},A_{t}) ← Q(S_{t},A_{t}) + α \left[ R_{t+1} + γ \text{max}_{a} (S_{t+1},a) - Q(S_{t},A_{t})\right]\]Here, the learned action-value function, \(Q\), directly approximates the optimal action-value \(q_{⋆}\), independent of the policy being followed. The policy still has an effect as it determines which state-action pairs are visited and updated. However, for correct convergence it is required that all pairs continue to be updated.
Therefore, as each state-action pairs needs to be visited and updated, independent of the choice of policy being followed, this algorithm enable early convergence.
Expected SARSA
It follows the schema of Q-Learning, but with the update rule. Instead of using maximum over state-action pair, it uses expected value, taking into account how likely each action is under the current policy.
\[Q(S_{t},A_{t}) ← Q(S_{t},A_{t}) + α \left[R_{t+1} + γ \mathbb{E}_{\pi} \left[Q(S_{t+1},A_{t+1}) \mid S_{t+1}\right] -Q(S_{t},A_{t})\right]\] \[← Q(S_{t},A_{t}) + α \left[R_{t+1} + γ \sum\limits_{a} π (a \mid S_{t+1}) Q(S_{t+1},a) - Q(S_{t},A_{t})\right]\]Expected SARSA shows significant improvement over SARSA over a wide range of values over for the step-size parameter \(\alpha\). Expected SARSA has another advantage of having lower variance than seen in SARSA due to random selection of \(A_{t+1}\)