Deep Q-Learning and Double Q-Learning

Deep Q-Learning

In \(Q-\)Learning, representing the \(Q-\)function as a table for all state-action pairs can be impractical. Deep Q-Learning solves this by training a neural network with parameters \(θ\) to approximate \(Q-\)values i.e., \(Q(s,a; θ) ≈ Q^{\ast}(s,a)\).

This is achieved by minimizing loss at each time step \(t\), enabling efficient estimation of \(Q-\)values without an exhaustive table representation.

In addition, Deep Q-Learning uses a technique called Experience Replay during network updates. In this technique, at each time step \(t\), the transitions are added to a circular buffer called replay buffer. During training, instead of using only the most recent action, a mini-batch of transitions sampled from the replay buffer is employed to compute the loss and its gradient, improving learning efficiency and stability.

Deep Q-Learning Steps

Initializtion:

Experience replay is initialized to an empty list of size \(M\).

2.We choose a maximum size of the memory.

At each time step \(t\), we repeat then following processes until the end of the epoch.

We predict the \(Q-\)values of the current state \(s_{t}\).
We play the action that has the highest \(Q-\)value: \(a_{t} = \text{arg} \text{max}_{a}{Q(s_{t},a)}\).
We get the reward \(R(s_{t},a_{t})\).
We reach the next state \(s_{t+1}\).
We append the transition \((s_{t},a_{t},r_{t},s_{t+1})\) in the memory \(M\).
We take a random mini-batch \(B \subset M\) of replay buffer. For all the transitions \((s_{t_{B}},a_{t_{B}},r_{t_{B}},s_{t_{B+1}})\) of the random mini-batch \(B\).

We get the predictions, \(Q(s_{t_{B}},a_{t_{B}})\).

We get the target \(R(s_{t_{B}},a_{t_{B}}) + \gamma \text{max}_{a} (Q(s_{t_{B+1}},a))\).

Now, we compute the desired loss between the predictions and target over the whole mini-batch \(B\).

\[\begin{align} \text{Loss} &= \frac{1}{2} \sum\limits_{B} (R(s_{t_{B}},a_{t_{B}}) + \gamma max_{a} (Q(s_{t_{B+1}},a)) - Q(s_{t_{B}},a_{t_{B}}))^{2}\notag\\ &= \frac{1}{2}\sum\limits_{B} TD_{t_{B}}(s_{t_{B}},a_{t_{B}})^{2}\notag\\ \end{align}\]

Afterwards, we backprop the loss error back into network, and through stochastic gradient descent, we update the weights on network.

Double Q-Learning

In Q-Learning, update can be written as:

\[Q_{t+1}(s_{t},a_{t}) = Q_{t}(s_{t},a_{t}) + \alpha_{t}(s_{t},a_{t})(R_{t} + \gamma max_{a} Q_{t}(s_{t+1},a) - Q_{t}(S_{t},a_{t}))\]

The use of max operator in the above eq of Q-Learning, can cause large over-estimation of the action values. This leads to large performance penalty that slows the learning process too.

Therefore, Double Q-Learning proposes the double estimator method. Here, two sets of estimators: \(μ^{A} = {\mu_{1}^{A},\cdots, \mu_{M}^{A}}\) and \(\mu^{B} = {\mu_{1}^{B},\cdots, \mu_{M}^{B}}\) is used to approximate \(max_{i}E\left\{X_{i}\right\}\). The approximation can be written as :

\[max_{i}E\left\{X_{i}\right\} = max_{i}E\left\{\mu_{i}^{B}\right\} ≈ \mu_{a\ast}^{B}\]

Double Q-Learning Method

It stores two \(Q\) functions: \(Q^{A}\) and \(Q^{B}\). Each \(Q\) function is updated with a value from the other \(Q\) function for the next state. This way it can be considered an unbiased estimate for the value of this action.

For instance, the action \(a^{\ast}\) in eq(3) is the maximum value action in state \(s^{'}\), according to the value function \(Q^{A}\). i.e, \(Q^{A}(s^{'},a^{\ast}) = max_{a}Q^{A}(s^{'},a)\). However, we still use \(Q^{B}(s^{'},a^{\ast})\) to update \(Q^{A}\).

Double Q-Learning converges to the optimal policy in the limit, and faster than Q-Learning, its algorithm is given below : double

PREVIOUSRL04 - Q-Learning, Expected SARSA, and n-step Bootstrapping

NEXTRL06 - Policy Gradient Methods & REINFORCE