Double Q-Learning and Value overestimation in Q-Learning

LAAI
4 min readApr 30, 2020

The problem is named maximization bias problem.

In RL book,

In these algorithms, a maximum over estimated values is used implicitly as an estimate of the maximum value, which can lead to a significant positive bias. To see why, consider a single state s where there are many actions a whose true values, q(s, a), are all zero but whose estimated values, Q(s, a), are uncertain and thus distributed some above and some below zero.

Let start from an example below.

State A is the starting state. In state A, taking action ‘right’ will go to a terminal state and get ‘0’ reward, and taking action ‘left’ will go to state B. In state B, it will automatically go left to a terminal state and get a reward from a normal distribution with mean -0.1 and variance 1.

Intuitive thinking — The state value for state B should be -0.1, and the state value for A should be 0. The best action in this case is to keep choosing right.

However, we can notice a problem in Q-learning. The Q-learning algorithm is the following.

Q-learning updates

In Q-learning updates, it always takes the max action to estimate the Q-value. This is the problem. This will result in a biased value estimation. Let’s see the performance result first, and explain why it happens.

Figure 6.5 in Sutton and Bart’s book

Q-learning takes more episodes to converge to true value than double Q-learning. Especially, Q-learning takes more wrong actions in the beginning of training. Why?

Q-learning updates

Let's observe this equation again. When we update Q(s=A,a=left), we base on Q(s=B, a=?). We use max operation to find the maximum Q-value in state B. In the early training stage, the value estimation is very noise. For example, Q(s=B, a=left_1) = 0.8, Q(s=B, a=left_2) = 0.6, Q(s=B, a=left_3) = -0.5. The first time update will have max(0.8,0.6,-0.5) as value estimation of state B. The second time update might have max(0.1, 0.6, -0.5) as value estimation of state B where 0.8 has been updated to 0.1. However, we still overestimate the value for state B. Therefore, we will need to train more epochs to eliminate this overestimation.

Here is the solution — double Q-learning

Having one more Q to eliminate this overestimation. The max action based on Q1 but the value estimation is from Q2.

from Emma Brunskill’s RL slides

Short proof of Q-value overestimation

Q-learning updates

Our goal is to have Q(s,a) to be true Q-value which is q*. In Monte Carlo case, we want to take expected value as our true Q-value. For example, in the following equations, the X_i are reward from environment for action a_i .

m: is the best reward across all actions.
m hat : is what we have in Q-learning, and it is biased because it is sample mean of rewards, and we always take the maximum. Therefore, our value estimation will be greater than the true Q-value.

--

--