DDQN, Prioritized Replay, and Dueling DQN

LAAI
4 min readApr 30, 2020

DDQN — Double Deep Q-network, (Hasselt et al, AAAI 2016)

Prioritized Replay (Schaulet al, ICLR 2016)

Dueling DQN (Wang et al, ICML 2016, best paper)

DDQN:

Please refer here.

Prioritized Replay

We have replay buffer in RL algorithm to improve training efficiency. However, not all data in the replay buffer are good to choose. For example,

Our initial state is S1, we can take action L or action R. Goal Left has reward 1 and goal right has reward 10, others are zeros. Assume our first episode is the following.

We move from s1 to s2', to s3' to goal left. Let’s dive into TD update now.

Assume the initial V(s) are all zeros. Because r1, and r2' are zeros, therefore, those those update have no effect to our. The only helpful one is to update r3' which is 1. You see the problem here?

We need to choose buffer data that is helpful for updating Q-value.
Error in update term
The higher error should have higher probability. This is similar to softmax but not using exp.

Intuitive thinking, the more error in update term should have higher probability to choose. The diagram below is the performance.

Dueling DQN

In Q-Learning (off-policy TD control)

Although this update is able to get true Q-value, this might be still no sufficient. In control, we care more is whether the action is better than other actions given current state. For example,

Q(s,a1) = 100, Q(s,a2)=101

We know a2 is better than a1, but in probability, there almost have the same probability to be chosen. So, how to have a proper representation to address this problem? => Advantage function

advantage function
Q and V, Just for reference

What we need to compare is this action value with the state value.

State value means the return in this state based on the policy pi to choose next actions. If our action value is way larger than the stat value, this means that this action is way better than expected value. Vice versa.

In neural network architecture, we can have the following.

Let’s modify the equation(3).

We will have that Q is actually the sum of V and A. The training of dueling DQN is similar to DQN which is backpropagation. However, if we look into equation(7), you might observe a problem.

Equation(7) is unidentifiable in the sense that given Q we cannot recover V and A uniquely. So we need some modification or constraints.

In the paper, they proposed to use equation (8) and equation (9). However, I did not totally buy in these ideas. I believe there is another better idea to constraint this. Temporarily stop here.

--

--