DDQN:
Please refer here.
Prioritized Replay
We have replay buffer in RL algorithm to improve training efficiency. However, not all data in the replay buffer are good to choose. For example,
Our initial state is S1, we can take action L or action R. Goal Left has reward 1 and goal right has reward 10, others are zeros. Assume our first episode is the following.
We move from s1 to s2', to s3' to goal left. Let’s dive into TD update now.
Assume the initial V(s) are all zeros. Because r1, and r2' are zeros, therefore, those those update have no effect to our. The only helpful one is to update r3' which is 1. You see the problem here?
We need to choose buffer data that is helpful for updating Q-value.
Intuitive thinking, the more error in update term should have higher probability to choose. The diagram below is the performance.
Dueling DQN
In Q-Learning (off-policy TD control)
Although this update is able to get true Q-value, this might be still no sufficient. In control, we care more is whether the action is better than other actions given current state. For example,
Q(s,a1) = 100, Q(s,a2)=101
We know a2 is better than a1, but in probability, there almost have the same probability to be chosen. So, how to have a proper representation to address this problem? => Advantage function
What we need to compare is this action value with the state value.
State value means the return in this state based on the policy pi to choose next actions. If our action value is way larger than the stat value, this means that this action is way better than expected value. Vice versa.
In neural network architecture, we can have the following.
Let’s modify the equation(3).
We will have that Q is actually the sum of V and A. The training of dueling DQN is similar to DQN which is backpropagation. However, if we look into equation(7), you might observe a problem.
Equation(7) is unidentifiable in the sense that given Q we cannot recover V and A uniquely. So we need some modification or constraints.
In the paper, they proposed to use equation (8) and equation (9). However, I did not totally buy in these ideas. I believe there is another better idea to constraint this. Temporarily stop here.