Sign in

ICLR 2018 paper from Thanard Kurutach, Ignashi Clavera, Yan Duan, Aviv Tamar, Pieter Abbeel

They proposed a method to solve a problem a model-based reinforcement learning. In model-based reinforcement learning, the learning process alternates between model learning and policy optimization. The learned model is used to search for an improved policy. However, the policy optimization tends to exploit regions where insufficient data is available to train the model, leading to catastrophic failures.

Their idea is to use multiple models {f_1, f_2, …f_k} to learn environmental dynamics/transition probability. These models are trained via standard supervised learning, and the only differ by the. initial weights and the order in which mini-batches are sampled. The model ensemble serves as effective regularization for policy learning. The model bias is reduced by model ensemble.

DDQN — Double Deep Q-network, (Hasselt et al, AAAI 2016)

Prioritized Replay (Schaulet al, ICLR 2016)

Dueling DQN (Wang et al, ICML 2016, best paper)


Please refer here.

Prioritized Replay

We have replay buffer in RL algorithm to improve training efficiency. However, not all data in the replay buffer are good to choose. For example,

Our initial state is S1, we can take action L or action R. Goal Left has reward 1 and goal right has reward 10, others are zeros. Assume our first episode is the following.

The problem is named maximization bias problem.

In RL book,

In these algorithms, a maximum over estimated values is used implicitly as an estimate of the maximum value, which can lead to a significant positive bias. To see why, consider a single state s where there are many actions a whose true values, q(s, a), are all zero but whose estimated values, Q(s, a), are uncertain and thus distributed some above and some below zero.

Let start from an example below.

State A is the starting state. In state A, taking action ‘right’ will go to a terminal state and…

In my PhD career, the first year for prelim, and the second year for Qual. In my third year EE PhD career, having an internship might be helpful to my PhD career, because I can learn from practical and experience people, also I can join a slightly different project to get inspired. In the end of 2019, I decided to start looking for internship.

Unfortunately, things didn't go as I expected.

It was easy and not that easy to find an internship. 

The very first interview was a startup company spun out of SRI International. I love their CEO, CTO…

Paper: “Overcoming catastrophic forgetting in neural networks”.


這篇文章主要要解決的是Catastrophic Forgetting problem in Multi-tasking. 簡單來說,我們有任務1、任務2…任務n,依序學習任務1,2,…n,然而每次學習新的任務時,我們就會忘記舊的任務的能力,所以如果拿新的模型套用在舊的任務上,就會得到很差的效能.有一種解法是一次學習全部的任務,然而,這個方法的缺點是需要大量的儲存空間儲存訓練資料,並且大量的運算去最佳化我們的模型.於是這篇文章提出了EWC的這個方法去解決,讓你的模型可以依序學習,又同時不忘記或是破壞先前所學好的技能.



任務A與任務B有不同的Loss Function,其最佳解並不一樣,然而如果兩個的Loss space長成向左邊這個樣子,我們希望在訓練任務B時,並不要影響到任務A的效能.因次我們不想要綠色的線,也不想要藍色的線,我們想要的是紅色的線.那該如何做到呢?

簡單來說,我們不想要動到那些重要的參數,因此,我們只要把這個想法放入Loss function即可,新參數不要離就參數太近,否則就會產生很大的Loss,但並不是全部的參數都不能動,所以前方我們多了一個 F_i,描述每一個paramter對於先前任務的重要性,如果F_i很小,代表沒那們重要,在訓練新任務時,我們可以更改它,即使更改的幅度很大,也不會造成太多loss.

這篇文章稱之為Relation Network,主要要解決few-shot learning中similairty function的問題,其架構可以自己最佳化出最好的similarity function.類似的文章可以參考Prototypical Network


簡單來說,這篇文章提出一個比較簡單的架構來解決few-shot classification,舉個實際例子來說,手機臉部解鎖這項功能,一開始拿到新手機時,你只會給手機幾張你自己的照片,接下來,手機必須知道新進來的影像到底是不是你,這樣才能解鎖手機.由於你給的圖片太少,於是這是一個few-shot classication problem.


登入 (Log in AWS with Jupyter and Tensorboard port liked)
You can create a shell script to execute the command below. Please replace “AWS_AMI.pem”, “IP”, “region” with your own.

ssh -N -L localhost:8888:localhost:8888 -i “AWS_AMI.pem” ubuntu@”IP.region” & 
ssh -N -L localhost:6006:localhost:6006 -i “AWS_AMI.pem” ubuntu@”IP.region” &
ssh -i “AWS_AMI.pem” ubuntu@”IP.region”



啟動 Jupyter notebook

xvfb-run -a -s "-screen 0 1400x900x24 +extension RANDR" jupyter notebook --no-browser

啟動 Tensorboard

tensorboard --logdir ./

Implementation matters in deep policy gradients: a case study on PPO and TRPO

這是一篇ICLR 2019 Oral paper,來自於MIT Logan Engstrom.

主要貢獻:細部討論PPO以及TRPO的效能差異,點出其差異來自於細微的程市最佳化(code-level optimizations),簡單來說,這篇文章提出許多細節是PPO以及TRPO在實作時有做,但卻沒有講得很清楚的部分.


簡單來說,TRPO提出用KL divergence 去限制policy update,如此一來讓新的policy不要更新太多,小心翼翼、一步一步的更新,所以稱之為Trust region policy optimization (TRPO).而PPO原文章的想法是,透過CLIP去限制policy update,一樣想要小心翼翼、一步一步的更新.實驗結果來看,PPO比TRPO更加穩定且效能更佳,原先認為是CLIP所造成的結果,然而這篇2019ICLR文章想要探討更深入


前情提要:TRPO with KL constraint on policy update. KL divergence是一種測量兩者distribution不相似程度,而TRPO希望每次更新不要差異太大.

前情提要:PPO with CLIP on policy update

No words. Just give you frequent useful tips as reference.

Code Block

Type ```, you can get the gray block as blow.

print("hellow world")

Inline code

How to type this abcdefg ? => type ` with any character


Have no idea how to use LaTex in Medium.

Keep updating


Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store