The test score is different from the DeepMind paper #32

futurecrew · 2016-08-22T23:58:04Z

Hi,
Thank you for the great project.

While testing simple_dqn I found the test score of simple_dqn is different from the DeepMind paper.

The DeepMind paper 'Prioritized Experience Replay' (http://arxiv.org/pdf/1511.05952v3.pdf)
shows learning curves of DQN in Figure 7.
The gray is the original DQN according to the paper.

When comparing the curves in the paper with the attached png files in simple_dqn/results folder,
the test score is somewhat different.

In Breakout the paper says the original DQN reaches around score 320 but the simple_dqn doesn't.
Also in Seaquest the paper says the original DQN reaches more than score 3000 but the simple_dqn doesn't.

The paper doesn't say much about test code or environment of the original DQN result.
Also I'm not sure the result of the paper is using the following DeepMind code or not.
https://sites.google.com/a/deepmind.com/dqn/

Do you have any idea why there are score differences between simple_dqn and the DeepMind paper?

Thank you

mthrok · 2016-08-23T01:59:46Z

AFAIK, RMSProp optimizer implementation and screen preprocessing are different from the paper.

In the code, RMSProp is implemented here, which is similar to RMSProp by A.Graves (see page 23, eq 40, parameters are not same as DQN paper).

simple_dqn uses Neon RMSprop. Notice that epsilon appears twice and denominator is no subtracted by the square of the mean gradient.

The difference in screen preprocessing is mentioned here. Simple DQN uses averaged frame among skipped frame (which is ALE's built-in functionality), instead of max values from successive two frames as the paper.

Correct me if I am wrong.

tambetm · 2016-08-23T08:43:36Z

Yes @mthrok, these are some of the differences. But I'm not sure how much they matter, for example you can easily switch from RMSProp to Adam, which seems to be the preferred optimization method these days.

Another important difference is that DeepMind considers loss of life as episode end, something I don't do yet. I would expect more substantial differences from that, but who knows.

There is also discussion about matching DeepMind's result in deep-q-learning list:
https://groups.google.com/forum/#!topic/deep-q-learning/JV384mQcylo

Keeping this issue open till we figure out the differences.

futurecrew · 2016-08-24T04:27:33Z

I got a similar score to the DeepMind paper in kung_fu_master after taking loss of life as the terminal state.
Previously, with the original simple_dqn source, kung_fu_master showed much lower score.

I regarded the loss of life as the terminal state but didn't reset the game, which is different from DeepMind.
Without the modification (original simple_dqn), it was 155 test score in 22 epoch in kung_fu_master.
With this modification, I got 9,500 test score in 22 epoch and 19,893 test score in 29 epoch.

mthrok · 2016-08-26T17:15:54Z

@tambetm

Just to clarify, the current implementation does not store test experience, right?
README still says it stores, https://github.com/tambetm/simple_dqn#known-differences, but 1b9ceac seems to fix it.

futurecrew · 2016-08-31T10:54:17Z

Two more things to get close to the DeepMind paper.

DeepMind code copies the trained net to the target net once every 10,000 steps.
DeepMind uses 10,000 steps for env steps not for train steps.
After using the steps for env steps I could get much faster learning curves. (maybe also higher test scores)
DeepMind uses a fan_in parameter initializer while simple_dqn uses Gaussian.
After using a Xavier initializer (which is similar to fan_in) I could get faster learning curves and higher test scores than the Gaussian.

initializer = Xavier()
layers = [Conv(fshape=(8, 8, 32), strides=4, init=initializer, bias=initializer, activation=Rectlin()),
    Conv(fshape=(4, 4, 64), strides=2, init=initializer, bias=initializer, activation=Rectlin()),
    Conv(fshape=(3, 3, 64), strides=1, init=initializer, bias=initializer, activation=Rectlin()),
    Affine(nout=512, init=initializer, bias=initializer, activation=Rectlin()),
    Affine(nout=num_actions, init=initializer, bias=initializer)
]

With the two changes plus one previous fix, 'taking loss of life as game over for train',
I could get to the almost similar test scores to the DQN in Figure 7 of the following paper in kung_fu_master up to 32 epochs.
'Prioritized Experience Replay' (http://arxiv.org/pdf/1511.05952v3.pdf)

mthrok · 2016-08-31T15:59:52Z

@only4hj

Good job!

Just to clarify

1), 10,000 env steps is equivalent to 2,500 observation from agent's view point when skip_frame=4, right?

Can you point me to the initialization part of the code? Is it default initialization torch nn or custom initialization?

BTW, what value of repeat_action_probability are you using?

mthrok · 2016-08-31T19:27:16Z

@only4hj
In the original DQN paper, target network update frequency is described as The frequency (measured in the number of parameter updates) with which the target network is update ...

~~And it seems to me that the corresponding code tracks the number of training steps.~~
Nevermind, you are right, it's the number of perception, I was misreading it, sorry.

tambetm · 2016-08-31T20:05:49Z

Thanks @only4hj and @mthrok for wonderful analysis, I included bits of it in README. I would be happy to merge any pull requests regarding this. Especially target network interval and Xavier initialization seem like trivial fixes.

futurecrew · 2016-08-31T23:41:40Z

@mthrok

10,000 env steps means 10,000 observation from agent's
view point in skip_frame=4 (which means 40,000 ale frames).
I think it's torch's default initialization. See these.
https://groups.google.com/d/msg/deep-q-learning/JV384mQcylo/De1Jzc0hAAAJ
https://github.com/gtoubassi/dqn-atari/blob/master/dqn.py#L148

Regarding repeat_action_probability,
I didn't use frame_skip of ale.
Instead skipped frames in my implementation as following with frame_repeat = 4.
This works just like repeat_action_probability = 0.

https://github.com/only4hj/DeepRL/blob/master/deep_rl_player.py#L165

def doActions(self, actionIndex, mode):
    action = self.legalActions[actionIndex]
    reward = 0
    lostLife = False 
    lives = self.ale.lives()
    for f in range(self.settings['frame_repeat']):
        reward += self.ale.act(action)
        gameOver = self.ale.game_over()
        if self.ale.lives() < lives or gameOver:
            lostLife = True

            if mode == 'TRAIN' and self.settings['lost_life_game_over'] == True:
                gameOver = True

            break
    state = self.getScreenPixels()

return reward, state, lostLife, gameOver

mthrok · 2016-09-02T16:59:39Z

@only4hj

Thank you for clarifying. Now I understand that frame skip is processed in their alewrapper and invisible to agent.

I was having trouble to set ale's repeat_action_probability to 1. Now I see how you are testing things. Thanks you very much.

@tambetm
I created PR for changing initialization method. #33
I will try to investigate network syncing update mentioned above after this.

tambetm · 2016-09-04T01:46:42Z

Thanks @mthrok, merged the PR! Keep me posted if you figure out the network update interval.

kerawits · 2017-02-16T07:08:32Z

Quoted from the Nature paper:

target network update frequency: 10000: The frequency (measured in the number of parameter updates) with which the target network is updated (this corresponds to the parameter C from Algorithm 1).
action repeat: 4: Repeat each action selected by the agent this many times. Using a value of 4 results in the agent seeing only every 4th frame
update frequency: 4: The number of actions selected by the agent between successive SGD updates. Using a value of 4 results in the agent selecting 4 actions between each pair of successive updates

Since the agent sees the image and makes prediction once every 4th frame (due to action repeat = 4) and it only updates its online network once every 4th prediction (due to update frequency = 4), with target network update frequency = 10000, doesn't it mean that the target network should get updated on the 10000th update or once every 40000 predictions which is once every 160000 frames?

tambetm · 2017-02-18T23:28:12Z

@kerawits your reasoning seems valid. Because we see only every 4th frame, I think target_steps parameter value should be 40000. Would be nice if somebody could do a test run...

Seraphli · 2017-05-03T01:41:02Z

I run a code from here and the mean score seems be able to reach 400. BTW, I have change the network architecture to original DQN. The original code have commented the part.

tambetm added a commit that referenced this issue Aug 31, 2016

Documented known differences. #32

eb8456a

mthrok mentioned this issue Sep 2, 2016

Change parameter initialization to Xavier #33

Merged

mthrok mentioned this issue Sep 9, 2016

Change network sync timing from #trainings to #steps #34

Merged

mthrok mentioned this issue Sep 21, 2016

Treat life loss as termination in training mode #35

Merged

fferreres mentioned this issue Nov 18, 2016

DQN solution results peak at ~35 reward dennybritz/reinforcement-learning#30

Open

Seraphli mentioned this issue May 2, 2017

Can you get same score using original DQN? tensorpack/tensorpack#247

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The test score is different from the DeepMind paper #32

The test score is different from the DeepMind paper #32

futurecrew commented Aug 22, 2016 •

edited

Loading

mthrok commented Aug 23, 2016 •

edited

Loading

tambetm commented Aug 23, 2016

futurecrew commented Aug 24, 2016 •

edited

Loading

mthrok commented Aug 26, 2016

futurecrew commented Aug 31, 2016 •

edited

Loading

mthrok commented Aug 31, 2016

mthrok commented Aug 31, 2016 •

edited

Loading

tambetm commented Aug 31, 2016

futurecrew commented Aug 31, 2016

mthrok commented Sep 2, 2016

tambetm commented Sep 4, 2016

kerawits commented Feb 16, 2017 •

edited

Loading

tambetm commented Feb 18, 2017

Seraphli commented May 3, 2017

The test score is different from the DeepMind paper #32

The test score is different from the DeepMind paper #32

Comments

futurecrew commented Aug 22, 2016 • edited Loading

mthrok commented Aug 23, 2016 • edited Loading

tambetm commented Aug 23, 2016

futurecrew commented Aug 24, 2016 • edited Loading

mthrok commented Aug 26, 2016

futurecrew commented Aug 31, 2016 • edited Loading

mthrok commented Aug 31, 2016

mthrok commented Aug 31, 2016 • edited Loading

tambetm commented Aug 31, 2016

futurecrew commented Aug 31, 2016

mthrok commented Sep 2, 2016

tambetm commented Sep 4, 2016

kerawits commented Feb 16, 2017 • edited Loading

tambetm commented Feb 18, 2017

Seraphli commented May 3, 2017

futurecrew commented Aug 22, 2016 •

edited

Loading

mthrok commented Aug 23, 2016 •

edited

Loading

futurecrew commented Aug 24, 2016 •

edited

Loading

futurecrew commented Aug 31, 2016 •

edited

Loading

mthrok commented Aug 31, 2016 •

edited

Loading

kerawits commented Feb 16, 2017 •

edited

Loading