Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The test score is different from the DeepMind paper #32

Open
futurecrew opened this issue Aug 22, 2016 · 14 comments
Open

The test score is different from the DeepMind paper #32

futurecrew opened this issue Aug 22, 2016 · 14 comments

Comments

@futurecrew
Copy link

futurecrew commented Aug 22, 2016

Hi,
Thank you for the great project.

While testing simple_dqn I found the test score of simple_dqn is different from the DeepMind paper.

The DeepMind paper 'Prioritized Experience Replay' (http://arxiv.org/pdf/1511.05952v3.pdf)
shows learning curves of DQN in Figure 7.
The gray is the original DQN according to the paper.

When comparing the curves in the paper with the attached png files in simple_dqn/results folder,
the test score is somewhat different.

In Breakout the paper says the original DQN reaches around score 320 but the simple_dqn doesn't.
Also in Seaquest the paper says the original DQN reaches more than score 3000 but the simple_dqn doesn't.

The paper doesn't say much about test code or environment of the original DQN result.
Also I'm not sure the result of the paper is using the following DeepMind code or not.
https://sites.google.com/a/deepmind.com/dqn/

Do you have any idea why there are score differences between simple_dqn and the DeepMind paper?

Thank you

@mthrok
Copy link
Contributor

mthrok commented Aug 23, 2016

AFAIK, RMSProp optimizer implementation and screen preprocessing are different from the paper.

In the code, RMSProp is implemented here, which is similar to RMSProp by A.Graves (see page 23, eq 40, parameters are not same as DQN paper).

simple_dqn uses Neon RMSprop. Notice that epsilon appears twice and denominator is no subtracted by the square of the mean gradient.

The difference in screen preprocessing is mentioned here. Simple DQN uses averaged frame among skipped frame (which is ALE's built-in functionality), instead of max values from successive two frames as the paper.

Correct me if I am wrong.

@tambetm
Copy link
Owner

tambetm commented Aug 23, 2016

Yes @mthrok, these are some of the differences. But I'm not sure how much they matter, for example you can easily switch from RMSProp to Adam, which seems to be the preferred optimization method these days.

Another important difference is that DeepMind considers loss of life as episode end, something I don't do yet. I would expect more substantial differences from that, but who knows.

There is also discussion about matching DeepMind's result in deep-q-learning list:
https://groups.google.com/forum/#!topic/deep-q-learning/JV384mQcylo

Keeping this issue open till we figure out the differences.

@futurecrew
Copy link
Author

futurecrew commented Aug 24, 2016

I got a similar score to the DeepMind paper in kung_fu_master after taking loss of life as the terminal state.
Previously, with the original simple_dqn source, kung_fu_master showed much lower score.

I regarded the loss of life as the terminal state but didn't reset the game, which is different from DeepMind.
Without the modification (original simple_dqn), it was 155 test score in 22 epoch in kung_fu_master.
With this modification, I got 9,500 test score in 22 epoch and 19,893 test score in 29 epoch.

@mthrok
Copy link
Contributor

mthrok commented Aug 26, 2016

@tambetm

Just to clarify, the current implementation does not store test experience, right?
README still says it stores, https://github.com/tambetm/simple_dqn#known-differences, but 1b9ceac seems to fix it.

@futurecrew
Copy link
Author

futurecrew commented Aug 31, 2016

Two more things to get close to the DeepMind paper.

  1. DeepMind code copies the trained net to the target net once every 10,000 steps.
    DeepMind uses 10,000 steps for env steps not for train steps.
    After using the steps for env steps I could get much faster learning curves. (maybe also higher test scores)

  2. DeepMind uses a fan_in parameter initializer while simple_dqn uses Gaussian.
    After using a Xavier initializer (which is similar to fan_in) I could get faster learning curves and higher test scores than the Gaussian.

initializer = Xavier()
layers = [Conv(fshape=(8, 8, 32), strides=4, init=initializer, bias=initializer, activation=Rectlin()),
    Conv(fshape=(4, 4, 64), strides=2, init=initializer, bias=initializer, activation=Rectlin()),
    Conv(fshape=(3, 3, 64), strides=1, init=initializer, bias=initializer, activation=Rectlin()),
    Affine(nout=512, init=initializer, bias=initializer, activation=Rectlin()),
    Affine(nout=num_actions, init=initializer, bias=initializer)
]

With the two changes plus one previous fix, 'taking loss of life as game over for train',
I could get to the almost similar test scores to the DQN in Figure 7 of the following paper in kung_fu_master up to 32 epochs.
'Prioritized Experience Replay' (http://arxiv.org/pdf/1511.05952v3.pdf)

@mthrok
Copy link
Contributor

mthrok commented Aug 31, 2016

@only4hj

Good job!

Just to clarify

1), 10,000 env steps is equivalent to 2,500 observation from agent's view point when skip_frame=4, right?

  1. Can you point me to the initialization part of the code? Is it default initialization torch nn or custom initialization?

BTW, what value of repeat_action_probability are you using?

@mthrok
Copy link
Contributor

mthrok commented Aug 31, 2016

@only4hj
In the original DQN paper, target network update frequency is described as The frequency (measured in the number of parameter updates) with which the target network is update ...

And it seems to me that the corresponding code tracks the number of training steps.
Nevermind, you are right, it's the number of perception, I was misreading it, sorry.

tambetm added a commit that referenced this issue Aug 31, 2016
@tambetm
Copy link
Owner

tambetm commented Aug 31, 2016

Thanks @only4hj and @mthrok for wonderful analysis, I included bits of it in README. I would be happy to merge any pull requests regarding this. Especially target network interval and Xavier initialization seem like trivial fixes.

@futurecrew
Copy link
Author

@mthrok

  1. 10,000 env steps means 10,000 observation from agent's
    view point in skip_frame=4 (which means 40,000 ale frames).

  2. I think it's torch's default initialization. See these.
    https://groups.google.com/d/msg/deep-q-learning/JV384mQcylo/De1Jzc0hAAAJ
    https://github.com/gtoubassi/dqn-atari/blob/master/dqn.py#L148

Regarding repeat_action_probability,
I didn't use frame_skip of ale.
Instead skipped frames in my implementation as following with frame_repeat = 4.
This works just like repeat_action_probability = 0.

https://github.com/only4hj/DeepRL/blob/master/deep_rl_player.py#L165

def doActions(self, actionIndex, mode):
    action = self.legalActions[actionIndex]
    reward = 0
    lostLife = False 
    lives = self.ale.lives()
    for f in range(self.settings['frame_repeat']):
        reward += self.ale.act(action)
        gameOver = self.ale.game_over()
        if self.ale.lives() < lives or gameOver:
            lostLife = True

            if mode == 'TRAIN' and self.settings['lost_life_game_over'] == True:
                gameOver = True

            break
    state = self.getScreenPixels()

return reward, state, lostLife, gameOver

@mthrok
Copy link
Contributor

mthrok commented Sep 2, 2016

@only4hj

Thank you for clarifying. Now I understand that frame skip is processed in their alewrapper and invisible to agent.

I was having trouble to set ale's repeat_action_probability to 1. Now I see how you are testing things. Thanks you very much.

@tambetm
I created PR for changing initialization method. #33
I will try to investigate network syncing update mentioned above after this.

@tambetm
Copy link
Owner

tambetm commented Sep 4, 2016

Thanks @mthrok, merged the PR! Keep me posted if you figure out the network update interval.

@kerawits
Copy link

kerawits commented Feb 16, 2017

Quoted from the Nature paper:

target network update frequency: 10000: The frequency (measured in the number of parameter updates) with which the target network is updated (this corresponds to the parameter C from Algorithm 1).
action repeat: 4: Repeat each action selected by the agent this many times. Using a value of 4 results in the agent seeing only every 4th frame
update frequency: 4: The number of actions selected by the agent between successive SGD updates. Using a value of 4 results in the agent selecting 4 actions between each pair of successive updates

Since the agent sees the image and makes prediction once every 4th frame (due to action repeat = 4) and it only updates its online network once every 4th prediction (due to update frequency = 4), with target network update frequency = 10000, doesn't it mean that the target network should get updated on the 10000th update or once every 40000 predictions which is once every 160000 frames?

@tambetm
Copy link
Owner

tambetm commented Feb 18, 2017

@kerawits your reasoning seems valid. Because we see only every 4th frame, I think target_steps parameter value should be 40000. Would be nice if somebody could do a test run...

@Seraphli
Copy link

Seraphli commented May 3, 2017

I run a code from here and the mean score seems be able to reach 400. BTW, I have change the network architecture to original DQN. The original code have commented the part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants