Variable model_1/loss/ExponentialMovingAverage/ does not exist #54

demiguo · 2017-08-23T18:58:24Z

HI, I'm running the dev branch code on Tensorflow 1.2.

And I got this error:
Variable model_1/loss/ExponentialMovingAverage/ does not exist or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope.

From the stack trace, it was from basic/model.py, in _build_ema, ema_op=ema.apply(tensors).

I tried to add "with tf.variable_scope(tf.get_variable_scope(), reuse=False):" before eam.apply but that still doesn't work.

Any ideas how can I fix this?

Thanks!
I'm using CUDA8.0 and Cudnn5.1. Tensorflow v1.2, python 3.5.

Gandor26 · 2017-09-18T00:43:18Z

I found this problem only occurs in multi-GPU training. It's fine to use the same code without --num_gpus>1.

xingjinglu · 2018-02-07T09:05:02Z

@demiguo I have same issue, had you solved the problem?

uditsaxena · 2018-02-09T05:03:11Z

@demiguo I have the same issue. Because of this I can't train on a multiple GPU setup. Has anyone solved the problem?

ghost · 2018-04-13T12:28:47Z

I have the same problem

vidhumalik · 2018-04-21T12:14:21Z

Anyone found anything?

David-Levinthal · 2018-04-21T16:58:08Z

Same issue occurs on version patched to run on TF r.17
Note the following version works on 1 gpu but not on >1 gpu
https://github.com/klintan/bi-att-flow/tree/dev

Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/levinth/bi-att-flow-zt1/basic/cli.py", line 128, in
tf.app.run()
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/home/levinth/bi-att-flow-zt1/basic/cli.py", line 125, in main
m(config)
File "/home/levinth/bi-att-flow-zt1/basic/main.py", line 26, in main
_train(config)
File "/home/levinth/bi-att-flow-zt1/basic/main.py", line 85, in _train
models = get_multi_gpu_models(config)
File "/home/levinth/bi-att-flow-zt1/basic/model.py", line 21, in get_multi_gpu_models
model = Model(config, scope, rep=gpu_idx == 0)
File "/home/levinth/bi-att-flow-zt1/basic/model.py", line 68, in init
self._build_ema()
File "/home/levinth/bi-att-flow-zt1/basic/model.py", line 298, in _build_ema
ema_op = ema.apply(tensors)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/training/moving_averages.py", line 405, in apply
"VarHandleOp"]))
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/training/slot_creator.py", line 179, in create_zeros_slot
colocate_with_primary=colocate_with_primary)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/training/slot_creator.py", line 156, in create_slot_with_initializer
dtype)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/training/slot_creator.py", line 65, in _create_slot_var
validate_shape=validate_shape)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 1297, in get_variable
constraint=constraint)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 1093, in get_variable
constraint=constraint)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 439, in get_variable
constraint=constraint)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 408, in _true_getter
use_resource=use_resource, constraint=constraint)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 765, in _get_single_variable
"reuse=tf.AUTO_REUSE in VarScope?" % name)
ValueError: Variable model_1/loss/ExponentialMovingAverage/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=tf.AUTO_REUSE in VarScope?

kelayamatoz · 2018-05-01T21:58:24Z

Hi David,

My name is Tian. I'm moving our discussion from email to this issue ticket so that other developers who have issues with this error could see it.

The issue with the original bidaf implementation is that it only creates one loss variable for the model. On a single GPU this is fine, because you only have one model. In a multi-GPU setting, the way bidaf implements multi-GPU training is that it replicates the model for every GPU device, and assigns one model on one device. This means that each model would require its own loss variable. If the developer only specifies one loss variable, tensorflow would try to reuse the loss variable for every model, which would create a conflict.

For example, in your error, if you only have one device, the name of the loss variable would be model_0/loss/ExponentialMovingAverage. If you have two devices, another loss variable called model_1/loss/ExponentialMovingAverage would be referenced by tensorflow. Since this variable is not created before you generate the whole model, tensorflow would try to reuse the variable you previously generated for model_0. Does that make sense?

The solution to resolve this conflict is by creating a loss variable for every model that's replicated:
https://github.com/stanford-futuredata/dawn-bench-models/blob/master/tensorflow/SQuAD/basic/model.py#L25:L36.

Unfortunately I don't have a mult-GPU node available. Would you mind try this patch on your node and see if it works?

David-Levinthal · 2018-05-01T22:45:35Z

I will modify the Andreas Klintberg fork of this..as that is the code base that works on top of tree TF..nothing else does due to the change in the handling of flags https://www.linkedin.com/in/andreas-klintberg-b7655710/ but I have to wait until the 4 GPU machine I set this up on gets freed up... I don't want to build everything again LOL d

…

On Tue, May 1, 2018 at 2:58 PM, kelayamatoz ***@***.***> wrote: Hi David, My name is Tian. I'm moving our discussion from email to this issue ticket so that other developers who have issues with this error could see it. The issue with the original bidaf implementation is that it only creates one loss variable for the model. On a single GPU this is fine, because you only have one model. In a multi-GPU setting, the way bidaf implements multi-GPU training is that it replicates the model for every GPU device, and assigns one model on one device. This means that each model would require its own loss variable. If the developer only specifies one loss variable, tensorflow would try to reuse the loss variable for every model, which would create a conflict. For example, in your error, if you only have one device, the name of the loss variable would be model_0/loss/ExponentialMovingAverage. If you have two devices, another loss variable called model_1/loss/ExponentialMovingAverage would be referenced by tensorflow. Since this variable is not created before you generate the whole model, tensorflow would try to reuse the variable you previously generated for model_0. Does that make sense? The solution to resolve this conflict is by creating a loss variable for every model that's replicated: https://github.com/stanford-futuredata/dawn-bench-models/ blob/master/tensorflow/SQuAD/basic/model.py#L25:L36. Unfortunately I don't have a mult-GPU node available. Would you mind try this patch on your node and see if it works? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#54 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIUuT2AyY-Ru255Tw_q4MHoDVgkGKk9uks5tuNqCgaJpZM4PAcOq> .

David-Levinthal · 2018-05-02T20:29:18Z

new distro is https://github.com/klintan/bi-att-flow/tree/dev On Tue, May 1, 2018 at 3:45 PM, David Levinthal <[email protected]> wrote:

…

I will modify the Andreas Klintberg fork of this..as that is the code base that works on top of tree TF..nothing else does due to the change in the handling of flags https://www.linkedin.com/in/andreas-klintberg-b7655710/ but I have to wait until the 4 GPU machine I set this up on gets freed up... I don't want to build everything again LOL d On Tue, May 1, 2018 at 2:58 PM, kelayamatoz ***@***.***> wrote: > Hi David, > > My name is Tian. I'm moving our discussion from email to this issue > ticket so that other developers who have issues with this error could see > it. > > The issue with the original bidaf implementation is that it only creates > one loss variable for the model. On a single GPU this is fine, because you > only have one model. In a multi-GPU setting, the way bidaf implements > multi-GPU training is that it replicates the model for every GPU device, > and assigns one model on one device. This means that each model would > require its own loss variable. If the developer only specifies one loss > variable, tensorflow would try to reuse the loss variable for every model, > which would create a conflict. > > For example, in your error, if you only have one device, the name of the > loss variable would be model_0/loss/ExponentialMovingAverage. If you > have two devices, another loss variable called > model_1/loss/ExponentialMovingAverage would be referenced by tensorflow. > Since this variable is not created before you generate the whole model, > tensorflow would try to reuse the variable you previously generated for > model_0. Does that make sense? > > The solution to resolve this conflict is by creating a loss variable for > every model that's replicated: > https://github.com/stanford-futuredata/dawn-bench-models/blo > b/master/tensorflow/SQuAD/basic/model.py#L25:L36. > > Unfortunately I don't have a mult-GPU node available. Would you mind try > this patch on your node and see if it works? > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#54 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AIUuT2AyY-Ru255Tw_q4MHoDVgkGKk9uks5tuNqCgaJpZM4PAcOq> > . >

David-Levinthal · 2018-05-03T20:40:12Z

took model.py from the dawn distribution and added it to my modified (for printout and speed logging) version of the klintberg distro It appears to not actually pay attention to the num_gpus flag and started 4 processes on the 4 V100s used CUDA_VISIBLE_DEVICES=0 and num_gpus=1 default batch size global_step: 100, avg_loss = 8.443472, time = 336.919500 global_step: 200, avg_loss = 7.691555, time = 333.118028 global_step: 300, avg_loss = 7.293336, time = 332.794547 global_step: 400, avg_loss = 6.585279, time = 330.941432 nvidia-smi showed 1 process unsetting CUDA_VISIBLE_DEVICES and rerunning one see 4 processes in nvidia-smi..but only 1 GPU being active :-) global_step: 100, avg_loss = 8.414022, time = 353.267652 global_step: 200, avg_loss = 7.680358, time = 343.982275 global_step: 300, avg_loss = 7.316520, time = 346.749630 global_step: 400, avg_loss = 6.531937, time = 344.386899 set CUDA_VISIBLE_DEVICES=0,1,2,3 and num_gpus=4 global_step: 100, avg_loss = 8.122040, time = 669.353025 global_step: 200, avg_loss = 6.920084, time = 651.585218 global_step: 300, avg_loss = 5.956660, time = 648.792006 global_step: 400, avg_loss = 5.126814, time = 650.647822 global_step: 500, avg_loss = 4.219713, time = 648.755034 so I am bit unsure exactly whether things are going faster when fanned out lowering the batch size to 15 while running on 4 GPUs does not change the output much global_step: 100, avg_loss = 8.417717, time = 518.246054 global_step: 200, avg_loss = 7.639581, time = 492.312914 global_step: 300, avg_loss = 7.274171, time = 493.591461 global_step: 400, avg_loss = 6.699906, time = 506.484376 On Wed, May 2, 2018 at 1:29 PM, David Levinthal <[email protected]> wrote:

…

new distro is https://github.com/klintan/bi-att-flow/tree/dev On Tue, May 1, 2018 at 3:45 PM, David Levinthal < ***@***.***> wrote: > I will modify the Andreas Klintberg fork of this..as that is the code > base that works on top of tree TF..nothing else does due to the change in > the handling of flags > https://www.linkedin.com/in/andreas-klintberg-b7655710/ > > but I have to wait until the 4 GPU machine I set this up on gets freed > up... > I don't want to build everything again LOL > d > > On Tue, May 1, 2018 at 2:58 PM, kelayamatoz ***@***.***> > wrote: > >> Hi David, >> >> My name is Tian. I'm moving our discussion from email to this issue >> ticket so that other developers who have issues with this error could see >> it. >> >> The issue with the original bidaf implementation is that it only creates >> one loss variable for the model. On a single GPU this is fine, because you >> only have one model. In a multi-GPU setting, the way bidaf implements >> multi-GPU training is that it replicates the model for every GPU device, >> and assigns one model on one device. This means that each model would >> require its own loss variable. If the developer only specifies one loss >> variable, tensorflow would try to reuse the loss variable for every model, >> which would create a conflict. >> >> For example, in your error, if you only have one device, the name of the >> loss variable would be model_0/loss/ExponentialMovingAverage. If you >> have two devices, another loss variable called >> model_1/loss/ExponentialMovingAverage would be referenced by >> tensorflow. Since this variable is not created before you generate the >> whole model, tensorflow would try to reuse the variable you previously >> generated for model_0. Does that make sense? >> >> The solution to resolve this conflict is by creating a loss variable for >> every model that's replicated: >> https://github.com/stanford-futuredata/dawn-bench-models/blo >> b/master/tensorflow/SQuAD/basic/model.py#L25:L36. >> >> Unfortunately I don't have a mult-GPU node available. Would you mind try >> this patch on your node and see if it works? >> >> — >> You are receiving this because you commented. >> Reply to this email directly, view it on GitHub >> <#54 (comment)>, >> or mute the thread >> <https://github.com/notifications/unsubscribe-auth/AIUuT2AyY-Ru255Tw_q4MHoDVgkGKk9uks5tuNqCgaJpZM4PAcOq> >> . >> > >

shimafoolad · 2018-11-25T08:37:43Z

Hi @demiguo. On tensorflow 1.12.0, I had the same problem and fixed it by adding the line:

        with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):

before ema.apply

masotrix · 2019-08-15T16:36:38Z

First I would like to give some context to this issue. It applies only when a different WorkerX/GPUX/etcX tf.name_scope() was created over different instantations of any model that uses tf.train.ExponentialMovingAverage (commonly used by Batch Normalization). If instead there had been used a "WorkerX" tf.variable_scope(), there would be no possibility of reuse, because variables created with tf.get_variable() only ignore tf.name_scopes(). Thus using tf.get_variable() inside different tf.variable_scope() can only be different variables.

On the other hand, if there were no tf.name_scope() nor tf.variable_scope() over different gpu workers, variables being created with either tf.Variables or tf.get_variable would have exactly the same scope, giving both the possibility of being properly reused, but probably not creating a very pretty underlying graph, because operations would not be agregated over workers (esthetical/design/maintenance issue).

But as I understand, using with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE) before ema.apply, as @shimafoolad suggests, will drop reuse of loss/ExponentialMovingAverage/ across GPUs, and any shadow variable created with tf.Variable by ema.apply. That would be bad for the learning of Batch Normalization layers in distributed learning, which does not seem to be a good solution.

Maybe there is a way in which main variables of BN layers would be reused anyway, but I have found no explanation of such mechanism and maybe this issue would be solved with such an explanation.

masotrix · 2019-08-15T17:14:00Z

Thougth it again and realized that trainable parámeters of Batch Normalization can be defined with tf.get_variable() and use "with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE)" as @shimafoolad says, as follows:

def batch_norm_template(inputs, is_training, scope,
        moments_dims, bn_decay, reuse):
  with tf.variable_scope(scope, reuse=reuse) as sc: 
    num_channels = inputs.get_shape()[-1].value
    beta = tf.get_variable('beta', None, None,
        tf.constant(0.0, tf.float32, [num_channels]), None, True)
    gamma = tf.get_variable('gamma', None, None,
        tf.constant(1.0, tf.float32, [num_channels]), None, True)
    batch_mean, batch_var = tf.nn.moments(inputs,
            moments_dims, name='moments')
    decay = bn_decay if bn_decay is not None else 0.9 
    ema = tf.train.ExponentialMovingAverage(decay=decay)

    # Operator that maintains moving averages of variables.
    with tf.variable_scope(tf.get_variable_scope(),
        reuse=tf.AUTO_REUSE):
      ema_apply_op = tf.cond(is_training,
                lambda: ema.apply([batch_mean, batch_var]),
                lambda: tf.no_op())
    
    # Update moving average, return current batch's avg and var.
    def mean_var_with_update():
      with tf.control_dependencies([ema_apply_op]):
        return tf.identity(batch_mean), tf.identity(batch_var)
    
    # ema.average returns the Variable holding the average of var.
    mean, var = tf.cond(is_training,
        mean_var_with_update,
        lambda: (ema.average(batch_mean), ema.average(batch_var)))
    normed = tf.nn.batch_normalization(inputs, mean, var,
            beta, gamma, 1e-3)

  return normed

For safety, reuse should be False the first call and True all after, as done in
https://wizardforcel.gitbooks.io/tensorflow-examples-aymericdamien/6.2_multigpu_cnn.html.

Hope it helps, especially those who update legacy tensorflow

klintan mentioned this issue Apr 21, 2018

Dev branch fails with TF r1.4 #69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable model_1/loss/ExponentialMovingAverage/ does not exist #54

Variable model_1/loss/ExponentialMovingAverage/ does not exist #54

demiguo commented Aug 23, 2017

Gandor26 commented Sep 18, 2017

xingjinglu commented Feb 7, 2018

uditsaxena commented Feb 9, 2018

ghost commented Apr 13, 2018

vidhumalik commented Apr 21, 2018

David-Levinthal commented Apr 21, 2018

kelayamatoz commented May 1, 2018

David-Levinthal commented May 1, 2018 via email

David-Levinthal commented May 2, 2018 via email

David-Levinthal commented May 3, 2018 via email

shimafoolad commented Nov 25, 2018

masotrix commented Aug 15, 2019 •

edited

Loading

masotrix commented Aug 15, 2019 •

edited

Loading

Variable model_1/loss/ExponentialMovingAverage/ does not exist #54

Variable model_1/loss/ExponentialMovingAverage/ does not exist #54

Comments

demiguo commented Aug 23, 2017

Gandor26 commented Sep 18, 2017

xingjinglu commented Feb 7, 2018

uditsaxena commented Feb 9, 2018

ghost commented Apr 13, 2018

vidhumalik commented Apr 21, 2018

David-Levinthal commented Apr 21, 2018

kelayamatoz commented May 1, 2018

David-Levinthal commented May 1, 2018 via email

David-Levinthal commented May 2, 2018 via email

David-Levinthal commented May 3, 2018 via email

shimafoolad commented Nov 25, 2018

masotrix commented Aug 15, 2019 • edited Loading

masotrix commented Aug 15, 2019 • edited Loading

masotrix commented Aug 15, 2019 •

edited

Loading

masotrix commented Aug 15, 2019 •

edited

Loading