Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable model_1/loss/ExponentialMovingAverage/ does not exist #54

Open
demiguo opened this issue Aug 23, 2017 · 13 comments
Open

Variable model_1/loss/ExponentialMovingAverage/ does not exist #54

demiguo opened this issue Aug 23, 2017 · 13 comments

Comments

@demiguo
Copy link

demiguo commented Aug 23, 2017

HI, I'm running the dev branch code on Tensorflow 1.2.

And I got this error:
Variable model_1/loss/ExponentialMovingAverage/ does not exist or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope.

From the stack trace, it was from basic/model.py, in _build_ema, ema_op=ema.apply(tensors).

I tried to add "with tf.variable_scope(tf.get_variable_scope(), reuse=False):" before eam.apply but that still doesn't work.

Any ideas how can I fix this?

Thanks!
I'm using CUDA8.0 and Cudnn5.1. Tensorflow v1.2, python 3.5.

@Gandor26
Copy link

I found this problem only occurs in multi-GPU training. It's fine to use the same code without --num_gpus>1.

@xingjinglu
Copy link

@demiguo I have same issue, had you solved the problem?

@uditsaxena
Copy link

@demiguo I have the same issue. Because of this I can't train on a multiple GPU setup. Has anyone solved the problem?

@ghost
Copy link

ghost commented Apr 13, 2018

I have the same problem

@vidhumalik
Copy link

Anyone found anything?

@David-Levinthal
Copy link

Same issue occurs on version patched to run on TF r.17
Note the following version works on 1 gpu but not on >1 gpu
https://github.com/klintan/bi-att-flow/tree/dev

Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/levinth/bi-att-flow-zt1/basic/cli.py", line 128, in
tf.app.run()
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/home/levinth/bi-att-flow-zt1/basic/cli.py", line 125, in main
m(config)
File "/home/levinth/bi-att-flow-zt1/basic/main.py", line 26, in main
_train(config)
File "/home/levinth/bi-att-flow-zt1/basic/main.py", line 85, in _train
models = get_multi_gpu_models(config)
File "/home/levinth/bi-att-flow-zt1/basic/model.py", line 21, in get_multi_gpu_models
model = Model(config, scope, rep=gpu_idx == 0)
File "/home/levinth/bi-att-flow-zt1/basic/model.py", line 68, in init
self._build_ema()
File "/home/levinth/bi-att-flow-zt1/basic/model.py", line 298, in _build_ema
ema_op = ema.apply(tensors)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/training/moving_averages.py", line 405, in apply
"VarHandleOp"]))
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/training/slot_creator.py", line 179, in create_zeros_slot
colocate_with_primary=colocate_with_primary)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/training/slot_creator.py", line 156, in create_slot_with_initializer
dtype)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/training/slot_creator.py", line 65, in _create_slot_var
validate_shape=validate_shape)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 1297, in get_variable
constraint=constraint)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 1093, in get_variable
constraint=constraint)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 439, in get_variable
constraint=constraint)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 408, in _true_getter
use_resource=use_resource, constraint=constraint)
File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 765, in _get_single_variable
"reuse=tf.AUTO_REUSE in VarScope?" % name)
ValueError: Variable model_1/loss/ExponentialMovingAverage/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=tf.AUTO_REUSE in VarScope?

@kelayamatoz
Copy link

Hi David,

My name is Tian. I'm moving our discussion from email to this issue ticket so that other developers who have issues with this error could see it.

The issue with the original bidaf implementation is that it only creates one loss variable for the model. On a single GPU this is fine, because you only have one model. In a multi-GPU setting, the way bidaf implements multi-GPU training is that it replicates the model for every GPU device, and assigns one model on one device. This means that each model would require its own loss variable. If the developer only specifies one loss variable, tensorflow would try to reuse the loss variable for every model, which would create a conflict.

For example, in your error, if you only have one device, the name of the loss variable would be model_0/loss/ExponentialMovingAverage. If you have two devices, another loss variable called model_1/loss/ExponentialMovingAverage would be referenced by tensorflow. Since this variable is not created before you generate the whole model, tensorflow would try to reuse the variable you previously generated for model_0. Does that make sense?

The solution to resolve this conflict is by creating a loss variable for every model that's replicated:
https://github.com/stanford-futuredata/dawn-bench-models/blob/master/tensorflow/SQuAD/basic/model.py#L25:L36.

Unfortunately I don't have a mult-GPU node available. Would you mind try this patch on your node and see if it works?

@David-Levinthal
Copy link

David-Levinthal commented May 1, 2018 via email

@David-Levinthal
Copy link

David-Levinthal commented May 2, 2018 via email

@David-Levinthal
Copy link

David-Levinthal commented May 3, 2018 via email

@shimafoolad
Copy link

Hi @demiguo. On tensorflow 1.12.0, I had the same problem and fixed it by adding the line:

        with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):

before ema.apply

@masotrix
Copy link

masotrix commented Aug 15, 2019

First I would like to give some context to this issue. It applies only when a different WorkerX/GPUX/etcX tf.name_scope() was created over different instantations of any model that uses tf.train.ExponentialMovingAverage (commonly used by Batch Normalization). If instead there had been used a "WorkerX" tf.variable_scope(), there would be no possibility of reuse, because variables created with tf.get_variable() only ignore tf.name_scopes(). Thus using tf.get_variable() inside different tf.variable_scope() can only be different variables.

On the other hand, if there were no tf.name_scope() nor tf.variable_scope() over different gpu workers, variables being created with either tf.Variables or tf.get_variable would have exactly the same scope, giving both the possibility of being properly reused, but probably not creating a very pretty underlying graph, because operations would not be agregated over workers (esthetical/design/maintenance issue).

But as I understand, using with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE) before ema.apply, as @shimafoolad suggests, will drop reuse of loss/ExponentialMovingAverage/ across GPUs, and any shadow variable created with tf.Variable by ema.apply. That would be bad for the learning of Batch Normalization layers in distributed learning, which does not seem to be a good solution.

Maybe there is a way in which main variables of BN layers would be reused anyway, but I have found no explanation of such mechanism and maybe this issue would be solved with such an explanation.

@masotrix
Copy link

masotrix commented Aug 15, 2019

Thougth it again and realized that trainable parámeters of Batch Normalization can be defined with tf.get_variable() and use "with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE)" as @shimafoolad says, as follows:

def batch_norm_template(inputs, is_training, scope,
        moments_dims, bn_decay, reuse):
  with tf.variable_scope(scope, reuse=reuse) as sc: 
    num_channels = inputs.get_shape()[-1].value
    beta = tf.get_variable('beta', None, None,
        tf.constant(0.0, tf.float32, [num_channels]), None, True)
    gamma = tf.get_variable('gamma', None, None,
        tf.constant(1.0, tf.float32, [num_channels]), None, True)
    batch_mean, batch_var = tf.nn.moments(inputs,
            moments_dims, name='moments')
    decay = bn_decay if bn_decay is not None else 0.9 
    ema = tf.train.ExponentialMovingAverage(decay=decay)

    # Operator that maintains moving averages of variables.
    with tf.variable_scope(tf.get_variable_scope(),
        reuse=tf.AUTO_REUSE):
      ema_apply_op = tf.cond(is_training,
                lambda: ema.apply([batch_mean, batch_var]),
                lambda: tf.no_op())
    
    # Update moving average, return current batch's avg and var.
    def mean_var_with_update():
      with tf.control_dependencies([ema_apply_op]):
        return tf.identity(batch_mean), tf.identity(batch_var)
    
    # ema.average returns the Variable holding the average of var.
    mean, var = tf.cond(is_training,
        mean_var_with_update,
        lambda: (ema.average(batch_mean), ema.average(batch_var)))
    normed = tf.nn.batch_normalization(inputs, mean, var,
            beta, gamma, 1e-3)

  return normed

For safety, reuse should be False the first call and True all after, as done in
https://wizardforcel.gitbooks.io/tensorflow-examples-aymericdamien/6.2_multigpu_cnn.html.

Hope it helps, especially those who update legacy tensorflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants