-
Notifications
You must be signed in to change notification settings - Fork 679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variable model_1/loss/ExponentialMovingAverage/ does not exist #54
Comments
I found this problem only occurs in multi-GPU training. It's fine to use the same code without --num_gpus>1. |
@demiguo I have same issue, had you solved the problem? |
@demiguo I have the same issue. Because of this I can't train on a multiple GPU setup. Has anyone solved the problem? |
I have the same problem |
Anyone found anything? |
Same issue occurs on version patched to run on TF r.17 Traceback (most recent call last): |
Hi David, My name is Tian. I'm moving our discussion from email to this issue ticket so that other developers who have issues with this error could see it. The issue with the original bidaf implementation is that it only creates one loss variable for the model. On a single GPU this is fine, because you only have one model. In a multi-GPU setting, the way bidaf implements multi-GPU training is that it replicates the model for every GPU device, and assigns one model on one device. This means that each model would require its own loss variable. If the developer only specifies one loss variable, tensorflow would try to reuse the loss variable for every model, which would create a conflict. For example, in your error, if you only have one device, the name of the loss variable would be model_0/loss/ExponentialMovingAverage. If you have two devices, another loss variable called model_1/loss/ExponentialMovingAverage would be referenced by tensorflow. Since this variable is not created before you generate the whole model, tensorflow would try to reuse the variable you previously generated for model_0. Does that make sense? The solution to resolve this conflict is by creating a loss variable for every model that's replicated: Unfortunately I don't have a mult-GPU node available. Would you mind try this patch on your node and see if it works? |
I will modify the Andreas Klintberg fork of this..as that is the code base
that works on top of tree TF..nothing else does due to the change in the
handling of flags
https://www.linkedin.com/in/andreas-klintberg-b7655710/
but I have to wait until the 4 GPU machine I set this up on gets freed up...
I don't want to build everything again LOL
d
…On Tue, May 1, 2018 at 2:58 PM, kelayamatoz ***@***.***> wrote:
Hi David,
My name is Tian. I'm moving our discussion from email to this issue ticket
so that other developers who have issues with this error could see it.
The issue with the original bidaf implementation is that it only creates
one loss variable for the model. On a single GPU this is fine, because you
only have one model. In a multi-GPU setting, the way bidaf implements
multi-GPU training is that it replicates the model for every GPU device,
and assigns one model on one device. This means that each model would
require its own loss variable. If the developer only specifies one loss
variable, tensorflow would try to reuse the loss variable for every model,
which would create a conflict.
For example, in your error, if you only have one device, the name of the
loss variable would be model_0/loss/ExponentialMovingAverage. If you have
two devices, another loss variable called model_1/loss/ExponentialMovingAverage
would be referenced by tensorflow. Since this variable is not created
before you generate the whole model, tensorflow would try to reuse the
variable you previously generated for model_0. Does that make sense?
The solution to resolve this conflict is by creating a loss variable for
every model that's replicated:
https://github.com/stanford-futuredata/dawn-bench-models/
blob/master/tensorflow/SQuAD/basic/model.py#L25:L36.
Unfortunately I don't have a mult-GPU node available. Would you mind try
this patch on your node and see if it works?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#54 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIUuT2AyY-Ru255Tw_q4MHoDVgkGKk9uks5tuNqCgaJpZM4PAcOq>
.
|
new distro is https://github.com/klintan/bi-att-flow/tree/dev
On Tue, May 1, 2018 at 3:45 PM, David Levinthal <[email protected]>
wrote:
… I will modify the Andreas Klintberg fork of this..as that is the code base
that works on top of tree TF..nothing else does due to the change in the
handling of flags
https://www.linkedin.com/in/andreas-klintberg-b7655710/
but I have to wait until the 4 GPU machine I set this up on gets freed
up...
I don't want to build everything again LOL
d
On Tue, May 1, 2018 at 2:58 PM, kelayamatoz ***@***.***>
wrote:
> Hi David,
>
> My name is Tian. I'm moving our discussion from email to this issue
> ticket so that other developers who have issues with this error could see
> it.
>
> The issue with the original bidaf implementation is that it only creates
> one loss variable for the model. On a single GPU this is fine, because you
> only have one model. In a multi-GPU setting, the way bidaf implements
> multi-GPU training is that it replicates the model for every GPU device,
> and assigns one model on one device. This means that each model would
> require its own loss variable. If the developer only specifies one loss
> variable, tensorflow would try to reuse the loss variable for every model,
> which would create a conflict.
>
> For example, in your error, if you only have one device, the name of the
> loss variable would be model_0/loss/ExponentialMovingAverage. If you
> have two devices, another loss variable called
> model_1/loss/ExponentialMovingAverage would be referenced by tensorflow.
> Since this variable is not created before you generate the whole model,
> tensorflow would try to reuse the variable you previously generated for
> model_0. Does that make sense?
>
> The solution to resolve this conflict is by creating a loss variable for
> every model that's replicated:
> https://github.com/stanford-futuredata/dawn-bench-models/blo
> b/master/tensorflow/SQuAD/basic/model.py#L25:L36.
>
> Unfortunately I don't have a mult-GPU node available. Would you mind try
> this patch on your node and see if it works?
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#54 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AIUuT2AyY-Ru255Tw_q4MHoDVgkGKk9uks5tuNqCgaJpZM4PAcOq>
> .
>
|
took model.py from the dawn distribution and added it to my modified (for
printout and speed logging) version of the klintberg distro
It appears to not actually pay attention to the num_gpus flag and started 4
processes on the 4 V100s
used CUDA_VISIBLE_DEVICES=0 and num_gpus=1
default batch size
global_step: 100, avg_loss = 8.443472, time = 336.919500
global_step: 200, avg_loss = 7.691555, time = 333.118028
global_step: 300, avg_loss = 7.293336, time = 332.794547
global_step: 400, avg_loss = 6.585279, time = 330.941432
nvidia-smi showed 1 process
unsetting CUDA_VISIBLE_DEVICES and rerunning
one see 4 processes in nvidia-smi..but only 1 GPU being active :-)
global_step: 100, avg_loss = 8.414022, time = 353.267652
global_step: 200, avg_loss = 7.680358, time = 343.982275
global_step: 300, avg_loss = 7.316520, time = 346.749630
global_step: 400, avg_loss = 6.531937, time = 344.386899
set CUDA_VISIBLE_DEVICES=0,1,2,3 and num_gpus=4
global_step: 100, avg_loss = 8.122040, time = 669.353025
global_step: 200, avg_loss = 6.920084, time = 651.585218
global_step: 300, avg_loss = 5.956660, time = 648.792006
global_step: 400, avg_loss = 5.126814, time = 650.647822
global_step: 500, avg_loss = 4.219713, time = 648.755034
so I am bit unsure exactly whether things are going faster when fanned out
lowering the batch size to 15 while running on 4 GPUs does not change the
output much
global_step: 100, avg_loss = 8.417717, time = 518.246054
global_step: 200, avg_loss = 7.639581, time = 492.312914
global_step: 300, avg_loss = 7.274171, time = 493.591461
global_step: 400, avg_loss = 6.699906, time = 506.484376
On Wed, May 2, 2018 at 1:29 PM, David Levinthal <[email protected]>
wrote:
… new distro is https://github.com/klintan/bi-att-flow/tree/dev
On Tue, May 1, 2018 at 3:45 PM, David Levinthal <
***@***.***> wrote:
> I will modify the Andreas Klintberg fork of this..as that is the code
> base that works on top of tree TF..nothing else does due to the change in
> the handling of flags
> https://www.linkedin.com/in/andreas-klintberg-b7655710/
>
> but I have to wait until the 4 GPU machine I set this up on gets freed
> up...
> I don't want to build everything again LOL
> d
>
> On Tue, May 1, 2018 at 2:58 PM, kelayamatoz ***@***.***>
> wrote:
>
>> Hi David,
>>
>> My name is Tian. I'm moving our discussion from email to this issue
>> ticket so that other developers who have issues with this error could see
>> it.
>>
>> The issue with the original bidaf implementation is that it only creates
>> one loss variable for the model. On a single GPU this is fine, because you
>> only have one model. In a multi-GPU setting, the way bidaf implements
>> multi-GPU training is that it replicates the model for every GPU device,
>> and assigns one model on one device. This means that each model would
>> require its own loss variable. If the developer only specifies one loss
>> variable, tensorflow would try to reuse the loss variable for every model,
>> which would create a conflict.
>>
>> For example, in your error, if you only have one device, the name of the
>> loss variable would be model_0/loss/ExponentialMovingAverage. If you
>> have two devices, another loss variable called
>> model_1/loss/ExponentialMovingAverage would be referenced by
>> tensorflow. Since this variable is not created before you generate the
>> whole model, tensorflow would try to reuse the variable you previously
>> generated for model_0. Does that make sense?
>>
>> The solution to resolve this conflict is by creating a loss variable for
>> every model that's replicated:
>> https://github.com/stanford-futuredata/dawn-bench-models/blo
>> b/master/tensorflow/SQuAD/basic/model.py#L25:L36.
>>
>> Unfortunately I don't have a mult-GPU node available. Would you mind try
>> this patch on your node and see if it works?
>>
>> —
>> You are receiving this because you commented.
>> Reply to this email directly, view it on GitHub
>> <#54 (comment)>,
>> or mute the thread
>> <https://github.com/notifications/unsubscribe-auth/AIUuT2AyY-Ru255Tw_q4MHoDVgkGKk9uks5tuNqCgaJpZM4PAcOq>
>> .
>>
>
>
|
Hi @demiguo. On tensorflow 1.12.0, I had the same problem and fixed it by adding the line:
before ema.apply |
First I would like to give some context to this issue. It applies only when a different WorkerX/GPUX/etcX tf.name_scope() was created over different instantations of any model that uses tf.train.ExponentialMovingAverage (commonly used by Batch Normalization). If instead there had been used a "WorkerX" tf.variable_scope(), there would be no possibility of reuse, because variables created with tf.get_variable() only ignore tf.name_scopes(). Thus using tf.get_variable() inside different tf.variable_scope() can only be different variables. On the other hand, if there were no tf.name_scope() nor tf.variable_scope() over different gpu workers, variables being created with either tf.Variables or tf.get_variable would have exactly the same scope, giving both the possibility of being properly reused, but probably not creating a very pretty underlying graph, because operations would not be agregated over workers (esthetical/design/maintenance issue). But as I understand, using with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE) before ema.apply, as @shimafoolad suggests, will drop reuse of loss/ExponentialMovingAverage/ across GPUs, and any shadow variable created with tf.Variable by ema.apply. That would be bad for the learning of Batch Normalization layers in distributed learning, which does not seem to be a good solution. Maybe there is a way in which main variables of BN layers would be reused anyway, but I have found no explanation of such mechanism and maybe this issue would be solved with such an explanation. |
Thougth it again and realized that trainable parámeters of Batch Normalization can be defined with tf.get_variable() and use "with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE)" as @shimafoolad says, as follows:
For safety, reuse should be False the first call and True all after, as done in Hope it helps, especially those who update legacy tensorflow |
HI, I'm running the dev branch code on Tensorflow 1.2.
And I got this error:
Variable model_1/loss/ExponentialMovingAverage/ does not exist or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope.
From the stack trace, it was from basic/model.py, in _build_ema, ema_op=ema.apply(tensors).
I tried to add "with tf.variable_scope(tf.get_variable_scope(), reuse=False):" before eam.apply but that still doesn't work.
Any ideas how can I fix this?
Thanks!
I'm using CUDA8.0 and Cudnn5.1. Tensorflow v1.2, python 3.5.
The text was updated successfully, but these errors were encountered: