You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sometimes this causes the crash: sess.run(tf.global_variables_initializer())
This is the error message:
ResourceExhaustedError: OOM when allocating tensor with shape[512,769954] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node trainer/worker_0/trainer_mod/trainer/worker_0/mod/logits/W/Adam/Assign (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
I have tried restarting the runtime many times without change. I also tried setting the GPU count to 1. But it still runs out of memory.
Do you have any ideas how I can work around this?
Cheers, Fred
The text was updated successfully, but these errors were encountered:
Hello! We originally trained the model on 8 GPUs with a lot more GPU memory on each GPU than is available in Colab. Consider using just the FixedOrderTrainer without going multi-GPU (Colab only allows using 1 GPU per notebook). Does this fix the issue?
Thank you I'll try that too.
I managed to get it to start training but it was very slow going on the Colab so I gave up. It would have taken a few weeks I think.
I'm just going to go and do some basic tutorials that I have found and go from there, to get a better basic understanding of this stuff.
I'm going through and running your training setup on a google colab netbook but when I get to this section it keeps crashing:
sometimes this causes the crash:
sess.run(tf.global_variables_initializer())
This is the error message:
I have tried restarting the runtime many times without change. I also tried setting the GPU count to 1. But it still runs out of memory.
Do you have any ideas how I can work around this?
Cheers, Fred
The text was updated successfully, but these errors were encountered: