Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running out of memory in colab #2

Open
Fredrum opened this issue Dec 15, 2019 · 2 comments
Open

Running out of memory in colab #2

Fredrum opened this issue Dec 15, 2019 · 2 comments

Comments

@Fredrum
Copy link

Fredrum commented Dec 15, 2019

I'm going through and running your training setup on a google colab netbook but when I get to this section it keeps crashing:

trainer = MultiGPUTrainer('trainer', make_model,
                          TrainerClass=FixedOrderTrainer, sampler_opts=dict(samples_per_line=1),
                          optimizer_opts=dict(base_lr=1.4e-3, warmup_time=16000))

sometimes this causes the crash:
sess.run(tf.global_variables_initializer())

This is the error message:

ResourceExhaustedError: OOM when allocating tensor with shape[512,769954] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node trainer/worker_0/trainer_mod/trainer/worker_0/mod/logits/W/Adam/Assign (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

I have tried restarting the runtime many times without change. I also tried setting the GPU count to 1. But it still runs out of memory.

Do you have any ideas how I can work around this?

Cheers, Fred

@TIXFeniks
Copy link
Owner

Hello! We originally trained the model on 8 GPUs with a lot more GPU memory on each GPU than is available in Colab. Consider using just the FixedOrderTrainer without going multi-GPU (Colab only allows using 1 GPU per notebook). Does this fix the issue?

@Fredrum
Copy link
Author

Fredrum commented Dec 21, 2019

Thank you I'll try that too.
I managed to get it to start training but it was very slow going on the Colab so I gave up. It would have taken a few weeks I think.
I'm just going to go and do some basic tutorials that I have found and go from there, to get a better basic understanding of this stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants