Asking for the cuda OOM questions #35

gonghouyu · 2018-06-19T05:39:01Z

I run the code on a Chinese ner train data(around 70 thousand sentences, and I set the LM-LSTM-crf to co-train model), and I got the OMM error:

When I set the batch_size to 10, it results in:

Tot it 6916 (epoch 0): 6308it [26:09, 4.02it/s]THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File "train_wc.py", line 243, in
loss.backward()
File "/usr/local/lib/python3.5/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/usr/local/lib/python3.5/site-packages/torch/autograd/init.py", line 99, in backward
variables, grad_variables, retain_graph)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

When I set the batch_size to 128, it results in:

Tot it 543 (epoch 0): 455it [03:57, 1.91it/s]THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory

Traceback (most recent call last):
File "train_wc.py", line 241, in
loss = loss + args.lambda0 * crit_lm(cbs, cf_y.view(-1))
File "/usr/local/lib/python3.5/site-packages/torch/nn/modules/module.py", line325, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/site-packages/torch/nn/modules/loss.py", line 601, in forward
self.ignore_index, self.reduce)
File "/usr/local/lib/python3.5/site-packages/torch/nn/functional.py", line 1140, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, size_average, ignore_index, reduce)
File "/usr/local/lib/python3.5/site-packages/torch/nn/functional.py", line 786, in log_softmax
return torch._C._nn.log_softmax(input, dim)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

Could any one give me some advise to solve it?

LiyuanLucasLiu · 2018-06-19T06:02:20Z

Hi, what type of GPU you are using, and how large is its memory?

For chinese, even the character-level language modeling would result in a large dictionary (and also large GPU memory consumptions). One way to alleviate this problem is to filter some low-frequency words as unknown tokens.

gonghouyu · 2018-06-20T12:30:23Z

The type of GPU is Tesla K40c, We have 4 piece and each has 10 Memory.
Both of using only one GPU or set it to multi-GPU in the pytorch code have the same OOM error.
And set mini_count to 5 even 10 also doesn't work.
But if I do not use the co_train, it works well~

LiyuanLucasLiu · 2018-06-20T17:51:44Z

Yes, language modeling for chinese is a little tricky. I think it's necessary to do some model modification to make it work.

gonghouyu closed this as completed Jun 20, 2018

gonghouyu reopened this Jun 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asking for the cuda OOM questions #35

Asking for the cuda OOM questions #35

gonghouyu commented Jun 19, 2018

LiyuanLucasLiu commented Jun 19, 2018

gonghouyu commented Jun 20, 2018

LiyuanLucasLiu commented Jun 20, 2018

Asking for the cuda OOM questions #35

Asking for the cuda OOM questions #35

Comments

gonghouyu commented Jun 19, 2018

LiyuanLucasLiu commented Jun 19, 2018

gonghouyu commented Jun 20, 2018

LiyuanLucasLiu commented Jun 20, 2018