Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asking for the cuda OOM questions #35

Open
gonghouyu opened this issue Jun 19, 2018 · 3 comments
Open

Asking for the cuda OOM questions #35

gonghouyu opened this issue Jun 19, 2018 · 3 comments

Comments

@gonghouyu
Copy link

I run the code on a Chinese ner train data(around 70 thousand sentences, and I set the LM-LSTM-crf to co-train model), and I got the OMM error:

When I set the batch_size to 10, it results in:

  • Tot it 6916 (epoch 0): 6308it [26:09, 4.02it/s]THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
    Traceback (most recent call last):
    File "train_wc.py", line 243, in
    loss.backward()
    File "/usr/local/lib/python3.5/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
    File "/usr/local/lib/python3.5/site-packages/torch/autograd/init.py", line 99, in backward
    variables, grad_variables, retain_graph)
    RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

When I set the batch_size to 128, it results in:

  • Tot it 543 (epoch 0): 455it [03:57, 1.91it/s]THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory

Traceback (most recent call last):
File "train_wc.py", line 241, in
loss = loss + args.lambda0 * crit_lm(cbs, cf_y.view(-1))
File "/usr/local/lib/python3.5/site-packages/torch/nn/modules/module.py", line325, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/site-packages/torch/nn/modules/loss.py", line 601, in forward
self.ignore_index, self.reduce)
File "/usr/local/lib/python3.5/site-packages/torch/nn/functional.py", line 1140, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, size_average, ignore_index, reduce)
File "/usr/local/lib/python3.5/site-packages/torch/nn/functional.py", line 786, in log_softmax
return torch._C._nn.log_softmax(input, dim)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

Could any one give me some advise to solve it?

@LiyuanLucasLiu
Copy link
Owner

Hi, what type of GPU you are using, and how large is its memory?

For chinese, even the character-level language modeling would result in a large dictionary (and also large GPU memory consumptions). One way to alleviate this problem is to filter some low-frequency words as unknown tokens.

@gonghouyu gonghouyu reopened this Jun 20, 2018
@gonghouyu
Copy link
Author

The type of GPU is Tesla K40c, We have 4 piece and each has 10 Memory.
Both of using only one GPU or set it to multi-GPU in the pytorch code have the same OOM error.
And set mini_count to 5 even 10 also doesn't work.
But if I do not use the co_train, it works well~

@LiyuanLucasLiu
Copy link
Owner

Yes, language modeling for chinese is a little tricky. I think it's necessary to do some model modification to make it work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants