-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with sparse activations #17
Comments
Are you using the most recent version of the code? We changed the name from tsallis15 to entmax15 not long after we released it. I would recommend updating and seeing if the bug goes away. |
I used both the most recent version of entmax package and the function used in openNMT-entmax. It all does not work. I don't know why. |
Do you get a different error message if you run the code on the cpu? Are you using entmax for attention or the loss function? |
Thank you very much for your reply. |
Could you post more details about the error?
|
The error occurs in the decoder self-attention and context attention. It occurs because of the beamsearch algorithm. I use the this transformer code, you can try it. When using beamsearch, it appears [nan,nan,nan] in the first decode step, softmax function can ignore it, but sparse function can not. Do your have any suggestion to fix this problem? |
Unfortunately I'm not familiar with that transformer implementation. Can you find out what the inputs and outputs of entmax are when you get nans? I'd guess that target masking (different between training and beam search time) is producing tensors that entmax is struggling with. But I can't fix the problem until I know what these tensors look like. |
The input is like: |
I reproduced your error message on my machine. Softmax can handle an input of all nans, entmax currently cannot. However, I'm skeptical of the circumstances that caused this situation to arise. Does the code you're using intentionally create tensors with nans in them for masking? I find that surprising. nan should show up basically only if the code is doing something wrong -- introducing it on purpose makes debugging harder. Neither of the transformer implementations I'm familiar with (OpenNMT, joeynmt) use nans for masking like this (for what it's worth, I looked through the repo you linked and can't find nans there either -- are you sure they're there on purpose?). So while I agree that entmax crashes on nans, I'm not convinced this is a bad thing -- if something is broken, it's better to crash than fail silently. |
I just replace the sotfmax function with sparsemax function or tsallis15 function in my transformer model. It works well on training stage, but the following errors occur during the testing phase:
RuntimeError: CUDA error: device-side assert triggered
If I replace it with softmax function again, it works.
What could be the cause?
The text was updated successfully, but these errors were encountered: