-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does memory keep increasing during training? #14
Comments
Likewise, trying to find a fix now. My process keeps getting killed, even when running with 25GB RAM on Google Colab. |
@hugh920 could you let me know if you find a fix in the meantime 👍 |
@dgbarclay When you train, does your memory increase with each epoch?How many epochs have you reached so far? |
@hugh920 Mine is being killed whilst parsing the data, it doesn't reach the beginning of training. It seems to fall within the block on line 203 of util.py. Are you able to begin training? Have you modified the code? |
@dgbarclay Mine can be trained without modifying the code. However, due to increased memory, it failed in the second round. I modified batch_size and made the model structure a little simpler so that he could continue to run. I noticed that the memory increased during the first two training rounds and stabilized after the third. I don't understand why. |
@hugh920 okay, I have not yet made it that far. I was running out of memory during forming the DataLoader so I'm having to refactor a little bit. Are you able to push your version so I can compare the two? It would help me out loads, cheers. |
@hugh920 are you able to run eval_nus_wide.sh without failure? I ultimately just need to be able to run this model to take image queries and give predictions, are you able to get the model in that state? |
@dgbarclay I took the ALF out and just used FLF, which didn't work well. It may not be what you need. |
@dgbarclay I also had a problem with processes being killed while loading data on other projects today. I have observed that the GPU is not utilized when loading data. It is a dataloader made by CPU, probably because the processing power of CPU is not up to it.It has nothing to do with memory size or GPU. |
Is the issue solved? |
Dear author, thanks for your code.But when I reproduced this code, I found that the memory kept increasing, and finally caused the training failure of running out of memory.What is the possible reason?
The text was updated successfully, but these errors were encountered: