Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version compitability of pytorch-lightning #282

Open
kaelsunkiller opened this issue Oct 4, 2023 · 2 comments
Open

Version compitability of pytorch-lightning #282

kaelsunkiller opened this issue Oct 4, 2023 · 2 comments

Comments

@kaelsunkiller
Copy link

May I ask which version of pl did you use for developing this codebase?

I tried the newest 2.0 but got lots of bugs, params and functions deprecated, etc. So I degrade it to 1.5 now, with the compatible torch 1.8.0 and torchmetrics, but still find it stuck at step 1770/1850 epoch 0, very confusing.

I thought it might have gone through the validation step, because of a warning by pl as below:

/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:56: UserWarning: Trying to infer the 'batch_size' from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use 'self.log(..., batch_size=batch_size)'.

The batch size changed to 1, and also this warning is new in pl 1.5. I don't know if it causes any error in computation.

Back to the stuck issue, I waited for more than 30 mins which is much longer than the eta of training one epoch. Still stuck, no errors or warnings, desperate...

Too many uncertain issues with pl training. So I have to ask the version that can work with this codebase. Thanks a lot!

@ruotianluo
Copy link
Owner

I should have something working for 2.0. Let me push.

@kaelsunkiller
Copy link
Author

kaelsunkiller commented Oct 5, 2023

I should have something working for 2.0. Let me push.

That would be great!

BTW, I found the problem. It's probably caused by the size-changing of the last batch when using mulit-gpus in pl. My batch size is set to 64, with 8 gpus, so the last batch fed to gpus will be 7 (which is incompatible with 8 gpus). Then I added drop_last=True in the train dataloader but still got stuck at the last step of validation ( number of validation images is 5000, the last batch size should be 8 which is compatible with 8 gpus, each gpu has batch size 1). So I think my environment may have some issues with batch size 1, or just the in-epoch size changing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants