You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
May I ask which version of pl did you use for developing this codebase?
I tried the newest 2.0 but got lots of bugs, params and functions deprecated, etc. So I degrade it to 1.5 now, with the compatible torch 1.8.0 and torchmetrics, but still find it stuck at step 1770/1850 epoch 0, very confusing.
I thought it might have gone through the validation step, because of a warning by pl as below:
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:56: UserWarning: Trying to infer the 'batch_size' from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use 'self.log(..., batch_size=batch_size)'.
The batch size changed to 1, and also this warning is new in pl 1.5. I don't know if it causes any error in computation.
Back to the stuck issue, I waited for more than 30 mins which is much longer than the eta of training one epoch. Still stuck, no errors or warnings, desperate...
Too many uncertain issues with pl training. So I have to ask the version that can work with this codebase. Thanks a lot!
The text was updated successfully, but these errors were encountered:
I should have something working for 2.0. Let me push.
That would be great!
BTW, I found the problem. It's probably caused by the size-changing of the last batch when using mulit-gpus in pl. My batch size is set to 64, with 8 gpus, so the last batch fed to gpus will be 7 (which is incompatible with 8 gpus). Then I added drop_last=True in the train dataloader but still got stuck at the last step of validation ( number of validation images is 5000, the last batch size should be 8 which is compatible with 8 gpus, each gpu has batch size 1). So I think my environment may have some issues with batch size 1, or just the in-epoch size changing.
May I ask which version of pl did you use for developing this codebase?
I tried the newest 2.0 but got lots of bugs, params and functions deprecated, etc. So I degrade it to 1.5 now, with the compatible torch 1.8.0 and torchmetrics, but still find it stuck at step 1770/1850 epoch 0, very confusing.
I thought it might have gone through the validation step, because of a warning by pl as below:
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:56: UserWarning: Trying to infer the 'batch_size' from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use 'self.log(..., batch_size=batch_size)'.
The batch size changed to 1, and also this warning is new in pl 1.5. I don't know if it causes any error in computation.
Back to the stuck issue, I waited for more than 30 mins which is much longer than the eta of training one epoch. Still stuck, no errors or warnings, desperate...
Too many uncertain issues with pl training. So I have to ask the version that can work with this codebase. Thanks a lot!
The text was updated successfully, but these errors were encountered: