Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vig训练自己的数据集 #250

Open
kingkaione opened this issue Apr 11, 2024 · 3 comments
Open

vig训练自己的数据集 #250

kingkaione opened this issue Apr 11, 2024 · 3 comments

Comments

@kingkaione
Copy link

Using native Torch AMP. Training in mixed precision. model flops: 16839108314 input_size: [1, 3, 224, 224] Model pvig_b_224_gelu created, param count: 95213258 Data processing configuration for current model + dataset: input_size: (3, 224, 224) interpolation: bicubic mean: (0.5, 0.5, 0.5) std: (0.5, 0.5, 0.5) crop_pct: 0.95 Using native Torch DistributedDataParallel. model flops: 16839108314 input_size: [1, 3, 224, 224] Model pvig_b_224_gelu created, param count: 95213258 Data processing configuration for current model + dataset: input_size: (3, 224, 224) interpolation: bicubic mean: (0.5, 0.5, 0.5) std: (0.5, 0.5, 0.5) crop_pct: 0.95

作者大大,它运行到这里就不动了是正常的吗

@iamhankai
Copy link
Member

你安装一下apex试试,会更快

@kingkaione
Copy link
Author

你安装一下apex试试,会更快
作者大大 我在 V100-SXM2-32GB * 4卡上用pvig训练数据集
命令是
‘python -m torch.distributed.launch --nproc_per_node=4 /root/vig_pytorch/train.py /root/data_size0/ --model pvig_b_224_gelu --sched cosine --epochs 100 --opt adamw -j 8 --warmup-lr 1e-6 --mixup .8 --cutmix 1.0 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --color-jitter 0.4 --warmup-epochs 20 --opt-eps 1e-8 --repeated-aug --remode pixel --reprob 0.25 --amp --lr 2e-3 --weight-decay .05 --drop 0 --drop-path .1 -b 128 --output /root/model

然后出现了以下错误:: 我将batch_size 改为1了还是不行 www
RuntimeError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 2; 31.75 GiB total capacity; 29.98 GiB already allocated; 87.94 MiB free; 30.53 GiB reserved in total by PyTorch) Traceback (most recent call last): File "/root/miniconda3/envs/vig/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/vig/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/vig/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in main() File "/root/miniconda3/envs/vig/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main raise subprocess.CalledProcessError(returncode=process.returncode,

@kingkaione
Copy link
Author

你安装一下apex试试,会更快

我换成pvig_s能跑了,,pvig_b对配置要求更高点吧,谢谢大佬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants