Usage report of the efficient_conv_bn_eval feature #1252

youkaichao · 2023-07-15T14:00:38Z

youkaichao
Jul 15, 2023

Please report here how enabling efficient_conv_bn_eval feature helps your training. We expect the feature to seamlessly reduce memory footprint without affecting the performance. Currently it is an optional feature, but if the feedback is good enough, we can turn this feature into default :)

JolyonWu · 2023-08-04T04:28:37Z

JolyonWu
Aug 4, 2023

I used the maskrcnn model as a baseline to test the difference of turning on this function. When the model first started training, it could indeed save 13% of the memory. But as the training continued, I used nvidia-smi to check the memory. I found that the memory continued to increase from 7GB. It soared until it occupied 24GB, but the memory usage in the training log record remained as low as it was at the beginning

3 replies

youkaichao Aug 4, 2023
Author

Hi, numbers shown in nvidia-smi is known to be unreliable. Please refer to the documentation of pytorch https://pytorch.org/docs/stable/notes/cuda.html#memory-management for details. In your case, I suppose there are some memory cache showing up in nvidia-smi. The log record might be a more faithful number for memory footprint.

JolyonWu Aug 4, 2023

Thanks for your prompt reply, I'll check again!

youkaichao Aug 23, 2023
Author

@buxuewushu1314 for example, please check the memory field at https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_fpn_2x_coco/faster_rcnn_r101_fpn_2x_coco_20200504_210455.log.json .

buxuewushu1314 · 2023-08-31T07:57:14Z

buxuewushu1314
Aug 31, 2023

Train resnet18 on mmpretrain-v1.0.2 using '-- cfg options effective'_ Conv_ Bn_ After eval="[backbone]" ', there is almost no savings memory. Do I need to specify a version for mmretrain, or do I need any other settings?

5 replies

youkaichao Aug 31, 2023
Author

Can you show your full training config? e.g. which task you are training on, and the config file you are using.

In general, pre-training uses BN in train mode, where efficient_conv_bn_eval cannot help. As the name suggests, it works when you train with BN in eval mode, which often occurs in mmdetection training.

buxuewushu1314 Aug 31, 2023

Thanks for your reply，yeah，it should be train mode，so does mmpretrain provide an interface for setting BN in eval mode, how should to do?

youkaichao Aug 31, 2023
Author

Generally we don't pretrain in eval mode. Pretraining requires train mode.

buxuewushu1314 Aug 31, 2023

Thank you again. classification model was still effective during fine-tuning by setting 'norm_eval', testing shows that it can save 33.8% memory,good job!

youkaichao Aug 31, 2023
Author

Glad it helps :)

Jano-rm-rf · 2023-11-15T03:24:29Z

Jano-rm-rf
Nov 15, 2023

再次感谢你。通过设置norm_eval，分类模型在微调时仍然有效，测试表明可以节省33.8%的内存，干得好！

Thank you again. classification model was still effective during fine-tuning by setting 'norm_eval', testing shows that it can save 33.8% memory,good job!

Hello, I have set "norm_val", but there is an error：torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow,how should to do?

8 replies

Jano-rm-rf Nov 15, 2023

https://github.com/Jano-rm-rf/Test/blob/main/DINO-Test6-Config.py
command: python tools/train.py configs/dino/DINO-Test6.py --cfg-options efficient_conv_bn_eval="[backbone]"

Jano-rm-rf Nov 15, 2023

code running environment：torch2.0.1, mmdet3.2.0,mmengine0.9.0
Is version incompatibility cause the error？

hvlgo Nov 15, 2023

This is not a problem caused by version incompatibility. This is due to the current implementation of efficient_conv_ bn_eval feature is based on the symbolic trace in torch.fx. Due to the limited ability of symbolic trace, it is not possible to handle expressions that are only known at runtime or depend on parameters such as values or shapes of input. In the future, we will use torch.compile to implement this feature, and there will be no such issue at that time. If you want to learn more about symbolic trace, you can check this link https://pytorch.org/docs/stable/fx.html#limitations-of-symbolic-tracing or https://zhuanlan.zhihu.com/p/644590863.

hvlgo Nov 15, 2023

The specific reason for this issue is that your config uses eficientnetv2, and its initialization specifies the conv type as Conv2dAdaptivePadding (https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/efficientnet_v2.py#L186). The forward function of Conv2dAdaptivePadding contains condition judgment of pad_h calculated based on x.size() (https://github.com/open-mmlab/mmcv/blob/main/mmcv/cnn/bricks/conv2d_adaptive_padding.py#L58).

Jano-rm-rf Nov 15, 2023

Thanks for your reply, I will modify conv type and test it

zhangcan-hust · 2024-03-28T06:44:37Z

zhangcan-hust
Mar 28, 2024

Can this method still work if the batch_size is set to 1 during training?

2 replies

youkaichao Mar 28, 2024
Author

As long as your BN is in eval mode during training, this method should work. Nevertheless, smaller batchsize may lead to less benefit in memory reduction.

zhangcan-hust Mar 29, 2024

I understand, thanks for your reply!

makecent · 2024-03-28T09:07:06Z

makecent
Mar 28, 2024

Tested on the SlowOnly model (torch.hub.load("facebookresearch/pytorchvideo", model='slow_r50', pretrained=True). It reduced the memory from 4576Mb to 2623Mb:

1 reply

youkaichao Mar 28, 2024
Author

Glad it helps :)

phoenixdna · 2024-10-16T03:53:28Z

phoenixdna
Oct 16, 2024

I have encoutered a question, please help.
I am using cuda11.8 and pytorch2.0.1; and the command line is:

 ./tools/dist_train.sh bisai.py 1 --cfg-options efficient_conv_bn_eval="[backbone]"

my dist_train.sh

#!/usr/bin/env bash

CONFIG=$1
GPUS=$2
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
PORT=${PORT:-29568}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
    --nnodes=$NNODES \
    --node_rank=$NODE_RANK \
    --master_addr=$MASTER_ADDR \
    --nproc_per_node=$GPUS \
    --master_port=$PORT \
    $(dirname "$0")/train.py \
    $CONFIG \
    --launcher pytorch ${@:3}

Then I have the following exception:

10/16 03:46:39 - mmengine - INFO - Enabling the "efficient_conv_bn_eval" feature for sub-modules: ['backbone']
Traceback (most recent call last):
  File "/gemini/code/mmdet_bisai/./tools/train.py", line 133, in <module>
    main()
  File "/gemini/code/mmdet_bisai/./tools/train.py", line 129, in main
    runner.train()
  File "/root/miniconda3/lib/python3.10/site-packages/mmengine/runner/runner.py", line 1762, in train
    turn_on_efficient_conv_bn_eval(ori_model, modules)
  File "/root/miniconda3/lib/python3.10/site-packages/mmengine/model/efficient_conv_bn_eval.py", line 158, in turn_on_efficient_conv_bn_eval
    turn_on_efficient_conv_bn_eval_for_single_model(module)
  File "/root/miniconda3/lib/python3.10/site-packages/mmengine/model/efficient_conv_bn_eval.py", line 147, in turn_on_efficient_conv_bn_eval_for_single_model
    fx_model: fx.GraphModule = fx.symbolic_trace(model)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py", line 1109, in symbolic_trace
    graph = tracer.trace(root, concrete_args)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py", line 778, in trace
    (self.create_arg(fn(*args)),),
  File "/gemini/code/mmdet_bisai/mmdet/models/backbones/swin.py", line 835, in forward
    x, hw_shape = self.patch_embed(x)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py", line 756, in module_call_wrapper
    return self.call_module(mod, forward, args, kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py", line 467, in call_module
    ret_val = forward(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py", line 749, in forward
    return _orig_module_call(mod, *args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gemini/code/mmdet_bisai/mmdet/models/layers/transformer/utils.py", line 302, in forward
    x = self.adap_padding(x)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py", line 756, in module_call_wrapper
    return self.call_module(mod, forward, args, kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py", line 467, in call_module
    ret_val = forward(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py", line 749, in forward
    return _orig_module_call(mod, *args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gemini/code/mmdet_bisai/mmdet/models/layers/transformer/utils.py", line 176, in forward
    pad_h, pad_w = self.get_pad_shape(x.size()[-2:])
  File "/gemini/code/mmdet_bisai/mmdet/models/layers/transformer/utils.py", line 169, in get_pad_shape
    pad_h = max((output_h - 1) * stride_h +
  File "/root/miniconda3/lib/python3.10/site-packages/torch/fx/proxy.py", line 413, in __bool__
    return self.tracer.to_bool(self)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/fx/proxy.py", line 276, in to_bool
    raise TraceError('symbolically traced variables cannot be used as inputs to control flow')
torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow

I checked a lot of material and ask GPT, but got no answer. Please do help and thx in advance.

2 replies

youkaichao Oct 24, 2024
Author

bisai.py

this model might contain certain conv that cannot be traced by torch.fx .

youkaichao Oct 24, 2024
Author

if your code supports torch.compile , you can use torch >= 2.2 , and set

from torch._inductor import config as inductor_config
inductor_config.efficient_conv_bn_eval_fx_passes = True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage report of the efficient_conv_bn_eval feature #1252

{{title}}

Replies: 6 comments 21 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Usage report of the efficient_conv_bn_eval feature #1252

Replies: 6 comments · 21 replies

youkaichao Aug 4, 2023 Author

youkaichao Aug 23, 2023 Author

youkaichao Aug 31, 2023 Author

youkaichao Aug 31, 2023 Author

youkaichao Aug 31, 2023 Author

youkaichao Mar 28, 2024 Author

youkaichao Mar 28, 2024 Author

youkaichao Oct 24, 2024 Author

youkaichao Oct 24, 2024 Author

Replies: 6 comments 21 replies

youkaichao Aug 4, 2023
Author

youkaichao Aug 23, 2023
Author

youkaichao Aug 31, 2023
Author

youkaichao Aug 31, 2023
Author

youkaichao Aug 31, 2023
Author

youkaichao Mar 28, 2024
Author

youkaichao Mar 28, 2024
Author

youkaichao Oct 24, 2024
Author

youkaichao Oct 24, 2024
Author