InnerEye GPU memory requirements/suggestions #812

furtheraway · 2022-10-20T18:13:41Z

furtheraway
Oct 20, 2022

I am wondering what the required/recommended GPU memory amount for typical model training with InnerEye.

I tested with training a lung segmentation model on the example 60 patients dataset (https://wiki.cancerimagingarchive.net/display/Public/Lung+CT+Segmentation+Challenge+2017 )

I read from the documentation that Standard_ND24s is recommended. (24 cores, 448GB RAM):

But with this ND24s, when train_betch_size=8 (default), I run into the following error:

I can reduce train_batch_size to 4 or 1 and finish the model training. But in one of my runs, I saw the following message once (a different out-of-memory error message, with train_batch_size=1):

If I use a machine with less memory per GPU core, like NC24s (24 cores, 224GB RAM), I always see the following error, even with train_batch_size=1. (training failed always, but some of the failed ones still yield a trained Model in AzureML workspace, which surprised me.)

My questions are:

If I use a larger GPU machine with more cores, say 48, will it help? (I have tried GPU machine ND6s, 6 cores, 112GB RAM, and it cannot finish either the model training or inferencing job, and I didn't have enough quota to test on ND48 yet.)
ND24 already has the highest ratio of memory per GPU core on AzureML (eastus region), will memory requirement increase more with a bigger dataset with more than 60 patients, or with more complicated cases like Head and Neck (more channels)? If so, Azure GPU machines won't be capable enought to train those models.
I read that to further save memory, I can reduce crop_size. But is this advisable at all? I once tried reducing crop_size to half of the original values (to accormadate a machine like NC24s). And the resulted model gives chaotic inference result. (If I have to reduce crop_size, how much can/should I reduce and still get sensible models?).
If a model training failed (say due to out-of-memory), but still yielded a model in AzureML, will this model be usable? I tried run an inference segmentation from such a model, and the result is comparible to a model resulted from a successful training job.

Thank you,
John

Answered by peterhessey

Oct 21, 2022

Hi John, thanks for posting this! I'll answer your questions in order:

"If I use a larger GPU machine with more cores, say 48, will it help?"

Using more cores will not help in this instance. If you want increased memory to avoid out of memory errors you need more memory per GPU. During training on multiple GPUs, the whole model is copied onto each available GPU and a batch is passed through each copy of the model at each training step.

More cores / GPUs will only increase how many batches are processed in parallel, not memory capacity.

"Will memory requirement increase more with a bigger dataset?"

A bigger dataset (more total samples in the set) will not increase memory requirements. …

View full answer

peterhessey · 2022-10-21T11:10:36Z

peterhessey
Oct 21, 2022

Hi John, thanks for posting this! I'll answer your questions in order:

"If I use a larger GPU machine with more cores, say 48, will it help?"

Using more cores will not help in this instance. If you want increased memory to avoid out of memory errors you need more memory per GPU. During training on multiple GPUs, the whole model is copied onto each available GPU and a batch is passed through each copy of the model at each training step.

More cores / GPUs will only increase how many batches are processed in parallel, not memory capacity.

"Will memory requirement increase more with a bigger dataset?"

A bigger dataset (more total samples in the set) will not increase memory requirements. Memory requirements are affected by the model size (i.e. feature_channels), training batch size and crop size. During training, the dataset is split into batches and those batches are subsequently cropped into smaller patches. These parameters are independent of the size of the dataset and the resolution of the images within that dataset.

If using the same training parameters, larger datasets will only increase training time, not memory requirements.

"I read that to further save memory, I can reduce crop_size. But is this advisable at all?"

Absolutely! Patching during training is very common and a still produces very good results for fairly small patches. The issues you're encountering with the lung model we have been working on recently and are specific to incorrect parameters in the Lung model class definition. We're going to be publishing a lung model very soon which fixes this. If you want to see this for yourself before then, set the following parameters in the Lung(SegmentationModelBase) class:

    level=-500,
    window=2200,

While smaller crops will obviously have an impact on model performance, reducing the crop_size variable is still one of the most effective ways of reducing your memory requirements to train on smaller GPUs.

"If a model training failed (say due to out-of-memory), but still yielded a model in AzureML, will this model be usable?"

If a model has been registered it means training completed successfully. If the job failed with an out of memory issue after registered the model, it means that this error was encountered when the model tried to run inference on itself (which occurs after every completed InnerEye training run). This is usually because the test_crop_size parameter is too large and needs to be reduced for your specific GPU.

If a model is registered by InnerEye it will always be usable for inference, even if the training run then failed after the registration.

Hope this helps!

Peter

0 replies

ant0nsc · 2022-10-21T11:51:24Z

ant0nsc
Oct 21, 2022
Maintainer

@peterhessey's answer summarizes the situation very well. Just wanted to add here details about the error messages:
@furtheraway , the first and third error you posted are indeed cases where there is not enough GPU (CUDA) memory. This can fixed by reducing crop_size, train_batch_size, or by choosing a VM with a GPU that has more memory.
The second error that you posted is one where the machine likely ran out of CPU memory. In this case, choose a VM with more RAM.

Full lists with CPU memory and GPU memory for different Azure VMs are available here. The ND series is particularly good for model training because it has 24GB of GPU memory (Tesla P40), the NDv2 series has 32GB of GPU memory (Tesla V100)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InnerEye GPU memory requirements/suggestions #812

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

InnerEye GPU memory requirements/suggestions #812

furtheraway Oct 20, 2022

Replies: 2 comments

peterhessey Oct 21, 2022

ant0nsc Oct 21, 2022 Maintainer

furtheraway
Oct 20, 2022

peterhessey
Oct 21, 2022

ant0nsc
Oct 21, 2022
Maintainer