InnerEye GPU memory requirements/suggestions #812
-
I am wondering what the required/recommended GPU memory amount for typical model training with InnerEye. I tested with training a lung segmentation model on the example 60 patients dataset (https://wiki.cancerimagingarchive.net/display/Public/Lung+CT+Segmentation+Challenge+2017 ) I read from the documentation that Standard_ND24s is recommended. (24 cores, 448GB RAM): But with this ND24s, when train_betch_size=8 (default), I run into the following error: I can reduce train_batch_size to 4 or 1 and finish the model training. But in one of my runs, I saw the following message once (a different out-of-memory error message, with train_batch_size=1): If I use a machine with less memory per GPU core, like NC24s (24 cores, 224GB RAM), I always see the following error, even with train_batch_size=1. (training failed always, but some of the failed ones still yield a trained Model in AzureML workspace, which surprised me.) My questions are:
Thank you, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi John, thanks for posting this! I'll answer your questions in order:
Using more cores will not help in this instance. If you want increased memory to avoid out of memory errors you need more memory per GPU. During training on multiple GPUs, the whole model is copied onto each available GPU and a batch is passed through each copy of the model at each training step. More cores / GPUs will only increase how many batches are processed in parallel, not memory capacity.
A bigger dataset (more total samples in the set) will not increase memory requirements. Memory requirements are affected by the model size (i.e. If using the same training parameters, larger datasets will only increase training time, not memory requirements.
Absolutely! Patching during training is very common and a still produces very good results for fairly small patches. The issues you're encountering with the lung model we have been working on recently and are specific to incorrect parameters in the level=-500,
window=2200, While smaller crops will obviously have an impact on model performance, reducing the
If a model has been registered it means training completed successfully. If the job failed with an out of memory issue after registered the model, it means that this error was encountered when the model tried to run inference on itself (which occurs after every completed InnerEye training run). This is usually because the If a model is registered by InnerEye it will always be usable for inference, even if the training run then failed after the registration. Hope this helps! Peter |
Beta Was this translation helpful? Give feedback.
-
@peterhessey's answer summarizes the situation very well. Just wanted to add here details about the error messages: Full lists with CPU memory and GPU memory for different Azure VMs are available here. The ND series is particularly good for model training because it has 24GB of GPU memory (Tesla P40), the NDv2 series has 32GB of GPU memory (Tesla V100) |
Beta Was this translation helpful? Give feedback.
Hi John, thanks for posting this! I'll answer your questions in order:
Using more cores will not help in this instance. If you want increased memory to avoid out of memory errors you need more memory per GPU. During training on multiple GPUs, the whole model is copied onto each available GPU and a batch is passed through each copy of the model at each training step.
More cores / GPUs will only increase how many batches are processed in parallel, not memory capacity.
A bigger dataset (more total samples in the set) will not increase memory requirements. …