Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: In include/poptorch_err/ExceptionHandling.hpp:76: 'poplar_stream_memory_allocation_error' #9

Open
Lime-Cakes opened this issue Dec 7, 2022 · 5 comments

Comments

@Lime-Cakes
Copy link

Is there an explanation of this error?

/usr/local/lib/python3.8/dist-packages/poptorch/experimental.py in __exit__(self, exc_type, value, traceback)
    253         if self._compile_using == enums.Compiler.PopART:
    254             # Compile the captured graph using PopART.
--> 255             self._executable = poptorch_core.compileWithManualTracing(
    256                 self._options.toDict(), accessAttributes, self._training,
    257                 self._dict_optimizer,

Error: In include/poptorch_err/ExceptionHandling.hpp:76: 'poplar_stream_memory_allocation_error': /opt/jenkins/workspace/poplar/poplar_ci_ubuntu_20_04_unprivileged/popart/willow/src/popx/irlowering.cpp:3516 Out of memory. Single stream of length 149766584 in bufferIndex 1
Error raised in:
  [0] popart::Session::prepareDevice: Poplar compilation
  [1] Compiler::compileAndPrepareDevice
  [2] LowerToPopart::compile
  [3] compileWithManualTracing
@ariannas-graphcore
Copy link

Hi,

the exception you're hitting is thrown when the allocation of stream buffer fails: in this case, too much memory is being used by a data stream (Single stream of length 149766584 as per the error), which can happen e.g. when having too large outputs being streamed back from the IPU to the host.

It would be great if you could share instructions to reproduce the error and SDK version you've been using: this would enable us to to share specific remedies to the issue. In the meanwhile, here are some more generic suggestions that might help:

  • Prefetching allows the host to prepare the next buffer for the infeed/outfeed while it waits for the IPU to compute the current buffer. You could try to reduce the buffering depth of the infeeds/outfeeds in PopART. If your program is using buffering depth greater than 1, this might improve performance but comes at the cost of increasing the memory footprint. You could alternatively try to disable prefetching, which is enabled by default. You can find more details on how to do that in our PopART user guide.

  • Try to switch to a different poptorch.OutputMode() e.g. if you're using All, you might want to try to use Final instead. This mode only returns the last batch instead of returning a result for each batch, which would decrease the accumulated amount of output data used. See PopTorch user guide for more details.

  • Try to reduce the batch size and/or the PopTorch deviceIterations - see PopTorch user guide.

Hope this helps for now, I encourage you to share a reproducer for us to investigate this further.

Best,
Arianna

@Lime-Cakes
Copy link
Author

This is for training, so I left outputMode at default, which should already be last. What values would be streamed from IPU to host when training? The loss calculation would be done on the IPU since it's part of the model.

Could the stream error be related to data streaming between IPUs?

@payoto
Copy link
Contributor

payoto commented Jan 27, 2023

No, those streams are for IPU/host communication. Have you had a chance to try any of the other suggestions @ariannas-graphcore provided?

@Lime-Cakes
Copy link
Author

Yeah. Those either doesn't work, or can't be used (lower further)

@payoto
Copy link
Contributor

payoto commented Jan 30, 2023

By any chance are you trying to profile the Poplar executable by generating a PopVision profile? (by setting the environment variable: POPLAR_ENGINE_OPTIONS='{"autoReport.all"="true"}')
I have seen that occasionally cause host stream issues in the past.

We can try and provide more specific support but we would need additional information on the system, software and model you are encountering the error in:

  • IPU Machine type (number of IPUs, Paperspace? GCore?),
  • Poplar SDK version
  • Environment variables that are set?
  • Frameworks (Pytorch, huggingface, Tensorflow?)
  • Which model and datasets you are trying to run (if possible)?

Ideally if you can send us a code sample which reproduces the error we can provide more specific advice to help fix the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants