-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve logging logic to improve/fix GPU performance #2252
Comments
@strickvl I'm interested in working on this issue. Can I take it up? |
Sure thing, @nida-imran173! I'll assign it to you and let us know if you have any questions. Most basic things should be answered in our CONTRIBUTING.md document. |
Hi @strickvl, After analyzing the code in 'logging', I've identified a few potential areas that could be causing the reported drop in GPU utilization. Here are the key points:
I would greatly appreciate your guidance and any specific insights you might have on tackling this issue. If there are additional aspects I should consider or if you have any preferences regarding the approach, please let me know. |
So first thing I'd say would be to reproduce the issue. I.e. run a step when logging is turned on (i.e. by default). Then either toggle / update When someone is running on a GPU-enabled environment, we could potentially have different behaviour. Also it isn't yet clear to me why logs within a GPU-enabled environment are slower beyond maybe that the task itself generates a certain frequency of logs. So in short, we'll need to dive a bit deeper into the problem I think. |
@strickvl @nida-imran173 I would just add to this discussion that I think the primary reason for GPU performance degredation is exactly as Nida already said:
I would try to tackle this issue first. basically, id run some tests to see how this can effect performance. A very simple test could be to run a pipeline which trains a model using pytorch or tensorflow. These libraries produce progress bars that are then logged and cause a slow down . Once we've verified this, we can work on a fix all together by brainstorming strategies. But first things first, as @strickvl said, we need a test in place where we can measure things |
Open Source Contributors Welcomed!
Please comment below if you would like to work on this issue!
Contact Details [Optional]
[email protected]
What happened?
Users have reported a significant drop in GPU utilization (from 95% to 2%) after upgrading ZenML from version 0.32.1 to 0.44.2. This issue was observed while deploying pipelines on GCP VertexAI. Investigations suggest that the performance bottleneck is due to the logging mechanism, especially when using progress bars like
tqdm
. It appears that logging, particularly frequent updates from progress bars, is substantially slowing down the processing speed.Task Description
Investigate and optimize the logging logic in ZenML, particularly for scenarios involving high GPU usage. The goal is to ensure that the logging process, including progress bars, does not adversely affect the GPU performance and overall speed of pipeline execution.
Expected Outcome
Steps to Implement
Note that part of the solution might be to expose these global variables / constants better in settings via environment variables.
Additional Context
This issue is critical for users leveraging ZenML for GPU-intensive tasks, as efficient GPU utilization is key to performance in these scenarios. The solution should provide a balance between informative logging and optimal resource utilization.
Code of Conduct
The text was updated successfully, but these errors were encountered: