-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training image reads from DL workload not appearing in log #996
Comments
Just to confirm, do you see Darshan output for other files that are accessed? Just not this particular file that is being read as part of training? Do you see any warnings when parsing logs about Darshan running out of memory? It's maybe a bit heavy-handed, but if at all possible, maybe you could share the output of |
Thanks for the quick response. Yes, we see Darshan output for other files like some log files for the training dataset but not the images themselves. We are not seeing any warnings about Darshan running out of memory. Here is the link to the strace output file: https://drive.google.com/file/d/1kguz59VGZyCtkBB9KrB6R1nayeyz0xVF/view?usp=sharing The filename of interest was picked up by the strace output and is as follows (basically all the jpg files in the train2017 file): |
Thanks for the additional details, that helps a lot in understanding what's going on here. I think you're running into a similar problem we've seen in the past when using Python You can see from the
These frameworks seem to rely on multiprocessing packages for spawning entirely new worker processes for doing I/O, e.g. here (L#159664) you can see worker process
Now, Darshan should actually be able to handle this all fine, ultimately generating a unique log file for each process. However, the way these Python
Darshan's shutdown mechanism (which generates the output log file) relies on a graceful exit on the process it is instrumenting, so it never gets a chance to execute when the process is abruptly killed. Ultimately, there's nothing we can do in our default mode to avoid that. But, we actually have a (somewhat experimental) build option for Darshan that I think should give you a way to keep the logs for the killed processes. You would need to reconfigure Darshan with the You should be able to access and analyze those temporary logs the same way you do normal logs. The key difference is they can be quite large as they are stored in uncompressed form. I recommend compressing them and moving them to a more permanent location (e.g., wherever your log files are configured to go if processes terminate normally) using the
Hopefully that makes some sense, but let me know if you need any more info or if you run into additional issues. |
Hello,
We are observing some issues where Darshan does not appear to instrument training data reads during CNN training workloads. We are using darshan-3.4.5 and running a CNN model using the mmdetection repo. Here is the line in mmdetection where it gets called to read image files: https://github.com/open-mmlab/mmengine/blob/main/mmengine/fileio/backends/local_backend.py#L33. We have confirmed that this code is running and the images are being read, however, Darshan will not output this in the log. Here is the train.py script we are using from mmdetection: mmdetection/tools/train.py at main · open-mmlab/mmdetection (github.com).
The command we use is as follows:
Whether we use environment variables or the config file, we can still not see Darshan instrumenting image reads.
Below are the contents of our config file:
The text was updated successfully, but these errors were encountered: