Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training image reads from DL workload not appearing in log #996

Open
darrenjng opened this issue Jul 11, 2024 · 3 comments
Open

Training image reads from DL workload not appearing in log #996

darrenjng opened this issue Jul 11, 2024 · 3 comments

Comments

@darrenjng
Copy link

Hello,

We are observing some issues where Darshan does not appear to instrument training data reads during CNN training workloads. We are using darshan-3.4.5 and running a CNN model using the mmdetection repo. Here is the line in mmdetection where it gets called to read image files: https://github.com/open-mmlab/mmengine/blob/main/mmengine/fileio/backends/local_backend.py#L33. We have confirmed that this code is running and the images are being read, however, Darshan will not output this in the log. Here is the train.py script we are using from mmdetection: mmdetection/tools/train.py at main · open-mmlab/mmdetection (github.com).

The command we use is as follows:

export DARSHAN_MODMEM=10000 
export DARSHAN_EXCLUDE_DIRS = /var,/proc,/etc,/dev,/sys,/snap,/run,/user,/lib,/bin,/home/darrenng/.local/,/home/darrenng/miniforge3,/home/darrenng/bin,/tmp 
DARSHAN_ENABLE_NONMPI=1 LD_PRELOAD="/lib/libdarshan.so" python tools/train.py configs/the_fasterrcnn_or_yolo_config.py

Whether we use environment variables or the config file, we can still not see Darshan instrumenting image reads.

Below are the contents of our config file:

MAX_RECORDS     100000  POSIX,MPI-IO,LUSTRE
MODMEM  6500    POSIX,MPI-IO,LUSTRE
NAME_EXCLUDE  /var,/proc,/etc,/dev,/sys,/snap,/run,/user,/lib,/bin,/home/darrenng/.local/,/home/darrenng/miniforge3,/home/darrenng/bin,/tmp   POSIX,MPI-IO,LUSTRE
@shanedsnyder
Copy link
Contributor

Just to confirm, do you see Darshan output for other files that are accessed? Just not this particular file that is being read as part of training?

Do you see any warnings when parsing logs about Darshan running out of memory?

It's maybe a bit heavy-handed, but if at all possible, maybe you could share the output of strace -f when running the script that has this problem? That can be helpful for confirming how the reads are occurring and if maybe there's some weird behavior related to Python spawning new processes, etc. If you can share this, please also let me know exactly the filename of interest so I can look for context related to that.

@darrenjng
Copy link
Author

Thanks for the quick response.

Yes, we see Darshan output for other files like some log files for the training dataset but not the images themselves. We are not seeing any warnings about Darshan running out of memory.

Here is the link to the strace output file: https://drive.google.com/file/d/1kguz59VGZyCtkBB9KrB6R1nayeyz0xVF/view?usp=sharing

The filename of interest was picked up by the strace output and is as follows (basically all the jpg files in the train2017 file):
data/coco/train2017/000000033009.jpg

@shanedsnyder
Copy link
Contributor

Thanks for the additional details, that helps a lot in understanding what's going on here. I think you're running into a similar problem we've seen in the past when using Python multiprocessing pools. It looks like TensorFlow implements it's own multiprocessing drop-in, and would assume it's exhibiting similar behavior we've seen. Bear with me and I can walk you through it and give you a possible solution..

You can see from the strace output (L#1) that your training script is initially executed by some "master" process 3356833:

3356833 execve("/home/darrenng/miniforge3/envs/nvme/bin/python", ["python", "tools/train.py", "configs/yolo/yolov3_d53_8xb8-320"...], 0x7ffc7a1a9f28 /* 231 vars */) = 0

These frameworks seem to rely on multiprocessing packages for spawning entirely new worker processes for doing I/O, e.g. here (L#159664) you can see worker process 3357131 actually open (and subsequently read) one of the files you're interested in:

3357131 openat(AT_FDCWD, "data/coco/train2017/000000104172.jpg", O_RDONLY|O_CLOEXEC <unfinished ...>

Now, Darshan should actually be able to handle this all fine, ultimately generating a unique log file for each process. However, the way these Python multiprocessing frameworks terminate worker processes creates a real problem for Darshan, e.g., here (L#10440878) you can see that "master" process issue a kill signal to the worker from above:

3356833 kill(3357131, SIGTERM <unfinished ...>

Darshan's shutdown mechanism (which generates the output log file) relies on a graceful exit on the process it is instrumenting, so it never gets a chance to execute when the process is abruptly killed. Ultimately, there's nothing we can do in our default mode to avoid that.

But, we actually have a (somewhat experimental) build option for Darshan that I think should give you a way to keep the logs for the killed processes. You would need to reconfigure Darshan with the --enable-mmap-logs option, in which case Darshan will use mmap to store it's logs in memory as the application executes (and with this memory backed to a temporary file, allowing it to persist in case of abrupt termination of the process). Darshan should store these temporary logs in /tmp. For processes that terminate normally, Darshan will output it's logfile in the typical location you configured it to do so.

You should be able to access and analyze those temporary logs the same way you do normal logs. The key difference is they can be quite large as they are stored in uncompressed form. I recommend compressing them and moving them to a more permanent location (e.g., wherever your log files are configured to go if processes terminate normally) using the darshan-convert tool:

darshan-convert <input_log_path> <output_log_path>

Hopefully that makes some sense, but let me know if you need any more info or if you run into additional issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants