-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LDMS implementation for runtime I/O collection #834
Conversation
Branch is now updated with a CI github test. This tests the Darshan-LDMS integrated code by running a simple mpi-io-test application and checks that both Darshan and LDMS are collecting I/O events. It then checks for completion of the run and generated Darshan log file when LDMS is not functioning properly (i.e. env variables not set, cannot connect to daemon). Right now, the CI test clones and installs Darshan from my repository (i.e. Snell1224/darshanConnector) to test the LDMS implemented code. Once this PR has been reviewed and/or approved to merge into the main branch, I will update the CI test accordingly (i.e. install Darshan from the main branch of this repository). If it's better to leave this PR as a separate branch and not merge the changes, that's fine too. I will also update the CI test accordingly (i.e. test will trigger and the Darshan will be installed from the darshanConnector branch of this repository). Documentation on how to run the Darshan-LDMS Integrated code (i.e. darshanConnector) can be found here: https://ovis-hpcreadthedocs.readthedocs.io/en/latest/ldms-streams.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this LDMS functionality to a new darshan-ldms.c source file, rather than in the DXT source here?
The DXT module currently just piggybacks off of our POSIX and MPI-IO modules to provide more detailed tracing, similar to what you want to with the LDMS connector. But, there isn't really any integration between LDMS/DXT that I can see here (all calls to LDMS are in the POSIX module, in the MPI-IO module, etc.), so I'm not sure this is the best spot for the code responsible for talking with LDMS. Pushing as much code to a standalone LDMS source file as possible and trying to limit LDMS-specific code in other source files will probably make maintenance a little easier on our end.
The LDMS functionality has been moved to a new darshan-ldms.c source file. Please let me know if there are any issues with these changes. |
Thanks for the updates! I'll do my best to make another pass over this this week and see if I have more comments related to the timer issue. |
I just added a bunch of review comments. Mostly just another pass to see how to tidy up the integration overall, but let me know if you have any questions or if anything seems unreasonable. I still need to think more on the timespec issue to see what might be the cleanest solution there, but I'm sure we can figure something out. |
* this allows `pip install -e .` to produce a working editable install locally * the basis for the `setuptools` pin here is based on reading through pypa/setuptools#3548; in short, looks like a behavior change in `setuptools` so let's just pin it for now * since this only affects Python developers for our project I've not added any "regression test"/"CI test" for it, but feel free to add one editable job to the CI if you want
839612c
to
3859340
Compare
Made the necessary changes to The However, after looking at Thank you for reviewing this! Please let me know if there is any other ldms code that can be cleaned up or doesn't make sense. |
* move various LDMS code so it's contiguous where possible * various indentation/whitespace changes * remove unused dC vars * fix source copyright
I pushed some small updates over the last day or so, and think I'm ready to sign off on things. For the source code, I mostly did some refactoring to further reduce the diff (e.g., moving the definition of the As far as the github workflow goes, I've only briefly scanned it and the most recent run from my latest commit and noticed that it's not building the LDMS module currently. It looks like a stage of the workflow is explicitly cloning our main branch, whereas I think we want it to just build the code associated with the workflow (in this case, this PR). Maybe have a look at the existing runtime_ci.yml example to see how we are building Darshan directly there? At any rate, it'd be great if we could get the workflow to run on an LDMS-enabled version of Darshan before we merge so I can sanity check that it's not causing issues while running examples. We should also think about how we could modify the existing workflow to trip some sort of error if the Darshan-LDMS integration isn't working properly (e.g., the workflow is passing now even without LDMS support being built in Darshan). Doing the typical inspection of Darshan logs in this case doesn't tell much other than that Darshan didn't crash and produced some output. Is there anyway to "check" LDMS to see that it has ingested Darshan data? Maybe at the very least, we could enable the verbose option you have for the LDMS module and check that you see some of that output at runtime on stdout/stderr? Any other ideas? |
Thank you for making these changes and fixing the whitespaces. Yes, right now this Darshan-LDMS github workflow is set to clone the main Darshan branch because, at the time when I first created this PR, I figured the code would be merged into the main branch once the PR was approved so we would want to test that branch instead. I have been checking this workflow by running it in a separate branch (identical to this one) where the LDMS integrated Darshan code is cloned. However, I do agree that we should test in this PR to make sure we are all on the same page about the Darshan LDMS code so I will update this. As a general description of the workflow file, the LDMS and Darshan codes are cloned, installed and the environment variables are set. Then 3 tests are performed where the goal is to check that the application and Darshan are unaffected by any sort of LDMS related issue/error:
As for checking if LDMS is collecting Darshan data, the LDMS streams daemon that's created is also configured to store the Darshan data to a .csv file and this is checked if it exists and not empty (a new change that I need to update in this branch). There is also a log file that's created when and LDMS daemon is started so I can also cat that file so we can see if any errors show up. Now for checking if Darshan is working, I'm not currently doing anything else other than checking that the .darshan log files exist but I'll look into the runtime_ci.yml and see if there's anything there I could refer to. If you have any other suggestions then please let me know! As I mentioned earlier, there are some changes that I made to the workflow that I haven't updated in this branch. The reason why is because I've recently ran into an LDMS related issue with the latest LDMS code and have been trying to debug this since. So once that is all sorted out I'll update the workflow with these changes and have it clone this PR instead of the main branch. Thank you for the suggestions and input! |
Thanks for the update, @Snell1224 ! That all sounds great. The key thing I was interested in checking was that there was a way to ensure that LDMS is capturing the Darshan data properly. Sounds like you have a couple of different files/logs you can check for which is perfect.
I think this is sufficient, for now, probably. We have a more advanced regression testing system for darshan-runtime in the repo, but it's not currently running in any of our workflows (this is on my TODO list) so there's not an example to point to. I can always come back and modify this test to use that more thorough testing if we want, but I think a test that does the following is more than fine:
The main thing I think that would be useful to include here from darshan/.github/workflows/runtime_ci.yml Line 34 in c57f730
(That would allow the workflow to run against this PR now, but once it's merged into Let me know when you have the workflow ready, and I can review and approve it to run to ensure we're getting the expected result on this branch. After that, I think we can go ahead and merge all of this into |
@shanedsnyder Okay the test script is updated and ready to be reviewed! The script now builds the Darshan-LDMS integrated code and is triggered whenever this is pull/push on this branch. Apologies for the delay, that LDMS issue took a bit longer than expected to debug but I finally found out the problem and have a temporary workaround until it's addressed by the other LDMS developers. I've added some additional |
- only run against `main` branch - no need to setup python - no need to build/use hdf5/h5py/mpi4py
Awesome, I just reviewed and everything looks good. I made a few smaller simplifications (like removing HDF5 and python dependencies that aren't used), but that's all. Looks like the LDMS workflow is running and passing now, which is great to confirm. There are some workflow failures elsewhere, but those appear to be related to some PyDarshan issues not introduced in this PR -- I'm working on getting a fix in for those now in a separate PR. So, I went ahead and approved this -- I don't think I need any other specific changes from you, unless there was something you still wanted to add? I will add a note to our darshan-runtime docs about the new configure argument to enable LDMS (and will link to the docs you shared previously), but that's my last planned change before merging this PR in. Thanks a lot for the contribution and for working with us to get the code ready for merge! This is a really neat capability that I hope will lead to some interesting analysis of I/O behavior at runtime. |
Okay sounds good! And no there's nothing else that I need to add so it can be merged whenever you're ready! About the docs, I just emailed a version of the darshan-runtime doc with the added LDMS info so if you haven't started editing the document yet then feel free to refer to this! No problem! It was a great learning experience and thank you for reviewing and cleaning up the code! Yes, I'm also excited to see what analyses we can produce with this capability. |
EDIT: Pushed the changes. Everything is ready to go now. @shanedsnyder I actually did have one more thing to add which is a env "switch" so that users can choose to turn off/on the LDMS capability. Right now it automatically initializes a connection to the LDMS daemon (when enabled during Darshan build). I'm going to add the commit in a few to this branch and updating the darshan-runtime document that was emailed earlier. |
Looks good, only suggestion is that I think we want to use the new env variable you just added ( I'll try to push in the docs updates later today, thanks for sending those to me! |
The Although I still exported it in test case 3 for ease of mind. Please let me know if you have any questions about the documentation part! |
All sounds good. I just pushed in the docs changes you sent me with a couple small tweaks. I also fixed some of the unrelated CI failures in another PR, so hopefully this passes all checks. Assuming so, I'll go ahead and merge it in. Thanks again for the contribution! |
This includes LDMS API code that collects I/O event data in real time.
This PR is a work in progress and will be updated with the necessary test script and documentation.