LDMS implementation for runtime I/O collection #834

Snell1224 · 2022-10-19T22:31:55Z

This includes LDMS API code that collects I/O event data in real time.
This PR is a work in progress and will be updated with the necessary test script and documentation.

Snell1224 · 2022-11-16T23:09:17Z

Branch is now updated with a CI github test. This tests the Darshan-LDMS integrated code by running a simple mpi-io-test application and checks that both Darshan and LDMS are collecting I/O events. It then checks for completion of the run and generated Darshan log file when LDMS is not functioning properly (i.e. env variables not set, cannot connect to daemon).

Right now, the CI test clones and installs Darshan from my repository (i.e. Snell1224/darshanConnector) to test the LDMS implemented code.

Once this PR has been reviewed and/or approved to merge into the main branch, I will update the CI test accordingly (i.e. install Darshan from the main branch of this repository). If it's better to leave this PR as a separate branch and not merge the changes, that's fine too. I will also update the CI test accordingly (i.e. test will trigger and the Darshan will be installed from the darshanConnector branch of this repository).

Documentation on how to run the Darshan-LDMS Integrated code (i.e. darshanConnector) can be found here: https://ovis-hpcreadthedocs.readthedocs.io/en/latest/ldms-streams.html

darshan-runtime/lib/darshan.h

shanedsnyder

Can we move this LDMS functionality to a new darshan-ldms.c source file, rather than in the DXT source here?

The DXT module currently just piggybacks off of our POSIX and MPI-IO modules to provide more detailed tracing, similar to what you want to with the LDMS connector. But, there isn't really any integration between LDMS/DXT that I can see here (all calls to LDMS are in the POSIX module, in the MPI-IO module, etc.), so I'm not sure this is the best spot for the code responsible for talking with LDMS. Pushing as much code to a standalone LDMS source file as possible and trying to limit LDMS-specific code in other source files will probably make maintenance a little easier on our end.

Snell1224 · 2023-02-03T19:12:55Z

The LDMS functionality has been moved to a new darshan-ldms.c source file. Please let me know if there are any issues with these changes.

shanedsnyder · 2023-02-13T20:55:34Z

Thanks for the updates! I'll do my best to make another pass over this this week and see if I have more comments related to the timer issue.

darshan-runtime/lib/darshan-core.c

darshan-runtime/lib/darshan-ldms.c

shanedsnyder · 2023-02-24T17:47:55Z

I just added a bunch of review comments. Mostly just another pass to see how to tidy up the integration overall, but let me know if you have any questions or if anything seems unreasonable.

I still need to think more on the timespec issue to see what might be the cleanest solution there, but I'm sure we can figure something out.

* this allows `pip install -e .` to produce a working editable install locally * the basis for the `setuptools` pin here is based on reading through pypa/setuptools#3548; in short, looks like a behavior change in `setuptools` so let's just pin it for now * since this only affects Python developers for our project I've not added any "regression test"/"CI test" for it, but feel free to add one editable job to the CI if you want

Snell1224 · 2023-03-08T20:49:13Z

Made the necessary changes to jobid, uid, and exename and cleaned up the LDMS code.

The DARSHAN_LDMS_HOST is different from gethostname().
DARSHAN_LDMS_HOST: Used to connect to the ldmsd (I.e. what host the ldms daemon is running on)
gethostname(): Used for identifying the producer name (i.e. what host the application is running on).

However, after looking at setup_connection(), you are correct that gethostname() should not be here. Not only is this the wrong function call (needs to find the host of the ldms daemon and not the application) but the darshan_ldms_connector_initialized() already checks if DARSHAN_LDMS_HOST is set before calling setup_connection(). This chunk of code has been removed

Thank you for reviewing this! Please let me know if there is any other ldms code that can be cleaned up or doesn't make sense.

darshan-runtime/lib/darshan-hdf5.c

* move various LDMS code so it's contiguous where possible * various indentation/whitespace changes * remove unused dC vars * fix source copyright

shanedsnyder · 2023-11-30T23:30:22Z

I pushed some small updates over the last day or so, and think I'm ready to sign off on things. For the source code, I mostly did some refactoring to further reduce the diff (e.g., moving the definition of the extern dC variable into the LDMS header so it doesn't have to be repeated everywhere), and also made some whitespace changes to make sure indentations were consistent. I also made some relatively minor updates to the autotools files to simplify and cleanup the output from configure.

As far as the github workflow goes, I've only briefly scanned it and the most recent run from my latest commit and noticed that it's not building the LDMS module currently. It looks like a stage of the workflow is explicitly cloning our main branch, whereas I think we want it to just build the code associated with the workflow (in this case, this PR). Maybe have a look at the existing runtime_ci.yml example to see how we are building Darshan directly there?

At any rate, it'd be great if we could get the workflow to run on an LDMS-enabled version of Darshan before we merge so I can sanity check that it's not causing issues while running examples. We should also think about how we could modify the existing workflow to trip some sort of error if the Darshan-LDMS integration isn't working properly (e.g., the workflow is passing now even without LDMS support being built in Darshan). Doing the typical inspection of Darshan logs in this case doesn't tell much other than that Darshan didn't crash and produced some output. Is there anyway to "check" LDMS to see that it has ingested Darshan data? Maybe at the very least, we could enable the verbose option you have for the LDMS module and check that you see some of that output at runtime on stdout/stderr? Any other ideas?

Snell1224 · 2023-12-05T21:21:42Z

Doing the typical inspection of Darshan logs in this case doesn't tell much other than that Darshan didn't crash and produced some output. I

Thank you for making these changes and fixing the whitespaces.

Yes, right now this Darshan-LDMS github workflow is set to clone the main Darshan branch because, at the time when I first created this PR, I figured the code would be merged into the main branch once the PR was approved so we would want to test that branch instead. I have been checking this workflow by running it in a separate branch (identical to this one) where the LDMS integrated Darshan code is cloned. However, I do agree that we should test in this PR to make sure we are all on the same page about the Darshan LDMS code so I will update this.

As a general description of the workflow file, the LDMS and Darshan codes are cloned, installed and the environment variables are set. Then 3 tests are performed where the goal is to check that the application and Darshan are unaffected by any sort of LDMS related issue/error:

LDMS streams daemon is connected and env variables are set
LDMS env variables are not set
LDMS streams daemon does not exist (not connected or started)

As for checking if LDMS is collecting Darshan data, the LDMS streams daemon that's created is also configured to store the Darshan data to a .csv file and this is checked if it exists and not empty (a new change that I need to update in this branch). There is also a log file that's created when and LDMS daemon is started so I can also cat that file so we can see if any errors show up.

Now for checking if Darshan is working, I'm not currently doing anything else other than checking that the .darshan log files exist but I'll look into the runtime_ci.yml and see if there's anything there I could refer to. If you have any other suggestions then please let me know!

As I mentioned earlier, there are some changes that I made to the workflow that I haven't updated in this branch. The reason why is because I've recently ran into an LDMS related issue with the latest LDMS code and have been trying to debug this since. So once that is all sorted out I'll update the workflow with these changes and have it clone this PR instead of the main branch.

Thank you for the suggestions and input!

shanedsnyder · 2023-12-06T17:10:21Z

Thanks for the update, @Snell1224 ! That all sounds great. The key thing I was interested in checking was that there was a way to ensure that LDMS is capturing the Darshan data properly. Sounds like you have a couple of different files/logs you can check for which is perfect.

Now for checking if Darshan is working, I'm not currently doing anything else other than checking that the .darshan log files exist but I'll look into the runtime_ci.yml and see if there's anything there I could refer to. If you have any other suggestions then please let me know!

I think this is sufficient, for now, probably. We have a more advanced regression testing system for darshan-runtime in the repo, but it's not currently running in any of our workflows (this is on my TODO list) so there's not an example to point to. I can always come back and modify this test to use that more thorough testing if we want, but I think a test that does the following is more than fine:

ensure application runs fine, terminates without error
ensure LDMS properly collects data/logs as expected
ensure Darshan generates a log and that darshan-parser can parse it without error

The main thing I think that would be useful to include here from runtime_ci.yml is the code for making sure we build the current branch (rather than main):

darshan/.github/workflows/runtime_ci.yml

Line 34 in c57f730

- name: Install darshan-runtime

(That would allow the workflow to run against this PR now, but once it's merged into main, subsequent workflows will target main, if that makes sense.)

Let me know when you have the workflow ready, and I can review and approve it to run to ensure we're getting the expected result on this branch. After that, I think we can go ahead and merge all of this into main.

Snell1224 · 2023-12-13T20:44:15Z

@shanedsnyder Okay the test script is updated and ready to be reviewed! The script now builds the Darshan-LDMS integrated code and is triggered whenever this is pull/push on this branch.

Apologies for the delay, that LDMS issue took a bit longer than expected to debug but I finally found out the problem and have a temporary workaround until it's addressed by the other LDMS developers.

I've added some additional ls -ltrch commands for the log file directory (towards the bottom of the file) so that we can see all the Darshan log files, their parsed output files, LDMS output and LDMS csv file. Please feel free to make any edits to this .yaml file and/or let me know if there's any issues or questions!

- only run against `main` branch - no need to setup python - no need to build/use hdf5/h5py/mpi4py

shanedsnyder · 2023-12-15T20:17:48Z

Awesome, I just reviewed and everything looks good. I made a few smaller simplifications (like removing HDF5 and python dependencies that aren't used), but that's all. Looks like the LDMS workflow is running and passing now, which is great to confirm. There are some workflow failures elsewhere, but those appear to be related to some PyDarshan issues not introduced in this PR -- I'm working on getting a fix in for those now in a separate PR.

So, I went ahead and approved this -- I don't think I need any other specific changes from you, unless there was something you still wanted to add? I will add a note to our darshan-runtime docs about the new configure argument to enable LDMS (and will link to the docs you shared previously), but that's my last planned change before merging this PR in.

Thanks a lot for the contribution and for working with us to get the code ready for merge! This is a really neat capability that I hope will lead to some interesting analysis of I/O behavior at runtime.

Snell1224 · 2023-12-18T19:31:53Z

So, I went ahead and approved this -- I don't think I need any other specific changes from you, unless there was something you still wanted to add? I will add a note to our darshan-runtime docs about the new configure argument to enable LDMS (and will link to the docs you shared previously), but that's my last planned change before merging this PR in.

Thanks a lot for the contribution and for working with us to get the code ready for merge! This is a really neat capability that I hope will lead to some interesting analysis of I/O behavior at runtime.

Okay sounds good! And no there's nothing else that I need to add so it can be merged whenever you're ready!

About the docs, I just emailed a version of the darshan-runtime doc with the added LDMS info so if you haven't started editing the document yet then feel free to refer to this!

No problem! It was a great learning experience and thank you for reviewing and cleaning up the code! Yes, I'm also excited to see what analyses we can produce with this capability.

Snell1224 · 2023-12-19T20:58:46Z

EDIT: Pushed the changes. Everything is ready to go now.

@shanedsnyder I actually did have one more thing to add which is a env "switch" so that users can choose to turn off/on the LDMS capability. Right now it automatically initializes a connection to the LDMS daemon (when enabled during Darshan build). I'm going to add the commit in a few to this branch and updating the darshan-runtime document that was emailed earlier.

shanedsnyder · 2023-12-20T17:04:29Z

Looks good, only suggestion is that I think we want to use the new env variable you just added (DARSHAN_LDMS_ENABLE) in test case 3 for your github workflow, as in test case 1? To ensure nothing bad happens when trying to connect to an LDMS daemon that has shut down?

I'll try to push in the docs updates later today, thanks for sending those to me!

Snell1224 · 2023-12-20T19:10:34Z

Looks good, only suggestion is that I think we want to use the new env variable you just added (DARSHAN_LDMS_ENABLE) in test case 3 for your github workflow, as in test case 1? To ensure nothing bad happens when trying to connect to an LDMS daemon that has shut down?

The DARSHAN_LDMS_ENABLE is already set for test case 3 because it's under the same step name: "Test Preparation and Run" and it is not being unset in test case 2. From what I understand, github actions keeps the exported variables until that step is complete. I did re-run the test to double check this and, for test case 3, LDMS threw a connection error (Error 111) and exited but there were no runtime issues with the program or Darshan.

Although I still exported it in test case 3 for ease of mind.

Please let me know if you have any questions about the documentation part!

shanedsnyder · 2023-12-21T19:43:21Z

All sounds good. I just pushed in the docs changes you sent me with a couple small tweaks. I also fixed some of the unrelated CI failures in another PR, so hopefully this passes all checks. Assuming so, I'll go ahead and merge it in. Thanks again for the contribution!

Snell1224 added 2 commits October 19, 2022 16:31

LDMS implementation for runtime I/O collection

e8fff92

CI: Test for LDMS Integration

3f78ec9

github-actions bot added CI continuous integration pydarshan labels Nov 16, 2022

Merge branch 'main' into darshanConnector

7fd1142

github-actions bot removed the pydarshan label Nov 16, 2022

Updating branch with main

5ea56d0

github-actions bot added the pydarshan label Nov 28, 2022

Merge branch 'main' into darshanConnector

658859c

github-actions bot removed the pydarshan label Nov 28, 2022

Merge branch 'darshan-hpc:main' into darshanConnector

2f5bfda

shanedsnyder reviewed Jan 13, 2023

View reviewed changes

darshan-runtime/lib/darshan.h Outdated Show resolved Hide resolved

shanedsnyder requested changes Jan 17, 2023

View reviewed changes

moving ldms code to separate file

4451774

shanedsnyder requested changes Feb 24, 2023

View reviewed changes

darshan-runtime/lib/darshan-core.c Outdated Show resolved Hide resolved

darshan-runtime/lib/darshan-core.c Outdated Show resolved Hide resolved

darshan-runtime/lib/darshan-core.c Outdated Show resolved Hide resolved

darshan-runtime/lib/darshan-ldms.c Outdated Show resolved Hide resolved

shanedsnyder requested changes Feb 24, 2023

View reviewed changes

darshan-runtime/lib/darshan-ldms.c Outdated Show resolved Hide resolved

darshan-runtime/lib/darshan-ldms.c Outdated Show resolved Hide resolved

github-actions bot added the pydarshan label Mar 8, 2023

tylerjereddy and others added 3 commits March 8, 2023 13:42

updating github workflow

7cf38ea

merging main and cleaning up ldms related code

3859340

Snell1224 force-pushed the darshanConnector branch from 839612c to 3859340 Compare March 8, 2023 20:43

Merge branch 'darshan-hpc:main' into darshanConnector

2d83637

github-actions bot removed the pydarshan label Mar 8, 2023

Snell1224 and others added 2 commits March 14, 2023 14:15

Merge branch 'darshan-hpc:main' into darshanConnector

6dd84cb

removing ldms mdhim and updating ldms event fields

b80d878

Snell1224 and others added 2 commits November 6, 2023 14:20

fixing close counter

107a901

Merge branch 'darshan-hpc:main' into darshanConnector

d0bc9fd

shanedsnyder requested changes Nov 28, 2023

View reviewed changes

darshan-runtime/lib/darshan-hdf5.c Outdated Show resolved Hide resolved

Snell1224 and others added 4 commits November 28, 2023 15:51

fixing hdf5, rename variables and fixing spaces

2600470

whitespace changes and other small things

8732926

* move various LDMS code so it's contiguous where possible * various indentation/whitespace changes * remove unused dC vars * fix source copyright

declare dC Darshan connector as extern in header

a122878

LDMS build changes

b4c1b14

test script fixes and updates to LDMS output

9e6225d

shanedsnyder added 4 commits December 15, 2023 11:44

try to run LDMS workflow on PRs to main

3b7ee14

simplify ldms workflows slightly

fe594b5

- only run against `main` branch - no need to setup python - no need to build/use hdf5/h5py/mpi4py

more openmpi libs needed

054919c

use darshan-parser, not darshan-dxt-parser

a74ac94

shanedsnyder previously approved these changes Dec 15, 2023

View reviewed changes

adding LDMS switch and fixing env checks/messages

8d59bc6

Snell1224 dismissed shanedsnyder’s stale review via 8d59bc6 December 20, 2023 06:05

Update darshan_ldms_test_ci.yml

dc807f6

updated runtime docs for LDMS

35147e3

shanedsnyder approved these changes Dec 21, 2023

View reviewed changes

shanedsnyder merged commit cd84ee5 into darshan-hpc:main Dec 21, 2023
17 of 18 checks passed

shanedsnyder added this to the 3.4.5 milestone Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDMS implementation for runtime I/O collection #834

LDMS implementation for runtime I/O collection #834

Snell1224 commented Oct 19, 2022

Snell1224 commented Nov 16, 2022 •

edited

Loading

shanedsnyder left a comment

Snell1224 commented Feb 3, 2023

shanedsnyder commented Feb 13, 2023

shanedsnyder commented Feb 24, 2023

Snell1224 commented Mar 8, 2023 •

edited

Loading

shanedsnyder commented Nov 30, 2023

Snell1224 commented Dec 5, 2023 •

edited

Loading

shanedsnyder commented Dec 6, 2023

Snell1224 commented Dec 13, 2023

shanedsnyder commented Dec 15, 2023

Snell1224 commented Dec 18, 2023 •

edited

Loading

Snell1224 commented Dec 19, 2023 •

edited

Loading

shanedsnyder commented Dec 20, 2023

Snell1224 commented Dec 20, 2023 •

edited

Loading

shanedsnyder commented Dec 21, 2023

LDMS implementation for runtime I/O collection #834

LDMS implementation for runtime I/O collection #834

Conversation

Snell1224 commented Oct 19, 2022

Snell1224 commented Nov 16, 2022 • edited Loading

shanedsnyder left a comment

Choose a reason for hiding this comment

Snell1224 commented Feb 3, 2023

shanedsnyder commented Feb 13, 2023

shanedsnyder commented Feb 24, 2023

Snell1224 commented Mar 8, 2023 • edited Loading

shanedsnyder commented Nov 30, 2023

Snell1224 commented Dec 5, 2023 • edited Loading

shanedsnyder commented Dec 6, 2023

Snell1224 commented Dec 13, 2023

shanedsnyder commented Dec 15, 2023

Snell1224 commented Dec 18, 2023 • edited Loading

Snell1224 commented Dec 19, 2023 • edited Loading

shanedsnyder commented Dec 20, 2023

Snell1224 commented Dec 20, 2023 • edited Loading

shanedsnyder commented Dec 21, 2023

Snell1224 commented Nov 16, 2022 •

edited

Loading

Snell1224 commented Mar 8, 2023 •

edited

Loading

Snell1224 commented Dec 5, 2023 •

edited

Loading

Snell1224 commented Dec 18, 2023 •

edited

Loading

Snell1224 commented Dec 19, 2023 •

edited

Loading

Snell1224 commented Dec 20, 2023 •

edited

Loading