Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[9.0] feat(Resources): introduce fabric in SSHCE #7703

Draft
wants to merge 1 commit into
base: integration
Choose a base branch
from

Conversation

aldbr
Copy link
Contributor

@aldbr aldbr commented Jun 27, 2024

Replace the Dirac-specific SSH class by fabric.

BEGINRELEASENOTES
*Resources
CHANGE: Replace SSH by fabric in SSHComputingElement
ENDRELEASENOTES

@aldbr aldbr force-pushed the v9.0_FEAT_use-fabric-in-SSHCE branch 2 times, most recently from 206c55e to 6fbeaed Compare June 27, 2024 08:07
@fstagni
Copy link
Contributor

fstagni commented Jun 28, 2024

Just a note: we do not (yet) have a way to do proper integration test for the Computing Elements, but one may think about adding them to our integration tests setup. Something to think about it, it would be nice if it was in this PR.
It involves creating the "site", with the "CE" (this would be yet another container) and the SiteDirector could send pilots to it.

@aldbr aldbr linked an issue Jul 3, 2024 that may be closed by this pull request
@aldbr aldbr force-pushed the v9.0_FEAT_use-fabric-in-SSHCE branch from 6fbeaed to 1c60e47 Compare July 25, 2024 15:24
@aldbr
Copy link
Contributor Author

aldbr commented Jul 25, 2024

Just a note: we do not (yet) have a way to do proper integration test for the Computing Elements, but one may think about adding them to our integration tests setup. Something to think about it, it would be nice if it was in this PR. It involves creating the "site", with the "CE" (this would be yet another container) and the SiteDirector could send pilots to it.

I agree it would great to add integration tests for CEs, at least to test basic features. But it will likely become complex because:

  • if we want to test things properly, we need to set up a CE and a Batch System.
  • we will have to choose one configuration, but it might not reflect the configuration of the sites in production.

I will give it a try with the SSHCE, let's see.

@aldbr
Copy link
Contributor Author

aldbr commented Nov 29, 2024

I wonder if it really makes sense to add CEs (and Batch Systems) in the integration tests: while it would be great to have a "grid in a box" in a controller environment, it would be cumbersome to maintain on the long term and would not be representative of all the instances we can find out there (e.g. Arc v6, v6 with a hack, v7, transferring jobs to Slurm, HTCondor, SSH, SSH tunnel, HTCondor with local scheduler, with remote scheduler...).

It would probably make more sense to add some scripts to run during the hackathons. For each type of CE supported it would:

  • get all the instances related to the given type of CE and for each of them:
    • submit a "hello world" job
    • get the CE status
    • get the job status until it reaches a final state
    • get the job output and logging info (if available)

Basically, it would be very similar to (i) submitting pilots with the Site Director and (ii) checking their results manually. But it would be more focused on the CE interfaces and would be more automated (though a human would need to check whether errors come from the CE instance itself or the Dirac CE interface).

Any opinion @fstagni ?

@fstagni
Copy link
Contributor

fstagni commented Nov 29, 2024

I think the only one that would make sense to set up here is the SSHCE. The others, "proper Grid ones", can not be tested here.

@aldbr
Copy link
Contributor Author

aldbr commented Nov 29, 2024

I don't even know if testing SSHCE in an integration test makes sense. The only easy test we can set up would be SSHCE + Host, which is not representative of what we can have in production.

@fstagni
Copy link
Contributor

fstagni commented Nov 29, 2024

OK OK, give up on the idea...

@aldbr
Copy link
Contributor Author

aldbr commented Nov 29, 2024

I will add a certification test focused on the CE interfaces as I explained (+ a card in the kanban board to explain how to execute it). I will execute it in the lhcb environment to make sure the changes in this PR are correct.

And I can also try to add a container that would act as a "Site" and use SSH + Host so that we can at least test the Site Director "in a box". Would it be okay?

@fstagni
Copy link
Contributor

fstagni commented Nov 29, 2024

Sure, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace the SSH class by a Python library?
3 participants