-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add test for TensorFlow #2
Comments
I've been trying to make a TensorFlow test that supports multinode (essentially: this tutorial) . Our previous test used Horovod for that, but it's not a very nice approach: we want to test TensorFlow functionality with this test, it doesn't make sense to pull in another framework just for multinode testing. But, in fact, we may want to verify that TensorFlow's native multinode support works. The TensorFlow code is pretty straightforward:
The challenge that I have been struggling with for the better part of today is how to set My solution was to use That worked, except... if I set I thought this could be resolved by running the parallel launcher multiple times in the test, as is done in this reframe tutorial. If we keep the same default arguments to the parallel launcher, the amount of processes and its distribution over the nodes should be consistent. That enables us to create a mapping global ranks to local ranks. Based on that, we can then set One alternative is just to say "this test only supports slurm" (we can maybe implement a 'skip' if it's not run with SLURM), and use SLURM environment variables ( |
could |
Hm, would be interesting to see if that still does something after
Unless I'm overlooking something, that should also work. But, if I can solve it without a wrapper script, and simply through the TensorFlow API, that might be less messy / more elegant. I'll give it a go. |
Ok, that works. I'm now using
I only get an error (this one) on process teardown. Annoying and a bit ugly, but not a dealbreaker: the test has run completely by then. Still have to test it on multi-node, only did 2 GPU single node for now. Currently waiting in the queue for a multi-node test... |
2 node 8 GPU test also completed succesfully (though it produces the same error on process teardown, of course). |
maybe try calling |
As far as I know, |
Tried calling |
Oh, also, documentation of |
chatgpt suggests the following solution :)
you might have to change the string "worker" into something more unique for tensorflow EDIT: this will probably not work for workers in other nodes.. |
use separate partitions for 40GB and 80GB GPU nodes on Hortense + also add CPU-only partitions
closing since #38 got merged |
For example based on https://github.com/EESSI/eessi-demo/tree/main/TensorFlow
The text was updated successfully, but these errors were encountered: