-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorFlow tests may require to be run on an even number of cores. #2600
Comments
Maybe we can force the tests to run on an even number of cores? So if 7 cores are available, run on 6? If we can confirm this though, seems weird (but I guess I shouldn't be surprised). |
Can I have some log files with the exact error to maybe find out what is wrong? And maybe report it as an issue to TF upstream to see what they say about this and for us to have something to reference in the upcoming workaround |
Do you want the complete (debug) build log or just the snippet around interleave_test? |
Extra tests from me:
It looks like a minimum of 8 cores is needed. |
Yeah, might be the problem, running on 17 cores works, so it's not a pure even/odd problem, nor a prime number problem. |
Error seen (from
And an almost identical error for shard 9 of 24, with only the top few lines being different:
|
The "simple" solution here would maybe be to add a min_parallel config opt so we can set it to 8 in the easyconfig, which the framework can then check for .... |
Or the easyblock could detect |
the problem is probably due to multiple uses of this may be fixed by replacing with |
@smoors did you make a PR for that or? |
not yet, too busy trying to make TF-2.7.1 pass the tests. |
there's quite a lot of places where is used, so not sure what's the best way to deal with this. I'm thinking to do a brute-force find/replace in the source tree, but maybe that's a bit too brute?
|
When doing a CPU-only build of TF 2.6.0 in a batch job with 7 cores it repeatedly failed
//tensorflow/python/data/kernel_tests:interleave_test
When running it on 14 cores the test passed.
I haven't had time to verify this fully but we might need to have the easyblock check that it has an even number of cores available so to not fail this test.
@Flamefire thoughts on this?
The text was updated successfully, but these errors were encountered: