TensorFlow tests may require to be run on an even number of cores. #2600

akesandgren · 2021-10-14T05:40:53Z

When doing a CPU-only build of TF 2.6.0 in a batch job with 7 cores it repeatedly failed
//tensorflow/python/data/kernel_tests:interleave_test
When running it on 14 cores the test passed.

I haven't had time to verify this fully but we might need to have the easyblock check that it has an even number of cores available so to not fail this test.

@Flamefire thoughts on this?

boegel · 2021-10-14T06:50:08Z

Maybe we can force the tests to run on an even number of cores? So if 7 cores are available, run on 6?
That only leaves a single-core installation of TensorFlow as a problematic case.

If we can confirm this though, seems weird (but I guess I shouldn't be surprised).

Flamefire · 2021-10-18T10:04:34Z

Can I have some log files with the exact error to maybe find out what is wrong? And maybe report it as an issue to TF upstream to see what they say about this and for us to have something to reference in the upcoming workaround

akesandgren · 2021-10-18T14:59:17Z

Do you want the complete (debug) build log or just the snippet around interleave_test?

branfosj · 2021-10-18T15:01:18Z

Extra tests from me:

6 fail
7 fail
8 success
9 success

It looks like a minimum of 8 cores is needed.

akesandgren · 2021-10-19T05:55:14Z

Yeah, might be the problem, running on 17 cores works, so it's not a pure even/odd problem, nor a prime number problem.

branfosj · 2021-10-19T07:32:24Z

Error seen (from TensorFlow/2.6.0/foss-2021a/tmpkuca27g9-bazel-tf/0db3977216c8a32952b67b0996300dd3/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/python/data/kernel_tests/interleave_test/shard_1_of_24/test.log:

======================================================================
FAIL: testInterleaveDataset_test_mode_graph_tfapiversion_2_blocklength_3_cyclelength_None_inputvalues_4567_numparallelcalls_None (__main__.InterleaveTest)
InterleaveTest.testInterleaveDataset_test_mode_graph_tfapiversion_2_blocklength_3_cyclelength_None_inputvalues_4567_numparallelcalls_None
testInterleaveDataset_test_mode_graph_tfapiversion_2_blocklength_3_cyclelength_None_inputvalues_4567_numparallelcalls_None(mode='graph', tf_api_version=2, block_length=3, cycle_length=None, input_values=array([4, 5, 6, 7]), num_parallel_calls=None)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages/absl/testing/parameterized.py", line 314, in bound_param_test
    return test_method(self, **testcase_params)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/TensorFlow/2.6.0/foss-2021a/tmpzwc1qwnh-bazel-tf/0db3977216c8a32952b67b0996300dd3/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/kernel_tests/interleave_test.runfiles/org_tensorflow/tensorflow/python/framework/test_combinations.py", line 366, in decorated
    execute_test_method()
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/TensorFlow/2.6.0/foss-2021a/tmpzwc1qwnh-bazel-tf/0db3977216c8a32952b67b0996300dd3/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/kernel_tests/interleave_test.runfiles/org_tensorflow/tensorflow/python/framework/test_combinations.py", line 349, in execute_test_method
    test_method(**kwargs_to_pass)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/TensorFlow/2.6.0/foss-2021a/tmpzwc1qwnh-bazel-tf/0db3977216c8a32952b67b0996300dd3/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/kernel_tests/interleave_test.runfiles/org_tensorflow/tensorflow/python/data/kernel_tests/interleave_test.py", line 197, in testInterleaveDataset
    self.assertDatasetProduces(dataset, expected_output)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/TensorFlow/2.6.0/foss-2021a/tmpzwc1qwnh-bazel-tf/0db3977216c8a32952b67b0996300dd3/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/kernel_tests/interleave_test.runfiles/org_tensorflow/tensorflow/python/data/kernel_tests/test_base.py", line 232, in assertDatasetProduces
    self._compareOutputToExpected(result, expected_output, assert_items_equal)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/TensorFlow/2.6.0/foss-2021a/tmpzwc1qwnh-bazel-tf/0db3977216c8a32952b67b0996300dd3/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/kernel_tests/interleave_test.runfiles/org_tensorflow/tensorflow/python/data/kernel_tests/test_base.py", line 151, in _compareOutputToExpected
    self.assertValuesEqual(expected_value, result_value)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/TensorFlow/2.6.0/foss-2021a/tmpzwc1qwnh-bazel-tf/0db3977216c8a32952b67b0996300dd3/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/kernel_tests/interleave_test.runfiles/org_tensorflow/tensorflow/python/data/kernel_tests/test_base.py", line 92, in assertValuesEqual
    self.assertAllEqual(expected, actual)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/TensorFlow/2.6.0/foss-2021a/tmpzwc1qwnh-bazel-tf/0db3977216c8a32952b67b0996300dd3/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/kernel_tests/interleave_test.runfiles/org_tensorflow/tensorflow/python/framework/test_util.py", line 1275, in decorated
    return f(*args, **kwds)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/TensorFlow/2.6.0/foss-2021a/tmpzwc1qwnh-bazel-tf/0db3977216c8a32952b67b0996300dd3/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/kernel_tests/interleave_test.runfiles/org_tensorflow/tensorflow/python/framework/test_util.py", line 2979, in assertAllEqual
    np.testing.assert_array_equal(a, b, err_msg="\n".join(msgs))
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 932, in assert_array_equal
    assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 842, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not equal

not equal lhs = array(6)
not equal rhs = array(4)
Mismatched elements: 1 / 1 (100%)
Max absolute difference: 2
Max relative difference: 0.5
 x: array(6)
 y: array(4)

----------------------------------------------------------------------

And an almost identical error for shard 9 of 24, with only the top few lines being different:

FAIL: testInterleaveDataset_test_mode_eager_tfapiversion_2_blocklength_3_cyclelength_None_inputvalues_4567_numparallelcalls_None (__main__.InterleaveTest)
InterleaveTest.testInterleaveDataset_test_mode_eager_tfapiversion_2_blocklength_3_cyclelength_None_inputvalues_4567_numparallelcalls_None
testInterleaveDataset_test_mode_eager_tfapiversion_2_blocklength_3_cyclelength_None_inputvalues_4567_numparallelcalls_None(mode='eager', tf_api_version=2, block_length=3, cycle_length=None, input_values=array([4, 5, 6, 7]), num_parallel_calls=None)

akesandgren · 2021-10-19T09:28:39Z

The "simple" solution here would maybe be to add a min_parallel config opt so we can set it to 8 in the easyconfig, which the framework can then check for ....

branfosj · 2021-10-19T09:38:26Z

Or the easyblock could detect < 8 cores and disable this test in that case, with a suitable warning.

smoors · 2022-02-10T17:14:19Z

the problem is probably due to multiple uses of multiprocessing.cpu_count() in the interleave_test, which gives a wrong result if not all cores are allocated to the job.
see for example here: https://github.com/tensorflow/tensorflow/blob/39ff146388eb4b7f63d8e4c94786edf94433bb60/tensorflow/python/data/kernel_tests/interleave_test.py#L60

this may be fixed by replacing with len(os.sched_getaffinity(0))
(updated: fixed obvious mistake in replacement)

akesandgren · 2022-02-15T12:18:45Z

@smoors did you make a PR for that or?

smoors · 2022-02-15T12:24:29Z

not yet, too busy trying to make TF-2.7.1 pass the tests.

smoors · 2022-02-21T21:31:28Z

there's quite a lot of places where is used, so not sure what's the best way to deal with this. I'm thinking to do a brute-force find/replace in the source tree, but maybe that's a bit too brute?
TF-2.6.0:

tensorflow/tools/test/system_info_lib.py:  cpu_info.num_cores = multiprocessing.cpu_count()
tensorflow/python/keras/integration_test/parameter_server_custom_training_loop_test.py:    if multiprocessing.cpu_count() < num_workers + 1:
tensorflow/python/keras/integration_test/parameter_server_keras_preprocessing_test.py:  if multiprocessing.cpu_count() < num_workers + 1:
tensorflow/python/data/ops/dataset_ops.py:      local_shard_func = lambda index, _: index % multiprocessing.cpu_count()
tensorflow/python/data/ops/dataset_ops.py:          cycle_length=multiprocessing.cpu_count(),
tensorflow/python/data/kernel_tests/snapshot_test.py:        num_snapshot_shards_per_run=multiprocessing.cpu_count())
tensorflow/python/data/kernel_tests/snapshot_test.py:        num_snapshot_shards_per_run=multiprocessing.cpu_count())
tensorflow/python/data/kernel_tests/snapshot_test.py:        num_snapshot_shards_per_run=multiprocessing.cpu_count())
tensorflow/python/data/kernel_tests/snapshot_test.py:        num_snapshot_shards_per_run=multiprocessing.cpu_count())
tensorflow/python/data/kernel_tests/snapshot_test.py:        num_snapshot_shards_per_run=multiprocessing.cpu_count())
tensorflow/python/data/kernel_tests/snapshot_test.py:        num_snapshot_shards_per_run=multiprocessing.cpu_count())
tensorflow/python/data/kernel_tests/snapshot_test.py:        num_snapshot_shards_per_run=multiprocessing.cpu_count())
tensorflow/python/data/kernel_tests/snapshot_test.py:        num_snapshot_shards_per_run=multiprocessing.cpu_count())
tensorflow/python/data/kernel_tests/snapshot_test.py:        num_snapshot_shards_per_run=multiprocessing.cpu_count())
tensorflow/python/data/kernel_tests/snapshot_test.py:        num_snapshot_shards_per_run=multiprocessing.cpu_count())
tensorflow/python/data/kernel_tests/interleave_test.py:      cycle_length = multiprocessing.cpu_count()
tensorflow/python/data/kernel_tests/interleave_test.py:      cycle_length = (multiprocessing.cpu_count() + 2) // 3
tensorflow/python/data/kernel_tests/interleave_test.py:      cycle_length = min(num_parallel_calls, multiprocessing.cpu_count())
tensorflow/lite/tools/pip_package/setup.py:    return multiprocessing.cpu_count()
tensorflow/python/data/experimental/ops/io.py:          cycle_length=multiprocessing.cpu_count(),

smoors · 2022-06-23T06:08:23Z

fix for interleave_test: https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.7.1_fix_cpu_count.patch

akesandgren added the bug report label Oct 14, 2021

boegel added this to the 4.x milestone Oct 14, 2021

boegel modified the milestones: 4.x, release after 4.5.0 Oct 27, 2021

boegel modified the milestones: 4.5.1, release after 4.5.1 Dec 7, 2021

boegel modified the milestones: 4.5.2, release after 4.5.2 Jan 14, 2022

boegel modified the milestones: 4.5.3, release after 4.5.3 Feb 9, 2022

boegel modified the milestones: next release (4.5.4), release after 4.5.4 Mar 25, 2022

boegel modified the milestones: 4.5.5, release after 4.5.5 Jun 4, 2022

boegel modified the milestones: next release (4.6.0), release after 4.6.0 Jul 6, 2022

boegel removed this from the next release (4.6.1) milestone Sep 9, 2022

boegel added this to the release after 4.6.1 milestone Sep 9, 2022

boegel modified the milestones: next release (4.6.2?), release after 4.6.2 Oct 18, 2022

boegel modified the milestones: next release (4.7.0), release after 4.7.0 Dec 20, 2022

boegel modified the milestones: next release (4.7.1), release after 4.7.1 Mar 1, 2023

boegel modified the milestones: next release (4.7.2), release after 4.7.2 Apr 12, 2023

boegel modified the milestones: 4.7.3, release after 4.7.3 Jul 6, 2023

boegel modified the milestones: next release (4.8.1?), release after 4.8.1 Sep 3, 2023

boegel modified the milestones: next release (4.8.2), release after 4.8.2 Oct 27, 2023

boegel modified the milestones: next release (4.9.0), release after 4.9.0 Dec 26, 2023

boegel modified the milestones: 4.9.1, release after 4.9.1 Apr 3, 2024

boegel modified the milestones: 4.9.2, release after 4.9.2 Jun 6, 2024

boegel modified the milestones: 4.9.3, release after 4.9.3 Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorFlow tests may require to be run on an even number of cores. #2600

TensorFlow tests may require to be run on an even number of cores. #2600

akesandgren commented Oct 14, 2021

boegel commented Oct 14, 2021

Flamefire commented Oct 18, 2021

akesandgren commented Oct 18, 2021

branfosj commented Oct 18, 2021 •

edited

Loading

akesandgren commented Oct 19, 2021 •

edited

Loading

branfosj commented Oct 19, 2021

akesandgren commented Oct 19, 2021

branfosj commented Oct 19, 2021

smoors commented Feb 10, 2022 •

edited

Loading

akesandgren commented Feb 15, 2022

smoors commented Feb 15, 2022

smoors commented Feb 21, 2022

smoors commented Jun 23, 2022

TensorFlow tests may require to be run on an even number of cores. #2600

TensorFlow tests may require to be run on an even number of cores. #2600

Comments

akesandgren commented Oct 14, 2021

boegel commented Oct 14, 2021

Flamefire commented Oct 18, 2021

akesandgren commented Oct 18, 2021

branfosj commented Oct 18, 2021 • edited Loading

akesandgren commented Oct 19, 2021 • edited Loading

branfosj commented Oct 19, 2021

akesandgren commented Oct 19, 2021

branfosj commented Oct 19, 2021

smoors commented Feb 10, 2022 • edited Loading

akesandgren commented Feb 15, 2022

smoors commented Feb 15, 2022

smoors commented Feb 21, 2022

smoors commented Jun 23, 2022

branfosj commented Oct 18, 2021 •

edited

Loading

akesandgren commented Oct 19, 2021 •

edited

Loading

smoors commented Feb 10, 2022 •

edited

Loading