[Fix] Fix cpu inference UT failure #4430

delock · 2023-10-01T06:30:53Z

This PR fix UT test error as described in this PR and the following test job. This PR skips TestModelTask if dtype is not supported by accelerator, or InferenceBuilder is not implemented by accelerator.
#4419
https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538

loadams · 2023-10-02T18:49:09Z

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

delock · 2023-10-03T03:14:48Z

Hi @loadams , currently we are seeking higher test coverage in the area of AutoTP. @Yejing-Lai is currently investigating whether more model coverage is possible. On the other hand. Some tests skipped because on CPU InferenceBuilder are not implemented. We will check if there is any skipped test not because of this reason.

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

loadams · 2023-10-03T16:36:38Z

Hi @loadams , currently we are seeking higher test coverage in the area of AutoTP. @Yejing-Lai is currently investigating whether more model coverage is possible. On the other hand. Some tests skipped because on CPU InferenceBuilder are not implemented. We will check if there is any skipped test not because of this reason.

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

That makes sense, thanks. Given that this increases the runtime of the cpu_inference tests drastically without increasing coverage, do you think it would make sense to either revert it or add the skip back to the top of the file which seemed to run quicker, then on subsequent PRs we can enable more tests and monitor the overall runtime?

loadams · 2023-10-03T18:08:50Z

Hi @loadams , currently we are seeking higher test coverage in the area of AutoTP. @Yejing-Lai is currently investigating whether more model coverage is possible. On the other hand. Some tests skipped because on CPU InferenceBuilder are not implemented. We will check if there is any skipped test not because of this reason.

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

That makes sense, thanks. Given that this increases the runtime of the cpu_inference tests drastically without increasing coverage, do you think it would make sense to either revert it or add the skip back to the top of the file which seemed to run quicker, then on subsequent PRs we can enable more tests and monitor the overall runtime?

FYI @delock, we disabled the cpu_inference test, but have it so we can run it on demand for PRs that we think will impact things as we fix the test time issue

loadams · 2023-10-03T18:22:10Z

Also #4439 had a way to try and speed up the tests by not setting it all up before skipping but that's wasn't quicker either.

delock · 2023-10-07T02:39:05Z

I tried to run the line: pytest -m 'seq_inference' unit/ locally and see 3 passed tests with this branch. Yet the same branch get these three test skipped on CI environment. Will need some further investigation with the CI environment.
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-t5] PASSED [ 66%]
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-roberta] PASSED [ 73%] unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-marian] SKIPPED (Acceleraor cpu does not support torch.float16.) [ 80%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-codegen] SKIPPED (Acceleraor cpu does not support torch.float16.) [ 86%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-marian] PASSED

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

delock · 2023-10-07T03:29:20Z

The slow running of test is caused by this line. It loads model list from huggingface_hub. Doing it for every test is slow. A proper fix should make it persistent during the test session. One way is save it as pickle in the first test and load the pickle subsequently. Will test and update this PR.
https://github.com/microsoft/DeepSpeed/blob/master/tests/unit/inference/test_inference.py#L68

* Reuse hf_model list among tests to avoid slow loading * try to debug test skip * another attempt to print test failure * another attempt * more attempt to print skip reason * revert changes that are temporary

delock · 2023-10-07T07:33:50Z

@loadams the slow test had been fixed with latest commit. How the test run around 13 minutes on my local environment.

The strange thing is on my local environment, there are three tests not skipped, and the skipping message is displayed in my local environment .

On github CI environment, however, these three tests are skipp and the skipping message is not printed with the test report.
I tried several attempt to turn on skip message printing, however it seems pytest in CI environment are not affected by the pytest command in the workflow. Do you have any idea and how do I turn on skip message printing in github CI env?

This is my local env.

unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-neo] SKIPPED (This op had not been implemented on this...) [  6%]
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-neox] SKIPPED (Skipping gpt-neox-20b for now)              [ 13%]
unit/inference/test_inference.py::TestMPSize::test[fp32-bloom] SKIPPED (Bloom models only support half precision, ...) [ 20%]
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-j] SKIPPED (Test is currently broken)                      [ 26%]
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-neo] SKIPPED (This op had not been implemented on this...) [ 33%]
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-neox] SKIPPED (Skipping gpt-neox-20b for now)              [ 40%]
unit/inference/test_inference.py::TestMPSize::test[fp16-bloom] SKIPPED (This op had not been implemented on this s...) [ 46%]
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-j] SKIPPED (Test is currently broken)                      [ 53%]
unit/inference/test_inference.py::TestAutoTP::test[falcon] SKIPPED (Not enough GPU memory for this on V100 runners)    [ 60%]
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-t5] PASSED                                            [ 66%]
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-roberta] PASSED                                       [ 73%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-marian] SKIPPED (Acceleraor cpu does not su...) [ 80%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-codegen] SKIPPED (Acceleraor cpu does not s...) [ 86%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-marian] PASSED                                  [ 93%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-codegen] SKIPPED (Codegen model(bf16) need ...) [100%]

This is the github CI printout, all tests are skipped and no skip message printed.

unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-marian] SKIPPED
unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-marian] SKIPPED
unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-codegen] SKIPPED
unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-codegen] SKIPPED
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-roberta] SKIPPED
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-t5] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp16-bloom] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp32-bloom] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-j] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-neox] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-neo] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-j] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-neo] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-neox] SKIPPED
unit/inference/test_inference.py::TestAutoTP::test[falcon] SKIPPED (...)\

delock · 2023-10-08T11:42:39Z

The skip message is truncated because of 80 column limit when pytest is running. We can expand the 'COLUMNS' env variable to 140 to see skip message. Another issue is recently pytorch upgraded to 2.1.0, this caused Intel Extension for PyTorch load error which subsequently caused all test skipped. Will discuss with Intel Extension for PyTorch team when they will release a wheel for PyTorch 2.1.0.

… check before run unit tests * Reuse hf_model list among tests to avoid slow loading * try to debug test skip * another attempt to print test failure * another attempt * more attempt to print skip reason * revert changes that are temporary * remove extra flag for pytest * add a dummy test to test pytest * test skip message * put old test and temp test together to compare * try to find out the reason skip message are not printed * comment all skips * check skip in common.py * revert last commits * shorten name to show skip message * change test name * expand number of columns to 120 when running pytest * detect deepspeed installation * add test code for environment * change pytorch version 2.1.0==>2.0.1 * add py-cpuinfo as requiiremetns to dev * install py-cpuinfo manually * Change COLUMNS to 140 to allow display of pytest skip message

delock · 2023-10-20T02:07:24Z

Intel Extension for PyTorch 2.1 is released. Will update this PR to change workflow to pytorch 2.1 accordingly.

* Reuse hf_model list among tests to avoid slow loading * try to debug test skip * another attempt to print test failure * another attempt * more attempt to print skip reason * revert changes that are temporary * remove extra flag for pytest * add a dummy test to test pytest * test skip message * put old test and temp test together to compare * try to find out the reason skip message are not printed * comment all skips * check skip in common.py * revert last commits * shorten name to show skip message * change test name * expand number of columns to 120 when running pytest * detect deepspeed installation * add test code for environment * change pytorch version 2.1.0==>2.0.1 * add py-cpuinfo as requiiremetns to dev * install py-cpuinfo manually * Change COLUMNS to 140 to allow display of pytest skip message * ping pytorch to 2.0.1 * add pip list before install deepspeed * install cpuinfo before install deepspeed * change workflow to work with pytorch 2.1 * add torch install to CI workflow * install py-cpuinfo * enforce autotp test on single socket instance * enforce 2 ranks in cpu autotp tests * enable tests that can only run on torch 2.1 or above * make build faster * remove -j make option * add back skip for codegen * check UT result * update tutorial

delock · 2023-10-25T02:45:16Z

Hi @loadams @tjruwase, we are adding back tensor parallel UT into CPU inference workflow. One issue we met is github instance "ubuntu-20.04" has only one CPU and two cores. We need a stronger system if we want to run autotp test stablely. We will use one of the self-hosted v100 to see whether it fits.

.github/workflows/cpu-inference.yml

delock · 2023-12-07T08:34:32Z

the failure in CI is due to result mismatch from BF16 autoTP. We have observed this on both CUDA device and CPU. A probable reason is precision difference between BF16 and FP16. We will disable codegen UT for BF16 to avoid UT failure.

delock · 2023-12-21T01:58:01Z

Hi @loadams this fix had passed all tests. I think this PR is critical because all changes that affect CPU only should be guarded by this workflow. Can we merge this workflow on master and turn back on CPU inference workflow? Thanks!

loadams · 2024-01-02T16:16:57Z

Hi @loadams this fix had passed all tests. I think this PR is critical because all changes that affect CPU only should be guarded by this workflow. Can we merge this workflow on master and turn back on CPU inference workflow? Thanks!

Hi @delock, yes apologies many of us were on vacation for the last week or so. I've re-merged the master branch and we will get this reviewed/completed/merged and re-enable the cpu-inference tests ASAP.

loadams · 2024-01-02T17:35:39Z

@delock - looks like a failure on cpu-inference, I'll re-run to check that it isn't transient, but might need one more fix?

FAILED unit/inference/test_inference_config.py::TestInferenceConfig::test_overlap_kwargs - ValueError: Type fp16 is not supported.
FAILED unit/inference/test_inference_config.py::TestInferenceConfig::test_json_config - ValueError: Type fp16 is not supported.

delock · 2024-01-03T02:09:58Z

@delock - looks like a failure on cpu-inference, I'll re-run to check that it isn't transient, but might need one more fix?
FAILED unit/inference/test_inference_config.py::TestInferenceConfig::test_overlap_kwargs - ValueError: Type fp16 is not supported.
FAILED unit/inference/test_inference_config.py::TestInferenceConfig::test_json_config - ValueError: Type fp16 is not supported.

I think it's due to this recent change. https://github.com/microsoft/DeepSpeed/pull/4843/files#diff-e6635a81d2c2bf0938b5f83b1c4945e0f344e0106191a6960df8ee4aa64cc55fR1020
I'll reproduce this locally and try to fix with a test update.

delock · 2024-01-03T03:18:48Z

@loadams I have reproduced these two failures locally. The failing test is updated with setting dtype in config to float32, this should still test these two UTs as intended and make them work on device without float16 support.

loadams · 2024-01-04T22:36:28Z

.github/workflows/cpu-inference.yml

          python -m pip install intel_extension_for_pytorch
-          python -m pip install oneccl_bind_pt==2.0 -f https://developer.intel.com/ipex-whl-stable-cpu
+          # the curl line is for troubleshootingn


typo on troubleshooting*

Thanks! Will fix it.

loadams · 2024-01-04T22:37:39Z

deepspeed/comm/ccl.py

@@ -170,7 +186,7 @@ def get_all_ranks_from_group(self, group):
            while True:
                results.append(super(CCLBackend, self).get_global_rank(group, rank))
                rank += 1
-        except ValueError:
+        except (ValueError, RuntimeError):


What is the runtime error that we can hit here?

The while True: loop would iterate local ranks and collect global rank for these local ranks, until local rank is out of local rank range. In older version of PyTorch, this out-of-range will throw a ValueError. In PyTorch 2, this behavior will throw a RuntimeError.

@Liangliang-Ma

mrwyattii · 2024-01-05T20:22:07Z

tests/unit/inference/test_inference.py

+try:
+    with open("hf_models.pkl", "rb") as fp:
+        _hf_models = pickle.load(fp)
+except FileNotFoundError:
+    _hf_models = list(HfApi().list_models())
+    with open("hf_models.pkl", "wb") as fp:
+        pickle.dump(_hf_models, fp)


I think that caching the model list can be a good idea for the tests, but we need to save it to blob storage so that it is persistent. Additionally, I think the cache should have a timestamp connected to it, such that we update it every hour/day/week. See how we do this in MII:
https://github.com/microsoft/DeepSpeed-MII/blob/4472e4e206182ed56399f225848a7721565922fb/mii/utils.py#L39

tests/unit/inference/test_inference_config.py

mrwyattii

Given the long runtime of the cpu-inference tests, we should only require them on changes to deepspeed/inference.

Approving the PR so we can merge.

@loadams let's address the long test runtimes and the model list caching in a PR after this is merged.

This PR fix UT test error as described in this PR and the following test job. This PR skips `TestModelTask` if dtype is not supported by accelerator, or `InferenceBuilder` is not implemented by accelerator. microsoft#4419 https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538 --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Liangliang-Ma <[email protected]> Co-authored-by: Quentin Anthony <[email protected]> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Ramya Ramineni <[email protected]> Co-authored-by: Xie Zejian <[email protected]> Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Michael Wyatt <[email protected]>

delock added 4 commits September 28, 2023 14:27

add a white change that breaks formatting

e60e645

fix TestModelTask

ed95d21

Skip TestModelTask if InferenceBuilder are not implemented

f0022b0

remove blank change

af2f380

delock requested review from jeffra, mrwyattii and tjruwase as code owners October 1, 2023 06:30

delock mentioned this pull request Oct 1, 2023

Revert "add CPU autotp UT (#4263)" #4419

Closed

Merge branch 'master' into gma/fix_cpu_inference

257ed96

loadams approved these changes Oct 2, 2023

View reviewed changes

loadams mentioned this pull request Oct 3, 2023

Test fixes + moving dtypes out of the test #4439

Closed

Reuse hf_model list among tests to avoid slow loading (#16)

ac4254f

* Reuse hf_model list among tests to avoid slow loading * try to debug test skip * another attempt to print test failure * another attempt * more attempt to print skip reason * revert changes that are temporary

delock added 2 commits October 8, 2023 19:54

Merge branch 'master' into gma/fix_cpu_inference

af6661a

tjruwase added this pull request to the merge queue Oct 16, 2023

tjruwase removed this pull request from the merge queue due to a manual request Oct 16, 2023

delock and others added 2 commits October 25, 2023 10:24

change cpu inference test to self hosted v100 runner

48787d9

loadams reviewed Oct 25, 2023

View reviewed changes

.github/workflows/cpu-inference.yml Show resolved Hide resolved

delock and others added 2 commits December 7, 2023 00:42

disable codegen bf16

b50a481

Merge branch 'master' into gma/fix_cpu_inference

3dce178

Merge branch 'master' into gma/fix_cpu_inference

1596224

fix test_inference_config UT error

a72beea

Merge branch 'master' into gma/fix_cpu_inference

057b6ff

loadams changed the title ~~[Bug fix] Fix cpu inference UT failure~~ [Fix] Fix cpu inference UT failure Jan 3, 2024

Merge branch 'master' into gma/fix_cpu_inference

5886645

loadams reviewed Jan 4, 2024

View reviewed changes

loadams approved these changes Jan 4, 2024

View reviewed changes

delock and others added 2 commits January 5, 2024 09:29

fix typo

3244e1f

Merge branch 'master' into gma/fix_cpu_inference

21b438c

mrwyattii reviewed Jan 5, 2024

View reviewed changes

tests/unit/inference/test_inference_config.py Show resolved Hide resolved

mrwyattii approved these changes Jan 5, 2024

View reviewed changes

Merge branch 'master' into gma/fix_cpu_inference

2e6fa99

loadams enabled auto-merge January 8, 2024 21:40

loadams added this pull request to the merge queue Jan 8, 2024

loadams removed this pull request from the merge queue due to a manual request Jan 8, 2024

loadams enabled auto-merge January 8, 2024 21:40

loadams added this pull request to the merge queue Jan 8, 2024

Merged via the queue into microsoft:master with commit d8d865f Jan 9, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Fix cpu inference UT failure #4430

[Fix] Fix cpu inference UT failure #4430

delock commented Oct 1, 2023

loadams commented Oct 2, 2023

delock commented Oct 3, 2023

loadams commented Oct 3, 2023

loadams commented Oct 3, 2023

loadams commented Oct 3, 2023

delock commented Oct 7, 2023

delock commented Oct 7, 2023

delock commented Oct 7, 2023

delock commented Oct 8, 2023

delock commented Oct 20, 2023

delock commented Oct 25, 2023 •

edited

Loading

delock commented Dec 7, 2023

delock commented Dec 21, 2023 •

edited

Loading

loadams commented Jan 2, 2024

loadams commented Jan 2, 2024

delock commented Jan 3, 2024

delock commented Jan 3, 2024 •

edited

Loading

loadams Jan 4, 2024

delock Jan 5, 2024

loadams Jan 4, 2024

delock Jan 5, 2024 •

edited

Loading

mrwyattii Jan 5, 2024

mrwyattii left a comment

[Fix] Fix cpu inference UT failure #4430

[Fix] Fix cpu inference UT failure #4430

Conversation

delock commented Oct 1, 2023

loadams commented Oct 2, 2023

delock commented Oct 3, 2023

loadams commented Oct 3, 2023

loadams commented Oct 3, 2023

loadams commented Oct 3, 2023

delock commented Oct 7, 2023

delock commented Oct 7, 2023

delock commented Oct 7, 2023

delock commented Oct 8, 2023

delock commented Oct 20, 2023

delock commented Oct 25, 2023 • edited Loading

delock commented Dec 7, 2023

delock commented Dec 21, 2023 • edited Loading

loadams commented Jan 2, 2024

loadams commented Jan 2, 2024

delock commented Jan 3, 2024

delock commented Jan 3, 2024 • edited Loading

loadams Jan 4, 2024

Choose a reason for hiding this comment

delock Jan 5, 2024

Choose a reason for hiding this comment

loadams Jan 4, 2024

Choose a reason for hiding this comment

delock Jan 5, 2024 • edited Loading

Choose a reason for hiding this comment

mrwyattii Jan 5, 2024

Choose a reason for hiding this comment

mrwyattii left a comment

Choose a reason for hiding this comment

delock commented Oct 25, 2023 •

edited

Loading

delock commented Dec 21, 2023 •

edited

Loading

delock commented Jan 3, 2024 •

edited

Loading

delock Jan 5, 2024 •

edited

Loading