Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Fix cpu inference UT failure #4430

Merged
merged 50 commits into from
Jan 9, 2024

Conversation

delock
Copy link
Collaborator

@delock delock commented Oct 1, 2023

This PR fix UT test error as described in this PR and the following test job. This PR skips TestModelTask if dtype is not supported by accelerator, or InferenceBuilder is not implemented by accelerator.
#4419
https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538

@loadams
Copy link
Contributor

loadams commented Oct 2, 2023

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

@delock
Copy link
Collaborator Author

delock commented Oct 3, 2023

Hi @loadams , currently we are seeking higher test coverage in the area of AutoTP. @Yejing-Lai is currently investigating whether more model coverage is possible. On the other hand. Some tests skipped because on CPU InferenceBuilder are not implemented. We will check if there is any skipped test not because of this reason.

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

@loadams
Copy link
Contributor

loadams commented Oct 3, 2023

Hi @loadams , currently we are seeking higher test coverage in the area of AutoTP. @Yejing-Lai is currently investigating whether more model coverage is possible. On the other hand. Some tests skipped because on CPU InferenceBuilder are not implemented. We will check if there is any skipped test not because of this reason.

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

That makes sense, thanks. Given that this increases the runtime of the cpu_inference tests drastically without increasing coverage, do you think it would make sense to either revert it or add the skip back to the top of the file which seemed to run quicker, then on subsequent PRs we can enable more tests and monitor the overall runtime?

@loadams
Copy link
Contributor

loadams commented Oct 3, 2023

Hi @loadams , currently we are seeking higher test coverage in the area of AutoTP. @Yejing-Lai is currently investigating whether more model coverage is possible. On the other hand. Some tests skipped because on CPU InferenceBuilder are not implemented. We will check if there is any skipped test not because of this reason.

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

That makes sense, thanks. Given that this increases the runtime of the cpu_inference tests drastically without increasing coverage, do you think it would make sense to either revert it or add the skip back to the top of the file which seemed to run quicker, then on subsequent PRs we can enable more tests and monitor the overall runtime?

FYI @delock, we disabled the cpu_inference test, but have it so we can run it on demand for PRs that we think will impact things as we fix the test time issue

@loadams
Copy link
Contributor

loadams commented Oct 3, 2023

Also #4439 had a way to try and speed up the tests by not setting it all up before skipping but that's wasn't quicker either.

@delock
Copy link
Collaborator Author

delock commented Oct 7, 2023

I tried to run the line: pytest -m 'seq_inference' unit/ locally and see 3 passed tests with this branch. Yet the same branch get these three test skipped on CI environment. Will need some further investigation with the CI environment.
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-t5] PASSED [ 66%]
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-roberta] PASSED [ 73%] unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-marian] SKIPPED (Acceleraor cpu does not support torch.float16.) [ 80%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-codegen] SKIPPED (Acceleraor cpu does not support torch.float16.) [ 86%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-marian] PASSED

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

@delock
Copy link
Collaborator Author

delock commented Oct 7, 2023

The slow running of test is caused by this line. It loads model list from huggingface_hub. Doing it for every test is slow. A proper fix should make it persistent during the test session. One way is save it as pickle in the first test and load the pickle subsequently. Will test and update this PR.
https://github.com/microsoft/DeepSpeed/blob/master/tests/unit/inference/test_inference.py#L68

* Reuse hf_model list among tests to avoid slow loading

* try to debug test skip

* another attempt to print test failure

* another attempt

* more attempt to print skip reason

* revert changes that are temporary
@delock
Copy link
Collaborator Author

delock commented Oct 7, 2023

@loadams the slow test had been fixed with latest commit. How the test run around 13 minutes on my local environment.

The strange thing is on my local environment, there are three tests not skipped, and the skipping message is displayed in my local environment .

On github CI environment, however, these three tests are skipp and the skipping message is not printed with the test report.
I tried several attempt to turn on skip message printing, however it seems pytest in CI environment are not affected by the pytest command in the workflow. Do you have any idea and how do I turn on skip message printing in github CI env?

This is my local env.

unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-neo] SKIPPED (This op had not been implemented on this...) [  6%]
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-neox] SKIPPED (Skipping gpt-neox-20b for now)              [ 13%]
unit/inference/test_inference.py::TestMPSize::test[fp32-bloom] SKIPPED (Bloom models only support half precision, ...) [ 20%]
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-j] SKIPPED (Test is currently broken)                      [ 26%]
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-neo] SKIPPED (This op had not been implemented on this...) [ 33%]
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-neox] SKIPPED (Skipping gpt-neox-20b for now)              [ 40%]
unit/inference/test_inference.py::TestMPSize::test[fp16-bloom] SKIPPED (This op had not been implemented on this s...) [ 46%]
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-j] SKIPPED (Test is currently broken)                      [ 53%]
unit/inference/test_inference.py::TestAutoTP::test[falcon] SKIPPED (Not enough GPU memory for this on V100 runners)    [ 60%]
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-t5] PASSED                                            [ 66%]
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-roberta] PASSED                                       [ 73%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-marian] SKIPPED (Acceleraor cpu does not su...) [ 80%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-codegen] SKIPPED (Acceleraor cpu does not s...) [ 86%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-marian] PASSED                                  [ 93%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-codegen] SKIPPED (Codegen model(bf16) need ...) [100%]

This is the github CI printout, all tests are skipped and no skip message printed.

unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-marian] SKIPPED
unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-marian] SKIPPED
unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-codegen] SKIPPED
unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-codegen] SKIPPED
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-roberta] SKIPPED
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-t5] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp16-bloom] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp32-bloom] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-j] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-neox] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-neo] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-j] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-neo] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-neox] SKIPPED
unit/inference/test_inference.py::TestAutoTP::test[falcon] SKIPPED (...)\

@delock
Copy link
Collaborator Author

delock commented Oct 8, 2023

The skip message is truncated because of 80 column limit when pytest is running. We can expand the 'COLUMNS' env variable to 140 to see skip message. Another issue is recently pytorch upgraded to 2.1.0, this caused Intel Extension for PyTorch load error which subsequently caused all test skipped. Will discuss with Intel Extension for PyTorch team when they will release a wheel for PyTorch 2.1.0.

delock added 2 commits October 8, 2023 19:54
… check before run unit tests

* Reuse hf_model list among tests to avoid slow loading

* try to debug test skip

* another attempt to print test failure

* another attempt

* more attempt to print skip reason

* revert changes that are temporary

* remove extra flag for pytest

* add a dummy test to test pytest

* test skip message

* put old test and temp test together to compare

* try to find out the reason skip message are not printed

* comment all skips

* check skip in common.py

* revert last commits

* shorten name to show skip message

* change test name

* expand number of columns to 120 when running pytest

* detect deepspeed installation

* add test code for environment

* change pytorch version 2.1.0==>2.0.1

* add py-cpuinfo as requiiremetns to dev

* install py-cpuinfo manually

* Change COLUMNS to 140 to allow display of pytest skip message
@tjruwase tjruwase added this pull request to the merge queue Oct 16, 2023
@tjruwase tjruwase removed this pull request from the merge queue due to a manual request Oct 16, 2023
@delock
Copy link
Collaborator Author

delock commented Oct 20, 2023

Intel Extension for PyTorch 2.1 is released. Will update this PR to change workflow to pytorch 2.1 accordingly.

delock and others added 2 commits October 25, 2023 10:24
* Reuse hf_model list among tests to avoid slow loading

* try to debug test skip

* another attempt to print test failure

* another attempt

* more attempt to print skip reason

* revert changes that are temporary

* remove extra flag for pytest

* add a dummy test to test pytest

* test skip message

* put old test and temp test together to compare

* try to find out the reason skip message are not printed

* comment all skips

* check skip in common.py

* revert last commits

* shorten name to show skip message

* change test name

* expand number of columns to 120 when running pytest

* detect deepspeed installation

* add test code for environment

* change pytorch version 2.1.0==>2.0.1

* add py-cpuinfo as requiiremetns to dev

* install py-cpuinfo manually

* Change COLUMNS to 140 to allow display of pytest skip message

* ping pytorch to 2.0.1

* add pip list before install deepspeed

* install cpuinfo before install deepspeed

* change workflow to work with pytorch 2.1

* add torch install to CI workflow

* install py-cpuinfo

* enforce autotp test on single socket instance

* enforce 2 ranks in cpu autotp tests

* enable tests that can only run on torch 2.1 or above

* make build faster

* remove -j make option

* add back skip for codegen

* check UT result

* update tutorial
@delock
Copy link
Collaborator Author

delock commented Oct 25, 2023

Hi @loadams @tjruwase, we are adding back tensor parallel UT into CPU inference workflow. One issue we met is github instance "ubuntu-20.04" has only one CPU and two cores. We need a stronger system if we want to run autotp test stablely. We will use one of the self-hosted v100 to see whether it fits.

@delock
Copy link
Collaborator Author

delock commented Dec 7, 2023

the failure in CI is due to result mismatch from BF16 autoTP. We have observed this on both CUDA device and CPU. A probable reason is precision difference between BF16 and FP16. We will disable codegen UT for BF16 to avoid UT failure.

@delock
Copy link
Collaborator Author

delock commented Dec 21, 2023

Hi @loadams this fix had passed all tests. I think this PR is critical because all changes that affect CPU only should be guarded by this workflow. Can we merge this workflow on master and turn back on CPU inference workflow? Thanks!

@loadams
Copy link
Contributor

loadams commented Jan 2, 2024

Hi @loadams this fix had passed all tests. I think this PR is critical because all changes that affect CPU only should be guarded by this workflow. Can we merge this workflow on master and turn back on CPU inference workflow? Thanks!

Hi @delock, yes apologies many of us were on vacation for the last week or so. I've re-merged the master branch and we will get this reviewed/completed/merged and re-enable the cpu-inference tests ASAP.

@loadams
Copy link
Contributor

loadams commented Jan 2, 2024

@delock - looks like a failure on cpu-inference, I'll re-run to check that it isn't transient, but might need one more fix?

FAILED unit/inference/test_inference_config.py::TestInferenceConfig::test_overlap_kwargs - ValueError: Type fp16 is not supported.
FAILED unit/inference/test_inference_config.py::TestInferenceConfig::test_json_config - ValueError: Type fp16 is not supported.

@delock
Copy link
Collaborator Author

delock commented Jan 3, 2024

@delock - looks like a failure on cpu-inference, I'll re-run to check that it isn't transient, but might need one more fix?

FAILED unit/inference/test_inference_config.py::TestInferenceConfig::test_overlap_kwargs - ValueError: Type fp16 is not supported.
FAILED unit/inference/test_inference_config.py::TestInferenceConfig::test_json_config - ValueError: Type fp16 is not supported.

I think it's due to this recent change. https://github.com/microsoft/DeepSpeed/pull/4843/files#diff-e6635a81d2c2bf0938b5f83b1c4945e0f344e0106191a6960df8ee4aa64cc55fR1020
I'll reproduce this locally and try to fix with a test update.

@delock
Copy link
Collaborator Author

delock commented Jan 3, 2024

@loadams I have reproduced these two failures locally. The failing test is updated with setting dtype in config to float32, this should still test these two UTs as intended and make them work on device without float16 support.

@loadams loadams changed the title [Bug fix] Fix cpu inference UT failure [Fix] Fix cpu inference UT failure Jan 3, 2024
python -m pip install intel_extension_for_pytorch
python -m pip install oneccl_bind_pt==2.0 -f https://developer.intel.com/ipex-whl-stable-cpu
# the curl line is for troubleshootingn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo on troubleshooting*

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will fix it.

@@ -170,7 +186,7 @@ def get_all_ranks_from_group(self, group):
while True:
results.append(super(CCLBackend, self).get_global_rank(group, rank))
rank += 1
except ValueError:
except (ValueError, RuntimeError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the runtime error that we can hit here?

Copy link
Collaborator Author

@delock delock Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The while True: loop would iterate local ranks and collect global rank for these local ranks, until local rank is out of local rank range. In older version of PyTorch, this out-of-range will throw a ValueError. In PyTorch 2, this behavior will throw a RuntimeError.

@Liangliang-Ma

Comment on lines +69 to +75
try:
with open("hf_models.pkl", "rb") as fp:
_hf_models = pickle.load(fp)
except FileNotFoundError:
_hf_models = list(HfApi().list_models())
with open("hf_models.pkl", "wb") as fp:
pickle.dump(_hf_models, fp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that caching the model list can be a good idea for the tests, but we need to save it to blob storage so that it is persistent. Additionally, I think the cache should have a timestamp connected to it, such that we update it every hour/day/week. See how we do this in MII:
https://github.com/microsoft/DeepSpeed-MII/blob/4472e4e206182ed56399f225848a7721565922fb/mii/utils.py#L39

Copy link
Contributor

@mrwyattii mrwyattii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the long runtime of the cpu-inference tests, we should only require them on changes to deepspeed/inference.

Approving the PR so we can merge.

@loadams let's address the long test runtimes and the model list caching in a PR after this is merged.

@loadams loadams enabled auto-merge January 8, 2024 21:40
@loadams loadams added this pull request to the merge queue Jan 8, 2024
@loadams loadams removed this pull request from the merge queue due to a manual request Jan 8, 2024
@loadams loadams enabled auto-merge January 8, 2024 21:40
@loadams loadams added this pull request to the merge queue Jan 8, 2024
Merged via the queue into microsoft:master with commit d8d865f Jan 9, 2024
15 checks passed
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
This PR fix UT test error as described in this PR and the following test
job. This PR skips `TestModelTask` if dtype is not supported by
accelerator, or `InferenceBuilder` is not implemented by accelerator.
microsoft#4419

https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Liangliang-Ma <[email protected]>
Co-authored-by: Quentin Anthony <[email protected]>
Co-authored-by: Dashiell Stander <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Ramya Ramineni <[email protected]>
Co-authored-by: Xie Zejian <[email protected]>
Co-authored-by: Conglong Li <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants