-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Fix cpu inference UT failure #4430
Conversation
FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher. |
Hi @loadams , currently we are seeking higher test coverage in the area of AutoTP. @Yejing-Lai is currently investigating whether more model coverage is possible. On the other hand. Some tests skipped because on CPU
|
That makes sense, thanks. Given that this increases the runtime of the cpu_inference tests drastically without increasing coverage, do you think it would make sense to either revert it or add the skip back to the top of the file which seemed to run quicker, then on subsequent PRs we can enable more tests and monitor the overall runtime? |
FYI @delock, we disabled the cpu_inference test, but have it so we can run it on demand for PRs that we think will impact things as we fix the test time issue |
Also #4439 had a way to try and speed up the tests by not setting it all up before skipping but that's wasn't quicker either. |
I tried to run the line:
|
The slow running of test is caused by this line. It loads model list from huggingface_hub. Doing it for every test is slow. A proper fix should make it persistent during the test session. One way is save it as pickle in the first test and load the pickle subsequently. Will test and update this PR. |
* Reuse hf_model list among tests to avoid slow loading * try to debug test skip * another attempt to print test failure * another attempt * more attempt to print skip reason * revert changes that are temporary
@loadams the slow test had been fixed with latest commit. How the test run around 13 minutes on my local environment. The strange thing is on my local environment, there are three tests not skipped, and the skipping message is displayed in my local environment . On github CI environment, however, these three tests are skipp and the skipping message is not printed with the test report. This is my local env.
This is the github CI printout, all tests are skipped and no skip message printed.
|
The skip message is truncated because of 80 column limit when pytest is running. We can expand the 'COLUMNS' env variable to 140 to see skip message. Another issue is recently pytorch upgraded to 2.1.0, this caused Intel Extension for PyTorch load error which subsequently caused all test skipped. Will discuss with Intel Extension for PyTorch team when they will release a wheel for PyTorch 2.1.0. |
… check before run unit tests * Reuse hf_model list among tests to avoid slow loading * try to debug test skip * another attempt to print test failure * another attempt * more attempt to print skip reason * revert changes that are temporary * remove extra flag for pytest * add a dummy test to test pytest * test skip message * put old test and temp test together to compare * try to find out the reason skip message are not printed * comment all skips * check skip in common.py * revert last commits * shorten name to show skip message * change test name * expand number of columns to 120 when running pytest * detect deepspeed installation * add test code for environment * change pytorch version 2.1.0==>2.0.1 * add py-cpuinfo as requiiremetns to dev * install py-cpuinfo manually * Change COLUMNS to 140 to allow display of pytest skip message
Intel Extension for PyTorch 2.1 is released. Will update this PR to change workflow to pytorch 2.1 accordingly. |
* Reuse hf_model list among tests to avoid slow loading * try to debug test skip * another attempt to print test failure * another attempt * more attempt to print skip reason * revert changes that are temporary * remove extra flag for pytest * add a dummy test to test pytest * test skip message * put old test and temp test together to compare * try to find out the reason skip message are not printed * comment all skips * check skip in common.py * revert last commits * shorten name to show skip message * change test name * expand number of columns to 120 when running pytest * detect deepspeed installation * add test code for environment * change pytorch version 2.1.0==>2.0.1 * add py-cpuinfo as requiiremetns to dev * install py-cpuinfo manually * Change COLUMNS to 140 to allow display of pytest skip message * ping pytorch to 2.0.1 * add pip list before install deepspeed * install cpuinfo before install deepspeed * change workflow to work with pytorch 2.1 * add torch install to CI workflow * install py-cpuinfo * enforce autotp test on single socket instance * enforce 2 ranks in cpu autotp tests * enable tests that can only run on torch 2.1 or above * make build faster * remove -j make option * add back skip for codegen * check UT result * update tutorial
Hi @loadams @tjruwase, we are adding back tensor parallel UT into CPU inference workflow. One issue we met is github instance "ubuntu-20.04" has only one CPU and two cores. We need a stronger system if we want to run autotp test stablely. We will use one of the self-hosted v100 to see whether it fits. |
the failure in CI is due to result mismatch from BF16 autoTP. We have observed this on both CUDA device and CPU. A probable reason is precision difference between BF16 and FP16. We will disable codegen UT for BF16 to avoid UT failure. |
Hi @loadams this fix had passed all tests. I think this PR is critical because all changes that affect CPU only should be guarded by this workflow. Can we merge this workflow on master and turn back on CPU inference workflow? Thanks! |
Hi @delock, yes apologies many of us were on vacation for the last week or so. I've re-merged the master branch and we will get this reviewed/completed/merged and re-enable the cpu-inference tests ASAP. |
@delock - looks like a failure on cpu-inference, I'll re-run to check that it isn't transient, but might need one more fix?
|
I think it's due to this recent change. https://github.com/microsoft/DeepSpeed/pull/4843/files#diff-e6635a81d2c2bf0938b5f83b1c4945e0f344e0106191a6960df8ee4aa64cc55fR1020 |
@loadams I have reproduced these two failures locally. The failing test is updated with setting dtype in config to float32, this should still test these two UTs as intended and make them work on device without float16 support. |
.github/workflows/cpu-inference.yml
Outdated
python -m pip install intel_extension_for_pytorch | ||
python -m pip install oneccl_bind_pt==2.0 -f https://developer.intel.com/ipex-whl-stable-cpu | ||
# the curl line is for troubleshootingn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo on troubleshooting*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will fix it.
@@ -170,7 +186,7 @@ def get_all_ranks_from_group(self, group): | |||
while True: | |||
results.append(super(CCLBackend, self).get_global_rank(group, rank)) | |||
rank += 1 | |||
except ValueError: | |||
except (ValueError, RuntimeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the runtime error that we can hit here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The while True:
loop would iterate local ranks and collect global rank for these local ranks, until local rank is out of local rank range. In older version of PyTorch, this out-of-range will throw a ValueError. In PyTorch 2, this behavior will throw a RuntimeError.
try: | ||
with open("hf_models.pkl", "rb") as fp: | ||
_hf_models = pickle.load(fp) | ||
except FileNotFoundError: | ||
_hf_models = list(HfApi().list_models()) | ||
with open("hf_models.pkl", "wb") as fp: | ||
pickle.dump(_hf_models, fp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that caching the model list can be a good idea for the tests, but we need to save it to blob storage so that it is persistent. Additionally, I think the cache should have a timestamp connected to it, such that we update it every hour/day/week. See how we do this in MII:
https://github.com/microsoft/DeepSpeed-MII/blob/4472e4e206182ed56399f225848a7721565922fb/mii/utils.py#L39
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the long runtime of the cpu-inference tests, we should only require them on changes to deepspeed/inference
.
Approving the PR so we can merge.
@loadams let's address the long test runtimes and the model list caching in a PR after this is merged.
This PR fix UT test error as described in this PR and the following test job. This PR skips `TestModelTask` if dtype is not supported by accelerator, or `InferenceBuilder` is not implemented by accelerator. microsoft#4419 https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538 --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Liangliang-Ma <[email protected]> Co-authored-by: Quentin Anthony <[email protected]> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Ramya Ramineni <[email protected]> Co-authored-by: Xie Zejian <[email protected]> Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Michael Wyatt <[email protected]>
This PR fix UT test error as described in this PR and the following test job. This PR skips
TestModelTask
if dtype is not supported by accelerator, orInferenceBuilder
is not implemented by accelerator.#4419
https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538