Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[userbenchmark] Broader error catching in Torch-TRT userbenchmark #1974

Closed
wants to merge 1 commit into from

Conversation

gs-olive
Copy link
Contributor

@gs-olive gs-olive commented Oct 9, 2023

  • Failures on a recent sample run indicate that errors raised by the subprocess are not always Exceptions, but are still sometimes recoverable
  • Add except clause to catch all such errors, short of keyboard interrupts, so the compilation can complete despite such errors

- Failures on a recent sample run indicate that errors raised by the
subprocess are not always Exceptions, but are still sometimes
recoverable
- Add `except` clause to catch all such errors, short of keyboard
interrupts, so the compilation can complete despite such errors
@gs-olive gs-olive force-pushed the torch_trt_error_catch_fix branch from cc34f3b to a609231 Compare October 9, 2023 19:37
@gs-olive gs-olive temporarily deployed to docker-s3-upload October 9, 2023 19:38 — with GitHub Actions Inactive
@gs-olive gs-olive temporarily deployed to docker-s3-upload October 9, 2023 19:39 — with GitHub Actions Inactive
Copy link
Contributor

@xuzhao9 xuzhao9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@xuzhao9 merged this pull request in eb8f952.

@gs-olive
Copy link
Contributor Author

@xuzhao9 - thanks! If possible, could you start another benchmark run with this new change?

@gs-olive
Copy link
Contributor Author

Hi @xuzhao9 - I am wondering how we might be able to set up a weekly CI run of this userbenchmark? I noticed some of the other benchmarks have a .yaml. Would something like ci.yaml, with the following work?

platform:   "gcp_a100"
schedule:   "weekly"

@xuzhao9
Copy link
Contributor

xuzhao9 commented Oct 18, 2023

@gs-olive Currently, it is already running nightly: https://github.com/pytorch/benchmark/blob/main/userbenchmark/torch_trt/ci.yaml#L2

Changing it to weekly will work ootb - but we need to fix the existing CI errors first: https://github.com/pytorch/benchmark/actions/runs/6518636322/job/17704274032

@gs-olive
Copy link
Contributor Author

I see - thanks for sharing this, it is likely because I am saving the error message as a string in the JSON dictionary when compilation or model building fails. Should I not save an entry at all in such situations instead?

@xuzhao9
Copy link
Contributor

xuzhao9 commented Oct 18, 2023

I see - thanks for sharing this, it is likely because I am saving the error message as a string in the JSON dictionary when compilation or model building fails. Should I not save an entry at all in such situations instead?

Yes, the metric value field in the JSON dictionary accepts float only. We suggest using a separate file or multiple separate files to save the complete error message for readability. In the JSON dictionary, if a metric fails, we suggest to use a special float value (e.g., -1.0) to indicate an error or skip having this metric Id.

@gs-olive
Copy link
Contributor Author

Thanks for the information! I have added the necessary changes to fix the string-values in the dictionary, here: #1998

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants