-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduces manifest.yaml that is the "last working world state" #405
Merged
Commits on Dec 8, 2023
-
Configuration menu - View commit details
-
Copy full SHA for fa48558 - Browse repository at this point
Copy the full SHA fa48558View commit details -
author Yu-Hang Tang <[email protected]> 1698050497 +0000 committer Terry Kong <[email protected]> 1701417045 -0800 pip-compile changes Updated t5-large perf (#342) Update Pax README and sub file (#345) - Adds FP8 documentation - Updates perf table - Makes some other minor improvements for readability Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars (#329) Re-enable NVLS in nightly containers (#331) NVLS was disabled due to a known issue in NCCL 2.17 that caused intermittent hangs. The issue has been resolved in NCCL 2.18, so we are safe to re-enable NVLS. --------- Co-authored-by: Terry Kong <[email protected]> Update Pax TE patch to point to rebased branch (#348) Loosens t5x loss tests relative tolerances (#343) Relaxing the relative tolerance on the loss tests since it was leading to too many false positives. For reference, deviation in loss for the t5 model can sometimes be up to 15% at the start of training with real data. Adds rosetta-t5x TE + no-TE tests that enable the correct configs for testing (#332) - [ ] Add capability to retroactively test with newer test-t5x.sh like in [t5x-wget-test](https://github.com/NVIDIA/JAX-Toolbox/tree/t5x-wget-test) - [ ] Sets `ENABLE_TE=1` in the Dockerfile.t5x which is identical to the logic from before where it was always enabled in rosetta-t5x Fix markdown hyperlink for jax package on frontpage readme (#319) Adds a --seed option to test-t5x.sh to ensure determinism (#344) To ensure that the tests results for a particular container are reproducible between runs, this change introduces a seed argument that sets the jax seed and dataset seed to 42. It remains configurable, but now there shouldn't be variance given the same container. - Also fixes a typo where --steps-per-epoch wasn't in the usage doc of this script Co-authored-by: NVIDIA <[email protected]> Co-authored-by: Yu-Hang "Maxin" Tang <[email protected]> Dynamic workflow run names (#356) This change introduces the dynamic [run name field](https://github.blog/changelog/2022-09-26-github-actions-dynamic-names-for-workflow-runs/#:~:text=GitHub%20Actions%20customers%20can%20now,visit%20the%20GitHub%20Actions%20community.) `run-name`. It's currently difficult on mobile to find the "workflow_run" that corresponds to a particular date, so hopefully this helps identify which builds were nightly vs which builds were manually triggered. I couldn't find a good way to dynamically look up the `name` field, so for now I copied all of names. I also wasn't able to find a "created_at" for the scheduled workflows, so those don't have timestamps for now. __Assumptions__: * "workflow_run" == nightly since "scheduled" events only happen on `main` and `workflow_run` are only run for concrete workflows and not reusable workflows - [x] Test the workflow_run codepath - [x] Test the scheduled codepath ![image](https://github.com/NVIDIA/JAX-Toolbox/assets/7576060/4b916452-334a-4a73-9220-9fbadc70462f) Fix random failling tests for backend_independent on V100 (#351) Fixes randomly failures in the backend-independent section of JAX unit tests: ``` Cannot find a free accelerator to run the test on, exiting with failure ``` Changes: limit the number of concurrent test jobs even for backend-independent tests, which do create GPU contexts. As a clarification, `--jobs` and `--local_test_jobs` do not make a difference for our particular CI pipeline, since JAX is built in a separate CI job anyway. References (From Reed Wanderman-Milne @ Google): > 1. In particular, you have to set NB_GPUS, JOBS_PER_ACC, and J correctly or you can get that error (I recently got the same error by not setting those correctly) > 2. (also I think --jobs should be --local_test_jobs in that code block, no reason to restrict the number of jobs compiling JAX) Propagate error code in ViT tests (#357) Merges rosetta unit tests and takes off the marker which spun up another matrix job (#360) This should simplify the rosetta tests and save some time since another matrix job was started for one test Propagate build failures (#363) Always run the `publish-build` step, regardless of whether the rosetta pax/t5x build was attempted. This ensures that badges correctly reflect build failures due to dependent builds failing. Patch for JAX core container (ARM64) (#367) Add patch to XLA to be able to build JAX core container for ARM64 Update the doc for USE_FP8 (#349) This PR provides guidance on how to use the new configuration option, `USE_FP8`, to enable native FP8 support on Hopper GPUs. Update the native-fp8 guide with cudnn layer norm (#368) This PR updates the guide to include the new flag to enable the cudnn layer norm. cc. @ashors1 @terrykong @nouiz Add WAR for XLA NCCL bug causing OOMs (#362) A stopgap for #346 fix TE multi-device test fix lzma build issue edit TE test name fix TE arm64 test install error remove --install option from get-source.sh fix TE arm64 test install error disable sandbox i'm jet-lagged use Pax image for TE testing Fix job dependency
Configuration menu - View commit details
-
Copy full SHA for 2ccf1a9 - Browse repository at this point
Copy the full SHA 2ccf1a9View commit details -
Adds support for building rosetta with local patches and an already
generated patch dir comment Add steps to archive patches in run Date the patches for readability Better log msg switch to --3way since that produces a merge conflict to help understand the conflict Switch to mealkit+finalize mechanic for rosetta builds Add github.run_id to artifacts for provenance Update all rosetta workflows with mealkit/final mechanism
Configuration menu - View commit details
-
Copy full SHA for 404d629 - Browse repository at this point
Copy the full SHA 404d629View commit details -
author Terry Kong <[email protected]> 1700265014 -0800 committer Terry Kong <[email protected]> 1701417338 -0800 parent 8a43f4a author Terry Kong <[email protected]> 1700265014 -0800 committer Terry Kong <[email protected]> 1701417298 -0800 Adds (1) bump.sh which bumps the manifest and pins the patches (2) updates create-distribution.sh to work with manifests (3) move everything to .github/container sandbox fix write add propagation of trial branch to all workflows and update sandbox to test synchronous workflow check wip test wip changes wip wip don't need wip wip remove make trial branch contingent on publishing Update get-source and initial update for jax build to accept manifest update manifest jax build partially working + patches update pax/t5x dockerfiles, add more repos into manifest, and update pip-finalize to use *.in instead of manifest.txt update manifest with rest of repos and patches missing arg fix jax/pax/t5x all builds work now! update manifest file everywhere fix all workflows cleanup get the context right fix all broken tests custom pip distribution works
Configuration menu - View commit details
-
Copy full SHA for 4404710 - Browse repository at this point
Copy the full SHA 4404710View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4951c31 - Browse repository at this point
Copy the full SHA 4951c31View commit details -
Configuration menu - View commit details
-
Copy full SHA for 7a852f8 - Browse repository at this point
Copy the full SHA 7a852f8View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6206f4c - Browse repository at this point
Copy the full SHA 6206f4cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 5688cac - Browse repository at this point
Copy the full SHA 5688cacView commit details -
Configuration menu - View commit details
-
Copy full SHA for 83c07fd - Browse repository at this point
Copy the full SHA 83c07fdView commit details -
Configuration menu - View commit details
-
Copy full SHA for 27036f7 - Browse repository at this point
Copy the full SHA 27036f7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2ef9313 - Browse repository at this point
Copy the full SHA 2ef9313View commit details -
Configuration menu - View commit details
-
Copy full SHA for ecf9a6a - Browse repository at this point
Copy the full SHA ecf9a6aView commit details -
Configuration menu - View commit details
-
Copy full SHA for c5f0b87 - Browse repository at this point
Copy the full SHA c5f0b87View commit details -
Configuration menu - View commit details
-
Copy full SHA for 645d4af - Browse repository at this point
Copy the full SHA 645d4afView commit details -
Configuration menu - View commit details
-
Copy full SHA for e780f9b - Browse repository at this point
Copy the full SHA e780f9bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 53ee07e - Browse repository at this point
Copy the full SHA 53ee07eView commit details -
Configuration menu - View commit details
-
Copy full SHA for f2f13f5 - Browse repository at this point
Copy the full SHA f2f13f5View commit details -
Configuration menu - View commit details
-
Copy full SHA for 48a3a09 - Browse repository at this point
Copy the full SHA 48a3a09View commit details -
Configuration menu - View commit details
-
Copy full SHA for e8b857b - Browse repository at this point
Copy the full SHA e8b857bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 68b001a - Browse repository at this point
Copy the full SHA 68b001aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 7281f59 - Browse repository at this point
Copy the full SHA 7281f59View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2c54e8d - Browse repository at this point
Copy the full SHA 2c54e8dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 2e9fccd - Browse repository at this point
Copy the full SHA 2e9fccdView commit details -
Configuration menu - View commit details
-
Copy full SHA for a554f01 - Browse repository at this point
Copy the full SHA a554f01View commit details -
Configuration menu - View commit details
-
Copy full SHA for 723f91d - Browse repository at this point
Copy the full SHA 723f91dView commit details
Commits on Dec 11, 2023
-
Fix BASE_IMAGE description and upstream t5x/pax builds now allow
base_image from workflow_dispatch
Configuration menu - View commit details
-
Copy full SHA for c0b683a - Browse repository at this point
Copy the full SHA c0b683aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 0828455 - Browse repository at this point
Copy the full SHA 0828455View commit details
Commits on Dec 13, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 73df444 - Browse repository at this point
Copy the full SHA 73df444View commit details -
Configuration menu - View commit details
-
Copy full SHA for 811c9d2 - Browse repository at this point
Copy the full SHA 811c9d2View commit details
Commits on Dec 18, 2023
-
Configuration menu - View commit details
-
Copy full SHA for aa7cea8 - Browse repository at this point
Copy the full SHA aa7cea8View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.