Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduces manifest.yaml that is the "last working world state" #405

Merged
merged 30 commits into from
Dec 18, 2023

Commits on Dec 8, 2023

  1. wip

    Co-authored-by: Terry Kong <[email protected]>
    yhtang and terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    fa48558 View commit details
    Browse the repository at this point in the history
  2. parent abb6f97

    author Yu-Hang Tang <[email protected]> 1698050497 +0000
    committer Terry Kong <[email protected]> 1701417045 -0800
    
    pip-compile changes
    
    Updated t5-large perf (#342)
    
    Update Pax README and sub file (#345)
    
    - Adds FP8 documentation
    - Updates perf table
    - Makes some other minor improvements for readability
    
    Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars (#329)
    
    Re-enable NVLS in nightly containers (#331)
    
    NVLS was disabled due to a known issue in NCCL 2.17 that caused
    intermittent hangs. The issue has been resolved in NCCL 2.18, so we are
    safe to re-enable NVLS.
    
    ---------
    
    Co-authored-by: Terry Kong <[email protected]>
    
    Update Pax TE patch to point to rebased branch (#348)
    
    Loosens t5x loss tests relative tolerances (#343)
    
    Relaxing the relative tolerance on the loss tests since it was leading
    to too many false positives. For reference, deviation in loss for the t5
    model can sometimes be up to 15% at the start of training with real
    data.
    
    Adds rosetta-t5x TE + no-TE tests that enable the correct configs for testing (#332)
    
    - [ ] Add capability to retroactively test with newer test-t5x.sh like
    in
    [t5x-wget-test](https://github.com/NVIDIA/JAX-Toolbox/tree/t5x-wget-test)
    - [ ] Sets `ENABLE_TE=1` in the Dockerfile.t5x which is identical to the
    logic from before where it was always enabled in rosetta-t5x
    
    Fix markdown hyperlink for jax package on frontpage readme (#319)
    
    Adds a --seed option to test-t5x.sh to ensure determinism (#344)
    
    To ensure that the tests results for a particular container are
    reproducible between runs, this change introduces a seed argument that
    sets the jax seed and dataset seed to 42. It remains configurable, but
    now there shouldn't be variance given the same container.
    
    - Also fixes a typo where --steps-per-epoch wasn't in the usage doc of
    this script
    
    Co-authored-by: NVIDIA <[email protected]>
    Co-authored-by: Yu-Hang "Maxin" Tang <[email protected]>
    
    Dynamic workflow run names (#356)
    
    This change introduces the dynamic [run name
    field](https://github.blog/changelog/2022-09-26-github-actions-dynamic-names-for-workflow-runs/#:~:text=GitHub%20Actions%20customers%20can%20now,visit%20the%20GitHub%20Actions%20community.)
    `run-name`.
    
    It's currently difficult on mobile to find the "workflow_run" that
    corresponds to a particular date, so hopefully this helps identify which
    builds were nightly vs which builds were manually triggered.
    
    I couldn't find a good way to dynamically look up the `name` field, so
    for now I copied all of names. I also wasn't able to find a "created_at"
    for the scheduled workflows, so those don't have timestamps for now.
    
    __Assumptions__:
    * "workflow_run" == nightly since "scheduled" events only happen on
    `main` and `workflow_run` are only run for concrete workflows and not
    reusable workflows
    
    - [x] Test the workflow_run codepath
    - [x] Test the scheduled codepath
    
    ![image](https://github.com/NVIDIA/JAX-Toolbox/assets/7576060/4b916452-334a-4a73-9220-9fbadc70462f)
    
    Fix random failling tests for backend_independent on V100 (#351)
    
    Fixes randomly failures in the backend-independent section of JAX unit
    tests:
    ```
    Cannot find a free accelerator to run the test  on, exiting with failure
    ```
    
    Changes: limit the number of concurrent test jobs even for
    backend-independent tests, which do create GPU contexts.
    
    As a clarification, `--jobs` and `--local_test_jobs` do not make a
    difference for our particular CI pipeline, since JAX is built in a
    separate CI job anyway.
    
    References (From Reed Wanderman-Milne @ Google):
    
    > 1. In particular, you have to set NB_GPUS, JOBS_PER_ACC, and J
    correctly or you can get that error (I recently got the same error by
    not setting those correctly)
    > 2. (also I think --jobs should be --local_test_jobs in that code
    block, no reason to restrict the number of jobs compiling JAX)
    
    Propagate error code in ViT tests (#357)
    
    Merges rosetta unit tests and takes off the marker which spun up another matrix job (#360)
    
    This should simplify the rosetta tests and save some time since another
    matrix job was started for one test
    
    Propagate build failures (#363)
    
    Always run the `publish-build` step, regardless of whether the rosetta
    pax/t5x build was attempted. This ensures that badges correctly reflect
    build failures due to dependent builds failing.
    
    Patch for JAX core container (ARM64) (#367)
    
    Add patch to XLA to be able to build JAX core container for ARM64
    
    Update the doc for USE_FP8 (#349)
    
    This PR provides guidance on how to use the new configuration option,
    `USE_FP8`, to enable native FP8 support on Hopper GPUs.
    
    Update the native-fp8 guide with cudnn layer norm (#368)
    
    This PR updates the guide to include the new flag to enable the cudnn
    layer norm.
    
    cc. @ashors1 @terrykong @nouiz
    
    Add WAR for XLA NCCL bug causing OOMs (#362)
    
    A stopgap for #346
    
    fix TE multi-device test
    
    fix lzma build issue
    
    edit TE test name
    
    fix TE arm64 test install error
    
    remove --install option from get-source.sh
    
    fix TE arm64 test install error
    
    disable sandbox
    
    i'm jet-lagged
    
    use Pax image for TE testing
    
    Fix job dependency
    yhtang authored and terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    2ccf1a9 View commit details
    Browse the repository at this point in the history
  3. Adds support for building rosetta with local patches and an already

    generated patch dir
    
    comment
    
    Add steps to archive patches in run
    
    Date the patches for readability
    
    Better log msg
    
    switch to --3way since that produces a merge conflict to help understand
    the conflict
    
    Switch to mealkit+finalize mechanic for rosetta builds
    
    Add github.run_id to artifacts for provenance
    
    Update all rosetta workflows with mealkit/final mechanism
    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    404d629 View commit details
    Browse the repository at this point in the history
  4. parent 8a43f4a

    author Terry Kong <[email protected]> 1700265014 -0800
    committer Terry Kong <[email protected]> 1701417338 -0800
    
    parent 8a43f4a
    author Terry Kong <[email protected]> 1700265014 -0800
    committer Terry Kong <[email protected]> 1701417298 -0800
    
    Adds
    (1) bump.sh which bumps the manifest and pins the patches
    (2) updates create-distribution.sh to work with manifests
    (3) move everything to .github/container
    
    sandbox
    
    fix
    
    write
    
    add propagation of trial branch to all workflows and update sandbox to
    test synchronous workflow check
    
    wip
    
    test
    
    wip
    
    changes
    
    wip
    
    wip
    
    don't need wip
    
    wip
    
    remove
    
    make trial branch contingent on publishing
    
    Update get-source and initial update for jax build to accept manifest
    
    update manifest
    
    jax build partially working + patches
    
    update pax/t5x dockerfiles, add more repos into manifest, and update
    pip-finalize to use *.in instead of manifest.txt
    
    update manifest with rest of repos and patches
    
    missing arg
    
    fix jax/pax/t5x
    
    all builds work now!
    
    update manifest file everywhere
    
    fix all workflows
    
    cleanup
    
    get the context right
    
    fix all broken tests
    
    custom pip distribution works
    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    4404710 View commit details
    Browse the repository at this point in the history
  5. fix pip patch

    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    4951c31 View commit details
    Browse the repository at this point in the history
  6. fix pip-finalize

    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    7a852f8 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    6206f4c View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    5688cac View commit details
    Browse the repository at this point in the history
  9. style

    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    83c07fd View commit details
    Browse the repository at this point in the history
  10. short -b switch

    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    27036f7 View commit details
    Browse the repository at this point in the history
  11. remove [optional]

    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    2ef9313 View commit details
    Browse the repository at this point in the history
  12. submodule init

    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    ecf9a6a View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    c5f0b87 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    645d4af View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    e780f9b View commit details
    Browse the repository at this point in the history
  16. typo

    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    53ee07e View commit details
    Browse the repository at this point in the history
  17. remove quotes

    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    f2f13f5 View commit details
    Browse the repository at this point in the history
  18. EOF & INNEREOF

    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    48a3a09 View commit details
    Browse the repository at this point in the history
  19. bump.sh documentation

    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    e8b857b View commit details
    Browse the repository at this point in the history
  20. rm [Optional]

    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    68b001a View commit details
    Browse the repository at this point in the history
  21. Configuration menu
    Copy the full SHA
    7281f59 View commit details
    Browse the repository at this point in the history
  22. trial branch description

    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    2c54e8d View commit details
    Browse the repository at this point in the history
  23. revert sandbox

    terrykong committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    2e9fccd View commit details
    Browse the repository at this point in the history
  24. Configuration menu
    Copy the full SHA
    a554f01 View commit details
    Browse the repository at this point in the history
  25. Configuration menu
    Copy the full SHA
    723f91d View commit details
    Browse the repository at this point in the history

Commits on Dec 11, 2023

  1. Fix BASE_IMAGE description and upstream t5x/pax builds now allow

    base_image from workflow_dispatch
    terrykong committed Dec 11, 2023
    Configuration menu
    Copy the full SHA
    c0b683a View commit details
    Browse the repository at this point in the history
  2. cleanup

    terrykong committed Dec 11, 2023
    Configuration menu
    Copy the full SHA
    0828455 View commit details
    Browse the repository at this point in the history

Commits on Dec 13, 2023

  1. Configuration menu
    Copy the full SHA
    73df444 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    811c9d2 View commit details
    Browse the repository at this point in the history

Commits on Dec 18, 2023

  1. Configuration menu
    Copy the full SHA
    aa7cea8 View commit details
    Browse the repository at this point in the history