Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[moe] merge moe into main #4978

Merged
merged 46 commits into from
Nov 2, 2023
Merged

[moe] merge moe into main #4978

merged 46 commits into from
Nov 2, 2023

Commits on Oct 26, 2023

  1. [moe] support moe fwd and bwd with low level zero (hpcaitech#4421)

    * fix test files
    
    * new file
    
    * add new
    
    * fix zero
    
    * update moe tests for forward and backward
    
    * remove useless test
    
    * remove print
    
    * moe
    
    * code style
    
    * code style
    
    * rename
    
    * rename
    
    * remove useless func
    
    * update param check
    
    * update utils and config
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    75fa0b6 View commit details
    Browse the repository at this point in the history
  2. [moe] support low level zero optim (hpcaitech#4429)

    * update optim
    
    * update grad handler
    
    * update moe param interface
    
    * update doc
    
    * move moe tensor
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    4373d06 View commit details
    Browse the repository at this point in the history
  3. [moe] refactor code to better adapt to llm (hpcaitech#4469)

    * polish code
    
    * rename
    
    * refactor code
    
    * fix test
    
    * refactor code
    
    * update flash attention version
    
    * Support TP (#6)
    
    * add tp test
    
    * update tp test
    
    * update
    
    * remove fa dependency
    
    * update dependency
    
    * update softmax
    
    * update checkpointio
    
    * update processgroupmesh
    
    * update name
    
    * update param
    
    * add keep vars
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    8240463 View commit details
    Browse the repository at this point in the history
  4. [moe] support local moe and fix bugs (hpcaitech#4574)

    * add local moe
    
    * update moe layer
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    75fdcc2 View commit details
    Browse the repository at this point in the history
  5. [moe] support openmoe inference (hpcaitech#4616)

    * init
    
    * update moe ckpt
    
    * update config
    
    * support openmoe infernece
    
    * update config
    
    * remove pdb
    
    * update ci
    
    * update requirement
    
    * add build ffn experts
    
    * update requirement
    
    * update ci
    
    * update ci
    
    * update require
    
    * update ci
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    61995f8 View commit details
    Browse the repository at this point in the history
  6. [moe] support openmoe train (hpcaitech#4637)

    * init
    
    * update moe ckpt
    
    * update config
    
    * support openmoe infernece
    
    * update config
    
    * remove pdb
    
    * support train
    
    * add ckpt download
    
    * update ckpt loading
    
    * use general ckpt
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    bf53487 View commit details
    Browse the repository at this point in the history
  7. [moe] align train settings and losses (hpcaitech#4655)

    * init
    
    * update moe ckpt
    
    * update config
    
    * support openmoe infernece
    
    * update config
    
    * remove pdb
    
    * support train
    
    * add ckpt download
    
    * update ckpt loading
    
    * use general ckpt
    
    * add loss and optim
    
    * update ci
    
    * update require
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    55a81a6 View commit details
    Browse the repository at this point in the history
  8. [moe] move to moe and remove legacy (hpcaitech#4672)

    * init
    
    * update moe ckpt
    
    * update config
    
    * support openmoe infernece
    
    * update config
    
    * remove pdb
    
    * support train
    
    * add ckpt download
    
    * update ckpt loading
    
    * use general ckpt
    
    * add loss and optim
    
    * update ci
    
    * update require
    
    * move
    
    * move
    
    * remove legacy
    
    * update file name and restore moe context
    
    * update module
    
    * update build_ffn_experts
    
    * update init
    
    * add ctx
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    84f05b1 View commit details
    Browse the repository at this point in the history
  9. [moe]: add top k router (hpcaitech#4597)

    * docs: add shape spec
    
    * docs: add doc
    
    * feat: add top_k router
    
    * feat: update init
    
    * test: add moe router tests
    
    * fix: reorder return values
    CWHer authored and oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    d1d0de8 View commit details
    Browse the repository at this point in the history
  10. [moe]: modify router loss, polish code (hpcaitech#4693)

    * feat: check z_loss and add doc
    
    * style: rename misleading variable
    
    * feat: modify auxiliary loss
    
    * feat: add aux_loss in topk router and modify doc
    
    * docs: add fn doc
    CWHer authored and oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    708bf6f View commit details
    Browse the repository at this point in the history
  11. [moe] speed up embed and mlp (hpcaitech#4701)

    * update triton
    
    * update kernel
    
    * add init
    
    * add version check
    
    * update precision
    
    * update precision
    
    * update kernel in experts
    
    * update test arg
    
    * update settings
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    fde57bf View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    adb8ebe View commit details
    Browse the repository at this point in the history
  13. [moe]: add flash attention & optimize top2 router (hpcaitech#4712)

    * feat: add benchmark train
    
    * perf: use flash_attn
    
    * fix: modify benchmark config
    
    * fix: check flash attn installation
    
    * fix: update config with args
    
    * perf: optimize top2 router
    CWHer authored and oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    3f02e57 View commit details
    Browse the repository at this point in the history
  14. [moe] support hybrid parallel (hpcaitech#4748)

    * init policy
    
    * renam,e
    
    * update pp
    
    * finish pp
    
    * update script
    
    * update plugin
    
    * finish pp
    
    * update setup for different plugin
    
    * update ci
    
    * update ci
    
    * update ci
    
    * support ep inside or dp inside
    
    * update arg for kernel
    
    * disable ci
    
    * update train script
    
    * update plugin
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    d12bbe7 View commit details
    Browse the repository at this point in the history
  15. [moe] update benchmark (hpcaitech#4770)

    * init policy
    
    * renam,e
    
    * update pp
    
    * finish pp
    
    * update script
    
    * update plugin
    
    * finish pp
    
    * update setup for different plugin
    
    * update ci
    
    * update ci
    
    * update ci
    
    * support ep inside or dp inside
    
    * update arg for kernel
    
    * disable ci
    
    * update train script
    
    * fsdp
    
    * update train
    
    * update train
    
    * fsdp benchmark
    
    * rename
    
    * update fsdp bench
    
    * fix plugin
    
    * update benchmark
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    b72fa37 View commit details
    Browse the repository at this point in the history
  16. [moe] fix ci (hpcaitech#4772)

    * init policy
    
    * renam,e
    
    * update pp
    
    * finish pp
    
    * update script
    
    * update plugin
    
    * finish pp
    
    * update setup for different plugin
    
    * update ci
    
    * update ci
    
    * update ci
    
    * support ep inside or dp inside
    
    * update arg for kernel
    
    * disable ci
    
    * update train script
    
    * fsdp
    
    * update train
    
    * update train
    
    * fsdp benchmark
    
    * rename
    
    * update fsdp bench
    
    * fix plugin
    
    * update benchmark
    
    * fix ci
    
    * fix ci
    
    * rename
    
    * update ci
    
    * update test
    
    * update vocab
    
    * update chunk head
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    5c97a96 View commit details
    Browse the repository at this point in the history
  17. [moe] update benchmark scripts and ckpt io (hpcaitech#4804)

    * update benchmark script
    
    * update pp strategy
    
    * update plugin
    
    * update bench script
    
    * optimize
    
    * update pp layers
    
    * update zero ep
    
    * ep
    
    * update ckpt
    
    * update test
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    c68303b View commit details
    Browse the repository at this point in the history
  18. [moe] support overlap for expert tp (hpcaitech#4851)

    * overlap comm
    
    * fix typo
    
    * update bench script
    
    * add option
    
    * update script
    
    * update bench
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    4d74f83 View commit details
    Browse the repository at this point in the history
  19. [moe] support hybrid zero strategy. (hpcaitech#4877)

    * overlap comm
    
    * fix typo
    
    * update bench script
    
    * add option
    
    * update script
    
    * update bench
    
    * param init
    
    * support dp zero
    
    * fix zero dp
    
    * fxi bug
    
    * update pg bug
    
    * update experts
    
    * fix optim bug
    
    * update config
    
    * kaishen niubi
    
    * fix bug
    
    * embed
    
    * Merge branch 'feature/MoE' of https://github.com/hpcaitech/ColossalAI into bench
    
    * update bench
    
    * update optim
    
    * update doc
    
    * update sync
    
    * fix test
    
    * fix arg
    
    * update ckpt
    
    * update test
    
    * fix
    
    * remove print
    
    * polish code
    
    * update hybrid zero optim
    
    * update print
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    2481b83 View commit details
    Browse the repository at this point in the history
  20. update mm (hpcaitech#4893)

    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    7441a1f View commit details
    Browse the repository at this point in the history
  21. [moe] support load balance (hpcaitech#4914)

    * add load balance
    
    * update test
    
    * update param exchange
    
    * pass test
    
    * update test
    
    * update test
    
    * update test
    
    * update test
    
    * fix ranks
    
    * update
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    5844f34 View commit details
    Browse the repository at this point in the history
  22. update bench (hpcaitech#4923)

    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    5f20878 View commit details
    Browse the repository at this point in the history
  23. [moe]: add overlap ep, and fix overlap tp (hpcaitech#4925)

    * test: add more ep/tp test case
    
    * to: add TPOverlap fn
    
    * fix: fix tp overlap
    
    * fix: remove useless variables
    
    * feat: add async all to all
    
    * feat: add overlap ep
    
    * fix: fix import error
    
    * fix: fix ep/tp tests
    
    * perf: optimize overlap
    
    * fix: add world_size check
    CWHer authored and oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    b0e277b View commit details
    Browse the repository at this point in the history
  24. [moe] polish code (hpcaitech#4952)

    * doc
    
    * update script
    
    * update experts
    
    * update optim in fsdp
    
    * update kernel in sparse
    
    * empty cache
    
    * update script
    
    * update bench
    
    * update script
    
    * remove epzero2
    
    * fix
    
    * update print
    
    * update test script
    
    * update script
    
    * update manager
    
    * update host
    
    * update script
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    4a7bf29 View commit details
    Browse the repository at this point in the history
  25. [moe] update train script (hpcaitech#4959)

    * update
    
    * update ckpt
    
    * update train
    
    * update train
    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    c644b47 View commit details
    Browse the repository at this point in the history
  26. update

    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    5cc3ad0 View commit details
    Browse the repository at this point in the history
  27. delete context

    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    713446b View commit details
    Browse the repository at this point in the history
  28. remove moe

    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    1b19a5f View commit details
    Browse the repository at this point in the history
  29. fix bugs

    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    ca42bf4 View commit details
    Browse the repository at this point in the history
  30. update timeout temporarily

    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    c381e4c View commit details
    Browse the repository at this point in the history
  31. resume time

    oahzxl committed Oct 26, 2023
    Configuration menu
    Copy the full SHA
    b19fb91 View commit details
    Browse the repository at this point in the history

Commits on Oct 28, 2023

  1. fix bug

    oahzxl committed Oct 28, 2023
    Configuration menu
    Copy the full SHA
    61df786 View commit details
    Browse the repository at this point in the history
  2. remove tp

    oahzxl committed Oct 28, 2023
    Configuration menu
    Copy the full SHA
    685c80a View commit details
    Browse the repository at this point in the history
  3. use kwargs

    oahzxl committed Oct 28, 2023
    Configuration menu
    Copy the full SHA
    9586f61 View commit details
    Browse the repository at this point in the history

Commits on Oct 30, 2023

  1. polish and align with main

    oahzxl committed Oct 30, 2023
    Configuration menu
    Copy the full SHA
    6c0094c View commit details
    Browse the repository at this point in the history
  2. fix test

    oahzxl committed Oct 30, 2023
    Configuration menu
    Copy the full SHA
    b732ab0 View commit details
    Browse the repository at this point in the history

Commits on Oct 31, 2023

  1. update doc

    oahzxl committed Oct 31, 2023
    Configuration menu
    Copy the full SHA
    e85122b View commit details
    Browse the repository at this point in the history
  2. Dist (#7)

    * dist bench
    
    * update fsdp
    oahzxl authored Oct 31, 2023
    Configuration menu
    Copy the full SHA
    25c329f View commit details
    Browse the repository at this point in the history
  3. update dist script

    oahzxl committed Oct 31, 2023
    Configuration menu
    Copy the full SHA
    6b03bd4 View commit details
    Browse the repository at this point in the history
  4. update cai version

    oahzxl committed Oct 31, 2023
    Configuration menu
    Copy the full SHA
    659c9b1 View commit details
    Browse the repository at this point in the history
  5. update fsdp

    oahzxl committed Oct 31, 2023
    Configuration menu
    Copy the full SHA
    caece56 View commit details
    Browse the repository at this point in the history
  6. update zero

    oahzxl committed Oct 31, 2023
    Configuration menu
    Copy the full SHA
    9fe7680 View commit details
    Browse the repository at this point in the history
  7. fix bug

    oahzxl committed Oct 31, 2023
    Configuration menu
    Copy the full SHA
    0eb5623 View commit details
    Browse the repository at this point in the history

Commits on Nov 1, 2023

  1. reverse legacy

    oahzxl committed Nov 1, 2023
    Configuration menu
    Copy the full SHA
    4be194a View commit details
    Browse the repository at this point in the history
  2. update

    oahzxl committed Nov 1, 2023
    Configuration menu
    Copy the full SHA
    7e92e7b View commit details
    Browse the repository at this point in the history
  3. update readme

    oahzxl committed Nov 1, 2023
    Configuration menu
    Copy the full SHA
    da6392f View commit details
    Browse the repository at this point in the history