Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull in main from argonne-lcf/Megatron-DeepSpeed #9

Merged
merged 553 commits into from
Nov 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
553 commits
Select commit Hold shift + click to select a range
47bf9b5
Update `train_llama_alcf.sh`
saforem2 May 20, 2024
e68d270
Update `ALCF/README.md`
saforem2 May 20, 2024
4dd51dd
Merge pull request #14 from argonne-lcf/sunspot-frameworks-tests
saforem2 May 20, 2024
ac414a0
Update README.md
saforem2 May 20, 2024
9aa7fab
Fix path in `prof.export_chrome_trace()` from `pretrain_gpt_alcf.py`
saforem2 May 20, 2024
7d20359
Merge pull request #15 from argonne-lcf/fix-trace-output-path
saforem2 May 20, 2024
0508cf6
changed environment variable
zhenghh04 May 20, 2024
c4250a1
added torch profiler per step output support
zhenghh04 May 20, 2024
fa04d11
local changes
zhenghh04 May 22, 2024
bec5f9e
merge
zhenghh04 May 22, 2024
6cca87f
distributed loading
zhenghh04 May 22, 2024
62f8f56
fixed print issue
zhenghh04 May 23, 2024
2f01543
Update README.md
saforem2 May 23, 2024
13171c2
Update README.md
saforem2 May 24, 2024
06ac065
Added function for on-the-fly building the dataset
zhenghh04 May 24, 2024
120a2b5
fixed minor issue in _build_train_valid_test_datasets_single
zhenghh04 May 24, 2024
6fdbfd3
fixed variable order in Builder
zhenghh04 May 24, 2024
3e2aa23
fixed minor issue
zhenghh04 May 24, 2024
b371742
Add `setup_tokenizer_and_data()` function to `ALCF/helpers.sh`
saforem2 May 24, 2024
d93fb7f
Update `train_llama_alcf.sh`
saforem2 May 24, 2024
05d82c3
Update `train_aGPT_7B.sh`
saforem2 May 24, 2024
6de8496
Update `ALCF/README.md`
saforem2 May 24, 2024
03aa7c1
Update `ALCF/helpers.sh`
saforem2 May 24, 2024
3cd3f1a
Update `train_aGPT_7B.sh`
saforem2 May 24, 2024
bc1dbfd
Fix `--data-cache-path` in `ALCF/helpers.sh, train_llama_alcf.sh`
saforem2 May 24, 2024
c3a4451
Add `ALCF/sunspot-env-2024-04-15-002.sh`
saforem2 May 25, 2024
0fc3919
Update `train_aGPT_7B.sh`
saforem2 May 25, 2024
318d860
Merge branch 'tokenizer-tests' of https://github.com/argonne-lcf/Mega…
saforem2 May 25, 2024
c7a20cf
Merge pull request #17 from argonne-lcf/tokenizer-tests
saforem2 May 25, 2024
efb2a3a
added a barrier to make sure all the datasets are built before other …
zhenghh04 May 28, 2024
ccf8835
Update `{train_llama_alcf.sh,ALCF/helpers.sh}`
saforem2 May 29, 2024
358139f
Update `ALCF/helpers.sh`
saforem2 May 30, 2024
47585ed
Concat datasets that belongs the same corpus
zhenghh04 May 31, 2024
2b5b41f
convert MDS checkpoint to Hf Llama model
vksastry May 31, 2024
10a34ea
fixed bugs
zhenghh04 May 31, 2024
7c80c2c
Update `ALCF/helpers.sh`
saforem2 May 31, 2024
93db2a9
optimized loading blendable dataset meta data, by loading and broadca…
zhenghh04 May 31, 2024
89f2a95
added broadcast
zhenghh04 May 31, 2024
96cb1e5
fixed overflow issue
zhenghh04 May 31, 2024
cb2f1dc
removed unnecessary mpi4py
zhenghh04 May 31, 2024
b48d6f8
Merge pull request #18 from argonne-lcf/distributed_loading_v2
zhenghh04 May 31, 2024
0dea6aa
Update dataset_utils.py
zhenghh04 Jun 4, 2024
f16416a
merge distributed_loading
zhenghh04 Jun 5, 2024
5d26dfe
fixed a minor bug
zhenghh04 Jun 5, 2024
3dc424f
remove unnecessary barrier
zhenghh04 Jun 5, 2024
60fc482
added pfw tracing for test_blendable_dataset
zhenghh04 Jun 5, 2024
b1f17d5
fixed bug
zhenghh04 Jun 5, 2024
10a3737
added more loging
zhenghh04 Jun 5, 2024
bc28f84
removed allreduce calls that are not needed
zhenghh04 Jun 5, 2024
6eb21b7
removed allreduce call that are not needed any more
zhenghh04 Jun 5, 2024
20a2430
fixed a bug
zhenghh04 Jun 5, 2024
f718694
added more logging info
zhenghh04 Jun 5, 2024
699bde4
Merge branch 'distributed_loading' of ../Megatron-DeepSpeed-distribut…
zhenghh04 Jun 5, 2024
dd3b070
Merge branch 'distributed_loading' of github.com:argonne-lcf/Megatron…
zhenghh04 Jun 5, 2024
b4c832e
added more logging for index_dataset
zhenghh04 Jun 5, 2024
1719b0e
added new log
zhenghh04 Jun 5, 2024
053b42d
changed things into helper
zhenghh04 Jun 5, 2024
52b2cca
fixed issue with dlioprofiler
zhenghh04 Jun 6, 2024
cbc7830
fixed some bugs
zhenghh04 Jun 6, 2024
03a9bfa
Merge branch 'pfw_trace' of github.com:argonne-lcf/Megatron-DeepSpeed…
zhenghh04 Jun 6, 2024
36a2671
fixed profiler issue
zhenghh04 Jun 6, 2024
5c8d376
reduced printing
zhenghh04 Jun 6, 2024
d9085b6
added more timing info
zhenghh04 Jun 10, 2024
0ef6bfd
fixed timing issue for all reduce
zhenghh04 Jun 10, 2024
26ee1c3
Merge pull request #20 from argonne-lcf/pfw_trace
zhenghh04 Jun 10, 2024
9413dc9
Merge pull request #21 from argonne-lcf/distributed_loading_v2
zhenghh04 Jun 10, 2024
f6363fb
changed init
zhenghh04 Jun 12, 2024
a55df51
reducing printing from non-root ranks
zhenghh04 Jun 12, 2024
a24f01b
reduce printing
zhenghh04 Jun 12, 2024
5a54149
reducing printing
zhenghh04 Jun 12, 2024
3acdda7
added MiCS as an option
zhenghh04 Jun 13, 2024
73f6cee
Merge branch 'mics' into distributed_loading
zhenghh04 Jun 13, 2024
712d08d
Update `dropout` in `ALCF/helpers.sh`
saforem2 Jun 14, 2024
482c235
Update {`ALCF/helpers.sh`, `train_llama_alcf.sh`}
saforem2 Jun 14, 2024
2e26950
Merge pull request #22 from argonne-lcf/sequence-parallel
saforem2 Jun 14, 2024
f4c2c16
Add `ALCF/data-lists/aurora/*.txt`
saforem2 Jun 14, 2024
231d2b5
Add `setup_conda_aurora` to `ALCF/helpers.sh`
saforem2 Jun 14, 2024
852575d
Merge pull request #23 from argonne-lcf/aurora-updates
saforem2 Jun 14, 2024
aaf6152
Fix `ezpz_{save,get}jobenv` in `ALCF/helpers.sh`
saforem2 Jun 14, 2024
56a1c37
Merge pull request #24 from argonne-lcf/ezpz-hotfix
saforem2 Jun 14, 2024
b905e53
Correctly set `dfl_fallback` on Aurora if no `DATA_FILE_LIST` specified
saforem2 Jun 14, 2024
4a07103
Merge pull request #25 from argonne-lcf/aurora-dfl-fix
saforem2 Jun 14, 2024
ba5f871
added warning if the file list is not provided correctly
zhenghh04 Jun 14, 2024
c690202
make it still compatible to previous
zhenghh04 Jun 14, 2024
a96bcea
added support for XPU
zhenghh04 Jun 14, 2024
30fe479
Update README.md
saforem2 Jun 14, 2024
9208eae
Update README.md
saforem2 Jun 14, 2024
caf82d7
Merge pull request #26 from argonne-lcf/saforem2-patch-1
saforem2 Jun 14, 2024
1f983f3
Create `llama-toggle` branch
saforem2 Jun 14, 2024
f902e91
Merge pull request #19 from argonne-lcf/checkpoint_convert
saforem2 Jun 15, 2024
67d6810
Update README.md
saforem2 Jun 15, 2024
3091871
Update `setEnv` for Aurora in `ALCF/helpers.sh`
saforem2 Jun 15, 2024
81fe55f
Update README.md
saforem2 Jun 15, 2024
983a0bd
Merge pull request #27 from argonne-lcf/saforem2-patch-1
saforem2 Jun 15, 2024
7d1784b
Updates to `NO_LLAMA` mode
saforem2 Jun 15, 2024
bf979a7
Update `pretrain_gpt_alcf.py`
saforem2 Jun 15, 2024
84fa77c
Update `pretrain_gpt_alcf.py`
saforem2 Jun 15, 2024
e6461f5
Merge pull request #28 from argonne-lcf/llama-toggle
saforem2 Jun 16, 2024
f138b27
added more log
zhenghh04 Jun 17, 2024
a7249fe
resolve conflict in file list
zhenghh04 Jun 17, 2024
a36569e
added warning info when XPU profiling is not available
zhenghh04 Jun 17, 2024
79d11a7
Create `alcf-patch-1` branch
saforem2 Jun 18, 2024
e058427
Update `ALCF/helpers.sh`
saforem2 Jun 18, 2024
1ae3768
Update `ALCF/helpers.sh`
saforem2 Jun 19, 2024
abead32
Update `ALCF/README.md`
saforem2 Jun 19, 2024
025ff3f
Update ALCF/README.md`
saforem2 Jun 19, 2024
d012937
Merge pull request #29 from argonne-lcf/alcf-patch-1
saforem2 Jun 19, 2024
ef5356b
Merge pull request #16 from argonne-lcf/distributed_loading
saforem2 Jun 19, 2024
732e567
Add `ALCF/data-lists/aurora/*.txt`
saforem2 Jun 19, 2024
0320b69
Update `ALCF/data-lists/sunspot/*.txt`
saforem2 Jun 19, 2024
a51fb11
Update `ALCF/data-lists/polaris/*.txt`
saforem2 Jun 19, 2024
9d10704
Update `.gitignore`
saforem2 Jun 19, 2024
ec600e5
Update `ALCF/helpers.sh`
saforem2 Jun 19, 2024
168cdda
Add `ALCF/requirements/requirements.txt`
saforem2 Jun 19, 2024
7df9329
Update `ALCF/helpers.sh`
saforem2 Jun 19, 2024
77ffd10
Update `ALCF/helpers.sh`
saforem2 Jun 19, 2024
e884f15
Update `ALCF/helpers.sh,requirements/requirements.txt}`
saforem2 Jun 19, 2024
10a17e2
Merge pull request #30 from argonne-lcf/distributed-data-lists
saforem2 Jun 19, 2024
fb49de8
Update `ALCF/helpers.sh`
saforem2 Jun 19, 2024
7272326
Update `ALCF/helpers.sh`
saforem2 Jun 19, 2024
18ca369
Merge pull request #31 from argonne-lcf/alcf-helpers-patch-1
saforem2 Jun 19, 2024
f826667
Update `ALCF/helpers.sh` with kvs fix on Aurora
saforem2 Jun 21, 2024
26b846a
Update `ALCF/helpers.sh`
saforem2 Jun 21, 2024
7cd5bfa
Merge pull request #32 from argonne-lcf/alcf-aurora-kvs-fix
saforem2 Jun 21, 2024
bc7fbc6
Update `ALCF/README.md`
saforem2 Jun 21, 2024
f94b845
Update `ALCF/README.md`
saforem2 Jun 21, 2024
6f98d5a
Merge pull request #33 from argonne-lcf/alcf-update-readme
saforem2 Jun 21, 2024
06357f4
Create `alcf-startup-time`
saforem2 Jun 21, 2024
c7a1e36
Add `ALCF/notes/deepspeed_init_time.md`
saforem2 Jun 24, 2024
0548bfb
Update `ALCF/notes/deepspeed_init_time.md`
saforem2 Jun 24, 2024
6a8f55c
Update deepspeed_init_time.md
saforem2 Jun 24, 2024
d0e3d79
Update `ALCF/helpers.sh`
saforem2 Jun 25, 2024
bb690e3
Update `pretrain_gpt_alcf.py`
saforem2 Jun 25, 2024
aa698da
Update `train_llama_alcf.sh`
saforem2 Jun 25, 2024
12baf30
Update `megatron/training.py`
saforem2 Jun 25, 2024
8eabb7a
Update `megatron/training.py`
saforem2 Jun 25, 2024
d9fc18e
Update `ALCF/helpers.sh`
saforem2 Jun 25, 2024
1d413c6
Update `megatron/training.py`
saforem2 Jun 25, 2024
93e4a51
Update `megatron/utils.py`
saforem2 Jun 25, 2024
99bddfa
Update `ALCF/helpers.sh`
saforem2 Jun 25, 2024
9a8ccfd
Update `ALCF/helpers.sh`
saforem2 Jun 25, 2024
c6a63bc
Merge pull request #34 from argonne-lcf/alcf-startup-time
saforem2 Jun 25, 2024
57ba1fb
Update `ALCF/helpers.sh`
saforem2 Jun 26, 2024
7388c1a
Update `ALCF/helpers.sh`
saforem2 Jun 29, 2024
37a7c5c
Merge pull request #36 from argonne-lcf/alcf-helpers-patch
saforem2 Jun 29, 2024
b511a2e
Update `ALCF/helpers.sh`
saforem2 Jul 5, 2024
561ddc1
Fix micro batch size on Polaris
saforem2 Jul 10, 2024
9ee09fe
Update `ALCF/helpers.sh`
saforem2 Jul 10, 2024
76209f4
Update `ALCF/helpers.sh`
saforem2 Jul 10, 2024
541ebf1
Update `ALCF/helpers.sh`
saforem2 Jul 10, 2024
d76331f
Update `ALCF/helpers.sh`
saforem2 Jul 10, 2024
d017b4c
Update `ALCF/helpers.sh`
saforem2 Jul 11, 2024
bac8aab
Update `ALCF/helpers.sh`
saforem2 Jul 11, 2024
911cc5c
Update `ALCF/helpers.sh`
saforem2 Jul 12, 2024
2ac4fb0
Update `ALCF/helpers.sh`
saforem2 Jul 15, 2024
4876eb8
Update `ALCF/helpers.sh` on Polaris
saforem2 Jul 16, 2024
7385e3b
Update `ALCF/helpers.sh`
saforem2 Jul 16, 2024
5f5bbd4
Update `pretrain_gpt_alcf.py`
saforem2 Jul 16, 2024
b38bcb6
Update `ALCF/helpers.sh`
saforem2 Jul 19, 2024
0999de2
Update `ALCF/requirements/requirements.txt`
saforem2 Jul 19, 2024
6ad3a99
Fix opt hyperparams in `ALCF/helpers.sh`
saforem2 Jul 19, 2024
019dc3c
Update `ALCF/helpers.sh`
saforem2 Jul 20, 2024
54bd608
Track grad_norm in `megatron/training.py`
saforem2 Jul 20, 2024
969f4c5
Update `train_aGPT_7B.sh`
saforem2 Jul 20, 2024
9550656
Update `train_llama_alcf.sh`
saforem2 Jul 22, 2024
5d96d64
Update `train_aGPT_7B.sh`
saforem2 Jul 22, 2024
8897dc2
Merge pull request #43 from argonne-lcf/alcf-helpers-patch-1
saforem2 Jul 22, 2024
bcbe75f
Update README.md
saforem2 Jul 31, 2024
0270321
Merge pull request #49 from argonne-lcf/saforem2-patch-2
saforem2 Jul 31, 2024
b7c17ca
Move `ALCF/mds_to_hf.py` to `mds_to_hf.py`
saforem2 Aug 23, 2024
81470e9
Merge pull request #51 from argonne-lcf/checkpoint-conversion
saforem2 Aug 23, 2024
5001600
fixed data loader issue for TP>1 PP>1
zhenghh04 Aug 30, 2024
38b2505
Update `ALCF/data-lists/aurora/*.txt`
saforem2 Aug 30, 2024
461bc7f
Merge pull request #52 from argonne-lcf/bugfix/tp_pp_dataloader
saforem2 Aug 30, 2024
ea0c3c7
fixed dftracer compatibility
zhenghh04 Aug 30, 2024
50e2729
hf cp conversion and inference scripts added
Aug 31, 2024
464a0d2
Merge pull request #53 from argonne-lcf/checkpoint_hf
saforem2 Aug 31, 2024
a0ac750
added requirements.txt
zhenghh04 Sep 3, 2024
de7f22f
Update utils.py
zhenghh04 Sep 4, 2024
3edba7f
Add `--train-range-to-skip` to `megatron/arguments.py`
saforem2 Sep 9, 2024
76a259b
Add logic for `--trin-range-to-skip` to `megatron/training.py`
saforem2 Sep 9, 2024
fd1ac6d
Update `ALCF/helpers.sh`
saforem2 Sep 10, 2024
6f27f5d
Update `train_aGPT_7B.sh`
saforem2 Sep 10, 2024
6df33ad
fix: `--override-opt_param-scheduler` if `OVERRIDE_CKPT_OPT_PARAM=1`
saforem2 Sep 11, 2024
73720c2
Merge pull request #56 from argonne-lcf/train-skip-range
saforem2 Sep 11, 2024
8bc5313
merge: Create `microsoft-main`
saforem2 Sep 12, 2024
a1ede68
Remove duplicate `--profile` arg
saforem2 Sep 12, 2024
6b32cff
debug: `sequence_parallel` issue in `RMSNorm` ??
saforem2 Sep 12, 2024
12f6f8e
fix check
zhenghh04 Sep 12, 2024
5ac877a
Update `megatron/training_log_alcf.py`
saforem2 Sep 12, 2024
b3e0f6f
Update `megatron/training.py`
saforem2 Sep 13, 2024
2113dbc
Update `megatron/utils.py`
saforem2 Sep 13, 2024
7f71572
Update `megatron/training_log.py`
saforem2 Sep 13, 2024
7cb9c11
Update `pretrain_gpt_alcf.py`
saforem2 Sep 15, 2024
e83de19
Update `megatron/training_log.py`
saforem2 Sep 15, 2024
29756d6
Warn if mismatch b/w iters in `megatron/checkpointing.py`
saforem2 Sep 15, 2024
1a7f03b
fix: `try/except` for non tensors in `megatron/training_log.py`
saforem2 Sep 16, 2024
828f6a9
fix: Correctly draw `grad_acc_steps` batches of data when skipping step
saforem2 Sep 17, 2024
295fcb3
Update `pretrain_gpt_alcf.py`
saforem2 Sep 17, 2024
cf80e6b
added sophia
Sep 23, 2024
09accde
Merge pull request #59 from mngom2/spike-skipper
saforem2 Sep 30, 2024
cef3fc7
Merge pull request #58 from argonne-lcf/spike-skipper
saforem2 Oct 8, 2024
fd94b37
merge: Resolve merge conflicts pulling in from Microsoft upstream
saforem2 Oct 8, 2024
9b5be12
merge: `argonne-lcf-microsoft-main` into `main`
saforem2 Oct 11, 2024
5394156
shuffle concate dataset index
zhenghh04 Oct 12, 2024
573b668
fixed bugs
zhenghh04 Oct 12, 2024
41ff059
Update `ALCF/helpers.sh`, `train_aGPT_7B.sh`
saforem2 Oct 12, 2024
89db92a
merge: `feature/profile` with data fix into `microsoft-main`
saforem2 Oct 12, 2024
9de83a9
Fix `shuffle_idx` in `megatron/data/gpt_dataset.py`
saforem2 Oct 12, 2024
d7a2594
Fix `shuffle_idx` in `megatron/data/gpt_dataset.py`
saforem2 Oct 12, 2024
3e33a6a
Update `ALCF/helpers.sh`, `train_aGPT_7B.sh`
saforem2 Oct 13, 2024
43cde2b
Update `pretrain_gpt_alcf.py`
saforem2 Oct 13, 2024
9f09733
Update `megatron/data/{blendable,gpt,indexed}_dataset.py`
saforem2 Oct 13, 2024
2b31b44
Update `ALCF/requirements/requirements.txt`
saforem2 Oct 13, 2024
5e9eed0
Update `megatron/utils.py`
saforem2 Oct 13, 2024
3dcb297
fixed bugs and added commandline option
zhenghh04 Oct 14, 2024
bec9b7a
Merge branch 'debug-logging' into feature/profile
saforem2 Oct 14, 2024
43fc2fe
fixed typo
zhenghh04 Oct 14, 2024
94d5337
Merge branch 'feature/profile' of github.com:argonne-lcf/Megatron-Dee…
zhenghh04 Oct 14, 2024
bb55e97
Merge pull request #67 from argonne-lcf/feature/profile
saforem2 Oct 14, 2024
d50239f
added support for blending samples across different files in the same…
zhenghh04 Oct 14, 2024
9b4f510
Merge pull request #64 from argonne-lcf/debug-logging
saforem2 Oct 14, 2024
324ef11
Merge branch 'alcf-hzheng-data-fix' into hzheng-data-fix
saforem2 Oct 15, 2024
45ff652
Discard changes to megatron/data/gpt_dataset.py
saforem2 Oct 15, 2024
52a406c
Consistent logging in `megatron/data/*.py`
saforem2 Oct 15, 2024
63b1901
Update `megatron/data/gpt_dataset.py`
saforem2 Oct 16, 2024
7ef26bf
Use `time.perf_counter` in `megatron/data/blendable_dataset.py`
saforem2 Oct 16, 2024
deb95cd
fix init issue for silently ignoring the deepspeed config (#452)
xylian86 Oct 17, 2024
68da2db
Update `ALCF/helpers.sh`
saforem2 Oct 17, 2024
ab3a8ec
Merge branch 'main' of https://github.com/microsoft/Megatron-DeepSpee…
saforem2 Oct 18, 2024
ed21bd9
Merge branch 'hzheng-data-fix' of https://github.com/argonne-lcf/Mega…
saforem2 Oct 18, 2024
6acc370
fix moe tflops (#445)
ranzhejiang Oct 18, 2024
467279b
Merge 'upstream/main' into `hzeng-data-fix`
saforem2 Oct 18, 2024
9e015cc
Remove duplicate `gradient_accumulation_steps` in DS config
saforem2 Oct 18, 2024
58dc2d7
Update default EVAL args
saforem2 Oct 21, 2024
277d308
Catch eval metrics in `megatron/training.py`
saforem2 Oct 21, 2024
af4cba1
Save git branch to env in `train_aGPT_7B.sh`
saforem2 Oct 21, 2024
8a8472c
fixed print out bug
zhenghh04 Oct 21, 2024
dfd0643
Merge pull request #68 from argonne-lcf/feature/blending_corpus
saforem2 Oct 21, 2024
6cb727d
Fix `args.shuffle` in `megatron/data/gpt_dataset.py`
saforem2 Oct 21, 2024
5d10179
Update `--{shuffle,blend}-sample-in-corpus` arg in `ALCF/helpers.sh`
saforem2 Oct 24, 2024
160d6a6
fix: `GRAD_ACC_STEPS` when `NHOSTS == 256`
saforem2 Oct 31, 2024
40db8c2
Merge pull request #63 from argonne-lcf/hzheng-data-fix
saforem2 Nov 5, 2024
ce7d553
🚧 `ALCF/ds_to_universal.py`
saforem2 Nov 7, 2024
8e0bff8
docs: Add `ALCF/notes/checkpoints.md`
saforem2 Nov 7, 2024
bd8c246
feat: Enable `--use-flash-attn-builder` by default on Aurora
saforem2 Nov 7, 2024
26f2e71
Update python.yml
saforem2 Nov 7, 2024
48b3c81
Update python.yml
saforem2 Nov 7, 2024
0a997bb
Update python.yml
saforem2 Nov 7, 2024
c4de4d1
Merge pull request #62 from argonne-lcf/microsoft-main
saforem2 Nov 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
unit-tests:
strategy:
matrix:
pyVersion: ["3.7", "3.8", "3.9", "3.10"]
pyVersion: ["3.10"]
fail-fast: false

runs-on: ubuntu-22.04
Expand Down
42 changes: 42 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,10 +1,52 @@
# User Added
.jobenv
**.e[0-9]**
**.o[0-9]**
**.e6**
**.o6**
**.e9**
**.o9**
**.e1**
**.o1**
*.o17*
*.e17*
*.o1
*.e1
deps/*
OUTPUTS/*
ALCF/OUTPUTS/*
*tmp*
*core.*
*old*
*.bak
**index-cache**
**pbslogs**
ezpz
*hostfile*
.deepspeed_env
*.DS_Store
old/*
**venv**
*.json
outputs/
venvs/
wandb/
llama-logs/
checkpoints/
*.gz
*.txt
*.idx
*.bin
*.log
__pycache__

.deepspeed_env
*.bak
.cache/*
outputs/
venvs/
wandb/
llama-logs/
checkpoints/
*.gz
*.txt
Expand Down
1,170 changes: 1,170 additions & 0 deletions ALCF/README.md

Large diffs are not rendered by default.

20 changes: 20 additions & 0 deletions ALCF/aws_ofi_nccl_plugin.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash --login

# AWS NCCL OFI Plugin settings below
export NCCL_CROSS_NIC=1
export NCCL_COLLNET_ENABLE=1
export NCCL_NET="AWS Libfabric"
export LD_LIBRARY_PATH=/soft/libraries/aws-ofi-nccl/v1.9.1-aws/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/soft/libraries/hwloc/lib/:$LD_LIBRARY_PATH
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_MR_CACHE_MONITOR=userfaultfd
export FI_CXI_DEFAULT_CQ_SIZE=131072
#########################################################
# WARNING: !!!
# - Currently, `export NCCL_NET_GDR_LEVEL=PHB`
# causes a hang on Polaris.
# so, we don't set it for the time being [2024-05-14].
# - Seems to work on Perlmutter ???
#
# export NCCL_NET_GDR_LEVEL=PHB
#########################################################
16 changes: 16 additions & 0 deletions ALCF/data-lists/aurora/algebraic.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
0.0018520780893211373 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0000_text_document algebraic-stack-train
0.0017591050606817512 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0001_text_document algebraic-stack-train
0.001459052794333798 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0002_text_document algebraic-stack-train
0.0007405667281569194 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0003_text_document algebraic-stack-train
0.00019420030110896795 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0004_text_document algebraic-stack-train
0.0009008668715801845 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0005_text_document algebraic-stack-train
0.00015115827957143057 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0006_text_document algebraic-stack-train
0.0014552844319220648 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0007_text_document algebraic-stack-train
0.0012469861325685161 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0008_text_document algebraic-stack-train
0.00136412011372413 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0009_text_document algebraic-stack-train
0.0007064279699221103 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0010_text_document algebraic-stack-train
0.0008472240000687427 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0011_text_document algebraic-stack-train
0.0001984375713341955 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0012_text_document algebraic-stack-train
0.0005472773881697123 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0013_text_document algebraic-stack-train
0.001815779629850992 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0014_text_document algebraic-stack-train
0.0018313600689757324 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0015_text_document algebraic-stack-train
100 changes: 100 additions & 0 deletions ALCF/data-lists/aurora/arxiv.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
0.0002583902668716813 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0000_text_document arxiv
0.0002646575141232155 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0001_text_document arxiv
0.0003165521247456758 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0002_text_document arxiv
0.0002920706460176214 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0003_text_document arxiv
0.00028396813182810215 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0004_text_document arxiv
0.00030445161883108107 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0005_text_document arxiv
0.00031628781276576474 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0006_text_document arxiv
0.0003083776568189157 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0007_text_document arxiv
0.0003176359471472902 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0008_text_document arxiv
0.0002536009369131698 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0009_text_document arxiv
0.0003067491424681363 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0010_text_document arxiv
0.0002597217257557784 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0011_text_document arxiv
0.0003788556450109768 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0012_text_document arxiv
0.0002796563272052598 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0013_text_document arxiv
0.00033573826524290287 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0014_text_document arxiv
0.00030523658022800287 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0015_text_document arxiv
0.00032211552192240096 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0016_text_document arxiv
0.0003329295675164247 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0017_text_document arxiv
0.0003101982186639862 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0018_text_document arxiv
0.00032361798234223355 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0019_text_document arxiv
0.0003495541581652915 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0020_text_document arxiv
0.0002821637448858042 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0021_text_document arxiv
0.00030399523537629673 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0022_text_document arxiv
0.0002955658968247219 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0023_text_document arxiv
0.00028942158502924254 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0024_text_document arxiv
0.00028769546171490733 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0025_text_document arxiv
0.0002938111057234182 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0026_text_document arxiv
0.0002711150403010948 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0027_text_document arxiv
0.00031130095874747565 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0028_text_document arxiv
0.0003002996118160777 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0029_text_document arxiv
0.0003732757901604459 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0030_text_document arxiv
0.00026784205751795894 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0031_text_document arxiv
0.0002799626521661984 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0032_text_document arxiv
0.00034334276069078164 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0033_text_document arxiv
0.0003582469803674965 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0034_text_document arxiv
0.00031094844818418623 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0035_text_document arxiv
0.0002766228384977191 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0036_text_document arxiv
0.00030297116159471485 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0037_text_document arxiv
0.00027033888377464685 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0038_text_document arxiv
0.00030090862368377933 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0039_text_document arxiv
0.00028543875802490955 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0040_text_document arxiv
0.00027559768459074204 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0041_text_document arxiv
0.0003182185533962886 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0042_text_document arxiv
0.0003311392971435837 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0043_text_document arxiv
0.00028751652060804325 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0044_text_document arxiv
0.000303466863212589 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0045_text_document arxiv
0.00033400462801277524 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0046_text_document arxiv
0.0002589234031777426 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0047_text_document arxiv
0.0002913508598466723 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0048_text_document arxiv
0.0002670572450004856 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0049_text_document arxiv
0.00032027399105647656 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0050_text_document arxiv
0.00032188376258379377 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0051_text_document arxiv
0.0003161585784100882 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0052_text_document arxiv
0.0003184249182974135 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0053_text_document arxiv
0.00030381336664000807 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0054_text_document arxiv
0.0003190437442184283 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0055_text_document arxiv
0.0002537961798200545 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0056_text_document arxiv
0.0003017817117223326 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0057_text_document arxiv
0.00028685268513240224 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0058_text_document arxiv
0.00031265179094451165 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0059_text_document arxiv
0.00034708319096986816 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0060_text_document arxiv
0.00026650837943080664 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0061_text_document arxiv
0.00034588832248507335 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0062_text_document arxiv
0.0002416982248399037 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0063_text_document arxiv
0.0003089296918222243 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0064_text_document arxiv
0.00029137184185700827 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0065_text_document arxiv
0.00026464226846800774 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0066_text_document arxiv
0.00030545397919456627 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0067_text_document arxiv
0.0003206778460448875 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0068_text_document arxiv
0.00030968971641110967 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0069_text_document arxiv
0.00023325653928600864 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0070_text_document arxiv
0.00030526899198338555 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0071_text_document arxiv
0.00035376719076633584 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0072_text_document arxiv
0.000290224385981026 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0073_text_document arxiv
0.000294650083382008 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0074_text_document arxiv
0.00028768858128616436 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0075_text_document arxiv
0.00030856965235527843 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0076_text_document arxiv
0.00030579942447879054 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0077_text_document arxiv
0.0002863101084704357 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0078_text_document arxiv
0.0002870032092492213 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0079_text_document arxiv
0.000264182727569885 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0080_text_document arxiv
0.0002974012367036449 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0081_text_document arxiv
0.00032238412143059203 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0082_text_document arxiv
0.00031683716893819036 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0083_text_document arxiv
0.00031157434937617524 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0084_text_document arxiv
0.0003411742735695989 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0085_text_document arxiv
0.00026778444816570715 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0086_text_document arxiv
0.0003037045797275201 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0087_text_document arxiv
0.00027746114370081314 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0088_text_document arxiv
0.00027148285946862043 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0089_text_document arxiv
0.00028042950114678207 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0090_text_document arxiv
0.0003235607816590721 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0091_text_document arxiv
0.0003086692227306295 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0092_text_document arxiv
0.00033990349455148105 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0093_text_document arxiv
0.00030945053208470265 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0094_text_document arxiv
0.00027309074552265303 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0095_text_document arxiv
0.00028737393506316194 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0096_text_document arxiv
0.0003098868328009879 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0097_text_document arxiv
0.0002614229162588409 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0098_text_document arxiv
0.0002884388407820923 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0099_text_document arxiv
3 changes: 3 additions & 0 deletions ALCF/data-lists/aurora/books.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
0.0031025147279277244 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/books-0000_text_document books
0.003102019887362634 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/books-0001_text_document books
0.0009996745994661548 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/books-0002_text_document books
Loading