Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test occl #471

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
e13a735
ori of run small vit ok
Panlichen Dec 19, 2022
f4faae9
Merge branch 'Oneflow-Inc:main' into main
Panlichen Dec 22, 2022
e44fb2f
update path
Panlichen Dec 22, 2022
6c83d18
update path
Panlichen Dec 22, 2022
873be34
update path
Panlichen Jan 6, 2023
0a7cb5d
scripts
Panlichen Jan 10, 2023
a6141a4
scripts
Panlichen Jan 10, 2023
ade1893
+ CUDA_VISIBLE_DEVICES control
Panlichen Jan 10, 2023
0666d42
0 epoch; 200iter
Panlichen Jan 11, 2023
57e54dd
scripts
Panlichen Jan 13, 2023
2db72ef
+ control enable_use_compute_stream
Panlichen Jan 16, 2023
d23a706
control enable_use_compute_stream with env
Panlichen Jan 17, 2023
2e862fc
hyperparameters
Panlichen Feb 5, 2023
c39a73f
+nsys
Panlichen Feb 5, 2023
aeb4550
get iter from env
Panlichen Feb 6, 2023
790d4c8
set cfg.num_heads = 16
Panlichen Feb 6, 2023
0ae9cca
hyperparemeters; adjust env
Panlichen Feb 8, 2023
db2c433
hyperparameter
Panlichen Feb 8, 2023
a6bb5ad
+zero
Panlichen Feb 11, 2023
37131e0
+ONEFLOW_TIME_SHAPE
Panlichen Feb 16, 2023
308da3e
+27 script
Panlichen Feb 28, 2023
283f0fb
script
Panlichen Feb 28, 2023
0ae238e
+ 2 machine script
Panlichen Mar 26, 2023
cf708ad
scripts
Panlichen Apr 11, 2023
f30ea65
Update vit_imagenet.py
Panlichen Apr 13, 2023
48a67b4
Update README.md
Panlichen Apr 13, 2023
354ac44
scripts
Panlichen Apr 13, 2023
6867b51
Update README.md
Panlichen Apr 14, 2023
b1c7e7f
scripts
Panlichen Apr 14, 2023
85a8cf8
scripts
Panlichen Apr 14, 2023
90837d0
+scripts
Panlichen Apr 24, 2023
3b51f86
+scripts
Panlichen Apr 24, 2023
0679d00
scripts
Panlichen Apr 27, 2023
ddd7e4b
+4090_para scripts
Panlichen May 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -129,4 +129,8 @@ venv.bak/
dmypy.json

# Pyre type checker
.pyre/
.pyre/

config/
version.py
output/
124 changes: 10 additions & 114 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,117 +1,13 @@
<!-- 配图 -->
Please refer to the [official repository](https://github.com/Oneflow-Inc/libai) and the [official documentation page](https://libai.readthedocs.io/en/latest/) for guidance on installation and other related topics.

<h2 align="center">LiBai</h2>
<p align="center">
<a href="https://libai.readthedocs.io/en/latest/index.html">
<img alt="docs" src="https://img.shields.io/badge/docs-latest-blue">
</a>
<a href="https://github.com/Oneflow-Inc/libai/blob/master/LICENSE">
<img alt="GitHub" src="https://img.shields.io/github/license/Oneflow-Inc/libai.svg?color=blue">
</a>
<a href="https://github.com/Oneflow-Inc/libai/releases">
<img alt="GitHub release" src="https://img.shields.io/github/release/Oneflow-Inc/libai.svg">
</a>
<a href="https://github.com/Oneflow-Inc/libai/issues">
<img alt="PRs Welcome" src="https://img.shields.io/badge/PRs-welcome-pink.svg">
</a>
<a herf="https://github.com/Oneflow-Inc/libai/issues">
<img alt="Python Checks" src="https://github.com/Oneflow-Inc/libai/workflows/Python checks/badge.svg">
</a>
<a herf="https://github.com/Oneflow-Inc/libai/issues">
<img alt="Docs Release Status" src="https://github.com/Oneflow-Inc/libai/workflows/Document Release/badge.svg">
</a>
</p>


## Introduction

**English** | [简体中文](/README_zh-CN.md)

LiBai is a large-scale open-source model training toolbox based on OneFlow. The main branch works with OneFlow 0.7.0.

<details open>
<summary> <b> Highlights </b> </summary>

- **Support a collection of parallel training components**

LiBai provides multiple parallelisms such as Data Parallelism, Tensor Parallelism, and Pipeline Parallelism. It's also extensible for other new parallelisms.

- **Varied training techniques**

LiBai provides many out-of-the-box training techniques such as Distributed Training, Mixed Precision Training, Activation Checkpointing, Recomputation, Gradient Accumulation, and Zero Redundancy Optimizer(ZeRO).

- **Support for both CV and NLP tasks**

LiBai has predifined data process for both CV and NLP datasets such as CIFAR, ImageNet, and BERT Dataset.

- **Easy to use**

LiBai's components are designed to be modular for easier usage as follows:
- LazyConfig system for more flexible syntax and no predefined structures
- Friendly trainer and engine
- Used as a library to support building research projects on it. See [projects/](/projects) for some projects that are built based on LiBai

- **High Efficiency**

</details>

## Installation

See [Installation instructions](https://libai.readthedocs.io/en/latest/tutorials/get_started/Installation.html).

## Getting Started

See [Quick Run](https://libai.readthedocs.io/en/latest/tutorials/get_started/quick_run.html) for the basic usage of LiBai.

## Documentation

See LiBai's [documentation](https://libai.readthedocs.io/en/latest/index.html) for full API documentation and tutorials.

## ChangeLog

**Beta 0.2.0** was released in 07/07/2022, the general changes in **0.2.0** version are as follows:

**Features:**
- Support evaluation enabled and set `eval_iter`
- Support customized sampler in `config.py`
- Support rdma for pipeline-model-parallel
- Support multi fused kernel
- fused_scale_mask_softmax_dropout
- fused_scale_tril_softmax_mask_scale
- fused_self_attention in branch `libai_bench`
- User Experience Optimization
- Optimization for training throughput, see [benchmark](https://libai.readthedocs.io/en/latest/tutorials/get_started/Benchmark.html) for more details

**Supported Models:**
- Support 3D parallel [Roberta](https://arxiv.org/abs/1907.11692) model
- Support 2D parallel (data parallel + tensor model parallel) [SimCSE](https://arxiv.org/abs/2104.08821) model
- Support Data parallel [MAE](https://arxiv.org/abs/2111.06377) model
- Support Data parallel [MOCOV3](https://arxiv.org/abs/2104.02057) model

See [changelog](./changelog.md) for details and release history.

## Contributing

We appreciate all contributions to improve LiBai. See [CONTRIBUTING](./CONTRIBUTING.md) for the contributing guideline.

## License

This project is released under the [Apache 2.0 license](LICENSE).

## Citation

If you find this project useful for your research, consider cite:

```BibTeX
@misc{of2021libai,
author = {Xingyu Liao and Peng Cheng and Tianhe Ren and Depeng Liang and
Kai Dang and Yi Wang and Xiaoyu Xu},
title = {LiBai},
howpublished = {\url{https://github.com/Oneflow-Inc/libai}},
year = {2021}
}
## Running experiments in the OCCL paper
```shell
bash tools/train.sh tools/train_net.py configs/vit_imagenet.py <NUM_LOCAL_GPUS>
```

## Join the WeChat group

![LiBai_Wechat_QRcode](./docs/source/tutorials/assets/LiBai_Wechat.png)
Notes:
- Prepare the ImageNet dataset in advance.
- Edit the [configs/vit_imagenet.py](configs/vit_imagenet.py#L84-L86) to switch among different distributed DNN training methods, following the guidelines in the [official doc](https://libai.readthedocs.io/en/latest/tutorials/basics/Distributed_Configuration.html).
- For training across multiple machines, edit the `NODE`, `NODE_RANK`, `ADDR`, and `ADDR_RANK` variables in [tools/train.sh](tools/train.sh#L8-L11).
- Edit [configs/vit_imagenet.py](configs/vit_imagenet.py#L2) to choose between the base ViT configuration or the large ViT configuration.
- If the environment virable `ONEFLOW_ENABLE_OFCCL` in [train.sh](tools/train.sh#L28) is set to `1`, OCCL will be used during training; otherwise, NCCL will be employed.
2 changes: 1 addition & 1 deletion configs/common/models/vit/vit_base_patch16_224.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@

cfg.patch_size = 16
cfg.embed_dim = 768
cfg.num_heads = 12
cfg.num_heads = 16

model = LazyCall(VisionTransformer)(cfg=cfg)
44 changes: 38 additions & 6 deletions configs/vit_imagenet.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from libai.config import LazyCall
from .common.models.vit.vit_base_patch16_224 import model
from .common.models.vit.vit_base_patch16_224 import model #from .common.models.vit.vit_large_patch16_224 import model
from .common.models.graph import graph
from .common.train import train
from .common.optim import optim
Expand All @@ -12,6 +12,31 @@
dataloader.train.dataset[0].root = "/path/to/imagenet"
dataloader.test[0].dataset.root = "/path/to/imagenet"

import os
host = os.getenv("HOST")

if (host == "oneflow-28"):
dataloader.train.dataset[0].root = "/ssd/dataset/ImageNet/extract"
dataloader.test[0].dataset.root = "/ssd/dataset/ImageNet/extract"
elif (host == "oneflow-15"):
dataloader.train.dataset[0].root = "/minio/sdd/dataset/imagenet/extract"
dataloader.test[0].dataset.root = "/minio/sdd/dataset/imagenet/extract"
elif (host == "oneflow-16"):
dataloader.train.dataset[0].root = "/dataset/ImageNet/extract"
dataloader.test[0].dataset.root = "/dataset/ImageNet/extract"
elif (host == "oneflow-25"):
dataloader.train.dataset[0].root = "/data/dataset/ImageNet/extract"
dataloader.test[0].dataset.root = "/data/dataset/ImageNet/extract"
elif (host == "oneflow-26"):
dataloader.train.dataset[0].root = "/ssd/dataset/ImageNet/extract"
dataloader.test[0].dataset.root = "/ssd/dataset/ImageNet/extract"
elif (host == "oneflow-27"):
dataloader.train.dataset[0].root = "/ssd/dataset/ImageNet/extract"
dataloader.test[0].dataset.root = "/ssd/dataset/ImageNet/extract"
else:
print("NO LEGAL HOST, exit.")
exit(1)

# Refine model cfg for vit training on imagenet
model.cfg.num_classes = 1000
model.cfg.loss_func = SoftTargetCrossEntropy()
Expand All @@ -37,9 +62,12 @@
# Refine train cfg for vit model
train.train_micro_batch_size = 128
train.test_micro_batch_size = 128
train.train_epoch = 300
# train.train_epoch = 300
train.train_epoch = 0
train.train_iter = int(os.getenv("NUM_ITER_ENV"))
train.warmup_ratio = 5 / 300
train.evaluation.eval_period = 1000
train.evaluation.enabled = False
# train.evaluation.eval_period = 100
train.log_period = 1

# Scheduler
Expand All @@ -50,8 +78,12 @@
# Set fp16 ON
train.amp.enabled = True

# zero
train.zero_optimization.enabled = False
train.zero_optimization.stage = 1

# Distributed Settings
train.dist.pipeline_num_layers = model.cfg.depth
train.dist.data_parallel_size = 1
train.dist.tensor_parallel_size = 1
train.dist.pipeline_parallel_size = 1
train.dist.data_parallel_size = 2
train.dist.tensor_parallel_size = 2
train.dist.pipeline_parallel_size = 2
65 changes: 65 additions & 0 deletions configs/vit_imagenet_a100.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
from libai.config import LazyCall
from .common.models.vit.vit_base_patch16_224 import model #from .common.models.vit.vit_large_patch16_224 import model
from .common.models.graph import graph
from .common.train import train
from .common.optim import optim
from .common.data.imagenet import dataloader

from flowvision.data import Mixup
from flowvision.loss.cross_entropy import SoftTargetCrossEntropy

# Refine data path to imagenet
dataloader.train.dataset[0].root = "/data/ImageNet/extract"
dataloader.test[0].dataset.root = "/data/ImageNet/extract"

# Refine model cfg for vit training on imagenet
model.cfg.num_classes = 1000
model.cfg.loss_func = SoftTargetCrossEntropy()

# Add Mixup Func
dataloader.train.mixup_func = LazyCall(Mixup)(
mixup_alpha=0.8,
cutmix_alpha=1.0,
prob=1.0,
switch_prob=0.5,
mode="batch",
num_classes=model.cfg.num_classes,
)

# Refine optimizer cfg for vit model
optim.lr = 1e-3 # 5e-4 * 1024 (batchsize) / 512
optim.eps = 1e-8
optim.weight_decay = 0.05
optim.params.clip_grad_max_norm = None
optim.params.clip_grad_norm_type = None
optim.params.overrides = {"pos_embed": {"weight_decay": 0.0}, "cls_token": {"weight_decay": 0.0}}

# Refine train cfg for vit model
train.train_micro_batch_size = 128
train.test_micro_batch_size = 128
# train.train_epoch = 300
train.train_epoch = 0
import os
train.train_iter = int(os.getenv("NUM_ITER_ENV"))
train.warmup_ratio = 5 / 300
train.evaluation.enabled = False
# train.evaluation.eval_period = 100
train.log_period = 1

# Scheduler
train.scheduler.warmup_factor = 0.001
train.scheduler.alpha = 0.01
train.scheduler.warmup_method = "linear"

# Set fp16 ON
train.amp.enabled = True

# zero
train.zero_optimization.enabled = False
train.zero_optimization.stage = 1

# Distributed Settings
train.dist.pipeline_num_layers = model.cfg.depth
train.dist.data_parallel_size = 2
train.dist.tensor_parallel_size = 2
train.dist.pipeline_parallel_size = 2
68 changes: 68 additions & 0 deletions configs/vit_imagenet_para_4090.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
from libai.config import LazyCall
from .common.models.vit.vit_base_patch16_224 import model #from .common.models.vit.vit_large_patch16_224 import model
from .common.models.graph import graph
from .common.train import train
from .common.optim import optim
from .common.data.imagenet import dataloader

from flowvision.data import Mixup
from flowvision.loss.cross_entropy import SoftTargetCrossEntropy


import os
host = os.getenv("HOST")


dataloader.train.dataset[0].root = "/HOME/scw6cab/run/OCCL/ImageNet"
dataloader.test[0].dataset.root = "/HOME/scw6cab/run/OCCL/ImageNet"

# Refine model cfg for vit training on imagenet
model.cfg.num_classes = 1000
model.cfg.loss_func = SoftTargetCrossEntropy()

# Add Mixup Func
dataloader.train.mixup_func = LazyCall(Mixup)(
mixup_alpha=0.8,
cutmix_alpha=1.0,
prob=1.0,
switch_prob=0.5,
mode="batch",
num_classes=model.cfg.num_classes,
)

# Refine optimizer cfg for vit model
optim.lr = 1e-3 # 5e-4 * 1024 (batchsize) / 512
optim.eps = 1e-8
optim.weight_decay = 0.05
optim.params.clip_grad_max_norm = None
optim.params.clip_grad_norm_type = None
optim.params.overrides = {"pos_embed": {"weight_decay": 0.0}, "cls_token": {"weight_decay": 0.0}}

# Refine train cfg for vit model
train.train_micro_batch_size = 128
train.test_micro_batch_size = 128
# train.train_epoch = 300
train.train_epoch = 0
train.train_iter = int(os.getenv("NUM_ITER_ENV"))
train.warmup_ratio = 5 / 300
train.evaluation.enabled = False
# train.evaluation.eval_period = 100
train.log_period = 1

# Scheduler
train.scheduler.warmup_factor = 0.001
train.scheduler.alpha = 0.01
train.scheduler.warmup_method = "linear"

# Set fp16 ON
train.amp.enabled = True

# zero
train.zero_optimization.enabled = False
train.zero_optimization.stage = 1

# Distributed Settings
train.dist.pipeline_num_layers = model.cfg.depth
train.dist.data_parallel_size = 2
train.dist.tensor_parallel_size = 2
train.dist.pipeline_parallel_size = 2
8 changes: 6 additions & 2 deletions libai/models/utils/graph_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,12 @@ def __init__(
# Enable cuda stream for computation and communication as the same stream.
# This will reduce memory when using model parallelism.
dist_util = dist.get_dist_util()
if dist_util.is_tensor_model_parallel() or dist_util.is_pipeline_model_parallel():
flow.boxing.nccl.enable_use_compute_stream(True)
import os
enable_occl = os.getenv("ONEFLOW_ENABLE_OFCCL")
disable_nccl_compute_stream = os.getenv("DISABLE_NCCL_COMPUTE_STREAM")
if enable_occl != "1" and disable_nccl_compute_stream != "1":
if dist_util.is_tensor_model_parallel() or dist_util.is_pipeline_model_parallel():
flow.boxing.nccl.enable_use_compute_stream(True)

# auto_parallel
if auto_parallel_conf is not None and auto_parallel_conf.enabled:
Expand Down
Loading