-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
561259f
commit 0b82892
Showing
20 changed files
with
369 additions
and
497 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,20 @@ | ||
Contributing | ||
############### | ||
|
||
This repository is developed and maintained by `Wei Fu <garrett4wade.github.io>`_ | ||
and `Zhiyu Mei <https://openreview.net/profile?id=~Zhiyu_Mei1>`_, both of whom are | ||
PhD students at `IIIS, Tsinghua University <https://iiis.tsinghua.edu.cn/en/>`_ | ||
advised by Professor `Yi Wu <https://jxwuyi.weebly.com/>`_. | ||
.. This repository is developed and maintained by `Wei Fu <garrett4wade.github.io>`_ | ||
.. and `Zhiyu Mei <https://openreview.net/profile?id=~Zhiyu_Mei1>`_, both of whom are | ||
.. PhD students at `IIIS, Tsinghua University <https://iiis.tsinghua.edu.cn/en/>`_ | ||
.. advised by Professor `Yi Wu <https://jxwuyi.weebly.com/>`_. | ||
We acknowledge that due to limited time and resources, | ||
the quality of the documentation and code in this repository is not very high. | ||
As a result, it can be quite challenging for potential developers to | ||
read the code and contribute new features. | ||
If you wish to contribute to this repository and have any questions about the code, | ||
please do not hesitate to contact us. | ||
.. We acknowledge that due to limited time and resources, | ||
.. the quality of the documentation and code in this repository is not very high. | ||
.. As a result, it can be quite challenging for potential developers to | ||
.. read the code and contribute new features. | ||
If you wish to contribute to this repository or have any questions about the code, | ||
please do not hesitate to raise issues or contact us directly. | ||
We will do our best to assist you. | ||
Currently, there is no template for issues or pull requests. | ||
|
||
We hope the open-source community can help improve this repository | ||
and enable the RLHF technology to truly empower the applications of LLM. | ||
and enable RLHF technology to truly empower the applications of LLM. |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,35 +1,35 @@ | ||
Set Up Distributed Experiments | ||
================================== | ||
|
||
Currently, ReaL supports launching distrbited experiments using | ||
Currently, ReaL supports launching distributed experiments using | ||
`SLURM <https://slurm.schedmd.com/documentation.html>`_ | ||
with the `Pyxis <https://github.com/NVIDIA/pyxis>`_ plugin. | ||
This plugin allows for launching enroot containers with the | ||
``srun`` command. | ||
|
||
To set up distributed experiments, you should write a JSON | ||
cluster configuration as the example in ``examples/cluster_config.json``. | ||
To set up distributed experiments, you need to create a JSON | ||
cluster configuration file, as shown in the example in ``examples/cluster_config.json``. | ||
|
||
- ``cluster_type``: The type of cluster. Currently, only "slurm" is supported. | ||
- ``cluster_type``: The type of the cluster. Currently, only "slurm" is supported. | ||
- ``cluster_name``: The name of the cluster. Arbitrary. | ||
- ``fileroot``: An NFS path that all nodes can access. This is where the log and checkpoints will be stored. | ||
- ``default_mount``: Comma separated list of paths to mount on all nodes. This should include the above ``fileroot``. | ||
- ``node_type_from_node_name``: A dictionary mapping a regular expression to a node type. Any host in this cluster should match one of these regular expressions. Node types include ["g1", "g2", "g8", "a100"]. "g" refers low-end GPUs in the cluster. | ||
- ``gpu_type_from_node_name``: A dictionary mapping a regular expression to a GPU type. GPU type is used by SLURM. | ||
- ``cpu_image``: The docker image of the controller and the master worker. | ||
- ``gpu_image``: The docker image of the model worker. | ||
- ``node_name_prefix``: The prefix of the host names. We assume host names in the cluster is prefixed by a string followed by some integer, e.g., "com-01", where "com-" is the prefix. | ||
- ``fileroot``: An NFS path accessible by all nodes. This is where logs and checkpoints will be stored. | ||
- ``default_mount``: A comma-separated list of paths to mount on all nodes. This should include the ``fileroot`` mentioned above.. | ||
- ``node_type_from_node_name``: A dictionary mapping a regular expression to a node type. Every host in this cluster should match one of these regular expressions. Node types include ["g1", "g2", "g8", "a100"]. "g" refers to low-end GPUs in the cluster. | ||
- ``gpu_type_from_node_name``: A dictionary mapping a regular expression to a GPU type. The GPU type is used by SLURM. | ||
- ``cpu_image``: The Docker image for the controller and the master worker. | ||
- ``gpu_image``: The Docker image for the model worker. | ||
- ``node_name_prefix``: The prefix of the host names. We assume that host names in the cluster are prefixed by a string followed by an integer, e.g., "com-01", where "com-" is the prefix. | ||
|
||
The path of this file should be specified in the ``CLUSTER_SPEC_PATH`` environment variable | ||
inside the used docker images and when launching the experiment. For example, | ||
inside the Docker images used and when launching the experiment. For example: | ||
|
||
.. code-block:: console | ||
CLUSTER_SPEC_PATH=/tmp/my-cluster.json python3 -m realhf.apps.quickstart ppo ... | ||
You also need to add an additional layer in the docker images like the following: | ||
You also need to add an additional layer in the Docker images as shown below: | ||
|
||
.. code-block:: dockerfile | ||
FROM docker.io/garrett4wade/real-cpu | ||
FROM garrett4wade/real-cpu:22.04-0.1.0 | ||
ENV CLUSTER_SPEC_PATH=/tmp/my-cluster.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.