Skip to content

Commit

Permalink
update readme (#484)
Browse files Browse the repository at this point in the history
  • Loading branch information
Cathy0908 authored Nov 12, 2024
1 parent 3566a8c commit d761af5
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 3 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -305,8 +305,8 @@ python tools/process_data.py --config ./demos/process_video_on_ray/configs/demo.
- To run data processing across multiple machines, it is necessary to ensure that all distributed nodes can access the corresponding data paths (for example, by mounting the respective data paths on a file-sharing system such as NAS).
- The deduplicator operators for RAY mode are different from the single-machine version, and all those operators are prefixed with `ray`, e.g. `ray_video_deduplicator` and `ray_document_deduplicator`. Those operators also rely on a [Redis](https://redis.io/) instance. So in addition to starting the RAY cluster, you also need to setup your Redis instance in advance and provide `host` and `port` of your Redis instance in configuration.
> Users can also opt not to use RAY and instead split the dataset to run on a cluster with [Slurm](https://slurm.schedmd.com/) / [Aliyun PAI-DLC](https://www.aliyun.com/activity/bigdata/pai-dlc). In this case, please use the default Data-Juicer without RAY.
> Users can also opt not to use RAY and instead split the dataset to run on a cluster with [Slurm](https://slurm.schedmd.com/). In this case, please use the default Data-Juicer without RAY.
> [Aliyun PAI-DLC](https://www.aliyun.com/activity/bigdata/pai-dlc) supports the RAY framework, Slurm framework, etc. Users can directly create RAY jobs and Slurm jobs on the DLC cluster.
### Data Analysis
- Run `analyze_data.py` tool or `dj-analyze` command line tool with your config as the argument to analyze your dataset.
Expand Down
3 changes: 2 additions & 1 deletion README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,7 +282,8 @@ python tools/process_data.py --config ./demos/process_video_on_ray/configs/demo.
- 如果需要在多机上使用RAY执行数据处理,需要确保所有节点都可以访问对应的数据路径,即将对应的数据路径挂载在共享文件系统(如NAS)中。
- RAY 模式下的去重算子与单机版本不同,所有 RAY 模式下的去重算子名称都以 `ray` 作为前缀,例如 `ray_video_deduplicator``ray_document_deduplicator`。这些去重算子依赖于 [Redis](https://redis.io/) 实例.因此使用前除启动 RAY 集群外还需要启动 Redis 实例,并在对应的配置文件中填写 Redis 实例的 `host``port`

> 用户也可以不使用 RAY,拆分数据集后使用 [Slurm](https://slurm.schedmd.com/) / [阿里云 PAI-DLC](https://www.aliyun.com/activity/bigdata/pai-dlc) 在集群上运行,此时使用不包含 RAY 的原版 Data-Juicer 即可。
> 用户也可以不使用 RAY,拆分数据集后使用 [Slurm](https://slurm.schedmd.com/) 在集群上运行,此时使用不包含 RAY 的原版 Data-Juicer 即可。
> [阿里云 PAI-DLC](https://www.aliyun.com/activity/bigdata/pai-dlc) 支持 RAY 框架、Slurm 框架等,用户可以直接在DLC集群上创建 RAY 作业 和 Slurm 作业。

### 数据分析

Expand Down

0 comments on commit d761af5

Please sign in to comment.