diff --git a/README.md b/README.md index 1b5c5131d..3e747e7c6 100644 --- a/README.md +++ b/README.md @@ -37,6 +37,7 @@ We welcome you to join us (via issues, PRs, [Slack](https://join.slack.com/t/dat ---- ## News +- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-24] "Tianchi Better Synth Data Synthesis Competition for Multimodal Large Models" — Our 4th data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532251) for more information. - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-17] We utilized the Data-Juicer [Sandbox Laboratory Suite](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md) to systematically optimize data and models through an co-development workflow between data and models, achieving a new top spot on the [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) text-to-video leaderboard. The related achievements have been compiled and published in a [paper](http://arxiv.org/abs/2407.11784), and the model has been released on the [ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V) and [HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V) platforms. - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-12] Our *awesome list of MLLM-Data* has evolved into a systemic [survey](https://arxiv.org/abs/2407.08583) from model-data co-development perspective. Welcome to [explore](docs/awesome_llm_data.md) and contribute! - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-06-01] ModelScope-Sora "Data Directors" creative sprint—Our third data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532219) for more information. @@ -96,8 +97,8 @@ Table of Contents visualization, and multidimensional automatic evaluation, so that you can better understand and improve your data and models. ![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg) -- **Enhanced Efficiency**: Providing efficient and parallel data processing pipelines (Aliyun-PAI\Ray\Slurm\CUDA\OP Fusion) - requiring less memory and CPU usage, optimized for maximum productivity. +- **Towards production environment **: Providing efficient and parallel data processing pipelines (Aliyun-PAI\Ray\Slurm\CUDA\OP Fusion) + requiring less memory and CPU usage, optimized with automatic fault-toleration. ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg) - **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data @@ -154,7 +155,7 @@ Table of Contents ## Installation -### From Source +### From Source - Run the following commands to install the latest basic `data_juicer` version in editable mode: @@ -229,6 +230,15 @@ You can install FFmpeg using package managers(e.g. sudo apt install ffmpeg on De Check if your environment path is set correctly by running the ffmpeg command from the terminal. + +

+
+ +[🔼 back to index](#documentation-index-) + +
+ + ## Quick Start @@ -259,6 +269,20 @@ export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models" export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets" ``` +#### Flexible Programming Interface +We provide various simple interfaces for users to choose from as follows. +```python +#... init op & dataset ... + +# Chain call style, support single operator or operator list +dataset = dataset.process(op) +dataset = dataset.process([op1, op2]) +# Functional programming style for quick integration or script prototype iteration +dataset = op(dataset) +dataset = op.run(dataset) +``` + + ### Distributed Data Processing We have now implemented multi-machine distributed data processing based on [RAY](https://www.ray.io/). The corresponding demos can be run using the following commands: @@ -376,6 +400,14 @@ docker run -dit \ # run the container in the background docker exec -it bash ``` + +

+
+ +[🔼 back to index](#documentation-index-) + +
+ ## Data Recipes - [Recipes for data process in BLOOM](configs/reproduced_bloom/README.md) - [Recipes for data process in RedPajama](configs/redpajama/README.md) @@ -417,3 +449,24 @@ If you find our work useful for your research or development, please kindly cite year={2024} } ``` + +
+ More related papers from Data-Juicer Team: +> + +- [Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784) + +- [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583) + +- [Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2402.11505) + +
+ + + +

+
+ +[🔼 back to index](#documentation-index-) + +
diff --git a/README_ZH.md b/README_ZH.md index c8b1a4886..4b9880ce6 100644 --- a/README_ZH.md +++ b/README_ZH.md @@ -31,6 +31,7 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多 ---- ## 新消息 +- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-24] “天池 Better Synth 多模态大模型数据合成赛”——第四届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532251),了解赛事详情。 - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png)[2024-07-17] 我们利用Data-Juicer[沙盒实验室套件](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox-ZH.md),通过数据与模型间的系统性研发工作流,调优数据和模型,在[VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard)文生视频排行榜取得了新的榜首。相关成果已经整理发表在[论文](http://arxiv.org/abs/2407.11784)中,并且模型已在[ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V)和[HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V)平台发布。 - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png)[2024-07-12] 我们的MLLM-Data精选列表已经演化为一个模型-数据协同开发的角度系统性[综述](https://arxiv.org/abs/2407.08583)。欢迎[浏览](docs/awesome_llm_data.md)或参与贡献! - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-06-01] ModelScope-Sora“数据导演”创意竞速——第三届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532219),了解赛事详情。 @@ -82,7 +83,7 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多 * **数据反馈回路 & 沙盒实验室**:支持一站式数据-模型协同开发,通过[沙盒实验室](docs/Sandbox-ZH.md)快速迭代,基于数据和模型反馈回路、可视化和多维度自动评估等功能,使您更了解和改进您的数据和模型。 ![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg) -* **效率增强**:提供高效并行化的数据处理流水线(Aliyun-PAI\Ray\Slurm\CUDA\算子融合),减少内存占用和CPU开销,提高生产力。 ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg) +* **面向生产环境**:提供高效并行化的数据处理流水线(Aliyun-PAI\Ray\Slurm\CUDA\算子融合),减少内存占用和CPU开销,支持自动化处理容错。 ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg) * **全面的数据处理菜谱**:为pre-training、fine-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 在LLaMA、LLaVA等模型上有效验证。 ![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png) @@ -235,6 +236,19 @@ export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models" export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets" ``` +#### 灵活的编程接口 +我们提供了各种层次的简单编程接口,以供用户选择: +```python +# ... init op & dataset ... + +# 链式调用风格,支持单算子或算子列表 +dataset = dataset.process(op) +dataset = dataset.process([op1, op2]) +# 函数式编程风格,方便快速集成或脚本原型迭代 +dataset = op(dataset) +dataset = op.run(dataset) +``` + ### 分布式数据处理 Data-Juicer 现在基于[RAY](https://www.ray.io/)实现了多机分布式数据处理。 @@ -278,6 +292,9 @@ dj-analyze --config configs/demo/analyzer.yaml streamlit run app.py ``` + + + ### 构建配置文件 * 配置文件包含一系列全局参数和用于数据处理的算子列表。您需要设置: @@ -380,8 +397,6 @@ Data-Juicer 被各种 LLM产品和研究工作使用,包括来自阿里云-通 Data-Juicer 感谢并参考了社区开源项目: [Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), .... - - ## 参考文献 如果您发现我们的工作对您的研发有帮助,请引用以下[论文](https://arxiv.org/abs/2309.02033) 。 @@ -392,4 +407,24 @@ Data-Juicer 感谢并参考了社区开源项目: booktitle={International Conference on Management of Data}, year={2024} } -``` \ No newline at end of file +``` +
+更多Data-Juicer团队相关论文: +> + +- [Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784) + +- [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583) + +- [Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2402.11505) + +
+ + + +

+
+ +[🔼 back to index](#documentation-index-a-namedocuments) + +
diff --git a/docs/DJ_SORA.md b/docs/DJ_SORA.md index 4a228683b..2e5139b75 100644 --- a/docs/DJ_SORA.md +++ b/docs/DJ_SORA.md @@ -38,7 +38,8 @@ This project is being actively updated and maintained. We eagerly invite you to - [✅] Ray based multi-machine distributed running - [✅] Aliyun PAI-DLC & Slurm based multi-machine distributed running - [✅] Distributed scheduling optimization (OP-aware, automated load balancing) --> Aliyun PAI-DLC -- [ ] [WIP] Distributed storage optimization +- [WIP] Low precision acceleration support for video related operators. (git tags: dj_op, dj_efficiency) +- [WIP] SOTA model enhancement of existing video related operators. (git tags: dj_op, dj_sota_models) ## Basic Operators (video spatio-temporal dimension) - Towards Data Quality @@ -90,20 +91,24 @@ This project is being actively updated and maintained. We eagerly invite you to - [✅] **Youku-mPLUG-CN**: 36TB video-caption data: `{}` - [✅] **InternVid**: 234M data sample: `{}` - [✅] **MSR-VTT**: 10K video-caption data: `{}` - - [ ] [WIP] ModelScope's datasets integration - - [ ] VideoInstruct-100K, Panda70M, ...... + - [✅] ModelScope's datasets integration + - [✅] VideoInstruct-100K, Panda70M, ...... - [ ] Large-scale high-quality DJ-SORA dataset - [✅] (Data sandbox) Building and optimizing multimodal data recipes with DJ-video operators (which are also being continuously extended and improved). - - [ ] [WIP] Continuous expansion of data sources: open-datasets, Youku, web, ... - - [ ] [WIP] Large-scale analysis, cleaning, and generation of high-quality multimodal datasets based on DJ recipes (OpenVideos, ...) - - [ ] [WIP] Large-scale generation of 3DPatch datasets based on DJ recipes. + - [✅] Continuous expansion of data sources: open-datasets, Youku, web, ... + - [ ] Large-scale analysis, cleaning, and generation of high-quality multimodal datasets based on DJ recipes (OpenVideos, ...) + - [WIP] broad scenarios, high-dynamic - ... ## DJ-SORA Data Validation and Model Training - - [ ] [WIP] (DJ-Bench101) Exploring and refining the collaborative development of multimodal data and model, establishing benchmarks and insights. + - [ ] Exploring and refining the collaborative development of multimodal data and model, establishing benchmarks and insights. [paper](https://arxiv.org/abs/2407.11784) - [ ] [WIP] Integration of SORA-like model training pipelines - [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) + - [✅] [T2V](https://t2v-turbo.github.io/) + - [✅] [V-Bench](https://vchitect.github.io/VBench-project/) - ... - [✅] (Model-Data sandbox) With relatively small models and the DJ-SORA dataset, exploring low-cost, transferable, and instructive data-model co-design, configurations and checkpoints. - [ ] [WIP] Training SORA-like models with DJ-SORA data on larger scales and in more scenarios to improve model performance. + - [✅] Data-Juicer-T2v, [V-Bench Top1 model](https://huggingface.co/datajuicer/Data-Juicer-T2V) + - ... - ... diff --git a/docs/DJ_SORA_ZH.md b/docs/DJ_SORA_ZH.md index 8e5dda301..9afe4e1bc 100644 --- a/docs/DJ_SORA_ZH.md +++ b/docs/DJ_SORA_ZH.md @@ -38,7 +38,8 @@ DJ-SORA将基于Data-Juicer(包含上百个专用的视频、图像、音频、 - [✅] Ray多机分布式 - [✅] 基于阿里云PAI-DLC和Slurm的多机分布式 - [✅] 分布式调度优化(OP-aware、自动化负载均衡)--> Aliyun PAI-DLC -- [ ] [WIP] 分布式存储优化 +- [WIP] 视频相关算子的低精度加速支持, git tags: dj_op, dj_efficiency +- [WIP] 现有视频相关算子的SOTA模型增强, git tags: dj_op, dj_sota_models ## 基础算子(视频时空维度) - 面向数据质量 @@ -94,22 +95,25 @@ DJ-SORA将基于Data-Juicer(包含上百个专用的视频、图像、音频、 - [✅] **Youku-mPLUG-CN**: 36TB video-caption data:`{}` - [✅] **InternVid**: 234M data sample:`{}` - [✅] **MSR-VTT**: 10K video-caption data:`{}` - - [ ] [WIP] ModelScope数据集集成 - - [ ] VideoInstruct-100K, Panda70M, ...... + - [✅] ModelScope数据集集成 + - [✅] VideoInstruct-100K, Panda70M, ...... - [ ] 大规模高质量DJ-SORA数据集 - [✅] (Data sandbox) 基于DJ-video算子构建和优化多模态数据菜谱 (算子同期持续完善) - - [ ] [WIP] 数据源持续扩充:open-datasets, youku, web, ... - - [ ] [WIP] 基于DJ菜谱规模化分析、清洗、生成高质量多模态数据集 (OpenVideo, ...) - - [ ] [WIP] 基于DJ菜谱形成大规模3DPatch数仓 + - [✅] 数据源持续扩充:open-datasets, youku, web, ... + - [ ] 基于DJ菜谱规模化分析、清洗、生成高质量多模态数据集 + - [WIP] 多场景、高动态 - ... ## DJ-SORA数据验证及模型训练 - - [ ] [WIP] (DJ-Bench101) 探索及完善多模态数据和模型的协同开发,形成benchmark和insights - - [ ] [WIP] 类SORA模型训练pipeline集成 - - [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) + - [✅] 探索及完善多模态数据和模型的协同开发,形成benchmark和insights: [paper](https://arxiv.org/abs/2407.11784) + - [] [WIP] 类SORA模型训练pipeline集成 + - [✅] [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) + - [✅] [T2V](https://t2v-turbo.github.io/) + - [✅] [V-Bench](https://vchitect.github.io/VBench-project/) - ... - [✅] (Model-Data sandbox) 在相对小的模型和DJ-SORA数据集上,探索形成低开销、可迁移、有指导性的data-model co-design、配置及检查点 - [ ] [WIP] 更大规模、更多场景使用DJ-SORA数据训练类SORA模型,提高模型性能 - - ... + - [✅] Data-Juicer-T2v, [V-Bench Top1 model](https://huggingface.co/datajuicer/Data-Juicer-T2V) + - ... diff --git a/docs/DeveloperGuide.md b/docs/DeveloperGuide.md index 7940ed0c0..4bc80d1ae 100644 --- a/docs/DeveloperGuide.md +++ b/docs/DeveloperGuide.md @@ -11,7 +11,7 @@ ## Coding Style We define our styles in `.pre-commit-config.yaml`. Before committing, -please install `pre-commit` tool to check and modify accordingly: +please install `pre-commit` tool to automatically check and modify accordingly: ```shell # ===========install pre-commit tool=========== @@ -104,20 +104,22 @@ class StatsKeys(object): return False ``` - - If Hugging Face models are used within an operator, you might want to leverage GPU acceleration. To achieve this, declare `self._accelerator = 'cuda'` in the constructor, and ensure that `compute_stats` and `process` methods accept an additional positional argument `rank`. + - If Hugging Face models are used within an operator, you might want to leverage GPU acceleration. To achieve this, declare `_accelerator = 'cuda'` in the constructor, and ensure that `compute_stats` and `process` methods accept an additional positional argument `rank`. ```python # ... (same as above) @OPERATORS.register_module('text_length_filter') class TextLengthFilter(Filter): + + _accelerator = 'cuda' + def __init__(self, min_len: PositiveInt = 10, max_len: PositiveInt = sys.maxsize, *args, **kwargs): # ... (same as above) - self._accelerator = 'cuda' def compute_stats(self, sample, rank=None): # ... (same as above) diff --git a/docs/DeveloperGuide_ZH.md b/docs/DeveloperGuide_ZH.md index 9ec85a5ce..b3e424452 100644 --- a/docs/DeveloperGuide_ZH.md +++ b/docs/DeveloperGuide_ZH.md @@ -10,7 +10,7 @@ ## 编码规范 -我们将编码规范定义在 `.pre-commit-config.yaml` 中。在向仓库贡献代码之前,请使用 `pre-commit` 工具对代码进行规范化。 +我们将编码规范定义在 `.pre-commit-config.yaml` 中。在向仓库贡献代码之前,请使用 `pre-commit` 工具对代码进行自动规范化。 ```shell # ===========install pre-commit tool=========== @@ -99,20 +99,22 @@ class StatsKeys(object): return False ``` - - 如果在算子中使用了 Hugging Face 模型,您可能希望利用 GPU 加速。为了实现这一点,请在构造函数中声明 `self._accelerator = 'cuda'`,并确保 `compute_stats` 和 `process` 方法接受一个额外的位置参数 `rank`。 + - 如果在算子中使用了 Hugging Face 模型,您可能希望利用 GPU 加速。为了实现这一点,请在构造函数中声明 `_accelerator = 'cuda'`,并确保 `compute_stats` 和 `process` 方法接受一个额外的位置参数 `rank`。 ```python # ... (same as above) @OPERATORS.register_module('text_length_filter') class TextLengthFilter(Filter): + + _accelerator = 'cuda' + def __init__(self, min_len: PositiveInt = 10, max_len: PositiveInt = sys.maxsize, *args, **kwargs): # ... (same as above) - self._accelerator = 'cuda' def compute_stats(self, sample, rank=None): # ... (same as above) diff --git a/docs/Operators.md b/docs/Operators.md index 045bfd92c..6bc3599ba 100644 --- a/docs/Operators.md +++ b/docs/Operators.md @@ -2,7 +2,7 @@ Operators are a collection of basic processes that assist in data modification, cleaning, filtering, deduplication, etc. We support a wide range of data sources and file formats, and allow for flexible extension to custom datasets. -This page offers a basic description of the operators (OPs) in Data-Juicer. Users can refer to the [API documentation](https://modelscope.github.io/data-juicer/) for the specific parameters of each operator. Users can refer to and run the unit tests for [examples of operator-wise usage](../tests/ops) as well as the effects of each operator when applied to built-in test data samples. +This page offers a basic description of the operators (OPs) in Data-Juicer. Users can refer to the [API documentation](https://modelscope.github.io/data-juicer/) for the specific parameters of each operator. Users can refer to and run the unit tests (`tests/ops/...`) for [examples of operator-wise usage](../tests/ops) as well as the effects of each operator when applied to built-in test data samples. ## Overview diff --git a/docs/Operators_ZH.md b/docs/Operators_ZH.md index 5c32c4ee4..3ee94d381 100644 --- a/docs/Operators_ZH.md +++ b/docs/Operators_ZH.md @@ -2,7 +2,7 @@ 算子 (Operator) 是协助数据修改、清理、过滤、去重等基本流程的集合。我们支持广泛的数据来源和文件格式,并支持对自定义数据集的灵活扩展。 -这个页面提供了OP的基本描述,用户可以参考[API文档](https://modelscope.github.io/data-juicer/)更细致了解每个OP的具体参数,并且可以查看、运行单元测试,来体验[各OP的用法示例](../tests/ops)以及每个OP作用于内置测试数据样本时的效果。 +这个页面提供了OP的基本描述,用户可以参考[API文档](https://modelscope.github.io/data-juicer/)更细致了解每个OP的具体参数,并且可以查看、运行单元测试 (`tests/ops/...`),来体验[各OP的用法示例](../tests/ops)以及每个OP作用于内置测试数据样本时的效果。 ## 概览 diff --git a/docs/awesome_llm_data.md b/docs/awesome_llm_data.md index 0392ab367..05d4e1c96 100644 --- a/docs/awesome_llm_data.md +++ b/docs/awesome_llm_data.md @@ -1,5 +1,5 @@ # Awesome Data-Model Co-Development of MLLMs [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) -Welcome to the "Awesome List" for data-model co-development of Multi-Modal Large Language Models (MLLMs), a continually updated resource tailored for the open-source community. This compilation features cutting-edge research, insightful articles focusing on improving MLLMs involving with the data-model co-development of MLLMs, and tagged based on the proposed **taxonomy** from data-model co-development, as illustrated below. +Welcome to the "Awesome List" for data-model co-development of Multi-Modal Large Language Models (MLLMs), a continually updated resource tailored for the open-source community. This compilation features cutting-edge research, insightful articles focusing on improving MLLMs involving with the data-model co-development of MLLMs, and tagged based on the proposed **taxonomy** from our data-model co-development [survey](https://arxiv.org/abs/2407.08583), as illustrated below. ![Overview of Our Taxonomy](https://img.alicdn.com/imgextra/i1/O1CN01aN3TVo1mgGZAuSHJ4_!!6000000004983-2-tps-3255-1327.png) Soon we will provide a dynamic table of contents to help readers more easily navigate through the materials with features such as search, filter, and sort.