Skip to content

Commit

Permalink
minor fix for tutorial & add news for ImgDiff (#384)
Browse files Browse the repository at this point in the history
* - fix the bad link in tutorial_kdd24.html
- add news for ImgDiff

* - fix the bad link in tutorial_kdd24.html
- add news for ImgDiff
  • Loading branch information
yxdyc authored Aug 13, 2024
1 parent 2dfd318 commit 986b43d
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 2 deletions.
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[[中文主页]](README_ZH.md) | [[Docs]](#documents) | [[API]](https://modelscope.github.io/data-juicer) | [[DJ-SORA]](docs/DJ_SORA.md) | [[Awesome List]](docs/awesome_llm_data.md)


# Data-Juicer: A One-Stop Data Processing System for Large Language Models
# Data-Juicer: A One-Stop Data Processing System for Large Language Models

<img src="https://img.alicdn.com/imgextra/i3/O1CN017Eq5kf27AlA2NUKef_!!6000000007757-0-tps-1280-720.jpg" width = "640" height = "360" alt="Data-Juicer"/>

Expand Down Expand Up @@ -37,6 +37,7 @@ We welcome you to join us (via issues, PRs, [Slack](https://join.slack.com/t/dat
----

## News
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-08-09] We propose Img-Diff, which enhances the performance of multimodal large language models through *contrastive data synthesis*, achieving a score that is 12 points higher than GPT-4V on the [MMVP benchmark](https://tsb0601.github.io/mmvp_blog/). See more details in our [paper](https://arxiv.org/abs/2408.04594), and download the dataset from [huggingface](https://huggingface.co/datasets/datajuicer/Img-Diff) and [modelscope](https://modelscope.cn/datasets/Data-Juicer/Img-Diff).
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-24] "Tianchi Better Synth Data Synthesis Competition for Multimodal Large Models" — Our 4th data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532251) for more information.
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-17] We utilized the Data-Juicer [Sandbox Laboratory Suite](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md) to systematically optimize data and models through an co-development workflow between data and models, achieving a new top spot on the [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) text-to-video leaderboard. The related achievements have been compiled and published in a [paper](http://arxiv.org/abs/2407.11784), and the model has been released on the [ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V) and [HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V) platforms.
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-12] Our *awesome list of MLLM-Data* has evolved into a systemic [survey](https://arxiv.org/abs/2407.08583) from model-data co-development perspective. Welcome to [explore](docs/awesome_llm_data.md) and contribute!
Expand Down Expand Up @@ -450,6 +451,8 @@ If you find our work useful for your research or development, please kindly cite
- [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583)
- [ImgDiff: Contrastive Data Synthesis for Vision Large Language Models](https://arxiv.org/abs/2408.04594)
- [Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2402.11505)
</details>
Expand Down
3 changes: 3 additions & 0 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多
----

## 新消息
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-08-09] 我们提出了Img-Diff,它通过*对比数据合成*来增强多模态大型语言模型的性能,在[MMVP benchmark](https://tsb0601.github.io/mmvp_blog/)中比GPT-4V高出12个点。 更多细节请参阅我们的 [论文](https://arxiv.org/abs/2408.04594), 以及从 [huggingface](https://huggingface.co/datasets/datajuicer/Img-Diff)[modelscope](https://modelscope.cn/datasets/Data-Juicer/Img-Diff)下载这份数据集。
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-24] “天池 Better Synth 多模态大模型数据合成赛”——第四届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532251),了解赛事详情。
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png)[2024-07-17] 我们利用Data-Juicer[沙盒实验室套件](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox-ZH.md),通过数据与模型间的系统性研发工作流,调优数据和模型,在[VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard)文生视频排行榜取得了新的榜首。相关成果已经整理发表在[论文](http://arxiv.org/abs/2407.11784)中,并且模型已在[ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V)[HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V)平台发布。
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png)[2024-07-12] 我们的MLLM-Data精选列表已经演化为一个模型-数据协同开发的角度系统性[综述](https://arxiv.org/abs/2407.08583)。欢迎[浏览](docs/awesome_llm_data.md)或参与贡献!
Expand Down Expand Up @@ -423,6 +424,8 @@ Data-Juicer 感谢并参考了社区开源项目:
- [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583)
- [ImgDiff: Contrastive Data Synthesis for Vision Large Language Models](https://arxiv.org/abs/2408.04594)
- [Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2402.11505)
</details>
Expand Down
4 changes: 3 additions & 1 deletion docs/sphinx_doc/source/_static/tutorial_kdd24.html
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,8 @@ <h6 class="text-left">(10 min) | Conclusion and Resources</h6>
<div class="col-sm-6 col-lg-12">
<p><a href="coming_soon" target="_blank"><b>tutorial slides
(pdf)</b></a></p>
<p><a href="coming_soon" target="_blank"><b>tutorial jupyter nootbooks
(pdf)</b></a></p>
<p>
<a href="https://github.com/modelscope/data-juicer/blob/main/docs/awesome_llm_data.md"
target="_blank"><b>related awesome list</b></a></p>
Expand Down Expand Up @@ -152,7 +154,7 @@ <h4><a id="Speakers">On-site Speakers</a></h4>
style="width: 140px; height: 140px;"
src="https://img.alicdn.com/imgextra/i4/O1CN01wvOXAa231p3nkuMAL_!!6000000007196-2-tps-435-435.png"
data-holder-rendered="true">
<h4><a href="https:/yxdyc.github.io/"
<h4><a href="https://yxdyc.github.io/"
target="_blank" style="white-space: nowrap;">Daoyuan
Chen</a></h4>
</div>
Expand Down

0 comments on commit 986b43d

Please sign in to comment.