update readme

modelscope · Aug 21, 2024 · e82fbbc · e82fbbc
1 parent ba5870c
commit e82fbbc
Show file tree

Hide file tree

Showing 16 changed files with 460 additions and 725 deletions.
diff --git a/README.md b/README.md
diff --git a/README_zh.md b/README_zh.md
diff --git a/docs/en/_static/images/evalscope.jpeg b/docs/en/_static/images/evalscope.jpeg
diff --git a/docs/en/_static/images/evalscope_logo.png b/docs/en/_static/images/evalscope_logo.png
diff --git a/docs/en/best_practice/experiments.md b/docs/en/best_practice/experiments.md
@@ -0,0 +1,39 @@
+
+# Experiments
+
+## [MMLU](https://modelscope.cn/datasets/modelscope/mmlu/summary)
+
+### Settings: (Split: test, Total num: 13985, 0-shot)
+
+| Model | Revision | Precision | Humanities | STEM | SocialScience | Other | WeightedAvg | Target | Delta |
+|--------------------------------------------------------------------------------------------------|----------|-----------|-------------|------------|---------------|---------|-------------|-------------|--------|
+| [Baichuan2-7B-Base](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Base/summary) | v1.0.2 | fp16 | 0.4111 | 0.3807 | 0.5233 | 0.504 | 0.4506 | - | |
+| [Baichuan2-7B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-chat/summary) | v1.0.4 | fp16 | 0.4439 | 0.374 | 0.5524 | 0.5458 | 0.4762 | - | |
+| [chatglm2-6b](https://modelscope.cn/models/ZhipuAI/chatglm2-6b/summary) | v1.0.12 | fp16 | 0.3834 | 0.3413 | 0.4708 | 0.4445 | 0.4077 | 0.4546(CoT) | -4.69% |
+| [chatglm3-6b-base](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base/summary) | v1.0.1 | fp16 | 0.5435 | 0.5087 | 0.7227 | 0.6471 | 0.5992 | 0.614 | -1.48% |
+| [internlm-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b/summary) | v1.0.1 | fp16 | 0.4005 | 0.3547 | 0.4953 | 0.4796 | 0.4297 | - | |
+| [Llama-2-13b-ms](https://modelscope.cn/models/modelscope/Llama-2-13b-ms/summary) | v1.0.2 | fp16 | 0.4371 | 0.3887 | 0.5579 | 0.5437 | 0.4778 | - | |
+| [Llama-2-7b-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-ms/summary) | v1.0.2 | fp16 | 0.3146 | 0.3037 | 0.4134 | 0.3885 | 0.3509 | - | |
+| [Qwen-14B-Chat](https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary) | v1.0.6 | bf16 | 0.5326 | 0.5397 | 0.7184 | 0.6859 | 0.6102 | - | |
+| [Qwen-7B](https://modelscope.cn/models/qwen/Qwen-7B/summary) | v1.1.6 | bf16 | 0.387 | 0.4 | 0.5403 | 0.5139 | 0.4527 | - | |
+| [Qwen-7B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary) | v1.1.6 | int8 | 0.4322 | 0.4277 | 0.6088 | 0.5778 | 0.5035 | - | |
+
+ - Target -- The official claimed score of the model on the dataset
+ - Delta -- The difference between the WeightedAvg score and the Target score
+
+
+### Settings: (Split: test, Total num: 13985, 5-shot)
+
+| Model | Revision | Precision | Humanities | STEM | SocialScience | Other | WeightedAvg | Avg | Target | Delta |
+|---------------------|----------|-----------|------------|--------|---------------|--------|-------------|--------|--------------------|---------|
+| Baichuan2-7B-Base | v1.0.2 | fp16 | 0.4295 | 0.398 | 0.5736 | 0.5325 | 0.4781 | 0.4918 | 0.5416 (official) | -4.98% |
+| Baichuan2-7B-Chat | v1.0.4 | fp16 | 0.4344 | 0.3937 | 0.5814 | 0.5462 | 0.4837 | 0.5029 | 0.5293 (official) | -2.64% |
+| chatglm2-6b | v1.0.12 | fp16 | 0.3941 | 0.376 | 0.4897 | 0.4706 | 0.4288 | 0.4442 | - | - |
+| chatglm3-6b-base | v1.0.1 | fp16 | 0.5356 | 0.4847 | 0.7175 | 0.6273 | 0.5857 | 0.5995 | - | - |
+| internlm-chat-7b | v1.0.1 | fp16 | 0.4171 | 0.3903 | 0.5772 | 0.5493 | 0.4769 | 0.4876 | - | - |
+| Llama-2-13b-ms | v1.0.2 | fp16 | 0.484 | 0.4133 | 0.6157 | 0.5809 | 0.5201 | 0.5327 | 0.548 (official) | -1.53% |
+| Llama-2-7b-ms | v1.0.2 | fp16 | 0.3747 | 0.3363 | 0.4372 | 0.4514 | 0.3979 | 0.4089 | 0.453 (official) | -4.41% |
+| Qwen-14B-Chat | v1.0.6 | bf16 | 0.574 | 0.553 | 0.7403 | 0.684 | 0.6313 | 0.6414 | 0.646 (official) | -0.46% |
+| Qwen-7B | v1.1.6 | bf16 | 0.4587 | 0.426 | 0.6078 | 0.5629 | 0.5084 | 0.5151 | 0.567 (official) | -5.2% |
+| Qwen-7B-Chat-Int8 | v1.1.6 | int8 | 0.4697 | 0.4383 | 0.6284 | 0.5967 | 0.5271 | 0.5347 | 0.554 (official) | -1.93% |
+
diff --git a/docs/en/get_started/installation.md b/docs/en/get_started/installation.md
@@ -1,6 +1,6 @@
 # Installation
 
-## Install Using pip
+## Method 1: Install Using pip
 We recommend using conda to manage your environment and installing dependencies with pip:
 
 1. Create a conda environment (optional)
@@ -31,7 +31,7 @@ from llmuses import ...
 ```
 ````
 
-## Install from Source
+## Method 2: Install from Source
 1. Download the source code
  ```shell
  git clone https://github.com/modelscope/evalscope.git

diff --git a/docs/en/index.rst b/docs/en/index.rst
@@ -1,18 +1,16 @@
-.. image:: _static/images/evalscope.jpeg
+.. image:: _static/images/evalscope_logo.png
 
 .. _pypi_downloads: https://pypi.org/project/evalscope
 .. _github_pr: https://github.com/modelscope/evalscope/pulls
 
 .. raw:: html
 
  <p align="center">
- <a href="`_pypi_downloads`"> 
- <img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dm/evalscope">
+ <a href="https://badge.fury.io/py/evalscope"><img src="https://badge.fury.io/py/evalscope.svg" alt="PyPI version" height="18"></a>
+ <a href="https://pypi.org/project/evalscope"><img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/evalscope">
  </a>
- <a href="`_github_pr`">
- <img src="https://img.shields.io/badge/PR-welcome-55EB99.svg"></a>
- <a href='https://evalscope.readthedocs.io/zh-cn/latest/?badge=latest'>
- <img src='https://readthedocs.org/projects/evalscope/badge/?version=latest' alt='Documentation Status' />
+ <a href='https://evalscope.readthedocs.io/en/latest/?badge=latest'>
+ <img src='https://readthedocs.org/projects/evalscope-en/badge/?version=latest' alt='Documentation Status' />
  </a>
  </p>
 
@@ -76,6 +74,7 @@ We always welcome users' PRs and Issues to improve EvalScope.
  :caption: Best Practices
 
  best_practice/swift_integration.md
+ best_practice/experiments.md
 
 Index and Tables
 ==================

diff --git a/docs/en/user_guides/offline_evaluation.md b/docs/en/user_guides/offline_evaluation.md
@@ -1,4 +1,4 @@
-# Offline Environment Evaluation
+# Offline Evaluation
 
 By default, datasets are hosted on [ModelScope](https://modelscope.cn/datasets), which requires an internet connection to load. However, if you find yourself in an environment without internet access, you can use local datasets. Follow the steps below:
 

diff --git a/docs/en/user_guides/vlmevalkit_backend.md b/docs/en/user_guides/vlmevalkit_backend.md
@@ -1,4 +1,4 @@
-# VLMEvalKit Evaluation Backend
+# VLMEvalKit Backend
 
 To facilitate the use of the VLMEvalKit evaluation backend, we have customized the VLMEvalKit source code, naming it `ms-vlmeval`. This version encapsulates the configuration and execution of evaluation tasks and supports installation via PyPI, allowing users to initiate lightweight VLMEvalKit evaluation tasks through EvalScope. Additionally, we support interface evaluation tasks based on the OpenAI API format, and you can deploy multi-modal model services using ModelScope [swift](https://github.com/modelscope/swift).
 

diff --git a/docs/zh/_static/images/evalscope.jpeg b/docs/zh/_static/images/evalscope.jpeg
diff --git a/docs/zh/_static/images/evalscope_logo.png b/docs/zh/_static/images/evalscope_logo.png
diff --git a/docs/zh/best_practice/experiments.md b/docs/zh/best_practice/experiments.md
@@ -0,0 +1,39 @@
+
+# 实验和报告
+
+## [MMLU](https://modelscope.cn/datasets/modelscope/mmlu/summary)
+
+### Settings: (Split: test, Total num: 13985, 0-shot)
+
+| Model | Revision | Precision | Humanities | STEM | SocialScience | Other | WeightedAvg | Target | Delta |
+|--------------------------------------------------------------------------------------------------|----------|-----------|-------------|------------|---------------|---------|-------------|-------------|--------|
+| [Baichuan2-7B-Base](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Base/summary) | v1.0.2 | fp16 | 0.4111 | 0.3807 | 0.5233 | 0.504 | 0.4506 | - | |
+| [Baichuan2-7B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-chat/summary) | v1.0.4 | fp16 | 0.4439 | 0.374 | 0.5524 | 0.5458 | 0.4762 | - | |
+| [chatglm2-6b](https://modelscope.cn/models/ZhipuAI/chatglm2-6b/summary) | v1.0.12 | fp16 | 0.3834 | 0.3413 | 0.4708 | 0.4445 | 0.4077 | 0.4546(CoT) | -4.69% |
+| [chatglm3-6b-base](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base/summary) | v1.0.1 | fp16 | 0.5435 | 0.5087 | 0.7227 | 0.6471 | 0.5992 | 0.614 | -1.48% |
+| [internlm-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b/summary) | v1.0.1 | fp16 | 0.4005 | 0.3547 | 0.4953 | 0.4796 | 0.4297 | - | |
+| [Llama-2-13b-ms](https://modelscope.cn/models/modelscope/Llama-2-13b-ms/summary) | v1.0.2 | fp16 | 0.4371 | 0.3887 | 0.5579 | 0.5437 | 0.4778 | - | |
+| [Llama-2-7b-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-ms/summary) | v1.0.2 | fp16 | 0.3146 | 0.3037 | 0.4134 | 0.3885 | 0.3509 | - | |
+| [Qwen-14B-Chat](https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary) | v1.0.6 | bf16 | 0.5326 | 0.5397 | 0.7184 | 0.6859 | 0.6102 | - | |
+| [Qwen-7B](https://modelscope.cn/models/qwen/Qwen-7B/summary) | v1.1.6 | bf16 | 0.387 | 0.4 | 0.5403 | 0.5139 | 0.4527 | - | |
+| [Qwen-7B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary) | v1.1.6 | int8 | 0.4322 | 0.4277 | 0.6088 | 0.5778 | 0.5035 | - | |
+
+ - Target -- The official claimed score of the model on the dataset
+ - Delta -- The difference between the WeightedAvg score and the Target score
+
+
+### Settings: (Split: test, Total num: 13985, 5-shot)
+
+| Model | Revision | Precision | Humanities | STEM | SocialScience | Other | WeightedAvg | Avg | Target | Delta |
+|---------------------|----------|-----------|------------|--------|---------------|--------|-------------|--------|--------------------|---------|
+| Baichuan2-7B-Base | v1.0.2 | fp16 | 0.4295 | 0.398 | 0.5736 | 0.5325 | 0.4781 | 0.4918 | 0.5416 (official) | -4.98% |
+| Baichuan2-7B-Chat | v1.0.4 | fp16 | 0.4344 | 0.3937 | 0.5814 | 0.5462 | 0.4837 | 0.5029 | 0.5293 (official) | -2.64% |
+| chatglm2-6b | v1.0.12 | fp16 | 0.3941 | 0.376 | 0.4897 | 0.4706 | 0.4288 | 0.4442 | - | - |
+| chatglm3-6b-base | v1.0.1 | fp16 | 0.5356 | 0.4847 | 0.7175 | 0.6273 | 0.5857 | 0.5995 | - | - |
+| internlm-chat-7b | v1.0.1 | fp16 | 0.4171 | 0.3903 | 0.5772 | 0.5493 | 0.4769 | 0.4876 | - | - |
+| Llama-2-13b-ms | v1.0.2 | fp16 | 0.484 | 0.4133 | 0.6157 | 0.5809 | 0.5201 | 0.5327 | 0.548 (official) | -1.53% |
+| Llama-2-7b-ms | v1.0.2 | fp16 | 0.3747 | 0.3363 | 0.4372 | 0.4514 | 0.3979 | 0.4089 | 0.453 (official) | -4.41% |
+| Qwen-14B-Chat | v1.0.6 | bf16 | 0.574 | 0.553 | 0.7403 | 0.684 | 0.6313 | 0.6414 | 0.646 (official) | -0.46% |
+| Qwen-7B | v1.1.6 | bf16 | 0.4587 | 0.426 | 0.6078 | 0.5629 | 0.5084 | 0.5151 | 0.567 (official) | -5.2% |
+| Qwen-7B-Chat-Int8 | v1.1.6 | int8 | 0.4697 | 0.4383 | 0.6284 | 0.5967 | 0.5271 | 0.5347 | 0.554 (official) | -1.93% |
+
diff --git a/docs/zh/get_started/installation.md b/docs/zh/get_started/installation.md
@@ -1,6 +1,6 @@
 # 安装
 
-## 使用pip安装
+## 方式1. 使用pip安装
 我们推荐使用conda来管理环境，并使用pip安装依赖:
 1. 创建conda环境 (可选)
 ```shell
@@ -32,7 +32,7 @@ from llmuses import ...
 ````
 
 
-## 使用源码安装
+## 方式2. 使用源码安装
 1. 下载源码
 ```shell
 git clone https://github.com/modelscope/evalscope.git

diff --git a/docs/zh/index.rst b/docs/zh/index.rst
@@ -1,18 +1,17 @@
-.. image:: _static/images/evalscope.jpeg
+.. image:: _static/images/evalscope_logo.png
 
 .. _pypi_downloads: https://pypi.org/project/evalscope
 .. _github_pr: https://github.com/modelscope/evalscope/pulls
 
 .. raw:: html
 
  <p align="center">
- <a href="`_pypi_downloads`"> 
- <img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dm/evalscope">
+ <a href="https://badge.fury.io/py/evalscope"><img src="https://badge.fury.io/py/evalscope.svg" alt="PyPI version" height="18"></a>
+ <a href="https://pypi.org/project/evalscope"><img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/evalscope">
  </a>
- <a href="`_github_pr`">
- <img src="https://img.shields.io/badge/PR-welcome-55EB99.svg"></a>
+ <a href="https://github.com/modelscope/evalscope/pulls"><img src="https://img.shields.io/badge/PR-welcome-55EB99.svg"></a>
  <a href='https://evalscope.readthedocs.io/zh-cn/latest/?badge=latest'>
- <img src='https://readthedocs.org/projects/evalscope/badge/?version=latest' alt='Documentation Status' />
+  <img src='https://readthedocs.org/projects/evalscope/badge/?version=latest' alt='Documentation Status' />
  </a>
  </p>
 
@@ -78,6 +77,7 @@ EvalScope 上手路线
  :caption: 最佳实践
 
  best_practice/swift_integration.md
+ best_practice/experiments.md
 
 索引与表格
 ==================

diff --git a/resources/evalscope.jpeg b/resources/evalscope.jpeg
diff --git a/resources/evalscope_framework.png b/resources/evalscope_framework.png