Add support of Qwen models (7b, 14b, 72b) to DeepSpeed-FastGen #4913

ZonePG · 2024-01-08T03:40:14Z

This PR adds support for Qwen models 7b, 14b and 72b.

Test Code

for mii pipeline:

from mii import pipeline
pipe = pipeline("Qwen/Qwen-7B-Chat")
pipe.tokenizer.tokenizer.eos_token_id = 151643
output = pipe(["DeepSpeed is"], max_new_tokens=128, do_sample=False)
print(output)

for huggingface:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
inputs = tokenizer('DeepSpeed is', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs, max_new_tokens=128, do_sample=False, repetition_penalty=1.0)
test = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False)
print(test)

Qwen 7B

Huggingface output with prompt "DeepSpeed is":

 a high-performance, low-latency database management system designed for real-time analytics and machine learning applications. It is built on top of Apache Arrow, a columnar in-memory data format, and is optimized for processing large volumes of data in parallel.\n\nDeepSpeed offers several key features that make it well-suited for real-time analytics and machine learning applications:\n\n1. High Performance: DeepSpeed is designed to deliver high performance by leveraging parallel processing and optimized data structures. It can process large volumes of data in real-time, making it ideal for applications that require real-time analytics.\n\n2. Low Latency: DeepSpeed is designed to minimize latency by

DeepSpeed-FastGen output with prompt "DeepSpeed is":

 a high-performance, low-latency database management system designed for real-time analytics and machine learning applications. It is built on top of Apache Arrow, a columnar in-memory data format, and is optimized for processing large volumes of data in parallel.\n\nDeepSpeed offers several key features that make it well-suited for real-time analytics and machine learning applications:\n\n1. High Performance: DeepSpeed is designed to deliver high performance by leveraging parallel processing and optimized data structures. It can process large volumes of data in real-time, making it ideal for applications that require real-time analytics.\n\n2. Low Latency: DeepSpeed is designed to minimize latency by

Qwen 72B

Huggingface output with prompt "DeepSpeed is":

是一个开源的深度学习优化库，它提供了多种优化技术，包括模型并行、数据并行、混合并行、ZeRO内存优化等。它可以帮助用户在大规模GPU集群上训练深度学习模型，提高训练速度，减少内存使用。\n在Deepspeed中，模型并行是一种将模型的不同部分分配到不同的GPU上的技术。这样可以处理模型太大，无法放在一个GPU上的问题。数据并行是将数据集分成多个部分，每个部分在不同的GPU上进行训练。混合并行则是结合了模型并行和数据并行，以更有效地利用GPU资源

DeepSpeed-FastGen output with prompt "DeepSpeed is" with 8-way sharding:

是一个开源的深度学习优化库，它提供了多种优化技术，包括模型并行、数据并行、混合并行、ZeRO内存优化等。它可以帮助用户在大规模GPU集群上训练深度学习模型，提高训练速度，减少内存使用。\n在Deepspeed中，模型并行是一种将模型的不同部分分配到不同的GPU上的技术。这样可以处理模型太大，无法放在一个GPU上的问题。数据并行是将数据集分成多个部分，每个部分在不同的GPU上进行训练。混合并行则是结合了模型并行和数据并行，以更有效地利用GPU资源

ZonePG · 2024-01-08T03:42:01Z

@microsoft-github-policy-service agree

mrwyattii · 2024-01-08T17:27:43Z

@ZonePG thank you for this contribution! Could you please run the pre-commit and commit any modified files? pre-commit run --all-files

ZonePG · 2024-01-09T02:33:31Z

@mrwyattii pre-commit checks have been run and all modified files have been committed. Thanks!

…soft#4913) This PR adds support for Qwen models 7b, 14b and 72b. ### Test Code for mii pipeline: ```python from mii import pipeline pipe = pipeline("Qwen/Qwen-7B-Chat") pipe.tokenizer.tokenizer.eos_token_id = 151643 output = pipe(["DeepSpeed is"], max_new_tokens=128, do_sample=False) print(output) ``` for huggingface: ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() inputs = tokenizer('DeepSpeed is', return_tensors='pt') inputs = inputs.to(model.device) pred = model.generate(**inputs, max_new_tokens=128, do_sample=False, repetition_penalty=1.0) test = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False) print(test) ``` ### Qwen 7B Huggingface output with prompt "DeepSpeed is": ``` a high-performance, low-latency database management system designed for real-time analytics and machine learning applications. It is built on top of Apache Arrow, a columnar in-memory data format, and is optimized for processing large volumes of data in parallel.\n\nDeepSpeed offers several key features that make it well-suited for real-time analytics and machine learning applications:\n\n1. High Performance: DeepSpeed is designed to deliver high performance by leveraging parallel processing and optimized data structures. It can process large volumes of data in real-time, making it ideal for applications that require real-time analytics.\n\n2. Low Latency: DeepSpeed is designed to minimize latency by ``` DeepSpeed-FastGen output with prompt "DeepSpeed is": ``` a high-performance, low-latency database management system designed for real-time analytics and machine learning applications. It is built on top of Apache Arrow, a columnar in-memory data format, and is optimized for processing large volumes of data in parallel.\n\nDeepSpeed offers several key features that make it well-suited for real-time analytics and machine learning applications:\n\n1. High Performance: DeepSpeed is designed to deliver high performance by leveraging parallel processing and optimized data structures. It can process large volumes of data in real-time, making it ideal for applications that require real-time analytics.\n\n2. Low Latency: DeepSpeed is designed to minimize latency by ``` ### Qwen 72B Huggingface output with prompt "DeepSpeed is": ``` 是一个开源的深度学习优化库，它提供了多种优化技术，包括模型并行、数据并行、混合并行、ZeRO内存优化等。它可以帮助用户在大规模GPU集群上训练深度学习模型，提高训练速度，减少内存使用。\n在Deepspeed中，模型并行是一种将模型的不同部分分配到不同的GPU上的技术。这样可以处理模型太大，无法放在一个GPU上的问题。数据并行是将数据集分成多个部分，每个部分在不同的GPU上进行训练。混合并行则是结合了模型并行和数据并行，以更有效地利用GPU资源 ``` DeepSpeed-FastGen output with prompt "DeepSpeed is" with 8-way sharding: ``` 是一个开源的深度学习优化库，它提供了多种优化技术，包括模型并行、数据并行、混合并行、ZeRO内存优化等。它可以帮助用户在大规模GPU集群上训练深度学习模型，提高训练速度，减少内存使用。\n在Deepspeed中，模型并行是一种将模型的不同部分分配到不同的GPU上的技术。这样可以处理模型太大，无法放在一个GPU上的问题。数据并行是将数据集分成多个部分，每个部分在不同的GPU上进行训练。混合并行则是结合了模型并行和数据并行，以更有效地利用GPU资源 ``` --------- Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Logan Adams <[email protected]>

ZonePG requested review from mrwyattii, awan-10 and arashb as code owners January 8, 2024 03:40

ZonePG force-pushed the master branch from f66b27c to 31e7580 Compare January 9, 2024 02:28

Add support of Qwen models (7b, 14b, 72b) to DeepSpeed-FastGen

31d924c

ZonePG force-pushed the master branch from 31e7580 to 31d924c Compare January 9, 2024 02:35

add comment for Qwen bf16 difference

1a7f6fe

mrwyattii approved these changes Jan 9, 2024

View reviewed changes

mrwyattii and others added 4 commits January 9, 2024 10:15

Merge branch 'master' into master

927253c

Merge branch 'master' into master

7855711

Merge branch 'master' into master

163e028

Merge branch 'master' into master

77d4387

mrwyattii added this pull request to the merge queue Jan 10, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 10, 2024

Merge branch 'master' into master

1d3c85a

loadams enabled auto-merge January 10, 2024 22:41

loadams added this pull request to the merge queue Jan 10, 2024

Merged via the queue into microsoft:master with commit ed10cc7 Jan 11, 2024
9 checks passed

ZonePG mentioned this pull request Jan 14, 2024

[BUG] Deepspeed inference does not support the Qwen model #4840

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support of Qwen models (7b, 14b, 72b) to DeepSpeed-FastGen #4913

Add support of Qwen models (7b, 14b, 72b) to DeepSpeed-FastGen #4913

ZonePG commented Jan 8, 2024

ZonePG commented Jan 8, 2024

mrwyattii commented Jan 8, 2024

ZonePG commented Jan 9, 2024

Add support of Qwen models (7b, 14b, 72b) to DeepSpeed-FastGen #4913

Add support of Qwen models (7b, 14b, 72b) to DeepSpeed-FastGen #4913

Conversation

ZonePG commented Jan 8, 2024

Test Code

Qwen 7B

Qwen 72B

ZonePG commented Jan 8, 2024

mrwyattii commented Jan 8, 2024

ZonePG commented Jan 9, 2024