-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hardware][Ascend] Add Ascend NPU backend #8054
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
Is there any document on how to use it? |
This work is not ready, if you want to develop this together, follow this,
|
very thankful, I'll try it. |
@wyzanski There is a fatal error about git, i think you may need to recheck your git config. |
期待对国产化的支持! |
Co-authored-by: MengqingCao <[email protected]>
6f89d38
to
6ae737e
Compare
感谢对国产化的支持! |
* pad slot indices * use parameter passing instead of global var to control whether pad length is calculated in the sampling
TODO:
|
感谢对国产化的支持!期待在昇腾系列上的效果,太缺一个高效的推理引擎了 |
是否支持在线推理呢 |
Does it means starting an OpenAI-compatible API server? The latest code already supports, like this: # start server
vllm serve facebook/opt-125m
# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'
# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live. I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}} |
What Ascend NPU devices are currently supported? |
suooprted qwen series LLM? |
Hi @XYZliang, we don`t have device with this chip type, maybe you could test on your device with latest code? |
@WangxuP we do not check the model corretness now, here is a simple offline result:
|
should we install mindie first? |
Is there a Dockerfile for npu to build image ? |
According to the official documentation, this operator has more restrictions on 310p. The current PR is developed based on Atlas 300T A2 training card. If you are interested in supporting 310p, welcome to join the development of this PR. |
好的,感谢 |
可以支持多卡推理吗? |
Still work in process now |
910A 测试报错,完整日志如下:
|
* remove unnecessary file copies in Dockerfile.npu * replace is_npu in utils with it in platform
This error indicates that your device does not support operator |
兄弟,芯片310P型号的昇腾推理卡不太行哦,我用LMDeploy V0.6.0测过了。 :( |
7ec30ff
to
0ca6849
Compare
This looks like there's no support
|
@WangxuP Quantization is not currently supported. |
Okay, looking forward to support soon. |
* fix swap blocks in ascend.py * add UT for copy_blocks and swap_blocks
感谢并期待功能的补全。另外目前版本验证910B下推理性能相对mindie差距较大,qwen1.5-7b-chat的推理速度20tokens/s,在mindie上可以达到38tokens/s。 |
Flash Attn is used by Ascend backend in |
ascend vllm 请问是否计划适配 qwen2-vl呢? |
Support for VLMs is in our todo list, including qwen2-vl. |
As mentioned in #7692, this PR make Ascend NPU backend available in VLLM.
RoadMap:
Support Device
Install
VLLM_TARGET_DEVICE=npu pip install -e .
to install vllmpython examples/offline_inference_npu.py
Using Dockerfile.npu
modify
--device /dev/davinci0
according to your device.Collaborators
@MengqingCao @dgy516 @hi-liuyifeng @Lin-Qingyang-Alec @liujie92 @JiasenTian @weiwei567 @JuntongMa @xiangjie
@zhangxy1234 @ldh2020 @Eviannn @agoodnoob @rumoralot
This work is still in WIP stage.