[Hardware][Ascend] Add Ascend NPU backend #8054

wangshuai09 · 2024-08-31T06:55:59Z

As mentioned in #7692, this PR make Ascend NPU backend available in VLLM.

RoadMap:

Support Device

Atlas 800I A2 Inference Server
Atlas 800T A2 Training Server
Atals 300T A2 Training Card

Install

install CANN, make sure the version matches torch2.1
run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm
test python examples/offline_inference_npu.py

Using Dockerfile.npu

Clone branch npu_support and step into vllm

git clone -b npu_support https://github.com/wangshuai09/vllm.git
cd vllm

Build the docker image

docker build -t vllm-npu -f Dockerfile.npu .

Run docker container.
modify --device /dev/davinci0 according to your device.

docker run -dit -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /etc/ascend_install.info:/etc/ascend_install.info --device /dev/davinci0 --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc --shm-size 16G --name vllm vllm-npu:latest bash

Enter the container

docker exec -it vllm bash

Collaborators

@MengqingCao @dgy516 @hi-liuyifeng @Lin-Qingyang-Alec @liujie92 @JiasenTian @weiwei567 @JuntongMa @xiangjie
@zhangxy1234 @ldh2020 @Eviannn @agoodnoob @rumoralot

This work is still in WIP stage.

github-actions · 2024-08-31T06:56:11Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

zer0py2c · 2024-09-01T14:03:58Z

Is there any document on how to use it?

wangshuai09 · 2024-09-02T01:42:01Z

Is there any document on how to use it?

This work is not ready, if you want to develop this together, follow this,

install CANN, make sure the version matches torch2.1
run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm
test python examples/offline_inference_npu.py, only support single prompt now.

zer0py2c · 2024-09-02T02:22:19Z

Is there any document on how to use it?

This work is not ready, if you want to develop this together, follow this,

install CANN, make sure the version matches torch2.1

run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm

test python examples/offline_inference_npu.py, only support single prompt now.

very thankful, I'll try it.

wyzanski · 2024-09-02T10:38:20Z

I followed the above steps and reported the following error. What is the reason?

wangshuai09 · 2024-09-02T11:14:02Z

@wyzanski There is a fatal error about git, i think you may need to recheck your git config.

Aiwenqiuyu · 2024-09-09T03:21:27Z

期待对国产化的支持！

Co-authored-by: MengqingCao <[email protected]>

…pport

jkl375 · 2024-09-11T02:38:58Z

感谢对国产化的支持！

* pad slot indices * use parameter passing instead of global var to control whether pad length is calculated in the sampling

MengqingCao · 2024-09-11T09:31:37Z

TODO:

update vllm/attention/backends/ascend.py to the latest version.

XYZliang · 2024-09-12T06:10:43Z

感谢对国产化的支持！期待在昇腾系列上的效果，太缺一个高效的推理引擎了

beardog6 · 2024-09-14T03:15:49Z

是否支持在线推理呢

wangshuai09 · 2024-09-18T09:10:51Z

是否支持在线推理呢

Does it means starting an OpenAI-compatible API server? The latest code already supports, like this:

# start server
vllm serve facebook/opt-125m

# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'

# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live.  I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}}

XYZliang · 2024-09-18T09:25:01Z

What Ascend NPU devices are currently supported?
The latest version of lmdeploy also supports Ascend NPU, but only 910B and 310P are supported, as other devices lack the operator support they require and will need to wait for CANN implementation. I encounter errors when testing with the 910A.
However, it seems that most users are using Ascend 910A. Is it possible to adapt it directly?

WangxuP · 2024-09-18T09:56:22Z

是否支持在线推理呢

是不是意味着要启动一个兼容 OpenAI 的 API 服务器呢？最新的代码已经支持了，像这样：

# start server
vllm serve facebook/opt-125m

# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'

# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live.  I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}}

suooprted qwen series LLM？

wangshuai09 · 2024-09-18T09:59:21Z

Hi @XYZliang, we don`t have device with this chip type, maybe you could test on your device with latest code?

wangshuai09 · 2024-09-18T10:09:49Z

@WangxuP we do not check the model corretness now, here is a simple offline result:

INFO 09-18 10:03:24 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-18 10:03:24 selector.py:161] Using ASCEND_TORCH backend.
[W compiler_depend.ts:623] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
INFO 09-18 10:03:33 npu_model_runner.py:319] Starting to load model Qwen/Qwen2-7B-Instruct...
INFO 09-18 10:03:33 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-18 10:03:33 selector.py:161] Using ASCEND_TORCH backend.
INFO 09-18 10:03:34 weight_utils.py:235] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.90it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.43it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]

INFO 09-18 10:03:39 npu_model_runner.py:330] Loading model weights took 14.2487 GB
/workspace/cmq/ws-code/vllm/vllm/model_executor/layers/sampler.py:437: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:74.)
  top_p_mask[:, -1] = False
INFO 09-18 10:03:45 gpu_executor.py:122] # GPU blocks: 37996, # CPU blocks: 4681
Processed prompts: 100%|████████| 2/2 [00:04<00:00,  2.34s/it, est. speed input: 2.56 toks/s, output: 42.72 toks/s]
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States. The president is the commander-in-chief of the armed forces, the head of the executive branch, and is responsible for enforcing federal laws, taking care that federal laws are faithfully executed, and serving as the commander in chief of the armed forces. The president is also the head of state and represents the nation to foreign governments and to the world at large. The president is the chief diplomat, the chief executive, and the chief legislator of'
Prompt: 'The future of AI is', Generated text: " here, and it's not just about robots and self-driving cars. AI is transforming every industry, from healthcare to finance, and it's changing the way we live and work. In this article, we'll explore the latest advancements in AI and how they're impacting our world.\nOne of the most exciting areas of AI research is natural language processing (NLP). NLP is the ability of machines to understand and interpret human language. This technology is being used to create chatbots, virtual assistants,"

RogerWYQ · 2024-09-18T15:19:49Z

should we install mindie first?

zhangzhiqiangcs · 2024-09-19T01:25:53Z

Is there a Dockerfile for npu to build image ?

MengqingCao · 2024-09-19T06:22:20Z

Could you offer your env info, including python, cann, device name? And if you confirm your code has been updated as the latest in this pr, plz offer me a minimal reproduction method.

python：3.10.12；cann：toolkit_8.0.RC2，kernels_310P_8.0.RC2；推理卡300I DUO

According to the official documentation, this operator has more restrictions on 310p.
https://www.hiascend.com/document/detail/zh/canncommercial/80RC22/apiref/appdevgapi/context/aclnnPromptFlashAttentionV3.md#%E7%BA%A6%E6%9D%9F%E4%B8%8E%E9%99%90%E5%88%B6

The current PR is developed based on Atlas 300T A2 training card. If you are interested in supporting 310p, welcome to join the development of this PR.

WWCTF · 2024-09-19T06:50:04Z

Could you offer your env info, including python, cann, device name? And if you confirm your code has been updated as the latest in this pr, plz offer me a minimal reproduction method.

python：3.10.12；cann：toolkit_8.0.RC2，kernels_310P_8.0.RC2；推理卡300I DUO

According to the official documentation, this operator has more restrictions on 310p. https://www.hiascend.com/document/detail/zh/canncommercial/80RC22/apiref/appdevgapi/context/aclnnPromptFlashAttentionV3.md#%E7%BA%A6%E6%9D%9F%E4%B8%8E%E9%99%90%E5%88%B6

The current PR is developed based on Atlas 300T A2 training card. If you are interested in supporting 310p, welcome to join the development of this PR.

好的，感谢

wrennywang · 2024-09-19T08:00:15Z

可以支持多卡推理吗？

MengqingCao · 2024-09-19T08:22:26Z

可以支持多卡推理吗？

Still work in process now

XYZliang · 2024-09-19T08:42:03Z

910A 测试报错，完整日志如下：

  npu_support  ❲c❳ vllm  ~/work/vllm                                                                                                                                                                                      16:37:06  ma-user 
❯ python examples/offline_inference_npu.py
INFO 09-19 16:37:25 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch_npu/utils/path_manager.py:82: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch_npu/utils/path_manager.py:82: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC1/aarch64-linux/ascend_toolkit_install.info owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
WARNING 09-19 16:37:29 config.py:374] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
INFO 09-19 16:37:29 llm_engine.py:232] Initializing an LLM engine (v0.6.0) with config: model='/home/ma-user/work/model/qwen/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/home/ma-user/work/model/qwen/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/ma-user/work/model/qwen/Qwen2-7B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
/home/ma-user/work/vllm/vllm/model_executor/model_loader/ascend_mindie.py:132: SyntaxWarning: assertion is always true, perhaps remove parentheses?
  assert (
INFO 09-19 16:37:29 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-19 16:37:29 selector.py:161] Using ASCEND_TORCH backend.
[W compiler_depend.ts:623] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
[W compiler_depend.ts:631] Warning: expandable_segments feature is not supportted                     and the possible cause is that driver and firmware packages do not match. (function operator())
INFO 09-19 16:38:00 npu_model_runner.py:319] Starting to load model /home/ma-user/work/model/qwen/Qwen2-7B-Instruct...
INFO 09-19 16:38:01 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-19 16:38:01 selector.py:161] Using ASCEND_TORCH backend.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [01:22<04:08, 82.86s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [01:25<01:10, 35.48s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [01:27<00:20, 20.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:29<00:00, 13.29s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:29<00:00, 22.49s/it]

INFO 09-19 16:39:32 npu_model_runner.py:330] Loading model weights took 14.2487 GB
Traceback (most recent call last):
  File "/home/ma-user/work/vllm/examples/offline_inference_npu.py", line 17, in <module>
    llm = LLM(model="/home/ma-user/work/model/qwen/Qwen2-7B-Instruct")
  File "/home/ma-user/work/vllm/vllm/entrypoints/llm.py", line 177, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/ma-user/work/vllm/vllm/engine/llm_engine.py", line 564, in from_engine_args
    engine = cls(
  File "/home/ma-user/work/vllm/vllm/engine/llm_engine.py", line 338, in __init__
    self._initialize_kv_caches()
  File "/home/ma-user/work/vllm/vllm/engine/llm_engine.py", line 467, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/home/ma-user/work/vllm/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/worker/npu_worker.py", line 173, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/worker/npu_model_runner.py", line 450, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/worker/model_runner.py", line 1450, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/model_executor/models/qwen2.py", line 361, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/model_executor/models/qwen2.py", line 277, in forward
    hidden_states, residual = layer(
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/model_executor/models/qwen2.py", line 210, in forward
    hidden_states = self.self_attn(
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/model_executor/models/qwen2.py", line 157, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/attention/layer.py", line 100, in forward
    return self.impl.forward(query,
  File "/home/ma-user/work/vllm/vllm/attention/backends/ascend.py", line 479, in forward
    output = torch_npu.npu_prompt_flash_attention(
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/_ops.py", line 692, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: call aclnnPromptFlashAttentionV3 failed, detail:EZ1001: 2024-09-19-16:39:32.263.802 PromptFlashAttention LaunchAicore failed.
        TraceBack (most recent call last):
        reserve memory address failed, runtime result = 207000[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        Cannot parse json for config file [/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe//kernel/config/ascend910/prompt_flash_attention.json].
        Failed to parse kernel in prompt_flash_attention.json.
        AclOpKernelInit failed opType
        PromptFlashAttention LaunchAicore failed.

[ERROR] 2024-09-19-16:39:32 (PID:2458425, Device:0, RankID:-1) ERR01005 OPS internal error
start compile Ascend C operator RmsNorm. kernel name is te_rmsnorm_dd3c29b41435fcbd1d381b4336c633fa1930bccc0077864f290be3aedc4cca75_1
compile Ascend C operator: RmsNorm success!

* remove unnecessary file copies in Dockerfile.npu * replace is_npu in utils with it in platform

MengqingCao · 2024-09-19T11:32:53Z

910A 测试报错，完整日志如下：

RuntimeError: call aclnnPromptFlashAttentionV3 failed, detail:EZ1001: 2024-09-19-16:39:32.263.802 PromptFlashAttention LaunchAicore failed.
TraceBack (most recent call last):
reserve memory address failed, runtime result = 207000[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
Cannot parse json for config file [/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe//kernel/config/ascend910/prompt_flash_attention.json].
Failed to parse kernel in prompt_flash_attention.json.
AclOpKernelInit failed opType
PromptFlashAttention LaunchAicore failed.

[ERROR] 2024-09-19-16:39:32 (PID:2458425, Device:0, RankID:-1) ERR01005 OPS internal error
start compile Ascend C operator RmsNorm. kernel name is te_rmsnorm_dd3c29b41435fcbd1d381b4336c633fa1930bccc0077864f290be3aedc4cca75_1
compile Ascend C operator: RmsNorm success!

This error indicates that your device does not support operator aclnnPromptFlashAttentionV3, which is called in torch_npu.npu_prompt_flash_attention.

zer0py2c · 2024-09-20T08:24:41Z

What Ascend NPU devices are currently supported? The latest version of lmdeploy also supports Ascend NPU, but only 910B and 310P are supported, as other devices lack the operator support they require and will need to wait for CANN implementation. I encounter errors when testing with the 910A. However, it seems that most users are using Ascend 910A. Is it possible to adapt it directly?

兄弟，芯片310P型号的昇腾推理卡不太行哦，我用LMDeploy V0.6.0测过了。 :(

WangxuP · 2024-09-25T03:34:11Z

This looks like there's no support Qwen1.5-72B-Chat-GPTQ-Int8 LLM ?

(py39) [root@master vllm]# vllm serve /home/models/qwen_model/qwen/Qwen1___5-72B-Chat-GPTQ-Int8 --port 6006 --served-model-name Qwen1.5-72B-Chat-int8 --trust-remote-code --max-model-len 2048 --dtype auto --disable-log-stats  --chat-template /home/me/vllm/examples/template_chatml.jinja --tensor-parallel-size 4 --quantization gptq
INFO 09-25 11:02:33 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
INFO 09-25 11:02:35 api_server.py:495] vLLM API server version 0.6.0
INFO 09-25 11:02:35 api_server.py:496] args: Namespace(model_tag='/home/models/qwen_model/qwen/Qwen1___5-72B-Chat-GPTQ-Int8', config='', host=None, port=6006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template='/home/me/vllm/examples/template_chatml.jinja', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/home/models/qwen_model/qwen/Qwen1___5-72B-Chat-GPTQ-Int8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=True, quantization='gptq', rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen1.5-72B-Chat-int8'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0xfffdc13b23a0>)
Traceback (most recent call last):
  File "/root/anaconda3/envs/py39/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/home/me/vllm/vllm/scripts.py", line 165, in main
    args.dispatch_function(args)
  File "/home/me/vllm/vllm/scripts.py", line 37, in serve
    asyncio.run(run_server(args))
  File "/root/anaconda3/envs/py39/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/root/anaconda3/envs/py39/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/me/vllm/vllm/entrypoints/openai/api_server.py", line 498, in run_server
    async with build_async_engine_client(args) as async_engine_client:
  File "/root/anaconda3/envs/py39/lib/python3.9/contextlib.py", line 181, in __aenter__
    return await self.gen.__anext__()
  File "/home/me/vllm/vllm/entrypoints/openai/api_server.py", line 110, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/root/anaconda3/envs/py39/lib/python3.9/contextlib.py", line 181, in __aenter__
    return await self.gen.__anext__()
  File "/home/me/vllm/vllm/entrypoints/openai/api_server.py", line 132, in build_async_engine_client_from_engine_args
    if (model_is_embedding(engine_args.model, engine_args.trust_remote_code,
  File "/home/me/vllm/vllm/entrypoints/openai/api_server.py", line 73, in model_is_embedding
    return ModelConfig(model=model_name,
  File "/home/me/vllm/vllm/config.py", line 243, in __init__
    self._verify_quantization()
  File "/home/me/vllm/vllm/config.py", line 302, in _verify_quantization
    quantization_override = method.override_quantization_method(
  File "/home/me/vllm/vllm/model_executor/layers/quantization/gptq_marlin.py", line 90, in override_quantization_method
    can_convert = cls.is_gptq_marlin_compatible(hf_quant_cfg)
  File "/home/me/vllm/vllm/model_executor/layers/quantization/gptq_marlin.py", line 138, in is_gptq_marlin_compatible
    return check_marlin_supported(quant_type=cls.TYPE_MAP[(num_bits, sym)],
  File "/home/me/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 78, in check_marlin_supported
    cond, _ = _check_marlin_supported(quant_type, group_size, has_zp,
  File "/home/me/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 55, in _check_marlin_supported
    major, minor = current_platform.get_device_capability()
  File "/home/me/vllm/vllm/platforms/ascend.py", line 26, in get_device_capability
    raise RuntimeError("Ascend NPU does not have device capability.")
RuntimeError: Ascend NPU does not have device capability.
[ERROR] 2024-09-25-11:02:35 (PID:2673537, Device:-1, RankID:-1) ERR99999 UNKNOWN application exception

wangshuai09 · 2024-09-25T03:49:47Z

@WangxuP Quantization is not currently supported.

WangxuP · 2024-09-25T03:59:52Z

@WangxuP Quantization is not currently supported.

Okay, looking forward to support soon.

* fix swap blocks in ascend.py * add UT for copy_blocks and swap_blocks

beardog6 · 2024-09-26T01:54:13Z

感谢并期待功能的补全。另外目前版本验证910B下推理性能相对mindie差距较大，qwen1.5-7b-chat的推理速度20tokens/s，在mindie上可以达到38tokens/s。
是否是fa算子未启用的原因？启动日志里有出现：
INFO 09-26 09:35:03 selector.py:222] Cannot use _Backend.FLASH_ATTN backend on NPU.

wangshuai09 · 2024-09-26T02:59:57Z

感谢并期待功能的补全。另外目前版本验证910B下推理性能相对mindie差距较大，qwen1.5-7b-chat的推理速度20tokens/s，在mindie上可以达到38tokens/s。是否是fa算子未启用的原因？启动日志里有出现： INFO 09-26 09:35:03 selector.py:222] Cannot use _Backend.FLASH_ATTN backend on NPU.

Flash Attn is used by Ascend backend in attention/backends/ascend.py and this log shows that Ascend backend does not use the default FLASH_ATTN which is implemented in vllm. There are space for improving perf, such as fused ops in custom_op,

verigle · 2024-09-27T01:40:03Z

ascend vllm 请问是否计划适配 qwen2-vl呢？

MengqingCao · 2024-09-27T01:52:17Z

ascend vllm 请问是否计划适配 qwen2-vl呢？

Support for VLMs is in our todo list, including qwen2-vl.

xuedinge233 mentioned this pull request Aug 31, 2024

[Feature]: Request for Ascend NPU support #6368

Open

init npu_support

6ae737e

Co-authored-by: MengqingCao <[email protected]>

wangshuai09 force-pushed the npu_support branch from 6f89d38 to 6ae737e Compare September 9, 2024 07:05

wangshuai09 and others added 4 commits September 10, 2024 01:11

not compile _core_ext

e52cae6

pad input tokens/positions

136be9f

support custom_op by native

10da669

Merge branch 'npu_support' of github.com:wangshuai09/vllm into npu_su…

b8af541

…pport

Some fixes for multi-prompt inference acc

e26bc8c

* pad slot indices * use parameter passing instead of global var to control whether pad length is calculated in the sampling

wangshuai09 and others added 3 commits September 14, 2024 09:18

refactor

47e1d7c

refactor attention and slot indices

89e298e

support api server

d6dd620

add ascend in platform

734b1a9

Add dockerfile and tiny fix

098447b

small fixes

6c42bc7

* remove unnecessary file copies in Dockerfile.npu * replace is_npu in utils with it in platform

wangshuai09 added 2 commits September 20, 2024 02:16

del mindie

73c59d4

extract torch.npu

2d34693

add mp distributed executor

0ca6849

MengqingCao force-pushed the npu_support branch from 7ec30ff to 0ca6849 Compare September 24, 2024 06:16

MengqingCao and others added 3 commits September 24, 2024 12:14

refactor MultiprocessingNPUExecutor

a8a35d4

add PlatformMemoryProfiler

d5acf25

simplify

0296cc8

fixes

670e217

MengqingCao added 2 commits September 25, 2024 07:04

fix copy blocks

b6f50b7

some fixes

c163b20

* fix swap blocks in ascend.py * add UT for copy_blocks and swap_blocks

add quay source

f494b81

Yikun mentioned this pull request Sep 28, 2024

[Roadmap] vLLM Roadmap Q3 2024 #5805

Open

46 tasks

MengqingCao added 2 commits September 30, 2024 10:43

merge main to npu_support

751df42

some fixes

a3cb972

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hardware][Ascend] Add Ascend NPU backend #8054

[Hardware][Ascend] Add Ascend NPU backend #8054

wangshuai09 commented Aug 31, 2024 •

edited

Loading

github-actions bot commented Aug 31, 2024

zer0py2c commented Sep 1, 2024

wangshuai09 commented Sep 2, 2024

zer0py2c commented Sep 2, 2024

wyzanski commented Sep 2, 2024

wangshuai09 commented Sep 2, 2024

Aiwenqiuyu commented Sep 9, 2024

jkl375 commented Sep 11, 2024

MengqingCao commented Sep 11, 2024 •

edited

Loading

XYZliang commented Sep 12, 2024

beardog6 commented Sep 14, 2024

wangshuai09 commented Sep 18, 2024 •

edited

Loading

XYZliang commented Sep 18, 2024

WangxuP commented Sep 18, 2024

wangshuai09 commented Sep 18, 2024

wangshuai09 commented Sep 18, 2024

RogerWYQ commented Sep 18, 2024

zhangzhiqiangcs commented Sep 19, 2024

MengqingCao commented Sep 19, 2024

WWCTF commented Sep 19, 2024

wrennywang commented Sep 19, 2024

MengqingCao commented Sep 19, 2024

XYZliang commented Sep 19, 2024

MengqingCao commented Sep 19, 2024 •

edited

Loading

zer0py2c commented Sep 20, 2024 •

edited

Loading

WangxuP commented Sep 25, 2024

wangshuai09 commented Sep 25, 2024

WangxuP commented Sep 25, 2024

beardog6 commented Sep 26, 2024

wangshuai09 commented Sep 26, 2024

verigle commented Sep 27, 2024

MengqingCao commented Sep 27, 2024

[Hardware][Ascend] Add Ascend NPU backend #8054

Are you sure you want to change the base?

[Hardware][Ascend] Add Ascend NPU backend #8054

Conversation

wangshuai09 commented Aug 31, 2024 • edited Loading

RoadMap:

Support Device

Install

Using Dockerfile.npu

Collaborators

github-actions bot commented Aug 31, 2024

zer0py2c commented Sep 1, 2024

wangshuai09 commented Sep 2, 2024

zer0py2c commented Sep 2, 2024

wyzanski commented Sep 2, 2024

wangshuai09 commented Sep 2, 2024

Aiwenqiuyu commented Sep 9, 2024

jkl375 commented Sep 11, 2024

MengqingCao commented Sep 11, 2024 • edited Loading

XYZliang commented Sep 12, 2024

beardog6 commented Sep 14, 2024

wangshuai09 commented Sep 18, 2024 • edited Loading

XYZliang commented Sep 18, 2024

WangxuP commented Sep 18, 2024

wangshuai09 commented Sep 18, 2024

wangshuai09 commented Sep 18, 2024

RogerWYQ commented Sep 18, 2024

zhangzhiqiangcs commented Sep 19, 2024

MengqingCao commented Sep 19, 2024

WWCTF commented Sep 19, 2024

wrennywang commented Sep 19, 2024

MengqingCao commented Sep 19, 2024

XYZliang commented Sep 19, 2024

MengqingCao commented Sep 19, 2024 • edited Loading

zer0py2c commented Sep 20, 2024 • edited Loading

WangxuP commented Sep 25, 2024

wangshuai09 commented Sep 25, 2024

WangxuP commented Sep 25, 2024

beardog6 commented Sep 26, 2024

wangshuai09 commented Sep 26, 2024

verigle commented Sep 27, 2024

MengqingCao commented Sep 27, 2024

wangshuai09 commented Aug 31, 2024 •

edited

Loading

MengqingCao commented Sep 11, 2024 •

edited

Loading

wangshuai09 commented Sep 18, 2024 •

edited

Loading

MengqingCao commented Sep 19, 2024 •

edited

Loading

zer0py2c commented Sep 20, 2024 •

edited

Loading