Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hardware][Ascend] Add Ascend NPU backend #8054

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

wangshuai09
Copy link

@wangshuai09 wangshuai09 commented Aug 31, 2024

As mentioned in #7692, this PR make Ascend NPU backend available in VLLM.

RoadMap:

  • Ascend Executor
  • Ascend Worker
  • Ascend Model Runner
  • Ascend SingleOps Backend
    • custom_ops with native impl
    • padding for multi prompts
    • update vllm/attention/backends/ascend.py to the latest version.
    • model inference: opt, llama
    • multiproc
  • Platform for Ascend NPU
  • Server
  • Unit-test

Support Device

  • Atlas 800I A2 Inference Server
  • Atlas 800T A2 Training Server
  • Atals 300T A2 Training Card

Install

  1. install CANN, make sure the version matches torch2.1
  2. run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm
  3. test python examples/offline_inference_npu.py

Using Dockerfile.npu

  1. Clone branch npu_support and step into vllm
git clone -b npu_support https://github.com/wangshuai09/vllm.git
cd vllm
  1. Build the docker image
docker build -t vllm-npu -f Dockerfile.npu .
  1. Run docker container.
    modify --device /dev/davinci0 according to your device.
docker run -dit -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /etc/ascend_install.info:/etc/ascend_install.info --device /dev/davinci0 --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc --shm-size 16G --name vllm vllm-npu:latest bash
  1. Enter the container
docker exec -it vllm bash

Collaborators

@MengqingCao @dgy516 @hi-liuyifeng @Lin-Qingyang-Alec @liujie92 @JiasenTian @weiwei567 @JuntongMa @xiangjie
@zhangxy1234 @ldh2020 @Eviannn @agoodnoob @rumoralot

This work is still in WIP stage.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@zer0py2c
Copy link

zer0py2c commented Sep 1, 2024

Is there any document on how to use it?

@wangshuai09
Copy link
Author

Is there any document on how to use it?

This work is not ready, if you want to develop this together, follow this,

  1. install CANN, make sure the version matches torch2.1
  2. run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm
  3. test python examples/offline_inference_npu.py, only support single prompt now.

@zer0py2c
Copy link

zer0py2c commented Sep 2, 2024

Is there any document on how to use it?

This work is not ready, if you want to develop this together, follow this,

  1. install CANN, make sure the version matches torch2.1
  2. run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm
  3. test python examples/offline_inference_npu.py, only support single prompt now.

very thankful, I'll try it.

@wyzanski
Copy link

wyzanski commented Sep 2, 2024

1160ABCD-57B2-4acf-B3DF-F5CDB188708C
I followed the above steps and reported the following error. What is the reason?

@wangshuai09
Copy link
Author

@wyzanski There is a fatal error about git, i think you may need to recheck your git config.

@Aiwenqiuyu
Copy link

期待对国产化的支持!

Co-authored-by: MengqingCao <[email protected]>
@jkl375
Copy link

jkl375 commented Sep 11, 2024

感谢对国产化的支持!

 * pad slot indices
 * use parameter passing instead of global var to control whether pad length is calculated in the sampling
@MengqingCao
Copy link

MengqingCao commented Sep 11, 2024

TODO:

  • update vllm/attention/backends/ascend.py to the latest version.

@XYZliang
Copy link

感谢对国产化的支持!期待在昇腾系列上的效果,太缺一个高效的推理引擎了

@beardog6
Copy link

是否支持在线推理呢

@wangshuai09
Copy link
Author

wangshuai09 commented Sep 18, 2024

是否支持在线推理呢

Does it means starting an OpenAI-compatible API server? The latest code already supports, like this:

# start server
vllm serve facebook/opt-125m

# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'

# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live.  I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}}

@XYZliang
Copy link

What Ascend NPU devices are currently supported?
The latest version of lmdeploy also supports Ascend NPU, but only 910B and 310P are supported, as other devices lack the operator support they require and will need to wait for CANN implementation. I encounter errors when testing with the 910A.
However, it seems that most users are using Ascend 910A. Is it possible to adapt it directly?

@WangxuP
Copy link

WangxuP commented Sep 18, 2024

是否支持在线推理呢

是不是意味着要启动一个兼容 OpenAI 的 API 服务器呢?最新的代码已经支持了,像这样:

# start server
vllm serve facebook/opt-125m

# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'

# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live.  I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}}

suooprted qwen series LLM?

@wangshuai09
Copy link
Author

Hi @XYZliang, we don`t have device with this chip type, maybe you could test on your device with latest code?

@wangshuai09
Copy link
Author

@WangxuP we do not check the model corretness now, here is a simple offline result:

INFO 09-18 10:03:24 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-18 10:03:24 selector.py:161] Using ASCEND_TORCH backend.
[W compiler_depend.ts:623] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
INFO 09-18 10:03:33 npu_model_runner.py:319] Starting to load model Qwen/Qwen2-7B-Instruct...
INFO 09-18 10:03:33 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-18 10:03:33 selector.py:161] Using ASCEND_TORCH backend.
INFO 09-18 10:03:34 weight_utils.py:235] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.90it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.43it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]

INFO 09-18 10:03:39 npu_model_runner.py:330] Loading model weights took 14.2487 GB
/workspace/cmq/ws-code/vllm/vllm/model_executor/layers/sampler.py:437: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:74.)
  top_p_mask[:, -1] = False
INFO 09-18 10:03:45 gpu_executor.py:122] # GPU blocks: 37996, # CPU blocks: 4681
Processed prompts: 100%|████████| 2/2 [00:04<00:00,  2.34s/it, est. speed input: 2.56 toks/s, output: 42.72 toks/s]
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States. The president is the commander-in-chief of the armed forces, the head of the executive branch, and is responsible for enforcing federal laws, taking care that federal laws are faithfully executed, and serving as the commander in chief of the armed forces. The president is also the head of state and represents the nation to foreign governments and to the world at large. The president is the chief diplomat, the chief executive, and the chief legislator of'
Prompt: 'The future of AI is', Generated text: " here, and it's not just about robots and self-driving cars. AI is transforming every industry, from healthcare to finance, and it's changing the way we live and work. In this article, we'll explore the latest advancements in AI and how they're impacting our world.\nOne of the most exciting areas of AI research is natural language processing (NLP). NLP is the ability of machines to understand and interpret human language. This technology is being used to create chatbots, virtual assistants,"

@RogerWYQ
Copy link

should we install mindie first?

@zhangzhiqiangcs
Copy link

Is there a Dockerfile for npu to build image ?

@MengqingCao
Copy link

Could you offer your env info, including python, cann, device name? And if you confirm your code has been updated as the latest in this pr, plz offer me a minimal reproduction method.

python:3.10.12;cann:toolkit_8.0.RC2,kernels_310P_8.0.RC2;推理卡300I DUO

According to the official documentation, this operator has more restrictions on 310p.
https://www.hiascend.com/document/detail/zh/canncommercial/80RC22/apiref/appdevgapi/context/aclnnPromptFlashAttentionV3.md#%E7%BA%A6%E6%9D%9F%E4%B8%8E%E9%99%90%E5%88%B6
image

The current PR is developed based on Atlas 300T A2 training card. If you are interested in supporting 310p, welcome to join the development of this PR.

@WWCTF
Copy link

WWCTF commented Sep 19, 2024

Could you offer your env info, including python, cann, device name? And if you confirm your code has been updated as the latest in this pr, plz offer me a minimal reproduction method.

python:3.10.12;cann:toolkit_8.0.RC2,kernels_310P_8.0.RC2;推理卡300I DUO

According to the official documentation, this operator has more restrictions on 310p. https://www.hiascend.com/document/detail/zh/canncommercial/80RC22/apiref/appdevgapi/context/aclnnPromptFlashAttentionV3.md#%E7%BA%A6%E6%9D%9F%E4%B8%8E%E9%99%90%E5%88%B6 image

The current PR is developed based on Atlas 300T A2 training card. If you are interested in supporting 310p, welcome to join the development of this PR.

好的,感谢

@wrennywang
Copy link

可以支持多卡推理吗?

@MengqingCao
Copy link

可以支持多卡推理吗?

Still work in process now

@XYZliang
Copy link

910A 测试报错,完整日志如下:

  npu_support  ❲c❳ vllm  ~/work/vllm                                                                                                                                                                                      16:37:06  ma-user 
❯ python examples/offline_inference_npu.py
INFO 09-19 16:37:25 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch_npu/utils/path_manager.py:82: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch_npu/utils/path_manager.py:82: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC1/aarch64-linux/ascend_toolkit_install.info owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
WARNING 09-19 16:37:29 config.py:374] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
INFO 09-19 16:37:29 llm_engine.py:232] Initializing an LLM engine (v0.6.0) with config: model='/home/ma-user/work/model/qwen/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/home/ma-user/work/model/qwen/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/ma-user/work/model/qwen/Qwen2-7B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
/home/ma-user/work/vllm/vllm/model_executor/model_loader/ascend_mindie.py:132: SyntaxWarning: assertion is always true, perhaps remove parentheses?
  assert (
INFO 09-19 16:37:29 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-19 16:37:29 selector.py:161] Using ASCEND_TORCH backend.
[W compiler_depend.ts:623] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
[W compiler_depend.ts:631] Warning: expandable_segments feature is not supportted                     and the possible cause is that driver and firmware packages do not match. (function operator())
INFO 09-19 16:38:00 npu_model_runner.py:319] Starting to load model /home/ma-user/work/model/qwen/Qwen2-7B-Instruct...
INFO 09-19 16:38:01 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-19 16:38:01 selector.py:161] Using ASCEND_TORCH backend.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [01:22<04:08, 82.86s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [01:25<01:10, 35.48s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [01:27<00:20, 20.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:29<00:00, 13.29s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:29<00:00, 22.49s/it]

INFO 09-19 16:39:32 npu_model_runner.py:330] Loading model weights took 14.2487 GB
Traceback (most recent call last):
  File "/home/ma-user/work/vllm/examples/offline_inference_npu.py", line 17, in <module>
    llm = LLM(model="/home/ma-user/work/model/qwen/Qwen2-7B-Instruct")
  File "/home/ma-user/work/vllm/vllm/entrypoints/llm.py", line 177, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/ma-user/work/vllm/vllm/engine/llm_engine.py", line 564, in from_engine_args
    engine = cls(
  File "/home/ma-user/work/vllm/vllm/engine/llm_engine.py", line 338, in __init__
    self._initialize_kv_caches()
  File "/home/ma-user/work/vllm/vllm/engine/llm_engine.py", line 467, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/home/ma-user/work/vllm/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/worker/npu_worker.py", line 173, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/worker/npu_model_runner.py", line 450, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/worker/model_runner.py", line 1450, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/model_executor/models/qwen2.py", line 361, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/model_executor/models/qwen2.py", line 277, in forward
    hidden_states, residual = layer(
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/model_executor/models/qwen2.py", line 210, in forward
    hidden_states = self.self_attn(
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/model_executor/models/qwen2.py", line 157, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/work/vllm/vllm/attention/layer.py", line 100, in forward
    return self.impl.forward(query,
  File "/home/ma-user/work/vllm/vllm/attention/backends/ascend.py", line 479, in forward
    output = torch_npu.npu_prompt_flash_attention(
  File "/home/ma-user/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/_ops.py", line 692, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: call aclnnPromptFlashAttentionV3 failed, detail:EZ1001: 2024-09-19-16:39:32.263.802 PromptFlashAttention LaunchAicore failed.
        TraceBack (most recent call last):
        reserve memory address failed, runtime result = 207000[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        Cannot parse json for config file [/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe//kernel/config/ascend910/prompt_flash_attention.json].
        Failed to parse kernel in prompt_flash_attention.json.
        AclOpKernelInit failed opType
        PromptFlashAttention LaunchAicore failed.

[ERROR] 2024-09-19-16:39:32 (PID:2458425, Device:0, RankID:-1) ERR01005 OPS internal error
start compile Ascend C operator RmsNorm. kernel name is te_rmsnorm_dd3c29b41435fcbd1d381b4336c633fa1930bccc0077864f290be3aedc4cca75_1
compile Ascend C operator: RmsNorm success!

  * remove unnecessary file copies in Dockerfile.npu
  * replace is_npu in utils with it in platform
@MengqingCao
Copy link

MengqingCao commented Sep 19, 2024

910A 测试报错,完整日志如下:

RuntimeError: call aclnnPromptFlashAttentionV3 failed, detail:EZ1001: 2024-09-19-16:39:32.263.802 PromptFlashAttention LaunchAicore failed.
TraceBack (most recent call last):
reserve memory address failed, runtime result = 207000[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
Cannot parse json for config file [/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe//kernel/config/ascend910/prompt_flash_attention.json].
Failed to parse kernel in prompt_flash_attention.json.
AclOpKernelInit failed opType
PromptFlashAttention LaunchAicore failed.

[ERROR] 2024-09-19-16:39:32 (PID:2458425, Device:0, RankID:-1) ERR01005 OPS internal error
start compile Ascend C operator RmsNorm. kernel name is te_rmsnorm_dd3c29b41435fcbd1d381b4336c633fa1930bccc0077864f290be3aedc4cca75_1
compile Ascend C operator: RmsNorm success!

This error indicates that your device does not support operator aclnnPromptFlashAttentionV3, which is called in torch_npu.npu_prompt_flash_attention.

@zer0py2c
Copy link

zer0py2c commented Sep 20, 2024

What Ascend NPU devices are currently supported? The latest version of lmdeploy also supports Ascend NPU, but only 910B and 310P are supported, as other devices lack the operator support they require and will need to wait for CANN implementation. I encounter errors when testing with the 910A. However, it seems that most users are using Ascend 910A. Is it possible to adapt it directly?

兄弟,芯片310P型号的昇腾推理卡不太行哦,我用LMDeploy V0.6.0测过了。 :(

@WangxuP
Copy link

WangxuP commented Sep 25, 2024

This looks like there's no support Qwen1.5-72B-Chat-GPTQ-Int8 LLM ?

(py39) [root@master vllm]# vllm serve /home/models/qwen_model/qwen/Qwen1___5-72B-Chat-GPTQ-Int8 --port 6006 --served-model-name Qwen1.5-72B-Chat-int8 --trust-remote-code --max-model-len 2048 --dtype auto --disable-log-stats  --chat-template /home/me/vllm/examples/template_chatml.jinja --tensor-parallel-size 4 --quantization gptq
INFO 09-25 11:02:33 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
INFO 09-25 11:02:35 api_server.py:495] vLLM API server version 0.6.0
INFO 09-25 11:02:35 api_server.py:496] args: Namespace(model_tag='/home/models/qwen_model/qwen/Qwen1___5-72B-Chat-GPTQ-Int8', config='', host=None, port=6006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template='/home/me/vllm/examples/template_chatml.jinja', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/home/models/qwen_model/qwen/Qwen1___5-72B-Chat-GPTQ-Int8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=True, quantization='gptq', rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen1.5-72B-Chat-int8'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0xfffdc13b23a0>)
Traceback (most recent call last):
  File "/root/anaconda3/envs/py39/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/home/me/vllm/vllm/scripts.py", line 165, in main
    args.dispatch_function(args)
  File "/home/me/vllm/vllm/scripts.py", line 37, in serve
    asyncio.run(run_server(args))
  File "/root/anaconda3/envs/py39/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/root/anaconda3/envs/py39/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/me/vllm/vllm/entrypoints/openai/api_server.py", line 498, in run_server
    async with build_async_engine_client(args) as async_engine_client:
  File "/root/anaconda3/envs/py39/lib/python3.9/contextlib.py", line 181, in __aenter__
    return await self.gen.__anext__()
  File "/home/me/vllm/vllm/entrypoints/openai/api_server.py", line 110, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/root/anaconda3/envs/py39/lib/python3.9/contextlib.py", line 181, in __aenter__
    return await self.gen.__anext__()
  File "/home/me/vllm/vllm/entrypoints/openai/api_server.py", line 132, in build_async_engine_client_from_engine_args
    if (model_is_embedding(engine_args.model, engine_args.trust_remote_code,
  File "/home/me/vllm/vllm/entrypoints/openai/api_server.py", line 73, in model_is_embedding
    return ModelConfig(model=model_name,
  File "/home/me/vllm/vllm/config.py", line 243, in __init__
    self._verify_quantization()
  File "/home/me/vllm/vllm/config.py", line 302, in _verify_quantization
    quantization_override = method.override_quantization_method(
  File "/home/me/vllm/vllm/model_executor/layers/quantization/gptq_marlin.py", line 90, in override_quantization_method
    can_convert = cls.is_gptq_marlin_compatible(hf_quant_cfg)
  File "/home/me/vllm/vllm/model_executor/layers/quantization/gptq_marlin.py", line 138, in is_gptq_marlin_compatible
    return check_marlin_supported(quant_type=cls.TYPE_MAP[(num_bits, sym)],
  File "/home/me/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 78, in check_marlin_supported
    cond, _ = _check_marlin_supported(quant_type, group_size, has_zp,
  File "/home/me/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 55, in _check_marlin_supported
    major, minor = current_platform.get_device_capability()
  File "/home/me/vllm/vllm/platforms/ascend.py", line 26, in get_device_capability
    raise RuntimeError("Ascend NPU does not have device capability.")
RuntimeError: Ascend NPU does not have device capability.
[ERROR] 2024-09-25-11:02:35 (PID:2673537, Device:-1, RankID:-1) ERR99999 UNKNOWN application exception

@wangshuai09
Copy link
Author

@WangxuP Quantization is not currently supported.

@WangxuP
Copy link

WangxuP commented Sep 25, 2024

@WangxuP Quantization is not currently supported.

Okay, looking forward to support soon.

  * fix swap blocks in ascend.py
  * add UT for copy_blocks and swap_blocks
@beardog6
Copy link

感谢并期待功能的补全。另外目前版本验证910B下推理性能相对mindie差距较大,qwen1.5-7b-chat的推理速度20tokens/s,在mindie上可以达到38tokens/s。
是否是fa算子未启用的原因?启动日志里有出现:
INFO 09-26 09:35:03 selector.py:222] Cannot use _Backend.FLASH_ATTN backend on NPU.

@wangshuai09
Copy link
Author

感谢并期待功能的补全。另外目前版本验证910B下推理性能相对mindie差距较大,qwen1.5-7b-chat的推理速度20tokens/s,在mindie上可以达到38tokens/s。 是否是fa算子未启用的原因?启动日志里有出现: INFO 09-26 09:35:03 selector.py:222] Cannot use _Backend.FLASH_ATTN backend on NPU.

Flash Attn is used by Ascend backend in attention/backends/ascend.py and this log shows that Ascend backend does not use the default FLASH_ATTN which is implemented in vllm. There are space for improving perf, such as fused ops in custom_op,

@verigle
Copy link

verigle commented Sep 27, 2024

ascend vllm 请问是否计划适配 qwen2-vl呢?

@MengqingCao
Copy link

ascend vllm 请问是否计划适配 qwen2-vl呢?

Support for VLMs is in our todo list, including qwen2-vl.

@Yikun Yikun mentioned this pull request Sep 28, 2024
46 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.