New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[DO NOT MERGE] Upstream codebase diff #470

Draft

kzawora-intel wants to merge 1,509 commits into main from habana_main

kzawora-intel commented Nov 6, 2024 •

edited

Loading

Scope of changes:

Contiguous PA
Multi-step scheduling
Automatic prefix caching
Padding-aware scheduling/max_num_prefill_seqs
Guided decoding fixes
FP8 support (INC/w8a8/weights_load_device)
ApplyToppTopkScalar sampler optimization
LoRA/MultiLoRA support
FusedMoE support
Model changes (adding mark_steps)
Tests
FakeHPU mode
CI stuff (.jenkins, .github)
Lots of minor stuff (RNG, FSDPA flag, reduced block fragmentation)

github-advanced-security bot found potential problems

View reviewed changes

.github/workflows/cpu-test.yml

		@@ -0,0 +1,35 @@
		name: cpu-test

Check failure

Code scanning / Scorecard

Token-Permissions High

score is 0: no topLevel permission defined
Remediation tip: Visit https://app.stepsecurity.io/secureworkflow.
Tick the 'Restrict permissions for GITHUB_TOKEN'
Untick other options
NOTE: If you want to resolve multiple issues at once, you can visit https://app.stepsecurity.io/securerepo instead.
Click Remediation section below for further remediation help

kzawora-intel marked this pull request as draft

November 6, 2024 13:49

kzawora-intel added the habana label

github-advanced-security bot found potential problems

View reviewed changes

.github/workflows/codespell.yml

    
            @@ -0,0 +1,45 @@
          
              name: codespell

Check failure

Code scanning / Scorecard

Token-Permissions High

score is 0: no topLevel permission defined
Remediation tip: Visit https://app.stepsecurity.io/secureworkflow.
Tick the 'Restrict permissions for GITHUB_TOKEN'
Untick other options
NOTE: If you want to resolve multiple issues at once, you can visit https://app.stepsecurity.io/securerepo instead.
Click Remediation section below for further remediation help

github-advanced-security bot found potential problems

View reviewed changes

tests/distributed/test_utils.py

+              def test_stateless_process_group(worker):
+                  port1 = get_open_port()
+                  with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+                      s.bind(("", port1))

Check warning

Code scanning / CodeQL

Binding a socket to all network interfaces Medium test

'' binds a socket to all interfaces.

Copilot Autofix AI about 2 months ago

To fix the problem, we need to bind the socket to a specific interface instead of all interfaces. In this case, we can bind it to the loopback interface 127.0.0.1, which is commonly used for local testing and development. This change will limit the socket to accept connections only from the local machine, reducing the security risks.

Suggested changeset 1

tests/distributed/test_utils.py

@@ -124,3 +124,3 @@
                 with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
-                    s.bind(("", port1))
+                    s.bind(("127.0.0.1", port1))
                     port2 = get_open_port()

Copilot is powered by AI and may make mistakes. Always verify output.

github-advanced-security bot found potential problems

View reviewed changes

vllm/entrypoints/openai/api_server.py

    
                  sock = socket.socket(family=family, type=socket.SOCK_STREAM)

                  sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

                  sock.bind(addr)

Check warning

Code scanning / CodeQL

Binding a socket to all network interfaces Medium

'' binds a socket to all interfaces.

Copilot Autofix AI 12 days ago

To fix the problem, we need to ensure that the socket is not bound to all network interfaces. Instead, we should bind it to a specific interface. This can be achieved by modifying the create_server_socket function to check if the provided address is empty or 0.0.0.0 and replace it with a specific interface address.

Modify the create_server_socket function to check if the address is empty or 0.0.0.0.
If the address is empty or 0.0.0.0, replace it with a specific interface address (e.g., 127.0.0.1 for localhost).
Update the sock.bind(addr) call to use the modified address.

Suggested changeset 1

vllm/entrypoints/openai/api_server.py

@@ -759,2 +759,6 @@
+                # Bind to a specific interface if the address is empty or 0.0.0.0
+                if addr[0] in ("", "0.0.0.0"):
+                    addr = ("127.0.0.1", addr[1])
                 sock = socket.socket(family=family, type=socket.SOCK_STREAM)

Copilot is powered by AI and may make mistakes. Always verify output.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[' and containing many repetitions of 'AA(),'.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA=,'.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(A=' and containing many repetitions of ',A='.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA= ),A('.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[' and containing many repetitions of 'AA()'.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA=,'.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(A=' and containing many repetitions of ',A='.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA=)A('.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(A=' and containing many repetitions of ')A(A='.

github-advanced-security bot found potential problems

View reviewed changes

.github/workflows/sphinx-lint.yml Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems

View reviewed changes

benchmarks/disagg_benchmarks/round_robin_proxy.py

+                                  return resp
+                          except Exception as e:
+                              return web.Response(text=f"Error: {str(e)}", status=500)

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium test

Stack trace information

flows to this location and may be exposed to an external user.

Copilot Autofix AI about 2 months ago

To fix the problem, we need to ensure that detailed exception messages are not exposed to the end user. Instead, we should log the detailed error message on the server and return a generic error message to the user. This can be achieved by modifying the exception handling block to log the exception and return a generic error message.

Import the logging module to enable logging of exceptions.
Configure the logging settings if not already configured.
Modify the exception handling block to log the exception and return a generic error message.

Suggested changeset 1

benchmarks/disagg_benchmarks/round_robin_proxy.py

@@ -2,3 +2,3 @@
             import itertools
+            import logging
             import aiohttp
@@ -6,2 +6,3 @@
+            logging.basicConfig(level=logging.ERROR)
@@ -39,3 +40,4 @@
                         except Exception as e:
-                            return web.Response(text=f"Error: {str(e)}", status=500)
+                            logging.error("An error occurred while handling the request", exc_info=True)
+                            return web.Response(text="An internal error has occurred!", status=500)

Copilot is powered by AI and may make mistakes. Always verify output.

github-advanced-security bot found potential problems

View reviewed changes

.github/workflows/lint-and-deploy.yaml Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems

View reviewed changes

.github/workflows/sphinx-lint.yml Fixed Show fixed Hide fixed

mgoin and others added 20 commits

January 3, 2025 22:36


          Update requirements-tpu.txt to support python 3.9 and 3.11 (vllm-proj…

bf0d97d

…ect#11695)

Signed-off-by: mgoin <[email protected]>


          [V1] Chore: cruft removal (vllm-project#11724)

ad0d567


          [V1] log GPU blocks num for MultiprocExecutor (vllm-project#11656)

e5d7ed0


          Update tool_calling.md (vllm-project#11701)

9c93636


          Update bnb.md with example for OpenAI (vllm-project#11718)

d1d4939


          [V1] Add RayExecutor support for AsyncLLM (api server) (vllm-proj…

fbf2564

…ect#11712)


          [V1] Add kv cache utils tests. (vllm-project#11513)

d91457d

Signed-off-by: xcnick <[email protected]>


          [Core][Bugfix] Use correct device to initialize GPU data during CUDA-…

300acb8

…graph-capture (vllm-project#11233)

Signed-off-by: Yan Burman <[email protected]>
Signed-off-by: Ido Asraff <[email protected]>


          [VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-On…

eed11eb

…eVision (vllm-project#11717)

Signed-off-by: DarkLight1337 <[email protected]>


          [Bugfix] Fix precision error in LLaVA-NeXT (vllm-project#11735)

ba214df

Signed-off-by: DarkLight1337 <[email protected]>


          [Model] Remove unnecessary weight initialization logic (vllm-project#…

65c0892

…11736)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>


          [Bugfix][V1] Fix test_kv_cache_utils.py (vllm-project#11738)

Signed-off-by: Jee Jee Li <[email protected]>


          [MISC] Replace c10::optional with std::optional (vllm-project#11730)

4068f4b

Signed-off-by: Lu Fang <[email protected]>


          [distributed] remove pynccl's redundant stream (vllm-project#11744)

635b897


          fix: [doc] fix typo (vllm-project#11751)

eba1717

Co-authored-by: Lancer <[email protected]>


          [Frontend] Improve StreamingResponse Exception Handling (vllm-proje…

33fc1e2

…ct#11752)


          [distributed] remove pynccl's redundant change_state (vllm-project#11749

9e764e7


          [Doc] [1/N] Reorganize Getting Started section (vllm-project#11645)

402d378

Signed-off-by: DarkLight1337 <[email protected]>


          [Bugfix] Remove block size constraint (vllm-project#11723)

408e560


          [V1] Add BlockTable class (vllm-project#11693)

06bfb51

Signed-off-by: Woosuk Kwon <[email protected]>

divakar-amd and others added 30 commits

January 17, 2025 14:49


          [ROCm][MoE] moe tuning support for rocm (vllm-project#12049)

8027a72

Signed-off-by: Divakar Verma <[email protected]>


          [V1] Move more control of kv cache initialization from model_executor…

69d765f

… to EngineCore (vllm-project#11960)

Signed-off-by: Chen Zhang <[email protected]>
Co-authored-by: Cody Yu <[email protected]>


          Merge branch 'habana_main' into adobrzyniewicz/multimodality_for_llava

2d85682


          Check if kv_cache is tuple before calling split_kv_cache (#697)

a685225


          Merge branch 'habana_main' into adobrzyniewicz/multimodality_for_llava

a293e2e


          [Misc][LoRA] Improve the readability of LoRA error messages (vllm-pro…

07934cc

…ject#12102)

Signed-off-by: Jee Jee Li <[email protected]>


          [CI/Build][CPU][Bugfix] Fix CPU CI (vllm-project#12150)

d4e6194

Signed-off-by: jiang1.li <[email protected]>


          [core] allow callable in collective_rpc (vllm-project#12151)

87a0c07

Signed-off-by: youkaichao <[email protected]>


          [CI] Cleanup run_tests.sh logs (#700)

7eea2df


          Merge remote-tracking branch 'upstream/main' into private/kzawora/reb…

ce50b1a

…ase-2025-01-17


          fix TP crashes

a128878


          make mypy happy

2e53e75


          ¿what the heck is incquark?

21f5fb2


          i forgot brackets again

f1e911d


          Multimodality fix for llava (#641)

ae67e4d

Multimodality fix for llava after rebase

Fix for:
```
ERROR 12-16 12:31:11 engine.py:136] NotImplementedError: Unknown multi-modal data type: attention_mask
```


          Rebase 2025-01-17 (#701)

018ce62


          Fix LoRA tests (#696)

b10992b

This PR updates `test/lora/utils.py` based on latest rebase.


          Updating README_GAUDI in habana_main (#690)

1. This PR updates habana_main README_GAUDI to the Technical Writer
reviewed version as seen in v1.19.0.
(habana_main README_GAUDI and v1.19.0 README_GAUDI had diverged. )
2. It also fixes broken urls due to recent restructuring in upstream
vllm examples folder.
3. Adds notes in examples folder for new users and redirects them to see
the Gaudi specific examples in README_GAUDI.md.


          Change vllm-hpu-extension revision to ae726d4

293bd87


          Change vllm-hpu-extension revision to ae726d4 (#707)

cc069cb

Change vllm-hpu-extension revision to ae726d4


          Capabilities overhaul (#692)

fedf706

Supporting PR for HabanaAI/vllm-hpu-extension#76


          [SW-216156] Fix mixtral Fused MoE issues after rebase (#708)

37eb4fc


          Disable enforcing eager mode for mllama and deepseek_v3 on hpu (#713)

1df1c2c


          Fix for random sampler recompilations for incomplete batches (#663)

e977f2a

Changes the sampler used by dummy sequences to greedy if any
sequence is using it. Prevents sampler recompilations.


          [SW-216413] - Fix new executors shutdown and shutdown_inc flow (#716)

a64571c

Co-authored-by: Michał Kuligowski <[email protected]>


          Pin triton to v3.1.0 for HPU (#728)

1b8b69e

- Resolves issue due to release of triton v3.2.0 (January 23rd, 2025).
This is a workaround. A proper fix to support triton v3.2.0 may be
required.

Error when triton v3.2.0 is used is shown below.

```bash
Traceback (most recent call last):
  File "/workspace/vllm/test_evaluation.py", line 15, in <module>
    from vllm import LLM, SamplingParams
  File "/workspace/vllm/vllm/__init__.py", line 7, in <module>
    from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
  File "/workspace/vllm/vllm/engine/arg_utils.py", line 11, in <module>
    from vllm.config import (CacheConfig, ConfigFormat, DecodingConfig,
  File "/workspace/vllm/vllm/config.py", line 16, in <module>
    from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS
  File "/workspace/vllm/vllm/model_executor/layers/quantization/__init__.py", line 6, in <module>
    from vllm.model_executor.layers.quantization.awq_marlin import AWQMarlinConfig
  File "/workspace/vllm/vllm/model_executor/layers/quantization/awq_marlin.py", line 6, in <module>
    import vllm.model_executor.layers.fused_moe  # noqa
  File "/workspace/vllm/vllm/model_executor/layers/fused_moe/__init__.py", line 34, in <module>
    import vllm.model_executor.layers.fused_moe.fused_marlin_moe  # noqa
  File "/workspace/vllm/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 8, in <module>
    from vllm.model_executor.layers.fused_moe.fused_moe import (
  File "/workspace/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 18, in <module>
    from vllm_hpu_extension.ops import scaled_fp8_quant
  File "/usr/local/lib/python3.10/dist-packages/vllm_hpu_extension/ops.py", line 9, in <module>
    import habana_frameworks.torch as htorch
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/__init__.py", line 54, in <module>
    import habana_frameworks.torch.core
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/__init__.py", line 114, in <module>
    import_compilers()
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/dynamo/compile_backend/backends.py", line 39, in import_compilers
    from .compilers import hpu_inference_compiler, hpu_training_compiler_bw, hpu_training_compiler_fw
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/dynamo/compile_backend/compilers.py", line 27, in <module>
    from .freezing_passes import freeze
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/dynamo/compile_backend/freezing_passes.py", line 28, in <module>
    from torch._inductor.freezing import discard_traced_gm_params, invalidate_eager_modules, replace_params_with_constants
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/freezing.py", line 15, in <module>
    from torch._inductor.fx_passes.freezing_patterns import freezing_passes
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/fx_passes/freezing_patterns.py", line 5, in <module>
    from torch._inductor.compile_fx import fake_tensor_prop
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py", line 49, in <module>
    from torch._inductor.debug import save_args_for_compile_fx_inner
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/debug.py", line 26, in <module>
    from . import config, ir  # noqa: F811, this is needed
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/ir.py", line 77, in <module>
    from .runtime.hints import ReductionHint
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/runtime/hints.py", line 36, in <module>
    attr_desc_fields = {f.name for f in fields(AttrsDescriptor)}
  File "/usr/lib/python3.10/dataclasses.py", line 1198, in fields
    raise TypeError('must be called with a dataclass type or instance') from None
TypeError: must be called with a dataclass type or instance
```

Signed-off-by: Voas, Tanner <[email protected]>


          [SW-199650] Add HPU fp8 DynamicMOE Op (#721)

1a87bc5

Co-authored-by: Michał Kuligowski <[email protected]>


          Make sure that all workers are notified about end of execution loop (#…

40745f0

…730)

Currently we will have a hang at the end of script when using TP>1 and
multistep scheduling. This is caused by lack of notification from driver
worker about ending the execution loop.
This is a workaround for this issue, by making sure that all workers are
notified at the end of `llm_engine` loop.
Other possible workaround could be modification of this check:
https://github.com/HabanaAI/vllm-fork/blob/habana_main/vllm/engine/llm_engine.py#L1379
with `or not self.has_unfinished_requests()`.


          Support for multi step scheduling in enc dec models (#715)

8c4b41e

This PR enables multi step scheduling for encoder - decoder models


          [SW-216666] - Add fp8 to the hpu supported quantization list (#739)

107a9a3

This is required for running the already quantized models with hpu,
using the fp8 quantization method (and not "inc").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

habana

98 participants