Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Upstream codebase diff #470

Draft
wants to merge 1,509 commits into
base: main
Choose a base branch
from
Draft

[DO NOT MERGE] Upstream codebase diff #470

wants to merge 1,509 commits into from

Conversation

kzawora-intel
Copy link

@kzawora-intel kzawora-intel commented Nov 6, 2024

Scope of changes:

  • Contiguous PA
  • Multi-step scheduling
  • Automatic prefix caching
  • Padding-aware scheduling/max_num_prefill_seqs
  • Guided decoding fixes
  • FP8 support (INC/w8a8/weights_load_device)
  • ApplyToppTopkScalar sampler optimization
  • LoRA/MultiLoRA support
  • FusedMoE support
  • Model changes (adding mark_steps)
  • Tests
  • FakeHPU mode
  • CI stuff (.jenkins, .github)
  • Lots of minor stuff (RNG, FSDPA flag, reduced block fragmentation)

@@ -0,0 +1,35 @@
name: cpu-test

Check failure

Code scanning / Scorecard

Token-Permissions High

score is 0: no topLevel permission defined
Remediation tip: Visit https://app.stepsecurity.io/secureworkflow.
Tick the 'Restrict permissions for GITHUB_TOKEN'
Untick other options
NOTE: If you want to resolve multiple issues at once, you can visit https://app.stepsecurity.io/securerepo instead.
Click Remediation section below for further remediation help
@kzawora-intel kzawora-intel marked this pull request as draft November 6, 2024 13:49
@kzawora-intel kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Nov 8, 2024
@@ -0,0 +1,45 @@
name: codespell

Check failure

Code scanning / Scorecard

Token-Permissions High

score is 0: no topLevel permission defined
Remediation tip: Visit https://app.stepsecurity.io/secureworkflow.
Tick the 'Restrict permissions for GITHUB_TOKEN'
Untick other options
NOTE: If you want to resolve multiple issues at once, you can visit https://app.stepsecurity.io/securerepo instead.
Click Remediation section below for further remediation help
def test_stateless_process_group(worker):
port1 = get_open_port()
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("", port1))

Check warning

Code scanning / CodeQL

Binding a socket to all network interfaces Medium test

'' binds a socket to all interfaces.

Copilot Autofix AI about 2 months ago

To fix the problem, we need to bind the socket to a specific interface instead of all interfaces. In this case, we can bind it to the loopback interface 127.0.0.1, which is commonly used for local testing and development. This change will limit the socket to accept connections only from the local machine, reducing the security risks.

Suggested changeset 1
tests/distributed/test_utils.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tests/distributed/test_utils.py b/tests/distributed/test_utils.py
--- a/tests/distributed/test_utils.py
+++ b/tests/distributed/test_utils.py
@@ -124,3 +124,3 @@
     with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
-        s.bind(("", port1))
+        s.bind(("127.0.0.1", port1))
         port2 = get_open_port()
EOF
@@ -124,3 +124,3 @@
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("", port1))
s.bind(("127.0.0.1", port1))
port2 = get_open_port()
Copilot is powered by AI and may make mistakes. Always verify output.
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options

sock = socket.socket(family=family, type=socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(addr)

Check warning

Code scanning / CodeQL

Binding a socket to all network interfaces Medium

'' binds a socket to all interfaces.

Copilot Autofix AI 12 days ago

To fix the problem, we need to ensure that the socket is not bound to all network interfaces. Instead, we should bind it to a specific interface. This can be achieved by modifying the create_server_socket function to check if the provided address is empty or 0.0.0.0 and replace it with a specific interface address.

  1. Modify the create_server_socket function to check if the address is empty or 0.0.0.0.
  2. If the address is empty or 0.0.0.0, replace it with a specific interface address (e.g., 127.0.0.1 for localhost).
  3. Update the sock.bind(addr) call to use the modified address.
Suggested changeset 1
vllm/entrypoints/openai/api_server.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py
--- a/vllm/entrypoints/openai/api_server.py
+++ b/vllm/entrypoints/openai/api_server.py
@@ -759,2 +759,6 @@
 
+    # Bind to a specific interface if the address is empty or 0.0.0.0
+    if addr[0] in ("", "0.0.0.0"):
+        addr = ("127.0.0.1", addr[1])
+
     sock = socket.socket(family=family, type=socket.SOCK_STREAM)
EOF
@@ -759,2 +759,6 @@

# Bind to a specific interface if the address is empty or 0.0.0.0
if addr[0] in ("", "0.0.0.0"):
addr = ("127.0.0.1", addr[1])

sock = socket.socket(family=family, type=socket.SOCK_STREAM)
Copilot is powered by AI and may make mistakes. Always verify output.
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[' and containing many repetitions of 'AA(),'.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA=,'.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(A=' and containing many repetitions of ',A='.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA= ),A('.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[' and containing many repetitions of 'AA()'.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA=,'.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(A=' and containing many repetitions of ',A='.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA=)A('.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(A=' and containing many repetitions of ')A(A='.
return resp

except Exception as e:
return web.Response(text=f"Error: {str(e)}", status=500)

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium test

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix AI about 2 months ago

To fix the problem, we need to ensure that detailed exception messages are not exposed to the end user. Instead, we should log the detailed error message on the server and return a generic error message to the user. This can be achieved by modifying the exception handling block to log the exception and return a generic error message.

  1. Import the logging module to enable logging of exceptions.
  2. Configure the logging settings if not already configured.
  3. Modify the exception handling block to log the exception and return a generic error message.
Suggested changeset 1
benchmarks/disagg_benchmarks/round_robin_proxy.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/benchmarks/disagg_benchmarks/round_robin_proxy.py b/benchmarks/disagg_benchmarks/round_robin_proxy.py
--- a/benchmarks/disagg_benchmarks/round_robin_proxy.py
+++ b/benchmarks/disagg_benchmarks/round_robin_proxy.py
@@ -2,3 +2,3 @@
 import itertools
-
+import logging
 import aiohttp
@@ -6,2 +6,3 @@
 
+logging.basicConfig(level=logging.ERROR)
 
@@ -39,3 +40,4 @@
             except Exception as e:
-                return web.Response(text=f"Error: {str(e)}", status=500)
+                logging.error("An error occurred while handling the request", exc_info=True)
+                return web.Response(text="An internal error has occurred!", status=500)
 
EOF
@@ -2,3 +2,3 @@
import itertools

import logging
import aiohttp
@@ -6,2 +6,3 @@

logging.basicConfig(level=logging.ERROR)

@@ -39,3 +40,4 @@
except Exception as e:
return web.Response(text=f"Error: {str(e)}", status=500)
logging.error("An error occurred while handling the request", exc_info=True)
return web.Response(text="An internal error has occurred!", status=500)

Copilot is powered by AI and may make mistakes. Always verify output.
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
mgoin and others added 20 commits January 3, 2025 22:36
divakar-amd and others added 30 commits January 17, 2025 14:49
Multimodality fix for llava after rebase

Fix for:
```
ERROR 12-16 12:31:11 engine.py:136] NotImplementedError: Unknown multi-modal data type: attention_mask
```
This PR updates `test/lora/utils.py` based on latest rebase.
1. This PR updates habana_main README_GAUDI to the Technical Writer
reviewed version as seen in v1.19.0.
(habana_main README_GAUDI and v1.19.0 README_GAUDI had diverged. )
2. It also fixes broken urls due to recent restructuring in upstream
vllm examples folder.
3. Adds notes in examples folder for new users and redirects them to see
the Gaudi specific examples in README_GAUDI.md.
Change vllm-hpu-extension revision to ae726d4
Changes the sampler used by dummy sequences to greedy if any
sequence is using it. Prevents sampler recompilations.
- Resolves issue due to release of triton v3.2.0 (January 23rd, 2025).
This is a workaround. A proper fix to support triton v3.2.0 may be
required.

Error when triton v3.2.0 is used is shown below.

```bash
Traceback (most recent call last):
  File "/workspace/vllm/test_evaluation.py", line 15, in <module>
    from vllm import LLM, SamplingParams
  File "/workspace/vllm/vllm/__init__.py", line 7, in <module>
    from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
  File "/workspace/vllm/vllm/engine/arg_utils.py", line 11, in <module>
    from vllm.config import (CacheConfig, ConfigFormat, DecodingConfig,
  File "/workspace/vllm/vllm/config.py", line 16, in <module>
    from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS
  File "/workspace/vllm/vllm/model_executor/layers/quantization/__init__.py", line 6, in <module>
    from vllm.model_executor.layers.quantization.awq_marlin import AWQMarlinConfig
  File "/workspace/vllm/vllm/model_executor/layers/quantization/awq_marlin.py", line 6, in <module>
    import vllm.model_executor.layers.fused_moe  # noqa
  File "/workspace/vllm/vllm/model_executor/layers/fused_moe/__init__.py", line 34, in <module>
    import vllm.model_executor.layers.fused_moe.fused_marlin_moe  # noqa
  File "/workspace/vllm/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 8, in <module>
    from vllm.model_executor.layers.fused_moe.fused_moe import (
  File "/workspace/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 18, in <module>
    from vllm_hpu_extension.ops import scaled_fp8_quant
  File "/usr/local/lib/python3.10/dist-packages/vllm_hpu_extension/ops.py", line 9, in <module>
    import habana_frameworks.torch as htorch
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/__init__.py", line 54, in <module>
    import habana_frameworks.torch.core
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/__init__.py", line 114, in <module>
    import_compilers()
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/dynamo/compile_backend/backends.py", line 39, in import_compilers
    from .compilers import hpu_inference_compiler, hpu_training_compiler_bw, hpu_training_compiler_fw
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/dynamo/compile_backend/compilers.py", line 27, in <module>
    from .freezing_passes import freeze
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/dynamo/compile_backend/freezing_passes.py", line 28, in <module>
    from torch._inductor.freezing import discard_traced_gm_params, invalidate_eager_modules, replace_params_with_constants
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/freezing.py", line 15, in <module>
    from torch._inductor.fx_passes.freezing_patterns import freezing_passes
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/fx_passes/freezing_patterns.py", line 5, in <module>
    from torch._inductor.compile_fx import fake_tensor_prop
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py", line 49, in <module>
    from torch._inductor.debug import save_args_for_compile_fx_inner
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/debug.py", line 26, in <module>
    from . import config, ir  # noqa: F811, this is needed
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/ir.py", line 77, in <module>
    from .runtime.hints import ReductionHint
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/runtime/hints.py", line 36, in <module>
    attr_desc_fields = {f.name for f in fields(AttrsDescriptor)}
  File "/usr/lib/python3.10/dataclasses.py", line 1198, in fields
    raise TypeError('must be called with a dataclass type or instance') from None
TypeError: must be called with a dataclass type or instance
```

Signed-off-by: Voas, Tanner <[email protected]>
…730)

Currently we will have a hang at the end of script when using TP>1 and
multistep scheduling. This is caused by lack of notification from driver
worker about ending the execution loop.
This is a workaround for this issue, by making sure that all workers are
notified at the end of `llm_engine` loop.
Other possible workaround could be modification of this check:
https://github.com/HabanaAI/vllm-fork/blob/habana_main/vllm/engine/llm_engine.py#L1379
with `or not self.has_unfinished_requests()`.
This PR enables multi step scheduling for encoder - decoder models
This is required for running the already quantized models with hpu,
using the fp8 quantization method (and not "inc").
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
habana Issues or PRs submitted by Habana Labs
Projects
None yet
Development

Successfully merging this pull request may close these issues.