Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TGI with IPEX fails on PVC 1100 on 8 gpus #737

Open
rahulunair opened this issue Nov 27, 2024 · 2 comments
Open

TGI with IPEX fails on PVC 1100 on 8 gpus #737

rahulunair opened this issue Nov 27, 2024 · 2 comments

Comments

@rahulunair
Copy link

rahulunair commented Nov 27, 2024

Describe the bug

I am using TGI with IPEX to deploy opt 30b model on 8 PVC GPUs and the model is loading:

Here is the command I am using:

docker run --privileged \
  --device=/dev/dri:/dev/dri \
  --shm-size=64g \
  --env ZE_AFFINITY_MASK=0,1,2,3,4,5,6,7 \
  --env SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 \
  --env HF_HOME=/data \
  --env ZE_ENABLE_PCI_ID_DEVICE_ORDER=1 \
  --env CCL_ZE_IPC_EXCHANGE=sockets \
  -v $HOME/.cache/huggingface:/data \
  -p 8000:80 \
  ghcr.io/huggingface/text-generation-inference:latest-intel-xpu \
  --model-id facebook/opt-30b \
  --num-shard 8 \
  --max-total-tokens 4096 \
  --max-input-length 2048 \
  --dtype bfloat16 \
  --cuda-graphs 0 \
  --json-output

Output:

{"timestamp":"2024-11-27T20:58:46.977471Z","level":"INFO","fields":{"message":"Args {\n    model_id: \"facebook/opt-30b\",\n    revision: None,\n    validation_workers: 2,\n    sharded: None,\n    num_s{"timestamp":"2024-11-27T20:58:46.977471Z","level":"INFO","fields":{"message":"Args {\n    model_id: \"facebook/opt-30b\",\n    revision: None,\n    validation_workers: 2,\n    sharded: None,\n    num_s{"timestamp":"2024-11-27T20:58:46.977471Z","level":"INFO","fields":{"message":"Args {\n    model_id: \"facebook/opt-30b\",\n    revision: None,\n    validation_workers: 2,\n    sharded: None,\n    num_s{"timestamp":"2024-11-27T20:58:46.977471Z","level":"INFO","fields":{"message":"Args {\n    model_id: \"facebook/opt-30b\",\n    revision: None,\n    validation_workers: 2,\n    sharded: None,\n    num_s{"timestamp":"2024-11-27T20:58:46.977471Z","level":"INFO","fields":{"message":"Args {\n    model_id: \"facebook/opt-30b\",\n    revision: None,\n    validation_workers: 2,\n    sharded: None,\n    num_s{"timestamp":"2024-11-27T20:58:46.977471Z","level":"INFO","fields":{"message":"Args {\n    model_id: \"facebook/opt-30b\",\n    revision: None,\n    validation_workers: 2,\n    sharded: None,\n    num_s{"timestamp":"2024-11-27T20:58:46.977471Z","level":"INFO","fields":{"message":"Args {\n    model_id: \"facebook/opt-30b\",\n    revision: None,\n    validation_workers: 2,\n    sharded: None,\n    num_s{"timestamp":"2024-11-27T20:58:46.977471Z","level":"INFO","fields":{"message":"Args {\n    model_id: \"facebook/opt-30b\",\n    revision: None,\n    validation_workers: 2,\n    sharded: None,\n    num_s{"timestamp":"2024-11-27T20:58:46.977471Z","level":"INFO","fields":{"message":"Args {\n    model_id: \"facebook/opt-30b\",\n    revision: None,\n    validation_workers: 2,\n    sharded: None,\n    num_s{"timestamp":"2024-11-27T20:58:46.977550Z","level":"INFO","fields":{"message":"Token file not found \"/data/token\"","log.target":"hf_hub","log.module_path":"hf_hub","log.file":"/usr/local/cargo/registr{"timestamp":"2024-11-27T20:58:46.977550Z","level":"INFO","fields":{"message":"Token file not found \"/data/token\"","log.target":"hf_hub","log.module_path":"hf_hub","log.file":"/usr/local/cargo/registr{"timestamp":"2024-11-27T20:58:48.039397Z","level":"WARN","fields":{"message":"Cannot determine GPU compute capability: AssertionError: Torch not compiled with CUDA enabled"},"target":"text_generation_l{"timestamp":"2024-11-27T20:58:48.039397Z","level":"WARN","fields":{"message":"Cannot determine GPU compute capability: AssertionError: Torch not compiled with CUDA enabled"},"target":"text_generation_l{"timestamp":"2024-11-27T20:58:48.039429Z","level":"INFO","fields":{"message":"Using attention paged - Prefix caching 0"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:58:48.039436Z","level":"INFO","fields":{"message":"Default `max_batch_prefill_tokens` to 2048"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:58:48.039440Z","level":"INFO","fields":{"message":"Sharding model on 8 processes"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:58:48.039562Z","level":"INFO","fields":{"message":"Starting check and download process for facebook/opt-30b"},"target":"text_generation_launcher","span":{"name":"download"},"{"timestamp":"2024-11-27T20:58:48.039562Z","level":"INFO","fields":{"message":"Starting check and download process for facebook/opt-30b"},"target":"text_generation_launcher","span":{"name":"download"},"{"timestamp":"2024-11-27T20:58:56.095381Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download."},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:58:56.951464Z","level":"INFO","fields":{"message":"Successfully downloaded weights for facebook/opt-30b"},"target":"text_generation_launcher","span":{"name":"download"},"span{"timestamp":"2024-11-27T20:58:56.951464Z","level":"INFO","fields":{"message":"Successfully downloaded weights for facebook/opt-30b"},"target":"text_generation_launcher","span":{"name":"download"},"span{"timestamp":"2024-11-27T20:58:56.951756Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"sh{"timestamp":"2024-11-27T20:58:56.951756Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"sh{"timestamp":"2024-11-27T20:58:56.951757Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"sh{"timestamp":"2024-11-27T20:58:56.951757Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"sh{"timestamp":"2024-11-27T20:58:56.951790Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"sh{"timestamp":"2024-11-27T20:58:56.951790Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"sh{"timestamp":"2024-11-27T20:58:56.951818Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"sh{"timestamp":"2024-11-27T20:58:56.951818Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"sh{"timestamp":"2024-11-27T20:58:56.954604Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":4,"name":"shard-manager"},"spans":[{"rank":4,"name":"sh{"timestamp":"2024-11-27T20:58:56.954604Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":4,"name":"shard-manager"},"spans":[{"rank":4,"name":"sh{"timestamp":"2024-11-27T20:58:56.957239Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":5,"name":"shard-manager"},"spans":[{"rank":5,"name":"sh{"timestamp":"2024-11-27T20:58:56.957239Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":5,"name":"shard-manager"},"spans":[{"rank":5,"name":"sh{"timestamp":"2024-11-27T20:58:56.958959Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":6,"name":"shard-manager"},"spans":[{"rank":6,"name":"sh{"timestamp":"2024-11-27T20:58:56.958959Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":6,"name":"shard-manager"},"spans":[{"rank":6,"name":"sh{"timestamp":"2024-11-27T20:58:56.960699Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":7,"name":"shard-manager"},"spans":[{"rank":7,"name":"sh{"timestamp":"2024-11-27T20:58:56.960699Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":7,"name":"shard-manager"},"spans":[{"rank":7,"name":"sh{"timestamp":"2024-11-27T20:59:01.797852Z","level":"INFO","fields":{"message":"Using prefix caching = False"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:59:01.797879Z","level":"INFO","fields":{"message":"Using Attention = paged"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:59:01.913329Z","level":"WARN","fields":{"message":"Could not import Mamba: No module named 'mamba_ssm'"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:59:06.977742Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.977742Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.979916Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":5,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.979916Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":5,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.979964Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.979964Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.984437Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":6,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.984437Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":6,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.986784Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.986784Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.987920Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.987920Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.988679Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":4,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.988679Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":4,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.989455Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":7,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:06.989455Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":7,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.028043Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.028043Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.028140Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.028140Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.035122Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":4,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.035122Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":4,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.035619Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.035619Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.038192Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.038192Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.043754Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":7,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.043754Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":7,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.044176Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":5,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.044176Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":5,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.046020Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":6,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:17.046020Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":6,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.043862Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.043862Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.057718Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":4,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.057718Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":4,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.061494Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.061494Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.062854Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.062854Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.086525Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.086525Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.092459Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":6,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.092459Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":6,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.093298Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":5,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.093298Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":5,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.095730Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":7,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.095730Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":7,"name":"shard-manager"},"spans":[{"{"timestamp":"2024-11-27T20:59:27.461531Z","level":"INFO","fields":{"message":"Using experimental prefill chunking = False"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:59:27.797806Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-7"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:59:27.798218Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-3"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:59:27.801258Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-1"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:59:27.802590Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-0"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:59:27.803582Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-4"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:59:27.811464Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-6"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:59:27.811881Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-2"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:59:27.826166Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-5"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:59:27.844393Z","level":"INFO","fields":{"message":"Shard ready in 30.880082279s"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.844393Z","level":"INFO","fields":{"message":"Shard ready in 30.880082279s"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.858213Z","level":"INFO","fields":{"message":"Shard ready in 30.890063753s"},"target":"text_generation_launcher","span":{"rank":4,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.858213Z","level":"INFO","fields":{"message":"Shard ready in 30.890063753s"},"target":"text_generation_launcher","span":{"rank":4,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.862034Z","level":"INFO","fields":{"message":"Shard ready in 30.897723226s"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.862034Z","level":"INFO","fields":{"message":"Shard ready in 30.897723226s"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.863362Z","level":"INFO","fields":{"message":"Shard ready in 30.895203452s"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.863362Z","level":"INFO","fields":{"message":"Shard ready in 30.895203452s"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.887029Z","level":"INFO","fields":{"message":"Shard ready in 30.920917242s"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.887029Z","level":"INFO","fields":{"message":"Shard ready in 30.920917242s"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.892963Z","level":"INFO","fields":{"message":"Shard ready in 30.922683334s"},"target":"text_generation_launcher","span":{"rank":6,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.892963Z","level":"INFO","fields":{"message":"Shard ready in 30.922683334s"},"target":"text_generation_launcher","span":{"rank":6,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.893811Z","level":"INFO","fields":{"message":"Shard ready in 30.925779667s"},"target":"text_generation_launcher","span":{"rank":5,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.893811Z","level":"INFO","fields":{"message":"Shard ready in 30.925779667s"},"target":"text_generation_launcher","span":{"rank":5,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.896269Z","level":"INFO","fields":{"message":"Shard ready in 30.927411147s"},"target":"text_generation_launcher","span":{"rank":7,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.896269Z","level":"INFO","fields":{"message":"Shard ready in 30.927411147s"},"target":"text_generation_launcher","span":{"rank":7,"name":"shard-manager"},"spans":[{"rank{"timestamp":"2024-11-27T20:59:27.956651Z","level":"INFO","fields":{"message":"Starting Webserver"},"target":"text_generation_launcher"}
{"timestamp":"2024-11-27T20:59:27.982808Z","level":"INFO","message":"Warming up model","target":"text_generation_router_v3","filename":"backends/v3/src/lib.rs","line_number":125}

Here is the xpu-smi log:

21:01:58.000,    0, 10029.14
21:01:58.000,    1, 9491.29
21:01:58.000,    2, 9491.29
21:01:58.000,    3, 9491.37
21:01:58.000,    4, 9554.49
21:01:58.000,    5, 9554.50
21:01:58.000,    6, 9554.48
21:01:58.000,    7, 9554.45
21:01:59.000,    0, 10029.14
21:01:59.000,    1, 9491.29
21:01:59.000,    2, 9491.29
21:01:59.000,    3, 9491.37
21:01:59.000,    4, 9554.49
21:01:59.000,    5, 9554.50
21:01:59.000,    6, 9554.48
21:01:59.000,    7, 9554.45

Versions

Latest version of TGI

When I do a curl to the TGI server with generate, getting error:

{"timestamp":"2024-11-27T21:02:41.401766Z","level":"ERROR","fields":{"message":"Method Prefill encountered an error.\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\"{"timestamp":"2024-11-27T21:02:41.401766Z","level":"ERROR","fields":{"message":"Method Prefill encountered an error.\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\"{"timestamp":"2024-11-27T21:02:41.401766Z","level":"ERROR","fields":{"message":"Method Prefill encountered an error.\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\"{"timestamp":"2024-11-27T21:02:41.401766Z","level":"ERROR","fields":{"message":"Method Prefill encountered an error.\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\"{"timestamp":"2024-11-27T21:02:41.401766Z","level":"ERROR","fields":{"message":"Method Prefill encountered an error.\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\"{"timestamp":"2024-11-27T21:02:41.401766Z","level":"ERROR","fields":{"message":"Method Prefill encountered an error.\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\"{"timestamp":"2024-
@sywangyi
Copy link
Contributor

sywangyi commented Dec 11, 2024

ipex or torch ccl related.
replace ipex 2.5 with 2.3 by
pip install torch==2.3.1+cxx11.abi torchvision==0.18.1+cxx11.abi torchaudio==2.3.1+cxx11.abi intel-extension-for-pytorch==2.3.110+xpu oneccl_bind_pt==2.3.100+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ --no-cache-dir
in docker container. opt-30b 8 card could work.

@sywangyi
Copy link
Contributor

coredump is caused by some torch ops in distributed env return incorrect value. not see the issue in 2.3 version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants