[BUG] deepspeed inference for llama3.1 70b for 2 node, each node with 2 gpu #6805

rastinrastinii · 2024-11-28T10:46:48Z

Describe the bug
Hi,
i run deepspeed inference for llama3.1 70b for 2 node, each node with 2 gpu, each gpu with 24GB vram.
it slowly loading in node 1 but fast load in node 2 and oom. what is problem?

To Reproduce

run below code with following command on node one only.

WORLD_SIZE=2 NCCL_SOCKET_IFNAME=enp0s31f6,eno49 TP_SOCKET_IFNAME=enp0s31f6,eno49 GLOO_SOCKET_IFNAME=enp0s31f6,eno49 NCCL_P2P_DISABLE=1 deepspeed --num_gpus 2 --num_nodes 2 --node_rank 0 --master_addr 172.16.22.61 --master_port 29123 --hostfile=hostfile.txt inference_test.py

import torch.multiprocessing as mp

mp.set_start_method("spawn", force=True)
import deepspeed
from fastapi import FastAPI
from pydantic import BaseModel
import torch
import argparse
import os
import time
from transformers.integrations.deepspeed import HfDeepSpeedConfig
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import datetime
import torch.distributed as dist
from accelerate import init_empty_weights, load_checkpoint_and_dispatch

local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "2"))
# Configuration
MODEL_NAME = "meta-llama/Llama-3.1-70B-Instruct"
DEEPSPEED_CONFIG = "ds_config.json"

print(f'local_rank: {local_rank}, world_size: {world_size}')
def run_deepspeed_inference():
    # Load the model on meta tensors
    print(f"##############\n\nrun_deepspeed_inference\n\n###########")
    nf_config = BitsAndBytesConfig(
    load_in_8bit=True,
    )
    config = AutoConfig.from_pretrained(MODEL_NAME, quantization_config=nf_config)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    with deepspeed.OnDevice(dtype=torch.float16, device="meta", enabled=True):
        model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16)

    # Define the checkpoint dict. You may need to convert *.safetensors to
    # *.bin for this work. Make sure you get all the *.bin and *.pt files in
    # the checkpoint_files list.
    checkpoint_dir = "/home/mshahsavari/.cache/huggingface/hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/945c8663693130f8be2ee66210e062158b2a9693"
    checkpoint_files = [
        os.path.join(checkpoint_dir, f"model-{i:05d}-of-00030.safetensors")
        for i in range(1, 31)
    ]
    checkpoint_dict = {
        "type": "DS_MODEL",
        "checkpoints": checkpoint_files,
        "version": 1.0,
    }

    # Initialize DeepSpeed
    # deepspeed.init_distributed(dist_backend='nccl', rank=0, world_size=2)
    print(f"############## \n\n\n\n deepspeed.init_inference \n\n\n\n ##############")
    model = deepspeed.init_inference(
        model,
        # replace_with_kernel_inject=False,
        # mp_size=world_size,
        dtype=torch.float16,
        checkpoint=checkpoint_dict,
        tensor_parallel={
            "enabled": True,
            "tp_size": world_size,
        },
        # replace_method="auto",
        # replace_with_kernel_inject=True,
    )

    # Run inference
    start_time = time.time()
    inputs = tokenizer.encode("DeepSpeed is", return_tensors="pt").to(
        f"cuda:{local_rank}"
    )
    outputs = model.generate(inputs, max_new_tokens=20)
    output_str = tokenizer.decode(outputs[0])
    end_time = time.time()
    print("DeepSpeed-inference time:", end_time - start_time)

    return tokenizer, model


if __name__ == "__main__":
    # tokenizer, model = run_zero_inference()
    tokenizer, model = run_deepspeed_inference()

    # Load FastAPI
    app = FastAPI()

    # API Request schema
    class InferenceRequest(BaseModel):
        prompt: str
        max_length: int = 100

    # Text generation endpoint
    @app.post("/generate")
    async def generate(request: InferenceRequest):
        # Run inference
        start_time = time.time()
        inputs = tokenizer.encode(request.prompt, return_tensors="pt").to(
            f"cuda:{local_rank}"
        )
        outputs = model.generate(inputs, max_new_tokens=20)
        output_str = tokenizer.decode(outputs[0])
        end_time = time.time()
        print("DeepSpeed-inference time:", end_time - start_time)
        return {"generated_text": output_str}


    import uvicorn

    uvicorn.run(app, host="0.0.0.0", port=8000)

Expected behavior
expect to distributed load complete llama 3.1 70B in vram of 2 node and run inference.

ds_report output

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
evoformer_attn ......... [NO] ....... [OKAY]
 [WARNING]  FP Quantizer is using an untested triton version (3.1.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
gds .................... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
 [WARNING]  using untested triton version (3.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/mshahsavari/.pyenv/versions/3.11.10_venv/lib/python3.11/site-packages/torch']
torch version .................... 2.5.1+cu124
deepspeed install path ........... ['/home/mshahsavari/.pyenv/versions/3.11.10_venv/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.15.4, unknown, unknown
torch cuda version ............... 12.4
torch hip version ................ None
nvcc version ..................... 12.0
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
shared memory (/dev/shm) size .... 31.25 GB

System info (please complete the following information):

OS: Ubuntu 24.04.1 LTS
GPU node 1: 2x 3090. node2: 2x4090
deepspeed 0.15.4, deepspeed-mii 0.3.1
transformers 4.46.3, accelerate 1.1.0(1.0.1 on node 2)
Python 3.11.10

rastinrastinii added bug Something isn't working inference labels Nov 28, 2024

rastinrastinii changed the title ~~[BUG]~~ [BUG] deepspeed inference for llama3.1 70b for 2 node, each node with 2 gpu Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] deepspeed inference for llama3.1 70b for 2 node, each node with 2 gpu #6805

[BUG] deepspeed inference for llama3.1 70b for 2 node, each node with 2 gpu #6805

rastinrastinii commented Nov 28, 2024 •

edited

Loading

[BUG] deepspeed inference for llama3.1 70b for 2 node, each node with 2 gpu #6805

[BUG] deepspeed inference for llama3.1 70b for 2 node, each node with 2 gpu #6805

Comments

rastinrastinii commented Nov 28, 2024 • edited Loading

rastinrastinii commented Nov 28, 2024 •

edited

Loading