Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] deepspeed inference for llama3.1 70b for 2 node, each node with 2 gpu #6805

Open
rastinrastinii opened this issue Nov 28, 2024 · 0 comments
Labels
bug Something isn't working inference

Comments

@rastinrastinii
Copy link

rastinrastinii commented Nov 28, 2024

Describe the bug
Hi,
i run deepspeed inference for llama3.1 70b for 2 node, each node with 2 gpu, each gpu with 24GB vram.
it slowly loading in node 1 but fast load in node 2 and oom. what is problem?

To Reproduce

run below code with following command on node one only.

WORLD_SIZE=2 NCCL_SOCKET_IFNAME=enp0s31f6,eno49 TP_SOCKET_IFNAME=enp0s31f6,eno49 GLOO_SOCKET_IFNAME=enp0s31f6,eno49 NCCL_P2P_DISABLE=1 deepspeed --num_gpus 2 --num_nodes 2 --node_rank 0 --master_addr 172.16.22.61 --master_port 29123 --hostfile=hostfile.txt inference_test.py
import torch.multiprocessing as mp

mp.set_start_method("spawn", force=True)
import deepspeed
from fastapi import FastAPI
from pydantic import BaseModel
import torch
import argparse
import os
import time
from transformers.integrations.deepspeed import HfDeepSpeedConfig
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import datetime
import torch.distributed as dist
from accelerate import init_empty_weights, load_checkpoint_and_dispatch

local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "2"))
# Configuration
MODEL_NAME = "meta-llama/Llama-3.1-70B-Instruct"
DEEPSPEED_CONFIG = "ds_config.json"

print(f'local_rank: {local_rank}, world_size: {world_size}')
def run_deepspeed_inference():
    # Load the model on meta tensors
    print(f"##############\n\nrun_deepspeed_inference\n\n###########")
    nf_config = BitsAndBytesConfig(
    load_in_8bit=True,
    )
    config = AutoConfig.from_pretrained(MODEL_NAME, quantization_config=nf_config)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    with deepspeed.OnDevice(dtype=torch.float16, device="meta", enabled=True):
        model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16)

    # Define the checkpoint dict. You may need to convert *.safetensors to
    # *.bin for this work. Make sure you get all the *.bin and *.pt files in
    # the checkpoint_files list.
    checkpoint_dir = "/home/mshahsavari/.cache/huggingface/hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/945c8663693130f8be2ee66210e062158b2a9693"
    checkpoint_files = [
        os.path.join(checkpoint_dir, f"model-{i:05d}-of-00030.safetensors")
        for i in range(1, 31)
    ]
    checkpoint_dict = {
        "type": "DS_MODEL",
        "checkpoints": checkpoint_files,
        "version": 1.0,
    }

    # Initialize DeepSpeed
    # deepspeed.init_distributed(dist_backend='nccl', rank=0, world_size=2)
    print(f"############## \n\n\n\n deepspeed.init_inference \n\n\n\n ##############")
    model = deepspeed.init_inference(
        model,
        # replace_with_kernel_inject=False,
        # mp_size=world_size,
        dtype=torch.float16,
        checkpoint=checkpoint_dict,
        tensor_parallel={
            "enabled": True,
            "tp_size": world_size,
        },
        # replace_method="auto",
        # replace_with_kernel_inject=True,
    )

    # Run inference
    start_time = time.time()
    inputs = tokenizer.encode("DeepSpeed is", return_tensors="pt").to(
        f"cuda:{local_rank}"
    )
    outputs = model.generate(inputs, max_new_tokens=20)
    output_str = tokenizer.decode(outputs[0])
    end_time = time.time()
    print("DeepSpeed-inference time:", end_time - start_time)

    return tokenizer, model


if __name__ == "__main__":
    # tokenizer, model = run_zero_inference()
    tokenizer, model = run_deepspeed_inference()

    # Load FastAPI
    app = FastAPI()

    # API Request schema
    class InferenceRequest(BaseModel):
        prompt: str
        max_length: int = 100

    # Text generation endpoint
    @app.post("/generate")
    async def generate(request: InferenceRequest):
        # Run inference
        start_time = time.time()
        inputs = tokenizer.encode(request.prompt, return_tensors="pt").to(
            f"cuda:{local_rank}"
        )
        outputs = model.generate(inputs, max_new_tokens=20)
        output_str = tokenizer.decode(outputs[0])
        end_time = time.time()
        print("DeepSpeed-inference time:", end_time - start_time)
        return {"generated_text": output_str}


    import uvicorn

    uvicorn.run(app, host="0.0.0.0", port=8000)

Expected behavior
expect to distributed load complete llama 3.1 70B in vram of 2 node and run inference.

ds_report output

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
evoformer_attn ......... [NO] ....... [OKAY]
 [WARNING]  FP Quantizer is using an untested triton version (3.1.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
gds .................... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
 [WARNING]  using untested triton version (3.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/mshahsavari/.pyenv/versions/3.11.10_venv/lib/python3.11/site-packages/torch']
torch version .................... 2.5.1+cu124
deepspeed install path ........... ['/home/mshahsavari/.pyenv/versions/3.11.10_venv/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.15.4, unknown, unknown
torch cuda version ............... 12.4
torch hip version ................ None
nvcc version ..................... 12.0
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
shared memory (/dev/shm) size .... 31.25 GB

System info (please complete the following information):

  • OS: Ubuntu 24.04.1 LTS
  • GPU node 1: 2x 3090. node2: 2x4090
  • deepspeed 0.15.4, deepspeed-mii 0.3.1
  • transformers 4.46.3, accelerate 1.1.0(1.0.1 on node 2)
  • Python 3.11.10
@rastinrastinii rastinrastinii added bug Something isn't working inference labels Nov 28, 2024
@rastinrastinii rastinrastinii changed the title [BUG] [BUG] deepspeed inference for llama3.1 70b for 2 node, each node with 2 gpu Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working inference
Projects
None yet
Development

No branches or pull requests

1 participant