Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to config accelerate on 2 mac machines #3356

Open
hsoftxl opened this issue Jan 20, 2025 · 4 comments
Open

how to config accelerate on 2 mac machines #3356

hsoftxl opened this issue Jan 20, 2025 · 4 comments

Comments

@hsoftxl
Copy link

hsoftxl commented Jan 20, 2025

https://huggingface.co/docs/accelerate/usage_guides/distributed_inference

i use accelerate config and when i run model , it will block and then got an error. means , can not connect IP and port.

who can help me.

@BenjaminBossan
Copy link
Member

With this little information, we cannot help to figure out the issue. Please follow the instructions for reporting bugs and provide the missing information.

@hsoftxl
Copy link
Author

hsoftxl commented Jan 21, 2025

System Info
*
Mac OS Sequoia 15.2

Mac Studio M2 Ultra 192G

Information
I have 4 Mac computers, each with 192GB of memory. I want to use these 4 Macs to run the Falcon 180B model. I configured distributed training using accelerate config, but when I run the script, each machine always loads all the layers of the model. How can I load different parts of the model on different machines?

My scripts

from accelerate import Accelerator
from accelerate.utils import gather_object
from transformers import AutoModelForCausalLM, AutoTokenizer
from statistics import mean
import torch, time, json

accelerator = Accelerator()

# 10*10 Prompts. Source: https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-books
prompts_all=[
    "The King is dead. Long live the Queen.",
    "Once there were four children whose names were Peter, Susan, Edmund, and Lucy.",
    "The story so far: in the beginning, the universe was created.",
    "It was a bright cold day in April, and the clocks were striking thirteen.",
    "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
    "The sweat wis lashing oafay Sick Boy; he wis trembling.",
    "124 was spiteful. Full of Baby's venom.",
    "As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.",
    "I write this sitting in the kitchen sink.",
    "We were somewhere around Barstow on the edge of the desert when the drugs began to take hold.",
] * 10

# load a base model and tokenizer
model_path = "tiiuae/falcon-180B-chat"
model = AutoModelForCausalLM.from_pretrained(
    model_path,    
    device_map={"": accelerator.process_index},
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)   

# sync GPUs and start the timer
accelerator.wait_for_everyone()
start=time.time()

# divide the prompt list onto the available GPUs 
with accelerator.split_between_processes(prompts_all) as prompts:
    # store output of generations in dict
    results=dict(outputs=[], num_tokens=0)

    # have each GPU do inference, prompt by prompt
    for prompt in prompts:
        prompt_tokenized=tokenizer(prompt, return_tensors="pt").to("mps")
        output_tokenized = model.generate(**prompt_tokenized, max_new_tokens=100)[0]

        # remove prompt from output 
        output_tokenized=output_tokenized[len(prompt_tokenized["input_ids"][0]):]

        # store outputs and number of tokens in result{}
        results["outputs"].append( tokenizer.decode(output_tokenized) )
        results["num_tokens"] += len(output_tokenized)

    results=[ results ] # transform to list, otherwise gather_object() will not collect correctly

# collect results from all the GPUs
results_gathered=gather_object(results)

if accelerator.is_main_process:
    timediff=time.time()-start
    num_tokens=sum([r["num_tokens"] for r in results_gathered ])

    print(f"tokens/sec: {num_tokens//timediff}, time {timediff}, total tokens {num_tokens}, total prompts {len(prompts_all)}")

Tasks

accelerate launch test.py

@hsoftxl
Copy link
Author

hsoftxl commented Jan 21, 2025

@BenjaminBossan thanks

@BenjaminBossan
Copy link
Member

Thanks for the additional info, we're still missing the accelerate env output.

From what you shared, this looks like multi-node inference for me. AFAIK, this is not supported out of the box. Typically, people also use some framework to handle multiple nodes but it's not clear to me if those work with Macs (I'm not a Mac user).

As to the specific problem of avoiding to load the whole model on each node, did you check out the docs for big model inference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants