-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multinode batch inference #2
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR introduces initial support for multinode batch inference using Ray in the vLLM batch inference process. The changes focus on setting up the infrastructure for distributed computing while maintaining compatibility with the existing system.
- Added
init_ray.py
to initialize a Ray cluster for multinode batch inference - Modified
Dockerfile_vllm
andbuild_and_upload_image.sh
to includeinit_ray.py
and improve script flexibility - Updated
vllm_batch.py
with TODOs for multinode setup, indicating ongoing implementation - Potential conflict between
JOB_COMPLETION_INDEX
usage invllm_batch.py
and multinode setup noted in TODO comment
4 file(s) reviewed, 4 comment(s)
Edit PR Review Bot Settings
if result.returncode == 0: | ||
print(f"Worker: Ray runtime started with head address {ray_address}:{ray_port}") | ||
sys.exit(0) | ||
print(result.returncode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Print the error message from the result object for better debugging
is_leader = os.getenv("JOB_COMPLETION_INDEX") == "0" | ||
ray_address = os.getenv("MASTER_ADDR") | ||
ray_port = os.getenv("MASTER_PORT") | ||
ray_cluster_size = os.getenv("NUM_INSTANCES") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Ensure NUM_INSTANCES is converted to an integer
@@ -312,12 +312,14 @@ def tool_func(text: str, past_context: Optional[str]): | |||
|
|||
|
|||
async def batch_inference(): | |||
job_index = int(os.getenv("JOB_COMPLETION_INDEX", 0)) | |||
job_index = int(os.getenv("JOB_COMPLETION_INDEX", 0)) # TODO this conflicts with multinode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: This TODO suggests a conflict with multinode setup. Consider resolving this before merging.
download_model( | ||
request.model_cfg.checkpoint_path, MODEL_WEIGHTS_FOLDER | ||
) # TODO move this out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Moving model download out of this function could improve performance for multinode setups.
Pull Request Summary
What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.
Test Plan and Usage Guide
How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.