Fast server startup, using file_handle to load model param files #840
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Currently, when we load model parameter files, we load the entire contents of the files into memory, then mmap that data to the devices. This would cause very long server startup times. For example, 70b (~130 GB) took me 10 minutes to start the server, and 405b (~750 GB) took me over
5 hours
to start the server.As an alternative, this PR uses
iree_io_file_handle_open
to obtain a handle to the parameter files, then streams that data to the devices, insteading of mmaping it. After this change, we are able to start the server for70b
and405b
within seconds.We default to the new method and add a private function
LoadMmap
for cases wheremmap == true
. This should improve the startup time for bothLLM
andSDXL
, especially when loading large files.