Skip to content

Latest commit

 

History

History
152 lines (118 loc) · 44.8 KB

configurations_large_model_inference_containers.md

File metadata and controls

152 lines (118 loc) · 44.8 KB

Large Model Inference Containers

There are a number of shared configurations for python models running large language models. They are also available through the Large Model Inference Containers.

Common (doc)

Item Required Description Example value
engine Yes The runtime engine of the code. MPI is an engine that allow model server to start distributed processes to load the model. This is used in some of the framework LMI supported. Please check for code samples to see which one you should use.

As of 0.25.0, TRTLLM, LMI-Dist and DeepSpeed framework uses MPI engine. vLLM, TransformersNeuronX, Optimum Neuron, HuggingFace Accelerate uses Python engine.
Python, MPI, DeepSpeed (deprecated)
option.model_dir No The directory path to load the model. Default is set to the current path with model files. On SageMaker, this is set to the location where SageMaker download the Model object from S3 Default: /opt/djl/ml
option.model_id No The value of this option will be the Hugging Face ID of a model or the s3 url of the model artifacts. DJL Serving will use the ID to download the model from Hugging Face or the s3 url. DJL Serving uses s5cmd to download the model from the bucket which is generally faster google/flan-t5-xl, s3://<my-bucket>/google/flan-t5-xl Default: None
option.dtype No Datatype to which you plan to cast the model default. Default is fp16. You can also set to bf16 if you are using G5, P4D and newer GPU machines fp16, fp32, bf16, int8 (only used in LMI-Dist)
option.tensor_parallel_degree No The number number of GPUs (model slicing number) to shard the model. If you are using LLMs, you should set this value to achieve best performance. If you don't know what should be value, start trying from "max" (split the model to maximum number of GPUs on the machine) Default for DeepSpeed, Transformer-NeuronX : 1
Default for HuggingFace Accelerate: -1
Default for TrtLLM Container: max
option.rolling_batch No also commonly known as continuous batching. Enable iteration level batching using one of the supported strategies. This allow concurrent requests that arrived in different time to merge as a batch to run inference with model server.

Disabled for DeepSpeed container by default given there are many choices of backends. For TensorRT container, rolling batch is enabled by default. For TransformersNeuronX, this is disabled by default
DeepSpeed Container: auto, scheduler, lmi-dist, vllm, deepspeed Neuron Container: auto TrtLLM Container: trtllm
option.max_rolling_batch_size No The maximum concurrent request/batch that the model can take. The model server will feed to the python processes <max_rolling_batch_size> requests to prevent GPU OOM. Customer can still send more requests to the model server. The requests more than <max_rolling_batch_size> will be queued and feed to python until there are finished requests.

Note: This is model specific configuration. If you set <max_rolling_batch_size> on Model A, and there are N copies of Model A inside the container. Then the model server can deal up to <max_rolling_batch_size> x N requests.
Default: 32 (for all engines except DeepSpeed)
Default: 4 (for DeepSpeed)
Advanced parameters
option.trust_remote_code No Set to True to use a HF hub model with custom code. We set a default value to False to prevent malicious code execution from the HuggingFace Hub Default: false
option.revision No Use a particular version/commit hash of a HF hub model ed94a7c6247d8aedce4647f00f20de6875b5b292
Default: None
option.entryPoint No Defines which built-in model loading handler to use. You can also use custom model handler (model.py) in the model directory. specify as djl_python. or path to your customized handler. Different DLC we offer chooses the entryPoint for you. For example, if you use TensorRTLLM DLC, it uses djl_python.tensorrt_llm djl_python.deepspeed, djl_python.huggingface, djl_python.transformers_neuronx, djl_python.tensorrt_llm, djl_python.stable_diffusion
option.parallel_loading No It loads the workers in parallel and reduce the model loading time if your model can fit in CPU memory with multiple processes. Note: If you are loading N copies of model at the same time, then the peak CPU memory can go up to N x model_size and cause CPU OOM. Default: false
option.model_loading_timeout No Sets limit on time model can load before model server timeout. Default time is 30 min (1800s). If you are using on SageMaker and you do expect the model takes longer to load, you should also set container_startup_health_check_timeout=<model_loading_timeout> on SageMaker Default: 1800
job_queue_size No Specify the job queue size at model level. The job queue is typically used to deal with concurrent requests that beyond the backend model server used can take. The model server will queue requests up to <job_queue_size>. Default: 1000
option.output_formatter No Only if option.rolling_batch enabled. Define which output format the model server will be sent as the result. If no value being set, we are sending back json in tokens. json, jsonlines
Default: json
option.enable_streaming [Deprecated since 0.25.0] No from 0.25.0, we started to deprecate this parameter. Enables response streaming for static batching. Use huggingface to enable HuggingFace-like streaming output. RollingBatch is already doing token-streaming by default. Setting this value for RollingBatch takes no effect. false, true, huggingface
Advanced parameters: Dynamic Batching
batch_size No Dynamic Request level batching. This is commonly used in non-text inference to wait requests to come to build a batch. It would help to deal concurrent requests more efficiently. Dynamic batching cannot be used with Rolling Batching given they are different batching algorithm defined by model server. Default: 1
option.max_batch_delay No The maximum delay for batch aggregation in milliseconds. We will wait <max_batch_delay> milliseconds duration to collect as much requests up to <batch_size> size to send to model server. Default: 100
option.max_idle_time No The maximum idle time in seconds before the worker thread is scaled down Default: 60

DeepSpeed (doc)

If you specify entrypoint to DeepSpeed, or use DeepSpeed engine, you will have access to the following parameters.

Item Required Description Example value
option.task No The task used in Hugging Face for different pipelines. Default is text-generation text-generation
option.quantize No Specify this option to quantize your model using the supported quantization methods in DeepSpeed. SmoothQuant is our special offering to provide quantization with better quality dynamic_int8, smoothquant
option.max_tokens No Total number of tokens (input and output) with which DeepSpeed can work. The number of output tokens in the difference between the total number of tokens and the number of input tokens. By default we set the value to 1024. If you are looking for long sequence generation, you may want to set this to higher value (2048, 4096..) 1024
Advanced parameters
option.low_cpu_mem_usage No Reduce CPU memory usage when loading models. We recommend that you set this to True. Default:true
option.enable_cuda_graph No Activates capturing the CUDA graph of the forward pass to accelerate. Default: false
option.triangular_masking No Whether to use triangular masking for the attention mask. This is application or model specific. Default: true
option.return_tuple No Whether transformer layers need to return a tuple or a tensor. Default: true
option.training_mp_size No If the model was trained with DeepSpeed, this indicates the tensor parallelism degree with which the model was trained. Can be different than the tensor parallel degree desired for inference. Default: 1
option.checkpoint No Path to DeepSpeed compatible checkpoint file. ds_inference_checkpoint.json
option.smoothquant_alpha No If smoothquant is provided in option.quantize, you can provide this alpha value. If not provided, DeepSpeed will choose one for you. Any float value between 0 and 1

HuggingFace Accelerate (doc)

If you specify Engine to be Python in DeepSpeed container, not specifying rolling_batch or specifying rolling_batch=scheduler. The following parameters will be accessible.

Item Required Description Example value
option.task No The task used in Hugging Face for different pipelines. text-generation
option.low_cpu_mem_usage No Reduce CPU memory usage when loading models. We recommend that you set this to True. TRUE
option.quantize No Quantize the model with the supported quantization methods bitsandbytes4, bitsandbytes8
Default: None
bitsandbytes4 is equivalent to load_in_4bit
bitsandbytes8 is equivalent to load_in_8bit
option.device_map No Enables to fit the model across multiple GPUs auto, balanced, balanced_low_0, sequential.
Default: auto if tensor_parallel_degree > 0 and cuda devices are available
Advanced parameters: Rolling batch scheduler parameters
option.decoding_strategy No Specifies the decoding method among sample, greedy and contrastive search sample, greedy, contrastive Default: greedy
option.max_sparsity No Used in max-sparsity thresholding mechanism. It limits the max_sparsity in the token sequence caused by padding. 0.01, 0.5 Default: 0.33
option.max_splits No Used in max-sparsity thresholding mechanism. It limits the max number of batch splits, where each split has its own inference call. 1, 5 Default: 3
option.disable_flash_attn No Used to toggle between using HuggingFace flash_attention or not. Note that HuggingFace flash attention can be affected by padding. huggingface/transformers#26990 https://huggingface.co/docs/transformers/perf_infer_gpu_one#expected-speedups. Default: true
option.load_in_4bit [Deprecated since 0.25.0] No Uses bitsandbytes quantization. Supported only on certain models. Deprecated since 0.25.0, use option.quantize=bitsandbytes4 instead. Default: false
option.load_in_8bit [Deprecated since 0.25.0] No Uses bitsandbytes quantization. Supported only on certain models. Deprecated, use option.quantize=bitsandbytes8 instead. Default: false

LMI-Dist

If you specify Engine to be MPI, rolling_batch to auto or lmi-dist in DeepSpeed container, the following parameters will be accessible.

Item Required Description Example value
option.quantize No Use option.quantize quantize technology to the model. gptq quantize requires to load a gptq model. bitsandbytes is deprecated, use bitsandbytes8 instead, both the options are the same. bitsandbytes8, gptq
Default: None
Advanced parameters
option.paged_attention No Use PagedAttention or not. Default is always use. Disable this if you plan to run on G4 or older GPU architecture Default: true
option.max_rolling_batch_prefill_tokens [Deprecated since 0.25.0] No Limits the number of tokens for caching. This needs to be tuned based on batch size and input sequence length to avoid GPU OOM. Currently we are calculating the best value for you from 0.25.0, this is no longer required Default: 1088

vLLM

If you specify Engine to be MPI, rolling_batch to vllm in DeepSpeed container, the following parameters will be accessible.

Item Required Description Example value
option.quantize No Quantize the model with the supported quantization methods awq Default: None
option.max_rolling_batch_prefill_tokens No Limits the number of tokens for caching. This needs to be tuned based on batch size and input sequence length to avoid GPU OOM. If you don't set, vLLM will try to find a good number to fit in Default: 1088
option.load_format No The checkpoint format of the model. Default is auto and means bin/safetensors will be used if found. Default: auto

Transformers-NeuronX (doc)

If you are using Neuron container and engine set to Python, the following parameter will be accessible.

Item Required Description Example value
option.n_positions No Total sequence length, input sequence length + output sequence length. Default: 128
option.load_in_8bit No Specify this option to quantize your model using the supported quantization methods in TransformerNeuronX False, True Default: False
Advanced parameters
option.unroll No Unroll the model graph for compilation. With unroll=None compiler will have more opportunities to do optimizations across the layers Default: None
option.neuron_optimize_level No Neuron runtime compiler optimization level, determines the type of optimizations applied during compilation. The higher optimize level we go, the longer time will spend on compilation. But in exchange, you will get better latency/throughput. Default value is not set (optimize level 2) that have a balance of compilation time and performance 1,2,3 Default: 2
option.context_length_estimate No Estimated context input length for Llama models. Customer can specify different size bucket to increase the KV cache re-usability. This will help to improve latency Example: 256,512,1024 (integers separated by comma if multiple values)
Default: None
option.low_cpu_mem_usage No Reduce CPU memory usage when loading models. Default: False
option.load_split_model No Toggle to True when using model artifacts that have already been split for neuron compilation/loading. Default: False
option.compiled_graph_path No Provide an s3 URI, or a local directory that stores the pre-compiled graph for your model (NEFF cache) to skip runtime compilation. Default: None

TensorRT-LLM

If you specify MPI engine in TensorRT LLM container, the following parameters will be accessible.

Item Required Description Example value
option.max_input_len No Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. Default values for:
Llama is 512
Falcon is 1024
option.max_output_len No Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set. Default values for:
Llama is 512
Falcon is 1024
option.use_custom_all_reduce No Custom all reduce kernel is used for GPUs that have NVLink enabled. This can help to speed up model inference speed with better communication. Turn this on by setting true on P4D, P4De, P5 and other GPUs that are NVLink connected true, false.
Default is false
Advanced parameters
option.tokens_per_block No tokens per block to be used in paged attention algorithm Default values is 64
option.batch_scheduler_policy No scheduler policy of Tensorrt-LLM batch manager. max_utilization, guaranteed_no_evict
Default value is max_utilization
option.kv_cache_free_gpu_mem_fraction No fraction of free gpu memory allocated for kv cache. The larger value you set, the more memory the model will try to take over on the GPU. The more memory preserved, the larger KV Cache size we can use and that means longer input+output sequence or larger batch size. float number between 0 and 1.
Default is 0.95
option.max_num_sequences No maximum number of input requests processed in the batch. We will apply max_rolling_batch_size as the value for it if you don't set this. Generally you don't have to touch it unless you really want the model to be compiled to a batch size that not the same as model server set Integer greater than 0
Default value is the batch size set while building Tensorrt engine
option.enable_trt_overlap No Parameter to overlap the execution of batches of requests. It may have a negative impact on performance when the number of requests is too small. During our experiment, we saw more negative impact to turn this on than off. true, false.
Default is false
option.baichuan_model_version No Parameter that exclusively for Baichuan LLM model to specify the version of the model. Need to specify the HF Baichuan checkpoint path. For v1_13b, you should use whether baichuan-inc/Baichuan-13B-Chat or baichuan-inc/Baichuan-13B-Base. For v2_13b, you should use whether baichuan-inc/Baichuan2-13B-Chat or baichuan-inc/Baichuan2-13B-Base. More Baichuan models could be found on baichuan-inc. v1_7b, v1_13b, v2_7b, v2_13b.
Default is v1_13b
Advanced parameters: Quantization
option.quantize No Currently only supports smoothquant for Llama models with just in time compilation mode. smoothquant
option.smoothquant_alpha No smoothquant alpha parameter Default value is 0.8
option.smoothquant_per_token No This is only applied when option.quantize is set to smoothquant. This enables choosing at run time a custom smoothquant scaling factor for each token. This is usally little slower and more accurate true, false.
Default is false
option.smoothquant_per_channel No This is only applied when option.quantize is set to smoothquant. This enables choosing at run time a custom smoothquant scaling factor for each channel. This is usally little slower and more accurate true, false.
Default is false
option.multi_query_mode No This is only needed when option.quantize is set to smoothquant . This is should be set for models that support multi-query-attention, for e.g llama-70b true, false.
Default is false

Aliases

DJLServing provides a few alias for Python engine to make it easy for common LLM configurations.

  • engine=DeepSpeed, equivalent to:
engine=Python
option.mpi_mode=true
option.entryPoint=djl_python.deepspeed
  • engine=MPI, this is equivalent to:
engine=Python
option.mpi_mode=true
option.entryPoint=djl_python.huggingface