Large Model Inference Containers

There are a number of shared configurations for python models running large language models. They are also available through the Large Model Inference Containers.

Common (doc)

Item	Required	Description	Example value
engine	Yes	The runtime engine of the code. MPI is an engine that allow model server to start distributed processes to load the model. This is used in some of the framework LMI supported. Please check for code samples to see which one you should use. As of 0.25.0, TRTLLM, LMI-Dist and DeepSpeed framework uses MPI engine. vLLM, TransformersNeuronX, Optimum Neuron, HuggingFace Accelerate uses Python engine.	`Python, MPI, DeepSpeed (deprecated)`
option.model_dir	No	The directory path to load the model. Default is set to the current path with model files. On SageMaker, this is set to the location where SageMaker download the Model object from S3	Default: `/opt/djl/ml`
option.model_id	No	The value of this option will be the Hugging Face ID of a model or the s3 url of the model artifacts. DJL Serving will use the ID to download the model from Hugging Face or the s3 url. DJL Serving uses `s5cmd` to download the model from the bucket which is generally faster	`google/flan-t5-xl`, `s3://<my-bucket>/google/flan-t5-xl` Default: `None`
option.dtype	No	Datatype to which you plan to cast the model default. Default is fp16. You can also set to bf16 if you are using G5, P4D and newer GPU machines	`fp16, fp32, bf16, int8 (only used in LMI-Dist)`
option.tensor_parallel_degree	No	The number number of GPUs (model slicing number) to shard the model. If you are using LLMs, you should set this value to achieve best performance. If you don't know what should be value, start trying from "max" (split the model to maximum number of GPUs on the machine)	Default for DeepSpeed, Transformer-NeuronX : `1` Default for HuggingFace Accelerate: `-1` Default for TrtLLM Container: `max`
option.rolling_batch	No	also commonly known as continuous batching. Enable iteration level batching using one of the supported strategies. This allow concurrent requests that arrived in different time to merge as a batch to run inference with model server. Disabled for DeepSpeed container by default given there are many choices of backends. For TensorRT container, rolling batch is enabled by default. For TransformersNeuronX, this is disabled by default	DeepSpeed Container: `auto`, `scheduler`, `lmi-dist`, `vllm`, `deepspeed` Neuron Container: `auto` TrtLLM Container: `trtllm`
option.max_rolling_batch_size	No	The maximum concurrent request/batch that the model can take. The model server will feed to the python processes <max_rolling_batch_size> requests to prevent GPU OOM. Customer can still send more requests to the model server. The requests more than <max_rolling_batch_size> will be queued and feed to python until there are finished requests. Note: This is model specific configuration. If you set <max_rolling_batch_size> on Model A, and there are N copies of Model A inside the container. Then the model server can deal up to <max_rolling_batch_size> x N requests.	Default: `32` (for all engines except DeepSpeed) Default: `4` (for DeepSpeed)
Advanced parameters
option.trust_remote_code	No	Set to True to use a HF hub model with custom code. We set a default value to False to prevent malicious code execution from the HuggingFace Hub	Default: `false`
option.revision	No	Use a particular version/commit hash of a HF hub model	`ed94a7c6247d8aedce4647f00f20de6875b5b292` Default: None
option.entryPoint	No	Defines which built-in model loading handler to use. You can also use custom model handler (`model.py`) in the model directory. specify as djl_python. or path to your customized handler. Different DLC we offer chooses the entryPoint for you. For example, if you use TensorRTLLM DLC, it uses `djl_python.tensorrt_llm`	`djl_python.deepspeed`, `djl_python.huggingface`, `djl_python.transformers_neuronx`, `djl_python.tensorrt_llm`, `djl_python.stable_diffusion`
option.parallel_loading	No	It loads the workers in parallel and reduce the model loading time if your model can fit in CPU memory with multiple processes. Note: If you are loading N copies of model at the same time, then the peak CPU memory can go up to N x model_size and cause CPU OOM.	Default: `false`
option.model_loading_timeout	No	Sets limit on time model can load before model server timeout. Default time is 30 min (1800s). If you are using on SageMaker and you do expect the model takes longer to load, you should also set container_startup_health_check_timeout=<model_loading_timeout> on SageMaker	Default: `1800`
job_queue_size	No	Specify the job queue size at model level. The job queue is typically used to deal with concurrent requests that beyond the backend model server used can take. The model server will queue requests up to <job_queue_size>.	Default: `1000`
option.output_formatter	No	Only if `option.rolling_batch` enabled. Define which output format the model server will be sent as the result. If no value being set, we are sending back json in tokens.	`json, jsonlines` Default: `json`
option.enable_streaming [Deprecated since 0.25.0]	No	from 0.25.0, we started to deprecate this parameter. Enables response streaming for static batching. Use `huggingface` to enable HuggingFace-like streaming output. RollingBatch is already doing token-streaming by default. Setting this value for RollingBatch takes no effect.	`false`, `true`, `huggingface`
Advanced parameters: Dynamic Batching
batch_size	No	Dynamic Request level batching. This is commonly used in non-text inference to wait requests to come to build a batch. It would help to deal concurrent requests more efficiently. Dynamic batching cannot be used with Rolling Batching given they are different batching algorithm defined by model server.	Default: `1`
option.max_batch_delay	No	The maximum delay for batch aggregation in milliseconds. We will wait <max_batch_delay> milliseconds duration to collect as much requests up to <batch_size> size to send to model server.	Default: `100`
option.max_idle_time	No	The maximum idle time in seconds before the worker thread is scaled down	Default: `60`

DeepSpeed (doc)

If you specify entrypoint to DeepSpeed, or use DeepSpeed engine, you will have access to the following parameters.

Item	Required	Description	Example value
option.task	No	The task used in Hugging Face for different pipelines. Default is text-generation	`text-generation`
option.quantize	No	Specify this option to quantize your model using the supported quantization methods in DeepSpeed. SmoothQuant is our special offering to provide quantization with better quality	`dynamic_int8`, `smoothquant`
option.max_tokens	No	Total number of tokens (input and output) with which DeepSpeed can work. The number of output tokens in the difference between the total number of tokens and the number of input tokens. By default we set the value to 1024. If you are looking for long sequence generation, you may want to set this to higher value (2048, 4096..)	1024
Advanced parameters
option.low_cpu_mem_usage	No	Reduce CPU memory usage when loading models. We recommend that you set this to True.	Default:`true`
option.enable_cuda_graph	No	Activates capturing the CUDA graph of the forward pass to accelerate.	Default: `false`
option.triangular_masking	No	Whether to use triangular masking for the attention mask. This is application or model specific.	Default: `true`
option.return_tuple	No	Whether transformer layers need to return a tuple or a tensor.	Default: `true`
option.training_mp_size	No	If the model was trained with DeepSpeed, this indicates the tensor parallelism degree with which the model was trained. Can be different than the tensor parallel degree desired for inference.	Default: `1`
option.checkpoint	No	Path to DeepSpeed compatible checkpoint file.	`ds_inference_checkpoint.json`
option.smoothquant_alpha	No	If `smoothquant` is provided in option.quantize, you can provide this alpha value. If not provided, DeepSpeed will choose one for you.	Any float value between 0 and 1

HuggingFace Accelerate (doc)

If you specify Engine to be Python in DeepSpeed container, not specifying rolling_batch or specifying rolling_batch=scheduler. The following parameters will be accessible.

Item	Required	Description	Example value
option.task	No	The task used in Hugging Face for different pipelines.	`text-generation`
option.low_cpu_mem_usage	No	Reduce CPU memory usage when loading models. We recommend that you set this to True.	TRUE
option.quantize	No	Quantize the model with the supported quantization methods	`bitsandbytes4`, `bitsandbytes8` Default: `None` `bitsandbytes4` is equivalent to `load_in_4bit` `bitsandbytes8` is equivalent to `load_in_8bit`
option.device_map	No	Enables to fit the model across multiple GPUs	`auto`, `balanced`, `balanced_low_0`, `sequential`. Default: `auto` if `tensor_parallel_degree` > 0 and cuda devices are available
Advanced parameters: Rolling batch scheduler parameters
option.decoding_strategy	No	Specifies the decoding method among sample, greedy and contrastive search	`sample, greedy, contrastive` Default: `greedy`
option.max_sparsity	No	Used in max-sparsity thresholding mechanism. It limits the max_sparsity in the token sequence caused by padding.	`0.01, 0.5` Default: `0.33`
option.max_splits	No	Used in max-sparsity thresholding mechanism. It limits the max number of batch splits, where each split has its own inference call.	`1, 5` Default: `3`
option.disable_flash_attn	No	Used to toggle between using HuggingFace flash_attention or not. Note that HuggingFace flash attention can be affected by padding. huggingface/transformers#26990 https://huggingface.co/docs/transformers/perf_infer_gpu_one#expected-speedups.	Default: `true`
option.load_in_4bit [Deprecated since 0.25.0]	No	Uses bitsandbytes quantization. Supported only on certain models.	Deprecated since 0.25.0, use `option.quantize=bitsandbytes4` instead. Default: `false`
option.load_in_8bit [Deprecated since 0.25.0]	No	Uses bitsandbytes quantization. Supported only on certain models.	Deprecated, use `option.quantize=bitsandbytes8` instead. Default: `false`

LMI-Dist

If you specify Engine to be MPI, rolling_batch to auto or lmi-dist in DeepSpeed container, the following parameters will be accessible.

Item	Required	Description	Example value
option.quantize	No	Use `option.quantize` quantize technology to the model. gptq quantize requires to load a gptq model. `bitsandbytes` is deprecated, use `bitsandbytes8` instead, both the options are the same.	`bitsandbytes8, gptq` Default: `None`
Advanced parameters
option.paged_attention	No	Use PagedAttention or not. Default is always use. Disable this if you plan to run on G4 or older GPU architecture	Default: `true`
option.max_rolling_batch_prefill_tokens [Deprecated since 0.25.0]	No	Limits the number of tokens for caching. This needs to be tuned based on batch size and input sequence length to avoid GPU OOM. Currently we are calculating the best value for you from 0.25.0, this is no longer required	Default: `1088`

vLLM

If you specify Engine to be MPI, rolling_batch to vllm in DeepSpeed container, the following parameters will be accessible.

Item	Required	Description	Example value
option.quantize	No	Quantize the model with the supported quantization methods	`awq` Default: `None`
option.max_rolling_batch_prefill_tokens	No	Limits the number of tokens for caching. This needs to be tuned based on batch size and input sequence length to avoid GPU OOM. If you don't set, vLLM will try to find a good number to fit in	Default: `1088`
option.load_format	No	The checkpoint format of the model. Default is auto and means bin/safetensors will be used if found.	Default: `auto`

Transformers-NeuronX (doc)

If you are using Neuron container and engine set to Python, the following parameter will be accessible.

Item	Required	Description	Example value
option.n_positions	No	Total sequence length, input sequence length + output sequence length.	Default: `128`
option.load_in_8bit	No	Specify this option to quantize your model using the supported quantization methods in TransformerNeuronX	`False`, `True` Default: `False`
Advanced parameters
option.unroll	No	Unroll the model graph for compilation. With `unroll=None` compiler will have more opportunities to do optimizations across the layers	Default: `None`
option.neuron_optimize_level	No	Neuron runtime compiler optimization level, determines the type of optimizations applied during compilation. The higher optimize level we go, the longer time will spend on compilation. But in exchange, you will get better latency/throughput. Default value is not set (optimize level 2) that have a balance of compilation time and performance	`1`,`2`,`3` Default: `2`
option.context_length_estimate	No	Estimated context input length for Llama models. Customer can specify different size bucket to increase the KV cache re-usability. This will help to improve latency	Example: `256,512,1024` (integers separated by comma if multiple values) Default: `None`
option.low_cpu_mem_usage	No	Reduce CPU memory usage when loading models.	Default: `False`
option.load_split_model	No	Toggle to True when using model artifacts that have already been split for neuron compilation/loading.	Default: `False`
option.compiled_graph_path	No	Provide an s3 URI, or a local directory that stores the pre-compiled graph for your model (NEFF cache) to skip runtime compilation.	Default: `None`

TensorRT-LLM

If you specify MPI engine in TensorRT LLM container, the following parameters will be accessible.

Item	Required	Description	Example value
option.max_input_len	No	Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input.	Default values for: Llama is `512` Falcon is `1024`
option.max_output_len	No	Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set.	Default values for: Llama is `512` Falcon is `1024`
option.use_custom_all_reduce	No	Custom all reduce kernel is used for GPUs that have NVLink enabled. This can help to speed up model inference speed with better communication. Turn this on by setting true on P4D, P4De, P5 and other GPUs that are NVLink connected	`true`, `false`. Default is `false`
Advanced parameters
option.tokens_per_block	No	tokens per block to be used in paged attention algorithm	Default values is `64`
option.batch_scheduler_policy	No	scheduler policy of Tensorrt-LLM batch manager.	`max_utilization`, `guaranteed_no_evict` Default value is `max_utilization`
option.kv_cache_free_gpu_mem_fraction	No	fraction of free gpu memory allocated for kv cache. The larger value you set, the more memory the model will try to take over on the GPU. The more memory preserved, the larger KV Cache size we can use and that means longer input+output sequence or larger batch size.	float number between 0 and 1. Default is `0.95`
option.max_num_sequences	No	maximum number of input requests processed in the batch. We will apply max_rolling_batch_size as the value for it if you don't set this. Generally you don't have to touch it unless you really want the model to be compiled to a batch size that not the same as model server set	Integer greater than 0 Default value is the batch size set while building Tensorrt engine
option.enable_trt_overlap	No	Parameter to overlap the execution of batches of requests. It may have a negative impact on performance when the number of requests is too small. During our experiment, we saw more negative impact to turn this on than off.	`true`, `false`. Default is `false`
option.baichuan_model_version	No	Parameter that exclusively for Baichuan LLM model to specify the version of the model. Need to specify the HF Baichuan checkpoint path. For v1_13b, you should use whether baichuan-inc/Baichuan-13B-Chat or baichuan-inc/Baichuan-13B-Base. For v2_13b, you should use whether baichuan-inc/Baichuan2-13B-Chat or baichuan-inc/Baichuan2-13B-Base. More Baichuan models could be found on baichuan-inc.	`v1_7b`, `v1_13b`, `v2_7b`, `v2_13b`. Default is `v1_13b`
Advanced parameters: Quantization
option.quantize	No	Currently only supports `smoothquant` for Llama models with just in time compilation mode.	`smoothquant`
option.smoothquant_alpha	No	smoothquant alpha parameter	Default value is `0.8`
option.smoothquant_per_token	No	This is only applied when `option.quantize` is set to `smoothquant`. This enables choosing at run time a custom smoothquant scaling factor for each token. This is usally little slower and more accurate	`true`, `false`. Default is `false`
option.smoothquant_per_channel	No	This is only applied when `option.quantize` is set to `smoothquant`. This enables choosing at run time a custom smoothquant scaling factor for each channel. This is usally little slower and more accurate	`true`, `false`. Default is `false`
option.multi_query_mode	No	This is only needed when `option.quantize` is set to `smoothquant` . This is should be set for models that support multi-query-attention, for e.g llama-70b	`true`, `false`. Default is `false`

Aliases

DJLServing provides a few alias for Python engine to make it easy for common LLM configurations.

engine=DeepSpeed, equivalent to:

engine=Python
option.mpi_mode=true
option.entryPoint=djl_python.deepspeed

engine=MPI, this is equivalent to:

engine=Python
option.mpi_mode=true
option.entryPoint=djl_python.huggingface

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configurations_large_model_inference_containers.md

configurations_large_model_inference_containers.md

Large Model Inference Containers

Common (doc)

DeepSpeed (doc)

HuggingFace Accelerate (doc)

LMI-Dist

vLLM

Transformers-NeuronX (doc)

TensorRT-LLM

Aliases

Files

configurations_large_model_inference_containers.md

Latest commit

History

configurations_large_model_inference_containers.md

File metadata and controls

Large Model Inference Containers

Common (doc)

DeepSpeed (doc)

HuggingFace Accelerate (doc)

LMI-Dist

vLLM

Transformers-NeuronX (doc)

TensorRT-LLM

Aliases