There are a number of shared configurations for python models running large language models. They are also available through the Large Model Inference Containers.
Common (doc)
Item | Required | Description | Example value |
---|---|---|---|
engine | Yes | The runtime engine of the code. MPI is an engine that allow model server to start distributed processes to load the model. This is used in some of the framework LMI supported. Please check for code samples to see which one you should use. As of 0.25.0, TRTLLM, LMI-Dist and DeepSpeed framework uses MPI engine. vLLM, TransformersNeuronX, Optimum Neuron, HuggingFace Accelerate uses Python engine. |
Python, MPI, DeepSpeed (deprecated) |
option.model_dir | No | The directory path to load the model. Default is set to the current path with model files. On SageMaker, this is set to the location where SageMaker download the Model object from S3 | Default: /opt/djl/ml |
option.model_id | No | The value of this option will be the Hugging Face ID of a model or the s3 url of the model artifacts. DJL Serving will use the ID to download the model from Hugging Face or the s3 url. DJL Serving uses s5cmd to download the model from the bucket which is generally faster |
google/flan-t5-xl , s3://<my-bucket>/google/flan-t5-xl Default: None |
option.dtype | No | Datatype to which you plan to cast the model default. Default is fp16. You can also set to bf16 if you are using G5, P4D and newer GPU machines | fp16, fp32, bf16, int8 (only used in LMI-Dist) |
option.tensor_parallel_degree | No | The number number of GPUs (model slicing number) to shard the model. If you are using LLMs, you should set this value to achieve best performance. If you don't know what should be value, start trying from "max" (split the model to maximum number of GPUs on the machine) | Default for DeepSpeed, Transformer-NeuronX : 1 Default for HuggingFace Accelerate: -1 Default for TrtLLM Container: max |
option.rolling_batch | No | also commonly known as continuous batching. Enable iteration level batching using one of the supported strategies. This allow concurrent requests that arrived in different time to merge as a batch to run inference with model server. Disabled for DeepSpeed container by default given there are many choices of backends. For TensorRT container, rolling batch is enabled by default. For TransformersNeuronX, this is disabled by default |
DeepSpeed Container: auto , scheduler , lmi-dist , vllm , deepspeed Neuron Container: auto TrtLLM Container: trtllm |
option.max_rolling_batch_size | No | The maximum concurrent request/batch that the model can take. The model server will feed to the python processes <max_rolling_batch_size> requests to prevent GPU OOM. Customer can still send more requests to the model server. The requests more than <max_rolling_batch_size> will be queued and feed to python until there are finished requests. Note: This is model specific configuration. If you set <max_rolling_batch_size> on Model A, and there are N copies of Model A inside the container. Then the model server can deal up to <max_rolling_batch_size> x N requests. |
Default: 32 (for all engines except DeepSpeed) Default: 4 (for DeepSpeed) |
Advanced parameters | |||
option.trust_remote_code | No | Set to True to use a HF hub model with custom code. We set a default value to False to prevent malicious code execution from the HuggingFace Hub | Default: false |
option.revision | No | Use a particular version/commit hash of a HF hub model | ed94a7c6247d8aedce4647f00f20de6875b5b292 Default: None |
option.entryPoint | No | Defines which built-in model loading handler to use. You can also use custom model handler (model.py ) in the model directory. specify as djl_python. or path to your customized handler. Different DLC we offer chooses the entryPoint for you. For example, if you use TensorRTLLM DLC, it uses djl_python.tensorrt_llm |
djl_python.deepspeed , djl_python.huggingface , djl_python.transformers_neuronx , djl_python.tensorrt_llm , djl_python.stable_diffusion |
option.parallel_loading | No | It loads the workers in parallel and reduce the model loading time if your model can fit in CPU memory with multiple processes. Note: If you are loading N copies of model at the same time, then the peak CPU memory can go up to N x model_size and cause CPU OOM. | Default: false |
option.model_loading_timeout | No | Sets limit on time model can load before model server timeout. Default time is 30 min (1800s). If you are using on SageMaker and you do expect the model takes longer to load, you should also set container_startup_health_check_timeout=<model_loading_timeout> on SageMaker | Default: 1800 |
job_queue_size | No | Specify the job queue size at model level. The job queue is typically used to deal with concurrent requests that beyond the backend model server used can take. The model server will queue requests up to <job_queue_size>. | Default: 1000 |
option.output_formatter | No | Only if option.rolling_batch enabled. Define which output format the model server will be sent as the result. If no value being set, we are sending back json in tokens. |
json, jsonlines Default: json |
option.enable_streaming [Deprecated since 0.25.0] | No | from 0.25.0, we started to deprecate this parameter. Enables response streaming for static batching. Use huggingface to enable HuggingFace-like streaming output. RollingBatch is already doing token-streaming by default. Setting this value for RollingBatch takes no effect. |
false , true , huggingface |
Advanced parameters: Dynamic Batching | |||
batch_size | No | Dynamic Request level batching. This is commonly used in non-text inference to wait requests to come to build a batch. It would help to deal concurrent requests more efficiently. Dynamic batching cannot be used with Rolling Batching given they are different batching algorithm defined by model server. | Default: 1 |
option.max_batch_delay | No | The maximum delay for batch aggregation in milliseconds. We will wait <max_batch_delay> milliseconds duration to collect as much requests up to <batch_size> size to send to model server. | Default: 100 |
option.max_idle_time | No | The maximum idle time in seconds before the worker thread is scaled down | Default: 60 |
DeepSpeed (doc)
If you specify entrypoint to DeepSpeed, or use DeepSpeed engine, you will have access to the following parameters.
Item | Required | Description | Example value |
---|---|---|---|
option.task | No | The task used in Hugging Face for different pipelines. Default is text-generation | text-generation |
option.quantize | No | Specify this option to quantize your model using the supported quantization methods in DeepSpeed. SmoothQuant is our special offering to provide quantization with better quality | dynamic_int8 , smoothquant |
option.max_tokens | No | Total number of tokens (input and output) with which DeepSpeed can work. The number of output tokens in the difference between the total number of tokens and the number of input tokens. By default we set the value to 1024. If you are looking for long sequence generation, you may want to set this to higher value (2048, 4096..) | 1024 |
Advanced parameters | |||
option.low_cpu_mem_usage | No | Reduce CPU memory usage when loading models. We recommend that you set this to True. | Default:true |
option.enable_cuda_graph | No | Activates capturing the CUDA graph of the forward pass to accelerate. | Default: false |
option.triangular_masking | No | Whether to use triangular masking for the attention mask. This is application or model specific. | Default: true |
option.return_tuple | No | Whether transformer layers need to return a tuple or a tensor. | Default: true |
option.training_mp_size | No | If the model was trained with DeepSpeed, this indicates the tensor parallelism degree with which the model was trained. Can be different than the tensor parallel degree desired for inference. | Default: 1 |
option.checkpoint | No | Path to DeepSpeed compatible checkpoint file. | ds_inference_checkpoint.json |
option.smoothquant_alpha | No | If smoothquant is provided in option.quantize, you can provide this alpha value. If not provided, DeepSpeed will choose one for you. |
Any float value between 0 and 1 |
HuggingFace Accelerate (doc)
If you specify Engine to be Python in DeepSpeed container, not specifying rolling_batch or specifying rolling_batch=scheduler. The following parameters will be accessible.
Item | Required | Description | Example value |
---|---|---|---|
option.task | No | The task used in Hugging Face for different pipelines. | text-generation |
option.low_cpu_mem_usage | No | Reduce CPU memory usage when loading models. We recommend that you set this to True. | TRUE |
option.quantize | No | Quantize the model with the supported quantization methods | bitsandbytes4 , bitsandbytes8 Default: None bitsandbytes4 is equivalent to load_in_4bit bitsandbytes8 is equivalent to load_in_8bit |
option.device_map | No | Enables to fit the model across multiple GPUs | auto , balanced , balanced_low_0 , sequential . Default: auto if tensor_parallel_degree > 0 and cuda devices are available |
Advanced parameters: Rolling batch scheduler parameters | |||
option.decoding_strategy | No | Specifies the decoding method among sample, greedy and contrastive search | sample, greedy, contrastive Default: greedy |
option.max_sparsity | No | Used in max-sparsity thresholding mechanism. It limits the max_sparsity in the token sequence caused by padding. | 0.01, 0.5 Default: 0.33 |
option.max_splits | No | Used in max-sparsity thresholding mechanism. It limits the max number of batch splits, where each split has its own inference call. | 1, 5 Default: 3 |
option.disable_flash_attn | No | Used to toggle between using HuggingFace flash_attention or not. Note that HuggingFace flash attention can be affected by padding. huggingface/transformers#26990 https://huggingface.co/docs/transformers/perf_infer_gpu_one#expected-speedups. | Default: true |
option.load_in_4bit [Deprecated since 0.25.0] | No | Uses bitsandbytes quantization. Supported only on certain models. | Deprecated since 0.25.0, use option.quantize=bitsandbytes4 instead. Default: false |
option.load_in_8bit [Deprecated since 0.25.0] | No | Uses bitsandbytes quantization. Supported only on certain models. | Deprecated, use option.quantize=bitsandbytes8 instead. Default: false |
If you specify Engine to be MPI, rolling_batch to auto or lmi-dist in DeepSpeed container, the following parameters will be accessible.
Item | Required | Description | Example value |
---|---|---|---|
option.quantize | No | Use option.quantize quantize technology to the model. gptq quantize requires to load a gptq model. bitsandbytes is deprecated, use bitsandbytes8 instead, both the options are the same. |
bitsandbytes8, gptq Default: None |
Advanced parameters | |||
option.paged_attention | No | Use PagedAttention or not. Default is always use. Disable this if you plan to run on G4 or older GPU architecture | Default: true |
option.max_rolling_batch_prefill_tokens [Deprecated since 0.25.0] | No | Limits the number of tokens for caching. This needs to be tuned based on batch size and input sequence length to avoid GPU OOM. Currently we are calculating the best value for you from 0.25.0, this is no longer required | Default: 1088 |
If you specify Engine to be MPI, rolling_batch to vllm in DeepSpeed container, the following parameters will be accessible.
Item | Required | Description | Example value |
---|---|---|---|
option.quantize | No | Quantize the model with the supported quantization methods | awq Default: None |
option.max_rolling_batch_prefill_tokens | No | Limits the number of tokens for caching. This needs to be tuned based on batch size and input sequence length to avoid GPU OOM. If you don't set, vLLM will try to find a good number to fit in | Default: 1088 |
option.load_format | No | The checkpoint format of the model. Default is auto and means bin/safetensors will be used if found. | Default: auto |
Transformers-NeuronX (doc)
If you are using Neuron container and engine set to Python, the following parameter will be accessible.
Item | Required | Description | Example value |
---|---|---|---|
option.n_positions | No | Total sequence length, input sequence length + output sequence length. | Default: 128 |
option.load_in_8bit | No | Specify this option to quantize your model using the supported quantization methods in TransformerNeuronX | False , True Default: False |
Advanced parameters | |||
option.unroll | No | Unroll the model graph for compilation. With unroll=None compiler will have more opportunities to do optimizations across the layers |
Default: None |
option.neuron_optimize_level | No | Neuron runtime compiler optimization level, determines the type of optimizations applied during compilation. The higher optimize level we go, the longer time will spend on compilation. But in exchange, you will get better latency/throughput. Default value is not set (optimize level 2) that have a balance of compilation time and performance | 1 ,2 ,3 Default: 2 |
option.context_length_estimate | No | Estimated context input length for Llama models. Customer can specify different size bucket to increase the KV cache re-usability. This will help to improve latency | Example: 256,512,1024 (integers separated by comma if multiple values) Default: None |
option.low_cpu_mem_usage | No | Reduce CPU memory usage when loading models. | Default: False |
option.load_split_model | No | Toggle to True when using model artifacts that have already been split for neuron compilation/loading. | Default: False |
option.compiled_graph_path | No | Provide an s3 URI, or a local directory that stores the pre-compiled graph for your model (NEFF cache) to skip runtime compilation. | Default: None |
If you specify MPI engine in TensorRT LLM container, the following parameters will be accessible.
Item | Required | Description | Example value |
---|---|---|---|
option.max_input_len | No | Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. | Default values for: Llama is 512 Falcon is 1024 |
option.max_output_len | No | Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set. | Default values for: Llama is 512 Falcon is 1024 |
option.use_custom_all_reduce | No | Custom all reduce kernel is used for GPUs that have NVLink enabled. This can help to speed up model inference speed with better communication. Turn this on by setting true on P4D, P4De, P5 and other GPUs that are NVLink connected | true , false . Default is false |
Advanced parameters | |||
option.tokens_per_block | No | tokens per block to be used in paged attention algorithm | Default values is 64 |
option.batch_scheduler_policy | No | scheduler policy of Tensorrt-LLM batch manager. | max_utilization , guaranteed_no_evict Default value is max_utilization |
option.kv_cache_free_gpu_mem_fraction | No | fraction of free gpu memory allocated for kv cache. The larger value you set, the more memory the model will try to take over on the GPU. The more memory preserved, the larger KV Cache size we can use and that means longer input+output sequence or larger batch size. | float number between 0 and 1. Default is 0.95 |
option.max_num_sequences | No | maximum number of input requests processed in the batch. We will apply max_rolling_batch_size as the value for it if you don't set this. Generally you don't have to touch it unless you really want the model to be compiled to a batch size that not the same as model server set | Integer greater than 0 Default value is the batch size set while building Tensorrt engine |
option.enable_trt_overlap | No | Parameter to overlap the execution of batches of requests. It may have a negative impact on performance when the number of requests is too small. During our experiment, we saw more negative impact to turn this on than off. | true , false . Default is false |
option.baichuan_model_version | No | Parameter that exclusively for Baichuan LLM model to specify the version of the model. Need to specify the HF Baichuan checkpoint path. For v1_13b, you should use whether baichuan-inc/Baichuan-13B-Chat or baichuan-inc/Baichuan-13B-Base. For v2_13b, you should use whether baichuan-inc/Baichuan2-13B-Chat or baichuan-inc/Baichuan2-13B-Base. More Baichuan models could be found on baichuan-inc. | v1_7b , v1_13b , v2_7b , v2_13b . Default is v1_13b |
Advanced parameters: Quantization | |||
option.quantize | No | Currently only supports smoothquant for Llama models with just in time compilation mode. |
smoothquant |
option.smoothquant_alpha | No | smoothquant alpha parameter | Default value is 0.8 |
option.smoothquant_per_token | No | This is only applied when option.quantize is set to smoothquant . This enables choosing at run time a custom smoothquant scaling factor for each token. This is usally little slower and more accurate |
true , false . Default is false |
option.smoothquant_per_channel | No | This is only applied when option.quantize is set to smoothquant . This enables choosing at run time a custom smoothquant scaling factor for each channel. This is usally little slower and more accurate |
true , false . Default is false |
option.multi_query_mode | No | This is only needed when option.quantize is set to smoothquant . This is should be set for models that support multi-query-attention, for e.g llama-70b |
true , false . Default is false |
DJLServing provides a few alias for Python engine to make it easy for common LLM configurations.
engine=DeepSpeed
, equivalent to:
engine=Python
option.mpi_mode=true
option.entryPoint=djl_python.deepspeed
engine=MPI
, this is equivalent to:
engine=Python
option.mpi_mode=true
option.entryPoint=djl_python.huggingface