Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT provider does not use the cache it created #18272

Open
BengtGustafsson opened this issue Nov 3, 2023 · 17 comments
Open

TensorRT provider does not use the cache it created #18272

BengtGustafsson opened this issue Nov 3, 2023 · 17 comments
Assignees
Labels
ep:CUDA issues related to the CUDA execution provider ep:DML issues related to the DirectML execution provider ep:TensorRT issues related to TensorRT execution provider platform:windows issues related to the Windows platform

Comments

@BengtGustafsson
Copy link
Contributor

Describe the issue

The symptoms I see is that if I delete the cache it takes 9 seconds to regenerate the cache files. If I create more engines for the same model in the same process it takes 40 ms or so. But if I restart the program it takes 9 seconds again but the cache files are not rewritten. This does not seem normal.

I turned on verbose logging and it does not show any errors. Attached the log below.

The only suspicious output is: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. I didn't feel this was important as a) we definitely only have F32 weights, b) calculations are carried out correctly subsequently. Maybe also: [TensorRT EP] Model path is empty which could possible be related to us providing the ONNX data as memory buffer to the Ort::Session constructor. I have no idea how the "hash" part of the .engine and .profile files is constructed but I would guess it comes from the modelData.

The ONNX file I use has a network that accepts different input sizes (and generates corresponding output sizes). I only use it with one size in our current application. Does this still mean that without specifying trt_profile_min_shapes etc. the cache is not used? Even if it wasn't I don't think it is consistent with the symptoms.

Here is your log output:
core 1 consist of logical processors: 1 2
core 2 consist of logical processors: 3 4
core 3 consist of logical processors: 5 6
core 4 consist of logical processors: 7 8
core 5 consist of logical processors: 9 10
core 6 consist of logical processors: 11 12
core 7 consist of logical processors: 13 14
core 8 consist of logical processors: 15 16
core 9 consist of logical processors: 17
core 10 consist of logical processors: 18
core 11 consist of logical processors: 19
core 12 consist of logical processors: 20
core 13 consist of logical processors: 21
core 14 consist of logical processors: 22
core 15 consist of logical processors: 23
core 16 consist of logical processors: 24
[TensorRT EP] Getting all registered TRT plugins from TRT plugin registry ...
[TensorRT EP] CaskDeconvShaderWeightsTransformerPlugin, version : 1
[TensorRT EP] CaskConvShaderWeightsTransformerPlugin, version : 1
[TensorRT EP] CustomQKVToContextPluginDynamic, version : 3
[TensorRT EP] CustomQKVToContextPluginDynamic, version : 1
[TensorRT EP] CustomQKVToContextPluginDynamic, version : 2
[TensorRT EP] DisentangledAttention_TRT, version : 1
[TensorRT EP] CustomEmbLayerNormPluginDynamic, version : 1
[TensorRT EP] CustomEmbLayerNormPluginDynamic, version : 2
[TensorRT EP] CustomEmbLayerNormPluginDynamic, version : 3
[TensorRT EP] CustomFCPluginDynamic, version : 1
[TensorRT EP] CustomGeluPluginDynamic, version : 1
[TensorRT EP] GroupNormalizationPlugin, version : 1
[TensorRT EP] RnRes2Br1Br2c_TRT, version : 1
[TensorRT EP] RnRes2Br1Br2c_TRT, version : 2
[TensorRT EP] RnRes2Br2bBr2c_TRT, version : 1
[TensorRT EP] RnRes2Br2bBr2c_TRT, version : 2
[TensorRT EP] SingleStepLSTMPlugin, version : 1
[TensorRT EP] CustomSkipLayerNormPluginDynamic, version : 3
[TensorRT EP] CustomSkipLayerNormPluginDynamic, version : 4
[TensorRT EP] CustomSkipLayerNormPluginDynamic, version : 1
[TensorRT EP] CustomSkipLayerNormPluginDynamic, version : 2
[TensorRT EP] RnRes2FullFusion_TRT, version : 1
[TensorRT EP] BatchedNMSDynamic_TRT, version : 1
[TensorRT EP] BatchedNMS_TRT, version : 1
[TensorRT EP] BatchTilePlugin_TRT, version : 1
[TensorRT EP] Clip_TRT, version : 1
[TensorRT EP] CoordConvAC, version : 1
[TensorRT EP] CropAndResizeDynamic, version : 1
[TensorRT EP] CropAndResize, version : 1
[TensorRT EP] DecodeBbox3DPlugin, version : 1
[TensorRT EP] DetectionLayer_TRT, version : 1
[TensorRT EP] EfficientNMS_Explicit_TF_TRT, version : 1
[TensorRT EP] EfficientNMS_Implicit_TF_TRT, version : 1
[TensorRT EP] EfficientNMS_ONNX_TRT, version : 1
[TensorRT EP] EfficientNMS_TRT, version : 1
[TensorRT EP] FlattenConcat_TRT, version : 1
[TensorRT EP] GenerateDetection_TRT, version : 1
[TensorRT EP] GridAnchor_TRT, version : 1
[TensorRT EP] GridAnchorRect_TRT, version : 1
[TensorRT EP] InstanceNormalization_TRT, version : 1
[TensorRT EP] InstanceNormalization_TRT, version : 2
[TensorRT EP] LReLU_TRT, version : 1
[TensorRT EP] ModulatedDeformConv2d, version : 1
[TensorRT EP] MultilevelCropAndResize_TRT, version : 1
[TensorRT EP] MultilevelProposeROI_TRT, version : 1
[TensorRT EP] MultiscaleDeformableAttnPlugin_TRT, version : 1
[TensorRT EP] NMSDynamic_TRT, version : 1
[TensorRT EP] NMS_TRT, version : 1
[TensorRT EP] Normalize_TRT, version : 1
[TensorRT EP] PillarScatterPlugin, version : 1
[TensorRT EP] PriorBox_TRT, version : 1
[TensorRT EP] ProposalDynamic, version : 1
[TensorRT EP] ProposalLayer_TRT, version : 1
[TensorRT EP] Proposal, version : 1
[TensorRT EP] PyramidROIAlign_TRT, version : 1
[TensorRT EP] Region_TRT, version : 1
[TensorRT EP] Reorg_TRT, version : 1
[TensorRT EP] ResizeNearest_TRT, version : 1
[TensorRT EP] ROIAlign_TRT, version : 1
[TensorRT EP] RPROI_TRT, version : 1
[TensorRT EP] ScatterND, version : 1
[TensorRT EP] SpecialSlice_TRT, version : 1
[TensorRT EP] Split, version : 1
[TensorRT EP] VoxelGeneratorPlugin, version : 1
[TensorRT EP] Getting all registered TRT plugins from TRT plugin registry ...
[TensorRT EP] CaskDeconvShaderWeightsTransformerPlugin, version : 1
[TensorRT EP] CaskConvShaderWeightsTransformerPlugin, version : 1
[TensorRT EP] CustomQKVToContextPluginDynamic, version : 3
[TensorRT EP] CustomQKVToContextPluginDynamic, version : 1
[TensorRT EP] CustomQKVToContextPluginDynamic, version : 2
[TensorRT EP] DisentangledAttention_TRT, version : 1
[TensorRT EP] CustomEmbLayerNormPluginDynamic, version : 1
[TensorRT EP] CustomEmbLayerNormPluginDynamic, version : 2
[TensorRT EP] CustomEmbLayerNormPluginDynamic, version : 3
[TensorRT EP] CustomFCPluginDynamic, version : 1
[TensorRT EP] CustomGeluPluginDynamic, version : 1
[TensorRT EP] GroupNormalizationPlugin, version : 1
[TensorRT EP] RnRes2Br1Br2c_TRT, version : 1
[TensorRT EP] RnRes2Br1Br2c_TRT, version : 2
[TensorRT EP] RnRes2Br2bBr2c_TRT, version : 1
[TensorRT EP] RnRes2Br2bBr2c_TRT, version : 2
[TensorRT EP] SingleStepLSTMPlugin, version : 1
[TensorRT EP] CustomSkipLayerNormPluginDynamic, version : 3
[TensorRT EP] CustomSkipLayerNormPluginDynamic, version : 4
[TensorRT EP] CustomSkipLayerNormPluginDynamic, version : 1
[TensorRT EP] CustomSkipLayerNormPluginDynamic, version : 2
[TensorRT EP] RnRes2FullFusion_TRT, version : 1
[TensorRT EP] BatchedNMSDynamic_TRT, version : 1
[TensorRT EP] BatchedNMS_TRT, version : 1
[TensorRT EP] BatchTilePlugin_TRT, version : 1
[TensorRT EP] Clip_TRT, version : 1
[TensorRT EP] CoordConvAC, version : 1
[TensorRT EP] CropAndResizeDynamic, version : 1
[TensorRT EP] CropAndResize, version : 1
[TensorRT EP] DecodeBbox3DPlugin, version : 1
[TensorRT EP] DetectionLayer_TRT, version : 1
[TensorRT EP] EfficientNMS_Explicit_TF_TRT, version : 1
[TensorRT EP] EfficientNMS_Implicit_TF_TRT, version : 1
[TensorRT EP] EfficientNMS_ONNX_TRT, version : 1
[TensorRT EP] EfficientNMS_TRT, version : 1
[TensorRT EP] FlattenConcat_TRT, version : 1
[TensorRT EP] GenerateDetection_TRT, version : 1
[TensorRT EP] GridAnchor_TRT, version : 1
[TensorRT EP] GridAnchorRect_TRT, version : 1
[TensorRT EP] InstanceNormalization_TRT, version : 1
[TensorRT EP] InstanceNormalization_TRT, version : 2
[TensorRT EP] LReLU_TRT, version : 1
[TensorRT EP] ModulatedDeformConv2d, version : 1
[TensorRT EP] MultilevelCropAndResize_TRT, version : 1
[TensorRT EP] MultilevelProposeROI_TRT, version : 1
[TensorRT EP] MultiscaleDeformableAttnPlugin_TRT, version : 1
[TensorRT EP] NMSDynamic_TRT, version : 1
[TensorRT EP] NMS_TRT, version : 1
[TensorRT EP] Normalize_TRT, version : 1
[TensorRT EP] PillarScatterPlugin, version : 1
[TensorRT EP] PriorBox_TRT, version : 1
[TensorRT EP] ProposalDynamic, version : 1
[TensorRT EP] ProposalLayer_TRT, version : 1
[TensorRT EP] Proposal, version : 1
[TensorRT EP] PyramidROIAlign_TRT, version : 1
[TensorRT EP] Region_TRT, version : 1
[TensorRT EP] Reorg_TRT, version : 1
[TensorRT EP] ResizeNearest_TRT, version : 1
[TensorRT EP] ROIAlign_TRT, version : 1
[TensorRT EP] RPROI_TRT, version : 1
[TensorRT EP] ScatterND, version : 1
[TensorRT EP] SpecialSlice_TRT, version : 1
[TensorRT EP] Split, version : 1
[TensorRT EP] VoxelGeneratorPlugin, version : 1
Flush-to-zero and denormal-as-zero are off
Creating and using per session threadpools since use_per_session_threads_ is true
Dynamic block base set to 0
SetThreadAffinityMask done for thread: 55872, group_id: 0, mask: 12
SetThreadAffinityMask done for thread: 61448, group_id: 0, mask: 48
SetThreadAffinityMask done for thread: 85932, group_id: 0, mask: 192
SetThreadAffinityMask done for thread: 97056, group_id: 0, mask: 768
SetThreadAffinityMask done for thread: 90348, group_id: 0, mask: 12288
SetThreadAffinityMask done for thread: 91864, group_id: 0, mask: 65536
SetThreadAffinityMask done for thread: 50880, group_id: 0, mask: 3072
SetThreadAffinityMask done for thread: 77020, group_id: 0, mask: 262144SetThreadAffinityMask done for thread: 93328, group_id: 0, mask: 2097152
SetThreadAffinityMask done for thread: 70988, group_id: 0, mask: 131072
SetThreadAffinityMask done for thread: 45040, group_id: 0, mask: 49152SetThreadAffinityMask done for thread: 94268, group_id: 0, mask: 524288
SetThreadAffinityMask done for thread: 68884, group_id: 0, mask: 8388608

SetThreadAffinityMask done for thread: 97236, group_id: 0, mask: 4194304
SetThreadAffinityMask done for thread: 99776, group_id: 0, mask: 1048576

I manually split this into multiple lines, the output was one long line:

[TensorRT EP] TensorRT provider options: device_id: 0,
trt_max_partition_iterations: 1000,
trt_min_subgraph_size: 1,
trt_max_workspace_size: 1073741824,
trt_fp16_enable: 0,
trt_int8_enable: 0,
trt_int8_calibration_cache_name: ,
int8_calibration_cache_available: 0,
trt_int8_use_native_tensorrt_calibration_table: 0,
trt_dla_enable: 0,
trt_dla_core: 0,
trt_dump_subgraphs: 0,
trt_engine_cache_enable: 1,
trt_cache_path: C:\ProgramData\ContextVision\cvn_cache\2beba869a9eb439b3c12a6bb70db02bc7d92381d,
trt_engine_decryption_enable: 0,
trt_engine_decryption_lib_path: ,
trt_force_sequential_engine_build: 0,
trt_context_memory_sharing_enable: 0,
trt_layer_norm_fp32_fallback: 0,
trt_build_heuristics_enable: 0,
trt_sparsity_enable: 0,
trt_builder_optimization_level: 3,
trt_auxiliary_streams: -1,
trt_tactic_sources: ,
trt_profile_min_shapes: ,
trt_profile_max_shapes: ,
trt_profile_opt_shapes: ,
trt_cuda_graph_enable: 0
Initializing session.
Adding default CPU execution provider.
Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 18446744073709551615 arena_extend_strategy: 0
Creating 21 bins of max chunk size 256 to 268435456
Creating BFCArena for CudaPinned with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 18446744073709551615 arena_extend_strategy: 0
Creating 21 bins of max chunk size 256 to 268435456
Creating BFCArena for Cpu with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 18446744073709551615 arena_extend_strategy: 0
Creating 21 bins of max chunk size 256 to 268435456
GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
GraphTransformer Level1_RuleBasedTransformer modified: 0 with status: OK
GraphTransformer DoubleQDQPairsRemover modified: 0 with status: OK
GraphTransformer ConstantSharing modified: 1 with status: OK
GraphTransformer CommonSubexpressionElimination modified: 0 with status: OK
GraphTransformer ConstantFolding modified: 0 with status: OK
GraphTransformer MatMulAddFusion modified: 0 with status: OK
GraphTransformer ReshapeFusion modified: 0 with status: OK
GraphTransformer FreeDimensionOverrideTransformer modified: 0 with status: OK
GraphTransformer QDQPropagationTransformer modified: 0 with status: OK
GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
GraphTransformer RocmBlasAltImpl modified: 0 with status: OK
GraphTransformer TransposeOptimizer modified: 0 with status: OK
GraphTransformer Level1_RuleBasedTransformer modified: 0 with status: OK
GraphTransformer DoubleQDQPairsRemover modified: 0 with status: OK
GraphTransformer ConstantSharing modified: 1 with status: OK
GraphTransformer CommonSubexpressionElimination modified: 0 with status: OK
GraphTransformer ConstantFolding modified: 0 with status: OK
GraphTransformer MatMulAddFusion modified: 0 with status: OK
GraphTransformer ReshapeFusion modified: 0 with status: OK
GraphTransformer FreeDimensionOverrideTransformer modified: 0 with status: OK
GraphTransformer QDQPropagationTransformer modified: 0 with status: OK
GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
GraphTransformer RocmBlasAltImpl modified: 0 with status: OK
GraphTransformer Level1_RuleBasedTransformer modified: 0 with status: OK
GraphTransformer DoubleQDQPairsRemover modified: 0 with status: OK
GraphTransformer ConstantSharing modified: 1 with status: OK
GraphTransformer CommonSubexpressionElimination modified: 0 with status: OK
GraphTransformer ConstantFolding modified: 0 with status: OK
GraphTransformer MatMulAddFusion modified: 0 with status: OK
GraphTransformer ReshapeFusion modified: 0 with status: OK
GraphTransformer FreeDimensionOverrideTransformer modified: 0 with status: OK
GraphTransformer QDQPropagationTransformer modified: 0 with status: OK
GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
GraphTransformer RocmBlasAltImpl modified: 0 with status: OK
GraphTransformer Level1_RuleBasedTransformer modified: 0 with status: OK
GraphTransformer DoubleQDQPairsRemover modified: 0 with status: OK
GraphTransformer ConstantSharing modified: 1 with status: OK
GraphTransformer CommonSubexpressionElimination modified: 0 with status: OK
GraphTransformer ConstantFolding modified: 0 with status: OK
GraphTransformer MatMulAddFusion modified: 0 with status: OK
GraphTransformer ReshapeFusion modified: 0 with status: OK
GraphTransformer FreeDimensionOverrideTransformer modified: 0 with status: OK
GraphTransformer QDQPropagationTransformer modified: 0 with status: OK
GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
GraphTransformer RocmBlasAltImpl modified: 0 with status: OK
GraphTransformer Level1_RuleBasedTransformer modified: 0 with status: OK
GraphTransformer DoubleQDQPairsRemover modified: 0 with status: OK
GraphTransformer ConstantSharing modified: 1 with status: OK
GraphTransformer CommonSubexpressionElimination modified: 0 with status: OK
GraphTransformer ConstantFolding modified: 0 with status: OK
GraphTransformer MatMulAddFusion modified: 0 with status: OK
GraphTransformer ReshapeFusion modified: 0 with status: OK
GraphTransformer FreeDimensionOverrideTransformer modified: 0 with status: OK
GraphTransformer QDQPropagationTransformer modified: 0 with status: OK
GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
GraphTransformer RocmBlasAltImpl modified: 0 with status: OK
GraphTransformer Level1_RuleBasedTransformer modified: 0 with status: OK
GraphTransformer DoubleQDQPairsRemover modified: 0 with status: OK
GraphTransformer ConstantSharing modified: 1 with status: OK
GraphTransformer CommonSubexpressionElimination modified: 0 with status: OK
GraphTransformer ConstantFolding modified: 0 with status: OK
GraphTransformer MatMulAddFusion modified: 0 with status: OK
GraphTransformer ReshapeFusion modified: 0 with status: OK
GraphTransformer FreeDimensionOverrideTransformer modified: 0 with status: OK
GraphTransformer QDQPropagationTransformer modified: 0 with status: OK
GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
GraphTransformer RocmBlasAltImpl modified: 0 with status: OK
GraphTransformer Level1_RuleBasedTransformer modified: 0 with status: OK
GraphTransformer DoubleQDQPairsRemover modified: 0 with status: OK
GraphTransformer ConstantSharing modified: 1 with status: OK
GraphTransformer CommonSubexpressionElimination modified: 0 with status: OK
GraphTransformer ConstantFolding modified: 0 with status: OK
GraphTransformer MatMulAddFusion modified: 0 with status: OK
GraphTransformer ReshapeFusion modified: 0 with status: OK
GraphTransformer FreeDimensionOverrideTransformer modified: 0 with status: OK
GraphTransformer QDQPropagationTransformer modified: 0 with status: OK
GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
GraphTransformer RocmBlasAltImpl modified: 0 with status: OK
GraphTransformer Level1_RuleBasedTransformer modified: 0 with status: OK
GraphTransformer DoubleQDQPairsRemover modified: 0 with status: OK
GraphTransformer ConstantSharing modified: 1 with status: OK
GraphTransformer CommonSubexpressionElimination modified: 0 with status: OK
GraphTransformer ConstantFolding modified: 0 with status: OK
GraphTransformer MatMulAddFusion modified: 0 with status: OK
GraphTransformer ReshapeFusion modified: 0 with status: OK
GraphTransformer FreeDimensionOverrideTransformer modified: 0 with status: OK
GraphTransformer QDQPropagationTransformer modified: 0 with status: OK
GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
GraphTransformer RocmBlasAltImpl modified: 0 with status: OK
GraphTransformer Level1_RuleBasedTransformer modified: 0 with status: OK
GraphTransformer DoubleQDQPairsRemover modified: 0 with status: OK
GraphTransformer ConstantSharing modified: 1 with status: OK
GraphTransformer CommonSubexpressionElimination modified: 0 with status: OK
GraphTransformer ConstantFolding modified: 0 with status: OK
GraphTransformer MatMulAddFusion modified: 0 with status: OK
GraphTransformer ReshapeFusion modified: 0 with status: OK
GraphTransformer FreeDimensionOverrideTransformer modified: 0 with status: OK
GraphTransformer QDQPropagationTransformer modified: 0 with status: OK
GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
GraphTransformer RocmBlasAltImpl modified: 0 with status: OK
GraphTransformer Level1_RuleBasedTransformer modified: 0 with status: OK
GraphTransformer DoubleQDQPairsRemover modified: 0 with status: OK
GraphTransformer ConstantSharing modified: 1 with status: OK
GraphTransformer CommonSubexpressionElimination modified: 0 with status: OK
GraphTransformer ConstantFolding modified: 0 with status: OK
GraphTransformer MatMulAddFusion modified: 0 with status: OK
GraphTransformer ReshapeFusion modified: 0 with status: OK
GraphTransformer FreeDimensionOverrideTransformer modified: 0 with status: OK
GraphTransformer QDQPropagationTransformer modified: 0 with status: OK
GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
GraphTransformer RocmBlasAltImpl modified: 0 with status: OK
[TensorRT EP] Model path is empty
[2023-11-03 12:17:33 WARNING] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[TensorRT EP] TensorRT subgraph MetaDef name TRTKernel_graph_torch-jit-export_6809225732474184598_0
[TensorRT EP] TensorRT subgraph MetaDef name TRTKernel_graph_torch-jit-export_6809225732474184598_0
[TensorRT EP] Whole graph will run on TensorRT execution provider
[2023-11-03 12:17:36 WARNING] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
Removing initializer 'blocks.15.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.0.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.15.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.0.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.2.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.18.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.2.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.11.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.4.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.11.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.4.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.13.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.6.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.29.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.13.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.6.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.35.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.9.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.35.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.9.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.18.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.20.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.20.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.22.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.22.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.24.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.41.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.24.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.27.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.27.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.29.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.31.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.31.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.33.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.33.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.38.conv.weight'. It is no longer used by any node.
Removing initializer 'blocks.38.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.41.conv.bias'. It is no longer used by any node.
Removing initializer 'blocks.44.conv.weight'. It is no longer used by any node.
Removing initializer 'ortshared_1_1_4_0_token_8'. It is no longer used by any node.
GraphTransformer TransposeOptimizer_CPUExecutionProvider modified: 0 with status: OK
GraphTransformer QDQS8ToU8Transformer modified: 0 with status: OK
GraphTransformer QDQSelectorActionTransformer modified: 0 with status: OK
GraphTransformer GemmActivationFusion modified: 0 with status: OK
GraphTransformer MatMulIntegerToFloatFusion modified: 0 with status: OK
GraphTransformer DynamicQuantizeMatMulFusion modified: 0 with status: OK
GraphTransformer ConvActivationFusion modified: 0 with status: OK
GraphTransformer GeluFusion modified: 0 with status: OK
GraphTransformer LayerNormFusion modified: 0 with status: OK
GraphTransformer SimplifiedLayerNormFusion modified: 0 with status: OK
GraphTransformer AttentionFusion modified: 0 with status: OK
GraphTransformer EmbedLayerNormFusion modified: 0 with status: OK
GraphTransformer GatherToSplitFusion modified: 0 with status: OK
GraphTransformer GatherToSliceFusion modified: 0 with status: OK
GraphTransformer MatmulTransposeFusion modified: 0 with status: OK
GraphTransformer BiasGeluFusion modified: 0 with status: OK
GraphTransformer SkipLayerNormFusion modified: 1 with status: OK
GraphTransformer FastGeluFusion modified: 0 with status: OK
GraphTransformer QuickGeluFusion modified: 0 with status: OK
GraphTransformer BiasSoftmaxFusion modified: 0 with status: OK
GraphTransformer BiasDropoutFusion modified: 0 with status: OK
GraphTransformer MatMulScaleFusion modified: 0 with status: OK
GraphTransformer MatMulActivationFusion modified: 0 with status: OK
GraphTransformer QDQFinalCleanupTransformer modified: 0 with status: OK
GraphTransformer QDQS8ToU8Transformer modified: 0 with status: OK
GraphTransformer QDQSelectorActionTransformer modified: 0 with status: OK
GraphTransformer GemmActivationFusion modified: 0 with status: OK
GraphTransformer MatMulIntegerToFloatFusion modified: 0 with status: OK
GraphTransformer DynamicQuantizeMatMulFusion modified: 0 with status: OK
GraphTransformer ConvActivationFusion modified: 0 with status: OK
GraphTransformer GeluFusion modified: 0 with status: OK
GraphTransformer LayerNormFusion modified: 0 with status: OK
GraphTransformer SimplifiedLayerNormFusion modified: 0 with status: OK
GraphTransformer AttentionFusion modified: 0 with status: OK
GraphTransformer EmbedLayerNormFusion modified: 0 with status: OK
GraphTransformer GatherToSplitFusion modified: 0 with status: OK
GraphTransformer GatherToSliceFusion modified: 0 with status: OK
GraphTransformer MatmulTransposeFusion modified: 0 with status: OK
GraphTransformer BiasGeluFusion modified: 0 with status: OK
GraphTransformer SkipLayerNormFusion modified: 1 with status: OK
GraphTransformer FastGeluFusion modified: 0 with status: OK
GraphTransformer QuickGeluFusion modified: 0 with status: OK
GraphTransformer BiasSoftmaxFusion modified: 0 with status: OK
GraphTransformer BiasDropoutFusion modified: 0 with status: OK
GraphTransformer MatMulScaleFusion modified: 0 with status: OK
GraphTransformer MatMulActivationFusion modified: 0 with status: OK
GraphTransformer QDQFinalCleanupTransformer modified: 0 with status: OK
GraphTransformer QDQS8ToU8Transformer modified: 0 with status: OK
GraphTransformer QDQSelectorActionTransformer modified: 0 with status: OK
GraphTransformer GemmActivationFusion modified: 0 with status: OK
GraphTransformer MatMulIntegerToFloatFusion modified: 0 with status: OK
GraphTransformer DynamicQuantizeMatMulFusion modified: 0 with status: OK
GraphTransformer ConvActivationFusion modified: 0 with status: OK
GraphTransformer GeluFusion modified: 0 with status: OK
GraphTransformer LayerNormFusion modified: 0 with status: OK
GraphTransformer SimplifiedLayerNormFusion modified: 0 with status: OK
GraphTransformer AttentionFusion modified: 0 with status: OK
GraphTransformer EmbedLayerNormFusion modified: 0 with status: OK
GraphTransformer GatherToSplitFusion modified: 0 with status: OK
GraphTransformer GatherToSliceFusion modified: 0 with status: OK
GraphTransformer MatmulTransposeFusion modified: 0 with status: OK
GraphTransformer BiasGeluFusion modified: 0 with status: OK
GraphTransformer SkipLayerNormFusion modified: 1 with status: OK
GraphTransformer FastGeluFusion modified: 0 with status: OK
GraphTransformer QuickGeluFusion modified: 0 with status: OK
GraphTransformer BiasSoftmaxFusion modified: 0 with status: OK
GraphTransformer BiasDropoutFusion modified: 0 with status: OK
GraphTransformer MatMulScaleFusion modified: 0 with status: OK
GraphTransformer MatMulActivationFusion modified: 0 with status: OK
GraphTransformer QDQFinalCleanupTransformer modified: 0 with status: OK
GraphTransformer QDQS8ToU8Transformer modified: 0 with status: OK
GraphTransformer QDQSelectorActionTransformer modified: 0 with status: OK
GraphTransformer GemmActivationFusion modified: 0 with status: OK
GraphTransformer MatMulIntegerToFloatFusion modified: 0 with status: OK
GraphTransformer DynamicQuantizeMatMulFusion modified: 0 with status: OK
GraphTransformer ConvActivationFusion modified: 0 with status: OK
GraphTransformer GeluFusion modified: 0 with status: OK
GraphTransformer LayerNormFusion modified: 0 with status: OK
GraphTransformer SimplifiedLayerNormFusion modified: 0 with status: OK
GraphTransformer AttentionFusion modified: 0 with status: OK
GraphTransformer EmbedLayerNormFusion modified: 0 with status: OK
GraphTransformer GatherToSplitFusion modified: 0 with status: OK
GraphTransformer GatherToSliceFusion modified: 0 with status: OK
GraphTransformer MatmulTransposeFusion modified: 0 with status: OK
GraphTransformer BiasGeluFusion modified: 0 with status: OK
GraphTransformer SkipLayerNormFusion modified: 1 with status: OK
GraphTransformer FastGeluFusion modified: 0 with status: OK
GraphTransformer QuickGeluFusion modified: 0 with status: OK
GraphTransformer BiasSoftmaxFusion modified: 0 with status: OK
GraphTransformer BiasDropoutFusion modified: 0 with status: OK
GraphTransformer MatMulScaleFusion modified: 0 with status: OK
GraphTransformer MatMulActivationFusion modified: 0 with status: OK
GraphTransformer QDQFinalCleanupTransformer modified: 0 with status: OK
GraphTransformer QDQS8ToU8Transformer modified: 0 with status: OK
GraphTransformer QDQSelectorActionTransformer modified: 0 with status: OK
GraphTransformer GemmActivationFusion modified: 0 with status: OK
GraphTransformer MatMulIntegerToFloatFusion modified: 0 with status: OK
GraphTransformer DynamicQuantizeMatMulFusion modified: 0 with status: OK
GraphTransformer ConvActivationFusion modified: 0 with status: OK
GraphTransformer GeluFusion modified: 0 with status: OK
GraphTransformer LayerNormFusion modified: 0 with status: OK
GraphTransformer SimplifiedLayerNormFusion modified: 0 with status: OK
GraphTransformer AttentionFusion modified: 0 with status: OK
GraphTransformer EmbedLayerNormFusion modified: 0 with status: OK
GraphTransformer GatherToSplitFusion modified: 0 with status: OK
GraphTransformer GatherToSliceFusion modified: 0 with status: OK
GraphTransformer MatmulTransposeFusion modified: 0 with status: OK
GraphTransformer BiasGeluFusion modified: 0 with status: OK
GraphTransformer SkipLayerNormFusion modified: 1 with status: OK
GraphTransformer FastGeluFusion modified: 0 with status: OK
GraphTransformer QuickGeluFusion modified: 0 with status: OK
GraphTransformer BiasSoftmaxFusion modified: 0 with status: OK
GraphTransformer BiasDropoutFusion modified: 0 with status: OK
GraphTransformer MatMulScaleFusion modified: 0 with status: OK
GraphTransformer MatMulActivationFusion modified: 0 with status: OK
GraphTransformer QDQFinalCleanupTransformer modified: 0 with status: OK
GraphTransformer QDQS8ToU8Transformer modified: 0 with status: OK
GraphTransformer QDQSelectorActionTransformer modified: 0 with status: OK
GraphTransformer GemmActivationFusion modified: 0 with status: OK
GraphTransformer MatMulIntegerToFloatFusion modified: 0 with status: OK
GraphTransformer DynamicQuantizeMatMulFusion modified: 0 with status: OK
GraphTransformer ConvActivationFusion modified: 0 with status: OK
GraphTransformer GeluFusion modified: 0 with status: OK
GraphTransformer LayerNormFusion modified: 0 with status: OK
GraphTransformer SimplifiedLayerNormFusion modified: 0 with status: OK
GraphTransformer AttentionFusion modified: 0 with status: OK
GraphTransformer EmbedLayerNormFusion modified: 0 with status: OK
GraphTransformer GatherToSplitFusion modified: 0 with status: OK
GraphTransformer GatherToSliceFusion modified: 0 with status: OK
GraphTransformer MatmulTransposeFusion modified: 0 with status: OK
GraphTransformer BiasGeluFusion modified: 0 with status: OK
GraphTransformer SkipLayerNormFusion modified: 1 with status: OK
GraphTransformer FastGeluFusion modified: 0 with status: OK
GraphTransformer QuickGeluFusion modified: 0 with status: OK
GraphTransformer BiasSoftmaxFusion modified: 0 with status: OK
GraphTransformer BiasDropoutFusion modified: 0 with status: OK
GraphTransformer MatMulScaleFusion modified: 0 with status: OK
GraphTransformer MatMulActivationFusion modified: 0 with status: OK
GraphTransformer QDQFinalCleanupTransformer modified: 0 with status: OK
GraphTransformer QDQS8ToU8Transformer modified: 0 with status: OK
GraphTransformer QDQSelectorActionTransformer modified: 0 with status: OK
GraphTransformer GemmActivationFusion modified: 0 with status: OK
GraphTransformer MatMulIntegerToFloatFusion modified: 0 with status: OK
GraphTransformer DynamicQuantizeMatMulFusion modified: 0 with status: OK
GraphTransformer ConvActivationFusion modified: 0 with status: OK
GraphTransformer GeluFusion modified: 0 with status: OK
GraphTransformer LayerNormFusion modified: 0 with status: OK
GraphTransformer SimplifiedLayerNormFusion modified: 0 with status: OK
GraphTransformer AttentionFusion modified: 0 with status: OK
GraphTransformer EmbedLayerNormFusion modified: 0 with status: OK
GraphTransformer GatherToSplitFusion modified: 0 with status: OK
GraphTransformer GatherToSliceFusion modified: 0 with status: OK
GraphTransformer MatmulTransposeFusion modified: 0 with status: OK
GraphTransformer BiasGeluFusion modified: 0 with status: OK
GraphTransformer SkipLayerNormFusion modified: 1 with status: OK
GraphTransformer FastGeluFusion modified: 0 with status: OK
GraphTransformer QuickGeluFusion modified: 0 with status: OK
GraphTransformer BiasSoftmaxFusion modified: 0 with status: OK
GraphTransformer BiasDropoutFusion modified: 0 with status: OK
GraphTransformer MatMulScaleFusion modified: 0 with status: OK
GraphTransformer MatMulActivationFusion modified: 0 with status: OK
GraphTransformer QDQFinalCleanupTransformer modified: 0 with status: OK
GraphTransformer QDQS8ToU8Transformer modified: 0 with status: OK
GraphTransformer QDQSelectorActionTransformer modified: 0 with status: OK
GraphTransformer GemmActivationFusion modified: 0 with status: OK
GraphTransformer MatMulIntegerToFloatFusion modified: 0 with status: OK
GraphTransformer DynamicQuantizeMatMulFusion modified: 0 with status: OK
GraphTransformer ConvActivationFusion modified: 0 with status: OK
GraphTransformer GeluFusion modified: 0 with status: OK
GraphTransformer LayerNormFusion modified: 0 with status: OK
GraphTransformer SimplifiedLayerNormFusion modified: 0 with status: OK
GraphTransformer AttentionFusion modified: 0 with status: OK
GraphTransformer EmbedLayerNormFusion modified: 0 with status: OK
GraphTransformer GatherToSplitFusion modified: 0 with status: OK
GraphTransformer GatherToSliceFusion modified: 0 with status: OK
GraphTransformer MatmulTransposeFusion modified: 0 with status: OK
GraphTransformer BiasGeluFusion modified: 0 with status: OK
GraphTransformer SkipLayerNormFusion modified: 1 with status: OK
GraphTransformer FastGeluFusion modified: 0 with status: OK
GraphTransformer QuickGeluFusion modified: 0 with status: OK
GraphTransformer BiasSoftmaxFusion modified: 0 with status: OK
GraphTransformer BiasDropoutFusion modified: 0 with status: OK
GraphTransformer MatMulScaleFusion modified: 0 with status: OK
GraphTransformer MatMulActivationFusion modified: 0 with status: OK
GraphTransformer QDQFinalCleanupTransformer modified: 0 with status: OK
GraphTransformer QDQS8ToU8Transformer modified: 0 with status: OK
GraphTransformer QDQSelectorActionTransformer modified: 0 with status: OK
GraphTransformer GemmActivationFusion modified: 0 with status: OK
GraphTransformer MatMulIntegerToFloatFusion modified: 0 with status: OK
GraphTransformer DynamicQuantizeMatMulFusion modified: 0 with status: OK
GraphTransformer ConvActivationFusion modified: 0 with status: OK
GraphTransformer GeluFusion modified: 0 with status: OK
GraphTransformer LayerNormFusion modified: 0 with status: OK
GraphTransformer SimplifiedLayerNormFusion modified: 0 with status: OK
GraphTransformer AttentionFusion modified: 0 with status: OK
GraphTransformer EmbedLayerNormFusion modified: 0 with status: OK
GraphTransformer GatherToSplitFusion modified: 0 with status: OK
GraphTransformer GatherToSliceFusion modified: 0 with status: OK
GraphTransformer MatmulTransposeFusion modified: 0 with status: OK
GraphTransformer BiasGeluFusion modified: 0 with status: OK
GraphTransformer SkipLayerNormFusion modified: 1 with status: OK
GraphTransformer FastGeluFusion modified: 0 with status: OK
GraphTransformer QuickGeluFusion modified: 0 with status: OK
GraphTransformer BiasSoftmaxFusion modified: 0 with status: OK
GraphTransformer BiasDropoutFusion modified: 0 with status: OK
GraphTransformer MatMulScaleFusion modified: 0 with status: OK
GraphTransformer MatMulActivationFusion modified: 0 with status: OK
GraphTransformer QDQFinalCleanupTransformer modified: 0 with status: OK
GraphTransformer QDQS8ToU8Transformer modified: 0 with status: OK
GraphTransformer QDQSelectorActionTransformer modified: 0 with status: OK
GraphTransformer GemmActivationFusion modified: 0 with status: OK
GraphTransformer MatMulIntegerToFloatFusion modified: 0 with status: OK
GraphTransformer DynamicQuantizeMatMulFusion modified: 0 with status: OK
GraphTransformer ConvActivationFusion modified: 0 with status: OK
GraphTransformer GeluFusion modified: 0 with status: OK
GraphTransformer LayerNormFusion modified: 0 with status: OK
GraphTransformer SimplifiedLayerNormFusion modified: 0 with status: OK
GraphTransformer AttentionFusion modified: 0 with status: OK
GraphTransformer EmbedLayerNormFusion modified: 0 with status: OK
GraphTransformer GatherToSplitFusion modified: 0 with status: OK
GraphTransformer GatherToSliceFusion modified: 0 with status: OK
GraphTransformer MatmulTransposeFusion modified: 0 with status: OK
GraphTransformer BiasGeluFusion modified: 0 with status: OK
GraphTransformer SkipLayerNormFusion modified: 1 with status: OK
GraphTransformer FastGeluFusion modified: 0 with status: OK
GraphTransformer QuickGeluFusion modified: 0 with status: OK
GraphTransformer BiasSoftmaxFusion modified: 0 with status: OK
GraphTransformer BiasDropoutFusion modified: 0 with status: OK
GraphTransformer MatMulScaleFusion modified: 0 with status: OK
GraphTransformer MatMulActivationFusion modified: 0 with status: OK
GraphTransformer QDQFinalCleanupTransformer modified: 0 with status: OK
GraphTransformer NchwcTransformer modified: 0 with status: OK
GraphTransformer NhwcTransformer modified: 0 with status: OK
GraphTransformer ConvAddActivationFusion modified: 0 with status: OK
GraphTransformer RemoveDuplicateCastTransformer modified: 0 with status: OK
GraphTransformer CastFloat16Transformer modified: 0 with status: OK
GraphTransformer MemcpyTransformer modified: 0 with status: OK
Node placements
All nodes placed on [TensorrtExecutionProvider]. Number of nodes: 1
SaveMLValueNameIndexMapping
Done saving OrtValue mappings.
Use DeviceBasedPartition as default
Saving initialized tensors.
Done saving initialized tensors
Session successfully initialized.

To reproduce

Possibly you haven't noticed the timing and it is trivial to recreate. Possibly it has to do with the loaded network's settable sizes, which should be easy to test, or just document the rules for when caching can't work, and why, Possibly it has to do with the particulars of our network structure, but I strongly feel that the graph transformer should have complained in this case.

Urgency

Out deadline is getting frightfully close. When we started integrating ONNX Runtime in June of 2022 we had no idea it would take this much time. You must realize it is super hard to use when you don't know which hardware you're going to run on.

Now our deadline is in 24Q1 so I'm getting really nervous as the only thing we have up and running on GPU is Wndows/DirectML. Our measurements using Julia integration shows that we can quadruple that performance with TensorRT on NVIDIA but it is soooo hard to get anything working when you have to build from source.

I think this is mostly a documentation issue. For instance there is about 1 line of documentation per trt_ option and even if I go back to TensorRT documentation it turns out it is not a 1:1 mapping. This ulra-brief description uses lots of unexplained concepts such as in this case "profile" that you may have picked up from TensorRT documentation but which would naturally be unknown to your users. The idea of ONNX Runtime is hopefully that you don't have to be an expert on the libraries underlying all providers, but this is basically the situation right now.

Platform

Windows

OS Version

11

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.16

ONNX Runtime API

C++

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

TensorRT 6.8.1.6 on CUDA 11.8

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:DML issues related to the DirectML execution provider ep:TensorRT issues related to TensorRT execution provider platform:windows issues related to the Windows platform labels Nov 3, 2023
@jywu-msft
Copy link
Member

if verbose logging is enabled there should be log lines related to engine serialization/deserialization

LOGS_DEFAULT(VERBOSE) << "[TensorRT EP] Serialized engine " + engine_cache_path;

LOGS_DEFAULT(VERBOSE) << "[TensorRT EP] DeSerialized " + engine_cache_path;

In your log output ,I see trt_cache_path: C:\ProgramData\ContextVision\cvn_cache\2beba869a9eb439b3c12a6bb70db02bc7d92381d
is there a single engine file there? (your log output shows "[TensorRT EP] Whole graph will run on TensorRT execution provider")

can you provide more details on your use case. you mentioned "creating more engines" in the same process.
are you loading a single model/multiple models?
and are you feeding random inputs in with different shapes or the same shape each time?

@chilo-ms
Copy link
Contributor

chilo-ms commented Nov 3, 2023

But if I restart the program it takes 9 seconds again but the cache files are not rewritten. This does not seem normal.

From your log, i assume your model has some "dynamic shape" inputs. In such case, TRT EP won't write out engine cache during session initialization. In fact, TRT EP will check/build the engine and write out engine cache only at inference time (e.g. when you call InferenceSession.Run()).

When TRT EP writes out engine cache, you should see log regarding this.

Note: If your model only contains "static shape" input, then you can expect TRT EP writing out engine cache during session initialization.

The ONNX file I use has a network that accepts different input sizes (and generates corresponding output sizes). I only use it with one size in our current application. Does this still mean that without specifying trt_profile_min_shapes etc. the cache is not used?

The cache will still be used, its name will be checked and if the name matches, the cache will be used at inference time.

if the ONNX model has "dynamic shape" input (That's exactly your case),
(1) you either provide trt_profile_min_shapes etc to let TRT EP check corresponding engine cache or build the engine at session initialization time.
(2) or you provide nothing, then TRT EP will check the shape range for you at inference time and check corresponding engine cache or build the engine. It all happens at inference time.

@jywu-msft
Copy link
Member

From your log, i assume your model has some "dynamic shape" inputs. In such case, TRT EP won't write out engine cache during session initialization. In fact, TRT EP will check/build the engine and write out engine cache only at inference time (e.g. when you call InferenceSession.Run()).

When TRT EP writes out engine cache, you should see log regarding this.

Note: If your model only contains "static shape" input, then you can expect TRT EP writing out engine cache during session initialization.

when there are dynamic shapes involved, the engine might need to be rebuilt at inference time if the shape falls out of the range the engine supports.
either use static shapes for the model (if dynamic shapes are not truly needed)
or run some warmup runs which can cover the full range possible shapes, to generate the engine/shape profiles. and use those.

@tianleiwu
Copy link
Contributor

tianleiwu commented Nov 3, 2023

@BengtGustafsson

Here is an example for TRT EP:

session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL
logger.info("creating TRT EP session for %s", onnx_path)
ort_session = ort.InferenceSession(
onnx_path,
session_options,
providers=[
("TensorrtExecutionProvider", self.ort_trt_provider_options),
],
)
logger.info("created TRT EP session for %s", onnx_path)
device = torch.device("cuda", device_id)
super().__init__(ort_session, device, enable_cuda_graph)
def get_tensorrt_provider_options(self, input_profile, workspace_size, fp16, device_id, enable_cuda_graph):
trt_ep_options = {
"device_id": device_id,
"trt_fp16_enable": fp16,
"trt_engine_cache_enable": True,
"trt_timing_cache_enable": True,
"trt_detailed_build_log": True,
"trt_engine_cache_path": self.engine_path,

A few things to try:

Disable ORT graph optimization since TRT has its own optimization. It might help when the whole graph is run by TRT.

session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL

Use different engine cache path for different profile. For example, static shape batch_size=1 and sequence_length=64 can use path like b_1_s_64. If input is dynamic like batch_size=1 to 4, sequence_length=64 to 128, you can use b_1_4_s_64_128 as cache path. In serving, you need to select proper cache path that could serve your query. For example, an input with batch_size=2 and sequence_length 128, you need use the second cache like:

"trt_engine_cache_path": "b_1_4_s_64_128", 

For dynamic shape input, set input profile in provider config like:

if input_profile:
min_shapes = []
max_shapes = []
opt_shapes = []
for name, profile in input_profile.items():
assert isinstance(profile, list) and len(profile) == 3
min_shape = profile[0]
opt_shape = profile[1]
max_shape = profile[2]
assert len(min_shape) == len(opt_shape) and len(opt_shape) == len(max_shape)
min_shapes.append(f"{name}:" + "x".join([str(x) for x in min_shape]))
opt_shapes.append(f"{name}:" + "x".join([str(x) for x in opt_shape]))
max_shapes.append(f"{name}:" + "x".join([str(x) for x in max_shape]))
trt_ep_options["trt_profile_min_shapes"] = ",".join(min_shapes)
trt_ep_options["trt_profile_max_shapes"] = ",".join(max_shapes)
trt_ep_options["trt_profile_opt_shapes"] = ",".join(opt_shapes)

@chilo-ms,
I saw some log which means it uses ORT graph optimization.
GraphTransformer SkipLayerNormFusion modified: 1 with status: OK
Could you take a look whether the optimizer need some change for TRT EP.

@BengtGustafsson
Copy link
Contributor Author

Thanks for the information. It does however seem that it should have worked:

We have a onnx model allowing dynamic shapes. However we only use one shape every time. I redid from scratch:

First create of Ort:Session. 12 seconds. No cache files written.

First Run call: 9 seconds. Cache files written.

Second Run call: 25 ms. (Same Ort::Session object).

Second Ort::Session creation in same process: 300 ms.

First Run call on this new Session: 17 ms.

Second program run:

First create of Ort:Session. 9 seconds. Same log output from GraphTransformer etc. as first time around.

First Run call: 1 second. No updated timestamp on cache files, probably read back in.

Second Run call: 25 ms. (Same Ort::Session object).

Second Ort::Session creation in same process: 300 ms.

First Run call on this new Session: 17 ms.

My interpretation of this is that the problem is that as the Session constructor doesn't know the actual sizes that will be used it does not presume that it can abstain from doing its prepatory work just because there is one set of engine/profile files already. Personally I find it strange that you don't provide the actual sizes to the constructor. That would make things much easier and as it stands we can't push all setup times back to construction time (without doing an extra, wasted run). This is very bad for our situation where a medical examination is held up for at least one second, and 10 seconds the first time.

Maybe this can be fixed by setting those extra size parameters, but where is the exact format of those values specified in your documentation?

@BengtGustafsson
Copy link
Contributor Author

BengtGustafsson commented Nov 6, 2023

Re: GraphOptimizer we have the problem that we don't know the hardware we're running on, or which dlls are actually present. We could figure this out by checking for the presence of all required DLLs in the library loading path but this is cumbersome as it includes all the dependent dlls required by each provider dll. So it is hard to turn off the GraphOptimizer before creating the Session as the session constructor could select a provider that actually benefits from the GraphOptimizer. I haven't found any description of the provider selection process but at this point my guess is that the Session constructor will choose between those for which you have called AppendExecutionProvider_* and select the first one that works, and use CPU if none does. On Windows64 our idea is to Append TensorRT, MigraphX, DirectML and maybe OpenVino and let Session ctor select the best one depending on available hardware and secondary DLLs. I noted that TensorRT also requires the cuda provider DLL which seems very strange but as both are for NVIDIA maybe there is an undocumented dependency there (build.py does build it, so I guess it is "normal").

With this scenario it is hard to understand how I would go about turning off GraphOptimizer before knowing if it is needed. Also, as I can't find any API to ask a Session which provider it is using my plan is to look which dlls have been loaded into the process after creating the Session (to make sure that I'm not running CPU just because some DLL was missing). I have seen requests for such APIs but I haven't found any work towards implementing one.

@BengtGustafsson
Copy link
Contributor Author

I set the GraphOptimizer parameter to disable all. It got faster, but not fast:

First program run with no cache files.

First session ctor: 6s
First Run: 8s
Second session ctor: 94 ms
First run on second Session object: 14 ms

Second program run with cache files.

First session ctor: 6s
First Run: 1s
Second session ctor: 163 ms
First run on second Session object: 14 ms

Here it is the 6 seconds to create the first Session in the second program run that is really troublesome, along with the problems remarked above.

@BengtGustafsson
Copy link
Contributor Author

Ok, I found some info about the explicit shape range format. Maybe it was added recently, or maybe I was just blind. But while the min and max are rather trivial to understand the opt eludes me. We have a step size for which the most downsampled version of the input is still of integer size, but I would not call that opt, so what is it?

@jywu-msft
Copy link
Member

the shape profile shouldn't play a role here since you are always using the same shaped input.
can you confirm that your trt_cache_path is always C:\ProgramData\ContextVision\cvn_cache\2beba869a9eb439b3c12a6bb70db02bc7d92381d, ?
or can that change? does each program set the same cache path? do you see the engine file being created there?

@jywu-msft
Copy link
Member

the shape profile shouldn't play a role here since you are always using the same shaped input. can you confirm that your trt_cache_path is always C:\ProgramData\ContextVision\cvn_cache\2beba869a9eb439b3c12a6bb70db02bc7d92381d, ? or can that change? does each program set the same cache path? do you see the engine file being created there?

upon further investigation, we believe the 6s session creation overhead is not related to engine caching at all.
during initialization, we need to call nvinfer1::createInferBuilder() and that takes around ~5500ms. We had previously asked Nvidia about it and they said it's because it needs to load a bunch of dependency libraries.
thus, your observed behavior is expected.

We're working to optimize this further by enabling a direct engine load path (loading an onnx model which wraps a previously generated trt engine). Then we can bypass builder creation since we don't need to do any onnx->tensorrt conversions at runtime.

@BengtGustafsson
Copy link
Contributor Author

BengtGustafsson commented Nov 7, 2023

Thanks for that effort, several seconds is probably going to be prohibitive for most of our customers as an added startup time. They complain about 700 ms for CUDA now...

I tried to use the main branch of onnxruntime to get the trt_profile_min_shapes etc. as they are not in the struct in the 1.16 version it seems. Or my fork wasn't up to date. Anyhow now I can't compile with CUDA 11.6 because of cuBLAS being too old, it doesn't seem to have CUDA_R_8F_E4M3 which is something related to F8 processing. Is there a way to know what the main branch compilation needs by way of dependency/tools versions? Is there a release note-to-be somewhere in the repo for instance?

@BengtGustafsson
Copy link
Contributor Author

Ok. I had missed that there was a V2 setup struct. So even after I updated to main there were no members for min/opt/max.

So after I figured that out I was able to run again and now the first Run() of the first Session created is around 50 ms and succeeding runs about 10 ms. The first Session ctor is still 6-7 seconds, but that is at least less troublesome than a really long first Run time. It also seems consistent with you timing of 5.5 seconds. I do hope you can avoid this later!

@jywu-msft
Copy link
Member

jywu-msft commented Nov 7, 2023

Thanks for that effort, several seconds is probably going to be prohibitive for most of our customers as an added startup time. They complain about 700 ms for CUDA now...

I tried to use the main branch of onnxruntime to get the trt_profile_min_shapes etc. as they are not in the struct in the 1.16 version it seems. Or my fork wasn't up to date. Anyhow now I can't compile with CUDA 11.6 because of cuBLAS being too old, it doesn't seem to have CUDA_R_8F_E4M3 which is something related to F8 processing. Is there a way to know what the main branch compilation needs by way of dependency/tools versions? Is there a release note-to-be somewhere in the repo for instance?

https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#requirements

Btw, I wanted to confirm, were you able to resolve your build hang issue #17991? What was the resolution?

@BengtGustafsson
Copy link
Contributor Author

Thanks for asking, but 17991 is still not working. I'll write more info there.

@BengtGustafsson
Copy link
Contributor Author

We've realized that we will probably have to implement an off-line TensorRT optimizer to reduce subsequent startup times, and
we're still very interested in progress on what you described above as:

"We're working to optimize this further by enabling a direct engine load path (loading an onnx model which wraps a previously generated trt engine). Then we can bypass builder creation since we don't need to do any onnx->tensorrt conversions at runtime."

Any progress on this so far?

@jywu-msft
Copy link
Member

We've realized that we will probably have to implement an off-line TensorRT optimizer to reduce subsequent startup times, and we're still very interested in progress on what you described above as:

"We're working to optimize this further by enabling a direct engine load path (loading an onnx model which wraps a previously generated trt engine). Then we can bypass builder creation since we don't need to do any onnx->tensorrt conversions at runtime."

Any progress on this so far?

yes, this is under development and being reviewed. see #18217
we're targeting to include in upcoming ORT release 1.17 (ETA: end of Jan 2024)

@BengtGustafsson
Copy link
Contributor Author

We have now implemented our off-line TensorRT optimizer except that we have no solution for the encryption of the resulting .engine files. The undocumented encryption solution you have is ridiculously easy to break. More details here 22496

Our solution handles varying image sizes by allowing optimization for a set of max size steps. We want to be able to transfer the result of optimization to other computers with the same type of GPU. We have come up with this directory structure:

GPUModelName/ModelHash/MaxSize/engine files

The dismal observability / querability of onnx runtime forces us to use this structure. Our software runs in "optimize" mode where a new directory is created if no MaxSize directory with the exact same size is found, or in "run" mode where the closest (in pixel count) larger size is used. This way you can squeeze more performance out by optimizing in more and closer steps.

To play it safe our "model hash" contains also the onnxruntime and trt version as prescribed by you.

The reason for having to have the MaxSize directory level is that we can't ask onnx runtime for the sizes used for an optimization even as it is probably stored in the .profile file. We always set minsize to the minimum size the model can be used for and the optimization size same as the max size, which is encoded in the directory name. We also need to have this level as all the resulting "indirection" onnx files are called mysteriously _ctx.onnx so we could not have more than one optimization in the same directory even if we could read the size from the .profile file to see which one we want to use. This is probably due to the fact that we provide the orginal onnx as a blob.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider ep:DML issues related to the DirectML execution provider ep:TensorRT issues related to TensorRT execution provider platform:windows issues related to the Windows platform
Projects
None yet
Development

No branches or pull requests

4 participants