open source 315e9f5ccd286e906d4c0d402fefbf2f69a1febe (NVIDIA#2033)

ampdot-io · Jul 26, 2024 · 93293aa · 93293aa
1 parent 5fa9436
commit 93293aa
Show file tree

Hide file tree

Showing 97 changed files with 3,045 additions and 6,755 deletions.
diff --git a/benchmarks/python/README.md b/benchmarks/python/README.md
@@ -10,8 +10,6 @@ multiple GPUs or multiple nodes with multiple GPUs using the Python runtime.
 
 The benchmark implementation and entrypoint can be found in [`benchmarks/python/benchmark.py`](./benchmark.py). There are some other scripts in the directory:
 
-* [`benchmarks/python/allowed_configs.py`](./allowed_configs.py) to define configuration for each supported model.
-* [`benchmarks/python/build.py`](./build.py) to build supported models for benchmarking.
 * [`benchmarks/python/base_benchmark.py`](./base_benchmark.py) to implement the base class for benchmark.
 * [`benchmarks/python/gpt_benchmark.py`](./gpt_benchmark.py) to implement benchmark scripts for GPT and GPT-like(LLaMA/OPT/GPT-J/SmoothQuant-GPT) models.
 * [`benchmarks/python/bert_benchmark.py`](./bert_benchmark.py) to implement benchmark scripts for BERT models.
@@ -25,37 +23,29 @@ python benchmark.py -h
 ```
 
 ### 1. Single GPU benchmark
-Take GPT-350M as an example:
+Take LLaMA 7B as an example:
 ```
 python benchmark.py \
-    -m gpt_350m \
-    --mode plugin \
+    -m dec \
+    --engine_dir llama_7b \
     --batch_size "1;8;64" \
     --input_output_len "60,20;128,20"
 ```
 Expected outputs:
 ```
-[BENCHMARK] model_name gpt_350m world_size 1 num_heads 16 num_kv_heads 16 num_layers 24 hidden_size 1024 vocab_size 51200 precision float16 batch_size 1 input_length 60 output_length 20 gpu_peak_mem(gb) 4.2 build_time(s) 25.67 tokens_per_sec 483.54 percentile95(ms) 41.537 percentile99(ms) 42.102 latency(ms) 41.362 compute_cap sm80
-[BENCHMARK] model_name gpt_350m world_size 1 num_heads 16 num_kv_heads 16 num_layers 24 hidden_size 1024 vocab_size 51200 precision float16 batch_size 8 input_length 60 output_length 20 gpu_peak_mem(gb) 4.28 build_time(s) 25.67 tokens_per_sec 3477.28 percentile95(ms) 46.129 percentile99(ms) 46.276 latency(ms) 46.013 compute_cap sm80
-[BENCHMARK] model_name gpt_350m world_size 1 num_heads 16 num_kv_heads 16 num_layers 24 hidden_size 1024 vocab_size 51200 precision float16 batch_size 64 input_length 60 output_length 20 gpu_peak_mem(gb) 4.8 build_time(s) 25.67 tokens_per_sec 19698.07 percentile95(ms) 65.739 percentile99(ms) 65.906 latency(ms) 64.981 compute_cap sm80
+[BENCHMARK] model_name dec world_size 2 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 32000 precision float16 batch_size 1 gpu_weights_percent 1.0 input_length 60 output_length 20 gpu_peak_mem(gb) 0.0 build_time(s) None tokens_per_sec 170.77 percentile95(ms) 117.591 percentile99(ms) 124.262 latency(ms) 117.115 compute_cap sm90 quantization QuantMode.FP8_QDQ|FP8_KV_CACHE generation_time(ms) 110.189 total_generated_tokens 19.0 generation_tokens_per_second 172.43
+[BENCHMARK] model_name dec world_size 2 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 32000 precision float16 batch_size 8 gpu_weights_percent 1.0 input_length 60 output_length 20 gpu_peak_mem(gb) 0.0 build_time(s) None tokens_per_sec 1478.55 percentile95(ms) 108.641 percentile99(ms) 109.546 latency(ms) 108.214 compute_cap sm90 quantization QuantMode.FP8_QDQ|FP8_KV_CACHE generation_time(ms) 98.194 total_generated_tokens 152.0 generation_tokens_per_second 1547.951
+[BENCHMARK] model_name dec world_size 2 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 32000 precision float16 batch_size 64 gpu_weights_percent 1.0 input_length 60 output_length 20 gpu_peak_mem(gb) 0.0 build_time(s) None tokens_per_sec 8214.87 percentile95(ms) 156.748 percentile99(ms) 160.203 latency(ms) 155.815 compute_cap sm90 quantization QuantMode.FP8_QDQ|FP8_KV_CACHE generation_time(ms) 111.078 total_generated_tokens 1216.0 generation_tokens_per_second 10947.303
 ...
 ```
 *Please note that the expected outputs is only for reference, specific performance numbers depend on the GPU you're using.*
 
 ### 2. Multi-GPU benchmark
-Take GPT-175B as an example:
+Take LLaMA 7B as an example:
 ```
-mpirun -n 8 python benchmark.py \
-    -m gpt_175b \
-    --mode plugin \
+mpirun -n 2 python benchmark.py \
+    -m dec \
+    --engine_dir llama_7b \
     --batch_size "1;8;64" \
     --input_output_len "60,20;128,20"
 ```
-
-Note: Building multi-GPU engines in parallel could be a heavy workload for the CPU system. Tuning `mpirun --map-by <XXX>` option on your system may achieve significant boost in build time, for example:
-```
-mpirun --map-by socket -n 8 python build.py \
-    --model gpt_175b \
-    --mode ootb \
-    --quantization fp8
-```