Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning
This repo is based on TVM v0.12.0 and reuses some code from TenSet and TLP.
TLM has been integrated into Ansor, TVM(MetaSchedule), MindSpore's AKG and AKG-MLIR.
-
Build and install this repo following the guide.
-
You can refer to my installation environment here.
-
[Note] To avoid this issue, please remember to set this👉. If you are a PyTorch user, it is recommended to set
(USE_LLVM "/path/to/llvm-config --link-static")
andset(HIDE_PRIVATE_SYMBOLS ON)
to avoid potential symbol conflicts between different versions LLVM used by TVM and PyTorch. -
TLM uses huggingface for training and needs to install dependencies:
pip install -r requirements.txt
-
You can download the tlm_dataset we have collected and put them in the corresponding path.
(py38) ➜ ~ tree tlm_dataset -L 2 tlm_dataset ├── gen │ ├── dataset │ ├── gen_data │ └── utils.json └── meta ├── dataset ├── meta_data └── meta_utils.json 7 directories, 2 files
To get started quickly, you need to download tlm_dataset. Here we take compiling bert_base
as an example.
cd gen
-
Optional. We have already trained TLM in
gen_data/v100_gen_best
. You can repeat the following command, which will overwritegen_data/v100_gen_best
. When executingrun_train_clm_best_v100.py
, you need to 1) adjustCUDA_VISIBLE_DEVICES
and--per_device_train_batch_size
and 2)apt install tmux
.python postprocess.py --target=nvidia/nvidia-v100 python make_dataset.py \ --for_type=for_gen_best_all \ --target=nvidia/nvidia-v100 \ --dataset_path=dataset/measure_records/v100 \ --tokenizer_path=gen_data/gen_tokenizer_v100 \ --save_path=gen_data/v100_gen_best python run_train_clm_best_v100.py
-
To generate tensor programs for bert_base, generate prompts first.
python make_dataset.py \ --for_type=for_gen_eval_sketch_only_bert \ --target=nvidia/nvidia-v100 \ --dataset_path=dataset/to_measure_programs/v100 \ --tokenizer_path=gen_data/gen_tokenizer_v100 \ --save_path=gen_data/v100_gen_eval_only_bert \ --keep_cnt=64
-
Generate tensor programs for bert_base.
CUDA_VISIBLE_DEVICES=3 python gen_state.py \ --model_name_or_path=gen_data/clm_gen_best_v100 \ --sketch_path=gen_data/v100_gen_eval_only_bert/0_merge.json \ --save_path=gen_data/v100_gen_eval_only_bert/gen_eval.json \ --allow_repeat=True \ --target=nvidia/nvidia-v100 \ --keep_cnt=32
-
Measure the execution latency of the generated tensor program. Ensure that the measurement hardware is exclusive to avoid inaccurate results.
CUDA_VISIBLE_DEVICES=3 python measure_programs.py --batch-size=64 --target=nvidia/nvidia-v100 --to-measure-path=gen_data/v100_gen_eval_only_bert/gen_eval.json --measured-path=measured_only_bert.json
-
Select the best-performing programs from the measured tensor programs to compile
bert_base
and measure the end-to-end execution latency ofbert_base
. Ensure that the measurement hardware is exclusive to avoid inaccurate results.CUDA_VISIBLE_DEVICES=3 TLM_LOG_FILE=measured_only_bert.json python tune_relay.py --workload=bert_base --input-shape=\[1,128\] --target=nvidia/nvidia-v100 --backend=graph
cd gen
-
Train TLM-base, taking the NVIDIA V100 as an example.
-
Partition the workload into subgraphs and save the results to the path
dataset/network_info/v100
.python dump_network_info.py --target=nvidia/nvidia-v100
--target
can be found insrc/target/tag.cc
for other hardware. For CPUs, it can be set to something like this--target="llvm -mcpu=core-avx2 -model=i7"
. -
Dump tensor programs for those subgraphs. The resulting tensor programs have not measured execution latency and are unlabeled data. They will be saved to the path
dataset/to_measure_programs/v100
.python dump_programs.py --target=nvidia/nvidia-v100
-
Use unlabeled data to build a vocabulary and a tokenizer and save them to
--tokenizer_path
.python make_dataset.py \ --for_type=for_gen_tokenizer \ --target=nvidia/nvidia-v100 \ --dataset_path=dataset/to_measure_programs/v100 \ --tokenizer_path=gen_data/gen_tokenizer_v100
-
Use the tokenizer to train the TLM-base pre-train dataset and save it to
--save_path
.python make_dataset.py \ --for_type=for_gen \ --target=nvidia/nvidia-v100 \ --dataset_path=dataset/to_measure_programs/v100 \ --tokenizer_path=gen_data/gen_tokenizer_v100 \ --save_path=gen_data/v100_gen_2154
-
Pre-train TLM-base. Adjust parameters such as
batch_size
in the run_train_clm.py file according to the GPU memory size. Requiresapt install tmux
.python run_train_clm.py
-
-
Train the TLM using iterative optimization. We provide two methods: the script with one kick and the step-by-step command.
A. The script with one kick. The script uses a pipeline system and requires two machines. Both machines need to clone the TLM repository. The two machines communicate and exchange data using ssh and rsync.
-
On the training machine, 1) configure ssh password-free login, and then configure the target machine in
~/.ssh/config
, 2) setdevice_id_all
in run.py to specify the GPU card IDs that can be used to train TLM; 3) Setssh_target
in run.py.python run.py --target=nvidia/nvidia-v100 --for_type=for_finetuning --finetuning_init=True
-
On the measurement machine, configure
available_ids
to specify the GPU card IDs that can be used for measurement.python run_measure.py
B. Step-by-step command.
-
Before using TLM-base/TLM to generate tensor programs, generate prompts first.
python make_dataset.py \ --for_type=for_gen_train_sketch \ --target=nvidia/nvidia-v100 \ --dataset_path=dataset/to_measure_programs/v100 \ --tokenizer_path=gen_data/gen_tokenizer_v100 \ --save_path=gen_data/v100_gen_train \ --keep_cnt=48 \ --test_file_idx=0
-
Use TLM-base/TLM to generate tensor programs,
--model_name_or_path
specifies whether to use TLM-base or TLM.CUDA_VISIBLE_DEVICES=0,1,2,3 python gen_state.py \ --target=nvidia/nvidia-v100 \ --model_name_or_path=gen_data/clm_gen_v100/checkpoint-24000 \ --sketch_path=gen_data/v100_gen_train/0_merge.json \ --save_path=gen_data/v100_gen_train/gen_train.json \ --allow_repeat=True \ --keep_cnt=16
-
Measure the execution latency of the generated tensor program and 'manually' update the path of the measurement results to the
utils.json
file. There are many errors in the initial measurement data of iterative optimization. These errors are normal and will gradually decrease as the iteration proceeds.CUDA_VISIBLE_DEVICES=3 python measure_programs.py --batch-size=64 --target=nvidia/nvidia-v100 --to-measure-path=gen_data/v100_gen_train/gen_train.json --measured-path=gen_data/measure_data_v100/finetuning_0.json
-
Organize the measured programs into
dataset/measure_records/v100
.python postprocess.py --target=nvidia/nvidia-v100
-
Build an SFT dataset.
python make_dataset.py \ --for_type=for_gen_best \ --target=nvidia/nvidia-v100 \ --dataset_path=dataset/measure_records/v100 \ --tokenizer_path=gen_data/gen_tokenizer_v100 \ --save_path=gen_data/v100_gen_best
-
SFT TLM-base.
python run_train_clm_best_v100.py
-
-
Evaluation on target workload.
-
Generate prompts.
python make_dataset.py \ --for_type=for_gen_eval_sketch \ --target=nvidia/nvidia-v100 \ --dataset_path=dataset/to_measure_programs/v100 \ --tokenizer_path=gen_data/gen_tokenizer_v100 \ --save_path=gen_data/v100_gen_eval \ --keep_cnt=64
-
Generate tensor programs.
CUDA_VISIBLE_DEVICES=4 python gen_state.py \ --model_name_or_path=gen_data/clm_gen_best_v100 \ --sketch_path=gen_data/v100_gen_eval/0_merge.json \ --save_path=gen_data/v100_gen_eval/gen_eval.json \ --allow_repeat=True \ --target=nvidia/nvidia-v100 \ --keep_cnt=32
-
Measure the execution latency of the generated tensor program.
CUDA_VISIBLE_DEVICES=3 python measure_programs.py --batch-size=64 --target=nvidia/nvidia-v100 --to-measure-path=gen_data/v100_gen_eval/gen_eval.json --measured-path=gen_data/measure_data_v100/0_test_3.json
-
Use scripts to analyze the speedups.
python speedup_eval.py --target=nvidia/nvidia-v100 --for_test=True
-
-
When the tuning budget is ample, we continue to optimize TLM using the target workload data. There are also two ways.
A. The script with one kick.
-
On the training machine.
python run.py --target=nvidia/nvidia-v100 --for_type=for_testtuning --testtuning_init=True
-
On the measurement machine.
python run_measure.py
B. Step-by-step command.
-
Generate prompts.
python make_dataset.py \ --for_type=for_gen_evaltuning_sketch \ --target=nvidia/nvidia-v100 \ --dataset_path=dataset/to_measure_programs/v100 \ --tokenizer_path=gen_data/gen_tokenizer_v100 \ --save_path=gen_data/v100_gen_evaltuning \ --keep_cnt=64
-
Generate tensor programs.
CUDA_VISIBLE_DEVICES=0,1,2,3 python gen_state.py \ --model_name_or_path=gen_data/clm_gen_best_v100 \ --sketch_path=gen_data/v100_gen_evaltuning/0_merge.json \ --save_path=gen_data/v100_gen_evaltuning/gen_eval.json \ --allow_repeat=True \ --target=nvidia/nvidia-v100 \ --keep_cnt=32
-
Measure the execution latency of the generated tensor program and 'manually' update the path of the measurement results to the
utils.json
file.CUDA_VISIBLE_DEVICES=3 python measure_programs.py --batch-size=64 --target=nvidia/nvidia-v100 --to-measure-path=gen_data/v100_gen_evaltuning/gen_eval.json --measured-path=gen_data/measure_data_v100/testtuning_0.json
-
Organize the measured programs into
dataset/measure_records/v100
.python postprocess.py --target=nvidia/nvidia-v100
-
Build an SFT dataset.
python make_dataset.py \ --for_type=for_gen_best_all \ --target=nvidia/nvidia-v100 \ --dataset_path=dataset/measure_records/v100 \ --tokenizer_path=gen_data/gen_tokenizer_v100 \ --save_path=gen_data/v100_gen_best
-
SFT TLM-base
python run_train_clm_best_v100.py
-
Not every task has the same optimization space. We use the task scheduler to allocate the tuning budget.
python task_sheduler.py --target=nvidia/nvidia-v100 --for_testtuning=True
-
cd meta
Similar to TLM-Ansor, command lines can be found in run.sh and run.py.
TLM is licensed under the Apache-2.0 license.