Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong result when using lora on multi gpus #2589

Closed
4 tasks
ShuaiShao93 opened this issue Dec 18, 2024 · 6 comments
Closed
4 tasks

Wrong result when using lora on multi gpus #2589

ShuaiShao93 opened this issue Dec 18, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@ShuaiShao93
Copy link

ShuaiShao93 commented Dec 18, 2024

System Info

x86_64, debian 11, 8 A100 GPUs

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Start 2 VMs, one with 1 A100, the other one with 8 A100s
  2. git clone Llama-3.2-3B-Instruct, and arbitrary fp16 lora weights
  3. Install trtllm 0.15.0
  4. On 8-gpu VM, run the commands below. On 1-gpu VM, remove --tp_size=8
python3 TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir ./Llama-3.2-3B-Instruct --output_dir ./tllm_3b_checkpoint_8gpu_fp16 --dtype float16 --tp_size=8

trtllm-build --checkpoint_dir ./tllm_3b_checkpoint_8gpu_fp16 --output_dir ./tmp/llama/3B/trt_engines/fp16/8-gpu  --gpt_attention_plugin auto  --gemm_plugin auto --max_num_tokens 128000 --max_batch_size 8 --logits_dtype=float32 --gather_generation_logits --kv_cache_type=paged --lora_plugin auto  --lora_dir llama-3.2-3b-ins-finetuned-lora-weights
  1. On 8-gpu VMs, runt this command. On 1-gpu VM, remove mpirun -n 8
mpirun -n 8 python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/3B/trt_engines/fp16/8-gpu --max_output_len 1 --max_input_length=100000 --run_profiling --tokenizer_dir ./Llama-3.2-3B-Instruct --input_file input.txt --lora_dir llama-3.2-3b-ins-finetuned-lora-weights --lora_task_uids 0 --output_logits_npy test.npy && python3 -c "import numpy; print(numpy.load('test_generation.npy'))"

Expected behavior

On both VMs, the output text and the generation logits should be identical

actual behavior

Neither output text nor generation logits are identical. And the output text from multi-gpu instance is totally garbage if we set max_output_len to a higher value

additional notes

If we disable lora, both texts and logits are identical

@ShuaiShao93 ShuaiShao93 added the bug Something isn't working label Dec 18, 2024
@ShuaiShao93
Copy link
Author

@nv-guomingz this is a very serious bug, can you help triage it? Thanks!

@ShuaiShao93
Copy link
Author

ok, I think this only happens when we export the lora weights in a way that auto_mapping is not null and task_type is not CAUSAL_LM. My adapter_config.json is

{
  "alpha_pattern": {},
  "auto_mapping": {
    "base_model_class": "LlamaForCausalLM",
    "parent_library": "transformers.models.llama.modeling_llama"
  },
  "base_model_name_or_path": "meta-llama/Llama-3.2-3B-Instruct",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layer_replication": null,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 16,
  "lora_dropout": 0,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 16,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "down_proj",
    "v_proj",
    "gate_proj",
    "q_proj",
    "up_proj",
    "k_proj",
    "o_proj"
  ],
  "task_type": null,
  "use_dora": false,
  "use_rslora": false
}

When auto_mapping is null and task_type is CAUSAL_LM, this is good

@ShuaiShao93
Copy link
Author

Note that this only impacts the trtllm-build, but not run.py. This means when we run trtllm-build, the --lora_dir must have task_type: CAUSAL_LM.

@ShuaiShao93
Copy link
Author

ShuaiShao93 commented Dec 27, 2024

Actually even with task_type: CAUSAL_LM, it no longer outputs garbage but the outputs are still different from that on single gpu.

@ShuaiShao93
Copy link
Author

The interesting thing is, if we add --use_py_session to TensorRT-LLM/examples/run.py, this bug doesn't happen. So I believe it's inconsistency between c++ and python runners.

@ShuaiShao93
Copy link
Author

Ok, I have a good repro to compare python and cpp runners. Let me file a separate ticket with newer version and better repro. Closing this one for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant