Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using two 8xH100 nodes to train. encounter error bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above. #1924

Open
6 of 8 tasks
michaellin99999 opened this issue Sep 23, 2024 · 8 comments
Labels
bug Something isn't working waiting for reporter

Comments

@michaellin99999
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

This issue should not occur, as H100 definitely supports bf16.

Current behaviour

outputs error: Value error, bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above.
clipboard-image

Steps to reproduce

run the script https://github.com/axolotl-ai-cloud/axolotl/blob/main/docs/multi-node.qmd

Config yaml

base_model: openlm-research/open_llama_3b_v2 [0/3]model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
push_dataset_to_hub:
datasets:
- path: teknium/GPT4-LLM-Cleaned
type: alpaca
dataset_prepared_path:
val_set_size: 0.02
adapter: lora
lora_model_dir:
sequence_len: 1024
sample_packing: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.0
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/lora-out
gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint::

lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/lora-out
gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
gptq_groupsize:
s2_attention:
gptq_model_v1:
warmup_steps: 20
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.1
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_offload_params: true
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"

Possible solution

no idea what is causing this issue.

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.11.9

axolotl branch-commit

none

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@michaellin99999 michaellin99999 added the bug Something isn't working label Sep 23, 2024
@michaellin99999
Copy link
Author

the same settings used in Regular training, works.

@michaellin99999
Copy link
Author

settings in accelerate:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

@michaellin99999
Copy link
Author

this is the snippet for multinode slave settings:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 1
main_process_ip: 192.168.108.22
main_process_port: 5000
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

@winglian
Copy link
Collaborator

I recommend not using the accelerate config and removing that file. axolotl handles much of that automatically. See https://axolotlai.substack.com/p/fine-tuning-llama-31b-waxolotl-on

@michaellin99999
Copy link
Author

ok, is it the accelerate config causing the issue?

@ehartford
Copy link
Collaborator

Often, it is

@michaellin99999
Copy link
Author

we tried that still same issue, also went through https://axolotlai.substack.com/p/fine-tuning-llama-31b-waxolotl-on this requires axolot cloud, Im using my own two 8xh100 clusters. any scripts that work?

@NanoCode012
Copy link
Collaborator

@michaellin99999 , hey!

From my understanding, those scripts should work for any systems as Lambda just provides bare compute. Can you let us know if you still get this issue and how we can help solve it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting for reporter
Projects
None yet
Development

No branches or pull requests

4 participants