Mlflow duplicate logging #2063

jsh2581 · 2024-11-15T05:45:26Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

one log in one step

Current behaviour

duplicated log in one step

Steps to reproduce

pull docker image : winglian/axolotl:main-20241030-py3.11-cu124-2.4.1
setup mlflow (ghcr.io/mlflow/mlflow:v2.17.2)
run axolotl docker
prepare dataset, base model, train config file.
run accelerate launch -m axolotl.cli.train my_config.yml
go to mlflow logging dir
check the log file

Config yaml

base_model: meta-llama/Llama-3.2-3B
plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: false

strict: false
chat_template:
output_dir: /workspace/axolotl/3_model/pretraining
skip_prepare_dataset: true
datasets:
  - path: /workspace/axolotl/2_data/dataset-tokenized-8k/train
    split: train
    type:

sequence_len: 8192
sample_packing: false
pad_to_sequence_len: false

# mlflow configuration if you're using it
mlflow_tracking_uri: http://mlflow-server:5000
mlflow_experiment_name: llama-3B
mlflow_run_name: llama-3B

gradient_accumulation_steps: 1
micro_batch_size: 2
  # num_epochs: 1
# max_steps: 200000 
optimizer: adamw_torch
lr_scheduler: cosine
lr_scheduler_kwargs:
cosine_min_lr_ratio: 1e-3

learning_rate: 1e-5

train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
#flash_attention: true

warmup_steps: 20000
  #evals_per_epoch: 2
eval_table_size:

save_steps: 40000
debug:
deepspeed:
weight_decay: 0.0
fsdp:
#   - full_shard
#   - auto_wrap
# fsdp_config:
#   fsdp_limit_all_gathers: true
#   fsdp_sync_module_states: true
#   fsdp_offload_params: false
#   fsdp_use_orig_params: false
#   fsdp_cpu_ram_efficient_loading: true
#   fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
#   fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
#   fsdp_state_dict_type: FULL_STATE_DICT
#   fsdp_sharding_strategy: FULL_SHARD
#   fsdp_backward_prefetch: BACKWARD_PRE
special_tokens:
  pad_token: <|end_of_text|>

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.11

axolotl branch-commit

main/8c3a727f9d60ffd3af385f90bcc3fa3a56398fe1

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

NanoCode012 · 2024-11-15T19:08:15Z

cc @awhazell , do you perhaps see any duplicate logging recently to mlflow?

jsh2581 added the bug Something isn't working label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mlflow duplicate logging #2063

Mlflow duplicate logging #2063

jsh2581 commented Nov 15, 2024

NanoCode012 commented Nov 15, 2024

Mlflow duplicate logging #2063

Mlflow duplicate logging #2063

Comments

jsh2581 commented Nov 15, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

NanoCode012 commented Nov 15, 2024