Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero loss values during finetuning #22

Open
mmderakhshani opened this issue Oct 15, 2024 · 3 comments
Open

Zero loss values during finetuning #22

mmderakhshani opened this issue Oct 15, 2024 · 3 comments

Comments

@mmderakhshani
Copy link

mmderakhshani commented Oct 15, 2024

Dear @xiaoachen98,

Thank you very much for releasing the code. I am running your fine-tuning script to replicate your LLaMA 3 results. However, I am getting zero loss after the first iteration. Have you encountered such an issue before? I am using 8 A6000 GPUs with the total batch size of 128.


#!/bin/bash
export WANDB_DIR="/nvmestore/mderakh/wandb_llava/"
export LLaVA_PATH="/nvmestore/mderakh/vlm_datasets/playground/"

export DATA_PATH=${LLaVA_PATH}/data/open-llava-next/open-llava-next_instruct_mix1M.json
export SAVE_PATH=llava-v1.6-8b_llama3-8b_clip-large-336_pretrain_lcs-558k_sft-mix1M_lr-mlp-2e-5-vit-2e-6-llm-2e-5


OUTPUT="/nvmestore/mderakh/LLaVA-Next/clip-vit-large-patch14-336-llama3-8b/"
OUT_RESULTS=$OUTPUT/results/
mkdir -p "$OUT_RESULTS"

export BASE_LR=2e-5
export VIT_LR=2e-6
DEVICE_BATCH_SIZE=2
GRADIENT_ACCU_STEPS=8


deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path Lin-Chen/open-llava-next-llama3-8b \
    --version llava_llama_3 \
    --data_path ${DATA_PATH} \
    --image_folder ${LLaVA_PATH}/data \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --pretrain_mm_mlp_adapter ${OUTPUT}/checkpoints/llava-v1.6-8b_llama3-8b_pretrain_lcs-558k_ft-mlp-lr-1e-3/mm_projector.bin \
    --unfreeze_mm_vision_tower True \
    --mm_vision_tower_lr ${VIT_LR} \
    --image_aspect_ratio anyres \
    --group_by_modality_length True \
    --mm_vision_select_layer -2 \
    --mm_vision_select_feature patch \
    --mm_patch_merge_type spatial_unpad \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir ${OUTPUT}/checkpoints/${SAVE_PATH} \
    --num_train_epochs 1 \
    --per_device_train_batch_size ${DEVICE_BATCH_SIZE} \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps ${GRADIENT_ACCU_STEPS} \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 7975 \
    --save_total_limit 1 \
    --learning_rate ${BASE_LR} \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 6144 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb \
    --run_name ${SAVE_PATH}
@StevenSmith2000
Copy link

Hi, I have encountered the same problem. Have you solved it?

@mmderakhshani
Copy link
Author

Hi @StevenSmith2000 I could not solve it.

@Lauch1ng
Copy link

Hi, have you solved it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants