Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Deprecation Warnings in AutoCast API #113

Merged
merged 7 commits into from
Dec 2, 2024

Conversation

Abhishek-TAMU
Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU commented Nov 27, 2024

This PR fixes issue: #107

Changes:
Modified the autocasting decorator to @torch.amp.custom_fwd and @torch.amp.custom_bwd for autograd function.

@fabianlim
Copy link
Contributor

fabianlim commented Nov 27, 2024

@Abhishek-TAMU it looks good in general.

I think it needs a lint, try tox -e fmt,lint in the accelerated-peft and fused-ops-and-kernels folders.

Also will be good to run a local bench test to see if there is any regeression in the performance (highly doubt)

to do this you run tox -e run-benches and then run

    PYTHONPATH=. \
    python scripts/compare_with_reference.py \
	--result_dir $RESULT_DIR \
	--reference_benchmark_filepath scripts/benchmarks/refs/a100_80gb.csv \
	--indices \
	framework_config model_name_or_path \
        num_gpus per_device_train_batch_size

If you want to make it faster you can just uncomment the other models and leave this line

@Abhishek-TAMU
Copy link
Collaborator Author

Thank you @fabianlim for guidance.

I fixed fmt and lint issue. I also ran tox -e run-benches and file compare_with_reference.py using the command.
There is no outlier.csv, in benchmark_outputs folder. So I assume that's how you must be checking of any outlier and decide if new code changes affects anything.

image

@Abhishek-TAMU
Copy link
Collaborator Author

Also for running tox -e run-benches I had to add datasets and trl in tox.ini section [testenv:run-benches].
If this sounds good then I can push this change also.

image

@fabianlim
Copy link
Contributor

fabianlim commented Nov 28, 2024

it should not be the case befcause it should come in with fms-hf-tuning. Also i think something is wrong with the bench, it should take awhile to run (few hours). can you open your benchmarks.csv and see what is inside.

@Abhishek-TAMU
Copy link
Collaborator Author

Abhishek-TAMU commented Nov 28, 2024

benchmarks.csv is empty with below columns, after running tox -e run-benches and compare_with_reference.py :

framework_config,mem_nvidia_mem_reserved,model_name_or_path,num_gpus,peft_method,per_device_train_batch_size,torch_dtype

Also I commented out this line and next line and just left model TheBloke/Mistral-7B-v0.1-GPTQ

@fabianlim
Copy link
Contributor

fabianlim commented Nov 28, 2024

@Abhishek-TAMU you need to inspect the benchmark_outputs folder and see what failed. If benchmarks.csv is empty it means everything failed.

@Abhishek-TAMU
Copy link
Collaborator Author

Thanks for the input Fabian. Sharing benchmark_output folder here. There doesn't seem to be any regression in the performance.

@fabianlim
Copy link
Contributor

fabianlim commented Nov 30, 2024

@Abhishek-TAMU lookin at your benches, the train loss seems to be a bit higher, can you take a quick look

  • maybe what you can do is take one example, revert the change, then run it again and see if you get the old loss
  • if after revert, then you can try to revert back to old versions, and see if its dependent on that
image image
image image

@Abhishek-TAMU
Copy link
Collaborator Author

@fabianlim Running benchmark from code from main branch (without my changes) gives this train loss plot: (Other plots are same). Do you think this is a significant change in train loss compared to train loss with my code changes ?

compare-train_loss

@fabianlim
Copy link
Contributor

fabianlim commented Dec 2, 2024

@Abhishek-TAMU ic.. ok then its due to variation . i approve. Also did you verify that the warning messages went away?

Copy link
Contributor

@fabianlim fabianlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and there are no regressions

@Abhishek-TAMU
Copy link
Collaborator Author

Also did you verify that the warning messages went away?

Yes, I checked it by running fms-hf-tuning with below arguments in config and did not get those warnings again after the code changes in the PR.

"auto_gptq": ["triton_v2"],
"fp16": true,
"torch_dtype": "float16",
"fast_kernels": [true, true, true]

@fabianlim fabianlim merged commit c70ffe0 into foundation-model-stack:main Dec 2, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants