Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning 2b out of memory on Kaggle T4 x 2 #54

Open
yudataguy opened this issue Oct 23, 2024 · 1 comment
Open

Finetuning 2b out of memory on Kaggle T4 x 2 #54

yudataguy opened this issue Oct 23, 2024 · 1 comment

Comments

@yudataguy
Copy link

I'm following colabs/fine_tuning_tutorial.ipynb, but still ran out of memory on step
params = params_lib.load_and_format_params(ckpt_path)

error message:

E1023 00:10:32.928098      30 pjrt_stream_executor_client.cc:2809] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 1049100288 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation: 1000.50MiB
              constant allocation:         0B
        maybe_live_out allocation: 1000.50MiB
     preallocated temp allocation:         0B
                 total allocation:    1.95GiB
              total fragmentation:         0B (0.00%)
Peak buffers:
	Buffer 1:
		Size: 1000.50MiB
		Entry Parameter Subshape: bf16[256128,2048]
		==========================

	Buffer 2:
		Size: 1000.50MiB
		XLA Label: fusion
		Shape: bf16[256128,2048]
		==========================

I thought it shouldn't be running out of memory with a 2b model on T4x2. How can I solve this issue?

@Gopi-Uppari
Copy link

Hi @yudataguy,

I successfully reproduced the issue on Kaggle T4 x2 GPUs, but the error did not occur when I ran the same code in Google Colab with the v4 runtime (as mentioned in the tutorial notebook). In Kaggle, one GPU’s memory is fully occupied, and the code params = params_lib.load_and_format_params(ckpt_path) does not automatically utilize the free GPU. This suggests a memory allocation management issue. To resolve this and avoid the error, please refer to the solution provided in the code below, as well as the linked gist file for more details.
image

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants