Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaMA 13B works on a single RTX 4080 16GB #17

Open
kcchu opened this issue Mar 13, 2023 · 1 comment
Open

LLaMA 13B works on a single RTX 4080 16GB #17

kcchu opened this issue Mar 13, 2023 · 1 comment

Comments

@kcchu
Copy link

kcchu commented Mar 13, 2023

meta-llama#79 (comment)

System:

  • RTX 4080 16GB
  • Intel i7 13700
  • 32GB RAM
  • Ubuntu 22.04.2 LTS

LLaMA 13B

  • It uses > 32 GB of host memory when loading and quantizing, be sure you have enough memory or swap
  • VRAM usage: about 15GB
  • loading time: 5 min (using swap)
  • inference time: 30s

image

LLaMA 7B

  • VRAM usage: about 8.6 GB
  • loading time: 34s
  • inference time: 20s
@chrisbward
Copy link

Using the above methods on 3090 Ti 24GB;

LLaMA 13B - 30 seconds loading (with swap - 50GB), 30 seconds inference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants