Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing for CPUs #5

Open
AntonOresten opened this issue Nov 28, 2024 · 3 comments
Open

Optimizing for CPUs #5

AntonOresten opened this issue Nov 28, 2024 · 3 comments

Comments

@AntonOresten
Copy link
Member

AntonOresten commented Nov 28, 2024

While the on-GPU sampling is neat, moving logits to the CPU might still be faster.

One idea would be to have logitsample(f::Function, logits), falling back to logitsamplel(f(logits)), with specialized methods like logitsample(::Top_pk, logits) with better time complexity using a partial sort.

Some rough benchmarks show that logitsample ∘ Top_p(0.5) on 100k logits takes ~2 micro milliseconds on an A6000, which sets an upper limit on inference speed.

@murrellb
Copy link
Member

I'm ok with a speed limit of 500000tok/s.

@AntonOresten
Copy link
Member Author

AntonOresten commented Nov 28, 2024

Oops, I meant milliseconds. Seems the GPU I was using was already in use though, and Top_p is closer to 700 microseconds, with Top_nσ being faster at 250 microseconds.

@murrellb
Copy link
Member

Fine then I'll have to live with a speed limit of 1428toks/s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants