You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While the on-GPU sampling is neat, moving logits to the CPU might still be faster.
One idea would be to have logitsample(f::Function, logits), falling back to logitsamplel(f(logits)), with specialized methods like logitsample(::Top_pk, logits) with better time complexity using a partial sort.
Some rough benchmarks show that logitsample ∘ Top_p(0.5) on 100k logits takes ~2 micro milliseconds on an A6000, which sets an upper limit on inference speed.
The text was updated successfully, but these errors were encountered:
Oops, I meant milliseconds. Seems the GPU I was using was already in use though, and Top_p is closer to 700 microseconds, with Top_nσ being faster at 250 microseconds.
While the on-GPU sampling is neat, moving logits to the CPU might still be faster.
One idea would be to have
logitsample(f::Function, logits)
, falling back tologitsamplel(f(logits))
, with specialized methods likelogitsample(::Top_pk, logits)
with better time complexity using a partial sort.Some rough benchmarks show that
logitsample ∘ Top_p(0.5)
on 100k logits takes ~2micromilliseconds on an A6000, which sets an upper limit on inference speed.The text was updated successfully, but these errors were encountered: