tensorrt-llm possibility? #369
Closed
david-payjoy
started this conversation in
Ideas
Replies: 1 comment 2 replies
-
It already has context caching and speculative decoding. I'm not really sure where TensorRT would fit in? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
my use case is similar to exui. multi-gpu, 1 user, batch 1, want it to be as fast as possible. im wondering if there is a possibility / roadmap to integrate with tensorrt, but keeping the time to first token down with a conversation (so using something like the sessions) tabby or aphroditie work, but with each message it has to read the whole context again. so im trying to figure out if bringing sessions to aphroditte and silly tavern makes more sense, or bringing tensorrt-llm to exllamav2 is more feasible, or if it is too crazy an idea. thanks again for this amazing work. cheers. related on my dream board would be exllamav2 + tensorrt + speculative decoding +context/conversation caching, as I believe that would fly.
Beta Was this translation helpful? Give feedback.
All reactions