tensorrt-llm possibility? #369

david-payjoy · 2024-03-14T16:34:09Z

david-payjoy
Mar 14, 2024

my use case is similar to exui. multi-gpu, 1 user, batch 1, want it to be as fast as possible. im wondering if there is a possibility / roadmap to integrate with tensorrt, but keeping the time to first token down with a conversation (so using something like the sessions) tabby or aphroditie work, but with each message it has to read the whole context again. so im trying to figure out if bringing sessions to aphroditte and silly tavern makes more sense, or bringing tensorrt-llm to exllamav2 is more feasible, or if it is too crazy an idea. thanks again for this amazing work. cheers. related on my dream board would be exllamav2 + tensorrt + speculative decoding +context/conversation caching, as I believe that would fly.

turboderp · 2024-03-15T11:52:32Z

turboderp
Mar 15, 2024
Maintainer

It already has context caching and speculative decoding. I'm not really sure where TensorRT would fit in?

2 replies

david-payjoy Mar 16, 2024
Author

im trying to get to what you've accomplished so far with [exui (sessions), exllamav2 (so fast!), speculative decoding(even faster!)]
plus.. parallel multi gpu (like tensorrt)

david-payjoy Mar 16, 2024
Author

also, seriously, amazing work, especially as a solo project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorrt-llm possibility? #369

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

tensorrt-llm possibility? #369

david-payjoy Mar 14, 2024

Replies: 1 comment · 2 replies

turboderp Mar 15, 2024 Maintainer

david-payjoy Mar 16, 2024 Author

david-payjoy Mar 16, 2024 Author

david-payjoy
Mar 14, 2024

Replies: 1 comment 2 replies

turboderp
Mar 15, 2024
Maintainer

david-payjoy Mar 16, 2024
Author

david-payjoy Mar 16, 2024
Author