Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

ardfork / exllama Public

forked from turboderp/exllama

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Code
Pull requests
Actions
Projects
Security
Insights

Additional navigation options

Code
Pull requests
Actions
Projects
Security
Insights

Breadcrumbs

exllama

/

TODO.md

Copy path

Latest commit

History

59 lines (44 loc) · 2.16 KB

Breadcrumbs

exllama

/

TODO.md

File metadata and controls

59 lines (44 loc) · 2.16 KB

Model compatibility

Support for act-order models ~~(a bit slow for now)~~
~~Support for v1 models without groupsize~~ Nah.
Test more models
Consider support for loading GGML models
Utility to scan and validate .safetensors files
Figure out if there are quantized models with irregular groupsize (there are some at least with no groupsize)

GPU compatibility

Test that CUDA code works on GTX 10-series and RTX 20-series at some point
Test performance on P40 (would be a good GPU to support)
Tunable kernel parameters

Testing

Figure out an apples-to-apples way of comparing perplexity with other implementations
Compile charts of inference speed vs context length for variety of models, compare to other implementations

VRAM optimization

Fix layer streaming so it isn't unusably slow
Allow layer streaming to integrate with other features like device splitting
Provide alternative backend to allow layers on CPU

Speed optimization

Support for de-quantizing select matrices at load time
Better vector-matrix multiplication for de-quantized matrices (or show that it's bandwidth-limited now)
Fused QKV projection
Fused MLP (done, still needs act-order support)
Fused RoPE
Build attention mask in CUDA rather than PyTorch
Disable attention mask when it isn't needed
Figure out why inference appears to be CPU-bound
Measure PyTorch module overhead (nn.Modules aren't really needed for inference)
Examine if scaled_dot_product_attention is actually the best attention method for single tokens
Rewrite at least the quantized matmul kernel. Should be a bunch of special cases to consider

Generation

Memory-efficient beam search implementation
Optimized beam search
Multi-token censoring/de-censoring
Multi-token repetition penalties
(Multi) LoRA support
Guided generation (chat with multiple bots at once, etc.)
Multiple chat modes with prompt templates (instruct, etc.)

Interface

Simple web interface?
API server

??

Allow for backpropagation
LoRA training features

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.