Skip to content

Commit

Permalink
fixes and quality improvements, compile support
Browse files Browse the repository at this point in the history
  • Loading branch information
matatonic committed Sep 9, 2024
1 parent 4f87f4a commit 3e77c5a
Show file tree
Hide file tree
Showing 25 changed files with 870 additions and 119 deletions.
3 changes: 3 additions & 0 deletions CONFIG.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ Sample Generator JSON:
"torch_dtype": "bfloat16"
},
"options": {
"compile": ["transformer", "vae"],
"enable_sequential_cpu_offload": false,
"enable_model_cpu_offload": false,
"enable_vae_slicing": false,
Expand All @@ -60,6 +61,8 @@ Sample Generator JSON:

The format is very flexible and many entries are not pre-defined but are used as keywords in API calls to `diffusers` python objects.

The `compile` option can accept a list of components to compile (`["transformer", "vae"]`), compiling can take a while, but the performance improvements may be worth while. In my tests it can take almost 10 minutes before the first image is ready, and images generated approximately 10-20% faster after that.

#### Local model files

Here is another sample of how to use local model files with a fine-tune without downloading from huggingface:
Expand Down
25 changes: 15 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ An OpenAI API compatible image generation server for the FLUX.1 family of models
- **Enhancements**: Supports flexible AI prompt enhancers
- **Standalone Image Generation**: Uses your Nvidia GPU for image generation, doesn't use ComfyUI, SwarmUI or any other backend
- **Lora Support**: Support for multiple loras with individual scaling weights (strength)
- **Torch Compile Support**: Faster image generations with `torch.compile` (up to 20% faster in my tests, maybe more or less for other setups).
- [ ] **Easy to setup and use**: Maybe?
- [ ] **Upscaler Support** (planned)
- [ ] **BNB NF4 Quantization** (planned)
Expand Down Expand Up @@ -92,7 +93,7 @@ For example, it's simple to use with Open WebUI. Here is a screenshot of the con
- - [x] model (whatever you configure, `dall-e-2` by default)
- - [x] size (anything that works with flux, `1024x1024` by default)
- - [X] quality (whatever you want, `standard` by default)
- - [x] response_format (`b64_json` preffered, `url` will use `data:` uri's)
- - [x] response_format (`b64_json` preferred, `url` will use `data:` uri's)
- - [x] n
- - [ ] style (`vivid` by default) (currently ignored)
- - [ ] user (ignored)
Expand Down Expand Up @@ -134,23 +135,27 @@ By default, the following models are configured (require ~40GB VRAM, bfloat16, <
- `dall-e-2` is set to use `shnell`
- `dall-e-3` is set to use `dev`, with prompt enhancement if an openai chat API is available.

Additional FP8 quantized models (require ~24GB VRAM and can be slow to load, `+enable_vae_slicing`, `+enable_vae_tiling`, ~3+s/step):
Additional FP8 quantized models (require 24GB VRAM and can be slow to load, `+enable_vae_slicing`, `+enable_vae_tiling`, ~3+s/step):

- `schnell-fp8`: `kijai-flux.1-schnell-fp8.json` Scnhell with FP8 quantization, 4 steps (10-15s)
- `dev-fp8`: `kijai-flux.1-dev-fp8.json` Dev with FP8 quantization, 25/50 steps
- `dev-fp8-e5m2`: `kijai-flux.1-dev-fp8-e5m2.json` Dev with FP8_e5m2 quantization, 25/50 steps (slightly better)
- `merged-fp8`: `drbaph-flux.1-merged-fp8.json` Dev+Schnell merged, FP8 quantization, 12 steps by default
- `merged-fp8-4step`: `drbaph-flux.1-merged-fp8-4step.json` Dev+Schnell merged, FP8 quantization, 4 steps

Additional FP8 models (require ~16GB VRAM and can be slow to load, `+enable_model_cpu_offload`, ~5+s/step):
Additional FP8 models (require 16GB VRAM and can be slow to load, `+enable_model_cpu_offload`, ~5+s/step):

- `schnell-fp8-16G`: `kijai-flux.1-schnell-fp8-16G.json` Scnhell, 4 steps (~15-30s)
- `dev-fp8-16G`: `kijai-flux.1-dev-fp8-16G.json` Dev with FP8 quantization, 25/50 steps
- `dev-fp8-e5m2-16G`: `kijai-flux.1-dev-fp8-e5m2-16G.json` Dev with FP8_e5m2 quantization, 25/50 steps (slightly better)
- `merged-fp8-4step-16G`: `drbaph-flux.1-merged-fp8-4step-16G.json` Dev+Schnell merged, 4 steps
- `merged-fp8-16G`: `drbaph-flux.1-merged-fp8-16G.json` Dev+Schnell merged, 12 steps by default
- `schnell-fp8-16GB`: `kijai-flux.1-schnell-fp8-16GB.json` Scnhell, 4 steps (~15-30s)
- `dev-fp8-16GB`: `kijai-flux.1-dev-fp8-16GB.json` Dev with FP8 quantization, 25/50 steps
(slightly better)
- `merged-fp8-4step-16GB`: `drbaph-flux.1-merged-fp8-4step-16GB.json` Dev+Schnell merged, 4 steps
- `merged-fp8-16GB`: `drbaph-flux.1-merged-fp8-16GB.json` Dev+Schnell merged, 12 steps by default

Low VRAM options (<4GB VRAM, ~32GB RAM, `+enable_sequential_cpu_offload`, float16 instead of bfloat16, 8-15+s/step):
Additional NF4 models (require 12GB VRAM):

- sayakpaul-dev-nf4-12GB: soon ...
- sayakpaul-dev-nf4-compile-12GB: soon ...

Low VRAM options (<4GB VRAM, 34GB RAM, `+enable_sequential_cpu_offload`, float16 instead of bfloat16, 8-15+s/step):

- `schnell-low`: `flux.1-schnell-low.json` Schnell FP16, (30-60s per image)
- `dev-low`: `flux.1-dev-low.json` Dev FP16, at least a few minutes per image
Expand Down
27 changes: 21 additions & 6 deletions config.default.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,19 @@
"schnell": {
"generator": "lib/flux.1-schnell.json"
},
"schnell-compile": {
"generator": "lib/flux.1-schnell-compile.json"
},
"schnell-e": {
"generator": "lib/flux.1-schnell.json",
"enhancer": "lib/openai-enhancer.json"
},
"schnell-fp8": {
"generator": "lib/kijai-flux.1-schnell-fp8.json"
},
"schnell-fp8-compile": {
"generator": "lib/kijai-flux.1-schnell-fp8-compile.json"
},
"schnell-fp8-16GB": {
"generator": "lib/kijai-flux.1-schnell-fp8-16GB.json"
},
Expand All @@ -28,19 +34,28 @@
"merged": {
"generator": "lib/sayakpaul-flux.1-merged.json"
},
"merged-compile": {
"generator": "lib/sayakpaul-flux.1-merged-compile.json"
},
"merged-e": {
"generator": "lib/sayakpaul-flux.1-merged.json",
"enhancer": "lib/openai-enhancer.json"
},
"merged-fp8": {
"generator": "lib/drbaph-flux.1-merged-fp8.json"
},
"merged-fp8-compile": {
"generator": "lib/drbaph-flux.1-merged-fp8-compile.json"
},
"merged-fp8-16GB": {
"generator": "lib/drbaph-flux.1-merged-fp8-16GB.json"
},
"merged-fp8-4step": {
"generator": "lib/drbaph-flux.1-merged-fp8-4step.json"
},
"merged-fp8-4step-compile": {
"generator": "lib/drbaph-flux.1-merged-fp8-4step-compile.json"
},
"merged-fp8-4step-16GB": {
"generator": "lib/drbaph-flux.1-merged-fp8-4step-16GB.json"
},
Expand All @@ -51,22 +66,22 @@
"dev": {
"generator": "lib/flux.1-dev.json"
},
"dev-compile": {
"generator": "lib/flux.1-dev-compile.json"
},
"dev-e": {
"generator": "lib/flux.1-dev.json",
"enhancer": "lib/openai-enhancer.json"
},
"dev-fp8": {
"generator": "lib/kijai-flux.1-dev-fp8.json"
},
"dev-fp8-compile": {
"generator": "lib/kijai-flux.1-dev-fp8-compile.json"
},
"dev-fp8-16GB": {
"generator": "lib/kijai-flux.1-dev-fp8-16GB.json"
},
"dev-fp8-e5m2": {
"generator": "lib/kijai-flux.1-dev-fp8-e5m2.json"
},
"dev-fp8-e5m2-16GB": {
"generator": "lib/kijai-flux.1-dev-fp8-e5m2-16GB.json"
},
"dev-low": {
"generator": "lib/flux.1-dev-low.json"
}
Expand Down
10 changes: 8 additions & 2 deletions config/lib/drbaph-flux.1-merged-fp8-16GB.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,13 @@
"enable_vae_tiling": true
},
"generation_kwargs": {
"guidance_scale": 3.5,
"num_inference_steps": 12
"standard": {
"guidance_scale": 3.5,
"num_inference_steps": 12
},
"hd": {
"guidance_scale": 3.5,
"num_inference_steps": 25
}
}
}
37 changes: 37 additions & 0 deletions config/lib/drbaph-flux.1-merged-fp8-4step-compile.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"pipeline": {
"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-schnell",
"torch_dtype": "bfloat16",
"FluxTransformer2DModel": {
"quantize": "fp8",
"pretrained_model_link_or_path_or_dict": "https://huggingface.co/drbaph/FLUX.1-schnell-dev-merged-fp8-4step/blob/main/FLUX.1-schnell-dev-merged-fp8-4step.safetensors",
"torch_dtype": "bfloat16"
},
"T5EncoderModel": {
"quantize": "fp8",
"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-schnell",
"torch_dtype": "bfloat16",
"subfolder": "text_encoder_2"
}
},
"options": {
"compile": ["transformer", "vae"],
"enable_vae_slicing": true,
"enable_vae_tiling": true,
"to": {
"device": "cuda"
}
},
"generation_kwargs": {
"standard": {
"guidance_scale": 0.0,
"num_inference_steps": 4,
"max_sequence_length": 256
},
"hd": {
"guidance_scale": 0.0,
"num_inference_steps": 8,
"max_sequence_length": 256
}
}
}
35 changes: 35 additions & 0 deletions config/lib/drbaph-flux.1-merged-fp8-compile.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
{
"pipeline": {
"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-schnell",
"torch_dtype": "bfloat16",
"FluxTransformer2DModel": {
"quantize": "fp8",
"pretrained_model_link_or_path_or_dict": "https://huggingface.co/drbaph/FLUX.1-schnell-dev-merged-fp8/blob/main/FLUX.1-schnell-dev-merged-fp8.safetensors",
"torch_dtype": "bfloat16"
},
"T5EncoderModel": {
"quantize": "fp8",
"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-schnell",
"torch_dtype": "bfloat16",
"subfolder": "text_encoder_2"
}
},
"options": {
"compile": ["transformer", "vae"],
"enable_vae_slicing": true,
"enable_vae_tiling": true,
"to": {
"device": "cuda"
}
},
"generation_kwargs": {
"standard": {
"guidance_scale": 3.5,
"num_inference_steps": 12
},
"hd": {
"guidance_scale": 3.5,
"num_inference_steps": 25
}
}
}
10 changes: 8 additions & 2 deletions config/lib/drbaph-flux.1-merged-fp8.json
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,13 @@
}
},
"generation_kwargs": {
"guidance_scale": 3.5,
"num_inference_steps": 12
"standard": {
"guidance_scale": 3.5,
"num_inference_steps": 12
},
"hd": {
"guidance_scale": 3.5,
"num_inference_steps": 25
}
}
}
32 changes: 32 additions & 0 deletions config/lib/flux.1-dev-compile.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"pipeline": {
"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
"torch_dtype": "bfloat16"
},
"options": {
"compile": ["transformer", "vae"],
"enable_vae_slicing": true,
"enable_vae_tiling": true,
"to": {
"device": "cuda"
}
},
"generation_kwargs": {
"standard": {
"guidance_scale": 3.5,
"num_inference_steps": 25
},
"bfl": {
"guidance_scale": 3.5,
"num_inference_steps": 50
},
"hd": {
"guidance_scale": 5.5,
"num_inference_steps": 50
},
"xhd": {
"guidance_scale": 7.0,
"num_inference_steps": 50
}
}
}
44 changes: 44 additions & 0 deletions config/lib/flux.1-dev-int8-compile.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
{
"pipeline": {
"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
"torch_dtype": "bfloat16",
"FluxTransformer2DModel": {
"quantize": "qint8",
"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
"subfolder": "transformer",
"torch_dtype": "bfloat16"
},
"T5EncoderModel": {
"quantize": "qint8",
"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
"torch_dtype": "bfloat16",
"subfolder": "text_encoder_2"
}
},
"options": {
"compile": ["transformer", "vae"],
"enable_vae_slicing": true,
"enable_vae_tiling": true,
"to": {
"device": "cuda"
}
},
"generation_kwargs": {
"standard": {
"guidance_scale": 3.5,
"num_inference_steps": 25
},
"bfl": {
"guidance_scale": 3.5,
"num_inference_steps": 50
},
"hd": {
"guidance_scale": 5.5,
"num_inference_steps": 50
},
"xhd": {
"guidance_scale": 7.0,
"num_inference_steps": 50
}
}
}
19 changes: 19 additions & 0 deletions config/lib/flux.1-schnell-compile.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"pipeline": {
"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-schnell",
"torch_dtype": "bfloat16"
},
"options": {
"compile": ["transformer", "vae"],
"enable_vae_slicing": true,
"enable_vae_tiling": true,
"to": {
"device": "cuda"
}
},
"generation_kwargs": {
"guidance_scale": 0.0,
"num_inference_steps": 4,
"max_sequence_length": 256
}
}
2 changes: 1 addition & 1 deletion config/lib/kijai-flux.1-dev-fp8-16GB.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
"torch_dtype": "bfloat16",
"FluxTransformer2DModel": {
"quantize": "qfloat8_e4m3fn",
"quantize": "fp8",
"pretrained_model_link_or_path_or_dict": "https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8-e4m3fn.safetensors",
"torch_dtype": "bfloat16"
},
Expand Down
Loading

0 comments on commit 3e77c5a

Please sign in to comment.