Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use OpenVino to increase speed #2

Closed
deinferno opened this issue Oct 21, 2023 · 47 comments
Closed

Use OpenVino to increase speed #2

deinferno opened this issue Oct 21, 2023 · 47 comments
Labels
enhancement New feature or request

Comments

@deinferno
Copy link
Contributor

deinferno commented Oct 21, 2023

It's possible to adapt pipeline and convert weights to openvino format with little hackaround.

Current missing feature is implementation of timestep_cond input in compiled unet, which breakes guidance making images dim and messy.It can be bypassed by implementing classic cond/uncond but lowers inference speed by 33%. (I didn't use it benchmark because of that)

For example on Xeon Gold with 48C/96T speed increases a lot making it able to generate 512x512 image every 4 second or 12 seconds for batch of 4.

I will post weights, ov pipeline and comparable benchmark soon.

@rupeshs
Copy link
Owner

rupeshs commented Oct 21, 2023

@deinferno sounds cool

@Disty0
Copy link

Disty0 commented Oct 21, 2023

CPU performance is basically double over the standard PyTorch.

SDNext has OpenVINO support out of the box:
https://github.com/vladmandic/automatic/wiki/OpenVINO

SDNext's OpenVINO support is based on the official OpenVINO Script's torch.compile backend:
https://github.com/openvinotoolkit/stable-diffusion-webui/blob/master/scripts/openvino_accelerate.py#L117C10-L117C10

Using native OpenVINO instead of torch.compile backend will be better for this app tho.

@deinferno deinferno reopened this Oct 22, 2023
@Amin456789
Copy link

Amin456789 commented Oct 22, 2023

very cool indeed! could u please tell us how and when it is ready to use? can the model be converted to f16 when it is ready? for lower size

@Disty0
Copy link

Disty0 commented Oct 22, 2023

very cool indeed! could u please tell us how and when it is ready to use? can the model be converted to f16 when it is ready? for lower size

Models will be cached as FP32 and they will be converted for your hardware at the first run.
torch.complie backend will handle everything for you, you just need to trigger recompile in the code again if there is a parameter change.

Using native OpenVINO will be better for this app since you don't have to convert from PyTorch to OpenVINO with native OpenVINO.

@Amin456789
Copy link

thanks for the answer! if openvino can give the same or better speed as onnx for my intel cpu, i can't wait for it!
i have no idea how to use those parameters but i will ask u guys when it is ready to use

thanks!

@rupeshs
Copy link
Owner

rupeshs commented Oct 22, 2023

Yesterday I tried to convert model to openvino, image generation is a bit blurry (using the LMS sampler worked). Full LCM pipeline conversion is not yet done. @deinferno Any updates?

@Amin456789
Copy link

@deinferno @rupeshs i asked this in lcm repo yesterday but i ask it here too as i really like this gui too, so let us all have everything here as well
will u guys try to implement other features of sd to this in the future? such as img2img, inpaniting and AnimateDiff

if they can be run with openvino too that will be amazing!

@deinferno
Copy link
Contributor Author

deinferno commented Oct 22, 2023

It's done, i uploaded weights and inference code https://huggingface.co/deinferno/LCM_Dreamshaper_v7-openvino

@rupeshs
Copy link
Owner

rupeshs commented Oct 22, 2023

@deinferno Thanks , now 21 seconds reduced to 9 to12 seconds for 512x512 image on core i7 (4 steps)

@rupeshs
Copy link
Owner

rupeshs commented Oct 22, 2023

@deinferno seems like memory usage is high compared to pytorch inference,
512 x 512 - 9GB
768 x 768 - 12GB

@rupeshs
Copy link
Owner

rupeshs commented Oct 22, 2023

@deinferno Added OpenVINO support
https://github.com/rupeshs/fastsdcpu/releases/tag/v1.0.0-beta.3

@Disty0
Copy link

Disty0 commented Oct 22, 2023

@deinferno Added OpenVINO support https://github.com/rupeshs/fastsdcpu/releases/tag/v1.0.0-beta.3

Works fine on Linux too. Took 8.6 seconds at 512x512, 4 steps with my R7 5800X3D CPU & 3200 MHz CL18 RAM.

Also replaced device: str = "CPU", line to device: str = "GPU", and an image with the same settings took 0.36 seconds on my Intel ARC A770.

@rupeshs
Copy link
Owner

rupeshs commented Oct 22, 2023

@Disty0 wow ,thanks for testing it

@Disty0
Copy link

Disty0 commented Oct 22, 2023

Closest i could get without LCM on the GPU is 1 FPS. LCM boost the speed to 3 FPS.

Here is a video of it running on a GPU:

https://www.youtube.com/watch?v=-zso94H10hA

@rupeshs
Copy link
Owner

rupeshs commented Oct 22, 2023

@Disty0 That's pretty fast

@patientx
Copy link

it was around 11sec/it with 512x512 with my cpu not it is around 5secs pretty good speedup.

@deinferno
Copy link
Contributor Author

@deinferno seems like memory usage is high compared to pytorch inference, 512 x 512 - 9GB 768 x 768 - 12GB

I tried converting without timestep_cond and disabling it in pipeline too, but it seems to not be a cause for huge ram usage. I found out that if compiled shape .blob model file exists in locally downloaded model folder memory usage goes from 7 to 11 gigs. Can someone can test that with official stable diffusion openvino pipeline with reshape and compile model like in my example?

@deinferno
Copy link
Contributor Author

For some reason OpenVino converted LCM model is no longer deterministic for me, i just can't get even closer to previous output with same random seed.

Also it seems that guidance_scale itself doesn't do much even in official Latent Consistency Model Space

@rupeshs
Copy link
Owner

rupeshs commented Oct 25, 2023

@deinferno I tried with np random seed as per openvino docs, but producing not identical images but similar ; also added it in the master branch
np.random.seed(seed)
image

@deinferno
Copy link
Contributor Author

I updated inference code. It now produces same results on same seeds.

@rupeshs
Copy link
Owner

rupeshs commented Oct 26, 2023

@deinferno great, if possible could you please create a PR?

@deinferno
Copy link
Contributor Author

@rupeshs I opened PR.

Also HF Space is up. It runs on cpu-basic and generates one image each ~22.5 seconds, pretty impressive for 2 cores vCPU.

@rupeshs
Copy link
Owner

rupeshs commented Oct 26, 2023

@deinferno Thanks,that is cool

@rupeshs
Copy link
Owner

rupeshs commented Oct 26, 2023

@deinferno Added a comment in the PR ,could you please check

@deinferno
Copy link
Contributor Author

@rupeshs That's odd, i can't find any comment or code review in #35 , i didn't receive notifications either 🤔

@rupeshs
Copy link
Owner

rupeshs commented Oct 26, 2023

@deinferno NVM, just merged thanks for this PR

@rupeshs
Copy link
Owner

rupeshs commented Oct 27, 2023

@deinferno seems like we have a problem with the latest openvino change, garbage output #36 (comment)

@Amin456789
Copy link

@deinferno could u please work on a onnx version? onnx is as fast as openvino for cpu and i don't think it has this ram usage problems

@deinferno
Copy link
Contributor Author

deinferno commented Oct 28, 2023

@Amin456789 You may want to watch this PR in optimum for ONNX version.

OpenVino should only use 7.1 GB instead of 14.1 GB after #40 was merged.
And from your problem in #36 it looks like your system is swapping horribly that's why smaller ONNX int8 model was a lot faster for you.

@Amin456789
Copy link

nice! thanks for the answer, can't wait for this updates

@rupeshs
Copy link
Owner

rupeshs commented Oct 29, 2023

@rupeshs
Copy link
Owner

rupeshs commented Nov 2, 2023

@deinferno I tried Tiny Auto encoder for SD, got some speed improvement in the work diffuser workflow(25% speed boost), if we use it in OpenVINO speed we can probably increase speed .
https://huggingface.co/docs/diffusers/main/en/api/models/autoencoder_tiny

@Amin456789
Copy link

@rupeshs its amazing, can u please impelent this in normal model too?

@rupeshs
Copy link
Owner

rupeshs commented Nov 2, 2023

@Amin456789 yes implemented in the normal model,I will create a branch tomorrow

@Amin456789
Copy link

nice, thank u!

@Amin456789
Copy link

could u please add a dark mode in the future for the windows too? @rupeshs

@rupeshs
Copy link
Owner

rupeshs commented Nov 3, 2023

@Amin456789 yes

@rupeshs
Copy link
Owner

rupeshs commented Nov 3, 2023

WIP: Added tiny autoencoder for normal pipeline, @deinferno can you check the OpenVINO part?
https://github.com/rupeshs/fastsdcpu/tree/add-tae-sd-support

@deinferno
Copy link
Contributor Author

deinferno commented Nov 4, 2023

@rupeshs Big speedup from TAESD indeed, 4 images pipeline run now only takes 8.1 seconds instead of 12.5 with OpenVino converted TAESD, i will push converted weights and PR soon.

@rupeshs
Copy link
Owner

rupeshs commented Nov 4, 2023

@deinferno that's great,cheers.

@rupeshs
Copy link
Owner

rupeshs commented Nov 5, 2023

@patientx
Copy link

patientx commented Nov 5, 2023

interesting, it is actually slower with ryzen 2200g , normally 13 seconds for 4 step or about 3.25 sec/it with openvino but if I enable tiny auto encoder it is now 14 seconds or about 3.5 sec/it. :) Maybe on faster cpu's it would be faster I don't know what is happening here.

@Amin456789
Copy link

another model is out
https://huggingface.co/furusu/LCM-Acertainty

@rupeshs
Copy link
Owner

rupeshs commented Nov 11, 2023

@deinferno I have added LCM-LoRA support but I'm not sure whether it is possible with OpenVINO,
https://github.com/rupeshs/fastsdcpu/releases/tag/v1.0.0-beta.12

@rupeshs rupeshs added the enhancement New feature or request label Nov 12, 2023
@rupeshs rupeshs closed this as completed Nov 12, 2023
@onlyreportingissues
Copy link

@deinferno Added OpenVINO support https://github.com/rupeshs/fastsdcpu/releases/tag/v1.0.0-beta.3

Works fine on Linux too. Took 8.6 seconds at 512x512, 4 steps with my R7 5800X3D CPU & 3200 MHz CL18 RAM.

Also replaced device: str = "CPU", line to device: str = "GPU", and an image with the same settings took 0.36 seconds on my Intel ARC A770.

Where exactly did you change that, if I may ask?

@ExperimentDiffusion
Copy link

Hi, I have a tech question, I'm already using FASTSD-CPU but now I want to bypass CPU and use Intel ARC310 and ARC380 graphic cards with the same openvino config that you perfectly done in FASTSD-CPU; it is possible to make the same App with the option use GPU-1 Arc and share half of process in CPU and half of process in small 4GB 8GB ARC GPU's? Thanks in advance if you can make it!

@ExperimentDiffusion
Copy link

device: str = "GPU",

What line is safe to change? in wich file?
it is possible to share CPU/GPU process with a creation of a sliding bar of the user's desired cpu/gpu usage percentage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants