You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.
Hello everyone, we are seeing slower than expected inference times on one of our CPU node with Intel(R) Xeon(R) Platinum 8362 CPU @ 2.80GHz with following instruction sets:
Hello everyone, we are seeing slower than expected inference times on one of our CPU node with Intel(R) Xeon(R) Platinum 8362 CPU @ 2.80GHz with following instruction sets:
With latest version of
neuralchat_server
andneural-speed
in combination withintel-extension-for-transformers
with following config:We are seeing extremely slow time to first token with example prompts like
Tell me about Intel Xeon Scalable Processors.
With following measured times :
Without
neural-speed
compression of said model, we got inference times to only around20s
.Is there any misconfiguration on our part?
I would love to hear your feedback and appreciate any help.
The text was updated successfully, but these errors were encountered: