-
Notifications
You must be signed in to change notification settings - Fork 38
Question about Thread pool and GEMV #221
Comments
I've done some AVX2 GEMV kernels in this PR: #209. And it does show good performance on next-token inference.
We've provided a public interface class of threading, which can be implemented by std::thread, OpenMP, or other thread pools like ONNXRuntime's. It's very recommended to use our thread pool on Intel client CPUs which is hybrid. It will be much faster than other thread pools.
We've provided all kinds of GEMMs: sgemm, igemm, hgemm and bf16gemm. Also their Weight-Only-Quantization versions: int3, int4, fp4, and other data types. You can use it as a BLAS library. You can refer ONNXRuntime's code for how to use BesTLA only in your project: https://github.com/microsoft/onnxruntime/blob/main/cmake/external/neural_speed.cmake |
Thank you for your prompt reply! |
igemv code is here: https://github.com/intel/neural-speed/pull/209/files#diff-3f2e40e478bc4fdc338616cf4c43969cdd035cc17df448ba42bc2277f628a52dR1329 sgemv code is not planned yet, it's slower than igemv. |
thank u again! I am going to study these code. |
Because I've been working on efficient GeMV multiplication on CPUs lately, I've found that I'm only going to be able to make a limited amount of improvement after adopting SIMD. Referring to your BesTLA library might inspire me, I'm really looking forward to BestLA's GeMV kernels.
Also, I have a question about BesTLA's thread pool, is it based on a custom thread pool, or is it based on OpenMP?
btw, I'm looking forward to seeing BesTLA become more widely used, or be called separately, just like OpenBLAS.
The text was updated successfully, but these errors were encountered: