-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance database generation ? #2742
Comments
I have to remove the prebuilt db's from the Fedora package, so there will be a performance hit. I am expecting there is some off to the side tensile command that is being used. |
I am curious, why? |
@trixirt If you know the convolution configs that will be used in your systems, then you can generate the databases (find-db, perf-db and kdb (precompiled kernels)) manually. First, set variables: export MIOPEN_FIND_MODE=normal
export MIOPEN_FIND_ENFORCE=search Second, run the convolution configs using any available tool. For example, you can use After that you can find the generated user-find-db, user-perf-db and user-kdb in The user's databases can be installed to the user's homes (from the package you are working on) or properly renamed and installed to the "system" directories for all users (let me omit the details of this topic for now).
Another important note is that tuning process can be accelerated if offline compiler is used because that enables parallel compilation of kernels. Please ask if/when you need to know the details. |
On Fedora (and RHEL) there is a packaging requirement to be self consistent. ie. we can not use anything that has not been built on Fedora. So to package the db on Fedora it must be built on Fedora. ex/ for rocBLAS we use Fedora's Tensile to generate rocBLAS's db's so we can packed them as part of the rocBLAS rpm. Yes, this adds a couple of hours to build, that is the cost we have to pay. I am aware of the $HOME generation of the db, that does not help with the system db. I am asking for the offline compiler details so we can do for MIOpen, what we have done for rocBLAS |
Why? The only difference of the user and system databases is their locations/filenames.
Let me first explain the difference between "offline compilation" in rocBLAS and in MIOpen. IIRC rocBLAS can generate databases truly offline, which means that rocBLAS does not require the target GPU to be installed in the system. MIOpen is different. We use real GPU to benchmark kernels (because their performance can't be predicted algorithmically). This means that databases should be generated on the system that is identical to the target system, including ROCm and target GPU. What we call offline compiler is a clang++ executable (that resides somewhere in the rocm tree) and it supports parallel compilation of kernels, -- in contrast to the normal online kernels compilation which means the use of hipRTC and COMgr calls and does not support parallel builds. Parallel compilation allows faster building of MIOpen tuning database (perf-db) and binary database (that contains pre-built kernels). The second important thing is that MIOpen can not generate system databases. The system dbs are read-only by design. The process of preparation of the system databases includes generation of the user databases with subsequent renaming the resulting files and moving them to proper locations. If the above may work for you, then I am ready to explain how to engage offline compiler for tuning, in detail (but let me remind that that would only accelerate the database generation). |
Without Fedora being able to generate the system db, there will be no system db. |
Hi, @atamazov! I digged into kern_db and have few questions (this time for Gentoo, but it does not matter). First of all, this is what I see in gfx1030.kdb for rocm-6.0.2 release:
If everything above is right, then it is possible to create alternative representation similar to https://github.com/oneapi-src/oneDNN/blob/main/src/gpu/jit/gemm/kernel.db, which can be published directly in release. As I checked for gfx1030.kdb, name + tuned parameters in bz2 will use only 526 KB. Also I've found your comment in ROCm/llvm-project#36 and there are multiple directions for solving issue: forking in miopen and removing mutex locks in comgr when possible. With parallel compilation there will be no critical need to distribute precompiled kernels. Only tuned parameters will suffice, correct me if I'm wrong. My questions are:
|
@AngryLoki Let's try to keep things in order. Please open another issue for kdb matters, mention @cderb and @JehandadKhan there, copy your comment there and hide it here (as off-topic or duplicate). Thanks. |
Where can I find explanation why this is so? Thanks.
The process is explained at #2742 (comment) plus additional info at #2742 (comment). Please feel free to ask additional questions. Unfortunately, Fedora is not one of the platforms officially supported by ROCm, so support is limited. |
I would need to have a set of hw matching the gpus ? |
@trixirt Yes, because MIOpen uses actual HW to run and benchmark kernels. |
@trixirt Has this been resolved for you? If so, please close ticket. Thanks! |
@trixirt Closing ticket. Please re-open if you still need assistance with this ticket. Thanks! |
In ROCm 6.0 the performance db is unpacked from a tarball.
How is the db generated ?
The text was updated successfully, but these errors were encountered: