Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance database generation ? #2742

Closed
trixirt opened this issue Feb 9, 2024 · 13 comments
Closed

Performance database generation ? #2742

trixirt opened this issue Feb 9, 2024 · 13 comments

Comments

@trixirt
Copy link

trixirt commented Feb 9, 2024

In ROCm 6.0 the performance db is unpacked from a tarball.
How is the db generated ?

@trixirt
Copy link
Author

trixirt commented Feb 16, 2024

I have to remove the prebuilt db's from the Fedora package, so there will be a performance hit.

I am expecting there is some off to the side tensile command that is being used.
Could this be included in the MIOpen build like it is for rocBLAS and hipBLASLt ?

@atamazov
Copy link
Contributor

@trixirt

I have to remove the prebuilt db's from the Fedora package

I am curious, why?

@atamazov
Copy link
Contributor

@trixirt If you know the convolution configs that will be used in your systems, then you can generate the databases (find-db, perf-db and kdb (precompiled kernels)) manually.

First, set variables:

 export MIOPEN_FIND_MODE=normal
 export MIOPEN_FIND_ENFORCE=search

Second, run the convolution configs using any available tool. For example, you can use MIOpenDriver, or even real neural networks.

After that you can find the generated user-find-db, user-perf-db and user-kdb in $HOME/.config/miopen and in $HOME/.cache/miopen.

The user's databases can be installed to the user's homes (from the package you are working on) or properly renamed and installed to the "system" directories for all users (let me omit the details of this topic for now).

⚠️ Please note that MIOPEN_FIND_ENFORCE=search enforces tuning that dramatically increases the time required for the library to find the fastest kernel. So it may worth trying without this setting first. See also https://github.com/ROCm/MIOpen/blob/develop/docs/perfdatabase.md.

Another important note is that tuning process can be accelerated if offline compiler is used because that enables parallel compilation of kernels. Please ask if/when you need to know the details.

@trixirt
Copy link
Author

trixirt commented Feb 26, 2024

On Fedora (and RHEL) there is a packaging requirement to be self consistent. ie. we can not use anything that has not been built on Fedora. So to package the db on Fedora it must be built on Fedora. ex/ for rocBLAS we use Fedora's Tensile to generate rocBLAS's db's so we can packed them as part of the rocBLAS rpm. Yes, this adds a couple of hours to build, that is the cost we have to pay. I am aware of the $HOME generation of the db, that does not help with the system db. I am asking for the offline compiler details so we can do for MIOpen, what we have done for rocBLAS

@atamazov
Copy link
Contributor

@trixirt

I am aware of the $HOME generation of the db, that does not help with the system db.

Why? The only difference of the user and system databases is their locations/filenames.

I am asking for the offline compiler details so we can do for MIOpen, what we have done for rocBLAS

Let me first explain the difference between "offline compilation" in rocBLAS and in MIOpen.

IIRC rocBLAS can generate databases truly offline, which means that rocBLAS does not require the target GPU to be installed in the system.

MIOpen is different. We use real GPU to benchmark kernels (because their performance can't be predicted algorithmically). This means that databases should be generated on the system that is identical to the target system, including ROCm and target GPU.

What we call offline compiler is a clang++ executable (that resides somewhere in the rocm tree) and it supports parallel compilation of kernels, -- in contrast to the normal online kernels compilation which means the use of hipRTC and COMgr calls and does not support parallel builds. Parallel compilation allows faster building of MIOpen tuning database (perf-db) and binary database (that contains pre-built kernels).

The second important thing is that MIOpen can not generate system databases. The system dbs are read-only by design. The process of preparation of the system databases includes generation of the user databases with subsequent renaming the resulting files and moving them to proper locations.

If the above may work for you, then I am ready to explain how to engage offline compiler for tuning, in detail (but let me remind that that would only accelerate the database generation).

@trixirt
Copy link
Author

trixirt commented Feb 29, 2024

Without Fedora being able to generate the system db, there will be no system db.
Is there public, documented way to generate the the system db ?

@AngryLoki
Copy link

AngryLoki commented Mar 19, 2024

Hi, @atamazov! I digged into kern_db and have few questions (this time for Gentoo, but it does not matter).

First of all, this is what I see in gfx1030.kdb for rocm-6.0.2 release:

  • There are 51246 kernels for 42 unique source files (excluding .o)
  • Kernels are compiled from .cpp, .cpp.o, .cl, .cl.o, .s and .s.o files
  • All kernels were compiled from source code, stored in this repo, however some of them like Conv_Winograd_v21_1_3_fp32_dilation2.s were removed long time ago.
  • All information about benchmark-based tunable parameters required for rebuild is stored in -D parameters.

If everything above is right, then it is possible to create alternative representation similar to https://github.com/oneapi-src/oneDNN/blob/main/src/gpu/jit/gemm/kernel.db, which can be published directly in release. As I checked for gfx1030.kdb, name + tuned parameters in bz2 will use only 526 KB.

Also I've found your comment in ROCm/llvm-project#36 and there are multiple directions for solving issue: forking in miopen and removing mutex locks in comgr when possible. With parallel compilation there will be no critical need to distribute precompiled kernels. Only tuned parameters will suffice, correct me if I'm wrong.

My questions are:

  1. Removed kernels are just a dead code and can be safely ignored, right?
  2. There are 2 types of kernels there: with -mcpu=gfx1030 and without -mcpu (50% / 50%). When -mcpu is not used, what offload-target is there? Why db stores compilation result of naive_conv.cpp <no mcpu> and naive_conv.cpp.o -mcpu=gfx1030?
  3. What do you think about publishing a plain file with build parameters so that distro maintainers could build and package their own system database, rather than downloading huge blobs of elf code?

@atamazov
Copy link
Contributor

atamazov commented Mar 19, 2024

@AngryLoki Let's try to keep things in order. Please open another issue for kdb matters, mention @cderb and @JehandadKhan there, copy your comment there and hide it here (as off-topic or duplicate). Thanks.

@atamazov
Copy link
Contributor

@trixirt

Without Fedora being able to generate the system db, there will be no system db.

Where can I find explanation why this is so? Thanks.

Is there public, documented way to generate the the system db ?

The process is explained at #2742 (comment) plus additional info at #2742 (comment). Please feel free to ask additional questions.

Unfortunately, Fedora is not one of the platforms officially supported by ROCm, so support is limited.

/cc @JehandadKhan @junliume

@trixirt
Copy link
Author

trixirt commented Mar 28, 2024

I would need to have a set of hw matching the gpus ?

@atamazov
Copy link
Contributor

@trixirt Yes, because MIOpen uses actual HW to run and benchmark kernels.

@ppanchad-amd
Copy link

@trixirt Has this been resolved for you? If so, please close ticket. Thanks!

@ppanchad-amd
Copy link

@trixirt Closing ticket. Please re-open if you still need assistance with this ticket. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants