Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocBLAS performance differs with nonuniform data #1486

Open
ml2718 opened this issue Sep 23, 2024 · 2 comments
Open

rocBLAS performance differs with nonuniform data #1486

ml2718 opened this issue Sep 23, 2024 · 2 comments

Comments

@ml2718
Copy link

ml2718 commented Sep 23, 2024

When I run rocblas_sgemm with square matrices, the performance differs depending on whether I initialize the matrices with all 1.0f or with random floats. Sample code follows:

#define __HIP_PLATFORM_AMD__
#include <hip/hip_runtime.h>
#include <rocblas/rocblas.h>

#include <chrono>
#include <string>
#include <iostream>
using namespace std;

inline __attribute__((always_inline)) void hip_check(hipError_t err) {
    if (__builtin_expect(err != hipSuccess, 0)) {
        string name = hipGetErrorName(err);
        string desc = hipGetErrorString(err);
        cerr << "GPU error: " << name << ": " << desc << endl;
    }
}

int main(int argc, char** argv) {
    int N = 8192;

    void* A_cpu = malloc(N*N*4);
    void* A_gpu = nullptr;
    hip_check(hipMalloc(&A_gpu, N*N*4));
    hip_check(hipMemset(A_gpu, 0x00, N*N*4));

    void* B_cpu = malloc(N*N*4);
    void* B_gpu = nullptr;
    hip_check(hipMalloc(&B_gpu, N*N*4));
    hip_check(hipMemset(B_gpu, 0x00, N*N*4));

    void* C_cpu = malloc(N*N*4);
    void* C_gpu = nullptr;
    hip_check(hipMalloc(&C_gpu, N*N*4));
    hip_check(hipMemset(C_gpu, 0xcd, N*N*4));

    assert(argc == 2);
    string arg = argv[1];
    if (arg == "0") {
        for (int i = 0; i < N*N; i++) {
            reinterpret_cast<float*>(A_cpu)[i] = 1.0f;
            reinterpret_cast<float*>(B_cpu)[i] = 1.0f;
        }
    } else if (arg == "1") {
        for (int i = 0; i < N*N; i++) {
            reinterpret_cast<float*>(A_cpu)[i] = float((rand() % 1000000) / 1e6f);
            reinterpret_cast<float*>(B_cpu)[i] = float((rand() % 1000000) / 1e6f);
        }
    } else {
        assert(false);
    }

    hip_check(hipMemcpy(A_gpu, A_cpu, N*N*4, hipMemcpyHostToDevice));
    hip_check(hipMemcpy(B_gpu, B_cpu, N*N*4, hipMemcpyHostToDevice));

    hip_check(hipDeviceSynchronize());

    float alpha = 1.0f;
    float beta = 0.0f;

    rocblas_handle rh;
    rocblas_create_handle(&rh);

    for (int rep = 0; rep < 3; rep++) {
        hip_check(hipDeviceSynchronize());

        auto t0 = chrono::system_clock::now();

        int nrr = 50;

        for (int rr = 0; rr < nrr; rr++) {
            rocblas_sgemm(
                rh,
                rocblas_operation_none,
                rocblas_operation_none,
                N,
                N,
                N,
                &alpha,
                reinterpret_cast<float*>(A_gpu),
                N,
                reinterpret_cast<float*>(B_gpu),
                N,
                &beta,
                reinterpret_cast<float*>(C_gpu),
                N);
        }

        hip_check(hipDeviceSynchronize());

        auto t1 = chrono::system_clock::now();
        
        cerr << "Time: " << ((t1 - t0).count() / double(nrr) / 1e9) << endl;
    }

    rocblas_destroy_handle(rh);

    hip_check(hipFree(A_gpu));
    hip_check(hipFree(B_gpu));
    hip_check(hipFree(C_gpu));
    
    free(A_cpu);
    free(B_cpu);
    free(C_cpu);

    return 0;
}

When run with uniform initialization (0), the output is as follows:

Time: 0.0408275
Time: 0.0375156
Time: 0.0375371

whereas when run with random initialization (1), the output is as follows:

Time: 0.0464277
Time: 0.043631
Time: 0.0436978

i.e., about 16% slower. What could possibly be causing this? It appears that the same GPU kernel is being run in both cases, and it does not seem likely that the chip has a fast path for constant data.

@mahmoodw
Copy link
Contributor

Thank you for your inquiry! From what you've described, the performance variation you're seeing isn't due to any software difference but is related to how the GPU dynamically manages power and heat.

The key factor here is Dynamic Power Management. GPUs adjust their clock frequency to limit heat output and power usage. When a matrix is initialised to constant values like 0 or 1, fewer bit flips occur during arithmetic operations compared to when random values are used. Fewer bit flips lead to reduced heat output and power consumption. Consequently, the GPU can maintain higher clock frequencies for longer, resulting in better performance compared to random initialisation, which requires more bit manipulations and thus generates more heat.

For additional assistance or hardware-specific questions, I would recommend reaching out to AMD’s product support or posting on the appropriate community forums. Here’s the link for further support:
AMD Community Support

I hope this clarifies the behaviour you're observing. Let me know if you need any further assistance!

@ml2718
Copy link
Author

ml2718 commented Sep 25, 2024

Thanks for your reply! That makes sense, although I should note that I also tried random periodic data (repeating every 4 entries):

    if (arg == "0") {
        for (int i = 0; i < N*N; i++) {
            reinterpret_cast<float*>(A_cpu)[i] = 1.0f;
            reinterpret_cast<float*>(B_cpu)[i] = 1.0f;
        }
    } else if (arg == "1") {
        for (int i = 0; i < N*N; i++) {
            reinterpret_cast<float*>(A_cpu)[i] = float((rand() % 1000000) / 1e6f);
            reinterpret_cast<float*>(B_cpu)[i] = float((rand() % 1000000) / 1e6f);
        }
    } else if (arg == "2") {
        srand(17);
        float r[4];
        for (int i = 0; i < 4; i++) {
            r[i] = float((rand() % 1000000) / 1e6f);
        }
        for (int i = 0; i < N*N; i++) {
            reinterpret_cast<float*>(A_cpu)[i] = r[i%4];
            reinterpret_cast<float*>(B_cpu)[i] = r[i%4];
        }
    } else {
        assert(false);
    }

and the performance is still slightly faster than with totally random data:

Time: 0.0451028
Time: 0.0423471
Time: 0.0423582

Is it possible that there is some other hardware mechanism responsible for this (e.g., caching)?

Alternatively, is there a specific forum that would be appropriate for this type of question? I already tried asking a related question on the ROCm forum but received no response:

ROCm/ROCm#3726

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants