rocBLAS performance differs with nonuniform data #1486

ml2718 · 2024-09-23T02:08:06Z

When I run rocblas_sgemm with square matrices, the performance differs depending on whether I initialize the matrices with all 1.0f or with random floats. Sample code follows:

#define __HIP_PLATFORM_AMD__
#include <hip/hip_runtime.h>
#include <rocblas/rocblas.h>

#include <chrono>
#include <string>
#include <iostream>
using namespace std;

inline __attribute__((always_inline)) void hip_check(hipError_t err) {
    if (__builtin_expect(err != hipSuccess, 0)) {
        string name = hipGetErrorName(err);
        string desc = hipGetErrorString(err);
        cerr << "GPU error: " << name << ": " << desc << endl;
    }
}

int main(int argc, char** argv) {
    int N = 8192;

    void* A_cpu = malloc(N*N*4);
    void* A_gpu = nullptr;
    hip_check(hipMalloc(&A_gpu, N*N*4));
    hip_check(hipMemset(A_gpu, 0x00, N*N*4));

    void* B_cpu = malloc(N*N*4);
    void* B_gpu = nullptr;
    hip_check(hipMalloc(&B_gpu, N*N*4));
    hip_check(hipMemset(B_gpu, 0x00, N*N*4));

    void* C_cpu = malloc(N*N*4);
    void* C_gpu = nullptr;
    hip_check(hipMalloc(&C_gpu, N*N*4));
    hip_check(hipMemset(C_gpu, 0xcd, N*N*4));

    assert(argc == 2);
    string arg = argv[1];
    if (arg == "0") {
        for (int i = 0; i < N*N; i++) {
            reinterpret_cast<float*>(A_cpu)[i] = 1.0f;
            reinterpret_cast<float*>(B_cpu)[i] = 1.0f;
        }
    } else if (arg == "1") {
        for (int i = 0; i < N*N; i++) {
            reinterpret_cast<float*>(A_cpu)[i] = float((rand() % 1000000) / 1e6f);
            reinterpret_cast<float*>(B_cpu)[i] = float((rand() % 1000000) / 1e6f);
        }
    } else {
        assert(false);
    }

    hip_check(hipMemcpy(A_gpu, A_cpu, N*N*4, hipMemcpyHostToDevice));
    hip_check(hipMemcpy(B_gpu, B_cpu, N*N*4, hipMemcpyHostToDevice));

    hip_check(hipDeviceSynchronize());

    float alpha = 1.0f;
    float beta = 0.0f;

    rocblas_handle rh;
    rocblas_create_handle(&rh);

    for (int rep = 0; rep < 3; rep++) {
        hip_check(hipDeviceSynchronize());

        auto t0 = chrono::system_clock::now();

        int nrr = 50;

        for (int rr = 0; rr < nrr; rr++) {
            rocblas_sgemm(
                rh,
                rocblas_operation_none,
                rocblas_operation_none,
                N,
                N,
                N,
                &alpha,
                reinterpret_cast<float*>(A_gpu),
                N,
                reinterpret_cast<float*>(B_gpu),
                N,
                &beta,
                reinterpret_cast<float*>(C_gpu),
                N);
        }

        hip_check(hipDeviceSynchronize());

        auto t1 = chrono::system_clock::now();
        
        cerr << "Time: " << ((t1 - t0).count() / double(nrr) / 1e9) << endl;
    }

    rocblas_destroy_handle(rh);

    hip_check(hipFree(A_gpu));
    hip_check(hipFree(B_gpu));
    hip_check(hipFree(C_gpu));
    
    free(A_cpu);
    free(B_cpu);
    free(C_cpu);

    return 0;
}

When run with uniform initialization (0), the output is as follows:

Time: 0.0408275
Time: 0.0375156
Time: 0.0375371

whereas when run with random initialization (1), the output is as follows:

Time: 0.0464277
Time: 0.043631
Time: 0.0436978

i.e., about 16% slower. What could possibly be causing this? It appears that the same GPU kernel is being run in both cases, and it does not seem likely that the chip has a fast path for constant data.

The text was updated successfully, but these errors were encountered:

mahmoodw · 2024-09-25T16:45:36Z

Thank you for your inquiry! From what you've described, the performance variation you're seeing isn't due to any software difference but is related to how the GPU dynamically manages power and heat.

The key factor here is Dynamic Power Management. GPUs adjust their clock frequency to limit heat output and power usage. When a matrix is initialised to constant values like 0 or 1, fewer bit flips occur during arithmetic operations compared to when random values are used. Fewer bit flips lead to reduced heat output and power consumption. Consequently, the GPU can maintain higher clock frequencies for longer, resulting in better performance compared to random initialisation, which requires more bit manipulations and thus generates more heat.

For additional assistance or hardware-specific questions, I would recommend reaching out to AMD’s product support or posting on the appropriate community forums. Here’s the link for further support:
AMD Community Support

I hope this clarifies the behaviour you're observing. Let me know if you need any further assistance!

ml2718 · 2024-09-25T21:07:10Z

Thanks for your reply! That makes sense, although I should note that I also tried random periodic data (repeating every 4 entries):

    if (arg == "0") {
        for (int i = 0; i < N*N; i++) {
            reinterpret_cast<float*>(A_cpu)[i] = 1.0f;
            reinterpret_cast<float*>(B_cpu)[i] = 1.0f;
        }
    } else if (arg == "1") {
        for (int i = 0; i < N*N; i++) {
            reinterpret_cast<float*>(A_cpu)[i] = float((rand() % 1000000) / 1e6f);
            reinterpret_cast<float*>(B_cpu)[i] = float((rand() % 1000000) / 1e6f);
        }
    } else if (arg == "2") {
        srand(17);
        float r[4];
        for (int i = 0; i < 4; i++) {
            r[i] = float((rand() % 1000000) / 1e6f);
        }
        for (int i = 0; i < N*N; i++) {
            reinterpret_cast<float*>(A_cpu)[i] = r[i%4];
            reinterpret_cast<float*>(B_cpu)[i] = r[i%4];
        }
    } else {
        assert(false);
    }

and the performance is still slightly faster than with totally random data:

Time: 0.0451028
Time: 0.0423471
Time: 0.0423582

Is it possible that there is some other hardware mechanism responsible for this (e.g., caching)?

Alternatively, is there a specific forum that would be appropriate for this type of question? I already tried asking a related question on the ROCm forum but received no response:

ROCm/ROCm#3726

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocBLAS performance differs with nonuniform data #1486

rocBLAS performance differs with nonuniform data #1486

ml2718 commented Sep 23, 2024

mahmoodw commented Sep 25, 2024

ml2718 commented Sep 25, 2024

rocBLAS performance differs with nonuniform data #1486

rocBLAS performance differs with nonuniform data #1486

Comments

ml2718 commented Sep 23, 2024

mahmoodw commented Sep 25, 2024

ml2718 commented Sep 25, 2024