Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 256*256 double矩阵乘法耗时不符合预期 #21

Open
liuhuan2719 opened this issue Oct 26, 2023 · 14 comments
Open

[Bug]: 256*256 double矩阵乘法耗时不符合预期 #21

liuhuan2719 opened this issue Oct 26, 2023 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@liuhuan2719
Copy link

liuhuan2719 commented Oct 26, 2023

What happened

  1. 256x256 double矩阵乘法耗时不符合预期,相较于255x255和257x257差距过大
28d901f27ec0395efaee2d65b5ed6c4d

2.计算耗时没有线性关系,200x200矩阵 255x255的矩阵,计算量差距约2倍,实际耗时差6倍
image

Reproduction steps

代码如下,MATRIX_SIZE 修改矩阵维数,编译参数:
riscv64-unknown-linux-gnu-gcc -march=rv64imafdcxthead -mabi=lp64d -mcmodel=medany -ffunction-sections -fdata-sections -funroll-loops -Wall -std=gnu11 -mtune=c908 -O2 matrix.c -o matrix

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <stdint.h>

#define MATRIX_SIZE 256

double matrix1[MATRIX_SIZE][MATRIX_SIZE];
double matrix2[MATRIX_SIZE][MATRIX_SIZE];
double result[MATRIX_SIZE][MATRIX_SIZE];

double get_time_in_microseconds() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec * 1000000 + (double)tv.tv_usec;
}

static uint64_t mul_cnt = 0;

// 矩阵乘法函数
void matrix_multiply(double matrix1[][MATRIX_SIZE], double matrix2[][MATRIX_SIZE], double result[][MATRIX_SIZE]) {
    int i, j, k;
        double start_time;
        double end_time;
        double execution_time;
    for (i = 0; i < MATRIX_SIZE; i++) {
        for (j = 0; j < MATRIX_SIZE; j++) {
            result[i][j] = 0.0;
            for (k = 0; k < MATRIX_SIZE; k++) {
                result[i][j] = result[i][j] + (matrix1[i][k] * matrix2[k][j]);
                                mul_cnt++;
            }
        }
    }
}

int main() {

    // 生成两个随机的256x256的双精度矩阵
    int i, j;
    for (i = 0; i < MATRIX_SIZE; i++) {
        for (j = 0; j < MATRIX_SIZE; j++) {
            matrix1[i][j] = (double)rand() / RAND_MAX;
            matrix2[i][j] = (double)rand() / RAND_MAX;
        }
    }
#if 0
        printf("{");
    for (i = 0; i < MATRIX_SIZE; i++) {
        for (j = 0; j < MATRIX_SIZE; j++) {
            printf("%lf,", matrix1[i][j]);
        }
    }
        printf("}\n");
        printf("===============================\n");
        printf("{");
    for (i = 0; i < MATRIX_SIZE; i++) {
        for (j = 0; j < MATRIX_SIZE; j++) {
            printf("%lf,", matrix2[i][j]);
        }
    }
        printf("}\n");
#endif

    // 执行矩阵乘法
        double start_time = get_time_in_microseconds();
        for (int i = 0; i < 1; i++) {
                matrix_multiply(matrix1, matrix2, result);
        }
        double end_time = get_time_in_microseconds();

        double execution_time = end_time - start_time;

        printf("Execution Time: %.2f microseconds mul cnt %ld\n", execution_time, mul_cnt);
#if 0
    // 打印结果
    for (i = 0; i < MATRIX_SIZE; i++) {
        for (j = 0; j < MATRIX_SIZE; j++) {
            printf("%f ", result[i][j]);
        }
        printf("\n");
    }
#endif

    return 0;
}

Hardware board

k230 evb board

Software version

No response

Bug frequency

No response

Anything else

No response

@liuhuan2719 liuhuan2719 added the bug Something isn't working label Oct 26, 2023
@lawo123
Copy link

lawo123 commented Oct 30, 2023

该问题已复现,正在处理中。

@liuzm6217-jianan
Copy link

liuzm6217-jianan commented Nov 1, 2023

暂无进展,后面结合裸机仿真找找原因。

@liuzm6217-jianan
Copy link

  1. 向 soc 提供了测试 code,并沟通了测试的一些细节。
  2. soc 那边开始在做仿真,完成后看下结果,再做分析。(最早也要明天才能完成)

@liuzm6217-jianan
Copy link

仿真比较慢,今天未出结果,下周有初步结果再做进一步分析、测试。

@liuzm6217-jianan
Copy link

  1. soc 仿真(带 cache)暂未复现目前问题, 不带 cache 结果还没出来
  2. 开始在 k230 板子上做裸机测试,可以和 soc 仿真环境(裸机)对齐。

@liuzm6217-jianan
Copy link

  1. k230 裸机已经复现问题
  2. soc 使用 k230 裸机 bin 文件仿真。

@liuzm6217-jianan
Copy link

soc 仿真已经复现了问题,目前正在分析波形。
(基于 矩阵 64*64, O2 优化选项,带 cache)

@liuzm6217-jianan
Copy link

目前还在分析波形,未确定问题。

@liuzm6217-jianan
Copy link

liuzm6217-jianan commented Nov 13, 2023

  1. 目前发现一些规律,64 的倍数时,结果异常较为明显如:
    384 为 64 的倍数
    n = 383, release time = 5846470203.000000
    n = 384, release time = 16266221466.000000
    n = 385, release time = 5949622836.000000
  2. 根据目前的现象, 广超怀疑是 cache 替换策略的问题(硬件设计)。

@liuzm6217-jianan
Copy link

问题及目前测试情况已汇总到晶哥那边,晶哥后面和平头哥沟通,待反馈。

@liminn liminn removed their assignment Nov 15, 2023
@liuhuan2719
Copy link
Author

请问这个问题有结论了吗?

@zhangxiaojingCAN
Copy link

目前能确认是像64对齐下,dcache miss率显著提高,矩阵访存时addr[6]相同甚至就是同一个dcache index,一是同一index每次只能回填1路,另一路回填就得死等,另一个是伪随机替换策略导致可能刚回填的路马上被替换。建议可以试试以下2种优化:

  1. matrix2转置后存储,这样k就成了最后一个下标,可以提升第二个矩阵的局部性,让cache line内的数据尽可能多得被一起用到
  2. matrix2整体偏移1个cache line再存储,这样当matrix使用addr[6]=0的数据时,matrix 2更偏向于使用addr[6]=1的数据

@liuhuan2719
Copy link
Author

当前是demo程序,实际应用中也不能确定什么时候会出现这个问题,不容易规避,这个问题有办法解决吗?

@liuhuan2719
Copy link
Author

问题2的计算量和耗时不成线性关系的原因是什么呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants