block、thread、wrap间的执行顺序 #21

zhaiyi000 · 2022-11-26T06:23:14Z

您好，请教几个问题。
我的显卡是2080Ti

当我读到，109页10.3 更多线程束内的基本函数中的Listing 10.2（如下）

#include "error.cuh"
#include <stdio.h>

const unsigned WIDTH = 8;
const unsigned BLOCK_SIZE = 16;
const unsigned FULL_MASK = 0xffffffff;

void __global__ test_warp_primitives(void);

int main(int argc, char **argv)
{
    test_warp_primitives<<<1, BLOCK_SIZE>>>();
    CHECK(cudaDeviceSynchronize());
    return 0;
}

void __global__ test_warp_primitives(void)
{
    int tid = threadIdx.x;
    int lane_id = tid % WIDTH;

    if (tid == 0) printf("threadIdx.x: ");
    printf("%2d ", tid);
    if (tid == 0) printf("\n");

    if (tid == 0) printf("lane_id:     ");
    printf("%2d ", lane_id);
    if (tid == 0) printf("\n");

    unsigned mask1 = __ballot_sync(FULL_MASK, tid > 0);
    unsigned mask2 = __ballot_sync(FULL_MASK, tid == 0);
    if (tid == 0) printf("FULL_MASK = %x\n", FULL_MASK);
    if (tid == 1) printf("mask1     = %x\n", mask1);
    if (tid == 0) printf("mask2     = %x\n", mask2);

    int result = __all_sync(FULL_MASK, tid);
    if (tid == 0) printf("all_sync (FULL_MASK): %d\n", result);

    result = __all_sync(mask1, tid);
    if (tid == 1) printf("all_sync     (mask1): %d\n", result);

    result = __any_sync(FULL_MASK, tid);
    if (tid == 0) printf("any_sync (FULL_MASK): %d\n", result);

    result = __any_sync(mask2, tid);
    if (tid == 0) printf("any_sync     (mask2): %d\n", result);

    int value = __shfl_sync(FULL_MASK, tid, 2, WIDTH);
    if (tid == 0) printf("shfl:      ");
    printf("%2d ", value);
    if (tid == 0) printf("\n");

    value = __shfl_up_sync(FULL_MASK, tid, 1, WIDTH);
    if (tid == 0) printf("shfl_up:   ");
    printf("%2d ", value);
    if (tid == 0) printf("\n");

    value = __shfl_down_sync(FULL_MASK, tid, 1, WIDTH);
    if (tid == 0) printf("shfl_down: ");
    printf("%2d ", value);
    if (tid == 0) printf("\n");

    value = __shfl_xor_sync(FULL_MASK, tid, 1, WIDTH);
    if (tid == 0) printf("shfl_xor:  ");
    printf("%2d ", value);
    if (tid == 0) printf("\n");
}

输出是

threadIdx.x:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 
lane_id:      0  1  2  3  4  5  6  7  0  1  2  3  4  5  6  7 
FULL_MASK = ffffffff
mask1     = fffe
mask2     = 1
all_sync (FULL_MASK): 0
all_sync     (mask1): 1
any_sync (FULL_MASK): 1
any_sync     (mask2): 0
shfl:       2  2  2  2  2  2  2  2 10 10 10 10 10 10 10 10 
shfl_up:    0  0  1  2  3  4  5  6  8  8  9 10 11 12 13 14 
shfl_down:  1  2  3  4  5  6  7  7  9 10 11 12 13 14 15 15 
shfl_xor:   1  0  3  2  5  4  7  6  9  8 11 10 13 12 15 14

其中的

if (tid == 0) printf("threadIdx.x: ");
printf("%2d ", tid);

想请问：

对于线程束中的不同线程，为什么第二行一定在第一行之后执行？

对于线程束中的不同线程，为什么会顺序执行第二行？
我尝试加入如下代码

 int k = 0;
 if (tid == 0) {
     for ( int i = 0; i < 100000000; i++ ) {
         k = k + 1;
     }
 }
 if (tid == 0) printf("threadIdx.x: ");
 printf("%2d ", tid);
 if (tid == 0) printf("\n");

但是仍然如下输出，我期待线程0应该是最晚执行完成的。

 threadIdx.x:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

书中106页讲从伏特架构开始，引入了独立线程调度（independent thread scheduling）机制。每个线程有自己的程序计数器。我的理解是线程之间的执行顺序应该是随机的。这也正是线程束内同步函数＿syncwarp()存在的意义。

The text was updated successfully, but these errors were encountered:

woodforest357 · 2024-11-14T01:04:41Z

我看到这里也是同样疑问：为什么54-56行能按照线程序号顺序输出呢？
自己机器上跑了一下：win11，vs2022,8.9计算能力，4060显卡。
程序前面加了两个头文件：
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
然后这个程序在Release x64上的输出结果两个all_sync输出0，两个any_sync输出1，其余结果一样。Debugx64上，在第二个all_sync调用时cuda error。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

block、thread、wrap间的执行顺序 #21

block、thread、wrap间的执行顺序 #21

zhaiyi000 commented Nov 26, 2022

woodforest357 commented Nov 14, 2024

block、thread、wrap间的执行顺序 #21

block、thread、wrap间的执行顺序 #21

Comments

zhaiyi000 commented Nov 26, 2022

woodforest357 commented Nov 14, 2024