文章

CUDA 学习笔记(一):入门与线程模型

CUDA 学习笔记(一):入门与线程模型

Hello World

1
2
3
4
5
6
7
8
9
10
11
#include <stdio.h>

__global__ void helloCUDA() {
    printf("Hello from GPU! Thread ID: %d\n", threadIdx.x);
}

int main() {
    helloCUDA<<<1, 5>>>();
    cudaDeviceSynchronize();
    return 0;
}

编译命令

1
2
nvcc hello.cu -o hello
./hello

CUDA 线程模型

CUDA 的线程层次结构分为三层:

  • Thread(线程):最基本的执行单元
  • Block(线程块):一组线程,共享共享内存
  • Grid(网格):一组 Block,构成整个 kernel 的执行
1
2
3
4
5
6
7
8
9
10
Grid (1D)
├── Block 0 (1D)
│   ├── Thread 0
│   ├── Thread 1
│   └── ...
├── Block 1 (1D)
│   ├── Thread 0
│   ├── Thread 1
│   └── ...
└── ...

线程唯一标识计算

1D Grid + 1D Block

1
2
3
// GridDim: 网格中 Block 的数量
// BlockDim: 每个 Block 中线程的数量
int threadId = blockIdx.x * blockDim.x + threadIdx.x;
索引blockIdx.xthreadIdx.xthreadId
0000
1011
2102
3113

2D Grid + 2D Block

1
2
3
4
int blockId = blockIdx.x + blockIdx.y * gridDim.x;
int threadId = blockId * (blockDim.x * blockDim.y) +
               threadIdx.y * blockDim.x +
               threadIdx.x;
blockIdxblockIdx.ythreadIdx.xthreadIdx.ythreadId
(0,0)0000
(1,0)0001
(0,1)1002
(1,1)1003

3D Grid + 3D Block

1
2
3
4
5
6
7
8
int blockId = blockIdx.x + 
              blockIdx.y * gridDim.x + 
              blockIdx.z * gridDim.x * gridDim.y;

int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) +
               threadIdx.z * (blockDim.x * blockDim.y) +
               threadIdx.y * blockDim.x +
               threadIdx.x;
blockIdxthreadIdxthreadId 计算
(x,y,z)(x,y,z)blockIdx.x + blockIdx.y * gridDim.x + blockIdx.z * gridDim.x * gridDim.y 作为 blockId,再展开线程

完整示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include <stdio.h>

__global__ void printThreadInfo() {
    int threadId = blockIdx.x * blockDim.x + threadIdx.x;
    printf("Thread %d: blockIdx=(%d,%d,%d) threadIdx=(%d,%d,%d)\n",
           threadId,
           blockIdx.x, blockIdx.y, blockIdx.z,
           threadIdx.x, threadIdx.y, threadIdx.z);
}

int main() {
    dim3 blockDim(2, 2, 2);      // 8 threads per block
    dim3 gridDim(2, 2, 2);       // 8 blocks
    printThreadInfo<<<gridDim, blockDim>>>();
    cudaDeviceSynchronize();
    return 0;
}

编译运行:

1
2
nvcc thread_info.cu -o thread_info
./thread_info
本文由作者按照 CC BY 4.0 进行授权