Taskflow  3.2.0-Master-Branch
Loading...
Searching...
No Matches
Matrix Multiplication (cudaFlow)

Following up on Matrix Multiplication, this page studies how to accelerate a matrix multiplication workload on a GPU using tf::cudaFlow.

Define a Matrix Multiplication Kernel

GPU can perform a lot of parallel computations more than CPUs. It is especially useful for data-intensive computing such as matrix multiplication. With GPU, we express the parallel patterns at a fine-grained level. The kernel, written in CUDA, is described as follows:

// CUDA kernel to perform matrix multiplication
__global__ void matmul(int *A, int *B, int *C, int M, int K, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int sum = 0;
if(col < N && row < M) {
for(int i = 0; i < K; i++) {
sum += a[row * K + i] * b[i * N + col];
}
c[row * N + col] = sum;
}
}

Each CUDA thread corresponds to an element of C and compute its result. Instead of storing each matrix in a 2D array, we use 1D layout to ease the data transfer between CPU and GPU. In a row-major layout, an element (x, y) in the 2D matrix can be addressed at x * width + y in the transformed 1D layout.

Define a cudaFlow for Matrix Multiplication

The next step is to allocate memory for A, B, and C at a GPU. We create three tasks each calling cudaMalloc to allocate space for one matrix. Then, we create a cudaFlow to offload matrix multiplication to a GPU. The entire code is described as follows:

void matrix_multiplication(int* A, int* B, int* C, int M, int K, int N) {
tf::Taskflow taskflow;
tf::Executor executor;
// allocate the host and gpu storage for A
tf::Task allocate_a = taskflow.emplace([&](){
cudaMalloc(&da, M*K*sizeof(int));
}).name("allocate_a");
// allocate the host and gpu storage for B
tf::Task allocate_b = taskflow.emplace([&](){
cudaMalloc(&db, K*N*sizeof(int));
}).name("allocate_b");
// allocate the host and gpu storage for C
tf::Task allocate_c = taskflow.emplace([&](){
cudaMalloc(&dc, M*N*sizeof(int));
}).name("allocate_c");
// create a cudaFlow to run the matrix multiplication
tf::Task cudaFlow = taskflow.emplace([&](tf::cudaFlow& cf){
// copy data to da, db, and dc
tf::cudaTask copy_da = cf.copy(da, A, M*K).name("H2D_A");
tf::cudaTask copy_db = cf.copy(db, B, K*N).name("H2D_B");
tf::cudaTask copy_hc = cf.copy(C, dc, M*N).name("D2H_C");
dim3 grid ((K+16-1)/16, (M+16-1)/16);
dim3 block (16, 16);
tf::cudaTask kmatmul = cf.kernel(grid, block, 0, matmul, da, db, dc, M, K, N)
.name("matmul");
kmatmul.succeed(copy_da, copy_db)
.precede(copy_hc);
}).name("cudaFlow");
// free the gpu storage
auto free = taskflow.emplace([&](){
cudaFree(da);
cudaFree(db);
cudaFree(dc);
}).name("free");
// create dependency
cudaFlow.succeed(allocate_a, allocate_b, allocate_c)
.precede(free);
// dump the graph without unfolding the cudaFlow
taskflow.dump(std::cout);
// run the taskflow
executor.run(taskflow).wait();
// dump the entire execution graph including unfolded cudaFlow
taskflow.dump(std::cout);
}
class to create an executor for running a taskflow graph
Definition executor.hpp:50
tf::Future< void > run(Taskflow &taskflow)
runs a taskflow once
Definition executor.hpp:1573
Task emplace(C &&callable)
creates a static task
Definition flow_builder.hpp:742
class to create a task handle over a node in a taskflow graph
Definition task.hpp:187
Task & succeed(Ts &&... tasks)
adds precedence links from other tasks to this
Definition task.hpp:428
Task & precede(Ts &&... tasks)
adds precedence links from this to other tasks
Definition task.hpp:420
class to create a taskflow object
Definition core/taskflow.hpp:73
void dump(std::ostream &ostream) const
dumps the taskflow to a DOT format through a std::ostream target
Definition core/taskflow.hpp:363
class to create a cudaFlow task dependency graph
Definition cudaflow.hpp:56
cudaTask kernel(dim3 g, dim3 b, size_t s, F f, ArgsT &&... args)
creates a kernel task
Definition cudaflow.hpp:1272
cudaTask copy(T *tgt, const T *src, size_t num)
creates a memcopy task that copies typed data
Definition cudaflow.hpp:1348
class to create a task handle over an internal node of a cudaFlow graph
Definition cuda_task.hpp:65
cudaTask & succeed(Ts &&... tasks)
adds precedence links from other tasks to this
Definition cuda_task.hpp:189
cudaTask & name(const std::string &name)
assigns a name to the task
Definition cuda_task.hpp:200
cudaTask & precede(Ts &&... tasks)
adds precedence links from this to other tasks
Definition cuda_task.hpp:182
T free(T... args)

Within the cudaFlow, we create two host-to-device (H2D) tasks that copy data from A and B to da and db, one device-to-host (D2H) task that copies the result from dc to C, and one kernel task that launches matmul on the GPU (by default, GPU 0). H2D tasks precede the kernel and the kernel precedes the D2H task. These GPU operations form a GPU task graph managed by a cudaFlow. The first dump of the taskflow gives the following graph:

dot_matrix_multiplication_5.png

A cudaFlow encapsulates a GPU task dependency graph similar to a tf::Subflow (see Dynamic Tasking). In order to visualize it, we need to execute the graph first and then dump the taskflow.

dot_matrix_multiplication_6.png

Benchmarking

We run three versions of matrix multiplication, sequential CPU, parallel CPUs, and one GPU, on a machine of 12 Intel i7-8700 CPUs at 3.20 GHz and a Nvidia RTX 2080 GPU using various matrix sizes of A, B, and C.

A B C CPU Sequential CPU Parallel GPU Parallel
10x10 10x10 10x10 0.142 ms 0.414 ms 82 ms
100x100 100x100 100x100 1.641 ms 0.733 ms 83 ms
1000x1000 1000x1000 1000x1000 1532 ms 504 ms 85 ms
2000x2000 2000x2000 2000x2000 25688 ms 4387 ms 133 ms
3000x3000 3000x3000 3000x3000 104838 ms 16170 ms 214 ms
4000x4000 4000x4000 4000x4000 250133 ms 39646 ms 427 ms

As the matrix size increases, the speed-up of GPU over CPUs becomes prominent. For example, at 4000x4000, the GPU runtime is 585.8 times faster than the sequential CPU runtime and is 92.8 times faster than the parallel CPU solutions.