Taskflow  3.2.0-Master-Branch
Loading...
Searching...
No Matches
Matrix Multiplication

We study the classic problem, 2D matrix multiplication. We will start with a short introduction about the problem and then discuss how to solve it parallel CPUs.

Problem Formulation

We are multiplying two matrices, A (MxK) and B (KxN). The numbers of columns of A must match the number of rows of B. The output matrix C has the shape of (MxN) where M is the rows of A and N the columns of B. The following example multiplies a 3x3 matrix with a 3x2 matrix to derive a 3x2 matrix.

As a general view, for each element of C we iterate a complete row of A and a complete column of B, multiplying each element and summing them.

We can implement matrix multiplication using three nested loops.

for(int m=0; m<M; m++) {
for(int n=0; n<N; n++) {
C[m][n] = 0;
for(int k=0; k<K; k++) {
C[m][n] += A[m][k] * B[k][n];
}
}
}

Parallel Patterns

At a fine-grained level, computing each element of C is independent of each other. Similarly, computing each row of C or each column of C is also independent of one another. With task parallelism, we prefer coarse-grained model to have each task perform rather large computation to amortize the overhead of creating and scheduling tasks. In this case, we avoid intensive tasks each working on only a single element. by creating a task per row of C to multiply a row of A by every column of B.

// C = A * B
// A is a MxK matrix, B is a KxN matrix, and C is a MxN matrix
void matrix_multiplication(int** A, int** B, int** C, int M, int K, int N) {
tf::Taskflow taskflow;
tf::Executor executor;
for(int m=0; m<M; ++m) {
taskflow.emplace([m, &] () {
for(int n=0; n<N; n++) {
for(int k=0; k<K; k++) {
C[m][n] += A[m][k] * B[k][n]; // inner product
}
}
});
}
executor.run(taskflow).wait();
}
class to create an executor for running a taskflow graph
Definition executor.hpp:50
tf::Future< void > run(Taskflow &taskflow)
runs a taskflow once
Definition executor.hpp:1573
Task emplace(C &&callable)
creates a static task
Definition flow_builder.hpp:742
class to create a taskflow object
Definition core/taskflow.hpp:73

Instead of creating tasks one-by-one over a loop, you can leverage Taskflow::for_each_index to create a parallel-for task. A parallel-for task spawns a subflow to perform parallel iterations over the given range.

// perform parallel iterations on the range [0, M) with the step size of 1
tf::Task task = taskflow.for_each_index(0, M, 1, [&] (int m) {
for(int n=0; n<N; n++) {
for(int k=0; k<K; k++) {
C[m][n] += A[m][k] * B[k][n];
}
}
});
Task for_each_index(B first, E last, S step, C callable)
constructs a parallel-transform task
class to create a task handle over a node in a taskflow graph
Definition task.hpp:187

Please visit Parallel Iterations for more details.

Benchmarking

Based on the discussion above, we compare the runtime of computing various matrix sizes of A, B, and C between a sequential CPU and parallel CPUs on a machine of 12 Intel i7-8700 CPUs at 3.2 GHz.

A B C CPU Sequential CPU Parallel
10x10 10x10 10x10 0.142 ms 0.414 ms
100x100 100x100 100x100 1.641 ms 0.733 ms
1000x1000 1000x1000 1000x1000 1532 ms 504 ms
2000x2000 2000x2000 2000x2000 25688 ms 4387 ms
3000x3000 3000x3000 3000x3000 104838 ms 16170 ms
4000x4000 4000x4000 4000x4000 250133 ms 39646 ms

The speed-up of parallel execution becomes clean as we increase the problem size. For example, at 4000x4000, the parallel runtime is 6.3 times faster than the sequential runtime.