Taskflow  3.2.0-Master-Branch
Loading...
Searching...
No Matches
Parallel Reduction

cudaFlow provides template methods to create parallel reduction tasks on a CUDA GPU.

Include the Header

You need to include the header file, taskflow/cuda/algorithm/reduce.hpp, for creating a parallel-reduction task.

Reduce Items with an Initial Value

The reduction task created by tf::cudaFlow::reduce(I first, I last, T* result, C&& bop) performs parallel reduction over a range of elements specified by [first, last) using the binary operator bop and stores the reduced result in result. It represents the parallel execution of the following reduction loop on a GPU:

while (first != last) {
*result = bop(*result, *first++);
}

The variable result participates in the reduction loop and must be initialized with an initial value. The following code performs a parallel reduction to sum all the numbers in the given range with an initial value 1000:

const size_t N = 1000000;
std::vector<int> cpu(N, -1);
int sol; // solution on CPU
int* res = tf::cuda_malloc_device<int>(1); // solution on GPU
int* gpu = tf::cuda_malloc_device<int>(N); // data on GPU
tf::cudaTask d2h = cf.copy(&sol, res, 1);
tf::cudaTask h2d = cf.copy(gpu, cpu.data(), N);
tf::cudaTask set = cf.single_task([res] __device__ () mutable { *res = 1000; });
tf::cudaTask kernel = cf.reduce(
gpu, gpu+N, res, [] __device__ (int a, int b) { return a + b; }
);
kernel.succeed(h2d, set)
.precede(d2h);
cf.offload();
assert(sol == N + 1000);
class to create a cudaFlow task dependency graph
Definition cudaflow.hpp:56
void offload()
offloads the cudaFlow and executes it once
Definition cudaflow.hpp:1654
cudaTask reduce(I first, I last, T *result, B bop)
performs parallel reduction over a range of items
Definition reduce.hpp:578
cudaTask single_task(C c)
runs a callable with only a single kernel thread
Definition for_each.hpp:169
cudaTask copy(T *tgt, const T *src, size_t num)
creates a memcopy task that copies typed data
Definition cudaflow.hpp:1348
class to create a task handle over an internal node of a cudaFlow graph
Definition cuda_task.hpp:65
cudaTask & succeed(Ts &&... tasks)
adds precedence links from other tasks to this
Definition cuda_task.hpp:189
cudaTask & precede(Ts &&... tasks)
adds precedence links from this to other tasks
Definition cuda_task.hpp:182

Reduce Items without an Initial Value

You can use tf::cudaFlow::uninitialized_reduce to perform parallel reduction without an initial value. This method represents a parallel execution of the following reduction loop on a GPU, thus in no need of tf::cudaFlow::single_task to initialize the variable:

*result = *first++; // no initial values to participate in the reduction loop
while (first != last) {
*result = bop(*result, *first++);
}

The variable result is directly assigned the reduced value without any initial value to participate in the reduction loop. The following code performs a parallel reduction to sum all the numbers in the given range without any initial value:

const size_t N = 1000000;
std::vector<int> cpu(N, -1);
int sol; // solution on CPU
int* res = tf::cuda_malloc_device<int>(1); // solution on GPU
int* gpu = tf::cuda_malloc_device<int>(N); // data on GPU
tf::cudaTask d2h = cf.copy(&sol, res, 1);
tf::cudaTask h2d = cf.copy(gpu, cpu.data(), N);
gpu, gpu+N, res, [] __device__ (int a, int b) { return a + b; }
);
kernel.succeed(h2d)
.precede(d2h);
cf.offload();
assert(sol == N);
cudaTask uninitialized_reduce(I first, I last, T *result, B bop)
similar to tf::cudaFlow::reduce but does not assume any initial value to reduce
Definition reduce.hpp:587

Reduce a Range of Transformed Items with an Initial Value

tf::cudaFlow::transform_reduce performs a parallel reduction over a range of transformed elements specified by [first, last) using a binary reduce operator bop and a unary transform operator uop. It represents the parallel execution of the following reduction loop on a GPU:

while (first != last) {
*result = bop(*result, uop(*first++));
}

The variable result participates in the reduction loop and must be initialized with an initial value. The following code performs a parallel reduction to sum all the transformed numbers multiplied by 10 in the given range with an initial value 1000:

const size_t N = 1000000;
int* res = tf::cuda_malloc_shared<int>(1); // result
int* vec = tf::cuda_malloc_shared<int>(N); // vector
// initializes the data
*res = 1000;
std::iota(vec, vec + N, 0);
vec, vec + N, res,
[] __device__ (int a, int b) { return a + b; },
[] __device__ (int a) { return a*10; }
);
cf.offload();
// *res = 1000 + (0*10 + 1*10 + 2*10 + 3*10 + 4*10 + ... + (N-1)*10)
cudaTask transform_reduce(I first, I last, T *result, B bop, U uop)
performs parallel reduction over a range of transformed items
Definition reduce.hpp:596
T iota(T... args)

Reduce a Range of Transformed Items without an Initial Value

tf::cuda_transform_uninitialized_reduce performs a parallel reduction over a range of transformed items without an initial value. This method represents a parallel execution of the following reduction loop on a GPU:

*result = *first++; // no initial values to participate in the reduction loop
while (first != last) {
*result = bop(*result, uop(*first++));
}

The variable result is directly assigned the reduced value without any initial value participating in the reduction loop. The following code performs a parallel reduction to sum all the transformed numbers multiplied by 10 in the given range without any initial value:

const size_t N = 1000000;
int* res = tf::cuda_malloc_shared<int>(1); // result
int* vec = tf::cuda_malloc_shared<int>(N); // vector
// initializes the data
*res = 1000;
std::iota(vec, vec + N, 0);
vec, vec + N, res,
[] __device__ (int a, int b) { return a + b; },
[] __device__ (int a) { return a*10; }
);
cf.offload();
// *res = 0*10 + 1*10 + 2*10 + 3*10 + 4*10 + ... + (N-1)*10

Miscellaneous Items

Parallel-reduction algorithms are also available in tf::cudaFlowCapturer with the same API.