cudaFlow provides template methods to create parallel reduction tasks on a CUDA GPU.
Include the Header
You need to include the header file, taskflow/cuda/algorithm/reduce.hpp
, for creating a parallel-reduction task.
Reduce Items with an Initial Value
The reduction task created by tf::cudaFlow::reduce(I first, I last, T* result, C&& bop) performs parallel reduction over a range of elements specified by [first, last)
using the binary operator bop
and stores the reduced result in result
. It represents the parallel execution of the following reduction loop on a GPU:
while (first != last) {
*result = bop(*result, *first++);
}
The variable result
participates in the reduction loop and must be initialized with an initial value. The following code performs a parallel reduction to sum all the numbers in the given range with an initial value 1000
:
const size_t N = 1000000;
int sol;
int* res = tf::cuda_malloc_device<int>(1);
int* gpu = tf::cuda_malloc_device<int>(N);
gpu, gpu+N, res, [] __device__ (int a, int b) { return a + b; }
);
assert(sol == N + 1000);
class to create a cudaFlow task dependency graph
Definition cudaflow.hpp:56
void offload()
offloads the cudaFlow and executes it once
Definition cudaflow.hpp:1654
cudaTask reduce(I first, I last, T *result, B bop)
performs parallel reduction over a range of items
Definition reduce.hpp:578
cudaTask single_task(C c)
runs a callable with only a single kernel thread
Definition for_each.hpp:169
cudaTask copy(T *tgt, const T *src, size_t num)
creates a memcopy task that copies typed data
Definition cudaflow.hpp:1348
class to create a task handle over an internal node of a cudaFlow graph
Definition cuda_task.hpp:65
cudaTask & succeed(Ts &&... tasks)
adds precedence links from other tasks to this
Definition cuda_task.hpp:189
cudaTask & precede(Ts &&... tasks)
adds precedence links from this to other tasks
Definition cuda_task.hpp:182
Reduce Items without an Initial Value
You can use tf::cudaFlow::uninitialized_reduce to perform parallel reduction without an initial value. This method represents a parallel execution of the following reduction loop on a GPU, thus in no need of tf::cudaFlow::single_task to initialize the variable:
*result = *first++;
while (first != last) {
*result = bop(*result, *first++);
}
The variable result
is directly assigned the reduced value without any initial value to participate in the reduction loop. The following code performs a parallel reduction to sum all the numbers in the given range without any initial value:
const size_t N = 1000000;
int sol;
int* res = tf::cuda_malloc_device<int>(1);
int* gpu = tf::cuda_malloc_device<int>(N);
gpu, gpu+N, res, [] __device__ (int a, int b) { return a + b; }
);
assert(sol == N);
cudaTask uninitialized_reduce(I first, I last, T *result, B bop)
similar to tf::cudaFlow::reduce but does not assume any initial value to reduce
Definition reduce.hpp:587
Reduce a Range of Transformed Items with an Initial Value
tf::cudaFlow::transform_reduce performs a parallel reduction over a range of transformed elements specified by [first, last)
using a binary reduce operator bop
and a unary transform operator uop
. It represents the parallel execution of the following reduction loop on a GPU:
while (first != last) {
*result = bop(*result, uop(*first++));
}
The variable result
participates in the reduction loop and must be initialized with an initial value. The following code performs a parallel reduction to sum all the transformed numbers multiplied by 10
in the given range with an initial value 1000
:
const size_t N = 1000000;
int* res = tf::cuda_malloc_shared<int>(1);
int* vec = tf::cuda_malloc_shared<int>(N);
*res = 1000;
vec, vec + N, res,
[] __device__ (int a, int b) { return a + b; },
[] __device__ (int a) { return a*10; }
);
cudaTask transform_reduce(I first, I last, T *result, B bop, U uop)
performs parallel reduction over a range of transformed items
Definition reduce.hpp:596
Reduce a Range of Transformed Items without an Initial Value
tf::cuda_transform_uninitialized_reduce performs a parallel reduction over a range of transformed items without an initial value. This method represents a parallel execution of the following reduction loop on a GPU:
*result = *first++;
while (first != last) {
*result = bop(*result, uop(*first++));
}
The variable result
is directly assigned the reduced value without any initial value participating in the reduction loop. The following code performs a parallel reduction to sum all the transformed numbers multiplied by 10
in the given range without any initial value:
const size_t N = 1000000;
int* res = tf::cuda_malloc_shared<int>(1);
int* vec = tf::cuda_malloc_shared<int>(N);
*res = 1000;
vec, vec + N, res,
[] __device__ (int a, int b) { return a + b; },
[] __device__ (int a) { return a*10; }
);
Miscellaneous Items
Parallel-reduction algorithms are also available in tf::cudaFlowCapturer with the same API.