![]() |
Taskflow
3.2.0-Master-Branch
|
Taskflow provides standard template methods for performing parallel iterations over a range of items a CUDA GPU.
You need to include the header file, taskflow/cuda/algorithm/for_each.hpp
, for using the parallel-iteration algorithm.
Index-based parallel-for performs parallel iterations over a range [first, last)
with the given step
size. The task created by tf::cuda_for_each_index represents a kernel of parallel execution for the following loop:
Each iteration i
is independent of each other and is assigned one kernel thread to run the callable. The following example creates a kernel that assigns each entry of data
to 1 over the range [0, 100) with step size 1.
The parallel-iteration algorithm runs asynchronously through the stream specified in the execution policy. You need to synchronize the stream to obtain correct results.
Iterator-based parallel-for performs parallel iterations over a range specified by two STL-styled iterators, first
and last
. The task created by tf::cuda_for_each represents a parallel execution of the following loop:
The two iterators, first
and last
, are typically two raw pointers to the first element and the next to the last element in the range in GPU memory space. The following example creates a for_each
kernel that assigns each element in gpu_data
to 1 over the range [data, data + 1000)
.
Each iteration is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a __device__
specifier.