![]() |
Taskflow
3.2.0-Master-Branch
|
tf::cudaFlow provides two template methods, tf::cudaFlow::for_each and tf::cudaFlow::for_each_index, for creating tasks to perform parallel iterations over a range of items.
You need to include the header file, taskflow/cuda/algorithm/for_each.hpp
, for creating a parallel-iteration task.
Index-based parallel-for performs parallel iterations over a range [first, last)
with the given step
size. The task created by tf::cudaFlow::for_each_index(I first, I last, I step, C callable) represents a kernel of parallel execution for the following loop:
Each iteration i
is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a __device__
specifier. The following example creates a kernel that assigns each entry of gpu_data
to 1 over the range [0, 100) with step size 1.
Iterator-based parallel-for performs parallel iterations over a range specified by two STL-styled iterators, first
and last
. The task created by tf::cudaFlow::for_each(I first, I last, C callable) represents a parallel execution of the following loop:
The two iterators, first
and last
, are typically two raw pointers to the first element and the next to the last element in the range in GPU memory space. The following example creates a for_each
kernel that assigns each element in gpu_data
to 1 over the range [gpu_data, gpu_data + 1000)
.
Each iteration is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a __device__
specifier.
The parallel-iteration algorithms are also available in tf::cudaFlowCapturer::for_each and tf::cudaFlowCapturer::for_each_index.