Taskflow  3.2.0-Master-Branch
Loading...
Searching...
No Matches
Parallel Iterations

Taskflow provides standard template methods for performing parallel iterations over a range of items a CUDA GPU.

Include the Header

You need to include the header file, taskflow/cuda/algorithm/for_each.hpp, for using the parallel-iteration algorithm.

Index-based Parallel Iterations

Index-based parallel-for performs parallel iterations over a range [first, last) with the given step size. The task created by tf::cuda_for_each_index represents a kernel of parallel execution for the following loop:

// positive step: first, first+step, first+2*step, ...
for(auto i=first; i<last; i+=step) {
callable(i);
}
// negative step: first, first-step, first-2*step, ...
for(auto i=first; i>last; i+=step) {
callable(i);
}

Each iteration i is independent of each other and is assigned one kernel thread to run the callable. The following example creates a kernel that assigns each entry of data to 1 over the range [0, 100) with step size 1.

auto data = tf::cuda_malloc_shared<int>(100);
// assigns each element in data to 1 over the range [0, 100) with step size 1
policy, 0, 100, 1, [data] __device__ (int idx) { data[idx] = 1; }
);
// synchronize the execution
policy.synchronize();
class to define execution policy for CUDA standard algorithms
Definition cuda_execution_policy.hpp:29
void cuda_for_each_index(P &&p, I first, I last, I inc, C c)
performs asynchronous parallel iterations over an index-based range of items
Definition for_each.hpp:138

The parallel-iteration algorithm runs asynchronously through the stream specified in the execution policy. You need to synchronize the stream to obtain correct results.

Iterator-based Parallel Iterations

Iterator-based parallel-for performs parallel iterations over a range specified by two STL-styled iterators, first and last. The task created by tf::cuda_for_each represents a parallel execution of the following loop:

for(auto i=first; i<last; i++) {
callable(*i);
}

The two iterators, first and last, are typically two raw pointers to the first element and the next to the last element in the range in GPU memory space. The following example creates a for_each kernel that assigns each element in gpu_data to 1 over the range [data, data + 1000).

auto data = tf::cuda_malloc_shared<int>(1000);
// assigns each element in data to 1 over the range [0, 1000) with step size 1
policy, data, data + 1000, [] __device__ (int& item) { item = 1; }
);
// synchronize the execution
policy.synchronize();
void cuda_for_each(P &&p, I first, I last, C c)
performs asynchronous parallel iterations over a range of items
Definition for_each.hpp:97

Each iteration is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a __device__ specifier.