Taskflow supports SYCL, a general-purpose heterogeneous programming model, to program heterogeneous tasks in a single-source C++ environment. This chapter discusses how to write SYCL C++ kernel code with Taskflow based on SYCL 2020 Specification.
Include the Header
You need to include the header file, taskflow/sycl/syclflow.hpp
, for using tf::syclFlow.
Create a syclFlow
Taskflow introduces a task graph-based programming model, tf::syclFlow, to program SYCL tasks and their dependencies. A syclFlow is a task in a taskflow and is associated with a SYCL queue to execute kernels on a SYCL device. To create a syclFlow task, emplace a callable with an argument of type tf::syclFlow and associate it with a SYCL queue. The following example (saxpy.cpp
) implements the canonical saxpy (A·X Plus Y) task graph using tf::syclFlow.
1: #include <taskflow/syclflow.hpp>
2:
3: constexpr size_t N = 1000000;
4:
5: int main() {
6:
9:
10: sycl::queue queue{sycl::gpu_selector{}};
11:
12:
13: float* X = sycl::malloc_shared<float>(N, queue);
14: float* Y = sycl::malloc_shared<float>(N, queue);
15:
16:
21: [=] (sycl::id<1> id) {
22: X[id] = 3.0f * X[id] + Y[id];
23: }
24: ).name("saxpy");
26: }, queue).name("syclFlow");
27:
28: executor.
run(taskflow).wait();
30:
31:
32: sycl::free(X, queue);
33: sycl::free(Y, queue);
34: }
class to create an executor for running a taskflow graph
Definition executor.hpp:50
tf::Future< void > run(Taskflow &taskflow)
runs a taskflow once
Definition executor.hpp:1573
class to create a taskflow object
Definition core/taskflow.hpp:73
class for building a SYCL task dependency graph
Definition syclflow.hpp:23
syclTask fill(void *ptr, const T &pattern, size_t count)
creates a fill task that fills typed data with the given value
Definition syclflow.hpp:467
syclTask parallel_for(ArgsT &&... args)
creates a kernel task
Definition syclflow.hpp:500
handle to a node of the internal CUDA graph
Definition sycl_task.hpp:21
syclTask & succeed(Ts &&... tasks)
adds precedence links from other tasks to this
Definition sycl_task.hpp:138
Debrief:
- Lines 7-8 create a taskflow and an executor
- Lines 10 creates a SYCL queue on a default-selected GPU device
- Lines 13-14 allocate shared memory that is accessible on both host and device
- Lines 17-26 creates a syclFlow to define the saxpy task graph that contains:
- one fill task to fill the memory area
X
with 1.0f
- one fill task to fill the memory area
Y
with 2.0f
- one kernel task to perform the saxpy operation on the GPU
- Lines 28-29 executes the taskflow and dumps its graph to a DOT format
- Lines 32-33 deallocates the shared memory to avoid memory leak
tf::syclFlow is a lightweight task graph-based programming layer atop SYCL. We do not expend yet another effort on simplifying kernel programming but focus on tasking SYCL operations and their dependencies. This organization lets users fully take advantage of SYCL features that are commensurate with their domain knowledge, while leaving difficult task parallelism details to Taskflow.
Compile a syclFlow Program
Use DPC++ clang to compile a syclFlow program:
~$ clang++ -fsycl -fsycl-unnamed-lambda \
-fsycl-targets=nvptx64-nvidia-cuda \ # for CUDA target
-I path/to/taskflow -pthread -std=c++17 saxpy.cpp -o saxpy
~$ ./saxpy
Please visit the page Compile Taskflow with SYCL for more details.
Create Memory Operation Tasks
tf::syclFlow provides a set of methods for creating tasks to perform common memory operations, such as copy, set, and fill, on memory area pointed to by unified shared memory (USM) pointers. The following example creates a syclFlow task of two copy operations and one fill operation that set the first N/2
elements in the vector to -1
.
sycl::queue queue;
size_t N = 1000;
int* hvec = new int[N] (100);
int* dvec = sycl::malloc_device<int>(N, queue);
fill.precede(cd2h)
.succeed(ch2d);
}, queue);
executor.
run(taskflow).wait();
for(size_t i=0; i<N/2; i++) {
(i < N / 2) ? assert(hvec[i] == -1) : assert(hvec[i] == 100);
}
syclTask copy(T *target, const T *source, size_t count)
creates a copy task that copies typed data from a source to a target memory block
Definition syclflow.hpp:475
Both tf::syclFlow::copy and tf::syclFlow::fill operate on typed
data. You can use tf::syclFlow::memcpy and tf::syclFlow::memset to operate on untyped
data (i.e., array of bytes).
fill.precede(cd2h)
.succeed(ch2d);
}, queue);
syclTask memset(void *ptr, int value, size_t bytes)
creates a memset task that fills untyped data with a byte value
Definition syclflow.hpp:461
syclTask memcpy(void *tgt, const void *src, size_t bytes)
creates a memcpy task that copies untyped data in bytes
Definition syclflow.hpp:456
Create Kernel Tasks
SYCL allows a simple execution model in which a kernel is invoked over an N-dimensional index space defined by sycl::range<N>
, where N
is one, two or three. Each work item in such a kernel executes independently across a set of partitioned work groups. tf::syclFlow::parallel_for defines several variants to create a kernel task. The following variant pairs up a sycl::range
and a sycl::id
to set each element in data
to 1.0f
when it is not necessary to query the global range of the index space being executed across.
sycl::range<1>(N), [data](sycl::id<1> id){ data[id] = 1.0f; }
);
As the same example, the following variant enables low-level functionality of work items and work groups using sycl::nd_range
and sycl::nd_item
. This becomes valuable when an execution requires groups of work items that communicate and synchronize.
sycl::nd_range<1>{sycl::range<1>(N), sycl::range<1>(M)},
[data](sycl::nd_item<1> item){
auto id = item.get_global_linear_id();
data[id] = 1.0f;
}
);
All the kernel methods defined in the SYCL queue are applicable for tf::syclFlow::parallel_for.
Create Command Group Function Object Tasks
SYCL provides a way to encapsulate a device-side operation and all its data and event dependencies in a single command group function object. The function object accepts an argument of command group handler constructed by the SYCL runtime. Command group handler is the heart of SYCL programming as it defines pretty much all kernel-related methods, including submission, execution, and synchronization. You can directly create a SYCL task from a command group function object using tf::syclFlow::on.
[=] (sycl::handler& handler) {
handler.require(accessor);
handler.single_task([=](){
data[0] = 1;
);
}
);
syclTask on(F &&func)
creates a task that launches the given command group function object
Definition syclflow.hpp:483
Offload a syclFlow
By default, the executor offloads and executes the syclFlow once. When a syclFlow is being executed, its task graph will be materialized by the Taskflow runtime and submitted to its associated SYCL queue in a topological order of task dependencies defined in that graph. You can explicitly execute a syclFlow using different offload methods:
sf.
offload_until([repeat=5] ()
mutable {
return repeat-- == 0; })
}, queue);
void offload_until(P &&predicate)
offloads the syclFlow onto a GPU and repeatedly runs it until the predicate becomes true
Definition syclflow.hpp:506
void offload_n(size_t N)
offloads the syclFlow and executes it by the given times
Definition syclflow.hpp:569
void offload()
offloads the syclFlow and executes it once
Definition syclflow.hpp:574
After you offload a syclFlow, it is considered executed, and the executor will not run an offloaded syclFlow after leaving the syclFlow task callable. On the other hand, if a syclFlow is not offloaded, the executor runs it once. For example, the following two versions represent the same execution logic.
}, queue);
}, queue);
syclTask single_task(F &&func)
invokes a SYCL kernel function using only one thread
Definition syclflow.hpp:492
Update a syclFlow
You can update a SYCL task from an offloaded syclFlow and rebind it to another task type. For example, you can rebind a memory operation task to a parallel-for kernel task from an offloaded syclFlow and vice versa.
size_t N = 10000;
sycl::queue queue;
int* data = sycl::malloc_shared<int>(N, queue);
std::for_each(data, data+N, [](
int i){ assert(data[i] == -1); });
syclflow.rebind_parallel_for(task, sycl::range<1>(N), [](sycl::id<1> id){
data[id] = 100;
});
std::for_each(data, data+N, [data](
int i){ assert(data[i] == 100); });
}, queue);
executor.
run(taskflow).wait();
Each method of task creation in tf::syclFlow has a corresponding method of rebinding a task to that task type (e.g., tf::syclFlow::on and tf::syclFlow::rebind_on, tf::syclFlow::parallel_for and tf::syclFlow::parallel_for).
Use syclFlow in a Standalone Environment
You can use tf::syclFlow in a standalone environment without going through tf::Taskflow and offloads it to a SYCL device from the caller thread. All the tasking methods we have discussed so far apply to the standalone use.
sycl::queue queue;
sycl::range<1>(N), [=] (sycl::id<1> id) {
dx[id] = 2.0f * dx[id] + dy[id];
}
).name("saxpy");
syclTask & name(const std::string &name)
assigns a name to the task
Definition sycl_task.hpp:149
syclTask & precede(Ts &&... tasks)
adds precedence links from this to other tasks
Definition sycl_task.hpp:131
- Note
- In the standalone mode, a written syclFlow will not be executed untile you explicitly call an offload method, as there is neither a taskflow nor an executor.