![]() |
Taskflow
3.2.0-Master-Branch
|
class to capture a CUDA graph using a round-robin algorithm More...
#include <cuda_optimizer.hpp>
Public Member Functions | |
cudaRoundRobinCapturing ()=default | |
constructs a round-robin optimizer with 4 streams by default | |
cudaRoundRobinCapturing (size_t num_streams) | |
constructs a round-robin optimizer with the given number of streams | |
size_t | num_streams () const |
queries the number of streams used by the optimizer | |
void | num_streams (size_t n) |
sets the number of streams used by the optimizer | |
Friends | |
class | cudaFlowCapturer |
class to capture a CUDA graph using a round-robin algorithm
A round-robin capturing algorithm levelizes the user-described graph and assign streams to nodes in a round-robin order level by level. The algorithm is based on the following paper published in Euro-Par 2021:
The round-robin optimization algorithm is best suited for large cudaFlow graphs that compose hundreds of or thousands of GPU operations (e.g., kernels and memory copies) with many of them being able to run in parallel. You can configure the number of streams to the optimizer to adjust the maximum kernel currency in the captured CUDA graph.