![]() |
Taskflow
3.2.0-Master-Branch
|
Taskflow provides standalone template methods for sorting ranges of items on a CUDA GPU.
You need to include the header file, taskflow/cuda/algorithm/sort.hpp
, for using the parallel-sort algorithm.
tf::cuda_sort performs an in-place parallel sort over a range of elements specified by [first, last)
using the given comparator. The following code sorts one million random integers in an increasing order on a GPU.
The sort algorithm runs asynchronously through the stream specified in the execution policy. You need to synchronize the stream to obtain correct results. Since the GPU sort algorithm may require extra buffer to store the temporary results, you need provide a buffer of size at least bytes returned from tf::cuda_sort_buffer_size.
tf::cuda_sort_by_key sorts a range of key-value items into ascending key order. If i
and j
are any two valid iterators in [k_first, k_last)
such that i
precedes j
, and p
and q
are iterators in [v_first, v_first + (k_last - k_first))
corresponding to i
and j
respectively, then comp(*i, *j)
is true
. The following example sorts a range of items into ascending key order and swaps their corresponding values:
The buffer size required by tf::cuda_sort_by_key is the same as tf::cuda_sort and must be at least equal to or larger than the value returned by tf::cuda_sort_buffer_size. While you can capture the values into the lambda and sort them indirectly using plain tf::cuda_sort, this organization will result in frequent and costly access to the global memory. For example, we can sort idx
indirectly using the captured keys in vec:
The comparator here will frequently access the global memory of vec
, resulting in high memory latency. Instead, you should use tf::cuda_sort_by_key that has been optimized for this purpose.