Taskflow provides standalone template methods for merging two sorted ranges of items into a sorted range of items.
Include the Header
You need to include the header file, taskflow/cuda/algorithm/merge.hpp
, for using the parallel-merge algorithm.
Merge Two Sorted Ranges of Items
tf::cuda_merge merges two sorted ranges of items into a sorted range. The following code merges two sorted arrays input_1
and input_2
, each of 1000 items, into a sorted array output
of 2000 items.
const size_t N = 1000;
int* input_1 = tf::cuda_malloc_shared<int>(N);
int* input_2 = tf::cuda_malloc_shared<int>(N);
int* output = tf::cuda_malloc_shared<int>(2*N);
for(size_t i=0; i<N; i++) {
input_1[i] = rand()%100;
input_2[i] = rand()%100;
}
auto bytes = tf::cuda_merge_buffer_size<tf::cudaDefaultExecutionPolicy>(N, N);
auto buffer = tf::cuda_malloc_device<std::byte>(bytes);
input_1, input_1 + N, input_2, input_2 + N, output,
[]__device__ (int a, int b) { return a < b; },
buffer
);
policy.synchronize();
class to define execution policy for CUDA standard algorithms
Definition cuda_execution_policy.hpp:29
void cuda_merge(P &&p, a_keys_it a_keys_first, a_keys_it a_keys_last, b_keys_it b_keys_first, b_keys_it b_keys_last, c_keys_it c_keys_first, C comp, void *buf)
performs asynchronous key-only merge over a range of keys
Definition merge.hpp:629
void cuda_free(T *ptr, int d)
frees memory on the GPU device
Definition cuda_memory.hpp:101
The merge algorithm runs asynchronously through the stream specified in the execution policy. You need to synchronize the stream to obtain correct results. Since the GPU merge algorithm may require extra buffer to store the temporary results, you must provide a buffer of size at least bytes returned from tf::cuda_merge_buffer_size.
- Attention
- You must keep the buffer alive before the merge call completes.
Merge Two Sorted Ranges of Key-Value Items
tf::cuda_merge_by_key performs key-value merge over two sorted ranges in a similar way to tf::cuda_merge; additionally, it copies elements from the two ranges of values associated with the two input keys, respectively. The following code performs key-value merge over a
and b:
const size_t N = 2;
int* a_keys = tf::cuda_malloc_shared<int>(N);
int* a_vals = tf::cuda_malloc_shared<int>(N);
int* b_keys = tf::cuda_malloc_shared<int>(N);
int* b_vals = tf::cuda_malloc_shared<int>(N);
int* c_keys = tf::cuda_malloc_shared<int>(2*N);
int* c_vals = tf::cuda_malloc_shared<int>(2*N);
a_keys[0] = 8, a_keys[1] = 1;
a_vals[0] = 1, a_vals[1] = 2;
b_keys[0] = 3, b_keys[1] = 7;
b_vals[0] = 3, b_vals[1] = 4;
auto bytes = tf::cuda_merge_buffer_size<tf::cudaDefaultExecutionPolicy>(N, N);
auto buffer = tf::cuda_malloc_device<std::byte>(bytes);
policy,
a_keys, a_keys+N, a_vals,
b_keys, b_keys+N, b_vals,
c_keys, c_vals,
[]__device__ (int a, int b) { return a < b; },
buffer
);
policy.synchronize();
void cuda_merge_by_key(P &&p, a_keys_it a_keys_first, a_keys_it a_keys_last, a_vals_it a_vals_first, b_keys_it b_keys_first, b_keys_it b_keys_last, b_vals_it b_vals_first, c_keys_it c_keys_first, c_vals_it c_vals_first, C comp, void *buf)
performs asynchronous key-value merge over a range of keys and values
Definition merge.hpp:560
The buffer size required by tf::cuda_merge_by_key is the same as tf::cuda_merge and must be at least equal to or larger than the value returned by tf::cuda_merge_buffer_size.