owensgroup / mvgpubtree Goto Github PK

GPU B-Tree with support for versioning (snapshots).

License: Apache License 2.0

CMake 2.70% Cuda 31.24% C++ 57.85% Shell 2.34% Python 5.51% Dockerfile 0.37%

b-tree concurrent cuda gpu snapshot versioning

mvgpubtree's Introduction

Fully Concurrent GPU Multiversion B-Tree

Examples/Tests	Benchmarks	APIs	Reproduce

A fully concurrent GPU B-Tree that supports versioning (snapshots) and linearizable multipoint queries. Using our data structure and the tools we provide, you can launch one (or more) kernels where inside each kernel, you concurrently perform queries (e.g., point or range queries) and mutations (e.g., insert or update).

For more information, please check our PACT 2022 paper:

A GPU Multiversion B-Tree
Muhammad A. Awad, Serban D. Porumbescu, and John D. Owens

The repository also contains:

Our vision

GPU data structures such as the multiversion GPU B-Tree and other data structures we developed¹² should facilitate using them in the following concise and elegant manner:

#include<gpu_versioned_blink_tree.hpp>
#include<thrust/device_vector.hpp>
#include<thrust/for_each.hpp>

int main(){

 using key_t = uint32_t; using value_t = uint32_t;
 using tree_t = GpuBTree::gpu_versioned_blink_tree<key_t, value_t>;
 
 tree_t vtree(....); // call the data structure constructor 
 thrust::device_vector<key_t> keys(....); // initialize keys
 
 // solve a problem and do concurrent operations in a fully concurrent manner
 thrust::for_each(keys.begin(), keys.end(), [vtree](auto key){ 
  // perform operations in a tile-synchronous way
  auto block = cooperative_groups::this_thread_block();
  auto tile = cooperative_groups::tiled_partition<tree_t::branching_factor>(block);
  // ... problem-specific code
  auto value = ...;
  vtree.cooperative_insert(key, value, tile, ...); // insert
  // ... maybe more problem-specific application code
  auto snapshot_id = vtree.take_snapshot(); // take snapshot
  // ... maybe even more problem-specific code
  auto found_value = vtree.cooperative_find(key, tile, snapshot_id, ...); // query
  // ... maybe even more problem-specific code
 });
}

The previous example illustrates our vision for using GPU data structures. To a large extent, we can do most of these operations using current CUDA/C++ abstractions and compilers; however, some of the APIs, such as memory allocators and reclaimers (especially on-device ones), still lack adequate support and standardization. BGHT² provides the same device-side APIs and will require almost zero modifications to run the example snippet above.

Requirements and limitations

Please create an issue if you face challenges with any of the following limitations and requirements.

Requirements

C++17/CUDA C++17
NVIDIA Volta GPU or later microarchitectures
CMake 3.18 or later
CUDA 11.5 or later
GPU with 20 GiBs or higher to run the benchmarks

Limitations

Keys and values must have a type of unsigned 32-bit
Snapshot are limited to a maximum of 2^32 - 1 (can be extended to 2^64-1 easily)

Reproducing the paper results

To reproduce the results, follow the following steps. Our PACT 2022 paper was awarded the Results Reproduced v1.1 badge. If you find any mismatch (either faster or slower) between the results in the paper, please create an issue, and we will investigate the performance changes.

Questions and bug report

Please create an issue. We will welcome any contributions that improve the usability and quality of our repository.

BibTeX

The code in this repository is based on our Multiversion GPU B-Tree and GPU B-Tree publications:

@InProceedings{   Awad:2022:AGM,
  author        = {Muhammad A. Awad and Serban D. Porumbescu and John D.
                  Owens},
  title         = {A {GPU} Multiversion {B}-Tree},
  booktitle     = {Proceedings of the International Conference on Parallel
                  Architectures and Compilation Techniques},
  series        = {PACT 2022},
  year          = 2022,
  month         = oct,
  code          = {https://github.com/owensgroup/MVGpuBTree},
  doi           = {10.1145/3559009.3569681},
  url           = {https://escholarship.org/uc/item/4mz5t5b7},
  ucdcite       = {a146}
}

@InProceedings{   Awad:2019:EAH,
  author        = {Muhammad A. Awad and Saman Ashkiani and Rob Johnson and
                  Mart\'{\i}n Farach-Colton and John D. Owens},
  title         = {Engineering a High-Performance {GPU} {B}-Tree},
  booktitle     = {Proceedings of the 24th ACM SIGPLAN Symposium on
                  Principles and Practice of Parallel Programming},
  series        = {PPoPP 2019},
  year          = 2019,
  month         = feb,
  pages         = {145--157},
  acceptance    = {29 of 152 submissions, 19.1\%},
  doi           = {10.1145/3293883.3295706},
  acmauthorize  = {https://dl.acm.org/doi/10.1145/3293883.3295706?cid=81100458295},
  url           = {https://escholarship.org/uc/item/1ph2x5td},
  code          = {https://github.com/owensgroup/GpuBTree},
  ucdcite       = {a127}
}

mvgpubtree's People

Contributors

Stargazers

Watchers

Forkers

depaulmillz shujianqian

mvgpubtree's Issues

Occasional insertion deadlocks

Occasionally, when calling cooperative_insert from my own kernel, the function never returns.
I am running the code on an RTX 4090 with driver version 525.78.01, and CUDA 11.8.
I was able to reproduce this issue multiple times using the following code:

void investigate_tree_deadlock() {
    using key_type = uint32_t;
    using value_type = uint32_t;

    size_t build_size = size_t{1} << 25;
    key_type min_usable_key = 1;
    key_type max_usable_key = std::numeric_limits<key_type>::max() - 2;

    std::mt19937_64 gen(42);
    std::uniform_int_distribution<key_type> key_dist(min_usable_key, max_usable_key);
    std::vector<key_type> build_keys(build_size);
    std::unordered_set<key_type> build_keys_set;
    while (build_keys_set.size() < build_size) {
        key_type key = key_dist(gen);
        build_keys_set.insert(key);
    }
    std::copy(build_keys_set.begin(), build_keys_set.end(), build_keys.begin());
    std::sort(build_keys.begin(), build_keys.end());

    key_type* keys_on_gpu;
    cudaMalloc(&keys_on_gpu, build_size * sizeof(key_type));
    cudaMemcpy(keys_on_gpu, build_keys.data(), build_size * sizeof(key_type), cudaMemcpyHostToDevice);

    for (size_t i = 0; i < 10000; ++i) {
        std::cout << "round " << i << " starting" << std::endl;

        gpu_blink_tree<key_type, value_type, 16> tree;
        modified_insert_kernel<<<(build_size + 511) / 512, 512>>>(keys_on_gpu, build_size, tree);

        std::cout << "tree uses " << tree.compute_memory_usage() << " GB" << std::endl;
        std::cout << "round " << i << " done" << std::endl;
    }

    cudaFree(keys_on_gpu);
}

I ran the snippet twice and observed the issue in iterations 61 and 1699, respectively. In both cases, I had to terminate the process forcefully using CTRL+C. My modified_insert_kernel is almost identical to the default insertion kernel, it looks like this:

template <typename key_type, typename size_type, typename btree>
__global__ void modified_insert_kernel(
    const key_type* keys,
    const size_type keys_count,
    btree tree
) {
  auto thread_id = threadIdx.x + blockIdx.x * blockDim.x;
  auto block     = cg::this_thread_block();
  auto tile      = cg::tiled_partition<btree::branching_factor>(block);

  if ((thread_id - tile.thread_rank()) >= keys_count) { return; }

  auto key       = btree::invalid_key;
  auto value     = btree::invalid_value;
  bool to_insert = false;
  if (thread_id < keys_count) {
    key       = keys[thread_id];
    value     = thread_id;
    to_insert = true;
  }
  using allocator_type = typename btree::device_allocator_context_type;
  allocator_type allocator{tree.allocator_, tile};

  size_type num_inserted = 1;
  auto work_queue        = tile.ballot(to_insert);
  while (work_queue) {
    auto cur_rank  = __ffs(work_queue) - 1;
    auto cur_key   = tile.shfl(key, cur_rank);
    auto cur_value = tile.shfl(value, cur_rank);

    tree.cooperative_insert(cur_key, cur_value, tile, allocator);

    if (tile.thread_rank() == cur_rank) { to_insert = false; }
    num_inserted++;
    work_queue = tile.ballot(to_insert);
  }
}

Bulk build

Implement a constructor that takes in key-value pairs and bulk-build the tree.

Builds fails on CUDA12.1

Running on GeForce RTX 3080 GPU
CUDA 12.1
C++17
CMake version 3.26.3
Although the g++ compiler is version 9.4.0 on Ubuntu 20.04,
CMake is using CXX STANDARD 17 so it should be compiling with -std=c++17.

I am getting namespace errors such as the following when running 'make -j' in these instructions:

/MVGpuBTree/include/btree_kernels.hpp(79): error: name followed by "::" must be a class or namespace
name
attribute((shared)) cg::experimental::block_tile_memory<4, btree::reclaimer_block_size_> block_tile_shemm;

CMakeError when building the project

Dear author!Thanks for your great work.When I try to reproduce your experimental results and try to compile MVGpuBtree,the following error after the command(cmake ..)

I tried to solve it used different versions gcc/g++:tried 9.3 ,7.5,11.3,different cmake version : 3.8,3.16.3,but without any change,solutions did not work.I check my CmakeError.log.It seems like /usr/bin/ld: cannot find -lpthread.But I don't know the location where I could add -lpthread.

What's more,I use 8vCPU 32GiB,GPU:NVIDIA V100,Cuda version:11.4.

Kindly confirm why we are facing this issue?

Question about key and value size in btree

Hi:
It seems for each k,v pair stored in GpuBTree must have same memory size of type Key and type Value. Why it has to be the same? Can I use custom data structure with a comparator for key type?

Lowering the memory requirements for benchmarks

The benchmarking code now requires 20 GiBs of memory for a complete set of benchmarks. It would be nice to limit the memory requirements for benchmarking on workstations. The code (thrust) will throw an out-of-memory error or some exception when it runs out of memory. To help with limiting the memory requirements, here are the reasons why we need these 20 GiBs:

The tree data structure:
All memory allocations to the tree are either satisfied by the device_bump_allocator or SlabAllocator. Both allocators allocate 8 GiBs on construction by default. You may reduce this to 4 GiBs by changing template parameters (SlabAllocator only supports power-of-two allocations) but keep in mind that when inserting keys into the tree, I don't check for out-of-memory errors in device code (code will either segfault or deadlock). Also, keep in mind that device_bump_allocator does not support free memory, so benchmarks like VoB-Tree will not scale.
Input:
2.1. Point query benchmarks require only keys and values. For 50 million key-value pairs, the code will need ~0.2 GiBs for each array. The total will be $0.2 * 4 = 0.8$ GiBs (2 inputs, query, and result).
2.2. Range query benchmarks require keys and values, and range query lower and upper bounds and output buffer. For an input size of 50 million pairs, the code will need $0.2 * 4 = 0.8$ GiBs for queries and key-value pairs, and it will need $0.2 *$ average_range_query_length GiBs for RQ output buffer. So for a range query of 32 (Figure 3.b), the RQ output will be 6.4 GiBs. Most nodes are ~2/3 full, so 32 RQ length will cover $32 / (2/3 * 14) \approx 3.5$ nodes. Another example is Figure 4, where we have 2.5 million RQs up to 1000 RQ length, which makes the RQ required size 9.3 GiB.
Memory reclaimer: Allocates ~0.3 GiBs and can be changed by setting this number.

Maximum will be 8 (tree) + 0.8 (RQ input) + 9.3 (maximum RQ output) + 0.3 (reclaimer)= 18.4 GiBs. Notice that I never explicitly free GPU memory since I used a shared pointer wrapper around all allocations (see example) which means that any allocations get deallocated when it becomes out of scope.