Code Monkey home page Code Monkey logo

rdma_bench's Introduction

RDMA-bench

A framework to understand RDMA performance. This is the source code for our USENIX ATC paper.

Required hardware and software

  • InfiniBand HCAs. Some C++ benchmarks work with RoCE HCAs.
  • Linux-based OS with RDMA drivers (Mellanox OFED or upstream OFED). Ubuntu, RHEL, and CentOS have been tested.
  • Required packages: cmake, memcached, gflags, libmemcached-dev, libnuma-dev
  • Root access is required only for hugepages.

Required settings

All benchmarks require one server machine and multiple client machines. Every benchmark is contained in one directory.

  • The number of client machines required is described in each benchmark's README file. The server will wait for all clients to launch, so the benchmarks won't make progress until the correct number of clients are launched.
  • Modify HRD_REGISTRY_IP in run-servers.sh and run-machines.sh to the IP address of the server machine. The server runs a memcached instance that is used as a queue pair registry.
  • Allocate hugepages on all machines, and set unlimited SHM limits:
  sudo echo 8192 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
  sudo bash -c "echo kernel.shmmax = 9223372036854775807 >> /etc/sysctl.conf"
  sudo bash -c "echo kernel.shmall = 1152921504606846720 >> /etc/sysctl.conf"
  sudo sysctl -p /etc/sysctl.conf

Benchmark description

The benchmarks used in the paper are described below. This repository contains other benchmarks as well.

Benchmark Description
herd An improved implementation of the HERD key-value cache.
mica A simplified implementation of MICA.
atomics-sequencer Sequencer using one-sided fetch-and-add. Also emulates DrTM-KV.
ws-sequencer Sequencer using HERD RPCs (UC WRITE requests, UD SEND responses).
ss-sequencer Sequencer using header-only datagram RPCs (i.e., UD SENDs only).
rw-tput-sender Microbenchmark to measure throughput of outbound READs and WRITEs.
rw-tput-receiver Microbenchmark to measure throughput of inbound READs and WRITEs.
ud-sender Microbenchmark to measure throughput of outbound SENDs.
ud-receiver Microbenchmark to measure throughput of inbound SENDs.
rw-allsig WQE cache misses for outbound READs and WRITEs.
write-incomplete This PoC shows that a completed WRITE can be invisible to the remote CPU.
write-reordering A test for left-to-right ordering of WRITEs.

Implementation details

libhrd

The libhrd library is used to implement all benchmarks. It consists of convenience functions for initial RDMA setup, such as creating and connecting QPs, and allocating hugepage memory.

Memcached

Distributing QP information (required for connection setup in connected transports, and routing in datagram transports) requires a temporary out-of-band communication channel. To simplify this process, we use a memcached instance to publish (e.g., hrd_publish_conn_qp()) and pull QP information (e.g., hrd_get_published_qp) using global QP names.

Client connection logic

The code was written to work on a cluster that has dual-port NICs, but the switch connectivity does not allow cross-port communication. Using both ports in this constrained environment makes the initial QP connection setup slightly complicated. All benchmarks also work on single-port NICs. Usually, we use the following logic while setting up connections:

  • There are N client threads in the system and each client thread uses Q QPs.
  • The server has num_server_ports ports starting from port base_port_index. Similarly, clients have num_client_ports. The base_port_index may be different for server and clients.
  • On the CIB cluster, port i on a NIC can only communicate with port i on other NICs. So base_port_index must be same for clients and server, and num_client_ports == num_server_ports.
  • One server thread (the master thread in case there are worker threads) creates N * Q QPs on each server port. For applications requiring a request region, only one memory region is created and registered with all of the num_server_ports control blocks. Only some of these QPs actually get used by clients.
  • Client threads have a global index clt_i. Each client thread uses a single control block and creates all its QPs on port index (using base base_port_index) clt_i % num_client_ports. It connects all these QPs to QPs on server port indexed clt_i % num_server_ports (using the server's base_port_index). This works for both CIB, and Apt and Intel clusters that support any-to-any communication between ports.

Selective signaling logic

Most benchmarks post one signaled work request per UNSIG_BATCH work requests. This is done to reduce CQE DMAs. With UNSIG_BATCH = 4, a sequence of work requests looks as follows. Note that a work request is not post()ed immediately; it is added to a list and posted when the number of work requests in the list equals postlist.

wr 0 -> signaled
wr 1 -> unsignaled
wr 2 -> unsignaled
wr 3 -> unsignaled
	Poll for wr 0's completion. A postlist should have ended.
wr 4 -> signaled
...
wr 5 -> unsignaled
	Poll for wr 4's completion. Another postlist should have ended.

This imposes 2 requirements:

  • Postlist check: postlist <= UNSIG_BATCH. We poll for a completion before queueing work request UNSIG_BATCH + 1. If postlist > UNSIG_BATCH, nothing will have been posted at this point, so polling will get stuck.

  • Queue capacity check: HRD_Q_DEPTH >= 2 * UNSIG_BATCH. With the above scheme, up to 2 * UNSIG_BATCH - 1 work requests can be un-ACKed by the QP. With a QP of size N, N - 1 work requests are allowed to be un-ACKed by the InfiniBand/RoCE specification.

Work in progress

The benchmarks are being ported to use C++ and CMake. Some benchmarks will continue to use C (i.e., libhrd); others will move to C++ (i.e., libhrd_cpp).

Contact

Anuj Kalia ([email protected])

License

	Copyright 2016, Carnegie Mellon University

    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
    You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.

rdma_bench's People

Contributors

anujkaliaiitd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rdma_bench's Issues

Definite memory leak in libhrd_cpp

I ran a vagrant analysis that revealed the following definite memory leaks in libhrd_cpp caused by the memcached connectors:

==37850== 1,168 bytes in 1 blocks are definitely lost in loss record 65 of 82
==37850==    at 0x4C2FA3F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==37850==    by 0x4C31D84: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==37850==    by 0x549639F: memcached_server_list_append_with_weight (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x10E682: hrd_create_memc() (hrd_util.cpp:332)
==37850==    by 0x10E7DC: hrd_publish(char const*, void*, unsigned long) (hrd_util.cpp:351)
==37850==    by 0x10C53E: hrd_publish_conn_qp(hrd_ctrl_blk_t*, unsigned long, char const*) (hrd_conn.cpp:396)
==37850==    by 0x10F69E: main (main.cpp:170)
==37850==
...
==37850== 2,048 bytes in 2 blocks are definitely lost in loss record 68 of 82
==37850==    at 0x4C2FA3F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==37850==    by 0x4C31D84: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==37850==    by 0x549A09C: ??? (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x54949FF: ??? (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x5494E26: ??? (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x548DD99: memcached_fetch_result (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x548DF91: memcached_fetch (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x548F68C: memcached_get_by_key (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x548F889: memcached_get (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x10E92F: hrd_get_published(char const*, void**) (hrd_util.cpp:383)
==37850==    by 0x10EBB6: hrd_wait_till_ready(char const*) (hrd_util.cpp:434)
==37850==    by 0x10FA17: main (main.cpp:231)
==37850==
...
==37850== 2,048 bytes in 2 blocks are definitely lost in loss record 69 of 82
==37850==    at 0x4C2FA3F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==37850==    by 0x4C31D84: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==37850==    by 0x549A09C: ??? (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x54949FF: ??? (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x5494E26: ??? (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x548DD99: memcached_fetch_result (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x548DF91: memcached_fetch (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x548F68C: memcached_get_by_key (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x548F889: memcached_get (in /usr/lib/x86_64-linux-gnu/libmemcached.so.11.0.0)
==37850==    by 0x10E92F: hrd_get_published(char const*, void**) (hrd_util.cpp:383)
==37850==    by 0x10EBB6: hrd_wait_till_ready(char const*) (hrd_util.cpp:434)
==37850==    by 0x10FA4E: main (main.cpp:234)
==37850==
...
==37850== LEAK SUMMARY:
==37850==    definitely lost: 5,264 bytes in 5 blocks
...

About Pliaf and FaRM

I have read your HERD paper and you compared HERD with Pliaf and FaRM, could I take a look at the concrete implementation of GET operation in your emulating Pliaf or FaRM? I would appreciate it if you would answer me.

Failed to publish key server-0-0. Error A TIMEOUT OCCURRED, Reg IP= xx.xx.xx.xx

Hello!

I need your help. I set kRoCE = true, and use ROCEv2 protocol. I chaged "memcached -l 0.0.0.0 1>/dev/null 2>/dev/null" with "memcached -l 0.0.0.0 -u root -p 11266" in run_server.sh, then run it. There a error: Failed to publish key server-0-0. Error A TIMEOUT OCCURRED, Reg IP= xx.xx.xx.xx. I have verified that the memcached is running successfully on the server by using telnet. But it failed to run hrd_publish(). Can you tell me how to solve it.

Thank you very much!

Understanding the performance numbers of HERD

Hi @anujkaliaiitd

I am trying to understand the performance numbers printed out by the HERD server and client. Could you please help? thank you so much.

Server prints:

main: Worker 0: 3269603.08 IOPS. Avg per-port postlist = 1.00. HERD lookup fail rate = 0.0085
main: Worker 0: 3269148.67 IOPS. Avg per-port postlist = 1.00. HERD lookup fail rate = 0.0085
main: Worker 0: 3228738.91 IOPS. Avg per-port postlist = 1.00. HERD lookup fail rate = 0.0085
main: Worker 0: 3268642.09 IOPS. Avg per-port postlist = 1.00. HERD lookup fail rate = 0.0085

Client prints:

main: Client 0: 3252758.84 IOPS. nb_tx = 1301282816
main: Client 0: 3274758.97 IOPS. nb_tx = 1301807104
main: Client 0: 3267195.87 IOPS. nb_tx = 1302331392
main: Client 0: 3259528.61 IOPS. nb_tx = 1302855680

So what is the throughput here? 3 Million input output operations per second?
Is it normal? Or I am running it with a wrong configuration? Thank you so much for
your time.

I am running server (8 core machine with RDMA NIC) on a machine with following config:


blue "Reset server QP registry"
sudo killall memcached
memcached -l 0.0.0.0 1>/dev/null 2>/dev/null &
sleep 1

blue "Starting master process"
sudo LD_LIBRARY_PATH=/usr/local/lib/ -E \
        numactl --cpunodebind=0 --membind=0 ./main \
        --master 1 \
        --base-port-index  0 \
        --num-server-ports 1 &

# Give the master process time to create and register per-port request regions
sleep 1

blue "Starting worker threads"
sudo LD_LIBRARY_PATH=/usr/local/lib/ -E \
        numactl --physcpubind=0,2,4,6 --membind=0 ./main \
        --is-client 0 \
        --base-port-index 0 \
        --num-server-ports 1 \
        --postlist 1 &

Client (8 core machine with RNIC ) start script:

num_threads=1           # Threads per client machine
: ${HRD_REGISTRY_IP:?"Need to set HRD_REGISTRY_IP non-empty"}

blue "Running $num_threads client threads"

sudo LD_LIBRARY_PATH=/usr/local/lib/ -E \
        numactl --cpunodebind=0 --membind=0 ./main \
        --num-threads $num_threads \
        --base-port-index 0 \
        --num-server-ports 1 \
        --num-client-ports 1 \
        --is-client 1 \
        --update-percentage 0 \
        --machine-id $1 &

Running server and client:


./run-server.sh
./run-machine 0

Herd tests error: ibv_post_send error Error 22

For the herd tests, I'm having an issue on the client side, which returns "EINVAL" after calling ibv_post_send.

Server side program runs fine. And, I'm running 12 worker threads on the server and 2 client machines with 10 client threads in each machine.

Hardware: QLE7340 Infiniband single port.
OS: Ubuntu 14.04

Anyone has a clue?

compile error

hi,i meet some errors when compile the code. After I type "make build && cd build && cmake .."

it shows:
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc: In function ‘hrd_ctrl_blk_t* hrd_ctrl_blk_init(size_t, size_t, size_t, hrd_conn_config_t*, hrd_dgram_config_t*)’:
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:105:26: error: conversion to ‘unsigned int’ from ‘int’ may change the sign of the result [-Werror=sign-conversio ]
cb->dgram_buf_mr = ibv_reg_mr(cb->pd, const_cast<uint8_t*>(cb->dgram_buf),
^~~~~~~~~~
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:110:26: error: conversion to ‘unsigned int’ from ‘int’ may change the sign of the result [-Werror=sign-conversio ]
cb->dgram_buf_mr = ibv_reg_mr(cb->pd, const_cast<uint8_t*>(cb->dgram_buf),
^~~~~~~~~~
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:143:25: error: conversion to ‘unsigned int’ from ‘int’ may change the sign of the result [-Werror=sign-conversio ]
cb->conn_buf_mr = ibv_reg_mr(cb->pd, const_cast<uint8_t*>(cb->conn_buf),
^~~~~~~~~~
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:151:25: error: conversion to ‘unsigned int’ from ‘int’ may change the sign of the result [-Werror=sign-conversio ]
cb->conn_buf_mr = ibv_reg_mr(cb->pd, const_cast<uint8_t*>(cb->conn_buf),
^~~~~~~~~~
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc: In function ‘void hrd_create_dgram_qps(hrd_ctrl_blk_t*)’:
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:236:33: error: aggregate ‘hrd_create_dgram_qps(hrd_ctrl_blk_t*)::ibv_exp_cq_init_attr cq_init_attr’ has incomplete type and cannot be defined
struct ibv_exp_cq_init_attr cq_init_attr;
^~~~~~~~~~~~
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:239:28: error: ‘ibv_exp_create_cq’ was not declared in this scope
cb->dgram_send_cq[i] = ibv_exp_create_cq(
^~~~~~~~~~~~~~~~~
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:239:28: note: suggested alternative: ‘ibv_create_cq’
cb->dgram_send_cq[i] = ibv_exp_create_cq(
^~~~~~~~~~~~~~~~~
ibv_create_cq
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:253:33: error: aggregate ‘hrd_create_dgram_qps(hrd_ctrl_blk_t*)::ibv_exp_qp_init_attr create_attr’ has incomplete type and cannot be defined
struct ibv_exp_qp_init_attr create_attr;
^~~~~~~~~~~
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:256:9: error: ‘IBV_EXP_QP_INIT_ATTR_PD’ was not declared in this scope
IBV_EXP_QP_INIT_ATTR_PD | IBV_EXP_QP_INIT_ATTR_CREATE_FLAGS;
^~~~~~~~~~~~~~~~~~~~~~~
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:256:9: note: suggested alternative: ‘IBV_QP_INIT_ATTR_PD’
IBV_EXP_QP_INIT_ATTR_PD | IBV_EXP_QP_INIT_ATTR_CREATE_FLAGS;
^~~~~~~~~~~~~~~~~~~~~~~
IBV_QP_INIT_ATTR_PD
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:256:35: error: ‘IBV_EXP_QP_INIT_ATTR_CREATE_FLAGS’ was not declared in this scope
IBV_EXP_QP_INIT_ATTR_PD | IBV_EXP_QP_INIT_ATTR_CREATE_FLAGS;
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:256:35: note: suggested alternative: ‘IBV_QP_INIT_ATTR_CREATE_FLAGS’
IBV_EXP_QP_INIT_ATTR_PD | IBV_EXP_QP_INIT_ATTR_CREATE_FLAGS;
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
IBV_QP_INIT_ATTR_CREATE_FLAGS
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:270:23: error: ‘ibv_exp_create_qp’ was not declared in this scope
cb->dgram_qp[i] = ibv_exp_create_qp(cb->resolve.ib_ctx, &create_attr);
^~~~~~~~~~~~~~~~~
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:270:23: note: suggested alternative: ‘ibv_create_qp’
cb->dgram_qp[i] = ibv_exp_create_qp(cb->resolve.ib_ctx, &create_attr);
^~~~~~~~~~~~~~~~~
ibv_create_qp
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:274:28: error: aggregate ‘hrd_create_dgram_qps(hrd_ctrl_blk_t*)::ibv_exp_qp_attr init_attr’ has incomplete type and cannot be defined
struct ibv_exp_qp_attr init_attr;
^~~~~~~~~
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:284:9: error: ‘ibv_exp_modify_qp’ was not declared in this scope
ibv_exp_modify_qp(cb->dgram_qp[i], &init_attr, init_comp_mask) == 0,
^~~~~~~~~~~~~~~~~
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:284:9: note: suggested alternative: ‘ibv_modify_qp’
ibv_exp_modify_qp(cb->dgram_qp[i], &init_attr, init_comp_mask) == 0,
^~~~~~~~~~~~~~~~~
ibv_modify_qp
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:288:28: error: aggregate ‘hrd_create_dgram_qps(hrd_ctrl_blk_t*)::ibv_exp_qp_attr rtr_attr’ has incomplete type and cannot be defined
struct ibv_exp_qp_attr rtr_attr;
^~~~~~~~
/home/lwh/rdma_bench-master/libhrd_cpp/hrd_conn.cc:296:28: error: aggregate ‘hrd_create_dgram_qps(hrd_ctrl_blk_t*)::ibv_exp_qp_attr rts_attr’ has incomplete type and cannot be defined
struct ibv_exp_qp_attr rts_attr;
^~~~~~~~
cc1plus: all warnings being treated as errors
make[2]: *** [CMakeFiles/sender-scalability.dir/build.make:76: CMakeFiles/sender-scalability.dir/libhrd_cpp/hrd_conn.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:73: CMakeFiles/sender-scalability.dir/all] Error 2
make: *** [Makefile:84: all] Error 2

How can i fix it? Thanks

about ud-sender

Hello, I wish to get your help, when i try to use the ud-sender to measure throughput of outbound UD SENDs, i got an error "Failed to create SEND CQ", could you give me some advice? thanks.

How do I send when the payload is 4 kB?

Hello, I want to test the latency when the payload is 4 KB. I modify max_inline_data to increase the value of message. When max_inline_data is 828, create_qp fails.
Maybe it's not the right way to modify it. How do I send when the payload is 4 kB?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.