suquark / hoplite Goto Github PK

Makefile 0.16% C++ 47.21% Shell 6.48% C 3.46% Python 37.81% CMake 2.48% Cython 2.40%

hoplite's Introduction

Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems

This is the repo for the artifact evaluataion for the SIGCOMM 2021 paper: Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems. For any questions or related issue, please feel free to contact Siyuan Zhuang ([email protected]) and Zhuohan Li ([email protected]).

Setup AWS Cluster & Hoplite

All the experiments in the paper are evaluated on AWS. We use Ray cluster launcher to lanuch the cluster for all the experiments in the paper. We highly recommend using Ray cluster launcher to launch the cluster as it will automatically setup the execution environment we required in the experiments.

For every experiment, we include detailed instruction for setting up a cluster and reproducing the results in the paper.

Microbenchmarks (Section 5.1)

Please see microbenchmarks/ to reproduce the microbenchmark experiments in the paper.

Asynchronous SGD (Section 5.2)

Please see app/parameter-server/ to reproduce the Asynchronous SGD experiments in the paper.

Reinforcement Learning (Section 5.3)

Please see app/rllib/ to reproduce the rllib experiments in the paper.

ML Model Serving Experiments (Section 5.4)

Please see app/ray_serve/ to reproduce the Ray serve experiments and the Ray serve fault tolerance experiments (Section 5.5, Figure 12a) in the paper.

hoplite's People

Contributors

Stargazers

Watchers

Forkers

machinelearningsystem liudyboy travisdai ryanhuang1014 nwtnni duihuhu

hoplite's Issues

Ensure Ray/Dask benchmarks are OK

Currently I run into some issue with them. Need to fix them soon today.

Average time for microbenchmarks

We should run many tests (maybe 10?) and average the running time of them. Since cold start a grpc call is very expensive.

Scripts to generate plots of microbenchmarks automatically

Building instruction

Is there any tutorial or document to instruct building hoplite in ray system. I'm currenctly using ray version 1.7.1 and hoplite is not builts in there. I use pip to install ray. I tried to install gRPC and protobuf and run python setup.py build in dir python. It comes this error

fatal error: object_store.grpc.pb.h: No such file or directory
5 | #include "object_store.grpc.pb.h"

Add a timeline function for debugging and visualization

Running Failure when running async DDP experiments.

When I tried to run the codes under app/parameter-server on slurm cluster, I always get failure with segementation fault. The failure always happens after

hoplite.start_location_server()
ray.init(address='auto', ignore_reinit_error=True)

and it will report segementatoin fault. The logging is like this

1636525003466554268 object_directory[139.19.15.46]:63378:139778674558912 /home/dixiyao/hoplite/src/object_directory/notification.cc:535 main ]: Starting object directory at 139.19.15.46:7777
1636525003475934778 object_directory[139.19.15.46]:63378:139778485901056 /home/dixiyao/hoplite/src/object_directory/notification.cc:524 worker_loop ]: [NotificationServer] notification server started
2021-11-10 07:16:45,468	INFO worker.py:827 -- Connecting to existing Ray cluster at address: 139.19.15.46:6379
/var/lib/slurm/slurmd/job7000819/slurm_script: line 67: 63357 Segmentation fault      python3 hoplite_all_reduce.py --num-workers=4

I have tried this on AWS cluster with one machine and on Slurm with 4 machine (one GPU on each machine for both). Neither of them will succeed and both them reported segementation fault.

Is there any idea about possible causes of this fault. Or could this be result from incorrect installation of Hoplite?

Deadlock when only 2 nodes joins the reduce process

For number of nodes > 2, reduce will succeed; if not, the processes just hang on

Error related to object store. "The location of ObjectID(xxx) is unavailable yet. Waiting for further notification."

Hey guys, when I was trying to do allreduce by hoplite on slurm cluster, I came across such problem. In my log files, it will report such Get failed for 139.19.15.49. This is the head address of the node. Then later it will say The location of ObjectID(0000000000000000000081131282117120887345) is unavailable yet. Waiting for further notification. And my process will stuck here. It seems like this is related to gettinig object from the store, maybe this line of code grad_buffer = self.store.get(reduction_id)?

For reference, this is the code of my main function

hoplite.start_location_server()
ray.init(address='auto', ignore_reinit_error=True)
workers = [DataWorker.remote(args_dict, i, model_type=args.model, device='cuda') for i in range(num_workers)]

print("Running synchronous parameter server training.")
step_start = time.time()
for i in range(iterations):
    gradients = []
    rediction_id = hoplite.random_object_id()
    all_grad_ids = [hoplite.random_object_id() for worker in workers]
    all_updates = []
    for grad_id, worker in zip(all_grad_ids, workers):
        all_updates.append(worker.compute_gradients.remote(grad_id, all_grad_ids, rediction_id))
    ray.get(all_updates)
    now = time.time()
    print("step time:", now - step_start, flush=True)
    step_start = now

and the code of dataworker

@ray.remote(num_gpus=1)
class DataWorker(object):
    def __init__(self, args_dict, rank, model_type="custom", device="cpu"):
        self.store = hoplite.create_store_using_dict(args_dict)
        self.device = device
        self.model = bert.Vit_Wrapper().to(device)
        self.optimizer = torch.optim.SGD(self.model.parameters(), lr=0.02)
        self.rank = rank
        self.is_master = hoplite.get_my_address().encode() == args_dict['redis_address']

    def compute_gradients(self, gradient_id, gradient_ids, reduction_id, batch_size=128):
        start_time = time.time()
        if self.is_master:
            #print("i'm master and i start reduce")
            reduced_gradient_id = self.store.reduce_async(gradient_ids, hoplite.ReduceOp.SUM, reduction_id)
        data = torch.randn(batch_size, 3, 224, 224, device=self.device)
        self.model.zero_grad()
        output = self.model(data)
        loss = torch.mean(output)
        loss.backward()
        gradients = self.model.get_gradients()
        cont_g = np.concatenate([g.ravel().view(np.uint8) for g in gradients])
        print(cont_g.shape)
        buffer = hoplite.Buffer.from_buffer(cont_g)
        gradient_id = self.store.put(buffer, gradient_id)
        print(gradient_id)
        grad_buffer = self.store.get(reduction_id)
        print(grad_buffer)
        summed_gradients = self.model.buffer_to_tensors(grad_buffer)
        self.optimizer.zero_grad()
        self.model.set_gradients(summed_gradients)
        self.optimizer.step()
        print(self.rank, "in actor time", time.time() - start_time)
        return None

So, have any idea about the problem?

Apart from that, I have two other questions. First is that, as you can see, I have some print in my code for quick debugging. While in my slurm logging files, there is output of any print information. Second thing is that I'm not sure is it possible that my problem is actually related to my authentication on slurm cluster. I have no root access. So, maybe hoplite cannot access the appointed store location? Besides function create_store_using_dict, is there other way to create store whhere I can use my home dirs?

Fast restart

Currently, port 6666 is not released when object store is killed. It blocks fast iterations of tests.

Some threads in Hoplite not joined before exit

This can be triggered by calling reduce and exits. This does not affect our results, but it produces ugly error messages.

add size information into the notification server

Test and write documents about RL experiments

support reducing master objects

[hoplite-cpp] Unexpected failure with multicast/allgather

when using 8 nodes, object size = 32768, and repeat 3 times. for the last repeat, we get

1620340896969884106 172.31.56.237:6061:140167967596224 /home/ubuntu/efs/hoplite/src/common/buffer.cc:26 CopyFrom ]:  Check failed: size == size_ input size mismatch

It could be related to #143

Sealing an unfinished buffer

hoplite/src/client/local_store_client.cc:41 Seal ]: Sealing an unfinished buffer

This should be an false alarm.

Add redis & Grpc & Async TCP

redis for object directory https://github.com/redis/hiredis
Grpc for object ID transfer
Async TCP for object transfer

Instructions to setup Ray clusters

We need to tell reviewers how to setup these clusters.

Hoplite with Python failure

Surprisingly, when using the python interface, hoplite could suddenly fail and cannot run anything even with new sessions. This is very weird, and there are 2 possible causes:

Issues with Ray.
Logging module.

I'll investigate into it.

[hoplite-python] Unexpected bug for roundtrip test using small objects

For example: ./run_test.sh roundtrip 2 $[2**15]

1620615376199987199 172.31.51.213:5793:139973848192832 /home/ubuntu/efs/hoplite/src/client/receiver.cc:180 check_and_store_inband_data ]: fetching object directly from inband data
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

[hoplite-python] Unexpected message when run roundtrip with median size objects

For example, ./run_test.sh roundtrip 102400. The client crashed when calling WriteLocation

Test and write docs about Sync parameter server

error when I run the problem

I write a demo to use the hoplite.

#!/usr/bin/env python3
###test_put.py 
import argparse
import time

from mpi4py import MPI
import numpy as np

import hoplite

notification_port = 7777
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
world_size = comm.Get_size()


def testPut(store,object_size):
    object_id = hoplite.ObjectID(b'\3' * 20)
    array = np.random.randint(2**30, size=object_size//4, dtype=np.int32)
    buffer = hoplite.Buffer.from_buffer(array)
    print("Buffer created, hash =", hash(buffer))
    print('testPut,the buffer:', buffer)
    store.put(buffer, object_id)
    comm.Barrier()

if rank == 0:
    object_directory_address = hoplite.start_location_server()
    time.sleep(1)
else:
    object_directory_address = None

# broadcast object directory address
object_directory_address = comm.bcast(object_directory_address, root=0)
store = hoplite.HopliteClient(object_directory_address)


globals()['testPut'](store,4096)

# avoid disconnecting from hoplite store before another one finishes
comm.Barrier()

# exit clients before the server to suppress connection error messages
del store
comm.Barrier()

Because Redis Pub/Sub is fire and forget currently there is no way to use this feature if your application demands reliable notification of events, that is, if your Pub/Sub client disconnects, and reconnects later, all the events delivered during the time the client was disconnected are lost.

We need to use something else to implement the notification mechanism

Prevent more than thread reducing objects at the same time

Should be fixed later. A current workaround we are applying is set #define HOPLITE_MAX_INFLOW_CONCURRENCY 1

Dockerfile

A Dockerfile would help authors w/o AWS instances reproducing the results more easily.