Code Monkey home page Code Monkey logo

hoplite's Introduction

Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems

This is the repo for the artifact evaluataion for the SIGCOMM 2021 paper: Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems. For any questions or related issue, please feel free to contact Siyuan Zhuang ([email protected]) and Zhuohan Li ([email protected]).

Setup AWS Cluster & Hoplite

All the experiments in the paper are evaluated on AWS. We use Ray cluster launcher to lanuch the cluster for all the experiments in the paper. We highly recommend using Ray cluster launcher to launch the cluster as it will automatically setup the execution environment we required in the experiments.

For every experiment, we include detailed instruction for setting up a cluster and reproducing the results in the paper.

Microbenchmarks (Section 5.1)

Please see microbenchmarks/ to reproduce the microbenchmark experiments in the paper.

Asynchronous SGD (Section 5.2)

Please see app/parameter-server/ to reproduce the Asynchronous SGD experiments in the paper.

Reinforcement Learning (Section 5.3)

Please see app/rllib/ to reproduce the rllib experiments in the paper.

ML Model Serving Experiments (Section 5.4)

Please see app/ray_serve/ to reproduce the Ray serve experiments and the Ray serve fault tolerance experiments (Section 5.5, Figure 12a) in the paper.

hoplite's People

Contributors

danyangz avatar lambda7xx avatar suquark avatar zhuohan123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

hoplite's Issues

Building instruction

Is there any tutorial or document to instruct building hoplite in ray system. I'm currenctly using ray version 1.7.1 and hoplite is not builts in there. I use pip to install ray. I tried to install gRPC and protobuf and run python setup.py build in dir python. It comes this error

fatal error: object_store.grpc.pb.h: No such file or directory
5 | #include "object_store.grpc.pb.h"

Running Failure when running async DDP experiments.

When I tried to run the codes under app/parameter-server on slurm cluster, I always get failure with segementation fault. The failure always happens after

hoplite.start_location_server()
ray.init(address='auto', ignore_reinit_error=True)

and it will report segementatoin fault. The logging is like this

1636525003466554268 object_directory[139.19.15.46]:63378:139778674558912 /home/dixiyao/hoplite/src/object_directory/notification.cc:535 main ]: Starting object directory at 139.19.15.46:7777
1636525003475934778 object_directory[139.19.15.46]:63378:139778485901056 /home/dixiyao/hoplite/src/object_directory/notification.cc:524 worker_loop ]: [NotificationServer] notification server started
2021-11-10 07:16:45,468	INFO worker.py:827 -- Connecting to existing Ray cluster at address: 139.19.15.46:6379
/var/lib/slurm/slurmd/job7000819/slurm_script: line 67: 63357 Segmentation fault      python3 hoplite_all_reduce.py --num-workers=4

I have tried this on AWS cluster with one machine and on Slurm with 4 machine (one GPU on each machine for both). Neither of them will succeed and both them reported segementation fault.

Is there any idea about possible causes of this fault. Or could this be result from incorrect installation of Hoplite?

Error related to object store. "The location of ObjectID(xxx) is unavailable yet. Waiting for further notification."

Hey guys, when I was trying to do allreduce by hoplite on slurm cluster, I came across such problem. In my log files, it will report such Get failed for 139.19.15.49. This is the head address of the node. Then later it will say The location of ObjectID(0000000000000000000081131282117120887345) is unavailable yet. Waiting for further notification. And my process will stuck here. It seems like this is related to gettinig object from the store, maybe this line of code grad_buffer = self.store.get(reduction_id)?

For reference, this is the code of my main function

hoplite.start_location_server()
ray.init(address='auto', ignore_reinit_error=True)
workers = [DataWorker.remote(args_dict, i, model_type=args.model, device='cuda') for i in range(num_workers)]

print("Running synchronous parameter server training.")
step_start = time.time()
for i in range(iterations):
    gradients = []
    rediction_id = hoplite.random_object_id()
    all_grad_ids = [hoplite.random_object_id() for worker in workers]
    all_updates = []
    for grad_id, worker in zip(all_grad_ids, workers):
        all_updates.append(worker.compute_gradients.remote(grad_id, all_grad_ids, rediction_id))
    ray.get(all_updates)
    now = time.time()
    print("step time:", now - step_start, flush=True)
    step_start = now 

and the code of dataworker

@ray.remote(num_gpus=1)
class DataWorker(object):
    def __init__(self, args_dict, rank, model_type="custom", device="cpu"):
        self.store = hoplite.create_store_using_dict(args_dict)
        self.device = device
        self.model = bert.Vit_Wrapper().to(device)
        self.optimizer = torch.optim.SGD(self.model.parameters(), lr=0.02)
        self.rank = rank
        self.is_master = hoplite.get_my_address().encode() == args_dict['redis_address']

    def compute_gradients(self, gradient_id, gradient_ids, reduction_id, batch_size=128):
        start_time = time.time()
        if self.is_master:
            #print("i'm master and i start reduce")
            reduced_gradient_id = self.store.reduce_async(gradient_ids, hoplite.ReduceOp.SUM, reduction_id)
        data = torch.randn(batch_size, 3, 224, 224, device=self.device)
        self.model.zero_grad()
        output = self.model(data)
        loss = torch.mean(output)
        loss.backward()
        gradients = self.model.get_gradients()
        cont_g = np.concatenate([g.ravel().view(np.uint8) for g in gradients])
        print(cont_g.shape)
        buffer = hoplite.Buffer.from_buffer(cont_g)
        gradient_id = self.store.put(buffer, gradient_id)
        print(gradient_id)
        grad_buffer = self.store.get(reduction_id)
        print(grad_buffer)
        summed_gradients = self.model.buffer_to_tensors(grad_buffer)
        self.optimizer.zero_grad()
        self.model.set_gradients(summed_gradients)
        self.optimizer.step()
        print(self.rank, "in actor time", time.time() - start_time)
        return None

So, have any idea about the problem?

Apart from that, I have two other questions. First is that, as you can see, I have some print in my code for quick debugging. While in my slurm logging files, there is output of any print information. Second thing is that I'm not sure is it possible that my problem is actually related to my authentication on slurm cluster. I have no root access. So, maybe hoplite cannot access the appointed store location? Besides function create_store_using_dict, is there other way to create store whhere I can use my home dirs?

Fast restart

Currently, port 6666 is not released when object store is killed. It blocks fast iterations of tests.

[hoplite-cpp] Unexpected failure with multicast/allgather

when using 8 nodes, object size = 32768, and repeat 3 times. for the last repeat, we get

1620340896969884106 172.31.56.237:6061:140167967596224 /home/ubuntu/efs/hoplite/src/common/buffer.cc:26 CopyFrom ]:  Check failed: size == size_ input size mismatch

It could be related to #143

Sealing an unfinished buffer

hoplite/src/client/local_store_client.cc:41 Seal ]: Sealing an unfinished buffer

This should be an false alarm.

Hoplite with Python failure

Surprisingly, when using the python interface, hoplite could suddenly fail and cannot run anything even with new sessions. This is very weird, and there are 2 possible causes:

  1. Issues with Ray.
  2. Logging module.

I'll investigate into it.

[hoplite-python] Unexpected bug for roundtrip test using small objects

For example: ./run_test.sh roundtrip 2 $[2**15]

1620615376199987199 172.31.51.213:5793:139973848192832 /home/ubuntu/efs/hoplite/src/client/receiver.cc:180 check_and_store_inband_data ]: fetching object directly from inband data
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

error when I run the problem

I write a demo to use the hoplite.

#!/usr/bin/env python3
###test_put.py 
import argparse
import time

from mpi4py import MPI
import numpy as np

import hoplite

notification_port = 7777
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
world_size = comm.Get_size()


def testPut(store,object_size):
    object_id = hoplite.ObjectID(b'\3' * 20)
    array = np.random.randint(2**30, size=object_size//4, dtype=np.int32)
    buffer = hoplite.Buffer.from_buffer(array)
    print("Buffer created, hash =", hash(buffer))
    print('testPut,the buffer:', buffer)
    store.put(buffer, object_id)
    comm.Barrier()

if rank == 0:
    object_directory_address = hoplite.start_location_server()
    time.sleep(1)
else:
    object_directory_address = None

# broadcast object directory address
object_directory_address = comm.bcast(object_directory_address, root=0)
store = hoplite.HopliteClient(object_directory_address)


globals()['testPut'](store,4096)

# avoid disconnecting from hoplite store before another one finishes
comm.Barrier()

# exit clients before the server to suppress connection error messages
del store
comm.Barrier()

Redis notification is not reliable

According to https://redis.io/topics/notifications

Because Redis Pub/Sub is fire and forget currently there is no way to use this feature if your application demands reliable notification of events, that is, if your Pub/Sub client disconnects, and reconnects later, all the events delivered during the time the client was disconnected are lost.

We need to use something else to implement the notification mechanism

Dockerfile

A Dockerfile would help authors w/o AWS instances reproducing the results more easily.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.