Code Monkey home page Code Monkey logo

msrflute's Introduction

FLUTE

Welcome to FLUTE (Federated Learning Utilities for Testing and Experimentation), a platform for conducting high-performance federated learning simulations.

Features

FLUTE is a pytorch-based orchestration environment enabling GPU or CPU-based FL simulations. The primary goal of FLUTE is to enable researchers to rapidly prototype and validate their ideas. Features include:

  • large scale simulation (millions of clients, sampling tens of thousands per round)
  • single/multi GPU and multi-node orchestration
  • local or global differential privacy
  • model quantization
  • a variety of standard optimizers and aggregation methods
  • most model types including CNNs, RNNs, and Huggingface Transformers.
  • extensibility, enabling new models, dataloaders, optimizers, and aggregators.
  • local or cloud-based job staging using AzureML

Benchmarking

The following common tasks were used to evaluate the performance in speed/memory utilization of FLUTE compared with the most representative simulation platforms based on their number of starts on GitHub: FedML 0.7.303 and Flower 1.0.0.

Task Data Set Model Algorithm # Clients Clients per round Batch Size Client Optimizer lr Epochs # Rounds Test Freq
CV MNIST LR FedAvg 1000 10 10 SGD 0.03 1 100 20
CV Federated EMNIST CNN (2 Conv + 2 FC) FedAvg 3400 10 20 SGD 0.1 1 1500 50
CV FED_CIFAR-100 ResNet-18+group normalization FedAvg 500 10 20 SGD 0.1 1 4000 50
NLP Shakespeare RNN (2 LSTM + 1 FC) FedAvg 715 10 4 SGD 0.8 1 1200 50

FedML Comparison

This comparison was carried out using Parrot (Simulator) on version 0.7.303 at commit ID 8f7f261f. Showing that in some cases FLUTE can outperform 43x faster.

 _____________________________________________________________________________
|                    |   FedML (MPI) - Fastest   |   FLUTE (NCCL)  - Fastest  |
| Task               | Acc | Time     | GPU Mem  | Acc | Time     | GPU Mem   |
|--------------------|-----|----------|----------|-----|----------|-----------|
| LR_MNIST           | ~81 | 00:03:09 | ~3060 MB | ~81 | 00:01:35 | ~1060 MB  |
| CNN_FEMNIST        | ~83 | 05:49:52 | ~5180 MB | ~83 | 00:08:22 | ~1770 MB  |
| RESNET_FEDCIFAR100 | ~34 | 15:55:36 | ~5530 MB | ~33 | 01:42:01 | ~1900 MB  |
| RNN_FEDSHAKESPEARE | ~57 | 06:46:21 | ~3690 MB | ~57 | 00:21:50 | ~1270 MB  |
 -----------------------------------------------------------------------------

You can find the examples above in experiments.

Flower Comparison

This comparison was carried out using Flower (Simulator) on version 1.0.0 at commit ID 4e7fad9 with the lr_mnist task. Showing that in some cases FLUTE can outperform 53x faster.

 ________________________________________________
|        |    Flower (Ray)   | FLUTE (NCCL/Gloo) |
|        | Acc |    Time     | Acc |    Time     |
|--------|-----|-------------|-----|-------------|
| CPU    | ~80 |   00:30:14  | ~80 |   00:03:20  |
| GPU 2x | ~80 |   01:21:44  | ~80 |   00:01:31  |
| GPU 4x | ~79 |   00:56:45  | ~81 |   00:01:26  |
 ------------------------------------------------

You can find the example above in the cv_lr_mnist folder.

Quick Start

Install the requirements stated inside of requirements.txt. Ideally this sould be done inside of a virtual environment, for instance, using Anaconda.

conda create -n FLUTE python==3.7
pip install -r requirements.txt

FLUTE uses torch.distributed API as its main communication backbone, supporting three built-in backends. For more information please refer to Distributed Communication Package. Therefore, we highly suggest to use NCCL backend for distributed GPU training and Gloo for distributed CPU training. There is no setup.py as FLUTE is not currently distributed as a package, but instead meant to run from the root of the repository.

After this initial setup, you can use the data created for the integration test for a first local run. Note that this data needs to be download manually inside the testing folder, for more instructions please look at the README file inside testing.

For single-GPU runs:

python -m torch.distributed.run --nproc_per_node=1 e2e_trainer.py -dataPath ./testing -outputPath scratch -config testing/hello_world_nlg_gru.yaml -task nlg_gru -backend nccl

For multi-GPU runs (3 GPUs):

python -m torch.distributed.run --nproc_per_node=3 e2e_trainer.py -dataPath ./testing -outputPath scratch -config testing/hello_world_nlg_gru.yaml -task nlg_gru -backend nccl

The config file testing/hello_world_nlg_gru.yaml has some comments explaining the major sections and some important details; essentially, it consists in a very short experiment where a couple of iterations are done for just a few clients. A scratch folder will be created containing detailed logs.

Documentation

Online documentation is available at https://microsoft.github.io/msrflute/

Locally, the documentation is inside the doc/sphinx folder. To build the docs on Linux:

$ pip install sphinx
$ cd doc/sphinx
$ make html

On Windows, you can use the make.bat script. It may be necessary to export PYTHONPATH=../../ for sphinx to find the code.

Architecture

The core client/server training code is inside the core folder.

  • Server-side federation and global DP application takes place in server.py, more specifically in the OptimizationServer.train() method.
  • Client-side training updates take place in the static method Client.process_round(), inside client.py.

General FL orchestration code is in federated.py, but for most hub and spoke federation scenarios you won't need to touch this (unless you want to invest in optimizing server-client calls, which would be great!). Note that FLUTE does not implement secure aggregation since this is primarily a security feature for production scenarios; contributors are invited to add it for experimentation purposes.

The primary entry point for an experiment is in the script e2e_trainer.py. Primary config scripts for experiments are in configs. For instance, a basic training scenario for a next-word prediction task is set up in hello_world_nlg_gru_json.yaml.

Privacy accounting is expensive so the main parameters are logged and the actual accounting can be done offline. RDP privacy accounting is in extensions/privacy/analysis.py. A better accounting method is in the dp-accountant submodule.

Customization

See experiments folder for illustrations of how dataloaders and models are customized. In order to in include a new experiment, the new scenario must be added following the same folder structure as nlg_gru and mlm_bert, naming the folder with the task.

Experiments

Experiments are defined by YAML files, examples are provided in the configs folder. These can be run either locally or on AzureML.

For running experiments on AzureML, the CLI can help. You should first install the CLI (make sure you have v2) and create a resource group and workspace. You can then create a compute cluster, type az ml compute create -h for more info. Afterwards, you should write an YAML file with instructions for the job; we provide a simple example below

experiment_name: basic_example
description: Basic example of AML config for submitting FLUTE jobs
code:
  local_path: .
compute: azureml:Test
environment:
  image: pytorch/pytorch:1.9.0-cuda10.2-cudnn7-devel
inputs:
  data:
    folder: azureml://datastores/data/paths/cifar
    mode: rw_mount
command: >
  apt -y update &&
  apt -y install openmpi-bin libopenmpi-dev openssh-client &&
  python3 -m pip install --upgrade pip &&
  python3 -m pip install -r requirements.txt &&
  python -m torch.distributed.run --nproc_per_node=4 e2e_trainer.py
  -outputPath=./outputs
  -dataPath={inputs.data}
  -task=classif_cnn
  -config=./experiments/classif_cnn/config.yaml
  -backend=nccl

You should replace compute with the name of the one you created before, and adjust the path of the datastore containing the data -- in the example above, we created a datastore called data and added to it a folder called cifar, which contained the two HDF5 files. The command passed above will install dependencies and then launch a distributed job with 4 threads, for the experiment defined in experiments/classif_cnn. Details on how to run a job using the AzureML CLI are given in its documentation, but typically it suffices to set up the environment and type az ml job create -f <name-of-the-yaml-file>. In the same page of the documentation, you can also find more info about how to set up the YAML file above, in case other changes are needed.

Note that the local_path above is relative to the location of the YAML file, so setting it to . assumes it is in the same folder as e2e_trainer.py. All files on this folder will be uploaded to Azure, including hidden folders such as .git, so make sure to temporarily get rid of large files and folders that are not needed.

After launching the experiment, you can follow it on AzureML Studio, which prints logs, plots metrics and makes the output easily available after the experiment is finished.

Privacy Accounting

Accounting is expensive, so we log all the privacy parameters so that accounting can be run offline. Best run on a Linux box with a GPU. In particular, we use a DP accountant from another Microsoft repository, which is included in ours as a submodule. For using this accountant, just follow the instructions below:

$ git submodule update --init --recursive
$ cd utils
$ cd dp-accountant
$ python setup.py install
$ ./bin/compute-dp-epsilon --help
usage: compute-dp-epsilon [-h] -p SAMPLING_PROBABILITY -s NOISE_MULTIPLIER -i ITERATIONS -d DELTA

Third Party Notice

This software includes the files listed below from the Huggingface/Transformers Library (https://github.com/huggingface/transformers) as part of task performance and preprocessing pretrained models.

experiments/mlm_bert
└── utils
    ├── trainer_pt_utils.py
    └── trainer_utils.py

This software includes the file extensions/privacy/analysis.py from the Tensorflow/Privacy Library (https://github.com/tensorflow/privacy) as part of Renyi Differential Privacy implementation.

This software includes the script testing/build_vocab.py from LEAF Library (https://github.com/TalwalkarLab/leaf) to create the vocabulary needed to run a testing job.

This software includes the model implementation of the example ECG Classification | CNN LSTM Attention Mechanism from Kaggle Competition (https://www.kaggle.com/polomarco/ecg-classification-cnn-lstm-attention-mechanism) to reproduce the ecg_cnn experiment.

This software includes the model implementation of the FedNewsRec repository (https://github.com/taoqi98/FedNewsRec)| Code from the paper "Privacy-Preserving News Recommendation Model Learning" (https://arxiv.org/abs/2003.09592) ported to PyTorch framework to reproduce the fednewsrec experiment. For more information about third-party OSS licence, please refer to NOTICE.txt.

This software includes the Data Augmentation scripts of the Fast AutoAugment repository (https://github.com/kakaobrain/fast-autoaugment) to preprocess the data used in the semisupervision experiment.

This software included the FedProx logic implementation of the NIID-Bench repository (https://github.com/Xtra-Computing/NIID-Bench/tree/main) as Federated aggregation method used in the trainer object.

Support

You are welcome to open issues on this repository related to bug reports and feature requests.

Contributing

Contributions are welcomed and encouraged. For details on how to contribute, please see CONTRIBUTING.md.

msrflute's People

Contributors

jakob-98 avatar microsoft-github-policy-service[bot] avatar mirian-hipolito avatar simra avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

msrflute's Issues

Sample Code Running Error

Hi,

I recently installed FLUTE and was trying the sample example given this repo's description.

Following is the command which I ran:

python -m torch.distributed.run --nproc_per_node=3 e2e_trainer.py -dataPath ./testing/mockup -outputPath scratch -config testing/configs/hello_world_local.yaml -task nlg_gru -backend nccl

Now, Following is the error which I received.


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


The data can be found here: ./testing/mockup
Traceback (most recent call last):
File "/home/linuxsys/Downloads/msrflute/e2e_trainer.py", line 226, in
shutil.copyfile(args.config, cfg_out)
File "/home/linuxsys/anaconda3/envs/fluteLatest/lib/python3.9/shutil.py", line 261, in copyfile
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: 'testing/configs/hello_world_local.yaml'
The data can be found here: ./testing/mockup
Traceback (most recent call last):
File "/home/linuxsys/Downloads/msrflute/e2e_trainer.py", line 226, in
shutil.copyfile(args.config, cfg_out)
File "/home/linuxsys/anaconda3/envs/fluteLatest/lib/python3.9/shutil.py", line 261, in copyfile
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: 'testing/configs/hello_world_local.yaml'
The data can be found here: ./testing/mockup
Traceback (most recent call last):
File "/home/linuxsys/Downloads/msrflute/e2e_trainer.py", line 226, in
shutil.copyfile(args.config, cfg_out)
File "/home/linuxsys/anaconda3/envs/fluteLatest/lib/python3.9/shutil.py", line 261, in copyfile
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: 'testing/configs/hello_world_local.yaml'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 44679) of binary: /home/linuxsys/anaconda3/envs/fluteLatest/bin/python
Traceback (most recent call last):
File "/home/linuxsys/anaconda3/envs/fluteLatest/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/linuxsys/anaconda3/envs/fluteLatest/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/linuxsys/anaconda3/envs/fluteLatest/lib/python3.9/site-packages/torch/distributed/run.py", line 765, in
main()
File "/home/linuxsys/anaconda3/envs/fluteLatest/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/linuxsys/anaconda3/envs/fluteLatest/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/linuxsys/anaconda3/envs/fluteLatest/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/linuxsys/anaconda3/envs/fluteLatest/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/linuxsys/anaconda3/envs/fluteLatest/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

e2e_trainer.py FAILED

Failures:
[1]:
time : 2022-10-03_17:01:38
host : linuxsys
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 44680)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2022-10-03_17:01:38
host : linuxsys
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 44681)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2022-10-03_17:01:38
host : linuxsys
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 44679)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Request you to please guide me to solve this isue.

Replay function on the Server is breaking

When enabling the replay server option in the server it breaks because of the following:

  1. FLUTE does not allow these parameters in the config file. Schema.py should be updated for this.
  2. Server.py is looking for the respective in the client configuration instead of the server configuration, therefore is never enabled.

Request fo Xbox client

Hello
This is a suggestion, not a bug report.
May we expect to see an Xbox client ?
I am ready to help.

Annealing LR Scheduler required?

Is it possible to disable the annealing LR scheduler? If it is removed from the config.yaml file, the training process will not start.

FLUTE GPU utilisation vs performance

Hello,

While running a series of benchmarks between FLUTE and other frameworks we have observed a consistently high degree of GPU compute utilisation with low memory utilisation on the part of FLUTE. The backend used was NCCL.

Despite outclassing the other frameworks in compute utilisation, FLUTE underperforms in terms of round duration compared to one of the others by a factor of 2-4x. All experiments were carried out using the same hardware resource with either 2 or 4 GPUs and our results hold for both fast aggregation and normal aggregation.

Could you highlight some potential bottlenecks that FLUTE may encounter in an image task such that the high GPU utilisation does not translate to lower round duration?

We are interested in providing a fair comparison and would like some pointers for potential issues that may suppress the performance of FLUTE.

RuntimeError: CUDA error: invalid device ordinal and setting up NCCL + requesting subprocess model update for python 3.6+

HI there maintainers,
first off I'm thankful to the devs and engineering that went behind setting up this framework .I tried picking it up and as a to simulating GPU parallel computing with NCCL I ran into some issues .
here's the error i'm currently trying to fix .

error [1]

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

My system is ZorinOS 16 which is based on ubuntu20.04 , I'm trying to use an Nvidia RTX 3060 GPU

nvidia-smi returns the following

| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 31%   27C    P8    14W / 170W |   1426MiB / 12045MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1484      G   /usr/lib/xorg/Xorg                128MiB |
|    0   N/A  N/A      1633      G   /usr/bin/gnome-shell               89MiB |
|    0   N/A  N/A      7155      G   ...548701901119532058,131072       28MiB |
|    0   N/A  N/A     12476      C   ...da3/envs/FLUTE/bin/python     1175MiB |

and nvcc --version returns the following

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Oct_11_21:27:02_PDT_2021
Cuda compilation tools, release 11.4, V11.4.152
Build cuda_11.4.r11.4/compiler.30521435_0

this screenshot displays that I have pytorch environment almost ready to go .

Screenshot from 2023-02-13 13-29-41

now when trying to install nccl , I can't find a way to confirm if the installation is succesful , or where the nccl home is .

using the command (python -m torch.distributed.run --nproc_per_node=3 e2e_trainer.py -dataPath ./testing -outputPath scratch -config testing/hello_world_nlg_gru.yaml -task nlg_gru -backend nccl) in readme yields the following and no models being stored in the scratch folders error [1]'s original stack

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (packaging 22.0 (/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages), Requirement.parse('packaging<22.0,>=20.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (packaging 22.0 (/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages), Requirement.parse('packaging<22.0,>=20.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (packaging 22.0 (/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages), Requirement.parse('packaging<22.0,>=20.0')).
The data can be found here: The data can be found here: The data can be found here:   ./testing ./testing./testing


Mon Feb 13 12:39:20 2023 : Assigning default values for: {'batch_size', 'max_grad_norm'} in [server_config][val][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames', 'max_grad_norm'} in [server_config][test][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames'} in [client_config][train][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'max_grad_norm', 'batch_size'} in [server_config][val][data_config]Mon Feb 13 12:39:20 2023 : Assigning default values for: {'batch_size', 'max_grad_norm'} in [server_config][val][data_config]

Mon Feb 13 12:39:20 2023 : Assigning default values for: {'max_grad_norm', 'num_frames'} in [server_config][test][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames', 'max_grad_norm'} in [server_config][test][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames'} in [client_config][train][data_config]Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames'} in [client_config][train][data_config]

Mon Feb 13 12:39:20 2023 : Backend: nccl
Mon Feb 13 12:39:20 2023 : Backend: nccl
Mon Feb 13 12:39:20 2023 : Backend: nccl
Added key: store_based_barrier_key:1 to store for rank: 0Added key: store_based_barrier_key:1 to store for rank: 2

Added key: store_based_barrier_key:1 to store for rank: 1
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
Mon Feb 13 12:39:20 2023 : Assigning worker to GPU 1
Traceback (most recent call last):
  File "e2e_trainer.py", line 238, in <module>
    run_worker(model_path, config, task, data_path, local_rank, backend)
  File "e2e_trainer.py", line 100, in run_worker
    torch.cuda.set_device(device)
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/cuda/__init__.py", line 326, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
Mon Feb 13 12:39:20 2023 : Assigning worker to GPU 0Mon Feb 13 12:39:20 2023 : Assigning worker to GPU 2

Preparing model .. Initializing
Traceback (most recent call last):
  File "e2e_trainer.py", line 238, in <module>
    run_worker(model_path, config, task, data_path, local_rank, backend)
  File "e2e_trainer.py", line 100, in run_worker
    torch.cuda.set_device(device)
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/cuda/__init__.py", line 326, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
GRU(
  (embedding): Embedding()
  (rnn): GRU2(
    (w_ih): Linear(in_features=160, out_features=1536, bias=True)
    (w_hh): Linear(in_features=512, out_features=1536, bias=True)
  )
  (squeeze): Linear(in_features=512, out_features=160, bias=False)
)
Mon Feb 13 12:39:20 2023 : initialize model with default settings
Mon Feb 13 12:39:20 2023 : trying to move the model to GPU
Mon Feb 13 12:39:21 2023 : model: GRU(
  (embedding): Embedding()
  (rnn): GRU2(
    (w_ih): Linear(in_features=160, out_features=1536, bias=True)
    (w_hh): Linear(in_features=512, out_features=1536, bias=True)
  )
  (squeeze): Linear(in_features=512, out_features=160, bias=False)
)
Mon Feb 13 12:39:21 2023 : torch.cuda.memory_allocated(): 10909184
/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/cuda/memory.py:397: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
  FutureWarning)
Mon Feb 13 12:39:21 2023 : torch.cuda.memory_cached(): 23068672
Mon Feb 13 12:39:21 2023 : torch.cuda.synchronize(): None
Loading json-file:  ./testing/data/nlg_gru/val_data.json
Loading json-file:  ./testing/data/nlg_gru/test_data.json
Loading json-file:  ./testing/data/nlg_gru/train_data.json
Mon Feb 13 12:39:21 2023 : Server data preparation
Mon Feb 13 12:39:21 2023 : No server training set is defined
Mon Feb 13 12:39:21 2023 : Prepared the dataloaders
Mon Feb 13 12:39:21 2023 : Loading Model from: None
Could not load the run context. Logging offline
Attempted to log scalar metric System memory (GB):
15.414344787597656
Attempted to log scalar metric server_config.num_clients_per_iteration:
10
Attempted to log scalar metric server_config.max_iteration:
3
Attempted to log scalar metric dp_config.eps:
0
Attempted to log scalar metric dp_config.max_weight:
0
Attempted to log scalar metric dp_config.min_weight:
0
Attempted to log scalar metric server_config.optimizer_config.type:
adam
Attempted to log scalar metric server_config.optimizer_config.lr:
0.003
Attempted to log scalar metric server_config.optimizer_config.amsgrad:
True
Attempted to log scalar metric server_config.annealing_config.type:
step_lr
Attempted to log scalar metric server_config.annealing_config.step_interval:
epoch
Attempted to log scalar metric server_config.annealing_config.gamma:
1.0
Attempted to log scalar metric server_config.annealing_config.step_size:
100
Mon Feb 13 12:39:21 2023 : Launching server
Mon Feb 13 12:39:21 2023 : server started
Attempted to log scalar metric Max iterations:
3
Attempted to log scalar metric LR for agg. opt.:
0.003
Mon Feb 13 12:39:21 2023 : Running ['val'] at itr=0
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12703 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 12704) of binary: /home/crns/anaconda3/envs/FLUTE/bin/python
Traceback (most recent call last):
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
    )(*cmd_args)
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
e2e_trainer.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-02-13_12:39:24
  host      : crns-IdeaCentre-Gaming5-14IOB6
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 12705)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-13_12:39:24
  host      : crns-IdeaCentre-Gaming5-14IOB6
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 12704)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

and before that I tried running pytest -v -s in ./testing
Screenshot from 2023-02-13 13-36-31
Screenshot from 2023-02-13 13-37-04

so my guess was that I haven't setup NCCL properly , I tried to find the legacy build compatible with mine from https://developer.nvidia.com/nccl/nccl-legacy-downloads and got NCCL 2.11.4, for CUDA 11.4, September 7, 2021

and as instructed used " sudo apt install libnccl2=2.11.4-1+cuda11.4 libnccl-dev=2.11.4-1+cuda11.4 " as instructed which went smoothly but I still encountered the older stack trace .

going to nvidia's NCCL test repo , I skip the installation steps because I have an official release then try to do "make" then "./build/all_reduce_perf -b 8 -e 256M -f 2 -g 1" (I tried changed the -g argument to 4 or keeping ngpus) and got the same error either way
./build/all_reduce_perf: symbol lookup error: ./build/all_reduce_perf: undefined symbol: ncclRedOpCreatePreMulSum

now that's where I stopped with those 2 issues where I feel solving one would help the other .

but before I got this far I had to reformat the workstation acouple times seeing Nvidia fails to keep all the necessary compatibility information in one place but this post saved me
in my previous environments , I managed to get FLute running on gloo but I still had a similer warning stack trace but models could be saved .

in this fresh environment I also had trouble importing and using the python built-in subprocess module specifically because the "run" method generated errors that I worked around around with this https://stackoverflow.com/questions/40590192/getting-an-error-attributeerror-module-object-has-no-attribute-run-while but even then I was still receiving an error with that solution because "text" had a TypeError and couldn't be passed to Popen class constructor Failed: TypeError: __init__() got an unexpected keyword argument 'text'

so my investigation led to the fact that the text argument was added after python 3.7 and when your readme.md suggests 3.8 thus the problem I can understand if you have been working on this project for a long time but this could have been a seperate issue because it causes the tests in pytest -v -s to fail. that you can label as an enhancement but I felt it could be related to why the processes aren't being assigned to the virtual gpus properly.

other honorable mentions include using : sickit-learn instead of deprecated sklearn in requirements.txt and that using newest version of pytorch 1.13 compatible with cuda 11.7 leaves the speech recognition task with deprecated torchaudio

Apologies if I mentioned several irrelevant steps or issues but I hope that I can get an exact answer to error[1]'s stack trace and quickly get back to focusing on the experimentation side research . thanks to the msrflute team and hope to hear from u soon

Sample code CUDA issue

Hi,
Following is the error, I am getting on running the sample code given in the documentation. Request you to please look into it and help me resolving the issue.

vision@vision:~/aviral/msrflute$ python -m torch.distributed.run --nproc_per_node=3 e2e_trainer.py -dataPath ./testing -outputPath scratch -config testing/hello_world_nlg_gru.yaml -task nlg_gru -backend nccl
WARNING:main:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


The data can be found here: ./testing
The data can be found here: ./testing
The data can be found here: ./testing
Sat Nov 12 14:38:34 2022 : Assigning default values for: {'max_grad_norm', 'batch_size'} in [server_config][val][data_config]
Sat Nov 12 14:38:34 2022 : Assigning default values for: {'max_grad_norm', 'num_frames'} in [server_config][test][data_config]
Sat Nov 12 14:38:34 2022 : Assigning default values for: {'num_frames'} in [client_config][train][data_config]
Sat Nov 12 14:38:34 2022 : Backend: nccl
Sat Nov 12 14:38:34 2022 : Assigning default values for: {'batch_size', 'max_grad_norm'} in [server_config][val][data_config]
Sat Nov 12 14:38:34 2022 : Assigning default values for: {'num_frames', 'max_grad_norm'} in [server_config][test][data_config]
Sat Nov 12 14:38:34 2022 : Assigning default values for: {'num_frames'} in [client_config][train][data_config]
Sat Nov 12 14:38:34 2022 : Backend: nccl
Sat Nov 12 14:38:34 2022 : Assigning default values for: {'batch_size', 'max_grad_norm'} in [server_config][val][data_config]
Sat Nov 12 14:38:34 2022 : Assigning default values for: {'num_frames', 'max_grad_norm'} in [server_config][test][data_config]
Sat Nov 12 14:38:34 2022 : Assigning default values for: {'num_frames'} in [client_config][train][data_config]
Sat Nov 12 14:38:34 2022 : Backend: nccl
Added key: store_based_barrier_key:1 to store for rank: 2
Added key: store_based_barrier_key:1 to store for rank: 0
Added key: store_based_barrier_key:1 to store for rank: 1
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
Traceback (most recent call last):
File "e2e_trainer.py", line 244, in
run_worker(model_path, config, task, data_path, local_rank, backend)
File "e2e_trainer.py", line 95, in run_worker
torch.cuda.set_device(rank)
File "/home/vision/anaconda3/envs/flute/lib/python3.8/site-packages/torch/cuda/init.py", line 313, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
Traceback (most recent call last):
File "e2e_trainer.py", line 244, in
run_worker(model_path, config, task, data_path, local_rank, backend)
File "e2e_trainer.py", line 95, in run_worker
torch.cuda.set_device(rank)
File "/home/vision/anaconda3/envs/flute/lib/python3.8/site-packages/torch/cuda/init.py", line 313, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Sat Nov 12 14:38:34 2022 : Assigning worker to GPU 0
Preparing model .. Initializing
GRU(
(embedding): Embedding()
(rnn): GRU2(
(w_ih): Linear(in_features=160, out_features=1536, bias=True)
(w_hh): Linear(in_features=512, out_features=1536, bias=True)
)
(squeeze): Linear(in_features=512, out_features=160, bias=False)
)
Sat Nov 12 14:38:34 2022 : initialize model with default settings
Sat Nov 12 14:38:34 2022 : trying to move the model to GPU
Sat Nov 12 14:38:36 2022 : model: GRU(
(embedding): Embedding()
(rnn): GRU2(
(w_ih): Linear(in_features=160, out_features=1536, bias=True)
(w_hh): Linear(in_features=512, out_features=1536, bias=True)
)
(squeeze): Linear(in_features=512, out_features=160, bias=False)
)
Sat Nov 12 14:38:36 2022 : torch.cuda.memory_allocated(): 10909184
/home/vision/anaconda3/envs/flute/lib/python3.8/site-packages/torch/cuda/memory.py:384: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
warnings.warn(
Sat Nov 12 14:38:36 2022 : torch.cuda.memory_cached(): 23068672
Sat Nov 12 14:38:36 2022 : torch.cuda.synchronize(): None
Traceback (most recent call last):
File "e2e_trainer.py", line 244, in
run_worker(model_path, config, task, data_path, local_rank, backend)
File "e2e_trainer.py", line 111, in run_worker
val_dataset = get_dataset(data_path, data_config["val"], task, mode="val", test_only=True)
File "/home/vision/aviral/msrflute/utils/dataloaders_utils.py", line 94, in get_dataset
dataset = dataset(data_file, test_only=test_only, user_idx=-1, args=data_config)
File "experiments/nlg_gru/dataloaders/dataset.py", line 26, in init
self.vocab = load_vocab(kwargs['args']['vocab_dict']) if 'args' in kwargs else load_vocab(vocab_dict)
File "/home/vision/aviral/msrflute/experiments/nlg_gru/utils/utility.py", line 28, in load_vocab
with open(url, 'r', encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: './testing/models/vocab_reddit.vocab'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 31988) of binary: /home/vision/anaconda3/envs/flute/bin/python
Traceback (most recent call last):
File "/home/vision/anaconda3/envs/flute/lib/python3.8/runpy.py", line 192, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/vision/anaconda3/envs/flute/lib/python3.8/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/vision/anaconda3/envs/flute/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in
main()
File "/home/vision/anaconda3/envs/flute/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/vision/anaconda3/envs/flute/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/vision/anaconda3/envs/flute/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/vision/anaconda3/envs/flute/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/vision/anaconda3/envs/flute/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

e2e_trainer.py FAILED

Failures:
[1]:
time : 2022-11-12_14:38:38
host : vision
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 31989)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2022-11-12_14:38:38
host : vision
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 31990)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2022-11-12_14:38:38
host : vision
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 31988)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Could you provide a multi-node execution example?

Hello,
My research group at the University of Cambridge is looking to benchmark Flute on a multi-node setup using our machine cluster.
We have been unable to find an example script for launching multi-node executions, could you please provide it for us?

RFC: single-GPU setups, improving worker 0 utilization

This issue is to discuss a known limitation which is that FLUTE expects a minimum of two GPUs for any CUDA-based training. There must always be a Worker 0 GPU and then at least one more for client training. It would be valuable to be able to specify arbitrary mappings so that, say, Worker 0 and Worker 1 share the same GPU. From a memory standpoint this should be ok because they never need the GPU at the same time. I'm not sure that torch.distributed can support arbitrary mappings (note: CUDA_VISIBLE_DEVICES=0,0 doesn't work as a solution). Alternatively if we could assign worker 0 to cpu and worker 1+ to GPUs that might be a reasonable solution- relatively speaking, model aggregation is less expensive and could potentially be done on CPU.

Thoughts?

profiling error

When profiling is enabled, server.py references self.run_metrics, which should be self.run_stats

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.