ibm / fedma Goto Github PK

Code for Federated Learning with Matched Averaging, ICLR 2020.

License: MIT License

Python 87.48% Jupyter Notebook 11.18% Shell 1.33%

fedma's Introduction

Federated Learning with Matched Averaging

This is the code accompanying the ICLR 2020 paper "Federated Learning with Matched Averaging " Paper link: [https://openreview.net/forum?id=BkluqlSFDS]

Overview

FedMA algorithm is designed for federated learning of modern neural network architectures e.g. convolutional neural networks (CNNs) and LSTMs. FedMA constructs the shared global model in a layer-wise manner by matching and averaging hidden elements (i.e. channels for convolution layers; hidden states for LSTM; neurons for fully connected layers) with similar feature extraction signatures.

Depdendencies

Tested stable depdencises:

python 3.6.5 (Anaconda)
PyTorch 1.1.0
torchvision 0.2.2
CUDA 10.0.130
cuDNN 7.5.1
lapsolver 1.0.2

Data Preparation

Language Models:

For the language model experiments, we used the Shakespeare dataset provided by project Leaf. Following the instructions to prepare Shakespeare dataset, we choose to use non-i.i.d., full-size dataset, and split 80% of the data points into the training dataset. Moreover, we set minimum number of samples per user at 9K. Thus, the following command returns our data partitioning:

./preprocess.sh -s niid --sf 1.0 -k 0 -t sample -tf 0.8 -k 9

Image Classification:

We simulate a heterogeneous partition for which batch sizes and class proportions are unbalanced. We simulate a heterogeneous partition by sampling proportion of the data points in each class across participating clients from a Dirichlet distribution. Due to the small concentration parameter (0.5) of the Dirichlet distribution, some sampled batches may not have any examples of certain classes of data. Details about this partition can be found in the partition_data function in ./utils.py.

Experients over Language Task:

The source code involving language task experiments i.e. LSTM over the Shakespeare dataset locates in the folder FedMA/language_modeling. And we summarize the functionality of each script below.

Script	Functionality
`ensemble_accuracy_calculator.py`	Evaluating the performance of ensemble accross local models trained on paritipating clients.
`language_main.py`	Conducting `FedAvg` and `FedProx` experiments, which are used as baseline methods.
`language_oneshot_matching.py`	Evaluating the performance of one-shot match i.e. PFNM-style model fusion.
`language_whole_training.py`	Centralized training over one device i.e. we combine the local datasets and coduct centralized training. This is the strongest possible baseline for any Federated Leaarning method.
`lstm_fedma_with_comm.py`	Our proposed "FedMA with communication algorithm".

Experients over Image Classification Task:

The main result related to the image classification task i.e. VGG-9 on CIFAR-10 can be reproduced via running ./run.sh. The following arguments to the ./main.py file control the important parameters of the experiment.

Argument	Description
`model`	The CNN architecture that each client train locally.
`dataset`	Dataset to use. We use CIFAR-10 to study FedMA.
`lr`	Inital learning rate that will be use.
`retrain_lr`	The learning rate for the local re-training process. Usually set to the same value as `lr`
`batch-size`	Batch size for the optimizers e.g. SGD or Adam.
`epochs`	Locally training epochs.
`retrain_epochs`	Local re-training epochs.
`n_nets`	Number of participating local clients.
`partition`	Data partitioning strategy. Set to `hetero-dir` for the simulated heterogeneous CIFAR-10 dataset.
`comm_type`	Federated learning methods. Set to `fedavg`, `fedprox`, or `fedma`.
`comm_round`	Number of communication rounds to use in `fedavg`, `fedprox`, and `fedma`.
`retrain`	Flag to retrain the model or load from checkpoint.
`rematching`	Flag to re-conduct the matching process or load from checkpoint.

Sample command

python main.py --model=moderate-cnn \
--dataset=cifar10 \
--lr=0.01 \
--retrain_lr=0.01 \
--batch-size=64 \
--epochs=20 \
--retrain_epochs=20 \
--n_nets=16 \
--partition=hetero-dir \
--comm_type=fedma \
--comm_round=50 \
--retrain=True \
--rematching=True

Interpretability of FedMA:

The results of interpretability we presented in the FedMA paper are summerized in a jupyter notebook i.e. ./jupyter_notebook/Interpretability_fedma.ipynb.

Handling Data Bias Experiments:

The handeling data bias experiments we presented the FedMA paper are summerized in the script ./dist_skew_main.py. To reproduce the experiment, one can simply run:

bash run_dist_skew.sh

Sample command

python dist_skew_main.py --model=moderate-cnn \
--dataset=cifar10 \
--lr=0.01 \
--retrain_lr=0.01 \
--batch-size=64 \
--epochs=10 \
--retrain_epochs=20 \
--n_nets=2 \
--partition=homo \
--comm_type=fedma \
--retrain=True \
--rematching=True

Citing FedMA:

@inproceedings{
Wang2020Federated,
title={Federated Learning with Matched Averaging},
author={Hongyi Wang and Mikhail Yurochkin and Yuekai Sun and Dimitris Papailiopoulos and Yasaman Khazaeni},
booktitle={International Conference on Learning Representations},
year={2020},
url={https://openreview.net/forum?id=BkluqlSFDS}
}

fedma's People

Stargazers

Watchers

Forkers

ammieqi ylsung kundjanasith 564612540 willjay5991 bruinxiong trantorrepository meiyuan666 forestneo jlee28 dlwbm123 yonggucheng xrosliang waitwaitforget rhllasag guobbin matthewjstott propaler bhaskers-blu-org1 tzq2doc ahmedcs 13301338176 caifederated jianxu95 eunjuyang honglin1997 oodunsi1 twilightdema 20victor12 franciszchen robot-ai-machinelearning thu-syh zjamy-hust dannieldwt stefanwan-durham ychen404 monthfall violet998 brodymcnutt jingtao1995 mvisionai iammcy kouda-amine sssupertian jam-cc zhangtiegang2014 yach1603 flywingm amanapte stevin-wilson only-changer cmicelifarias coasxu classicvalues daybright-david anushiya1 shenna2017 moon24x mengyaowunotavailable wozaimoyu 374494125 mingkunyang adityakumarakash saigontrade88 gang370 izfree-edu tutuskt comeon-hzl makhanov-nu joey61liuyi sunyulin950824 lianzhuotao in-browser-federated-learning fuzhou-institute-of-data-technology vmromero-ubi ruogu-alter ghas-results chelsiehi arwankhoiruddin sravanch1287

fedma's Issues

Provide more details about the experiments?

Good job! I am very interested in this work and I tried run the experiments mentioned in the paper. My questions are:

How to run the experiments(CIFAR10, MNIST, Shakespeare), it seems that only CIFAR10 experiment available now.
How the FedMA work? The term retrain and rematching confused me.

Thank you.

Does hyper parameter "retrain_epoch" lead to extra training for FedMA?

Dear authors,

Your work is very impressive and thanks for open-sourcing the code!

Please correct me if I got it wrong - In each round of FedMA, you need to retrain (num_layers * retrain_epoch) total epochs, not including (retrain_epoch) epochs for the whole model to be updated. So the extra computation will be very intensive if you set a relatively large "retrain_epoch" (say 20 as shown in the sample command). Could you share the specific number for this hyper-parameter to reproduce your results on Cifar-10?

Thanks in advance and looking forward to your reply.

I what to run this in heterogeneous data, but there are some errors

When i wangt to use the model of simple-cnn under heterogeneous data, there are some errors , my commond is :
python dist_skew_main.py --model=simple-cnn --dataset=cifar10 --lr=0.01 --retrain_lr=0.01 --batch-size=64 --epochs=10 --retrain_epochs=10 --n_nets=10 --partition=hetero-dir --comm_type=fedma --comm_round=10 --retrain=True --rematching=True

and the error is :
Traceback (most recent call last):
File "dist_skew_main.py", line 1181, in
args.partition, args.n_nets, args_alpha, args=args)
File "/home/wjj/three/FedMA-master/utils.py", line 274, in partition_data_dist_skew
traindata_cls_counts = record_net_data_stats(y_train, net_dataidx_map, logdir)
UnboundLocalError: local variable 'net_dataidx_map' referenced before assignment

the code is

Can you tell me how to solve this problem

UnboundLocalError: local variable 'shape_estimator' referenced before assignment

Thanks for sharing your great work. When I reimplemented your code, I met the following error:
UnboundLocalError: local variable 'shape_estimator' referenced before assignment

I used LeNet on Mnist.

Could you tell me the reason may cause such error?

Thanks.

Use Cnn matching as a black box

I want to implement FedMA in a FML framework, is there a function inside this repo that i can use as a black box?
I want it to have input lets say two layers from client cnn and one from the global and to return the matched output.

i want to change the hyper paramerters language model

When I try to change the NUM_LAYERS in lstm_fedma_with_comm.py,it will have some problem,so i change the RNNmodel's hyper parameters nlayers.but still can't debug.
for example,when i change the NUM_LAYERS = 4(2-layer LSTM (4 layers: encoder|hidden LSTM1|hidden LSTM2|decoder)),so i change the nlayers=2.

Traceback (most recent call last):
File "language_oneshot_matching.py", line 504, in
matching_shapes=matching_shapes)
File "/home/hx/github/FedMA/language_modeling/language_fedma.py", line 258, in layerwise_fedma
reconstructed_bias = [split_bias(batch_weights[j][layer_index+3+2]) for j in range(J)]
File "/home/hx/github/FedMA/language_modeling/language_fedma.py", line 258, in
reconstructed_bias = [split_bias(batch_weights[j][layer_index+3+2]) for j in range(J)]
IndexError: list index out of range

Unable to run lstm_fedma_with_comm.py file

I tried to run the lstm_fedma_with_comm.py file to reproduce the paper results. But I got file not found error for the following files:
lstm_matching_assignments, lstm_matching_shapes and matched_global_weights.

some qustions about oneshot_matching experiment

Hi, i have something that confused me:

What is the difference between oneshot_matching and BBP_MAP?
The retrain process in matching actually introduces multiple original data information, so does the matching really reflect his ability to aggregation in a single communication?

some questions about initialization and retrain

Hi, thank you for sharing your outstanding work! I have some questions about the settings in the paper, could you please support some more details about the experiment settings?

In the code, the J clients do not share the same initialization, and they are retrained before the first global round. Is there any difference of the settings between the retrain process at first and the later local-retrain?(like fedavg local retrain)
What is the batch-size of the dataset corresponding to the results in this paper?
Does the experiment in the paper use the following command?
python main.py --model=moderate-cnn \ --dataset=cifar10 \ --lr=0.01 \ --retrain_lr=0.01 \ --batch-size=64 \ --epochs=150 \ --retrain_epochs=150 \ --n_nets=16 \ --partition=hetero-dir \ --comm_type=fedma \ --comm_round=10 \ --retrain=True \ --rematching=True

Thanks very much

unable to install lapsolver

Running FedMA with simple cnn

Hi,
Very good job! I love your method.
I tried to run FedMA using the "simple-cnn" model, but there is a miss-match in the size of matched_cnn (which is bigger than the original model) and the weights after alignment.

The error is:
File "../main.py", line 97, in trans_next_conv_layer_backward
reshaped = layer_weight.reshape(reconstructed_next_layer_shape).transpose(1, 0, 2, 3).reshape(next_layer_shape[0], -1)
ValueError: cannot reshape array of size 3750 into shape (15,25,5,5)

Question on code in language_fedma.py

Hi,

Thank you for your wonderful work and making the code public.

I have a question regarding the code in language_fedma.py. In line 302-303, why is there a 'pass' if layer_index ==2 for example (please refer to attached screenshot)? What happens to the case where we have more layers?

Can I use the code from lines 309-313 again for layer_index ==2 instead of a pass?

Thank you very much!

Reproduce results from run_dist_skew.sh

Lot of intersting ideas in the paper.
I am trying to reproduce the results for fedavg using "run_dist_skew.sh". In Fig 4 in paper the quoted accuracy is around 66%. But in my runs, accuracy hardly increases beyond 50%. Could you please help with the settings needed to reproduce the results in Fig4 for fedavg ??
PC

Running FedMA with large input data shape

Hi @hwang595, a few weeks ago I made some questions in another issue thread about I problem that I had when trying to train a model with input image shape greater or equal to 224x224. Since then, I tried to reduce the dimensions of my problem to the default size, i.e. 32x32, and it worked well! But when I run using 224x224, I'm still locked in this training part.

So I'm gonna ask my questions here again:

Is there such a relationship? Training input size and FedMA communication process? If that's true, what can we do about it?
By adding a different model, in which part of the code should I take care? Besides changing, for example, the input dimensions to 1x224x224?

Obs.: As I'm working with medical images it is critical resize them.

Thanks for the great work!

The problem of "local_retrain" function in the main.py

Question about reproducing results in the paper (LSTM on Shakespeare dataset)

Hello,
Thank you for the great work. I am studying federated learning in NLP. I tried to reproduce the results in the paper (mainly LSTM on Shakespeare dataset) but results seem very off from what it should be. Please help me recheck what I missed in my experiments.

(1) The Shakespeare data preprocessing is noted like below in the paper:

So I use the command like this to preprocess the data:

./preprocess.sh -s niid --sf 1.0 -k 0 -t sample -tf 0.8 -k 10000

(2) It is indicated the the paper that experiments were done with 1-Layer LSTM.

Anyway, reading from the code, I believe it is equal to setting:

NUM_LAYERS=3

As it will have one input layer, one output layer and one hidden LSTM layer (where the invariant permutation problem is addressed by FedMA)

(3) It is noted in the paper that FedAvg and FedProx awere trained with 33 communication rounds, while FedMA was trained with 11 communication rounds (because each round of FedMA requires 3 communication rounds correspoding to number of LSTM layers). I actually used 30 for FedAvg and FedProx and 10 for FedMA like these:

For FedAvg
python language_main.py --mode=fedavg --comm-round=30

For FedProx
python language_main.py --mode=fedprox --comm-round=30

For FedMA
python language_main.py --mode=fedma --comm-round=10
(I do not think --comm-round has any effect in FedMA anyway because the code perform single round of FedMA)
Then I performed the rest of FedMA communication round by running
python lstm_fedma_with_comm.py
(The lstm_fedma_with_comm.py has 10 communication rounds hard-coded)

(4) The results seem not aligned with what indicated in the paper. While FedProx got lower test accuracy than FedAvg, but FedMA also got lower accuracy than FedAvg too.

For FedAvg

For FedProx

For FedMA
Result from the first step (language_main.py)

Result from the second step (lstm_fedma_with_comm.py)

Results from the paper

Actually my FedAvg got substantially higher accuracy than in the paper. It reach 0.5 test accuracy while non of these 3 approachs reach such accuracy in the paper.
** I did not tune E (local training epoch) and use default value (5) but the results are still not align with indicated in the paper for E=5 anyway.

Thank you in advance for your help.

ibm / fedma Goto Github PK

fedma's Introduction

Federated Learning with Matched Averaging

Overview

Depdendencies

Data Preparation

Language Models:

Image Classification:

Experients over Language Task:

Experients over Image Classification Task:

Sample command

Interpretability of FedMA:

Handling Data Bias Experiments:

Sample command

Citing FedMA:

fedma's People

Stargazers

Watchers

Forkers

fedma's Issues

Recommend Projects

Recommend Topics

Recommend Org