securefederatedai / openfl Goto Github PK

An open framework for Federated Learning.

Home Page: https://openfl.readthedocs.io/en/latest/index.html

License: Apache License 2.0

Makefile 0.01% Shell 0.09% Jupyter Notebook 90.84% Python 9.06%

federated-learning secure-computation openfl distributed-computing deep-learning collaborative-learning privacy-preserving-machine-learning machine-learning fedprox fedavg

openfl's People

Contributors

Stargazers

Watchers

Forkers

itrushkin walteriviera sammysun0711 sarthakpati fets-ai jingmouren sycomix trendingtechnology moondav igor-davidyuk grib0ed0v michellebarbosa olegov99 msheller psfoley maradionov jacmarjorie dmitryagapov dejavu6 glassofwhiskey aleksandr-mokrov karol-brejna-i hongleguo rstoki dcmartin kouda-amine tonyreina tortoiseham amad-person pduckworth pragyanaischool tanertopal tobytoy alexeyhorkin davidleeh alexey-khorkin jiyunyoung cheetah-alo cuchulainx techthiyanes brandon-edwards laynepeng benirungu eceisik ssg-research tsaichanglan katerina-merkulova shailensobhee tpickle chauvinhloi korallin htang2012 baobunuo devhliu alexsandruss viktoriiaromanova helloneel eliaskousk stjordanis pa-wan openfederatedlearning mansishr ricklina90 saransh09 ladi-pomsar ekmixon maximemowkin alexey-gruzdev zxecho cryptowealth-technology ms116 jamesrgregg einse57 asomsiko judithspd isabella232 mahdiall99 gpubrr042 jingxinlieee-fairy godor1333 sun1lach bohdanstupa maltetoelle gresliebear ahmedcs edsun3941 whuhxb pankajr141 jkmarz duncant0417 rabee333 rodgerzhu mkurisoo crazymotor chutu15 yanivbi felix-marx giemp95 dl8 anindya-paul

openfl's Issues

Add `port` option for fx tutorial start

Is your feature request related to a problem? Please describe.
This is more like an enhancement. Some environments don't allow you to use specific ports (for example 8888 which seems to be the default port used by Jupiter included in fx). It would be good to allow the users to specify the port they want.

Describe the solution you'd like
It would be good to allow the users to specify the port they want. There is already --ip option in fx tutorial start. Addint --port would solve the problem.

Describe alternatives you've considered

Additional context
Logs from starting the notebook

[I 11:27:36.940 NotebookApp] Serving notebooks from local directory: /home/ubuntu/anaconda3/envs/test/lib/python3.6/site-packages/openfl-tutorials
[I 11:27:36.940 NotebookApp] Jupyter Notebook 6.4.0 is running at:
[I 11:27:36.940 NotebookApp] http://aggregator:8888/?token=f78c90d7649f1166bf83df9f0f6f69c6e605494b9b4a3a23
[I 11:27:36.940 NotebookApp]  or http://127.0.0.1:8888/?token=f78c90d7649f1166bf83df9f0f6f69c6e605494b9b4a3a23
[I 11:27:36.940 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 11:27:36.945 NotebookApp] No web browser found: could not locate runnable browser.

Bug while creating local CA with https server

Bug while creating CA:

fx pki install -p </path/to/ca/dir> --ca-url <host:port>

Got an exception:

Password: 
Repeat for confirmation: 
[16:23:46] INFO     Creating CA                                                                                                                        ca.py:157
CA binaries from github will be downloaded now [Y/n]: y
[16:23:51] INFO     Downloading step-ca_linux_0.17.2_amd64.tar.gz.sig                                                                                   ca.py:57
EXCEPTION : Unknown archive format './step-ca_linux_0.17.2_amd64.tar.gz.sig'
Traceback (most recent call last):
...
  File "/home/akhorkin/.virtualenvs/openfl/bin/fx", line 8, in <module>
    sys.exit(entry())
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/openfl/interface/cli.py", line 214, in entry
    error_handler(e)
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/openfl/interface/cli.py", line 173, in error_handler
    raise error
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/openfl/interface/cli.py", line 212, in entry
    cli()
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/openfl/interface/pki.py", line 64, in install_
    install(ca_path, ca_url, password)
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/openfl/component/ca/ca.py", line 168, in install
    download_step_bin(url, 'step-ca_linux', 'amd', prefix=ca_path, confirmation=False)
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/openfl/component/ca/ca.py", line 59, in download_step_bin
    shutil.unpack_archive(f'{prefix}/{name}', f'{prefix}/step')
  File "/usr/local/lib/python3.8/shutil.py", line 1223, in unpack_archive
    raise ReadError("Unknown archive format '{0}'".format(filename))
shutil.ReadError: Unknown archive format './step-ca_linux_0.17.2_amd64.tar.gz.sig'

Desktop:

OS: Ubuntu 18.04.5 LTS
Python: Python 3.8.5

Expected behavior:

Need to download and unpack .zip archive

Setup of 'fx' failed on conda 4.6.11, worked on 4.10.1

Describe the bug
When following the setup instructions using conda 4.6, the resulting conda environment failed to install openfl in the environment. While not an openfl bug, we may want to determine how to make these instructions work on older conda versions.
NOTE: I am using the 'develop' branch.

To Reproduce
Steps to reproduce the behavior:

Install conda 4.6.11 on Ubuntu 18.04
Follow the initial setup instructions for openfl
Note that openfl has not been installed in the conda environment, thus 'fx' doesn't work. "which pip" and "which python" also don't point to the environment pip/python, either.
Deactivate and remove the conda environment you created as part of openfl setup
Update conda to 4.10.1
Follow openfl setup instructions
fx now works and is installed in the conda environment

Create tests for tutorials' notebooks

There are no tests for openfl-tutorials/interactive_api.
Its have core scenarious for openfl and if some change broke this functionality it sould be fixed as soon as possible. Becouse it is the entry point for new users. And if something was broken here the user can deside that all library isn't working.
It would be greate to create environment for this notebooks and run it on CI.

Add FedCurv Algorithm

https://arxiv.org/pdf/1910.07796.pdf

We'd like to support FedCurv for robust aggregation.

Client training performance of a container always better than others

I'm a student and trying to understand this FL framework. I've started with notebook tutorials, like the one called "new_python_api_Tensorflow_MNIST.ipynb". I've run the scenario with 2 collaborators, each one in a respective container. I noticed that a collaborator always (more than 10 run) has better accuracy metrics better than another one even if:

the collaborators both have the same performance (ram, cpus, etc);
there is a shuffle of mnist dataset elements and the do_sharding method which assigns different portions of the whole dataset to clients;
the collaborators both uses the same model.

Could this be a feature of this FL framework that I don't know or maybe I haven't understand well how the learning phases works? Thanks for your patience and support.

Connection issue between collaborator and aggregator

In some machines, when I run a federation (even if it both the collaborators and the aggregator are on the same machine) they fail to establish a connection.
I have started facing this problem specifically in the interactive API, I have been unsuccessful in running a federation as collaborator is unable to connect to the port where aggregator has started the gRPC server.

Reproducing the error:
I did not do anything differently than what is already mentioned in the tutorial.
I created a fresh conda environment, installed the openfl library and finally tried to replicate the experiment.

We tried to debug this error and in the process we found out that the gRPC server from the aggregator runs exclusively on IPv6. Whereas, collaborator tries to listen to IPv4. We even tried to hardcode the server and the port numbers but we were unable to make it work. We suspect that the error has something to do with the way the gRPC server are started in https://github.com/intel/openfl/blob/c3c0886aefeb09f426fc3726be0f65de2b344e22/openfl/transport/grpc/server.py and https://github.com/intel/openfl/blob/c3c0886aefeb09f426fc3726be0f65de2b344e22/openfl/transport/grpc/client.py

I think this error can pose a potentially big problem in the future. Therefore, please look into it.

Thanks

Add option to log tensor_db contents on exit

OpenFL should provide an option to save a log of the TensorDB after failure or when a user hits Ctrl-C, This will make debugging aggregated values significantly easier and allow users to submit more informative issues

File not found issue when doing fx plan initialize

Hi there,

I was trying to do fx plan initialize but encountered some error message with the data loader as follows:

fx workspace create --prefix ${HOME}/2dunet --template tf_2dunet
pip install -r requirments.txt
fx plan initialize
FileNotFoundError: [Errno 2] No such file or directory: "'/raid/datasets/BraTS17/by_institution_NIfTY/1'"

Nevertheless, we also tried with files and directories that are valid and must exist there such as '/home/', but the error massage is still there, saying No such file or directory: "'/home/'"

openfl package not found

Describe the bug
Following the OpenFL documentation instructions here but I get the following error when running pip install openfl:

ERROR: Could not find a version that satisfies the requirement openfl
ERROR: No matching distribution found for openfl

-OS: MacOS Big Sur

FX command - make it more independent

Is your feature request related to a problem? Please describe.
During the initialization of the federation (federated environment), the fx command is very useful. However, it performs some "additional tasks", that are typically not required (or, in the future may be problematic), and need to be 'rolled-back' manually.

A list of non-necessary actions:

Do not call pip install -r requirements.txt``inside fx workspace create`
- the command fx workspace create creates workspace from a template, but it also calls pip install -r requirements.txt inside.
  For example, for template tf_2dunet, it installs TensorFlow 2.3.1, which is non-current version as of 05/2021 -- the current version is 2.4.1. So typically, there is a big chance it will roll-back already installed (and working) tensorflow version in the user's python environment to the previous version, which may not be working for him. Moreover, if the user want to change/modify his model and/or supply his own pretrained one.
do not check data folders inside fx plan initialize
- the command fx plan initialize take into consideration also data paths set in the <workspace>\plan\data.yaml file. But since the fx plan initialize is called on aggregator, and the data folders for individual clients are on a completely different computers, it must not be assumed they can be accessible from the aggregator.

Describe the solution you'd like

Do not call pip ... from fx tool
Do not check existence of data paths since the setup is performed on the Aggregator (which, by definition, do not have access to any data)

Director/Envoy imperfections

Same validation metric values across collaborators in Jupyter notebooks

Describe the bug
When launching the federated training in Jupyter notebooks of openfl-tutorials folder, I noticed that different collaborators achieve the same validation metric values. That could be possible if collaborators had the same data, but the data is randomly split. Looks like there is an issue of defining the Data Loader for each collaborator.

To Reproduce
Steps to reproduce the behavior:

Open openfl-tutorials/Federated_PyTorch_UNET_Tutorial.ipynb in Jupyter.
Execute all cells.

Expected behavior
Collaborators have different metrics due to random data split.

Screenshot

do we have roadmap to support Intel SGX？

FX autocomplete feature

It would be great to have fx + TAB combination (like standard Linux autocomplete) feature for current CLI.

Requirements for ShardDescriptor are installing separatly from openfl requirements.

Describe the bug
To run envoy we need to install extra requirements that are not installing with openfl.

To Reproduce
Steps to reproduce the behavior:

clone openfl
cd openfl
pip install -e .
cd openfl-tutorials/interactive_api/Director_Pytorch_Kvasir_UNET/envoy_folder
source ./start_envoy.sh
envoy return error

Expected behavior
envoy is working correct

General configuration mechanism

I propose to create a general mechanism for receiving configurable data and exclude the default values for them from the code, and set them in default configurable file. For example, the director will take params from cli, if they are passed, otherwise from director.yaml in the director workspace, otherwise from openfl-workspace/default/director.yaml.

Provide write access to TensorDB through AggregationFunctionInterface

Recently, OpenFL has expanded the types of custom aggregation functions that can be computed on collaborator models. Some participants from the FeTS Challenge had complex aggregation strategies they wished to implement that were based on novel calculated metrics that could be used for a future round.

Investigate what is the most apropriate type for `sample_shape` and `target_shape` in ShardDirectorClient

Now we are using ['100-200','300-400'] to describe height and width.
Maybe there will be better alternative ((100, 200), (300,400)), or even create special type for it.
https://github.com/intel/openfl/pull/151/files#r692170647

Dead code in run_experiment (native.py)?

Describe the bug
The "model_states" dictionary in the run_experiment function appears to be unused. Perhaps a hold-over from graph sharing?

I am able to comment out all lines involving "model_states" without any impact to the experiment that I can find.

Regarding aggregation in openfl

Hi,

If we want to use privacy preserving technologies such as differential privacy in securing aggregation in openfl, are there any tutorials or class interfaces which we can override to include the added security?

Are there any tutorials or class interfaces in openfl in which custom aggregation algorithms can be included other than federated averaging? Edit: I just realize there is new documentation added to custom averaging at https://openfl.readthedocs.io/en/latest/overriding_agg_fn.html

Thanks.

Model DB

As a user I would like to keep track of my Federated Learning experiments and plot statistics of the model performance.
One implementation may be using a Model Database, such as ModelDB https://github.com/VertaAI/modeldb
This could simply plugin to our current code via the Python API. There are some nice features such as model and data versioning (Git-like) and dashboards.

Need to have a consistent command for passing director's uri

As of now, the envoy command asks for director's URI whereas the director for IP and port. It would be great if we could pass in the director's URI at both nodes.

Setting-up the federation -- fx plan init -- does not work

Describe the bug
I am trying to setup a federation based on the '' following the documentation written here

The problem is, that the command fx plan initialize (as mentioned in the point 7) fails due to the checks for non-existing data folders. In default setup, it looks for path (which seems to be some 'leftovers' from your development environment), and even after specifying the local paths, it tries to look for them somewhere else.

To Reproduce

Steps to reproduce the behavior:

Fresh install (windows or Linux machine), fresh conda environment (named 'open-fl'), installed pip openfl package
- on Windows installed pip package from source (branch develop, commit: 0412c82)
- fx command is running
chosen template tf_2dunet
Setting some custom configuration:
export WORKSPACE_TEMPLATE=tf_2dunet
export WORKSPACE_PATH=${HOME}/projects/my-work/openfl-federations/federation_0.2
changing directory to cd ${WORKSPACE_PATH}
Running:
fx workspace create --prefix ${WORKSPACE_PATH} --template ${WORKSPACE_TEMPLATE}
- the command finishes susscessfully - the workspace is created, the requirements from requirements.txt are installed via pip.
running pip install -r requirements.txt manually, as mentioned in the point 6, of the tutorial is not necessary
=> I would suggest that fx command will not update pip requirements.
running command fx plan initialize ends with the error:
EXCEPTION : [Errno 2] No such file or directory: "'/raid/datasets/BraTS17/by_institution_NIfTY/1'"
- please see screenshot 1 below - screenshot from Linux machine
- error log:

EXCEPTION : [Errno 2] No such file or directory: "'/raid/datasets/BraTS17/by_institution_NIfTY/1'"
Traceback (most recent call last):
File "/home/rstoklas/miniconda3/envs/open-fl/bin/fx", line 8, in
sys.exit(entry())
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/interface/cli.py", line 194, in entry
error_handler(e)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/interface/cli.py", line 155, in error_handler
raise error
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/interface/cli.py", line 192, in entry
cli()
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/interface/plan.py", line 78, in initialize
task_runner = plan.get_task_runner(collaborator_cname)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/federated/plan/plan.py", line 298, in get_task_runner
defaults[SETTINGS]['data_loader'] = self.get_data_loader(
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/federated/plan/plan.py", line 286, in get_data_loader
self.loader = Plan.Build(**defaults)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/federated/plan/plan.py", line 173, in Build
instance = getattr(module, class_name)(**settings)
File "/home/rstoklas/projects/my-work/openfl-federations/federation_0.2/code/tfbrats_inmemory.py", line 29, in init
X_train, y_train, X_valid, y_valid = load_from_NIfTI(parent_dir=data_path,
File "/home/rstoklas/projects/my-work/openfl-federations/federation_0.2/code/brats_utils.py", line 93, in load_from_NIfTI
subdirs = os.listdir(path)
FileNotFoundError: [Errno 2] No such file or directory: "'/raid/datasets/BraTS17/by_institution_NIfTY/1'"

Even when I change paths in the plan/data.yaml to point to the existing directories, it fails:
- Exception:
- Please see screenshot #2
- Error message:

{'01-win': 'data/client-01', '02-pegas': 'data/client-02', '03-pegas': 'data/client-03'}
INFO Building 🡆 Object TensorFlowBratsInMemory from code.tfbrats_inmemory Module. plan.py:168
INFO Settings 🡆 {'batch_size': 64, 'percent_train': 0.8, 'collaborator_count': 2, 'data_group_name': 'brats', 'data_path': plan.py:171
'data/client-01'}
INFO Override 🡆 {'defaults': 'plan/defaults/data_loader.yaml'} plan.py:173
EXCEPTION : need at least one array to concatenate
Traceback (most recent call last):
File "c:\anaconda3\envs\open-fl\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "c:\anaconda3\envs\open-fl\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Anaconda3\envs\open-fl\Scripts\fx.exe_main.py", line 7, in
File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\interface\cli.py", line 194, in entry
error_handler(e)
File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\interface\cli.py", line 155, in error_handler
raise error
File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\interface\cli.py", line 192, in entry
cli()
File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 829, in call
return self.main(*args, **kwargs)
File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 1259, in invoke
return process_result(sub_ctx.command.invoke(sub_ctx))
File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "c:\anaconda3\envs\open-fl\lib\site-packages\click\decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
File "C:\Anaconda3\envs\open-fl\Lib\site-packages\openfl\interface\plan.py", line 77, in initialize
data_loader = plan.get_data_loader(collaborator_cname)
File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\federated\plan\plan.py", line 293, in get_data_loader
self.loader = Plan.Build(**defaults)
File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\federated\plan\plan.py", line 179, in Build
instance = getattr(module, class_name)(**settings)
File "C:\Users\rstoklas\cernbox\work\my-projects\FL-phase-3_network\federation-0.1\code\tfbrats_inmemory.py", line 29, in init
X_train, y_train, X_valid, y_valid = load_from_NIfTI(parent_dir=data_path,
File "C:\Users\rstoklas\cernbox\work\my-projects\FL-phase-3_network\federation-0.1\code\brats_utils.py", line 125, in load_from_NIfTI
imgs_train = np.concatenate(imgs_all_train, axis=0)
File "<array_function internals>", line 5, in concatenate
ValueError: need at least one array to concatenate

Expected behavior

I would expect that all steps in the tutorial will succeed
I would expect that at the end of the tutorial, I will end up with a working federated environment
I would expect that the setup tools will not require access to the data (since the setup is performed on the aggregator, and the data are on the nodes, to which the aggregator does not have access to)

Screenshots
If applicable, add screenshots to help explain your problem.

Error with defaults paths:

Error with modified and correct paths:

Desktop (please complete the following information):

OS: Ubuntu 20.04, Windows 10

(Develop branch) pillow not installed in pytorch MNIST tutorial notebook

Describe the bug
After a clean install of openfl (develop branch) in a new conda environment, when running the pytorch MNIST tutorial, cell 2 fails due to "PIL" not found. Fixed by installing 'pillow' in the conda env.

[Keras_MNIST_Tutorial]: ValueError: Aggregator does not have an aggregated tensor for TensorKey

Hi all, I build the openfl docker image (with current master), and I'm trying Keras Mnist tutorial with a docker container. However, currently, I get the following error:

final_fl_model = fx.run_experiment(collaborators,override_config={'aggregator.settings.rounds_to_train':5})
  File "/usr/local/lib/python3.8/dist-packages/openfl/native/native.py", line 297, in run_experiment
    collaborator.run_simulation()
  File "/usr/local/lib/python3.8/dist-packages/openfl/component/collaborator/collaborator.py", line 147, in run_simulation
    self.do_task(task, round_number)
  File "/usr/local/lib/python3.8/dist-packages/openfl/component/collaborator/collaborator.py", line 192, in do_task
    input_tensor_dict = self.get_numpy_dict_for_tensorkeys(
  File "/usr/local/lib/python3.8/dist-packages/openfl/component/collaborator/collaborator.py", line 214, in get_numpy_dict_for_tensorkeys
    return {k.tensor_name: self.get_data_for_tensorkey(k) for k in tensor_keys}
  File "/usr/local/lib/python3.8/dist-packages/openfl/component/collaborator/collaborator.py", line 214, in <dictcomp>
    return {k.tensor_name: self.get_data_for_tensorkey(k) for k in tensor_keys}
  File "/usr/local/lib/python3.8/dist-packages/openfl/component/collaborator/collaborator.py", line 290, in get_data_for_tensorkey
    nparray = self.get_aggregated_tensor_from_aggregator(
  File "/usr/local/lib/python3.8/dist-packages/openfl/component/collaborator/collaborator.py", line 328, in get_aggregated_tensor_from_aggregator
    tensor = self.client.get_aggregated_tensor(
  File "/usr/local/lib/python3.8/dist-packages/openfl/component/aggregator/aggregator.py", line 334, in get_aggregated_tensor
    raise ValueError("Aggregator does not have an aggregated tensor"
ValueError: Aggregator does not have an aggregated tensor for TensorKey(tensor_name='dense_3/kernel:0', origin='aggregator_plan.yaml_a379411e', round_number=0, report=False, tags=('model',))

Working environment in docker container

Python 3.8.0
intel-tensorflow 2.3.0

To Reproduce

clone the openfl repos, and build image: ./scripts/build_base_docker_image.sh
Run docker: docker run -p 8888 -it --rm openfl:latest bash
Try out the turorial: openfl-tutorials/Federated_Keras_MNIST_Tutorial.ipynb

I think, there is an issue here: https://github.com/intel/openfl/blob/1aa2b16509a1a9a97983760a45aa1e5f133e9e30/openfl/native/native.py#L288
since the model of the first collaborator model was not initialized as the last collaborator: https://github.com/intel/openfl/blob/1aa2b16509a1a9a97983760a45aa1e5f133e9e30/openfl/native/native.py#L259
In addition, there is a sallow copy for the plan of each collaborator: https://github.com/intel/openfl/blob/develop/openfl/native/native.py#L206
Could you please help to check?
Thank you!

Docs bad rendering: syntax error in graph

Describe the bug
On the Configuring the Federation page there is a "Syntax error in graph mermaid version 8.9.1" error

Screenshots:

Redesigning Task Runner

Docker image - Manifest not found

Describe the bug
Trying to install docker image gives the following error:
$ sudo docker pull intel/openfl
Using default tag: latest
Error response from daemon: manifest for intel/openfl:latest not found: manifest unknown: manifest unknown

To Reproduce

have a "clean" linux machine with docker installed
sudo docker pull intel/openfl
See error

Expected behavior
The intel/openfl image will be successfully installed

Desktop (please complete the following information):

OS: Linux Ubuntu 20.04
Docker version 20.10.6, build 370c289

Suggestion: Implement ARFL(Auto-weighted Robust Federated Learning)

Hey guys,
i really liked the OpenFL and after some readings found that would be quite interesting to have the ARFL as a choose option to substitute FedAvg since it should increase the accuracy a lot in real world scenarios where the data from each client isn't trust able.

article: https://arxiv.org/pdf/2101.05880.pdf

Slack Invite

Hello!

I hope whoever reading it is safe and doing great. It seems like people outside intel are unable to join the slack, can anyone give invite for people outside slack

Poor error traceback

When I have some errors in my code running federation, I get only the error without full traceback with call stack. For example: [17:46:46] ERROR Collaborator failed: list index out of range . And I have no way of knowing where specifically error is. I only have a message above and link to https://github.com/intel/openfl/blob/c2796b6c3a425436d38c3b3b7f8867e2ea4f9918/openfl/component/envoy/envoy.py#L59

Docker image - enhance description on hub.docker.com

Is your feature request related to a problem? Please describe.
Currently, there is an ambiguity of image name "openfl" in the Docker Hub, since there is product called "Open Flash Library".
The "docker pull openfl" (as stated in the documentation) will point to the openfl/openfl, which is the other product:
https://hub.docker.com/r/openfl/openfl

Finding the correct docker image intel/openfl does not provide very confidence, since there is no associated description,

Describe the solution you'd like

Please add some more informative description to the Docker-Hub page of the project:
https://hub.docker.com/r/intel/openfl
please update the documentation instruction reading:
docker pull intel/openfl

Scalable PKIs

In avoid manual copying for current PKIs/certificate exchange between Aggregator & Collaborators we need an automatic system for that.

[tf_2dunet] fx plan initialize command is killed without any error

Hi I am following tf_2dunet on https://openfl.readthedocs.io/en/latest/running_the_federation.baremetal.html#creating-workspaces , However, fx plan initialize command is killed without any error. I have downloaded Brats data and added the data path in data.yaml file

For 3D Image data

Hi there,

I am trying to process some 3D medical images (some .nii.gz files) with openFL but I am having some trouble doing so. My data loader is as follows: (data loader from 3D_unet model)

def get_dataset(self):

    self.num_train = int(self.numFiles * self.train_test_split)
    numValTest = self.numFiles - self.num_train
    ds = tf.data.Dataset.range(self.numFiles).shuffle(
        self.numFiles, self.random_seed)  # Shuffle the dataset
    ds_train = ds.take(self.num_train).shuffle(
        self.num_train, self.shard)  # Reshuffle based on shard
    ds_val_test = ds.skip(self.num_train)
    self.num_val = int(numValTest * self.validate_test_split)
    self.num_test = self.num_train - self.num_val
    ds_val = ds_val_test.take(self.num_val)
    ds_test = ds_val_test.skip(self.num_val)

    ds_train = ds_train.map(lambda x: tf.py_function(self.read_nifti_file,
                                                     [x, True], [tf.float32, tf.float32]),
                            num_parallel_calls=tf.data.experimental.AUTOTUNE)
    ds_val = ds_val.map(lambda x: tf.py_function(self.read_nifti_file,
                                                 [x, False], [tf.float32, tf.float32]),
                        num_parallel_calls=tf.data.experimental.AUTOTUNE)
    ds_test = ds_test.map(lambda x: tf.py_function(self.read_nifti_file,
                                                   [x, False], [tf.float32, tf.float32]),
                          num_parallel_calls=tf.data.experimental.AUTOTUNE)

    ds_train = ds_train.repeat()
    ds_train = ds_train.batch(self.batch_size)
    ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE)

    batch_size_val = 4
    ds_val = ds_val.batch(batch_size_val)
    ds_val = ds_val.prefetch(tf.data.experimental.AUTOTUNE)

    batch_size_test = 1
    ds_test = ds_test.batch(batch_size_test)
    ds_test = ds_test.prefetch(tf.data.experimental.AUTOTUNE)

    return ds_train, ds_val, ds_test

, which output some PrefetchObjects ds_train, ds_val and ds_test. However, according to the data loader file, I believe OpenFL is expecting data loaders to outputs X_train, y_train, X_valid, y_valid, and some follow-up operations (e.g., get batch) with them. I personally found it easier if we can have an option to use the PrefetchObjects directly instead of converting them to X_train, y_train etc.

So I was wondering if OpenFL can have some ways to enable data loaders for the nii.gz files?

Thank you so much for your attention!

Long-living entities: Director/Envoy

Long-Living entities

The idea behind introducing Long-Living entities is that we would explicitly separate stages of setting up a Federation (which is a set of connected nodes) and running the experiment. This allows users to set up a Federation once, with PKI exchange and setting up a correct network settings and then within a one Federation run multiple experiments.

To accomplish this goal we need to implement few more logical entities:

Envoy. Envoy is a long-living entity that would be run on nodes with the dataset shards.
Director. Director is a long-living entity that would keep and manage the registry with Envoys. Director could receive a request from the frontend API layer to start an experiment and then will start an Aggregator and send a request to Envoys to start collaborators.
Frontend API layer - API layer is not a long-living entity it a component that allows users to define FL experiments in a python script or a Jupyter notebook, which is equipped with a Director gRPC client to register FL task, Model, Hyper-parameters, setting up an experiment and trigger the start of an experiment execution. The frontend API layer could be run on a less performant machine such as a typical laptop since all computations would happen on Director/Envoy nodes.

For a simplified version of the proposed workflow, please refer to a picture:

sdist distribution of openfl

Is your feature request related to a problem? Please describe.
Currently, the packaging is on wheel only, which is fine for pip but can introduce some issues when we run into mis-matched dependency versions when packaging openfl with other packages.

Describe the solution you'd like
Adding an sdist in addition to the wheel would make things much easier.

Describe alternatives you've considered
N.A.

Additional context
N.A.

Error in following the "Configuring the Federation" tutorial

Hi,

I am trying out "fx collaborator generate-cert-request -n COL.LABEL" from https://openfl.readthedocs.io/en/latest/running_the_federation.certificates.html and I got the below error:

I am using the "keras_cnn_mnist" template.

How do I resolve the issue?

Thanks.

Minor bug in the code for torch_unet_kvasir, which leads to run-time error due to CUDA, if running on CPU mode.

In the openfl/openfl-workspace/torch_unet_kvasir/code/fed_unet_runner.py
Line23 : It should be
def __init__(self, device='cpu', **kwargs):
Instead of
def __init__(self, device='cuda', **kwargs):

This seems to be a typo, as in the description it is written that the default device is set to be 'cpu'.

Without this change, one would encounter a run-time error due to CUDA not found.

Set default path for step-ca/step CLI binary and certificates

Set default path for step-ca/step CLI binary and certificates (i.e. ~/.local/workspace/) with standard naming convention ('director.crt', 'envoy_one.crt', 'envoy_two.crt', etc.) so long living entities can start without always providing path for root_cert, cert, and private_key (defaults can still be overridden)

Experiment order in Director

Is your feature request related to a problem? Please describe.
When experiment is set on director, director instantly creates and runs aggregator. Will be better if only one aggregator is run at one time.

Describe the solution you'd like
Create structure for experiment.
Create dict, list or queue for experiments.

Task Assigner Entity

Task Assigner:

the ability to assign different task to different envoys.
the ability to choose certain envoys to run experiment out of all connected to Federation.

Simple aggregation schemes on aggregator: geometric mean + median

It would be great to have not only Avg operator for tensor aggregation, but also few other simple operations like Geometric Mean, Median, etc.

Update gRPC for Tensorflow 2.4 compatibility

gRPC is currently pinned to version 1.30. Tensorflow 2.4+ requires a later version. The gRPC version was originally pinned because of sporadic network issues, but this is likely fixed with change to short lived gRPC client connection.

Aggregator Tasks to replace standard end of round procedure

At the end of a round, the aggregator currently calls a set sequence of functions to compute the aggregation of the collaborator models and task metrics. This aggregation procedure is highly tuned for the specific set of tasks normally called in an experiment (aggregated_model_validation, train_batches, and local_model_validation). Adding new tasks with new TensorKey tags does not always behave as expected with this rigid aggregation procedure.

We should instead provide an interface where users can add their own aggregation tasks. This goes a step beyond the current AggregationFunctionInterface, because it would be applied beyond TensorKeys marked with the ('trained',') tag, and would and could be made more general. Aggregator Tasks could further be customized to be attached to collaborator tasks, run in sequence, or run one or several at the beginning / end of round. The default set of Aggregator Tasks would execute at the end of the round. The first would compute the weighted average of metrics and report them, and the second would run aggregation on the collaborator models with compression / decompression, and the decision logic for saving the best model could be a third (this would allow easy user customization for saving a model on a metric besides best accuracy).

The exact interface for the aggregator tasks is TBD, but the tasks should be provided access to the TensorDB (read+write), the TensorCodec, and an interface to save models

DefaultCPUAllocator: can't allocate memory: you tried to allocate 451477504 bytes.

Describe the bug
The collaborator gives the following error EXCEPTION : [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 451477504 bytes. Error code 12 (Cannot allocate memory) while running the 'New Interactive Python API (experimental) notebook' (Pytorch using kvasir dataset).
After successfully running round 0, the collaborators receive the new weights from the aggregator and one of the collaborators crashes with the above error. Even though we have checked our RAM and disk memory, it was sufficient before this exception.

To Reproduce
Steps to reproduce the behaviour:

We are running Aggregator and 2 Collaborators on distributed machines.
On Aggregator we run the python code interactively
On Collaborator1 we run this command fx collaborator start -d data.yaml -n one
On Collaborator2 we run this command fx collaborator start -d data.yaml -n two

Expected behavior
The model should run more rounds.

Screenshots

Desktop (please complete the following information):

OS: [ubuntu]
Browser [18.04]
Version [Python 3.7.6]
OpenFL Version [1.1]

Conda recipe

Is your feature request related to a problem? Please describe.
Currently, OpenFL only has the option to be installed through pip, which prevents it to be added to packages that require C/C++ libraries.

Describe the solution you'd like
A conda recipe would be very useful to mitigate this, which I am happy to work on. 😄

Describe alternatives you've considered
N.A.

Additional context
Needs #44

Parallel execution for Python API

Python native API currently executes collaborators in a sequential way, however it could be done parallel, since their execution is independent.

for round_num in range(rounds_to_train):
        for col in plan.authorized_cols:

            collaborator = collaborators[col]
            model.set_data_loader(collaborator_dict[col].data_loader)

            if round_num != 0:
                model.rebuild_model(round_num, model_states[col])

            collaborator.run_simulation()

            model_states[col] = model.get_tensor_dict(with_opt_vars=True)

Misprint, list out of range

https://github.com/intel/openfl/blob/704dfd5b958fadf6aafd073c882beec5875b7006/openfl/interface/collaborator.py#L358
Take 0 index of empty list. I think it should be updated_crts or previous_crts instead of cert_difference

securefederatedai / openfl Goto Github PK

openfl's People

Contributors

Stargazers

Watchers

Forkers

openfl's Issues

Describe alternatives you've considered

Recommend Projects

Recommend Topics

Recommend Org