Comments (17)
@iuserea I just run our code using 4 and 8 compute nodes, it works well. The issue you mentioned happens when you have only 1 compute node but do not change the compute topology. To address this issue, you can try to modify the computing topology at init_training_device() function, in which you can force all workers/clients run in the same GPU device ID (since many clients/workers share the same GPU device, you may also need to make the batch size smaller to fit the memory constraints, which may degrades the accuracy a little bit and leads to a relatively slow training speed). Besides, you should also change the client_number/worker_number at run_xxx.sh
from fedml.
@chaoyanghe Could fedgkt algorithm be running with only 2 clients or even one client?I tried but I failed.
from fedml.
from fedml.
from fedml.
CMD for 10 clients:
sh run_FedGKT.sh 8 cifar10 homo 10 10 1 Adam 0.001 1 0 resnet56 fedml_resnet56_homo_cifar10 "./../../../data/cifar10" 64 10
from fedml.
The success flag I found is that b_all_received = True.
When the process failed, it either failed in b_all_received = false or 'b_all_received' variable even didn't appear after all the clients training has finished.
However it's just the surface of the real problem.
from fedml.
When I set client/worker's number to 2,fedgkt algorithm will also create 8 processes which may result in the failure of itself.
from fedml.
Change your sh script: -n is still 9
from fedml.
@chaoyanghe
The -n option is not essential for training of two clients.
The quesion is when training the two clients,the message below didn't appear.
handle_message_receive_feature_and_logits_from_client
add_model. index = 7
from fedml.
Hi @iuserea We have supported GPU mapping, please have a look at this:
from fedml.
Hi @iuserea , Could you please share your configuration of both software and hardware?
from fedml.
I only have two gpus in one server, how can I train fedgdk?
from fedml.
@iuserea Hi, How do you set your mpi_host_file?
from fedml.
from fedml.
from fedml.
@rambo-coder @iuserea @chaoyanghe Did you guys figure out how to run fedgkt on a single machine with multiple gpus? I followed the thread but was not able to make it finish with success (b_all_received=False).
@chaoyanghe Most researchers has a single machine with multiple gpus. It would be nice to have a guide for this especially if the library is designed specifically for researchers.
from fedml.
@rambo-coder @iuserea @chaoyanghe Did you guys figure out how to run fedgkt on a single machine with multiple gpus? I followed the thread but was not able to make it finish with success (b_all_received=False).
@chaoyanghe Most researchers has a single machine with multiple gpus. It would be nice to have a guide for this especially if the library is designed specifically for researchers.
Hello @korawat-tanwisuth Did you run FedGKT on a single machine with multiple gpus?
from fedml.
Related Issues (20)
- fed_cifar10 sample does not download the dataset correctly
- KeyError. msg_type = 5. Please check whether you launch the server or client with the correct args.rank HOT 1
- Where can I find FedGraphNN? HOT 2
- On the problem of gradient processing in FedML HOT 1
- 运行fedml.run_simulation()时就会出现TypeError: bind_simulation_device() takes 2 positional arguments but 3 were given HOT 4
- where is FedGraphNN HOT 3
- FedOpt for cross-silo HOT 2
- trained model path in single process simulation examples
- The compatibility issues of Nvidia Jetson
- Quickstart Guide
- log_file_dir arg not work
- Rookie question HOT 1
- from fedml.core.distributed.server.server_manager import ServerManager from fedml.core.distributed.client.client_manager import ClientManager from fedml.core.distributed.communication.comm_manager import CommManager显示
- Which communication protocol and serialization method is supported?
- typo "salve" instead of "slave" in identifiers
- possible bug in python/fedml/core/distributed/communication/trpc/utils.py
- FedGraphnn -- wandb utilization HOT 2
- [FedML-HE] How is the merging of decrypted weights done? HOT 1
- In Fed-ML HE example, the client model weights are not encrypted.
- bind_simulation_device() takes 2 positional arguments but 3 were given
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fedml.