Comments (8)
I guess it is because you put your progress on the front end. Please use the back end command (nohup).
Let me also double-check your configuration later with our own machine, and get back to you later.
from fedml.
I see. The WORKERs were sampled from CLIENTs,WORKER_NUM <= CLIENT_NUM.
from fedml.
@weiyikang Thank you for sharing your code!
from fedml.
- What are the differences between parameter CLIENT_NUM and WORKER_NUM?
- How to use the specified GPUs,such as #6,#7 GPU,because other GPUs are used by other people.
from fedml.
- CLIENT_NUM is to describe how many users are involved in training, while WORKER_NUM means the parallel processes during training. If the client number is super large (e.g., 1 million users), a common practice for scalability is to use uniform sampling to select $WORKER_NUM (e.g. 10) of users to train each round.
from fedml.
- In the init_training_device (main_fedavg.py), you can see that GPU_NUM_PER_SERVER is used to arrange GPU devices to each worker. Using this function, you can customize your arrangement.
from fedml.
- You can always customize this function to meet your own physical configuration. In your case:
gpu_num_per_machine = 2
def init_training_device(process_ID, fl_worker_num, gpu_num_per_machine):
# initialize the mapping from process ID to GPU ID: <process ID, GPU ID>
if process_ID == 0:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
return device
process_gpu_dict = dict()
for client_index in range(fl_worker_num):
gpu_index = client_index % gpu_num_per_machine + 6
process_gpu_dict[client_index] = gpu_index
logging.info(process_gpu_dict)
device = torch.device("cuda:" + str(process_gpu_dict[process_ID - 1]) if torch.cuda.is_available() else "cpu")
logging.info(device)
return device
from fedml.
The problem has been solved by customizing the function "init_training_device" :
The result as following:
from fedml.
Related Issues (20)
- KeyError. msg_type = 5. Please check whether you launch the server or client with the correct args.rank HOT 1
- Where can I find FedGraphNN? HOT 2
- On the problem of gradient processing in FedML HOT 1
- 运行fedml.run_simulation()时就会出现TypeError: bind_simulation_device() takes 2 positional arguments but 3 were given HOT 4
- where is FedGraphNN HOT 3
- FedOpt for cross-silo HOT 2
- trained model path in single process simulation examples
- The compatibility issues of Nvidia Jetson
- Quickstart Guide
- log_file_dir arg not work
- Rookie question HOT 1
- from fedml.core.distributed.server.server_manager import ServerManager from fedml.core.distributed.client.client_manager import ClientManager from fedml.core.distributed.communication.comm_manager import CommManager显示
- Which communication protocol and serialization method is supported?
- typo "salve" instead of "slave" in identifiers
- possible bug in python/fedml/core/distributed/communication/trpc/utils.py
- FedGraphnn -- wandb utilization HOT 2
- [FedML-HE] How is the merging of decrypted weights done? HOT 1
- In Fed-ML HE example, the client model weights are not encrypted.
- bind_simulation_device() takes 2 positional arguments but 3 were given
- Error when running fedml.run_simulation()
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fedml.