Running the fedavg on the configure: 20 rounds, 10 epochs, 2 clients, cifar-10 dataset

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

What are the differences between parameter CLIENT_NUM and WORKER_NUM？ <

CLIENT_NUM is to describe how many users are involved in training, while WORKER_

In the init_training_device (main_fedavg.py), you can see that GPU_NUM

You can always customize this function to meet your own physical confi

Using your specified gpus' list by customizing the function "init_training_device"! about fedml HOT 8 CLOSED

weiyikang commented on July 22, 2024

Using your specified gpus' list by customizing the function "init_training_device"!

from fedml.

Comments (8)

chaoyanghe commented on July 22, 2024 1

I guess it is because you put your progress on the front end. Please use the back end command (nohup).
Let me also double-check your configuration later with our own machine, and get back to you later.

from fedml.

weiyikang commented on July 22, 2024 1

I see. The WORKERs were sampled from CLIENTs，WORKER_NUM <= CLIENT_NUM.

from fedml.

chaoyanghe commented on July 22, 2024 1

@weiyikang Thank you for sharing your code!

from fedml.

weiyikang commented on July 22, 2024

What are the differences between parameter CLIENT_NUM and WORKER_NUM？
How to use the specified GPUs，such as #6，#7 GPU，because other GPUs are used by other people.

from fedml.

chaoyanghe commented on July 22, 2024

CLIENT_NUM is to describe how many users are involved in training, while WORKER_NUM means the parallel processes during training. If the client number is super large (e.g., 1 million users), a common practice for scalability is to use uniform sampling to select $WORKER_NUM (e.g. 10) of users to train each round.

from fedml.

chaoyanghe commented on July 22, 2024

In the init_training_device (main_fedavg.py), you can see that GPU_NUM_PER_SERVER is used to arrange GPU devices to each worker. Using this function, you can customize your arrangement.

from fedml.

chaoyanghe commented on July 22, 2024

You can always customize this function to meet your own physical configuration. In your case:

gpu_num_per_machine = 2

def init_training_device(process_ID, fl_worker_num, gpu_num_per_machine):
    # initialize the mapping from process ID to GPU ID: <process ID, GPU ID>
    if process_ID == 0:
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        return device
    process_gpu_dict = dict()
    for client_index in range(fl_worker_num):
        gpu_index = client_index % gpu_num_per_machine + 6
        process_gpu_dict[client_index] = gpu_index

    logging.info(process_gpu_dict)
    device = torch.device("cuda:" + str(process_gpu_dict[process_ID - 1]) if torch.cuda.is_available() else "cpu")
    logging.info(device)
    return device

from fedml.

weiyikang commented on July 22, 2024

The problem has been solved by customizing the function "init_training_device" :