Code Monkey home page Code Monkey logo

Comments (11)

hwang595 avatar hwang595 commented on June 21, 2024

@hkcao it seems that you enabled one attacker with the constant attack. Can you please try to set --worker-fail=0 and rerun the code?

from draco.

hkcao avatar hkcao commented on June 21, 2024

Thanks for your kindly reply.

It seems like that the number of failing worker should be larger than 0.

image

However, I tried the code again with 6 workers, and this time the loss is getting smaller suddenly after 230-th iteration (from more than 10000 to smaller than 10, and then get stuck at 2.3 for several hundreds of iterations). Is it the influence of SGD? Because I also tried some other values of workers and batch_size, and sometimes the loss can get smaller than 2 after some iterations and has the trends to converge.

from draco.

hwang595 avatar hwang595 commented on June 21, 2024

@hkcao I realized that you are using the cyclic coding scheme. We did observe that the cyclic coding scheme may suffer from numeric issues from settings to settings.

Can you try the repetition coding scheme instead (by setting --mode=maj_vote; --approach=maj_vote)?

from draco.

hkcao avatar hkcao commented on June 21, 2024

@hkcao I realized that you are using the cyclic coding scheme. We did observe that the cyclic coding scheme may suffer from numeric issues from settings to settings.

Can you try the repetition coding scheme instead (by setting --mode=maj_vote; --approach=maj_vote)?

Thanks for your reply.

I tried repetition scheme with the same parameters as before, and the loss is indeed getting smaller.

According to my understanding , the numeric issues should only happens when the dimension of encoding matrix grows large. While the test I used has only 6 workers, which should not be an issue ?

from draco.

hwang595 avatar hwang595 commented on June 21, 2024

@hkcao you are right about the numerical issue. We also found when the dimension of the model becomes high, it also needs more precision for encoding+decoding. For the current version, we reduced the precision to attain better communication efficiency. You may want to try switching back to this commit for better precision (please also see what I changed there): 582616d.

from draco.

hkcao avatar hkcao commented on June 21, 2024

@hkcao you are right about the numerical issue. We also found when the dimension of the model becomes high, it also needs more precision for encoding+decoding. For the current version, we reduced the precision to attain better communication efficiency. You may want to try switching back to this commit for better precision (please also see what I changed there): 582616d.

Thanks for your reply. It works ! :-)

when I modified these changes back as you mention in that commit, the loss is indeed getting smaller, and seems can converge faster than previous version. I also tried some larger values like 40 workers, and it can also work now. It seems like the precision you mentioned is indeed important for the coding scheme.

from draco.

hwang595 avatar hwang595 commented on June 21, 2024

Glad to help @hkcao! Then we can probably close this issue for now and reopen it if you feel needed?

from draco.

hkcao avatar hkcao commented on June 21, 2024

Glad to help @hkcao! Then we can probably close this issue for now and reopen it if you feel needed?

Thanks again for the help!

from draco.

hkcao avatar hkcao commented on June 21, 2024

Sorry to bother you again. I met another problem, and it confuses me.

I saw that there is a check point in rank 1 to test the accuracy on test set. When I was training, I can see that the accuracy(prec1) is something like 80, while when I use the code of distributed_evaluator.py to see the accuracy, the result is much lower. For instance, I just ran 200 iterations, the test result on rank 1 is as follows,
image
image

while the test result (prec@1) by the code of distributed_evaluator.py is just 10 for two times.
image
Do you have any idea about it ?

from draco.

hwang595 avatar hwang595 commented on June 21, 2024

Hi, @hkcao can you double check if the model checkpoints are saved by the PS or workers? It seems master also tries to save checkpoints: https://github.com/hwang595/Draco/blob/master/src/master/rep_master.py#L136-L137.

And we should not use the checkpoints from the master to evaluate as it does not contain any running statistics in the BN layers.

from draco.

hkcao avatar hkcao commented on June 21, 2024

Hi, @hkcao can you double check if the model checkpoints are saved by the PS or workers? It seems master also tries to save checkpoints: https://github.com/hwang595/Draco/blob/master/src/master/rep_master.py#L136-L137.

And we should not use the checkpoints from the master to evaluate as it does not contain any running statistics in the BN layers.

Thanks for your reply.

I was using the checkpoints saved by the master before. When the models saved by worker 1 are used, the loss and accuracy come to the same values as the result of distributed_evaluator.py.

Thanks again for the help!

from draco.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.