Code Monkey home page Code Monkey logo

Comments (19)

albertfgu avatar albertfgu commented on June 15, 2024

The model changed a bit since the initial release. We will release a branch or tag marking the initial release where those configs should reproduce the original results (you should also be able to find it in the commit history). We have also been working on re-creating those results with the newest version of the model, which should involve only minor changes to the configs.

from s4.

violet-zct avatar violet-zct commented on June 15, 2024

Thanks! Would you mind pointing me to the commit that I can use to reproduce the results?

from s4.

albertfgu avatar albertfgu commented on June 15, 2024

Actually it looks like we already tagged it here: https://github.com/HazyResearch/state-spaces/releases/tag/v1

from s4.

violet-zct avatar violet-zct commented on June 15, 2024

Great, thanks so much!

from s4.

violet-zct avatar violet-zct commented on June 15, 2024

Hi, I used this commit and ran on two datasets: pathfinder-32 and cifar, with two random seeds respectively, including your default one 1112.
For pathfinder, the best valid accuracies are 77.59 and 78.14.
For cifar, the best valid accuracies are 79.72 and 79.8.

For both datasets, the valid accuracies are much worse than the test results reported in the paper. I used A40 to run these experiments. Do you know why this happens or is there a different commit of code that can be used for reproduction?
Thanks!

from s4.

albertfgu avatar albertfgu commented on June 15, 2024

I'm not sure why this is happening. Many other people have been able to reproduce the experiments. What versions of pytorch and pytorch-lightning are you running? Which Cauchy kernel do you have installed? Can you paste the command lines you're using?

from s4.

violet-zct avatar violet-zct commented on June 15, 2024

Here are my specifications:
pytorch 1.11.0
pytorch-lightening 1.6.1
Cauchy kernel: def cauchy_conj_slow(v, z, w): in state-spaces/src/models/functional/cauchy.py following the issue here #9 (comment).

The command line I used is exactly the same as in your README:

python -m train wandb=null experiment=s4-lra-cifar
python -m train wandb=null experiment=s4-lra-pathfinder

Thanks!

from s4.

albertfgu avatar albertfgu commented on June 15, 2024

Did you have any issue installing either of the two faster Cauchy kernels? It is conceivable that they might have subtle numerical differences. We tested using the custom CUDA kernel.

I just checked out that commit and ran the CIFAR command and am getting to 80% val in 15 epochs and currently 82% at 30 epochs. So I think it's working properly.

It is also possible (although less likely) that pytorch-lightning changed something. If possible, I would suggest:

pip install pytorch-lightning=1.5.10
git checkout main
cd extensions/cauchy
python setup.py install
cd ../..
git checkout v1

and try running the command from there. The pykeops kernel installed with pip install pykeops==1.5 should also work.

from s4.

violet-zct avatar violet-zct commented on June 15, 2024

Thanks so much for the response and instructions! I will try what you suggested.

from s4.

albertfgu avatar albertfgu commented on June 15, 2024

The job I launched ended up getting to around 86% val accuracy. Let me know if you figure out the issue; if it ends up being a problem with cauchy_conj_slow or a package version I'll update the README.

from s4.

violet-zct avatar violet-zct commented on June 15, 2024

Thanks! I was handling something else yesterday and will get back to you asap.

from s4.

violet-zct avatar violet-zct commented on June 15, 2024

Hi Albert, sorry for the delay. I just recreated a new environment with pytorch 1.11.0, pytorch_lightning 1.5.10 installed, and I also successfully compiled the custom CUDA Cauchy kernel. I ran experiments on CIFAR on both A40 and A100, however, I still could not reproduce the results and I got something similar to my previous run:

image

I have no clues what could be the reason why this happens since I didn't modify anything from your code.
Thanks!

from s4.

violet-zct avatar violet-zct commented on June 15, 2024

Hi Albert, to confirm, both my friend and I can not reproduce the results with v1 independently. But I can reproduce your results of v2.

from s4.

albertfgu avatar albertfgu commented on June 15, 2024

Thanks for reporting back! I'll leave this issue open for longer because some other people are still trying to reproduce V1.

Just to check more variables: Is your friend using the same computing resources (e.g. same cluster or machine types) as you?

I definitely checked these results on an A100 before the V1 release, and as I reported above a fresh version of the repo still gets to high 80's on CIFAR for me on a P100, so I am really confused as well.

from s4.

violet-zct avatar violet-zct commented on June 15, 2024

We are using the same cluster but different machine types.

from s4.

albertfgu avatar albertfgu commented on June 15, 2024

Hi,

Could you downgrade to Pytorch 1.10 and try again when you have time? We just discovered a bug in Pytorch 1.11 (pytorch/pytorch#77081) with Dropout2d which is causing a noticeable difference on small sCIFAR models and will probably cause a difference for large models.

from s4.

violet-zct avatar violet-zct commented on June 15, 2024

Thanks! Did you use Pytorch 1.10 for your version 1? I can downgrade and see if it can reproduce lately.

from s4.

albertfgu avatar albertfgu commented on June 15, 2024

Yeah we were on torch 1.10 for a long time. The run I did above was also on 1.10

from s4.

albertfgu avatar albertfgu commented on June 15, 2024

Closing this issue as the original problems were confirmed to be a PyTorch bug and have since been resolved.

from s4.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.