Hi, Thanks for the great work! When I ran your code on the LRA pathf

Actually it looks like we already tagged it here: <a href="https://github.com/HazyRese

Model can not converge on the LRA Pathfinder about s4 HOT 19 CLOSED

violet-zct commented on June 15, 2024

Model can not converge on the LRA Pathfinder

from s4.

Comments (19)

albertfgu commented on June 15, 2024

The model changed a bit since the initial release. We will release a branch or tag marking the initial release where those configs should reproduce the original results (you should also be able to find it in the commit history). We have also been working on re-creating those results with the newest version of the model, which should involve only minor changes to the configs.

from s4.

violet-zct commented on June 15, 2024

Thanks! Would you mind pointing me to the commit that I can use to reproduce the results?

from s4.

albertfgu commented on June 15, 2024

Actually it looks like we already tagged it here: https://github.com/HazyResearch/state-spaces/releases/tag/v1

from s4.

violet-zct commented on June 15, 2024

Great, thanks so much!

from s4.

violet-zct commented on June 15, 2024

Hi, I used this commit and ran on two datasets: pathfinder-32 and cifar, with two random seeds respectively, including your default one 1112.
For pathfinder, the best valid accuracies are 77.59 and 78.14.
For cifar, the best valid accuracies are 79.72 and 79.8.

For both datasets, the valid accuracies are much worse than the test results reported in the paper. I used A40 to run these experiments. Do you know why this happens or is there a different commit of code that can be used for reproduction?
Thanks!

from s4.

albertfgu commented on June 15, 2024

I'm not sure why this is happening. Many other people have been able to reproduce the experiments. What versions of pytorch and pytorch-lightning are you running? Which Cauchy kernel do you have installed? Can you paste the command lines you're using?

from s4.

violet-zct commented on June 15, 2024

Here are my specifications:
pytorch 1.11.0
pytorch-lightening 1.6.1
Cauchy kernel: def cauchy_conj_slow(v, z, w): in state-spaces/src/models/functional/cauchy.py following the issue here #9 (comment).

The command line I used is exactly the same as in your README:

python -m train wandb=null experiment=s4-lra-cifar
python -m train wandb=null experiment=s4-lra-pathfinder

Thanks!

from s4.

albertfgu commented on June 15, 2024

Did you have any issue installing either of the two faster Cauchy kernels? It is conceivable that they might have subtle numerical differences. We tested using the custom CUDA kernel.

I just checked out that commit and ran the CIFAR command and am getting to 80% val in 15 epochs and currently 82% at 30 epochs. So I think it's working properly.

It is also possible (although less likely) that pytorch-lightning changed something. If possible, I would suggest:

pip install pytorch-lightning=1.5.10
git checkout main
cd extensions/cauchy
python setup.py install
cd ../..
git checkout v1

and try running the command from there. The pykeops kernel installed with pip install pykeops==1.5 should also work.

from s4.

violet-zct commented on June 15, 2024

Thanks so much for the response and instructions! I will try what you suggested.

from s4.

albertfgu commented on June 15, 2024

The job I launched ended up getting to around 86% val accuracy. Let me know if you figure out the issue; if it ends up being a problem with cauchy_conj_slow or a package version I'll update the README.

from s4.

violet-zct commented on June 15, 2024

Thanks! I was handling something else yesterday and will get back to you asap.

from s4.

violet-zct commented on June 15, 2024

Hi Albert, sorry for the delay. I just recreated a new environment with pytorch 1.11.0, pytorch_lightning 1.5.10 installed, and I also successfully compiled the custom CUDA Cauchy kernel. I ran experiments on CIFAR on both A40 and A100, however, I still could not reproduce the results and I got something similar to my previous run:

I have no clues what could be the reason why this happens since I didn't modify anything from your code.
Thanks!

from s4.

violet-zct commented on June 15, 2024

Hi Albert, to confirm, both my friend and I can not reproduce the results with v1 independently. But I can reproduce your results of v2.

from s4.

albertfgu commented on June 15, 2024

Thanks for reporting back! I'll leave this issue open for longer because some other people are still trying to reproduce V1.

Just to check more variables: Is your friend using the same computing resources (e.g. same cluster or machine types) as you?

I definitely checked these results on an A100 before the V1 release, and as I reported above a fresh version of the repo still gets to high 80's on CIFAR for me on a P100, so I am really confused as well.

from s4.

violet-zct commented on June 15, 2024

We are using the same cluster but different machine types.

from s4.

albertfgu commented on June 15, 2024

Hi,

Could you downgrade to Pytorch 1.10 and try again when you have time? We just discovered a bug in Pytorch 1.11 (pytorch/pytorch#77081) with Dropout2d which is causing a noticeable difference on small sCIFAR models and will probably cause a difference for large models.

from s4.

violet-zct commented on June 15, 2024

Thanks! Did you use Pytorch 1.10 for your version 1? I can downgrade and see if it can reproduce lately.

from s4.

albertfgu commented on June 15, 2024

Yeah we were on torch 1.10 for a long time. The run I did above was also on 1.10

from s4.

albertfgu commented on June 15, 2024

Closing this issue as the original problems were confirmed to be a PyTorch bug and have since been resolved.

from s4.

Model can not converge on the LRA Pathfinder about s4 HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent