state-spaces / s4 Goto Github PK
View Code? Open in Web Editor NEWStructured state space sequence models
License: Apache License 2.0
Structured state space sequence models
License: Apache License 2.0
Hi,
Wu et al. recently published a paper on Memorizing Transformers (transformers with states/memory), which extends their perceptive field to unbounded contexts (https://www.youtube.com/watch?v=5AoOpFFjW28&list=PL0NRmB0fnLJQJ3fuIk3yVULtm6_JnQ_zI, https://arxiv.org/abs/2203.08913). I am curious to hear what you think about how S4/Sashimi might compare with this new transformer model. My hunch is that S4 might be theoretically similar if you use the exponential measure density.
Hi, I am trying to train the Sashimi model on .mat files. How should I go about doing this?
Hi, was unet used for the LRA tasks?
Dear all,
Sorry for a silly question. I'm having trouble install cauchy kernel by custom cuda kernel.
running python setup.py install
like this:
~/LongSeq/state-spaces/extensions/cauchy$ python setup.py install
running install
~/miniconda3/envs/lightning/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
~/miniconda3/envs/lightning/lib/python3.9/site-packages/setuptools/command/easy_install.py:156: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running bdist_egg
running egg_info
creating cauchy_mult.egg-info
writing cauchy_mult.egg-info/PKG-INFO
writing dependency_links to cauchy_mult.egg-info/dependency_links.txt
writing top-level names to cauchy_mult.egg-info/top_level.txt
writing manifest file 'cauchy_mult.egg-info/SOURCES.txt'
reading manifest file 'cauchy_mult.egg-info/SOURCES.txt'
writing manifest file 'cauchy_mult.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
warning: install_lib: 'build/lib' does not exist -- no Python modules to install
creating build
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying cauchy_mult.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying cauchy_mult.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying cauchy_mult.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying cauchy_mult.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents...
creating dist
creating 'dist/cauchy_mult-0.0.0-py3.9.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing cauchy_mult-0.0.0-py3.9.egg
Copying cauchy_mult-0.0.0-py3.9.egg to ~/miniconda3/envs/lightning/lib/python3.9/site-packages
Adding cauchy-mult 0.0.0 to easy-install.pth file
Installed ~/miniconda3/envs/lightning/lib/python3.9/site-packages/cauchy_mult-0.0.0-py3.9.egg
Processing dependencies for cauchy-mult==0.0.0
Finished processing dependencies for cauchy-mult==0.0.0
but cannot import the cauchy_mult
when I running test_cauchy.py
~/LongSeq/state-spaces/extensions/cauchy$ python test_cauchy.py
Traceback (most recent call last):
File "~/LongSeq/state-spaces/extensions/cauchy/test_cauchy.py", line 8, in <module>
from cauchy import cauchy_mult_torch, cauchy_mult_keops, cauchy_mult
File "~/LongSeq/state-spaces/extensions/cauchy/cauchy.py", line 5, in <module>
from cauchy_mult import cauchy_mult_fwd, cauchy_mult_bwd, cauchy_mult_sym_fwd, cauchy_mult_sym_bwd
ModuleNotFoundError: No module named 'cauchy_mult'
I also try to run python -m pip install
and change python to 3.8, but they don't work too.
Hi,
Are you planning to make the pretrained WikiText-103 model available? Thank you very much!
Hey folks,
Congrats on the amazing and inspiring work!!
I have a quick question - how do I resume training + wandb logging if the training got terminated before completion. E.g. say the original command was CUDA_VISIBLE_DEVICES=0 python -m train experiment=s4-lra-pathx loader.batch_size=16 trainer.accumulate_grad_batches=2
and the training got suspended halfway through the full training. What command should I run to resume the training and the wandb logging?
Thanks in advance,
Ankit
I suggest adding topics such as state-space
, sequence-modeling
, time-series-model
, time-series
in the About section at https://github.com/HazyResearch/state-spaces.
Excellent idea and great paper!
Could you please provide a concrete example on how to both train and eval using the stateful RNN version of S4/S4D? I only find an evaluation example in the SaShiMi code but I have not found an example for training.
Thank you!
Hello,
Thanks for sharing this awesome work, when i try to run example.py i get the following error :
CUDA extension for cauchy multiplication not found. Install by going to extensions/cauchy/ and running
python setup.py install. This should speed up end-to-end training by 10-50% Falling back on slow Cauchy kernel. Install at least one of pykeops or the CUDA extension for efficiency. cuda ==> Preparing cifar10 data.. Files already downloaded and verified Files already downloaded and verified Files already downloaded and verified ==> Building model.. Optimizer group 0 | 28 tensors 0it [00:11, ?it/s] | 0/200 [00:00<?, ?it/s] Epoch: 0: 0%| | 0/200 [00:12<?, ?it/s] Traceback (most recent call last): File "example.py", line 373, in <module> train() File "example.py", line 312, in train outputs = model(inputs) File "/media/data2/ameenali/anaconda3/envs/ameen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/media/data2/ameenali/anaconda3/envs/ameen/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/media/data2/ameenali/anaconda3/envs/ameen/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/media/data2/ameenali/anaconda3/envs/ameen/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/media/data2/ameenali/anaconda3/envs/ameen/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/media/data2/ameenali/anaconda3/envs/ameen/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/media/data2/ameenali/anaconda3/envs/ameen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "example.py", line 196, in forward z, _ = layer(z) File "/media/data2/ameenali/anaconda3/envs/ameen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/media/data1/ameen/state-spaces/src/models/sequence/ss/standalone/s4.py", line 842, in forward k = self.kernel(L=L) # (C H L) (B C H L) File "/media/data2/ameenali/anaconda3/envs/ameen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/media/data1/ameen/state-spaces/src/models/sequence/ss/standalone/s4.py", line 752, in forward k = self.kernel(L=L) File "/media/data2/ameenali/anaconda3/envs/ameen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/media/data1/ameen/state-spaces/src/models/sequence/ss/standalone/s4.py", line 427, in forward C = _r2c(self.C) RuntimeError: Output 3 of BroadcastBackward is a view and its base or another view of its base has been modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
Any idea how to solve this?
Dear all,
I am having trouble compiling the Cauchy kernel, and although I have installed pykeops, running train.py always results in errors like this:
_RuntimeError: [KeOps] This KeOps shared object has been compiled without cuda support:
The only thing that fixed the issues for me is commenting out the following try/catch. Without that (sorry for its uglyness...) the code never did default back to the slow kernel... now it does, but that is certainly not the right way for me to go about it ;)
I wonder if the try/catch-phrase needs to check whether the kernel actually runs, not just lets itself be imported?
''' try:
import pykeops
from src.models.functional.cauchy import cauchy_conj
has_pykeops = True
except ImportError:
has_pykeops = False
from src.models.functional.cauchy import cauchy_conj_slow
if not has_cauchy_extension:
log.error(
"Falling back on slow Cauchy kernel. Install at least one of pykeops or the CUDA extension for efficiency."
)
'''
has_pykeops = False
from src.models.functional.cauchy import cauchy_conj_slow
if not has_cauchy_extension:
log.error(
"Falling back on slow Cauchy kernel. Install at least one of pykeops or the CUDA extension for efficiency."
)
Hi,
Thank you for your sharing.
I am doing some experiments on Ett dataset using S4.
I wonder how to load the best checkpoint, test it with new data, and then visualize the prediction and truth.
Is there any easy way by pytorch lighting?
Cheers,
Max
I'd like to train a sashimi model on my own data. Could you please let me know in what structure the data needs to be in order to be compatible with the dataloder?
Hello,
I'm considering trying your Sashimi model as the backbone of a diffusion model for audio generation. There is a detail I couldn't find in the paper, neither in the code (maybe I didn't look enough). How do you condition the model to the diffusion timestep? Do you do the same as in diffwave? (Element-wise addition of the diffusion-step embedding at each layer) Or use something similar to a FiLM layer , as in WaveGrad?
When I create a new environment per the instructions in the DSS setup, and then try to run any experiment, I get a segfault.
Hey, I wanted to give S4D a quick try in my research as a drop-in replacement of S4 (which, as far as I gathered, should be a good way to start), but I'm running into some hard memory limitations. I'm trying to train the DiffWave version of SaShiMi as a first experiment, but the memory requirements seem to increase significantly when replacing S4 with an equivalent S4D layer (with default settings), causing the model to go OOM in my case actually (so I don't have any precise measurements, but it's a 20% increase in overall memory consumption at least. I use the parameters as discussed in #46. Is this something you'd expect?
Hey, first of all, great work with the repository, I don't think I've worked with a repository for a paper that's so extensive and well-structured so far.
I'm currently trying to train the SaShiMi model on my own dataset (following your guide here: #23), and I run into some issues when trying to generate samples with the trained model.
In case this is relevant, I'm trying to do inference on the checkpoint files, and I changed the number of layers (model.n_layers
) to 4 to accommodate for the memory limitations of my GPU. Apart from that, I have done no changes to any of the training and model (code) except for switching the dataset to my own.
When I try to call the generation.py
script now, I run into a range of errors:
hurwitz
parameter does not exist anymore, and the setup_step
methods don't seem to correctly accept (or rather pass them downstream) the mode argument. I "fixed" this by removing the hurwitz
argument override and by adding the mode argument to all module.setup_step()
methods and just passing it downstream as required.model.layer.postact=null
causes the state_dict
to not load successfully anymore, giving me the following error:Missing key(s) in state_dict: "model.c_layers.0.layer.output_linear.weight", "model.c_layers.0.layer.output_linear.bias", "model.c_layers.2.layer.output_linear.weight", "model.c_layers.2.layer.output_linear.bias", "model.c_layers.4.layer.output_linear.weight", "model.c_layers.4.layer.output_linear.bias", "model.c_layers.6.layer.output_linear.weight", "model.c_layers.6.layer.output_linear.bias", "model.u_layers.0.1.layer.output_linear.weight", "model.u_layers.0.1.layer.output_linear.bias", "model.u_layers.0.3.layer.output_linear.weight", "model.u_layers.0.3.layer.output_linear.bias", "model.u_layers.0.5.layer.output_linear.weight", "model.u_layers.0.5.layer.output_linear.bias", "model.u_layers.0.7.layer.output_linear.weight", "model.u_layers.0.7.layer.output_linear.bias", "model.u_layers.1.1.layer.output_linear.weight", "model.u_layers.1.1.layer.output_linear.bias", "model.u_layers.1.3.layer.output_linear.weight", "model.u_layers.1.3.layer.output_linear.bias", "model.u_layers.1.5.layer.output_linear.weight", "model.u_layers.1.5.layer.output_linear.bias", "model.u_layers.1.7.layer.output_linear.weight", "model.u_layers.1.7.layer.output_linear.bias".
Unexpected key(s) in state_dict: "model.c_layers.0.layer.output_linear.0.weight", "model.c_layers.0.layer.output_linear.0.bias", "model.c_layers.2.layer.output_linear.0.weight", "model.c_layers.2.layer.output_linear.0.bias", "model.c_layers.4.layer.output_linear.0.weight", "model.c_layers.4.layer.output_linear.0.bias", "model.c_layers.6.layer.output_linear.0.weight", "model.c_layers.6.layer.output_linear.0.bias", "model.u_layers.0.1.layer.output_linear.0.weight", "model.u_layers.0.1.layer.output_linear.0.bias", "model.u_layers.0.3.layer.output_linear.0.weight", "model.u_layers.0.3.layer.output_linear.0.bias", "model.u_layers.0.5.layer.output_linear.0.weight", "model.u_layers.0.5.layer.output_linear.0.bias", "model.u_layers.0.7.layer.output_linear.0.weight", "model.u_layers.0.7.layer.output_linear.0.bias", "model.u_layers.1.1.layer.output_linear.0.weight", "model.u_layers.1.1.layer.output_linear.0.bias", "model.u_layers.1.3.layer.output_linear.0.weight", "model.u_layers.1.3.layer.output_linear.0.bias", "model.u_layers.1.5.layer.output_linear.0.weight", "model.u_layers.1.5.layer.output_linear.0.bias", "model.u_layers.1.7.layer.output_linear.0.weight", "model.u_layers.1.7.layer.output_linear.0.bias".
Does this mean that I should rename those keys manually (there's a fairly clear correspondence) to make it work after changing the activation?
mode
parameter in module.setup_step()
, I still get this error:Traceback (most recent call last):
File "/home/debaumas/state-spaces/sashimi/generation.py", line 192, in main
module.setup_step(mode='dense')
File "/home/debaumas/state-spaces/src/models/sequence/ss/kernel.py", line 1038, in setup_step
self.kernel.setup_step(mode=mode)
File "/home/debaumas/state-spaces/src/models/sequence/ss/kernel.py", line 515, in setup_step
dC = torch.linalg.solve(
torch._C._LinAlgError: linalg.solve: (Batch element 0): The diagonal element 1 is zero, the solve could not be completed because the input matrix is singular.
Do you have any idea what might be causing this and maybe an idea about how to fix/circumvent this?
It'd be awesome if you could help point me in the right direction with this.
Best,
Stefan
Hi, I was trying to reproduce some of your results using the SaShiMi model by running the command
python -m train experiment=sashimi-sc09 wandb=null
but I get the error
TypeError: __init__() got an unexpected keyword argument 'pool'
due to the DownPool
class no longer needing pool
parameter for initialization.
Can I ask if there are any plans to fix these issues so that they work with the current implementations of the different modules?
Hi!
When I was checking the code, I found that for most of the experimental configurations for S4, the A, B, C, dt parameters are not set to be trainable, while I originally thought that these parameters are trained in S4 presented in the paper. I don't know if I understand it correctly, but isn't this setting similar to HiPPO if these parameters are not trained? Or do you have any empirical findings on this?
Thanks!
Hi,
Thanks for the great work! When I ran your code on the LRA pathfinder dataset (using your config), I found it can't converge till the end of the 200th epoch as shown in the following log: loss=0.693, val/accuracy=0.499, val/loss=0.693, test/accuracy=0.495, test/loss=0.693, train/accuracy=0.501, train/loss=0.693
. The loss is 0.693 throughout training.
Do you have any thoughts on this? Thanks!
I think it would be quite useful if one could install this repository, either directly from git or preferably using pip
.
I invision a change to models where people only have to:
pip install state-spaces
from state_spaces import S4
nn.LSTM
or nn.Transformer
with S4
There is a package on pypi, not sure if it was pushed by you or not, but the latest version is older than the current code.
https://pypi.org/project/state-space/#history
Hi,
I bounced upon an issue while attempting to run the sashimi.py script, namely the LinearActivation function in the ./standalone/s4.py script does not accept the weight_norm argument upon initialization passed on through the **kwargs in the DownPool and UpPool classes.
The error trace:
File ".\PycharmProjects\state-spaces\sashimi\sashimi.py", line 450, in <module>
model = Sashimi(n_layers=2).cuda()
File ".\PycharmProjects\state-spaces\sashimi\sashimi.py", line 298, in __init__
d_layers.append(DownPool(H, expand, p))
File ".\PycharmProjects\state-spaces\sashimi\sashimi.py", line 28, in __init__
self.linear = LinearActivation(
File ".\PycharmProjects\state-spaces\src\models\sequence\ss\standalone\s4.py", line 137, in LinearActivation
linear = linear_cls(d_input, d_output, bias=bias, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'weight_norm'
Looking a bit further I noticed that neither the nn.Conv1d
(chosen in DownPool due to transposed = True) nor the nn.Linear
that could be called within LinearActivation()
, have the explicit weight_norm
argument in Pytorch.
Am I overlooking something?
Python: 3.9
Pytorch: 1.12 (latest stable release)
Thanks a lot for publishing your code with the papers!
Cheers,
Bavo
When I run:
python -m train wandb=null experiment=s4-cifar
I meet this problem:
File "/root/anaconda3/envs/statespaces/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/anaconda3/envs/statespaces/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward
Variable._execution_engine.run_backward(
File "/root/anaconda3/envs/statespaces/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
return user_fn(self, *args)
File "/root/anaconda3/envs/statespaces/lib/python3.8/site-packages/pykeops/torch/generic/generic_red.py", line 263, in backward
grad = grad.reshape(
RuntimeError: shape '[1024, 2, 2, 32, 2]' is invalid for input of size 4202496
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Breathtaking work, absolutely amazing application of linear algebra. Beautiful.
A few questions.
Appendix of S4 article mentions "... implementation of S4 uses the naive O(NL) algorithm ..._ " and README.md mentions custom kernels.
Question 0. Did you benchmark the naive O(NL) against the custom kernel?
Question 1. Is the 60x speedup in Table 8 with naive O(NL) or custom kernel? Or is the custom kernel only used during training?
Question 2. How big a percentage of compute is spent on S4 compared to mlp/lnorm/others in generation mode?
Apologies for any misunderstandings
Hi! I really want to try S4D in my research, but first I want to make a little proof of concept for myself, without going into too much theory or detail.
Do I understand correctly that standalone models can be used out of the box? What bothers me: in your publications, you talk about the importance of model initializing and tuning optimizer parameters. Is proper initialization taken into account in standalone S4D? Is there anywhere to see how to properly set up the optimizer for standalone S4D?
Thanks for your research!
Hello,
You have shown S4's great ability and efficiency on classification and unconditional generation (wikitext-103) tasks. I am wondering if S4 can be applied to conditional generation tasks such as summarization and machine translation? A simple idea is to re-organize these tasks to language modeling tasks, but I am not sure whether the generation quality would be affected.
Hi, I am getting the following error when trying to train S4 on the pathfinder dataset. Any help would be greatly appreciated.
Traceback (most recent call last):
File "/data/al451/state-spaces/train.py", line 553, in main
train(config)
File "/data/al451/state-spaces/train.py", line 498, in train
trainer.fit(model)
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
self._call_and_handle_interrupt(
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1172, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1492, in _call_setup_hook
self._call_lightning_module_hook("setup", stage=fn)
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/data/al451/state-spaces/train.py", line 56, in setup
self.dataset.setup()
File "/data/al451/state-spaces/src/dataloaders/datasets.py", line 1234, in setup
dataset = PathFinderDataset(self.data_dir, transform=self.default_transforms())
File "/data/al451/state-spaces/src/dataloaders/datasets.py", line 1130, in init
path_list = sorted(
File "/data/al451/state-spaces/src/dataloaders/datasets.py", line 1132, in
key=lambda path: int(path.stem),
ValueError: invalid literal for int() with base 10: '._142'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
The experiments using datasets like etth, ettm, and ecl don't run (despite the definitions in /config). I think it's because the datasets cannot be downloaded automatically like the LRA experiments.
Would it be possible to add this, or explain how these experimets could be run?
Hi,
Thanks for your excellent work!
In the HIPPO paper, the transition matrix A of SSM is time-variant, but has become time-invariant in s4 model. I didn't find any theories/experiments that discuss why it still works.
Can you kindly share your thoughts?
Thanks again,
Ziwei
Hello,
I loved Structured State Spaces, and obteined a fantastic performance compared to LSTM/SRU/Transformers.
I want to introduce S4 to some researchers and students, and the self-contained S4 layer is super great! However it requires the "cautchy kernel".
There are two versions of cautchy kernel, a cuda and a Pykeops version.
However extensions/cauchy requires cuda and Pykeops do not support Windows, and the target people have a very diverse number of environments.
Would be possible to HazyResearch team to implement a self-contained S4 layer including a simpler pytorch cautchy kernel ?
Dear all, many thanks for this extremely interesting code (and maths)!
Running
example.py --grayscale
with python 3.8 gives the following error on my system:
"RuntimeError: view_as_real doesn't work on unresolved conjugated tensors. To resolve the conjugate tensor so you can view it as real, use self.resolve_conj(); however, be warned that the resulting tensor will NOT alias the original."
Maybe some requirements need specific versions for the example to run?
''' dataset = load_dataset(
"csv",
data_files={
"train": str(self.data_dir / "basic_train.tsv"),
"val": str(self.data_dir / "basic_val.tsv"),
"test": str(self.data_dir / "basic_test.tsv"),
},
delimiter="\t",
keep_in_memory=True,
)''' this piece of code in src/dataloaders/datasets.py is giving error and the dataloading process is aborted
According to line 133 and following, batch_size will always be set to 128.
Might be relevant for users with smaller GPUs ;)
Hi there,
I recently tried upgrading my S4 setup / environment to be on the v2 tag but ran into the following issue when running the basic test script:
(base) ray@test-python:~/state-spaces$ python -m train wandb=null pipeline=mnist model=s4
CONFIG
├── train
│ └── seed: 0
│ interval: epoch
│ monitor: val/accuracy
│ mode: max
│ ema: 0.0
│ test: false
│ debug: false
│ ignore_warnings: false
│ state:
│ mode: null
│ chunk_len: null
│ overlap_len: null
│ n_context: 0
│ n_context_eval: 0
│ sweep: null
│ group: null
│ benchmark_step: false
│ benchmark_step_k: 1
│ benchmark_step_T: 1
│ checkpoint_path: null
│ visualizer: filters
│ disable_dataset: false
│
├── wandb
│ └── None
├── trainer
│ └── gpus: 1
│ accumulate_grad_batches: 1
│ max_epochs: 200
│ gradient_clip_val: 0.0
│ log_every_n_steps: 10
│ limit_train_batches: 1.0
│ limit_val_batches: 1.0
│ weights_summary: top
│ progress_bar_refresh_rate: 1
│ track_grad_norm: -1
│ resume_from_checkpoint: null
│
├── loader
│ └── batch_size: 50
│ num_workers: 4
│ pin_memory: true
│ drop_last: true
│ train_resolution: 1
│ eval_resolutions:
│ - 1
│
├── dataset
│ └── _name_: mnist
│ permute: true
│ val_split: 0.1
│ seed: 42
│
├── task
│ └── _name_: base
│ loss: cross_entropy
│ metrics:
│ - accuracy
│ torchmetrics: null
│
├── optimizer
│ └── _name_: adamw
│ lr: 0.001
│ weight_decay: 0.0
│
├── scheduler
│ └── _name_: plateau
│ mode: max
│ factor: 0.2
│ patience: 20
│ min_lr: 0.0
│
├── encoder
│ └── linear
├── decoder
│ └── _name_: sequence
│ mode: pool
│
├── model
│ └── layer:
│ _name_: s4
│ d_state: 64
│ channels: 1
│ bidirectional: false
│ activation: gelu
│ postact: null
│ hyper_act: null
│ dropout: 0.0
│ measure: legs
│ rank: 1
│ dt_min: 0.001
│ dt_max: 0.1
│ trainable:
│ dt: true
│ A: true
│ P: true
│ B: true
│ lr: 0.001
│ length_correction: true
│ tie_state: true
│ hurwitz: true
│ resample: false
│ deterministic: false
│ l_max: 784
│ verbose: false
│ _name_: model
│ prenorm: false
│ transposed: true
│ n_layers: 4
│ d_model: 256
│ residual: R
│ pool:
│ _name_: sample
│ pool: 1
│ expand: 1
│ norm: layer
│ dropout: 0.0
│
└── callbacks
└── learning_rate_monitor:
logging_interval: epoch
timer:
step: true
inter_step: false
epoch: true
val: true
params:
total: true
trainable: true
fixed: true
model_checkpoint:
monitor: val/accuracy
mode: max
save_top_k: 1
save_last: true
dirpath: checkpoints/
filename: val/accuracy
auto_insert_metric_name: false
verbose: true
Global seed set to 0
[2022-05-25 13:40:50,814][__main__][INFO] - Instantiating callback <src.callbacks.timer.Timer>
[2022-05-25 13:40:50,815][__main__][INFO] - Instantiating callback <src.callbacks.params.ParamsLog>
[2022-05-25 13:40:50,816][__main__][INFO] - Instantiating callback <pytorch_lightning.callbacks.ModelCheckpoint>
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
`Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
[2022-05-25 13:40:50,848][torch.distributed.nn.jit.instantiator][INFO] - Created a temporary directory at /tmp/tmpm51hqe7x
[2022-05-25 13:40:50,849][torch.distributed.nn.jit.instantiator][INFO] - Writing /tmp/tmpm51hqe7x/_remote_module_non_sriptable.py
Error executing job with overrides: ['wandb=null', 'pipeline=mnist', 'model=s4']
Traceback (most recent call last):
File "/home/ray/state-spaces/train.py", line 553, in main
train(config)
File "/home/ray/state-spaces/train.py", line 498, in train
trainer.fit(model)
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
self._call_and_handle_interrupt(
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1172, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1492, in _call_setup_hook
self._call_lightning_module_hook("setup", stage=fn)
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/ray/state-spaces/train.py", line 74, in setup
self.model = utils.instantiate(registry.model, self.hparams.model)
File "/home/ray/state-spaces/src/utils/config.py", line 99, in instantiate
return obj()
File "/home/ray/state-spaces/src/models/sequence/model.py", line 69, in __init__
block = SequenceResidualBlock(d, l+1, prenorm=prenorm, dropout=dropout, layer=layer, residual=residual, norm=norm, pool=pool)
File "/home/ray/state-spaces/src/models/sequence/block.py", line 36, in __init__
self.layer = utils.instantiate(registry.layer, layer, d_input)
File "/home/ray/state-spaces/src/utils/config.py", line 99, in instantiate
return obj()
File "/home/ray/state-spaces/src/models/sequence/ss/s4.py", line 86, in __init__
self.kernel = HippoSSKernel(self.h, N=self.n, L=l_max, channels=channels, verbose=verbose, **kernel_args)
File "/home/ray/state-spaces/src/models/sequence/ss/kernel.py", line 712, in __init__
self.kernel = SSKernelNPLR(
File "/home/ray/state-spaces/src/models/sequence/ss/kernel.py", line 217, in __init__
self.C = nn.Parameter(_c2r(_resolve_conj(C)))
RuntimeError: view_as_real doesn't work on unresolved conjugated tensors. To resolve the conjugate tensor so you can view it as real, use self.resolve_conj(); however, be warned that the resulting tensor will NOT alias the original.
Is this something you've seen before? I'd be happy to provide a fuller description of my package version, system architecture, etc. if you can let me know what might help get to the bottom of this bug.
Best,
Matthew
Hello thank you for the great library. I was wondering about the n_epoch_double flag and the best way to use it. If you use that flag will it not just give you a cuda OOM error when it doubles? Or is the appropriate usage to begin with a very small batch size so that it has room to double without going OOM? When running the wt103 model if I went over L_max=4096 with a batch size of 1 on a 4xV100 32GB GPU machine I ran out of memory when it got to the eval stage of the first epoch. I have a language modeling use case that requires very long sequence lengths (8192+) and wanted to try the s4 model on it, because the results seem pretty good for lower sequence lengths. Any help would be appreciated. Thank you!
hello, I trained a model using something like the wt103 task and modified the sashimi generate script to generate text like a CLM. So basically conditioning on a text string, generate the next N words sequentially in the same loop like the Sashimi generation script. I believe that I have it working however I don't know what integer output corresponds to what word in the vocab. Is there a hash table or something that stores the vocab somewhere that's easily accessible? Sorry I can't seem to find any obvious place that it would reside. Thank you for your help.
Hi,
I'm running experiments with default settings using this command:
python -m train wandb=null experiment=sashimi-sc09
I'd like to get the checkpoint of the best model in the training process and then load it to generate .wav
files. But i find the only checkpoint generated is a last.ckpt
under the directory outputs/yyyy-mm-dd/xx-xx-xx/checkpoints
, and i have no idea how to load it (since the checkpoint loading in sashimi/generation.py
loads a .pt
file)
Hello,
Sorry for another silly question. While training on s4-lra-cifar (on A100) I get the following final log:
Epoch 94, global step 85499: val/accuracy was not in top 1
Epoch 96: 80% 966/1200 [01:16<00:18, 12.63it/s, loss=0.144, v_num=xyj4, val/accuracy=0.866, val/loss=0.527, test/accuracy=0.865, test/loss=0.533, train/accuracy=0.946, train/loss=0.151]
Epoch 95, global step 86399: val/accuracy reached 0.86580 (best 0.86580), saving model to "blah/checkpoints/val/accuracy.ckpt" as top 1
Epoch 97: 78% 938/1200 [01:15<00:21, 12.38it/s, loss=0.142, v_num=xyj4, val/accuracy=0.864, val/loss=0.536, test/accuracy=0.864, test/loss=0.539, train/accuracy=0.945, train/loss=0.156]
Epoch 96, global step 87299: val/accuracy was not in top 1
Epoch 98: 75% 900/1200 [01:14<00:24, 12.06it/s, loss=0.143, v_num=xyj4, val/accuracy=0.866, val/loss=0.529, test/accuracy=0.865, test/loss=0.535, train/accuracy=0.945, train/loss=0.153]
Epoch 97, global step 88199: val/accuracy was not in top 1
Epoch 99: 79% 945/1200 [01:15<00:20, 12.45it/s, loss=0.137, v_num=xyj4, val/accuracy=0.863, val/loss=0.534, test/accuracy=0.864, test/loss=0.539, train/accuracy=0.946, train/loss=0.151]
Epoch 98, global step 89099: val/accuracy was not in top 1
Epoch 99, global step 89999: val/accuracy was not in top 1
Saving latest checkpoint...
I would be grateful if you could help with these questions and apologize in advance if they seem too basic.
Thank you again for sharing you wonderful repo,
Ankit
Hi,
Is there any way to train S4 with multiple GPUs? I have 2 GPUs, but only one of them is working.
Thanks
Hi, I did a simple test to verify the difference between forward and step (mode="dense") on a single unidirectional S4 layer. Given a random sequence, there difference, the absolute error is around 1e-2 and the square error is around 1e-4. I suspect these results are wrong. My verification follows test_step() in //src/models/sequence/ss/kernel.py. I'd love to know if you have examples that clearly compares their difference. Thanks:)
Hi, can you please upload the logs of the experiments that were reported in the paper?
I tried to reproduce the wikitext-103
experiment but had to change some configurations due to hardware constraints. Even though the changes were minor, the results were not as I expected them to be. I think that the logs from the original experiments might help me to reproduce the results more easily.
Thank you so much!
Hi! Very impressive results!
I'd like to try applying the S4 model on some toy example from NLP, like text generation (replicate examples in minGPT from @karpathy). I'm not very familiar with state-space models, so I don't understand a few things and have few questions:
Thanks!
Hi, I'm interested in recreating your WikiText-103 LM experiment. Is it possible you could make that easier for me? Thanks! CJ
When running CUDA_VISIBLE_DEVICES=1,2,3,4,7 python -m train wandb=null experiment=sashimi-youtubemix dataset=youtubemix, I get the following error:
Traceback (most recent call last):
File "/data/al451/state-spaces/train.py", line 553, in main
train(config)
File "/data/al451/state-spaces/train.py", line 498, in train
trainer.fit(model)
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
self._call_and_handle_interrupt(
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1311, in _run_train
self._run_sanity_check(self.lightning_module)
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1371, in _run_sanity_check
self._evaluation_loop._reload_evaluation_dataloaders()
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 170, in _reload_evaluation_dataloaders
self.trainer.reset_val_dataloader()
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py", line 551, in reset_val_dataloader
self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py", line 508, in _reset_eval_dataloader
if has_len_all_ranks(dataloader, self.training_type_plugin, module)
File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py", line 118, in has_len_all_ranks
raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: Total length of Dataloader
across ranks is zero. Please make sure that it returns at least 1 batch.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Dear authors/contributors,
First of all, thank you so much for publishing such a great work. I think it is really inspirational and we will see this model (or its variants) being deployed to solve variety of real-world problems in the next years.
I tried to go through your most recent papers starting from HiPPO, and I would like to kindly ask conceptual questions to deepen my understanding. As I couldn't find different sources of information other than your papers (and a couple of your recorded talks on Youtube and Annotated S4), I think this could be an appropriate place to ask those questions. If you prefer any other discussion platform, please let me know.
PS: These questions turned out to be a bit longer than I intended, but I don't expect you to clarify them all at once :)
It is a great pleasure for me to know more about your exciting work. Many thanks in advance. I would be also happy to know if there are other resources that you can suggest.
when I tried to run the code for the language model (wt103) with the command
HYDRA_FULL_ERROR=1 python3 -m train wandb=null experiment=s4-wt103
the exception happened!
in adaptive_softmax.py
module, the left index is greater than r_idx so nn.Parameter(torch.zeros(abs(r_idx - l_idx)))
can't pass (line 116)! so I did used abs()
function to pass it.
then in self.out_layers_weights
(line 182) again exception arises. list index out of range
so I did used try-except
to pass the lines...
You can find the full error in the attached file:
error.txt
Hi! I'm running some experiments using your code. For my use-case, I'm using torch.nn.DistributedDataParallel
, which automatically detects unused parameters, i.e., parameters that get no gradients.
The unused parameters are:
D
(from the S4
module)output_linear.weight
and output_linear.bias
(from the S4
module). These are instances of the TransposedLinear
layer.kernel.C
(from SSKernelNPLR
).I have manually confirmed these parameters don't get gradients by running the following code after computing the loss:
for name, param in model.named_parameters():
if param.grad is None:
print(name)
Usually, the above means the parameters are instantiated but not used. In this case, surprisingly, all the parameters get used in the forward
method. However, none of them get used in "vanilla" PyTorch ops. D
, output_linear.weight
and output_linear.bias
get used through opt_einsum.contract
, and kernel.C
gets used through your Cauchy GPU op.
Can you confirm the issue on your end? These parameters all look important for the model.
Hi, I'm trying to replicate your results for applying SaShiMi in a diffusion context, and have run into some questions about implementation details along the way. It'd be awesome if you could help me out with them.
bidirectional=True, unet=True, diffwave=True
and set the rest to the values specified in Appendix C.2.2 of the paper and their respective default values?Best,
Stefan
Dear authors and contributors,
There is an observation that I would be happy to get your confirmation on :-)
In all of the model hierarchy: SequenceModel
, SequenceResidualBlock
and S4
,you are using Dropout2d
which zeros at the batch dimension, i.e. ignores the entire sample. Without a residual link, with multiple layers, the probability that each sample is not ignored through the model becomes negligible. Consequently, the model does not see the inputs and will not train!
In the SequenceResidualBlock
, the dropout is applied only if a residual link is present. The residual link of SequenceResidualBlock
also takes care of the dropout from S4
.
So my issue is two-fold:
dropout > 0
, we never should set residual = None
in the parameters of SequenceResidualBlock
, right? Is it possible to add a check in the initialization to avoid possible misconfigurations?dropinp
input of SequenceModel
should not be used, as there is no residual link there. I've seen in all of the configs we have dropinp: 0.0
. So why is it there at all?Thanks and regards,
I was wondering what parameters I could change to be able to run it on GPU with limited RAM. I tried reducing the layers to 4, which did not help. Also, it seems like batch size is set to 1 by default. I am using 4x TITAN RTX 24GB.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.