spokenlanguage / platalea Goto Github PK

3.0 1.0 1.0 3.1 MB

Library for training visually-grounded models of spoken language understanding.

License: Apache License 2.0

Python 100.00%

visually-grounded-speech multi-tasking spoken-language-understanding deep-neural-networks speech-processing weakly-supervised-learning multimodal-learning pytorch flickr8k spokencoco

platalea's People

Stargazers

Watchers

Forkers

marvinlvn

platalea's Issues

CI ending in error

Some changes done in relation to #37 have made the CI worflow fail due to the tox job. We should fix this even if it means removing that job for now.

Figure out why the model is not training with GPU

The last transformer run (crisp-bee-8 on wandb) ran on CPU instead of GPU on Carmine (which has 3 GPUs). Find out why it didn't detect the GPU anymore. Could be some configuration changed in the last reboot (which was around 1 November).

rename 'videos' to 'video' in howto100m

For consistency as we'll also have an audio dir. There is no plural 'audios' :-)

Fine-tune learning rate scheduler

Some options for fine-tuning the schedule that may be promising:

First, use the now available data to recalibrate the min and max values of cyclic scheduler.
If that does not significantly improve performance, we could try applying a decay function to the maximum value, as described e.g. in this article.
a. Linear decay
b. Square root or exponential decay

Merge main branches into master

Merge vector-quant and transformer branches before we start working on grounding in videos.

vector-quant
transformer

refactor optimize and scheduling selection

Maybe we should take this as an opportunity to use a uniform scheduler creation procedure. I see that different experiment functions use different code (e.g. asr.experiment() has very different rules than basic.experiment()). This should include:

a uniform set of accepted optimizers (adadelta, adam...)
a uniform set of scheduling options (none, cyclic, noam; where I would use none instead of constant as adaptive optimizers don't use a constant learning rate even without scheduler)
uniform parameter names (constant_lr -> lr)
a single function in scheduler.py to handle this, called from all experiment() functions

Originally posted by @bhigy in #51 (comment)

Add logging of validation loss

Do every step for now, may cost more, but we need finegrained info for debugging the Transformer model.

Issue with use of configargparse and -h

When platalea.config is imported in a script, it catches the -h (help) parameter before the script's specific parameters are added. As an example, python utils/evaluate_asr.py -h doesn't show the script's parameters (path, -b), only the global ones (e.g. data_root, meta). Importing config after the script has parsed the parameters would result in the opposite behavior.

@egpbos, any idea how this could be handled better?

inout returns non-positive numbers

https://github.com/spokenlanguage/platalea/blob/510d5ef9270411dc971c19d988c30bed4c1c20d6/platalea/encoders.py#L443
For some inputs this returns negative numbers, which causes failures elsewhere and also logically makes no sense.

Overtrain on 100-image subsample of full dataset to validate transformer architecture

Do the same as in #26, but with bigger dataset. To be created in #25 .

No get_last_lr method in scheduler

I tried to run the basic_default experiment and get following error:

Traceback (most recent call last):
  File "basic_default.py", line 49, in <module>
    M.experiment(net, data, run_config)
  File "/home/bjrhigy/dev/platalea/platalea/basic.py", line 125, in experiment
    wandb_step_output["last_lr"] = scheduler.get_last_lr()[0]
AttributeError: 'LambdaLR' object has no attribute 'get_last_lr'

@egpbos and @cwmeijer, this seems related to the new code for wandb. Did you experience that?

add L2 regularization option

Default will be 0 (no regularization).

This will hopefully solve the problem mentioned in #30 .

Add actual result checks to test suite

Currently, our CI test suite of running experiments is only a "smoke test", i.e. it checks only whether everything runs, but not if it runs correctly. We should add result value checks so that we can see when changes in code change model outcomes.

Setup weights and biases (wandb) for multi-user

We want to easily share results with the team. Maybe we should request an academic license or maybe not. We need to find out.

Speed up CI tests

The test suite as set up in #17 takes over 40 minutes. The major bottlenecks are:

pip_seq, 18 minutes
mtl_asr, ~6 min.
pip_ind, ~5 min.
mtl_st, ~3:30 min.
conda install dependencies (mainly due to download of pytorch and deps), ~3:15
basic & asr, ~2 min.

For the experiments, would it be an idea if we run them with different hyperparameters so they train faster? If so, what parameters should we target?

If this is not enough, there are also a few possibilities within GitHub Actions to speed the whole test suite up:

Making the workflow parallel by using multiple jobs and dependencies between them (see commit e46e81f for a not yet working example of this). To do this, we must persist all files, including the whole environment of dependencies, between jobs by using artifacts. Since the PyTorch dependencies alone are over 1 GB, this may cost too much time transferring data back and forth between jobs, but maybe not, we could check.
Making a custom VM image with dependencies like PyTorch preinstalled would save a lot of time in setup. Combining this with option 1 would be ideal. Not sure if it's possible though, have to investigate.
Caching may be a better solution for storing the environment between parallel jobs. Artifacts can then be used to save coverage information to forward to the final job which calculates the total coverage.

Experiment with constant learning rate with transformers

Do a couple of runs with constant learning, varied systematically between runs, and find out if we can find some learning rate that consistently reduces loss. After that we can experiment with more complicated schemes.

Overtrain on 1D dataset to validate transformer architecture

[video-based training] Define and implement a basic architecture

Possible sources of inspiration for the visual part:
[1] https://arxiv.org/abs/2006.09199
[2] https://openaccess.thecvf.com/content_ICCV_2019/html/Miech_HowTo100M_Learning_a_Text-Video_Embedding_by_Watching_Hundred_Million_Narrated_ICCV_2019_paper.html
[3] https://openaccess.thecvf.com/content_CVPR_2020/html/Miech_End-to-End_Learning_of_Visual_Representations_From_Uncurated_Instructional_Videos_CVPR_2020_paper.html

Transformer model performance vs GPU memory usage

We are trying to find out why the Transformer model is performing poorer than expected, considering the better performance of the GRU model and from earlier Transformer-based models like ESPnet. In particular, we seem to be memory bound on the GPU, whereas those other models can get higher performance with less memory, which is puzzling.

We have started investigating the GRU and Transformer models with the torchinfo tool. Below some reports.

We are using branch 67_torchinfo: https://github.com/spokenlanguage/platalea/tree/67_torchinfo

Regenerate codecov token

There was a hack at codecov. I don't think it really compromises anything important, but we should probably still regenerate the codecov token we use in this project to be sure.

sklearn deprecation warnings

When running e.g. basic-default, I now get the following warnings:

INFO:root:Loading data
/Users/pbos/sw/miniconda3/envs/platalea/lib/python3.8/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.preprocessing.label module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.preprocessing. Anything that cannot be imported from sklearn.preprocessing is now part of the private API.
  warnings.warn(message, FutureWarning)
/Users/pbos/sw/miniconda3/envs/platalea/lib/python3.8/site-packages/sklearn/base.py:313: UserWarning: Trying to unpickle estimator LabelEncoder from version 0.21.3 when using version 0.22.1. This might lead to breaking code or invalid results. Use at your own risk.
  warnings.warn(

Create dataset FlickR1H

To have something in between the tiny 1D and the large 8K datasets.

Last remaining .cuda()

There's still one remaining explicit .cuda() call in decoders.py in TextDecoder. It has a use_cuda boolean parameter and checks that before calling .cuda() on the model. Is the use_cuda parameter actually used? Or would it preferable to remove the parameter and also switch to automatic device detection as we're using everywhere else now?

Try deeper models

In lunar-serenity-26, dark-serenity-27 and worthy-sunset-28, we notice that dropout may help slightly to get validation and training loss closer together, but at a cost to performance. Although it might be worth seeing how dropout models fare when trained for more epochs (only 32 in the mentioned runs), one other promising avenue may be to look into using deeper and wider models, as suggested here (in turn based on this article).

Compared to that paper, we are on the low end of both width (we used d_model = 256 in the above runs) and depth (4 now).

Glancing over their paper, it seems they are getting diminishing returns after d_model = 768 and depth = 12.

Their data and tasks are vastly different, they use multi-million word text sets and hence also completely different tasks.

Still, we should look into scaling up these parameters and seeing what it does.

Reduce validation loss in training

In run dainty-dawn-20 we saw the validation loss increasing again starting from epoch 20, approximately. We should find a way to train that reduces validation loss together with training loss.

[video-based training] Pre-processing

We need to adapt platalea/utils/preprocessing.py to work with videos from HowTo100M.

Create introduction text for the repo

What does this software do? Why would anyone want to use it? Why not just use keras/pytorch instead?
In the case anyone stumbles upon this repo, what could make this person enthusiastic about it before browsing away?
If we want many users, a logo would also help.

Software sustainability

Define actions still required by the software sustainability plan and implement them.

Presence of documentation targeting new users, illustrating the software's intended usage
~~Filling the Cll Best Practices Badge Program checklist + post the results (badge in the README)~~ --> #93
~~Any part worth contributing back to pytorch?~~ --> #92
Define a consistent versioning strategy
Add a DOI through Zenodo

Dependency management

The soundfile package is in requirements.txt, but not in setup.py, so won't be installed by pip install. We should add it there, since it was added to preprocessing in #11.

Related: do we need to keep requirements.txt for something specific? We had a recent discussion about this at the eScience Center: NLeSC/guide#156. Our conclusion, based on a survey of common practices, that requirements.txt is often superfluous, except for instance when you use it for specifying package repositories/sources. I see that in this case that is actually done for ursa, but I'm not sure whether that dependency is actually used at all. It was originally used in the analysis subdirectory, but that was removed at some point in this repo.

Do you agree that we can remove requirements.txt?

Make sure label_encoders.pkl gets installed

When doing pip install ., currently the file label_encoders.pkl does not get installed, despite it being in the MANIFEST.in file.

I also tried putting it in setup.py, with package_data={'platalea': ['platalea/label_encoders.pkl']}, but that doesn't work either.

Try CPU validation

As mentioned here, we can now calculate scoring on CPU so that we have more memory available for training on the GPU. A similar option for the validation step might be useful; it also takes about 2GB extra, so getting rid of that would allow us to further increase model size during training on a single GPU.

The title says "try", because we'll have to see whether it does not slow down training too much.

Deprecation warning when using torch 1.8

While trying to fix warnings, I stumbled upon a new warning that only comes up since pytorch 1.8: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.

The warning is triggered in the basic and transformer tests:

platalea/basic.py:39: in cost
    speech_enc, image_enc = self.forward(item['audio'], item['audio_len'], item['image'])
platalea/basic.py:34: in forward
    speech_enc = self.SpeechEncoder(audio, audio_len)
../../../sw/miniconda3/envs/platalea/lib/python3.8/site-packages/torch/nn/modules/module.py:914: in _call_impl
    self._maybe_warn_non_full_backward_hook(input, result, grad_fn)

(Note: I added the forward function in SpeechImage as a test, this call to SpeechEncoder is the actual troublemaker, and it was in SpeechImage.cost() before.)

I frankly have no idea what a backward hook is, let alone a non-full one. Anyone have any clue?

My best guess is that it has something to do with the multiple inputs and multiple outputs and some kind of missing element regarding autograd somewhere. I tried to search the other models/experiments (asr, mtl) for hints in this direction, but couldn't really find anything.

Find out and fix what's wrong in coverage

Unfortunately, it seems like something has not gone right in #17 and hence in #4. I misread that the coverage went up to 80%, but it actually went down very slightly. Maybe not all the coverage reports are properly merging, maybe not all files are properly tracked (the experiment files for instance seem to be missing in the report), maybe something else is going on, but it seems very strange that coverage would go down when we now run all these experiments. Something to look into.

Add dropout to other layers in transformer experiment

Currently, for the transformer experiment, we can only activate dropout on the transformer layers, not on the CNN and other layers. It may make sense to do this, so that we can counter overfitting using those layers as well.

[video-based training] Input pipeline

We need to adapt platalea/dataset.py to be able to train a model with HowTo100M.

Update pillow

Update pillow to version 7.1.0 and get rid of dependabot alert.

Fix tox on CI

I was working on fixing CI some weeks ago. For that purpose, I had set up interactive SSH debugging with tmate. Continue this.

W&B prompt when running an experiment

Whenever I try to run an experiment I get the following prompt:

wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice:

By default, I would prefer if the user doesn't have to do anything. W&B could be inactive or use choice (3) above.

Try multiple conv layers instead of 1

By doing so, we could use a smaller kernel and stride while keeping the same size or smaller input for the transformer layers.

Set additional seeds?

We're getting different results with the same configuration settings, including seed, so possibly we are not setting all seeds. Check which ones they are. I suspect torch.cuda.seed() may be one.

Save test files in repo

On CI, the ASR experiment saves a file in a directory so that a subsequent test can reuse it. We should just save this file to the repo. This will allow for parallel running of the tests, but also we can check the result of the ASR test to the existing saved output network.

Increase coverage

We should take a look at the coverage reports (now that they are fixed #19) and figure out why some parts of the code are not covered by the experiment runs on CI. It seems for instance that in encoders.py there are a lot of unused models. Are they still used in some other dependent package or can we remove them?

Related to #41.

Add experiments to the test scenarios

I would add:

asr/run.py
mtl-asr/run.py
mtl-st/run.py
pip-ind/run.py (reusing trained asr and text-image models)
pip-seq/run.py (reusing trained asr model)
text-image/run.py
jp/text-image/run.py

We need to check if that is not going to take to much time but it would cover the main experiments and architectures.

Make argument parsing fail on unknown parameters

Currently, our custom config class does not fail when it encounters unknown parameters. This is mostly a legacy from the past in which we were using multiple parsing moments. Currently, the parse_unknown_args function is still used to do some complicated help-print delaying. I don't think we still need that either. So let's just switch to parse_args so that the parser automatically checks for faulty arguments.

Test audio preprocessing routines (MFCC) against torchaudio's?

The MFCC conversion routines could be checked against the torchaudio.transforms equivalent to see whether they match (possibly after tweaking parameters).

If they don't match, that might be interesting, because it could a) hint at errors in either our or their implementation or b) it could mean that we have some interesting alternative algorithm to contribute (mentioned in point 3 of #41).

If they do match, that means we could consider replacing our audio folder altogether with the torchaudio implementation. This might be useful if we would want to switch to other available transforms in the torchaudio package in the future.

It could be that all this is not worth the effort, though. @bhigy what do you think?

Use parser.parse_args() instead of parse_known_args()

See #6 (comment)

add config logging for wandb

we want to log more stuff like number of layers, nodes per layer. etc

Add automatic versioning

Found this package version-query, may be useful, also for wandb logging.

Add Adam with constant learning rate

Memory optimizations in basic.py

I just noticed some additions in basic.py, which rely on two config varialbles validate_on_cpu and score_on_cpu. According to @cwmeijer, this was used to save some memory. It is not clear to me why that would be the case as memory used during validation should be freed before training resumes. @egpbos, do you have more details on this?

In addition to satisfying my own curiosity, I would like to be sure we are not missing an issue in our memory management.

spokenlanguage / platalea Goto Github PK

platalea's People

Stargazers

Watchers

Forkers

platalea's Issues

Recommend Projects

Recommend Topics

Recommend Org