lightning-ai / pytorch-lightning Goto Github PK

View Code? Open in Web Editor NEW

26.9K 247.0 3.3K 128.16 MB

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.

Home Page: https://lightning.ai

License: Apache License 2.0

Python 99.01% Shell 0.23% Dockerfile 0.18% Makefile 0.04% JavaScript 0.01% HTML 0.05% TypeScript 0.46% CSS 0.02%

python deep-learning artificial-intelligence ai pytorch data-science machine-learning

pytorch-lightning's People

Contributors

Stargazers

Watchers

Forkers

murari023 shreyasbapat collector-m jingweiz davidko3 hhy5277 jangocheng nnormandin kelvinson dsp6414 ducnx cinjon ivej lkhphuc peterouzh stjordanis wrongwhp asdafers praveenjoshi01 matic0209 diggerdu huaizhengzhang iiscaditaytripathi codeaudit nguyenducnhaty ahatamiz pgsrv eqs snazz2001 aoe-khkhan bityangke ghostintheshellarise medali-ai yucoian jurjsorinliviu avinash-chouhan hiteshkalwani kislerdm strawberryfg stevenhailin wangzheallen saso008 scottnan yangyuke001 krishnadubba colionx valentine217 batermj dennisshaw douruixin shaunstanislauslau samchen jiaxx adewin zhenyuanwei zuodh liuxingluoer zhiwuya greatwizard9519 chinarefers charlottesean sallyone vanova srinivasanbigdata violetcodet airacks shivlondon zaytiamo alok kanbo0409 mrzyzhaozeyu fkshi billyotieno captain-bing x-ccs yibit ypruksachatkun-asapp paojianghu jimmyjimmy94 chaiyujin neggert lewisacidic lynphoenix formleaf codap williammajanja-zz rjcc tarananttal ag027592 nsnntl ionvision zhengkuntian kentchun33333 goodrahstar ir1d prhldk cagataysari balaprasanna trendingtechnology milochen0418

pytorch-lightning's Issues

Proposal for help

Hi @williamFalcon ! I saw your project and I am very pleased by the idea. I wish to help you writing production level code. PLease let me know in what way can I help!

Typo in module's overview

Hi,

Thanks for developing this module.
There is a small typo in the Lighting module's overview.

Streamlined UX in saving, loading, continue training.

Currently each time I ran the command with the same experiment name, a new version is created and trained from scratch.
If I wanted to load back the model to continue training, I have to create a different script, input the path for the previous checkpoints and meta_tags.

Proposal:

Start training, input only 1 log_path and 1 experiment_name. Test_tube data and saved models will be included in the same folder.
In each experiment, default is to create a new model and trained from scratch (current behavior)
If user passed in --continue-training, model will load the latest model (not necessarily the best model) from the latest version, then create a new experiment version and continue training from there.
If user passed in --continue-training--best, model will load the the best model from the latest version, then create a new experiment version and continue training from there.
If user passed in --continue-training(--best) --version=X, auto load the model from version X and start training.

In the end, all the path and dir are determined from only the log_path and name of experiment, and default behavior is the same as of now:
python train.py --exp_name=exp1 --....
If user want to change some hyperparams, or start finetuning, they will only need to add:
python train.py --exp_name=exp1 --.... --continue-trainig--best --version=x

What do you think?
Happy to discuss more, as I am not sure this is suitable for cluster training or not.
I can make a PR if you're interest.

ModuleNotFoundError: No module named 'demo'

simon:~/Desktop/pytorch-lightning/demo$ python fully_featured_trainer.py
Traceback (most recent call last):
File "fully_featured_trainer.py", line 20, in
from demo.example_model import ExampleModel
ModuleNotFoundError: No module named 'demo'

What should I do if I don't have validation data and test data?

    @pl.data_loader
    def val_dataloader(self):
        # I don't need it.

    @pl.data_loader
    def test_dataloader(self):
        # I don't need it.

    def validation_step(self, batch, batch_nb):
         # I don't need it.

    def validation_end(self, outputs):
         # I don't need it.

Relax requirement for DistributedSampler with ddp

Is your feature request related to a problem? Please describe.
I have an application where I'm using a custom BatchSampler to construct batches for the N-Pairs metric learning loss. I need all of the data to be available on all processes when using DistributedDataParallel, so I wouldn't want to use DistributedSampler, even if it was compatible with a custom BatchSampler. Right now, I've hit a wall because lightning throws this exception:

pytorch_lightning.utilities.debugging.MisconfigurationException: 
when using multiple gpus and multiple nodes you must pass
 a DistributedSampler to DataLoader(sampler).

ie: this:
dataset = myDataset()
dataloader = Dataloader(dataset)

becomes:
dataset = myDataset()
dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
dataloader = Dataloader(dataset, sampler=dist_sampler)

Describe the solution you'd like
Could this exception be turned into a warning? I'm all for letting the user know when they're violating best practices, but throwing an exception removes flexibility for advanced users.

Describe alternatives you've considered
I looked at using the dp backend, but that's not going to work because the n-pairs loss needs the entire batch to compute the loss. Splitting it into chunks breaks things.

If I'm understanding correctly, this is actually another limitation introduced by Lightning. In a usual DataParallel setting, the batch would be merged back together before computing the loss and everything would be fine.

Typo: says seq-2seq in README

Section heading says seq-2-seq while comment in code says seq-2seq.

Update Lightning compatibility with PyTorch 1.2.0

Is your feature request related to a problem? Please describe.
PyTorch 1.2.0 has breaking changes for the experiment object.
Likely underlying changes to SummaryWriter.

For now, Lightning requires pytorch 1.1.0 but need to update compatibility.

Issue install the library

Hey, I wanted to give this library a try, so I did pip install pytorch-lightning, which gave the following error

C:\Users\cs>pip install pytorch-lightning
Collecting pytorch-lightning
  Using cached https://files.pythonhosted.org/packages/7e/3e/599dfe7b8c35ef9c72d
f4825d876c023fafe5e2618483ee3f3f2f4cdc3a9/pytorch-lightning-0.0.2.tar.gz
Collecting test-tube (from pytorch-lightning)
  Using cached https://files.pythonhosted.org/packages/3a/50/47ea5613be804c8e6e0
b01b1719e1f8186b8bc626441002b141c8a962abb/test_tube-0.631.tar.gz
Collecting torch (from pytorch-lightning)
  Using cached https://files.pythonhosted.org/packages/5f/e9/bac4204fe9cb1a002ec
6140b47f51affda1655379fe302a1caef421f9846/torch-0.1.2.post1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\cs\AppData\Local\Temp\pip-install-bufxsmuo\torch\setup.py",
 line 11, in <module>
        raise RuntimeError(README)
    RuntimeError: PyTorch does not currently provide packages for PyPI (see stat
us at https://github.com/pytorch/pytorch/issues/566).

    Please follow the instructions at http://pytorch.org/ to install with minico
nda instead.


    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in C:\Users\cs\AppDa
ta\Local\Temp\pip-install-bufxsmuo\torch\

C:\Users\cs>

So I went to the pytorch.org, and got the following set of commands to install pytorch on my system

pip3 install https://download.pytorch.org/whl/cpu/torch-1.0.1-cp37-cp37m-win_amd64.whl
pip3 install torchvision

The first command generated an error

C:\Users\cs>pip3 install https://download.pytorch.org/whl/cpu/torch-1.0.1-cp37-c
p37m-win_amd64.whl
torch-1.0.1-cp37-cp37m-win_amd64.whl is not a supported wheel on this platform.

C:\Users\cs>

Now the only option for me is install pytorch from source code. I was wonder if we can provide the pytorch-lighting as a docker image. We provide a template docker file for people to only provide the path for the test_python.py file. Is it a viable option?

My system: Windows 7, 32-bit, Python 3.7

Unable to import trainer

Hey, I am able to import pytorch_lightning but not the trainer. I am new to python and have no idea how to deal with it. It throws following error:

File "", line 1, in
ImportError: cannot import name Trainer

Thanks

change Checkpoint callback's `save_best_only` to `save_top_k`

Is your feature request related to a problem? Please describe.
save_best_only is a special case of save_top_k. However, save_tok_k checkpoints can be used to create ensemble model during the test time.

Describe the solution you'd like
keep a dict of {epoch: monitor} of length k, and save new ckeckpoint that can enter this dict, remove the worst checkpoint.

Enable any ML experiment tracking framework

People seem to have strong preferences for either using MLFlow, test-tube, polyaxon, etc...

Let's just add generic support for whatever people want to use. I don't know if generic support is possible, but each can easily be supported individually.

To make this work we'd need to:

change the logging to be non test-tube specific. Logging only happens in 2 places (train and validation completion).
Each call to log needs to be process-safe. Meaning when using distributed only rank=0 will log.
the experiment param in Trainer will need to be generalized (signature the same), to take any logger.

I think that's all that's needed to add this support.

Any suggestions and takers for working on this integration?
@Borda @alok

Trainer.fit() crashes if no checkpoint callback is provided

I hope it's okay that I keep posting issues...
Now that I can circumvent the github installation issues, I pulled in the latests master and let my simple CoolModel demo code run. But now calling trainer.fit() crashes with:

AttributeError Traceback (most recent call last)
in
21 )
22
---> 23 trainer.fit(model)
24 # exp.close()

/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/site-packages/pytorch_lightning/models/trainer.py in fit(self, model)
494 self.optimizers, self.lr_schedulers = self.optimizers
495
--> 496 self.__run_pretrain_routine(model)
497
498 # return 1 when finished

/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/site-packages/pytorch_lightning/models/trainer.py in __run_pretrain_routine(self, model)
680
681 # restore training and model before hpc call
--> 682 self.restore_state_if_existing_checkpoint()
683
684 # enable cluster checkpointing

/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/site-packages/pytorch_lightning/models/trainer.py in restore_state_if_existing_checkpoint(self)
261
262 # find last epoch
--> 263 checkpoints = os.listdir(self.checkpoint_callback.filepath)
264 for name in checkpoints:
265 # ignore hpc ckpts

AttributeError: 'NoneType' object has no attribute 'filepath'

Looking at the code, it appears to happen because I did not provide a checkpoint callback and it tries to access it in restore_state_if_existing_checkpoint

No real time Experiment logging

Currently I'm using your library in a simple setup

exp = Experiment(save_dir=save_dir)
trainer = Trainer(max_nb_epochs=1, experiment=exp)
trainer.fit(my_model)

on Google Colab. The folder default/version_0/tf/ gets immediently created, but sadly the tf experiment logs are only saved, when the training finished or got aborted by me by KeyboardInterrupt. So I can't watch the training process in tensorboard. Do you have any suggestions what to change to recieve real time updates?

pip installation using github repository incomplete

I tried to install pytorch-lightning using pip and the github repository.
Importing the module results in the following errors:

`
ModuleNotFoundError Traceback (most recent call last)
in
8 from torchvision import ops
9
---> 10 import pytorch_lightning as ptl
11 from pytorch_lightning import Trainer
12 from test_tube import Experiment

/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/site-packages/pytorch_lightning/init.py in
----> 1 from .models.trainer import Trainer
2 from .root_module.root_module import LightningModule
3 from .root_module.decorators import data_loader

ModuleNotFoundError: No module named 'pytorch_lightning.models'`

Following the error I found out, that there is no models folder under the path site-packages\pytorch_lightning

codecov not updating

Awesome improvements to coverage and tests! thanks @Borda

Wondering what has to be done to update the badge now. I pushed a report from GPU coverage but no updates.

@Borda

Add Gradient Checkpointing

Super useful to minimize RAM (trade-off speed).

Anyone interested in implementing?
(Instructions are here).

Probably needs to be generalized though.

self-balancing architecture

This is a really awesome feature we're looking to add. Super hard problem also if any ninjas want to try to tackle it :) (you'll be legendary haha).

Problem:
Some models are too big to fit in memory. Thus can't do any distributed training currently available (even in PyTorch).

But... we can break up the model and put parts on each GPU. The trick though is to do it automatically, because manually doing this is a PITA (trust me, i spend weeks dealing with this haha).

Proposed solution:
User hook in LightningModule where user returns the modules they want balanced.

class MyModule(LightningModule):
    def __init__(...):
        self.model_a = SomeModel()
        self.layer_1 = Linear(...)
        self.layer2 = Linear(...)

    def forward(x):
       # in each of these module calls, auto place the input x on the gpu of the module
        x = self.model_a(x)

       # in each of these module calls, auto place the input x on the gpu of the module
        x = self.layer_1(x)

       # in each of these module calls, auto place the input x on the gpu of the module
        x = self.layer_2(x)
        return x

    def self_balance():
        return [self.model_a, self.layer_1, self.layer_2]

So the above does two cool things:

user says how they want to break up the model.
In the forward, we auto put the input on that module's GPU.

That's the easy part lol... the hard part is deciding how to balance... optimizing for speed so you minimize data transfer across GPUs while not blowing up the RAM and using the RAM efficiently.

Anyone want to give this a shot?

Arbitrary lr_scheduler?

Currently the only learning rate scheduler supported is MultiStepLR, specified through the params of Trainer() constructor.
What do you think about a more flexible approach for lr scheduler, maybe an optional user defined function in Trainer() similar to configure_optimizer?

Incorrect Implementation for Accumulating Batch Gradients in Trainer

Current Behavior:
If accumulate_grad_batches > default of 1, the Trainer will proceed to take the loss from each batch and run loss.backward() for each batch accumulated, running the optimizer.step() once the desired number of batches has undergone backprop.
Loss averaging is only done for batch_loss_value.

Correct Behavior:
The loss from the output needs to be divided by accumulate_grad_batches before loss.backward() is run, otherwise the overall magnitude of the gradient could be up to N times greater for a simulated batch size N times bigger than the actual.

revert to absolute imports

recent relative imports are causing issues. in addition, pep8 recommends absolute imports for clarity as well.

let's go back to absolute imports

Thanks for sharing!

Good repo! Thanks for sharing!

Predict method for test set

The main Lightning module requires to define the test_dataloader function. But I'm not able to find any method that requires the test loader as input. Is there a model.predict() method to call on the test set?

0.4.0 release - final checks (releasing later today)

Want to release the last 2 core required feats we were missing

continue training (and session) from checkpoint (added).
16-bit with single GPU and no DP or DDP (added).

any other stability things to consider before releasing? (3.7)

Nitpick: `ptl` may be better as `pl`

(Feel free to ignore.)

All the usage examples do import pytorch_lighting as ptl. Instead of ptl, pl may be better as it doesn't clash with any library I know of, is 2 characters like NumPy's np, and is harder to mistype as plt, which many researchers probably also have imported. Since the library is in its early days, I don't think it would be that dramatic a change and is a little easier to read for people like me who often mix up letters like that.

On the other hand, it's pretty clear that it's not matplotlib from context, is yet another change, and is an aesthetic choice at its root, so it may not be worth it.

Enable multiple dataset in validation_step

Allow validation step to use multiple datasets.

Need to also decide how it will be called (especially to handle a case where the two datasets aren't the same length.

Option A:

for batch_a in dataset_a:
    model.validation_step(batch_a, batch_nb, dataset_index)

for batch_b in dataset_b:
    model.validation_step(batch_b, batch_nb, dataset_index)

Option B:

for batches in zip(dataset_a, dataset_b):
    # use both
    # new dynamic signature validation_step(batch_a, batch_b, batch_n, batch_nb)
    model.validation_step(*batches, batch_nb)

Option C:
(I think this is the only real generic way). The user would have to make sure datasets are the same length or be ok with iterating only as far as the shortest one.

import itertools

for batches in zip(*itertools.chain(dataset_a, dataset_b, ..., dataset_n)):
    # use both
    # new dynamic signature validation_step(batch_a, batch_b, batch_n, batch_nb)
    model.validation_step(*batches, batch_nb)

@cinjon @ppwwyyxx

AttributeError: 'TTNamespace' object has no attribute 'drop_prob'

Got the following error while running the Demo examples:

python single_gpu_node_template.py --gpus "0,1,2,3"

Traceback (most recent call last):
File "single_gpu_node_template.py", line 112, in
main(hyperparams)
File "single_gpu_node_template.py", line 33, in main
model = LightningTemplateModel(hparams)
File "/home/dgueraco/projects/pytorch-lightning/pytorch_lightning/examples/new_project_templates/lightning_module_template.py", line 37, in init
self.__build_model()
File "/home/dgueraco/projects/pytorch-lightning/pytorch_lightning/examples/new_project_templates/lightning_module_template.py", line 49, in __build_model
self.c_d1_drop = nn.Dropout(self.hparams.drop_prob)
AttributeError: 'TTNamespace' object has no attribute 'drop_prob'

Old test_tube version

https://github.com/williamFalcon/pytorch-lightning/blob/46e27e38aae21227d0d0f1cab97ec10b8b8766c2/setup.py#L34

test_tube library version should be updated to >=0.6.8. This old version causes Tensorboard logging directory issues.

pip install -e . crash

Ran:

pip install -e .

Obtaining file:///Users/williamfalcon/Developer/opensource/pytorch-lightning
Complete output from command python setup.py egg_info:
error in pytorch-lightning setup command: ("EntryPoint must be in 'name=module:attrs [extras]' format", 'pytorch-lightning=pytorch-lightning.cli:main')

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /Users/williamfalcon/Developer/opensource/pytorch-lightning/

@shreyasbapat

Cannot load saved model.

I cannot load model back after saved, using load_from_metrics:

Traceback (most recent call last):
  File "eval-metric.py", line 60, in <module>
    main(hyperparams)
  File "eval-metric.py", line 32, in main
    on_gpu=False, map_location=None)
  File "/opt/anaconda3/lib/python3.7/site-packages/pytorch_lightning/root_module/root_module.py", line 112, in load_from_metrics
    model = cls(hparams)
  File "/home/jupyter/kaggle-CellSignal/arcface_module.py", line 74, in __init__
    super().__init__(hparams)
  File "/opt/anaconda3/lib/python3.7/site-packages/pytorch_lightning/root_module/root_module.py", line 12, in __init__
    super(LightningModule, self).__init__(*args, **kwargs)
TypeError: __init__() takes 1 positional argument but 2 were given

Really don't know what went wrong.

Quantisation and Pruning Support

Is your feature request related to a problem? Please describe.
Nowadays, there is a need to take the floating point models that have been trained and deploy them to edge devices. One way that is popular is to quantise the weights and activation os a neural network to a lower bit width (eg: 8 bits or even 4 bits). The benefits of this are 2 fold:

Some accelerators perform computation at lower bit widths much faster than fp16 or fp32 computation.
The model takes less space, and the savings increase by a substantial factor every time we reduce a bit from the tensor data type.

People have tried other means to compress a model, one of them is pruning.
Pruning basically means that some of the weights of a neural network are zero, hence we seek to introduce sparsity in the network.

The benefits of this are that you potentially do not have to perform the useless multiplications with zeros hence providing a potential computation saving. Research has shown that even after pruning ~80% of weights (this is fine grained pruning), the network preserves it's accuracy . This is a very surprising result. Course grained pruning (setting all weights of a channel to zero) also works to an extent but results in significantly more accuracy loss. This is an active research area.

Describe the solution you'd like
Generally how quantisation works is through the use of a scale value and a zero point value, so each quantised tensor needs to have the quantised tensor, it's scale and zero point. The scale and zero point are needed to convert to and from quantised and dequantized tensors.

There are 2 ways to quantize a model:

Post training quantisation: Quantises a trained model, no retraining required (works well for down to 8 bits).
Quantisation Aware Training: A way to train a model to induce robustness to quantisation. (It works well for aggressive quantizations schemes (down to 4 bits))

I have successfully implemented the post training quantisation algorithms and was able to get a quantised MNIST model down to 8 bits with next to no accuracy loss. Going down to 4 bits resulted in the model diverging.I am currently working on quant aware training as of now. If you want to see how post train quantisation works, please check out this Google colab notebook.

Now, let's come to pruning:

Pruning is a very general thing, there could be a lot of ways to perform it. As far as I know, there is generally a "pruning schedule". The researcher decided when to prune how many percent of weights (aka the degree of sparsity of the layer). Now, they could prune some layers, leave some as is. Slowly increase the sparsity degree of the pruned players with time during training. There are also different types of pruning, a structured way to prune weights (eg: take off full channels of a conv kernel or reduce a dimension of a fully connected layer by 1) or an unstructured way to prune (randomly zero out weights).
Lightning could potentially offer a structured and unstructured way to prune to help out researchers. If you would like to see pruning in action, I have tried pruning out on an MNIST model by using the Google paper algorithm, "To Prune or not to Prune". It is unstructured pruning with 90% sparsity and I was able roughly the same accuracy as the un-pruned model. This is the Google Colab link for it.

Describe alternatives you've considered
Right now Pytorch doesn't have quantization and pruning support however, that is in the works. We could either wait for them to complete their work or we could implement a small library by ourselves.

What use case I was trying to target is lightning could become a playground where researchers could test out quantisation and pruning on their models and potentially could implement novel algorithms through it's base support.

Additional context
If any of you want to learn more about quantization, I have embedded the resources I learnt from below. They were indeed invaluable.

Jacob Benoit et al’s Quantisation Paper (Google)
Raghuraman’s Paper on Quantisation (Google, he’s now at Facebook)
Distiller Docs on Quantisation
Gemmlowp’s Quantisation Tutorial

Is it possible to make `validation_step` and `val_dataloader` no-ops?

Is your feature request related to a problem? Please describe.
Sometimes I don't have a separate validation split, only a train/test split. I'm trying out pytorch-lightning to prototype / experiment, and trying to see what the best of way of doing this is.

I could make the train dataset and then do torch.utils.data.random_split or use torch.utils.data.SubsetRandomSampler to build a validation set as well, but if I don't have enough data (or just don't want to do a separate validation step) this isn't ideal.

Describe the solution you'd like
I'd like to be able to implement only the training_step, train_dataloader, and test_dataloader methods and then have the validation step and validation metrics be omitted (maybe explicit no-ops). Right now, I'm experimenting with having an empty DataLoader for the validation data.

Describe alternatives you've considered

Implement val_dataloader with an empty (dummy) DataLoader
- Not sure if this will work yet (if lightning will still call validation_step and validation_end).

AttributeError: 'Experiment' object has no attribute 'get_meta_copy'

https://github.com/williamFalcon/pytorch-lightning/blob/b8c7baa8acce9e363c33d2580eb1abcca322a211/pytorch_lightning/models/trainer.py#L419
I encountered this problem when I used ddp.

Consider: ability to set seed

I dunno if this is in scope (feel free to close if not), but when experimenting, setting a fixed seed is handy since you can remove one source of randomness (Karpathy's recipe even includes it as an important beginning step).

Basically, being able to set the seeds for the random, numpy, torch, and other common modules in the config would be handy.

Add support for ReduceLROnPlateau

Is your feature request related to a problem? Please describe.
As of now it does not seem like it is possible to use ReduceLROnPlateau as a metric has to be passed to the step method of the lr_scheduler.

Describe the solution you'd like
A possibility to use ReduceLROnPlateau on some or any of the metrics calculated during training or validation.

Describe alternatives you've considered
In my use case I want to do the step based on a metric calculated on the validation set. As a workaround I define the lr_scheduler in the init of the model and perform the step in the validation_end function

Adding visualization module

Do you consider adding visualization ability? For example adding TensorBoard utility to visualize validation curve, or scalar changes, etc.

codecov doesn't respect ignore

need to add some ignore files to codecov... but it seems to not care about the ignore part:

coverage:
  precision: 0  # 2 = xx.xx%, 0 = xx%
  round: nearest # how coverage is rounded: down/up/nearest
  range: 40...100 # custom range of coverage colors from red -> yellow -> green
  status:
    # https://codecov.readme.io/v1.0/docs/commit-status
    project:
      default:
        against: auto
        target: 99% # specify the target coverage for each commit status
        threshold: 20% # allow this little decrease on project
        # https://github.com/codecov/support/wiki/Filtering-Branches
        # branches: master
        if_ci_failed: error
    # https://github.com/codecov/support/wiki/Patch-Status
    patch:
      default:
        against: auto
        target: 40% # specify the target "X%" coverage to hit
        # threshold: 50% # allow this much decrease on patch
    changes: false
    ignore:
      - "pytorch_lightning/utilities/arg_parse.py"
      - "raise *"

@Borda

Evaluate reduce removal from validation_step

Possible that the reduce function after validation_step might make it hard to save outputs such as videos, audios, etc...

Need to evaluate whether it makes sense to remove.

how to setup slurm in a cluster

This is a great repo of warper pytorch when using ddp, especially using slurm, which has little repos. I found slurm in mmdetection also.

In my group, we have 3 nodes, each of them has 4 GPUs.
I want to setup a slurm cluster to fully use these nodes. But little data can be found.
So could you please share some toturial of setup up slurm in a cluster? My nodes are all ubuntu 18.04 server.

When my X is a tuple input, the training_step method does not transfer them to cuda.

It has been solved.

Automatic testing, formation and coverage refactoring

Discussion on suggested changes to testing/formation/coverage by @Borda

Image logging to tensorboard

Hi @williamFalcon ,

Thanks for your nice work.
I am just wondering is it possible to log the image tensor to tensorboard to train such a U-net?

Bests,

DDP support on Jupyter Notebook

I'm trying to get ddp, fp16 etc running, however just setting the gpus and the distributed_backend='ddp' lets the CoolModel demo crash on my machine. I'm running it from a jupyter notebook with python 3.6 on a Ubuntu 18.04 machine with 2xV100.
The errors directly in the notebook are:

Exception Traceback (most recent call last)
in
69
70 # train (1 epoch only here for demo)
---> 71 trainer.fit(model)
72
73 # view tensorflow logs

/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/site-packages/pytorch_lightning/models/trainer.py in fit(self, model)
425 """
426 warnings.warn(msg)
--> 427 mp.spawn(self.ddp_train, nprocs=len(self.data_parallel_device_ids), args=(model, ))
428
429 # 1 gpu or dp option triggers training using DP module

/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon)
165
166 # Loop on join until it returns True or raises an exception.
--> 167 while not spawn_context.join():
168 pass

/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
106 raise Exception(
107 "process %d terminated with exit code %d" %
--> 108 (error_index, exitcode)
109 )
110

Exception: process 0 terminated with exit code 1

and in the jupyter server output window:

Traceback (most recent call last):
File "", line 1, in
File "/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'CoolModel' on <module 'main' (built-in)>

I installed apex as it is written it the repository.

How would you compare this to ignite?

Is there a list of major differences up anywhere, to make it easier to pick between the various options?

moving examples out of the package

Hello, nice peace of work. I was wondering if it would be easier to have examples out of the package (more intuitive to finds and keep the package simple) as well as all tests?
(e.g. pytorch-lightning/pytorch_lightning/testing_models/lm_test_module.py)

How to set hyperparameters search range and run the search?

Thanks for your powerful project and it's really helpful. I 'd like to try it for my current research if everything goes well.

My problem is, how to set hyperparameters search range and run the search? I 've read the chapters 'CPU hyperparameter search' and 'Running grid search on a cluster' in your document, however, I guess it is not very clear as there is only a few lines of code in 'CPU hyperparameter search' chapter without explanation (and main_local appears in the code without declaration).

Here is my trial to change LightningTemplateModel and single_cpu_template.py to be able to perform a hyperparameter search:

set tunable=True for some params in def add_model_specific_args(parent_parser, root_dir) in LightningTemplateModel, e.g., parser.opt_list('--learning_rate', default=0.001*8, type=float, options=[0.0001, 0.0005, 0.001, 0.005], tunable=True)
annotate main(hyperparams) and add hyperparams.optimize_parallel_cpu( main, nb_trials=20, nb_workers=1 )

However, it doesn't seem to work. So how can I set hyperparameters search range and run the search?

Sorry if my presentation is unclear (as I 'm not a native speaker). Thanks.

Returning None in validation_end method raises error

Hey,
If we define a validation_end method like

    def validation_end(self, outputs):
        return

it is gonna raise an error

AttributeError: 'NoneType' object has no attribute 'items'

Is this intended, if not shouldnt this part of the code initialize the metrics dict
https://github.com/williamFalcon/pytorch-lightning/blob/018b8da50e90638e8aa8d3eda1f8637656c25f2d/pytorch_lightning/models/trainer.py#L987

like here

https://github.com/williamFalcon/pytorch-lightning/blob/018b8da50e90638e8aa8d3eda1f8637656c25f2d/pytorch_lightning/models/trainer.py#L886

Training accuracy

I was wondering whether there is something like validation_end but for training (e.g., training_end). I want to compute the training accuracy at the end of each epoch. Thanks!

Allow optimizers to alternate at arbitrary intervals

For GANs or similar approaches, we may want optimizer A to step every batch while optimizer B might step every k batches.

This feature will enable this behavior.

Approach still needs to be scoped out. Open to suggestions here.

Dataset only available when the trainer is instantiated

Is your feature request related to a problem? Please describe.
This is half feedback/feature request. Maybe our approach is not right be here is what we felt when trying this awesome library:

We would like to use a LightningModule in our pipelines, but we have some constraints which makes this difficult.

We have an experiment framework where we can register models (eg a LightningModule) by instantiating them. Then the framework trains the various model using some train/val/test data which is specified at runtime and generates performance reports.

Pseudo code:

class TorchModel:
  def fit(x_train, y_train, x_val, y_val):
    trainer = Trainer(...)
    trainer.fit(self.module)

models = [
  ModelA(...),
  TorchModel(module=CoolModel()),  # TorchModel is actually a wrapper which exposes a common interface to Sklearn/Keras/Torch models
]

experiment_runner = Runner(models)
experiment_runner.run(train_dataset, val_dataset, test_dataset)

Or Uber's Ludwig would do:

from ludwig.api import LudwigModel

# train a model
model_definition = {...}
model = LudwigModel(model_definition)
train_stats = model.train(training_dataframe)

Describe the solution you'd like
For us, the datasets / input tensors don't belong to the definition of the module. We understand that it improves reproducibility but it may reduce portability of models

They probably should be provided to the trainer at instantiation:

Trainer(train_dataset=..., val_dataset=...)

# And maybe
class CoolModel(pl.LightningModule):
    ...

    @pl.data_loader
    def tng_dataloader(self, dataset):
        return DataLoader(dataset, batch_size=32)

   ...

Describe alternatives you've considered
A temporary solution could be:

class TorchModel:
  def fit(x_train, y_train, x_val, y_val):
    self.module.set_train_dataset(x_train, y_train)
    self.module.set_val_dataset(x_val, y_val)
    trainer = Trainer(...)
    trainer.fit(self.module)

Additional context
Thanks for creating this library, this makes pytorch so much easier to use!