lessw2020 / ranger21 Goto Github PK

View Code? Open in Web Editor NEW

321.0 12.0 44.0 120 KB

Ranger deep learning optimizer rewrite to use newest components

License: Apache License 2.0

Python 100.00%

ranger21's Introduction

Ranger21 - integrating the latest deep learning components into a single optimizer

A rewrite of the Ranger deep learning optimizer to integrate newer optimization ideas and, in particular:

uses the AdamW optimizer as its core (or, optionally, MadGrad)
Adaptive gradient clipping
Gradient centralization
Positive-Negative momentum
Norm loss
Stable weight decay
Linear learning rate warm-up
Explore-exploit learning rate schedule
Lookahead
Softplus transformation
Gradient Normalization

You can find a full description of most of our algorithm in the Ranger21 paper (only Softplus and Gradient Normalization were added after the paper). Researchers and library authors desiring to port the code might also be interested in the Flax implementation which was written with a focus on readability.

Installation

Until this is up on PyPi, this can either be installed via cloning the package:

git clone https://github.com/lessw2020/Ranger21.git
cd Ranger21
python -m pip install -e .

or directly installed from github:

python -m pip install git+https://github.com/lessw2020/Ranger21.git

History of the project and latest evolutions

Ranger, with Radam + Lookahead core, is now approaching two years old.
*Original publication, Aug 2019: New deep learning optimizer Ranger
In the interim, a number of new developments have happened including the rise of Transformers for Vision.

Thus, Ranger21 (as in 2021) is a rewrite with multiple new additions reflective of some of the most impressive papers this past year. The focus for Ranger21 is that these internals will be parameterized, and where possible, automated, so that you can easily test and leverage some of the newest concepts in AI training, to optimize the optimizer on your respective dataset.

Full Run on ImageNet in progress - results so far (going to 60 epochs, Ranger21 started later):

Latest Simple Benchmark comparison (Image classification, dog breed subset of ImageNet, ResNet-18):

Ranger 21 (7/10/21 version) :
Accuracy: 76.63% Validation Loss: 14.42

Adam:
Accuracy: 64.84% Validation Loss: 17.19

Net results: 18.18% greater accuracy with Ranger21 vs Adam, same training epochs.

Ranger21 Status:

July 10: 3 new improvements to Ranger21 Three new items have been added to Ranger21 after testing on sub ImageNet benchmark:

Gradient Normalization - this continues the Gradient centralization concept by normalizing the gradient (vs. gradient centralization subtracts the mean). On ImageNet it produces faster convergence in the first 20 or so epochs.
Softplus transform - by running the final variance denom through the softplus function, it lifts extremely small values to keep them viable. This helps with refining the training updates and in testing on our sub ImageNet benchmark, it set a new high in accuracy and val loss. (usage: softplus = True is default, set to False at init to turn off). Please see https://arxiv.org/abs/1908.00700 for the original paper.
Adaptive clipping now supports unlimited dimensions - some users were hitting issues running with 3D or 4D convolutions. Ranger21 now handles dimensions of any size with this update.

June 25: Arxiv paper nearly ready, back to work on Ranger21 after that! Paper is in review and should be published on Arxiv next week. Once that is done, will get back to working on Ranger21 - including working on the tutorial notebook.

May 16 - ImageNet training finished, finishing paper, updated Ranger21 with 1 off iteration fix and new show_schedule() feature:

ImageNet runs have finished and hope to have arxiv paper ready in next week or so.
Big thanks to @zsgj-Xxx for finding that the warmup ends up with the lr being 1 iteration short. Have updated with fix.
In order to make it easier to see the lr schedule, have added a new show_schedule() that will show a pyplot image directly, along with the start/max/min values for the schedule. This info was already there via the tracking_lr list, but you'd have to pull the data and then manually plot. Now it's even easier to train, and then make a single line call:

optimizer.show_schedule()

to quickly view the full schedule, and key values.

May 1 PM - Multiple ImageNet runs in progress, updated Ranger code checked in Have multiple ImageNet runs in progress to prep for a paper for Ranger21. The Base comparison is simply Adam on ImageNet and Ranger21 on ImageNet, with a ResNet50. Ranger21 started later but has already matched Adam with half the epochs...plan is to run to 60 epochs each.

In addition, training a BN Free (no batch norm) ResNet50 as an additional comparison. Of interest, even after 4 restarts, Adam was unable to get more than 3 epochs in on the NormFree Resnet50. By comparison, Ranger21 is doing well so this already shows the improved resilience of training with Ranger21.

Ranger21 code updates - due to firsthand experience, have added in safety guards in the event that num_epochs set for Ranger21 does not match the actual epochs being run, as well as updated the linear warmdown code to be simpler and never go below the min_lr designated (defaults to 3e-5).
If there is an epoch mis-match between num_epochs passed to optimizer and the atual run, this will start to spew a lot of text to alert you on each iteration, but the lr itself will now be automatically guarded and not go below the min_lr.

April 27 PM - Ranger21 now training on ImageNet! Starting work on benchmarking Ranger21 on ImageNet. Due to cost, will train to 60 epochs on ImageNet and compare with same setup with 60 epochs using Adam to have a basic "gold standard" comparison. Training is underway now.

April 26 PM - added smarter auto warmup based on Dickson Neoh report (tested with only 5 epochs), and first pip install setup thanks to @BrianPugh!
The warmup structure for Ranger21 is based on the paper by Ma/Yarats which uses the beta2 param to compute the default warmup. However, that also assumes we have a longer training run. @DNH on the fastai forums tested with 5 epochs which meant it never got past warmup phase.
Thus have added a check for the % warmup relative to the total training time and will auto fall back to 30% (settable via warmup_pct_default) in order to account for shorter training runs.

First pip install for Ranger21, thanks to @BrianPugh! In the next week or two will be focusing on making Ranger21 easier to install and use vs adding new optimizer features and thanks to @BrianPugh we've already underway with a basic pip install.

git clone https://github.com/lessw2020/Ranger21.git
cd Ranger21
python -m pip install -e .
```

or directly installed from github:

```
python -m pip install git+https://github.com/lessw2020/Ranger21.git

April 25 PM - added guard for potential key error issue Update checked in to add additional guard to prevent a key error reported earlier today during lookahead step. This should correct, but since unable to repro locally, please update to latest code and raise an issue if you encounter this. Thanks!

April 25 - Fixed warmdown calculation error, moved to Linear warmdown, new high in benchmark: Found that there was an error in the warmdown calculations. Fixed and also moved to linear warmdown. This resulted in another new high for the simple benchmark, with results now moved to above so they don't get lost in the updates section.
Note that the warmdown now calculates based on the decay between the full lr, to the minimal lr (defaults to 3e-5), rather than previously declining to 0.

Note that you can display the lr curves directly by simply using:

lr_curve = optimizer.tracking_lr
plt.plot(lr_curve)

Ranger21 internally tracks the lr per epoch for this type of review. Additional updates include adding a 'clear_cache' to reset the cached lookahead params, and also moved the lookahead procesing to it's own function and cleaned up some naming conventions. Will use item_active=True/False rather than the prior using_item=True/False to keep the code simpler as now item properties are alpha grouped vs being cluttered into the using_item layout.
April 24 - New record on benchmark with NormLoss, Lookahead, PosNeg momo, Stable decay etc. all combined NormLoss and Lookahead integrated into Ranger21 set a new high on our simple benchmark (ResNet 18, subset of ImageWoof).
Best Accuracy = 73.41 Best Val Loss = 15.06

For comparison, using plain Adam on this benchmark:
Adam Only Accuracy = 64.84 Best Adam Val Loss = 17.19

In otherwords, 12.5%+ higher accuracy atm for same training epochs by using Ranger21 vs Adam.

Basically it shows that the integration of all these various new techniques is paying off, as currently combining them delivers better than any of them + Adam.

New code checked in - adds Lookahead and of course Norm Loss. Also the settings is now callable via .show_settings() as an easy way to check settings.

Given that the extensive settings may become overwhelming, planning to create config file support to make it easy to save out settings for various architectures and ideally have a 'best settings' recipe for CNN, Transformer for Image/Video, GAN, etc.

April 23 - Norm Loss will be added, initial benchmarking in progress for several features A new soft regularizer, norm loss, was recently published in this paper on Arxiv: https://arxiv.org/abs/2103.06583v1

It's in the spirit of weight decay, but approaches it in a unique manner by nudging the weights towards the oblique manifold..this means unlike weight decay, it can actually push smaller weights up towards the norm 1 property vs weight decay only pushes down. Their paper also shows norm less is less sensitive to hyperparams such as batch size, etc. unlike regular weight decay.

One of the lead authors was kind enough to share their TF implemention, and have reworked it into PyTorch form and integrated into Ranger21. Initial testing set a new high for validation loss on my very basic benchmark. Thus, norm loss will be available with the next code update.

Also did some initial benchmarking to set vanilla Adam as a baseline, and ablation style testing with pos negative momentum. Pos neg momo alone is a big improvement over vanilla Adam, and looking forward to mapping out the contributions and synergies between all of the new features being rolled into Ranger21 including norm loss, adapt gradient clipping, gc, etc.

April 18 PM - Adaptive gradient clipping added, thanks for suggestion and code from @kayuksel. AGC is used in NFNets to replace BN. For our use case here, it's to have a smarter gradient clipping algo vs the usual hard clipping, and ideally better stabilize training.

Here's how the Ranger21 settings output looks atm:

April 18 AM - chebyshev fractals added, cosine warmdown (cosine decay) added
Chebyshev performed reasonably well, but still needs more work before recommending so it's defaulting to off atm. There are two papers providing support for using Chebyshev, one of which is: https://arxiv.org/abs/2010.13335v1
Cosine warmdown has been added so that the default lr schedule for Ranger21 is linear warmup, flat run at provided lr, and then cosine decay of lr starting at the X% passed in. (Default is .65).

April 17 - building benchmark dataset(s) As a cost effective way of testing Ranger21 and it's various options, currently taking a subset of ImageNet categories and building out at the high level an "ImageSubNet50" and also a few sub category datasets. These are similar in spirit to ImageNette and ImageWoof, but hope to make a few relative improvements including pre-sizing to 224x224 for speed of training/testing. First sub-dataset in progress in ImageBirds, which includes:
n01614925 bald eagle
n01616318 vulture
n01622779 grey owl

n01806143 peacock
n01833805 hummingbird

This is a medium-fine classification problem and will use as first tests for this type of benchmarking. Ideally, will make a seperate repo for the ImageBirds shortly to make it available for people to use though hosting the dataset poses a cost problem...

April 12 - positive negative momentum added, madgrad core checked in Testing over the weekend showed that positive negative momentum works really well, and even better with GC.
Code is a bit messy atm b/c also tested Adaiw, but did not do that well so removed and added pos negative momentum. Pos Neg momentum is a new technique to add parameter based, anisotropic noise to the gradient which helps it settle into flatter minima and also escape saddle points. In other words, better results.
Link to their excellent paper: https://arxiv.org/abs/2103.17182

You can toggle between madgrad or not with the use_madgrad = True/False flag:

April 10 - madgrad core engine integrated Madgrad has been added in a way that you will be able to select to use MadGrad or Adam as the core 'engine' for the optimizer.
Thus, you'll be able to simply toggle which opt engine to use, as well as the various enhancements (warmup, stable weight decay, gradient_centralization) and thus quickly find the best optimization setup for your specific dataset.

Still testing things and then will update code here... Gradient centralization good for both - first findings are gradient centralization definitely improves MadGrad (just like it does with Adam core) so will have GC on as default for both engines.

LR selection is very different between MadGrad and Adam core engine:

One item - the starting lr for madgrad is very different (typically higher) than with Adam....have done some testing with automated LR scheduling (HyperExplorer and ABEL), but that will be added later if it's successful. But if you simply plug your usual Adam LR's into Madgrad you won't be impressed :)

Note that AdamP projection was also tested as an option, but impact was minimal, so will not be adding it atm.

April 6 - Ranger21 alpha ready - automatic warmup added. Seeing impressive results with only 3 features implemented.
Stable weight decay + GC + automated linear warmup seem to sync very nicely. Thus if you are feeling adventorous, Ranger21 is basically alpha usable. Recommend you use the default warmup (automatic by default), but test lr and weight decay.
Ranger21 will output the settings at init to make it clear what you are running with:

April 5 - stable weight decay added. Quick testing shows nice results with 1e-4 weight decay on subset of ImageNet.

Current feature set planned:

1 - feature complete - automated, Linear and Exponential warmup in place of RAdam. This is based on the findings of https://arxiv.org/abs/1910.04209v3

2 - Feature in progress - MadGrad core engine . This is based on my own testing with Vision Transformers as well as the compelling MadGrad paper: https://arxiv.org/abs/2101.11075v1

3 - feature complete - Stable Weight Decay instead of AdamW style or Adam style: needs more testing but the paper is very compelling: https://arxiv.org/abs/2011.11152v3

4 - feature complete - Gradient Centralization will be continued - as always, you can turn it on or off. https://arxiv.org/abs/2004.01461v2

5 - Lookahead may be brought forward - unclear how much it may help with the new MadGrad core, which already leverages dual averaging, but will probably include as a testable param.

6 - Feature implementation in progress - dual optimization engines - Will have Adam and Madgrad core present as well so that one could quickly test with both Madgrad and Adam (or AdamP) with the flip of a param.

If you have ideas/feedback, feel free to open an issue.

Referencing this work

You can use the following BibTex to cite the Ranger21 paper in your research:

@article{wright2021ranger21,
      title={Ranger21: a synergistic deep learning optimizer}, 
      author={Wright, Less and Demeure, Nestor},
      year={2021},
      journal={arXiv preprint arXiv:2106.13731},
}

ranger21's People

Contributors

Stargazers

Watchers

Forkers

maciejdomagala brianpugh mousyball seraph080 evelynmitchell alxaline nestordemeure saruarlive skiuniverse glycerine xiusdk trendingtechnology ayasyrev yongjunhe11 khlam cvlzw ruhuawang af-74413592 liuaoy lzy-v junjie2008v vertyxzz laster-lee jackkelly tinahu1230 zivzone gdevos010 elijahahianyo kamin-at ning306en zhangzhishuo longjin-lab qiuhuan thysu nooneust wassname tolerantchief svillaza runngezhang jszym okbalefthanded reeshogue22 techthiyanes linjac

ranger21's Issues

decouple the lr scheduler and optimizer?

Hi @lessw2020, thanks for the very nice work!
I noticed that in this Ranger21, the optimizer is tightly coupled with the lr scheduler, could you guide me how I can decouple them?

Nice name of your project)

Just wanted to say thank you. I've been "playing" with deep learning libraries and zoneminder dvr solution and really love that sphere as hobby.

I will delete this comment later or you can do the same)

error in warmdown - lr below min lr. current lr = 2.999999999999997e-0518 [07:50<00:04, 4.66s/it] auto handling but please report issue!

not sure how to interpret this:

                                                                                                error in warmdown - lr
below min lr. current lr = 2.999999999999997e-0518 [07:50<00:04,  4.66s/it]
auto handling but please report issue!

File "/home/.../site-packages/ranger21/ranger21.py", line 680, in step raise RuntimeError("hit nan for variance_normalized")

Any idea what might have happened here?

the training runs normally with Ranger(20), when switching to 21 it crashes with this error:

  File "/home/.../lib/python3.8/site-packages/ranger21/ranger21.py", line 680, in step
    raise RuntimeError("hit nan for variance_normalized")

btw for Ranger20 you recommended training with mish activation function, is this also true for Ranger21?

I am training a segmentation network and some of the samples are completely empty.

Require an documentation

To developer:
Thanks a lot for developing such a nice project. There are many parameters to be set in Ranger21, but I don't know what these parameters do. If possible, please provide an explanatory documentation.

Best
Neng

Changes in lr

I got different learning rate curves in two identical experiments, do you understand the reason?

It looks like the first image is the desired result

RuntimeError: hit nan for variance_normalized

Calling Ranger21 with mostly default parameters:

    optimizer = ranger21.Ranger21(
        net.parameters(), lr=0.001, num_epochs=50, weight_decay=1e-5,
        num_batches_per_epoch=len(train_loader)
    )

Training seems fine for half a day with decent progress on all loss metrics, but then halts:

File "./train_pt.py", line 727, in <module>
    main(sys.argv[1:])
  File "./train_pt.py", line 612, in main
    optimizer.step()
  File "/home/morbo/git/sjeng/train/venv19/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/home/morbo/git/sjeng/train/venv19/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/morbo/git/Ranger21/ranger21/ranger21.py", line 714, in step
    raise RuntimeError("hit nan for variance_normalized")
RuntimeError: hit nan for variance_normalized

Can ranger be used for NLP transformers?

e.g for improving BERT-large/XLnet ?
@lessw2020 friendly ping

Adaptive Gradient Clipping

Hello, I highly recommend having AGC as well as it is extremely helpful for the training stability.

def unitwise_norm(x):
    dim = [1, 2, 3] if x.ndim == 4 else 0
    return torch.sum(x**2, dim=dim, keepdim= x.ndim > 1) ** 0.5

class AGC(opt.Optimizer):
    def __init__(self, params, optim: opt.Optimizer, clipping = 1e-2, eps = 1e-3):
        self.optim = optim
        defaults = dict(clipping=clipping, eps=eps)
        defaults = {**defaults, **optim.defaults}
        super(AGC, self).__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure=None):
        loss = None
        if closure is not None:
            with torch.enable_grad(): loss = closure()

        for group in self.param_groups:
            for p in group['params']:
                param_norm = torch.max(unitwise_norm(
                    p), torch.tensor(group['eps']).to(p.device))
                grad_norm = unitwise_norm(p.grad)
                max_norm = param_norm * group['clipping']
                trigger = grad_norm > max_norm
                clipped = p.grad * (max_norm / torch.max(grad_norm, torch.tensor(1e-6).cuda()))
                p.grad.data.copy_(torch.where(trigger, clipped, p.grad))
    
        self.optim.step(closure)

Recommended settings for transformers?

Are there any recommended settings for Transformer Language modeling?

sample usage in fastai

Can you provide a sample notebook on how to use ranger in fastai? Fastai has a ranger optimizer but how do I replace fastai's version with this version? and also when doing lr_find()

Example

Could you please add a demo notebook ?
TIA

Multi GPU problem

Hi I think I'm having a new problem I've compared Ranger with Ranger 21 on a fine-grained dataset, but Ranger 21's results are much worse than Ranger's. I do get exciting results on my own computer, but the results on a multi-card server are poor. Do you know why?

Ranger

Ranger21

adaptive gclipping: unable to process len of 5 - currently must be <= 4

Can Ranger21 be used in 3D?

Because using 3D, the length will be 5 not 4.

Thanks.

comparing ranger21 to SAM optimizer

Do you have any metrics on how ranger compares to the new SAM optimizer? I am using fastai and would like to incorporate ranger and sam to my pipeline but don't know which one to start with.

Performance of ResNet50 on ImageNet

Hi, thanks for the nice project. I noticed your paper achieved a 73.69 accuracy on ImageNet with ResNet50, which is much worse than reported by Keras https://keras.io/api/applications/ (74.9, 76.0 for v2) and PyTorch (76.15). I wonder does this mean Ranger cannot achieve as high accuracy as officially reported with SGD? Or is it caused by other settings are different? If so, how does ranger compare to the best of SGD in a fair setting?

Augmentation requests

Those are apparently the most promising optimizers, would be very useful to see how it compare to RAdam/madgrad!

Adabelief
lessw2020/Ranger-Deep-Learning-Optimizer#44

Stochastic weight averaging
https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/

Adas
https://paperswithcode.com/paper/adas-adaptive-scheduling-of-stochastic

error when training with batch_size = 1

  File "/home/florian/miniconda3/envs/msblob/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/florian/miniconda3/envs/msblob/lib/python3.8/site-packages/ranger21/ranger21.py", line 680, in step
    raise RuntimeError("hit nan for variance_normalized")
RuntimeError: hit nan for variance_normalized

hi，please help me

Hi, first of all, praise your work, because I'm a beginner, so I want to ask you how to use your Ranger optimizer. Because I see a lot of parameters in your code settings. I am currently using

optimizer = Ranger21(model.parameters(), lr = 3e-5, num_ epochs = 90, num_ batches_ per_ epoch = 16)

However, assertionerror: LR wentnegative is displayed

About gradient normalization

Hi,

Thanks for the great work. I think gradient normalization is a reasonable idea to extend GC. But I notice 2 points which is confused to me:

The gradient which size is greater than 1 is centralized by the mean and all the gradient (which is not filtered by the size ) are normalized by the std, is this an empirically better implementation or it is just a bug.
Also I notice the calculation dimension of mean and std in gradient normalization is different, which is not very intuitive to me.

Thanks for the reply.

optimizer = Ranger21(params=model.parameters(), lr=learning_rate) File "/mnt/Drive1/florian/msblob/Ranger21/ranger21/ranger21.py", line 179, in init self.total_iterations = num_epochs * num_batches_per_epoch TypeError: unsupported operand type(s) for *: 'NoneType' and 'NoneType'

I get the following error when starting my training:

Traceback (most recent call last):
  File "tr_baseline.py", line 75, in <module>
    optimizer = Ranger21(params=model.parameters(), lr=learning_rate)
  File "/mnt/Drive1/florian/msblob/Ranger21/ranger21/ranger21.py", line 179, in __init__
    self.total_iterations = num_epochs * num_batches_per_epoch
TypeError: unsupported operand type(s) for *: 'NoneType' and 'NoneType'

initializing ranger with:

# ranger:
optimizer = Ranger21(params=model.parameters(), lr=learning_rate)

Adaptive Gradient Clipping

Hi @lessw2020 thanks for this awesome work . I came here from the fastai forums and have been playing around with Ranger21 for a few days now. The results seem pretty solid and in most cases I was easily able to beat Ranger or get comparable results. Just a few points I noticed ...

I don't think AGC is working if we train using fp16. I was getting some weird losses if I kept use_adaptive_gradient_clipping on while training in fp16. I works fine if I keep training in fp32 though. Is this something to be expected or am I doing something wrong ?
I also noticed that the learning rate of paramters in Ranger21 is not modified i.e., optimizer.param_groups[n]["lr"] remains same throughout. Are you computing the learning rate schedule on the fly and then updating the weights ?

Gradient normalization lowers the maximum learning rate that can converge.

I found this problem while training ResNet18 on cifar100 for some experiment. I still haven't looked into this issue enough to find out what the cause is.

Error when using DDP

I used ZeroRedundancyOptimizer to wrap Ranger21 and running it in a 4-GPU machine. But its performace is much worse compared with the simple AdamW and it showed the error.
loerror in warmdown pct calc. new pct = 67.11272727272727�� | 19156/19505 [8:57:59<04:55, 1.18it/auto handled but please report issue
error in warmdown - lr below min lr. current lr = 2.999999999999997e-05
auto handling but please report issue!
error in warmdown pct calc. new pct = 67.11272727272727
auto handled but please report issue
error in warmdown - lr below min lr. current lr = 2.999999999999997e-05
auto handling but please report issue!
error in warmdown pct calc. new pct = 67.11272727272727
auto handled but please report issue
error in warmdown - lr below min lr. current lr = 2.999999999999997e-05
auto handling but please report issue!
error in warmdown pct calc. new pct = 67.11272727272727
auto handled but please report issue
error in warmdown - lr below min lr. current lr = 2.999999999999997e-05
auto handling but please report issue!

learning rate scheduler

if i don't have special purpose, I don't need to use additional learning rate scheduler right?

resuming training with ranger21?

As I learned ranger21 does internal lr scheduling etc.

How should training be resumed? Is there a state dict to be loaded etc.?

Not support pytorch _1.3.1

To developer:
Thank you for developing such a grateful optimizer. I have used it with pytorch_1.8 and pytorch_1.9 successfully. When I use the pytorch_1.3.1, ranger21 reports some errors. I think ranger21 not support pytorch_1.3.1. Could you make it available in the feature, please?
Here is the report info:

import torch
import ranger21
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/home/huangneng/tools/Ranger21/ranger21/__init__.py", line 1, in <module>
   from .ranger21 import Ranger21
 File "/home/huangneng/tools/Ranger21/ranger21/ranger21.py", line 49, in <module>
   from torch import linalg as LA
ImportError: cannot import name 'linalg'

Best,
Neng

hit nan for variance_normalized

Not certain this is a bug yet, but I'm getting this rarely after awhile of training and am not finding an issue in my side. Input to loss function looks good (no nan's). I'm working with a fairly complex loss function though, so very possible I have a rare bug in my code.

I'm using the following options

Ranger21(
      params=params, lr=3e-4, 
      num_epochs=1e12, num_batches_per_epoch=1, num_warmup_iterations=1000, 
      using_gc=True, weight_decay=1e-4, use_madgrad=True
      )

I've seen this with a batch size of 4-128 so far, so doesn't seem to be dependent on that.

AttributeError: module 'collections' has no attribute 'Callable'

  File "/home/jack/miniconda3/envs/power_perceiver/lib/python3.10/site-packages/ranger21/ranger21.py", line 578, in step
    if closure is not None and isinstance(closure, collections.Callable):
AttributeError: module 'collections' has no attribute 'Callable'

collections.Callable was moved to collections.abs.Callable back in Python 3.3 🙂

Please see: rbarrois/xworkflows#16

I will submit a pull request to fix this ASAP 🙂

torch.grad removed in PyTorch 1.8.1?

I'm getting the following error with PyTorch 1.8.1

AttributeError: module 'torch' has no attribute 'grad'

Swapping Line 515 for with torch.enable_grad(): seems to resolve the error.

I can't find it in the 1.8 release notes, but it appears torch.grad() might be deprecated? Not sure if anyone else can replicate.

Cheers!

local variable 'neg_grad_ma' referenced before assignment when momentum_type is not "pnm"

Bug in line 904 that requires adding
"if self.momentum_pnm:" before pnmomentum is calculated.

I'm trying to diagnose Nans I get with a large learning rate after some batches and make Ranger21 perform as the base AdamW (if it's possible in the first place).

SAM paper

First, great project. Seeing good improvements over straight Adam. Have you seen the Sharpness-Aware Minimization for Efficiently Improving Generalization paper? I'm curious if you have any thoughts on how to integrate Ranger21 with a SAM implementation ( https://github.com/davda54/sam for example )

Thanks!

lr below min_lr check too aggressive

Hi,

First of all, thank you for providing such an awesome optimizer and releasing an arXiv reference!
I am still working on integrating the Optimizer into my project, but I am getting quite a few superfluous warnings:

error in warmdown - lr below min lr. current lr = 2.999999999999997e-05
auto handling but please report issue!

> min_lr = 3e-5

Which is caused by the following check:

 if new_lr < self.min_lr:

from here

Due to floating-point rounding errors, the new_lr might become lower than the predefined min_lr.
I would suggest replacing this check with something like:

if (new_lr - self.min_lr) < - eps:

Which would be a simple fix, or a more sophisticated function similar to np.isclose

I am happy to make a PR if you'd like :)

What it the best hyper-parameter setting?

Hello, should I use what kind of hyper-paramter for the first try? For example, learning rate, AdamW or Madgrad?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.