Code Monkey home page Code Monkey logo

Comments (14)

juntang-zhuang avatar juntang-zhuang commented on September 25, 2024 1

@cryu854 Could you start a pull request, and input your contact info in the code? Perhaps also change the version to 0.2.1, so I can upload it to pip. Thanks a lot.

from adabelief-optimizer.

sumanth-sadu avatar sumanth-sadu commented on September 25, 2024

I have the same issue, can anyone provide some insight?

Thanks

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on September 25, 2024

The pytorch version does not have such problem. I think itโ€™s due to the implementation, since Iโ€™m not so familiar with tensorflow. The pip package 0.1.0 is an old version compared to the source code under pypi_package/adabelief_tf0.1.0/Adabelief_tf.py , this is merged from a pull request by @cryu854, should be optimized, but I have not updated it in pip (so install by pip install adabelief-tf installs an old version). Please try the source code. Let me know if there are any updates.

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on September 25, 2024

@ManoharSai2000 @sumanthsadhu could you provide the code to reproduce the result?

from adabelief-optimizer.

ManoharSai2000 avatar ManoharSai2000 commented on September 25, 2024

from future import absolute_import
from future import division
from future import print_function

import tensorflow as tf

from tabulate import tabulate
from colorama import Fore, Back, Style

class AdaBeliefOptimizer(tf.keras.optimizers.Optimizer):
"""
It implements the AdaBeliefOptimizer proposed by
Juntang Zhuang et al. in AdaBelief Optimizer: Adapting stepsizes by the belief
in observed gradients
.
Example of usage:
python from adabelief_tf impoty AdaBeliefOptimizer opt = AdaBeliefOptimizer(lr=1e-3)
Note: amsgrad is not described in the original paper. Use it with
caution.
AdaBeliefOptimizer is not a placement of the heuristic warmup, the settings should be
kept if warmup has already been employed and tuned in the baseline method.
You can enable warmup by setting total_steps and warmup_proportion:
python opt = AdaBeliefOptimizer( lr=1e-3, total_steps=10000, warmup_proportion=0.1, min_lr=1e-5, )
In the above example, the learning rate will increase linearly
from 0 to lr in 1000 steps, then decrease linearly from lr to min_lr
in 9000 steps.
Lookahead, proposed by Michael R. Zhang et.al in the paper
[Lookahead Optimizer: k steps forward, 1 step back]
(https://arxiv.org/abs/1907.08610v1), can be integrated with AdaBeliefOptimizer,
which is announced by Less Wright and the new combined optimizer can also
be called "Ranger". The mechanism can be enabled by using the lookahead
wrapper. For example:
python adabelief = AdaBeliefOptimizer() ranger = tfa.optimizers.Lookahead(adabelief, sync_period=6, slow_step_size=0.5)
Example of serialization:
python optimizer = AdaBeliefOptimizer(learning_rate=lr_scheduler, weight_decay=wd_scheduler) config = tf.keras.optimizers.serialize(optimizer) new_optimizer = tf.keras.optimizers.deserialize(config, custom_objects={"AdaBeliefOptimizer": AdaBeliefOptimizer})
"""

def __init__(
    self,
    learning_rate=0.001,
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-14,
    weight_decay=0.0,
    rectify=True,
    amsgrad=False,
    sma_threshold=5.0,
    total_steps=0,
    warmup_proportion=0.1,
    min_lr=0.0,
    name="AdaBeliefOptimizer",
    **kwargs):
    r"""Construct a new AdaBelief optimizer.
    Args:
        learning_rate: A `Tensor` or a floating point value, or a schedule
            that is a `tf.keras.optimizers.schedules.LearningRateSchedule`.
            The learning rate.
        beta_1: A float value or a constant float tensor.
            The exponential decay rate for the 1st moment estimates.
        beta_2: A float value or a constant float tensor.
            The exponential decay rate for the 2nd moment estimates.
        epsilon: A small constant for numerical stability.
        weight_decay: A `Tensor` or a floating point value, or a schedule
            that is a `tf.keras.optimizers.schedules.LearningRateSchedule`.
            Weight decay for each parameter.
        rectify: boolean. Whether to enable rectification as in RectifiedAdam
        amsgrad: boolean. Whether to apply AMSGrad variant of this
            algorithm from the paper "On the Convergence of Adam and
            beyond".
        sma_threshold. A float value.
            The threshold for simple mean average.
        total_steps: An integer. Total number of training steps.
            Enable warmup by setting a positive value.
        warmup_proportion: A floating point value.
            The proportion of increasing steps.
        min_lr: A floating point value. Minimum learning rate after warmup.
        name: Optional name for the operations created when applying
            gradients. Defaults to "AdaBeliefOptimizer".
        **kwargs: keyword arguments. Allowed to be {`clipnorm`,
            `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients
            by norm; `clipvalue` is clip gradients by value, `decay` is
            included for backward compatibility to allow time inverse
            decay of learning rate. `lr` is included for backward
            compatibility, recommended to use `learning_rate` instead.
    """
    super().__init__(name, **kwargs)

    # ------------------------------------------------------------------------------
    # Print modifications to default arguments
    #print(Fore.RED + 'Please check your arguments if you have upgraded adabelief-tf from version 0.0.1.')
    #print(Fore.RED + 'Modifications to default arguments:')
    #default_table = tabulate([
     #       ['adabelief-tf=0.0.1','1e-8','Not supported','Not supported'],
      #      ['Current version (0.1.0)','1e-14','supported','default: True']],
       #     headers=['eps','weight_decouple','rectify'])
    #print(Fore.RED + default_table)

    #print(Fore.RED +'For a complete table of recommended hyperparameters, see')
    #print(Fore.RED + 'https://github.com/juntang-zhuang/Adabelief-Optimizer')

    print(Style.RESET_ALL)
    # ------------------------------------------------------------------------------

    self._set_hyper("learning_rate", kwargs.get("lr", learning_rate))
    self._set_hyper("beta_1", beta_1)
    self._set_hyper("beta_2", beta_2)
    self._set_hyper("decay", self._initial_decay)
    self._set_hyper("weight_decay", weight_decay)
    self._set_hyper("sma_threshold", sma_threshold)
    self._set_hyper("total_steps", int(total_steps))
    self._set_hyper("warmup_proportion", warmup_proportion)
    self._set_hyper("min_lr", min_lr)
    self.epsilon = epsilon or tf.keras.backend.epsilon()
    self.amsgrad = amsgrad
    self.rectify = rectify
    self._has_weight_decay = weight_decay != 0.0
    self._initial_total_steps = total_steps

def _create_slots(self, var_list):
    for var in var_list:
        self.add_slot(var, "m")
    for var in var_list:
        self.add_slot(var, "v")
    for var in var_list:
        self.add_slot(var, "grad_dif")
    if self.amsgrad:
        for var in var_list:
            self.add_slot(var, "vhat")

def set_weights(self, weights):
    params = self.weights
    num_vars = int((len(params) - 1) / 2)
    if len(weights) == 4 * num_vars + 1:
        weights = weights[: len(params)]
    super().set_weights(weights)

def _decayed_wd(self, var_dtype):
    wd_t = self._get_hyper("weight_decay", var_dtype)
    if isinstance(wd_t, tf.keras.optimizers.schedules.LearningRateSchedule):
        wd_t = tf.cast(wd_t(self.iterations), var_dtype)
    return wd_t

def _resource_apply_dense(self, grad, var):
    var_dtype = var.dtype.base_dtype
    lr_t = self._decayed_lr(var_dtype)
    wd_t = self._decayed_wd(var_dtype)
    m = self.get_slot(var, "m")
    v = self.get_slot(var, "v")
    beta_1_t = self._get_hyper("beta_1", var_dtype)
    beta_2_t = self._get_hyper("beta_2", var_dtype)
    epsilon_t = tf.convert_to_tensor(self.epsilon, var_dtype)
    local_step = tf.cast(self.iterations + 1, var_dtype)
    beta_1_power = tf.math.pow(beta_1_t, local_step)
    beta_2_power = tf.math.pow(beta_2_t, local_step)

    if self._initial_total_steps > 0:
        total_steps = self._get_hyper("total_steps", var_dtype)
        warmup_steps = total_steps * self._get_hyper("warmup_proportion", var_dtype)
        min_lr = self._get_hyper("min_lr", var_dtype)
        decay_steps = tf.maximum(total_steps - warmup_steps, 1)
        decay_rate = (min_lr - lr_t) / decay_steps
        lr_t = tf.where(
            local_step <= warmup_steps,
            lr_t * (local_step / warmup_steps),
            lr_t + decay_rate * tf.minimum(local_step - warmup_steps, decay_steps),
        )

    sma_inf = 2.0 / (1.0 - beta_2_t) - 1.0
    sma_t = sma_inf - 2.0 * local_step * beta_2_power / (1.0 - beta_2_power)

    m_t = m.assign(
        beta_1_t * m + (1.0 - beta_1_t) * grad, use_locking=self._use_locking
    )
    m_corr_t = m_t / (1.0 - beta_1_power)

    grad_dif = self.get_slot(var,'grad_dif')
    grad_dif.assign( grad - m_t )
    v_t = v.assign(
        beta_2_t * v + (1.0 - beta_2_t) * tf.math.square(grad - m_t) + epsilon_t,
        use_locking=self._use_locking,
    )

    if self.amsgrad:
        vhat = self.get_slot(var, "vhat")
        vhat_t = vhat.assign(tf.maximum(vhat, v_t), use_locking=self._use_locking)
        v_corr_t = tf.math.sqrt(vhat_t / (1.0 - beta_2_power))
    else:
        vhat_t = None
        v_corr_t = tf.math.sqrt(v_t / (1.0 - beta_2_power))

    r_t = tf.math.sqrt(
        (sma_t - 4.0)
        / (sma_inf - 4.0)
        * (sma_t - 2.0)
        / (sma_inf - 2.0)
        * sma_inf
        / sma_t
    )

    if self.rectify:
        sma_threshold = self._get_hyper("sma_threshold", var_dtype)
        var_t = tf.where(
            sma_t >= sma_threshold,
            r_t * m_corr_t / (v_corr_t + epsilon_t),
            m_corr_t,
        )
    else:
        var_t = m_corr_t / (v_corr_t + epsilon_t)

    if self._has_weight_decay:
        var_t += wd_t * var

    var_update = var.assign_sub(lr_t * var_t, use_locking=self._use_locking)

    updates = [var_update, m_t, v_t]
    if self.amsgrad:
        updates.append(vhat_t)
    return tf.group(*updates)

def _resource_apply_sparse(self, grad, var, indices):
    var_dtype = var.dtype.base_dtype
    lr_t = self._decayed_lr(var_dtype)
    wd_t = self._decayed_wd(var_dtype)
    beta_1_t = self._get_hyper("beta_1", var_dtype)
    beta_2_t = self._get_hyper("beta_2", var_dtype)
    epsilon_t = tf.convert_to_tensor(self.epsilon, var_dtype)
    local_step = tf.cast(self.iterations + 1, var_dtype)
    beta_1_power = tf.math.pow(beta_1_t, local_step)
    beta_2_power = tf.math.pow(beta_2_t, local_step)

    if self._initial_total_steps > 0:
        total_steps = self._get_hyper("total_steps", var_dtype)
        warmup_steps = total_steps * self._get_hyper("warmup_proportion", var_dtype)
        min_lr = self._get_hyper("min_lr", var_dtype)
        decay_steps = tf.maximum(total_steps - warmup_steps, 1)
        decay_rate = (min_lr - lr_t) / decay_steps
        lr_t = tf.where(
            local_step <= warmup_steps,
            lr_t * (local_step / warmup_steps),
            lr_t + decay_rate * tf.minimum(local_step - warmup_steps, decay_steps),
        )

    sma_inf = 2.0 / (1.0 - beta_2_t) - 1.0
    sma_t = sma_inf - 2.0 * local_step * beta_2_power / (1.0 - beta_2_power)

    m = self.get_slot(var, "m")
    m_scaled_g_values = grad * (1 - beta_1_t)
    m_t = m.assign(m * beta_1_t, use_locking=self._use_locking)
    m_t = self._resource_scatter_add(m, indices, m_scaled_g_values)
    m_corr_t = m_t / (1.0 - beta_1_power)

    grad_dif = self.get_slot(var,'grad_dif')
    grad_dif.assign(m_t)
    grad_dif = self._resource_scatter_add(grad_dif, indices, -1.0 * grad)

    v = self.get_slot(var, "v")
    m_t_indices = tf.gather(m_t, indices)
    v_scaled_g_values = tf.math.square(grad - m_t_indices) * (1 - beta_2_t)
    v_t = v.assign(v * beta_2_t + epsilon_t, use_locking=self._use_locking)
    v_t = self._resource_scatter_add(v, indices, v_scaled_g_values)

    if self.amsgrad:
        vhat = self.get_slot(var, "vhat")
        vhat_t = vhat.assign(tf.maximum(vhat, v_t), use_locking=self._use_locking)
        v_corr_t = tf.math.sqrt(vhat_t / (1.0 - beta_2_power))
    else:
        vhat_t = None
        v_corr_t = tf.math.sqrt(v_t / (1.0 - beta_2_power))

    r_t = tf.math.sqrt(
        (sma_t - 4.0)
        / (sma_inf - 4.0)
        * (sma_t - 2.0)
        / (sma_inf - 2.0)
        * sma_inf
        / sma_t
    )

    if self.rectify:
        sma_threshold = self._get_hyper("sma_threshold", var_dtype)
        var_t = tf.where(
            sma_t >= sma_threshold,
            r_t * m_corr_t / (v_corr_t + epsilon_t),
            m_corr_t,
        )
    else:
        var_t = m_corr_t / (v_corr_t + epsilon_t)

    if self._has_weight_decay:
        var_t += wd_t * var

    var_update = self._resource_scatter_add(
        var, indices, tf.gather(-lr_t * var_t, indices)
    )

    updates = [var_update, m_t, v_t]
    if self.amsgrad:
        updates.append(vhat_t)
    return tf.group(*updates)

def get_config(self):
    config = super().get_config()
    config.update(
        {
            "learning_rate": self._serialize_hyperparameter("learning_rate"),
            "beta_1": self._serialize_hyperparameter("beta_1"),
            "beta_2": self._serialize_hyperparameter("beta_2"),
            "decay": self._serialize_hyperparameter("decay"),
            "weight_decay": self._serialize_hyperparameter("weight_decay"),
            "sma_threshold": self._serialize_hyperparameter("sma_threshold"),
            "epsilon": self.epsilon,
            "amsgrad": self.amsgrad,
            "rectify": self.rectify,
            "total_steps": self._serialize_hyperparameter("total_steps"),
            "warmup_proportion": self._serialize_hyperparameter(
                "warmup_proportion"
            ),
            "min_lr": self._serialize_hyperparameter("min_lr"),
        }
    )
    return config

generator_g_optimizer = AdaBeliefOptimizer(2e-4, beta_1=0.5)
generator_f_optimizer = AdaBeliefOptimizer(2e-4, beta_1=0.5)

discriminator_x_optimizer = AdaBeliefOptimizer(2e-4, beta_1=0.5)
discriminator_y_optimizer = AdaBeliefOptimizer(2e-4, beta_1=0.5)

from adabelief-optimizer.

ManoharSai2000 avatar ManoharSai2000 commented on September 25, 2024

The first part is the Ada-Belief Source Code and the second part is its usage

from adabelief-optimizer.

cryu854 avatar cryu854 commented on September 25, 2024

Hi @ManoharSai2000, @sumanthsadhu. I tried the example of tensorflow cycleGAN here, and used Adam and Adabelief to train, took me 490 sec and 560 sec for one epoch respectively on Tesla T4 in google colab.
In my opinion, since Adam is further optimized by fused kernels and Adabelief is implemented only using tensorflow ops and then wrapped as keras, so it is inevitable that there will be an efficiency gap.

from adabelief-optimizer.

ManoharSai2000 avatar ManoharSai2000 commented on September 25, 2024

@cryu854, Ok, Thank You. Is this issue same in pytorch? I hope, the optimizer will be added soon to tensorflow as you mentioned.

from adabelief-optimizer.

cryu854 avatar cryu854 commented on September 25, 2024

@ManoharSai2000. To my knowledge, pytorch seems to need an additional compiler like JIT to fuse kernel automatically. Otherwise, pytorch will launch a separate kernel for each operation. Please correct me if I'm wrong.

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on September 25, 2024

@cryu854 Thanks for the update. I just found that the code uses a slot called "grad_dif", but seems it's not used,

, it might also cause some overload computation. So I updated a new version to a new branch called "update_0.2.0", please see the new code https://github.com/juntang-zhuang/Adabelief-Optimizer/blob/ce188ee2d8c8afc72810374a0fbbe7309f9658f9/pypi_packages/adabelief_tf0.2.0/adabelief_tf/AdaBelief_tf.py Other update includes an option to choose to turn on or off the warning messages in red. Could you perform a quick check or test? Perhaps with tools such as text-compare.com to better identify the exact changes. If everything works fine, we can push it to pip. Thanks a lot in advance.

from adabelief-optimizer.

cryu854 avatar cryu854 commented on September 25, 2024

@juntang-zhuang The new code looks good to me, and it passes all the test cases in Adabelief_test.py.
Btw, should we move Adabelief_test.py out of the folder? I am not sure whether the pypi package will include the test code.

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on September 25, 2024

@cryu854 Thanks a lot. I just deleted the test code, and uploaded it to pip 0.2.0. BTW, do you want to add your name and email at the beginning of the file as a contributor? If so, I'll update it in version 0.2.1. Thanks again for your efforts and help.

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on September 25, 2024

@ManoharSai2000 @sumanthsadhu Just removed some redundancy computation in the code, and released adabelief-tf==0.2.0, please try it from pip pip install adabelief-tf==0.2.0, should be a little bit faster now. Though we did not perform fused kernel operation. Source code is in pypi_packages/adabelief_tf0.2.0

from adabelief-optimizer.

cryu854 avatar cryu854 commented on September 25, 2024

@juntang-zhuang Yes, if it won't bother you, it would be an honor for me to be as a part of the contributions. Thank you in advance.

from adabelief-optimizer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.