deep-spin / entmax Goto Github PK

View Code? Open in Web Editor NEW

395.0 395.0 43.0 214 KB

The entmax mapping and its loss, a family of sparse softmax alternatives.

License: MIT License

Python 100.00%

entmax's People

Contributors

Stargazers

Watchers

Forkers

codeaudit josephdviviano erickrf mtreviso ml-lab roholazandie hanshuaipanda markovcom antoniogois zhhao1 angadhjot kiminh hadaev8 peterliwenhao shubhampachori12110095 kp-forks abtuo tommylitlle ayuei baruffi antonio-farinhas frankier yuanhanglin jotionjoestar jrc1995 cifkao chenxiaodanhit maxattezekbayev breezedeus kantneel huangpu1 eifhaiphf techthiyanes stjordanis patrick-knab sanagno hibb-bb azton usernameai teinkbr firdausalhakim107 alignment-lab-ai jingmouren

entmax's Issues

Instability of entmax bisect around 0?

Hi, I was plotting the output of the entmax_bisect function at different levels of alpha

from matplotlib import pyplot as plt
x = torch.linspace(-1, 1, 10000).unsqueeze(-1)
x = torch.cat((x, torch.zeros_like(x)), dim=-1)
y = entmax_bisect(x, 10, dim=-1)[..., 0]

plt.plot(x.numpy(), y.numpy())

Here's what I'm getting

I'm expecting this to be a smooth function except for two points when it transitions to 0 or 1

Any explanation for this?

alpha-entmax

Hi, very nice work, do you have an notebook showing how to perform training of alpha parameter?

Unexpected behaviour of sparsemax gradients for 3d tensors

Hi folks!

It seems like the gradients of sparsemax are not the same when we have two "equal" tensors: one 2d, and the other with a time dimension.

Here is the code to reproduce the problem:

import torch
import entmax


def test_map_fn(activation_fn):
    x = torch.tensor([[-2, 0, 0.5], [0.1, 2, -0.4]], requires_grad=True)
    # >>> x.shape
    # torch.Size([2, 3])
    a_2d = activation_fn(x, dim=-1)
    z_2d = torch.sum(torch.pow(a_2d, 2))
    z_2d.backward()
    grad_2d = x.grad

    x = torch.tensor([[[-2, 0, 0.5]], [[0.1, 2, -0.4]]], requires_grad=True)
    # >>> x.shape
    # torch.Size([2, 1, 3])
    a_3d = activation_fn(x, dim=-1)
    z_3d = torch.sum(torch.pow(a_3d, 2))
    z_3d.backward()
    grad_3d = x.grad

    print(activation_fn.__name__)
    print('Ok acts:', torch.allclose(a_2d.squeeze(), a_3d.squeeze()))
    print('Ok grads:', torch.allclose(grad_2d.squeeze(), grad_3d.squeeze()))
    print(grad_2d.squeeze())
    print(grad_3d.squeeze())
    print('---\n')


if __name__ == '__main__':
    test_map_fn(torch.softmax)
    test_map_fn(entmax.entmax15)
    test_map_fn(entmax.sparsemax)

The output of this code is:

softmax
Ok acts: True
Ok grads: True
tensor([[-0.0421, -0.0883,  0.1304],
        [-0.1325,  0.2198, -0.0873]])
tensor([[-0.0421, -0.0883,  0.1304],
        [-0.1325,  0.2198, -0.0873]])
---

entmax15
Ok acts: True
Ok grads: True
tensor([[ 0.0000, -0.2344,  0.2344],
        [-0.0926,  0.0926,  0.0000]])
tensor([[ 0.0000, -0.2344,  0.2344],
        [-0.0926,  0.0926,  0.0000]])
---

sparsemax
Ok acts: True
Ok grads: False
tensor([[ 0.0000, -0.5000,  0.5000],
        [ 0.0000,  0.0000,  0.0000]])
tensor([[ 0., -2.,  0.],
        [ 0.,  1.,  0.]])
---

So, using sparsemax, the grads of the two tensors are different. Obs: it seems that a quick fix by doing tensor.view(-1, nb_labels) to get a 2d tensor works fine in practice.

Errors when using the loss function

The error shows that when computing the loss, the size of 'target' is not the same as 'p_star'. https://github.com/deep-spin/entmax/blob/master/entmax/losses.py#L156 Should it be switch to index_add_? Any hint?

Pytorch version: '0.4.1.post2'

Thanks

Release patch 1.0.1 with torch install_requires fix

Thanks for creating this useful library. We recently included it as part of our low code toolkit, Ludwig. However, we ran into an issue whereby if the user does not have torch already installed before installing entmax, it raises an exception:

× python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [10 lines of output]
      Traceback (most recent call last):
        File "<string>", line 36, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-xasbhx2w/entmax_64da4068d2414a04a5c3adc7187695b4/setup.py", line 2, in <module>
          from entmax import __version__
        File "/tmp/pip-install-xasbhx2w/entmax_64da4068d2414a04a5c3adc7187695b4/entmax/__init__.py", line 3, in <module>
          from entmax.activations import sparsemax, entmax15, Sparsemax, Entmax15
        File "/tmp/pip-install-xasbhx2w/entmax_64da4068d2414a04a5c3adc7187695b4/entmax/activations.py", line 13, in <module>
          import torch
      ModuleNotFoundError: No module named 'torch'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Looks like this was fixed some time back here, but this change was made after the v1.0 release, meaning the current production release has this bug. Can you create a patch release v1.0.1 that includes this fix?

Thanks.

Sparse losses return nan when there is -inf in the input

The sparse loss functions (and their equivalent classes) return nans when there is -inf in the input.

Example:

import torch
import numpy as np
from entmax import entmax15_loss, sparsemax_loss
x = torch.rand(10, 5)
y = torch.randint(0, 4, [10])
x[:, 4] = -np.inf
entmax15_loss(x, y) 
# tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])

sparsemax_loss(x, y)
# tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])

entmax_bisect leads to loss becoming nan

Hi,

I've used several different strategies with attention. I have tried entmax on a small batch, it works well, but somewhere during training on full dataset, my loss becomes Nan. The behavior is irregular, someone for one epoch, I did not get, but most of the times I'm getting Nan as my loss. Can you please suggest some ways of how this can be fixed. nn.Softmax works fine.

Replicating the behaviour from the paper

Hi,
Firstly thanks for releasing the code of your paper.
I had some queries:
Here's what I'm doing:

>>> att_scores = torch.rand(128, 12, 36, 1024)
>>> alpha = AlphaChooser(12)
>>> p = entmax_bisect(att_scores,alpha()[0])
>>> val = p.mean().backward()

How do I learn alpha correctly ? Will setting it as a parameter be sufficient, but in the paper you mentioned it cannot be solved simply by autograd differentiation.

>>> p = 0.
>>> for i in range(att_scores.size(1)):
>>>   p += entmax_bisect(X = att_scores,alpha = alpha()[i], dim=1)
>>> p/= num_attention_heads

Will p be representative of attention scores from all the heads ?
I'm only using first element of alpha(), how do I include all 12 values of alpha() while getting output scores from entmax ? Do I run a loop for all the elements of alpha and then take the mean, but that would increase computation time.
What does 'val' signify here ?
How do you learn shape of each attention head in your work ?

Could you please answer these queries, any suggestion would be helpful . Thanks !

Code to reproduce figure 3

Hi,
Can you provide the code to reproduce Figure 3 in this repository? I haven't fully understood how to create that figure.
Thanks,

Tensorflow implementation of entmax1.5

Here's a tensorflow implementation of entmax $\alpha=1.5$ mapping and loss in case someone's interested.

https://gist.github.com/justheuristic/60167e77a95221586be315ae527c3cbd

It should work on tf >= 1.8 and matches both outputs and gradients of the official pytorch implementation.

Thanks lena-voita@ for assistance

EntmaxBisect forward requires unused argument

The forward method EntmaxBisect takes a required alpha argument. However, what is actually passed passed to the entmax_bisect function is self.alpha.

Problem with sparse activations

I just replace the sotfmax function with sparsemax function or tsallis15 function in my transformer model. It works well on training stage, but the following errors occur during the testing phase:
RuntimeError: CUDA error: device-side assert triggered

If I replace it with softmax function again, it works.

What could be the cause?

entmax_bisect bugs with fp16/bf16

It would be great if entmax worked with torch.float16 and torch.bfloat16. Unfortunately, it currently does not. There are bugs for both bisection and the exact algorithm. Here I'll document a numerical stability problem that exists for the bisection-based algorithm for both torch.float16 and torch.bfloat16 (don't believe the propaganda that says that bf16 is a drop-in solution for float32).

Let's say you have a 32-bit vector of logits whose largest element is sufficiently negative.

a = torch.zeros(128, device="cuda").fill_(-5)  # torch.float32
a[0] = 0
a -= 1000

With alpha=1.5, the correct output for this vector is a one-hot distribution peaked on index 0. We get this behavior with both entmax.entmax15 and entmax.entmax_bisect.

p1 = entmax.entmax15(a)
p2 = entmax.entmax_bisect(a, alpha=1.5)

p1[0] == p2[0] == 1  # True

Ok, great. But what happens if we use torch.float16?

b = a.to(torch.float16)

p3 = entmax.entmax_bisect(b, alpha=1.5)
p3.isnan().all()  # True

and what about torch.bfloat16?

c = a.to(torch.bfloat16)

p4 = entmax.entmax_bisect(c, alpha=1.5)
p4.isnan().all()  # True

Well that's not good! (solution after this commercial break)

Nans gradients for tensors with singleton dimensions for budget_bisect.

Tensors with singleton dimensions like

    x = torch.tensor([[[-0.0744, -0.0904]],

    [[-0.0452, -0.0386]],

    [[-0.0187, -0.0100]],

    [[-0.0060,  0.0100]],

    [[-0.0660, -0.1066]],

    [[-0.0289, -0.0087]],

    [[-0.0227, -0.0159]],

    [[-0.0547, -0.0428]],

    [[-0.0941, -0.0747]],

    [[-0.0653, -0.0478]],

    [[-0.0747, -0.0417]],

    [[-0.0740, -0.0367]]], dtype=torch.float64, requires_grad=True)

yield gradients with Nans for budget_bisect with budget =2. If we squeeze the singleton dimention gradients are correct.

Usage of alpha

Hi,

May I know if we need to define a new trainable parameter for each head per layer for the alpha value? Could anyone be kind enough to show a simple example of how it could be used in normal transformer?

Thanks!

Alpha value less than one?

Can alpha value be less than one?

I basically need it to be sum-normalized sigmoids in that case

(e.g. rather than softmax that is the case where alpha = 1.0).

Entmax fails when all inputs are -inf

When all inputs to entmax are -inf, it fails with

RuntimeError                              Traceback (most recent call last)
<ipython-input-404-217bd9c1ced2> in <module>
      1 from entmax import entmax15
      2 logits = torch.ones(10) * float('-inf')
----> 3 entmax15(logits)

~/.virtualenvs/sparseref/lib/python3.7/site-packages/entmax/activations.py in entmax15(X, dim, k)
    254     """
    255 
--> 256     return Entmax15Function.apply(X, dim, k)
    257 
    258 

~/.virtualenvs/sparseref/lib/python3.7/site-packages/entmax/activations.py in forward(cls, ctx, X, dim, k)
    176         X = X / 2  # divide by 2 to solve actual Entmax
    177 
--> 178         tau_star, _ = _entmax_threshold_and_support(X, dim=dim, k=k)
    179 
    180         Y = torch.clamp(X - tau_star, min=0) ** 2

~/.virtualenvs/sparseref/lib/python3.7/site-packages/entmax/activations.py in _entmax_threshold_and_support(X, dim, k)
    129 
    130     support_size = (tau <= Xsrt).sum(dim).unsqueeze(dim)
--> 131     tau_star = tau.gather(dim, support_size - 1)
    132 
    133     if k is not None and k < X.shape[dim]:

RuntimeError: index -1 is out of bounds for dimension 0 with size 10

A minimal snippet to reproduce this behavior is

from entmax import entmax15
logits = torch.ones(10) * float('-inf')
entmax15(logits)

For reference, torch.softmax will return a tensor of nan's. This is certainly a corner case, but sometimes padding may create -inf-only inputs and it's easier to deal with nan's later.

[This is possibly related to #9 ]

`entmax_bisect` is not stable around `alpha=1`

First of all, thanks for a great library! It's very nicely implemented!

From the documentation provided in the code, I understood that entmax_bisect should behave like softmax when alpha is set to 1.

I've done some experiments and the results seem to be different from softmax when the alpha is equal to 1. Yet, when it's close to 1 it approximates the softmax behavior.

Here is the code snippet for a very small example:

import torch
from entmax import entmax_bisect


torch.softmax(torch.Tensor([0., 0., 1.]), dim=-1)                        # tensor([0.2119, 0.2119, 0.5761])

entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.9)             # tensor([0.2195, 0.2195, 0.5611])
entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.95)            # tensor([0.2157, 0.2157, 0.5687])
entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.99)            # tensor([0.2127, 0.2127, 0.5747])
entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.999999)        # tensor([0.2119, 0.2119, 0.5761])
entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=1)               # tensor([0.3333, 0.3333, 0.3333]) <--
entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=1.00001)         # tensor([0.2119, 0.2119, 0.5761])
entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=1.1)             # tensor([0.1985, 0.1985, 0.6031])

Is this a bug or is it the intended behavior?
I think it's not very clear from the documentation.

Index -1 is out of bounds

Hi! I am training a language model similar to one in Sparse Text Generation project with custom input format. When I start training it can not calculate an entmax loss.
My inputs and labels both has shapes (batch_size, seq_len) before went to loss. Afterwards (batch_size*seq_len, vocab_size) and (batch_size*seq_len,) respectively. I use masking via -1 in labels and despite I set ignore_index=-1 , my log is:

Traceback (most recent call last):                                                                                                       │
  File "run_lm_finetuning.py", line 782, in <module>                                                                                     │
    main()                                                                                                                               │
  File "run_lm_finetuning.py", line 736, in main                                                                                         │
    global_step, tr_loss = train(args, train_dataset, model, tokenizer, gen_func)                                                        │
  File "run_lm_finetuning.py", line 300, in train                                                                                        │
    outputs = model(inputs, labels=labels)                                                                                               │
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 880, in _call_impl                                      │
    result = self.forward(*input, **kwargs)                                                                                              │
  File "/app/src/pytorch_transformers/modeling_gpt2.py", line 607, in forward                                                            │
    loss = self.loss(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))                                                │
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 880, in _call_impl                                      │
    result = self.forward(*input, **kwargs)                                                                                              │
  File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 17, in forward                                                    │
    loss = self.loss(X, target)                                                                                                          │
  File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 278, in loss                                                      │
    return entmax_bisect_loss(X, target, self.alpha, self.n_iter)                                                                        │
  File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 242, in entmax_bisect_loss                                        │
    return EntmaxBisectLossFunction.apply(X, target, alpha, n_iter)                                                                      │
  File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 129, in forward                                                   │
    ctx, X, target, alpha, proj_args=dict(n_iter=n_iter)                                                                                 │
  File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 45, in forward                                                    │
    p_star.scatter_add_(1, target.unsqueeze(1), torch.full_like(p_star, -1))                                                             │
RuntimeError: index -1 is out of bounds for dimension 1 with size 50257

How to fix this?

UPD:
I realized that the problem is not connected with ignore_index, but with shapes missmatch between target and p_star in forward method of _GenericLossFunction class. Still don't know hot to fix this bug. So, help me please, if somebody know how :)

entmax implementation for Tesnoflow 2

Dear team,
Thank you for your great work.

Could you please add the support for Tensorflow 2 as I need it for many projects. Do you have any plans to do so?

A bug when alpha = 1 for entmax_bisect?

For function "entmax_bisect", when given parameter alpha, it can give out results like: softmax (alpha = 1), entmax15 (alpha = 1.5), sparsemax (alpha = 2). But when I try alpha = 1, it gives out wrong results that all number is the same. But when I set alpha = 0.99999 or 1.00001, it works well. And other alpha, like 2 and 1,5, this function also works well. So is this a bug or I just use it wrongly? Thank you a lot!