Hi, Sorry to bother you. I have gone through the paper several times

Thanks for your reply. I wanted to ask: Do you think adaptive

Not sure what "converge" means here. If you're saying it's not gr

Thanks for your reply. In case where <code class="notranslate"

Understanding adaptive-span loss about adaptive-span HOT 7 CLOSED

facebookresearch commented on June 15, 2024

Understanding adaptive-span loss

from adaptive-span.

Comments (7)

tesatory commented on June 15, 2024

It's a model parameter, so it will be updated by optimizer.step() like any other parameter.

from adaptive-span.

prajjwal1 commented on June 15, 2024

Thanks for your reply. I wanted to ask:

Do you think adaptive span takes a longer time to converge as compared to standard attention ? In my case, I'm seeing improvements but the extent is very less. Could this be due to trim_memory ? Did you try this on other tasks except char LM ?
In your experiments, did adaptive span loss become non zero at any moment ? Although current_val is a parameter and it's being constantly updated, the loss is a constant 0.
Thanks for your support.

from adaptive-span.

tesatory commented on June 15, 2024

Not sure what "converge" means here. If you're saying it's not growing large enough, you might want to reduce the loss coefficient associated with it. trim_memory shouldn't affect learning. Yes, we used it on word level LM without a problem.
The loss can be zero if it has too large weight compared to the LM loss. Try setting --adapt-span-loss to 0.

from adaptive-span.

prajjwal1 commented on June 15, 2024

Hi,
Thanks for replying. What did you use to calculate FLOPS?

from adaptive-span.

tesatory commented on June 15, 2024

We just counted all the flops in the model. For example, a linear layer has d_in x d_out flops.

from adaptive-span.

prajjwal1 commented on June 15, 2024

Thanks for your reply.

In case where trim_len<0, the trim_memory will perform padding on the input tensor as specified here. So in my case, trim_len<0 since 1024 is big, here's what happens:

# query.shape -> [128,36,768]
# key.shape -> [128,20,768]
# value.shape -> [128,20,768]
k,v,k_pe = adaptive.trim_memory(q,k,v,k_pe)
# k.shape -> [128,1060,768]
# v.shape -> [128,1060,768]
# k_pe.shape -> [1,64,768]

So in this case, I don't think memory consumption is being reduced, since now the dimensions have risen many fold, and more FLOPS are required. Am I right or am I missing something? So for now, I've removed this operation.

Using masking function as specified in the paper, my FLOPS have stayed the same

macs: 12.074G 
params: 237.558M

These results are noted during inference. Did you measure FLOPS (as per in the paper) during training (since spans only change during this process only) ? My spans are changing after some changes, but the FLOPS are same. Is it because trimming operations are solely responsible for reducing FLOPS ?

from adaptive-span.

tesatory commented on June 15, 2024

As noted in the paper, FLOPS is the number of FLOPS necessary for computing one step prediction. So it's not the training time flops where a batch of samples being processed together.

from adaptive-span.

Understanding adaptive-span loss about adaptive-span HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent