Code Monkey home page Code Monkey logo

Comments (7)

tesatory avatar tesatory commented on June 15, 2024

It's a model parameter, so it will be updated by optimizer.step() like any other parameter.

from adaptive-span.

prajjwal1 avatar prajjwal1 commented on June 15, 2024

Thanks for your reply. I wanted to ask:

  1. Do you think adaptive span takes a longer time to converge as compared to standard attention ? In my case, I'm seeing improvements but the extent is very less. Could this be due to trim_memory ? Did you try this on other tasks except char LM ?
  2. In your experiments, did adaptive span loss become non zero at any moment ? Although current_val is a parameter and it's being constantly updated, the loss is a constant 0.
    Thanks for your support.

from adaptive-span.

tesatory avatar tesatory commented on June 15, 2024
  1. Not sure what "converge" means here. If you're saying it's not growing large enough, you might want to reduce the loss coefficient associated with it. trim_memory shouldn't affect learning. Yes, we used it on word level LM without a problem.

  2. The loss can be zero if it has too large weight compared to the LM loss. Try setting --adapt-span-loss to 0.

from adaptive-span.

prajjwal1 avatar prajjwal1 commented on June 15, 2024

Hi,
Thanks for replying. What did you use to calculate FLOPS?

from adaptive-span.

tesatory avatar tesatory commented on June 15, 2024

We just counted all the flops in the model. For example, a linear layer has d_in x d_out flops.

from adaptive-span.

prajjwal1 avatar prajjwal1 commented on June 15, 2024

Thanks for your reply.

  1. In case where trim_len<0, the trim_memory will perform padding on the input tensor as specified here. So in my case, trim_len<0 since 1024 is big, here's what happens:
# query.shape -> [128,36,768]
# key.shape -> [128,20,768]
# value.shape -> [128,20,768]
k,v,k_pe = adaptive.trim_memory(q,k,v,k_pe)
# k.shape -> [128,1060,768]
# v.shape -> [128,1060,768]
# k_pe.shape -> [1,64,768] 

So in this case, I don't think memory consumption is being reduced, since now the dimensions have risen many fold, and more FLOPS are required. Am I right or am I missing something? So for now, I've removed this operation.

  1. Using masking function as specified in the paper, my FLOPS have stayed the same
macs: 12.074G 
params: 237.558M

These results are noted during inference. Did you measure FLOPS (as per in the paper) during training (since spans only change during this process only) ? My spans are changing after some changes, but the FLOPS are same. Is it because trimming operations are solely responsible for reducing FLOPS ?

from adaptive-span.

tesatory avatar tesatory commented on June 15, 2024

As noted in the paper, FLOPS is the number of FLOPS necessary for computing one step prediction. So it's not the training time flops where a batch of samples being processed together.

from adaptive-span.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.