Code Monkey home page Code Monkey logo

Comments (3)

HuanyuZhang avatar HuanyuZhang commented on August 16, 2024

A quick question/guess: is there any model parameter which has several (X) forward passes but has <X backward passes? For those parameters, the per_sample_grad will not be appropriately calculated/stored which might lead to this issue.

from opacus.

nhianK avatar nhianK commented on August 16, 2024

Thank you for responding. I am not sure what you meant by x forward passes and less than x backward passes. Could you give me a reproducible general case like that?
But here are some relevant details about the architecture that could help with understanding the issue. Referring to the architecture in the picture:
Input data labeled Rasch embedings are processed by question and knowledge encoders respectively to obtain xhat_t and yhat_t (contextualized embedings, both use self-attention). In the knowledge encoder(masked attention) xhat_t is used for key, query and yhat_t for value. The way it is implemented in the code is that the inputs are passed sequentially through the transformer blocks. There was initially one transformer class defined for all three components and it had all the necessary conditional logic and flags.
So one thing I did was remove the operations nested in the if-else statements that were meant for the knowledge retriever and knowledge encoder because it was causing issues. These conditional statements dictated when masks or certain layers would be applied. After removal, the original features of the model was still preserved. However, one component still remains difficult to determine- the Transformer layer had one condition.
This condition was only meant for the knowledge retriever and knowledge encoder as shown in the illustration, where linear layers and activation is applied. Question encoder is an exception.
if not question encoder:
query2 = self.linear2(self.dropout(
self.activation(self.linear1(query))))
query = query + self.dropout2((query2))
query = self.layer_norm2(query)
Here are the notebooks reproducing the architecture with and without opacus.

see how model does without opacus(it runs) : https://colab.research.google.com/drive/1CjPdzUaThLKrY0vVUUM-__zrLgMsFxe7?usp=sharing
model with opacus : https://colab.research.google.com/drive/1D0TwshmEzhc3_ymKo9PQPXFATyvFMWhM?usp=sharing

I defined all the encoders, knowledge retrievers separately because I needed to eliminate problems with conditional statements/computation paths.
So one behavior I observed is with opacus, the layers above need to be applied to all three forward computation paths or the model cannot compute per sample gradients. Question encoder cant be an exception. (see class Question encoder in the second notebook and my comment there)
So my question, why do you think I am having this strange behavior. I am trying to get it to run without question encoder having linear layers. In other words, why does the absence of these layers in Question encoder prevent per sample gradient computation?. In the picture, I annotated where the layers should or should not be in the original architecture.
akt-ann

from opacus.

HuanyuZhang avatar HuanyuZhang commented on August 16, 2024

Thanks for your detailed response. By any chance could we experiment with one single blocker at a time (for example, Question encoder) to see whether the problem replicate?

Specifically, given x to be the output of Question encoder, just define some dummy label y and any loss function L, and do the backward pass of L(x,y). Then we can see whether per_sample_grad is empty or not.

from opacus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.