Comments (3)
A quick question/guess: is there any model parameter which has several (X) forward passes but has <X backward passes? For those parameters, the per_sample_grad will not be appropriately calculated/stored which might lead to this issue.
from opacus.
Thank you for responding. I am not sure what you meant by x forward passes and less than x backward passes. Could you give me a reproducible general case like that?
But here are some relevant details about the architecture that could help with understanding the issue. Referring to the architecture in the picture:
Input data labeled Rasch embedings are processed by question and knowledge encoders respectively to obtain xhat_t and yhat_t (contextualized embedings, both use self-attention). In the knowledge encoder(masked attention) xhat_t is used for key, query and yhat_t for value. The way it is implemented in the code is that the inputs are passed sequentially through the transformer blocks. There was initially one transformer class defined for all three components and it had all the necessary conditional logic and flags.
So one thing I did was remove the operations nested in the if-else statements that were meant for the knowledge retriever and knowledge encoder because it was causing issues. These conditional statements dictated when masks or certain layers would be applied. After removal, the original features of the model was still preserved. However, one component still remains difficult to determine- the Transformer layer had one condition.
This condition was only meant for the knowledge retriever and knowledge encoder as shown in the illustration, where linear layers and activation is applied. Question encoder is an exception.
if not question encoder:
query2 = self.linear2(self.dropout(
self.activation(self.linear1(query))))
query = query + self.dropout2((query2))
query = self.layer_norm2(query)
Here are the notebooks reproducing the architecture with and without opacus.
see how model does without opacus(it runs) : https://colab.research.google.com/drive/1CjPdzUaThLKrY0vVUUM-__zrLgMsFxe7?usp=sharing
model with opacus : https://colab.research.google.com/drive/1D0TwshmEzhc3_ymKo9PQPXFATyvFMWhM?usp=sharing
I defined all the encoders, knowledge retrievers separately because I needed to eliminate problems with conditional statements/computation paths.
So one behavior I observed is with opacus, the layers above need to be applied to all three forward computation paths or the model cannot compute per sample gradients. Question encoder cant be an exception. (see class Question encoder in the second notebook and my comment there)
So my question, why do you think I am having this strange behavior. I am trying to get it to run without question encoder having linear layers. In other words, why does the absence of these layers in Question encoder prevent per sample gradient computation?. In the picture, I annotated where the layers should or should not be in the original architecture.
from opacus.
Thanks for your detailed response. By any chance could we experiment with one single blocker at a time (for example, Question encoder) to see whether the problem replicate?
Specifically, given x to be the output of Question encoder, just define some dummy label y and any loss function L, and do the backward pass of L(x,y). Then we can see whether per_sample_grad is empty or not.
from opacus.
Related Issues (20)
- UnsupportedModuleError: [IllegalModuleConfigurationError('Model needs to be in training mode')] HOT 1
- LLM finetuning with Opacus HOT 2
- Add context manager to toggle on/off privacy in training loop
- Some issue with loading model with 'weight' as opposed to 'pretrained=True' HOT 1
- Error: Trying to add hooks twice to the same model HOT 2
- OverflowError: cannot convert float infinity to integer HOT 2
- ModuleValidator.fix() causes layer gradients to be None. HOT 1
- Integrating Opacus for a custom pytorch model raises errors but works fine on its own HOT 2
- `BatchSplittingSampler` return wrong length HOT 5
- Error occurred when executing GroundingDinoSAMSegment (segment anything): Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
- Why GDP and RDP give different result for the same config HOT 1
- ValueError: Per sample gradient is not initialized. Not updated in backward pass? HOT 4
- Spectral normalization in Opacus HOT 4
- BatchMemoryManager for training private GANs
- Wrapper references can be easily replaced, consider using properties instead HOT 3
- Error in DPOptimizer: Inconsistency between batch_first argument of PrivacyEngine and DPMultiheadAttention HOT 2
- Grad Sample Module: Use full backward hook to save activations and backprop values. HOT 1
- Microbatching Support HOT 4
- Training a simple transformer model with Opacus produces runtime error due to mismatch in dimensions HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from opacus.