Comments (7)
Desired Behavior:
- When you commit changes to a model, only the diffs are stored
- Integrate with git so that the same system is used to manage source code and models
Background:
- Clean filter
- Specifies a program that runs when staging files that match some pattern
- git add foo.txt ==> clean foo.txt | git add
- Smudge filter
- Specifies a program that runs when checking out files that match some pattern
- git checkout <commit_hash> ==> smudge <commit_hash> | git checkout
Design 1
- Workflow:
- Commit initial my_model.pt as you would any other file in git
- Update my_model.pt in-place
- Commit updated my_model.pt as you would any other file in git
- Implementation
- Define clean filter for
*.pt
files that:
1. If it does not yet exist, creates a directory .git_ml/diffs/my_model/
2. Computes the diff between updated model.pt and the previous version of model.pt
3. Stores the diff file in .git_ml/diffs/my_model/model.diff
5. Stages the diff file
6. Do not add my_model.pt to staging area - Define a smudge filter for
*.pt
files that:
1. Checks for diff at .git_ml/diffs/my_model/model.diff
2. Goes back through revision history of model.diff
3. Iteratively applies diffs from revision history of model.diff to my_model.pt
- Define clean filter for
- Pros
- The fact that my_model.pt is handled specially is transparent to the user
- git checkout only requires checking out the initial version of my_model.pt (which is large and never changes)
and all the small diffs up until the commit you are checking out
- Cons
- Requires clean filter to compute diff which is hard in general (e.g., disambiguating low-rank vs. dense update)
- model.diff is stored in .git_ml/diffs/my_model so this makes it impossible to change the name of my_model.pt once its been initially committed
- Probably can solve this by using something along the lines of the model file's hash
Design 2:
- Workflow:
- Commit initial my_model.pt as you would any other file in git
- Training code modifies my_model.pt in-place and also produces a diff file in .git_ml/diffs/my_model/model.diff
- Commit updated my_model.pt as you would any other file in git
- Implementation:
- Define clean filter for
*.pt
files that:
1. Stages .git_ml/diffs/my_model/model.diff
2. Restores my_model.pt to its previous version - Define a smudge filter for
*.pt
files that:
1. Checks for diff at .git_ml/diffs/my_model/model.diff
2. Goes back through revision history of model.diff
3. Iteratively applies diffs from revision history of model.diff to my_model.pt
- Define clean filter for
- Pros:
- git checkout only requires checking out the initial version of my_model.pt (which is large and never changes)
and all the small diffs up until the commit you are checking out - User can specify the type of diff (e.g., low-rank vs. dense update) which is easier than figuring it out from model checkpoints
- git checkout only requires checking out the initial version of my_model.pt (which is large and never changes)
- Cons:
- Couples training code and version control since training code needs to "know about" diff files and where to store them
from git-theta.
I think there's a much simpler design to consider. Suppose a user has a model checkpoint with the following parameter group structure:
{
'layer1': {
'w': [1, 2, 3, 4],
'b': [10]
},
'layer2': {
'w': [-1, -2, -3, -4],
'b': [3]
},
'other_params': {
'a': 0.2
}
}
When a user runs git add model.pt
the clean filter loads the model checkpoint and explodes the dictionary structure onto the filesystem under .git/ml
.git
└── ml
└── model
├── layer1
│ ├── b.pt
│ └── w.pt
├── layer2
│ ├── b.pt
│ └── w.pt
└── other_params
└── a.pt
The clean filter puts this directory structure into the git index. Also, similar to git-lfs, instead of staging model.pt
with its full contents the clean filter will stage a placeholder model.pt
that points to .git/ml/model
.
When checking out a commit, all that the smudge filter needs to do is re-synthesize model.pt
from the exploded version of that model at .git/ml/model
.
This design has the following advantages:
- No need for diff files since updating a single parameter group will result in only that parameter group's file being updated in the git commit.
- In the future we'll want to store model data with LFS or something like it since these will exceed git's maximum file size. In this design we simply need to make the
.git/ml/model
directory all LFS objects and everything should work the same. In the previous proposal, the smudge filter needed the whole history of diff files to re-synthesize the model checkpoint. This is problematic if the diff files are stored with LFS sincegit pull
-ing from an LFS store only pulls the latest version of the object.
from git-theta.
Thanks @nkandpa2 . I think this has a clear advantage in terms of the fact that it will make git natively aware of which parameter groups were updated. A disadvantage I guess is that you would ultimately need to effectively materialize a second copy of the checkpoint, right? I'm not clear on what you mean
instead of staging model.pt with its full contents the clean filter will stage a placeholder model.pt that points to .git/ml/model.
Wouldn't we need to stage model.pt so that programs could make use of it?
from git-theta.
Also, the directory structure you're proposing is (I think) very similar to how t5x(/flax?) represents checkpoints - see e.g. gsutil ls gs://t5-data/pretrained_models/t5x/byt5_base/checkpoint_1000000/
. Each parameter "group" gets its own subdirectory, e.g. gs://t5-data/pretrained_models/t5x/byt5_base/checkpoint_1000000/target.encoder.layers_9.mlp.wo.kernel/
. Each subdirectory has a TensorStore object (which is a very nice library for storing and accessing tensors on disk) and a .zarray
metadata file (required by TensorStore's storage format). Using TensorStore is probably much cleaner than using individual .pt files (though I know you were being illustrative).
from git-theta.
I should mention the naming of the TensorStore object corresponds to sharding, e.g. in the example above the TensorStore file is called 0.0. I don't really understand that naming convention/the sharding but just FYI in case you were wondering.
from git-theta.
Thanks @nkandpa2 . I think this has a clear advantage in terms of the fact that it will make git natively aware of which parameter groups were updated. A disadvantage I guess is that you would ultimately need to effectively materialize a second copy of the checkpoint, right? I'm not clear on what you mean
instead of staging model.pt with its full contents the clean filter will stage a placeholder model.pt that points to .git/ml/model.
Wouldn't we need to stage model.pt so that programs could make use of it?
model.pt
never gets staged (i.e., gets put into the area holding everything about to be committed) but instead a pointer to the exploded checkpoint view gets staged. The working directory (the user's view of the directory) still contains the full model.pt
so after git add
/git commit
to the user the full model.pt
file is still there.
git-lfs does something very similar. For LFS tracked files, (1) the file gets copied to .git/lfs, (2) a file pointing to the LFS tracked file gets staged, and (3) on git push
the file being pointed to gets synced to an LFS store. After git add
the LFS tracked file is still in the working directory even though a pointer file is in the staging area.
from git-theta.
I see, that makes sense, thanks.
from git-theta.
Related Issues (20)
- Add an "apply to all" option to merge actions
- Parameter groups that are more than just tensors? HOT 3
- Add a way to script merges
- Functionality for partial model loading HOT 3
- Method to tell if git-theta wasn't installed? HOT 4
- Pytorch Checkpoint reading
- Git Add can have high memory usage.
- Finer-grained control of `git theta install` HOT 1
- Tensorflow model loading/saving seems bugged
- `git theta ls-files` HOT 1
- Git-Theta Clean
- Hanging when crashing
- More intelligent concurrency limits
- Investigate using cffi to speed up git lfs interface
- Configurable Serialization, Combining, and Saving to a backend
- Add `__str__` to metadata object HOT 1
- Update CI to handle MacOS
- Add retry to end2end tests
- in the `clean` filter, auto-detect checkpoint handler based on file extension HOT 1
- [end2ends] push repos to Hugging Face Hub (and git clone from there) to ensure it works HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from git-theta.