Comments (5)
This is one possible solution:
When a user runs git theta add <checkpoint>
the checkpoint's updated parameter groups get saved under .git_theta/<path to model>
and then the command internally runs git add <checkpoint>
. However, when the user unintentionally runs git add <checkpoint>
themselves, the updated parameter groups never get saved under .git_theta/<path to model>
. In this case the checkpoint file and its representation in .git_theta/<path to model>
are inconsistent. We can check for this inconsistency in the clean filter. In cases where the checkpoint file and and directory are inconsistent, the clean filter can fail (so that the file does not get git add
-ed) and log some message saying to use git theta add
instead.
Doing this naively would be expensive for large models. It would require loading the model parameters into memory twice -- once for the checkpoint file and once for the .git_theta/<path to model>
directory. Instead, in .git_theta/<path to model>
directory, we can directly store the model metadata file (produced by the clean filter) containing the shape, type, and hash of each parameter group. If this is stored in the directory, then there would be no need to load up the parameters in the directory. Instead we can just check for consistent hashes.
This approach of saving the metadata file under .git_theta/<path to model>
also has the benefit of simultaneously helping with #60 since it makes it simple to see what parameter groups have changed.
from git-theta.
The quick solution is to have metadata file in the .git_theta/<path-to-model>
directory. So when the user runs git add <checkpoint>
, we know that it calls clean filter. Inside clean filter, we check if metadata file exists. If it exists, we compare if the contents of metadata are same as contents of metadata made from the current checkpoint. If they are not same or the metadata file doesn't exists, we throw an error saying the user has to do git theta add <checkpoint>
.
Once bad case is: if the user runs git add <checkpoint>
without any modifications to the checkpoint. we don't throw any error but overwrite the staged files with the same contents as before.
When I implement this: git status
and git diff
fails when ran after modifying checkpoint because the clean filter is called and there is now a mismatch between checkpoint and metadata inside the git_theta/<path-to-model>
from git-theta.
we can directly store the model metadata file (produced by the clean filter) containing the shape, type, and hash of each parameter group.
We talked about how you can't run git add within a clean but are other git commands allowed? Instead of having a copy of the metadata file in the .git_theta/
dir we could use git to look at he value of the checkpoint file at head which would be the metadata version. Would this be easier/avoid any divergence in state between the checkedin metadata file under .git_theta/
and the one checked in as the replacement for the checkpoint file?
from git-theta.
The major issue with this idea is that the clean filter isn't just run upon git add
, so checking for consistency between the model metadata file and the .git_theta/
directory in the clean filter won't work. An alternative solution would be to do this consistency check inside of a git pre-commit hook.
Implementation-wise we could just make a script bin/git-theta-check
(or something like that). The script would
- create a
git.Repo
object - check the files being committed by looking at the git index with
repo.index.entries
- check whether any of them are being tracked by git-theta by looking at entries in
.gitattributes
- for any files being tracked by git-theta, verify that the checkpoint file contents and the
.git_theta/
directory are consistent
If they are inconsistent, we should abort the commit and tell the user to run git theta add <model>
before trying to commit again.
This script would just need to be called from .git/hooks/pre-commit
so that git runs the check just before committing. We would add this call to .git/hooks/pre-commit
when the user runs git theta track <model>
.
from git-theta.
Not needed after #114
from git-theta.
Related Issues (20)
- Add an "apply to all" option to merge actions
- Parameter groups that are more than just tensors? HOT 3
- Add a way to script merges
- Functionality for partial model loading HOT 3
- Method to tell if git-theta wasn't installed? HOT 4
- Pytorch Checkpoint reading
- Git Add can have high memory usage.
- Finer-grained control of `git theta install` HOT 1
- Tensorflow model loading/saving seems bugged
- `git theta ls-files` HOT 1
- Git-Theta Clean
- Hanging when crashing
- More intelligent concurrency limits
- Investigate using cffi to speed up git lfs interface
- Configurable Serialization, Combining, and Saving to a backend
- Add `__str__` to metadata object HOT 1
- Update CI to handle MacOS
- Add retry to end2end tests
- in the `clean` filter, auto-detect checkpoint handler based on file extension HOT 1
- [end2ends] push repos to Hugging Face Hub (and git clone from there) to ensure it works HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from git-theta.