r-three / git-theta Goto Github PK
View Code? Open in Web Editor NEWgit extension for {collaborative, communal, continual} model development
License: Apache License 2.0
git extension for {collaborative, communal, continual} model development
License: Apache License 2.0
After staging a change to a checkpoint with git theta add /path/to/my-model.pt
, we should be able to use git reset --hard
to destage the changes and blow away working tree modification, restoring back to the last commit.
Currently this results in a file not found error for ${git_repo}/path/to/my-model.pt
file during one of the smudges.
As we have moved to plugins for checkpoint handling, we don't need all of the deep learning frameworks installed all the time. Therefore we don't need to install them all, especially given that they can be heavy.
Update the setup.py to include extras_require
for various frameworks that install them with git-theta
. Also include some target that installs all the frameworks, or at least some of the most popular ones.
We should record a .name
value on each of the plugins. This property on a checkpoint object should return a string so that a get_checkpoint
function call with this name would return the class of this object.
This will make things like logging what checkpoint type is used (and making sure we use the same one across multiple cleans, etc) much easier, especially when the value is set via an environment variable.
It should still store the parameters and how they are changed.
Currently the params
module uses torch as a dependency only to convert the tensor back into a numpy array.
As we are working on supporting multiple checkpoint formats, can we just use numpy for most of these methods?
E.g. hyperformer/tracked_outputs/pytorch_model.bin
should appear in .git_cml/hyperformer/tracked_outputs/pytorch_model.bin
This would determine what to print out when git diff is run.
As a starting point, this could be:
Operation type (e.g. dense update a parameter)
Parameter name
New value
Probably needs to be able to refer to an external location for the initial checkpoint since we don't want to store it all in git - it's too big. Might look like git LFS. Bonus: Store the random seed that can be used to reconstruct the initial parameter values.
The scripts in our bin/
file don't end in .py
so they seem to get missed by black (I have confirmed they are missed in the pre-commit hook and I am pretty sure they are missed in the CI lint).
Update both the pre-commit hook and the ci to actually format these files. Will probably result in needing a regex as I think that specifying specific files in pre-commit removes the file-type based default change detection.
Should take a PyTorch checkpoint and basically construct a dict-like object that is keyed by parameter name and whose values are the parameter values.
rather than assuming it's a pytorch checkpoint
With the change to using flat maps, we are no longer using the iterate_(dict|dir)_leaves
functions.
They should be removed. The biggest part of the effort is that most of the tests that are operating on new functions like flatten
or walk_dir
are indirectly tested through these iterate functions. They need to be updated to test the actually functions we use.
When running git theta add
all of the parameter groups are saved in .git_theta/<path to model>
even if only a few of the parameter groups were actually modified. Instead of writing the whole model to disk every time we run git theta add
, check what has changed and only write those parameters to disk.
Probably requires looping through entire set of variables/initializers. https://github.com/bindog/onnx-surgery/blob/master/surgery.py#L113
Currently the metadata file produced by the clean filter looks like this
{
"model/scoping/to/param/1-weight shape": List[int],
"model/scoping/to/param/1-weight dtype": str,
"model/scoping/to/param/1-weight hash": str,
...,
"model/scoping/to/param/2-bias shape": List[int],
"model/scoping/to/param/2-bias dtype": str,
"model/scoping/to/param/2-bias": str,
...
To make fetching metadata for a single parameter we are converting to a nested format:
{
"model/scoping/to/param/1-weight": {
"tensor_metadata": {
"shape": List[str],
"dtype": str,
"hash": str,
},
},
...,
"model/scoping/to/param/2-bias": {
"tensor_metadata": {
"shape": List[str],
"dtype": str,
"hash": str,
},
},
...,
}
Tensor metadata is in it's own nested dict because we may add other keys like git_theta_metadata
for tracking things like update types eventually.
Note: We need a consistent serialization order (lexical sort on keys of each dict) when writing to disk to support diffs.
Have the global checkpoint checksum written out somewhere so that git always flags a merge conflict.
Read the proposal and blog post for VCS for collaborative update of models
Created Drive Folder for project
Why do we need sparse updates and other communication efficiency strategies
With large models, updating all the parameters can create very large checkpoints that would become infeasible to store (diff history) and communicate
May not be as much of a problem with small models or models that are rarely updated
Merge updates from models not fully in the scope of this project. Next layer after building a version control system
Fall back on some kind of averaging method, or for newly added layers that are not conflicting it would be a simple merge (Eg kNN), mixture of models
What do we do in the case of merge conflicts that cannot be resolved automatically?
Some form of distillation
Last semester tried to see how we could merge different update methods
Evaluation/Downstream tasks
Differentiate the scope of this project as building something similar to Git but not dealing with CI (continuous integration) just yet
Eventually we may also want to know what data and hyperparameters resulted in that model update. But that’s an added layer
If one were to update a large model, wouldn't one also need to be resource-rich to even load these large models for training?
Yes, but there are ways to run them on a single GPU -> DeepSpeedZero
A very basic version of a VCS using Git with a model stored in ONNX format? So everytime you update the model, git saves your version history?
May support some update types and not others - need to explore this
Does git only store line-level changes or is more nuanced
Please take a look at the notebook and see if you can figure out a cleaner way to update a specific parameter value in the ONNX checkpoint. I'm currently doing initializer[1], it would be nice to choose it by parameter name. And also figure out why it's called "6", etc. And possibly also play around with the on-disk format, see whether it's at all usable by git, etc.
Use something like the @file_or_name
decorator to remove our many checks for if the input is a file object or a string.
As described under https://github.com/r-three/git-cml/blob/main/README.md#design-notes
Have multiple aliases in the plugins.
Currently logging is only done through the basicConfig and everything is done at the debug level.
We should update this, we should also be logging to a file (whose location is user controllable), especially for clean and smudge filters, we should have some messages at debug and some at info, and a user configurable way to control the verbosity of logs.
Ideally there would also be a way to see our debug messages without getting the ones from GitPython as some of their debug logs look like errors (the message about CYGWIN for example) and like they come from git-theta as we are currently configuring the root logger.
Regular files are staged with git add <file>
while checkpoints are staged with git theta add <checkpoint file>
. Talking with users, a common mistake is trying to stage some code by running something like git add .
and unintentionally staging a checkpoint file in the current directory. We should prevent this behavior.
Replace git cml init
with git cml install
(only run once) and git cml track
(run separately for each file).
git cml track
should specify both the file to be tracked and the checkpoint format. There will need to be an attribute/metadata file somewhere in .git_cml that specifies that the checkpoint is a given format.
Given an initializer name and a new value for a variable, replace the value of the variable with the new value. See e.g. https://github.com/bindog/onnx-surgery/blob/master/surgery.py#L118
Lots of code uses small changes between (sorted) iteration through (value, key) pairs to do things like intersections and unions.
Convert these functions to use flattened maps and things like dict.update
methods.
Add any useful links below
We also need to decide what operations we would support. The obvious requirements for a POC, in order of implementation:
Beyond that, we would also want to consider:
Define all the pieces in the pipeline
Define input and output for each piece
Define functionality of each piece
Identify which pieces are required for the PoC and which pieces can be built later
If we are doing a lot of manipulating or parsing of gitattributes files, we might want to roll that out into a separate (well-tested) library, or try to rely on another library for that if possibe.
Add a simple test that creates a pytorch checkpoint and does as few operations on it; set up continuous integration to run it.
Instead of tracking/applying sparse updates manually (for example storing them in a different directory) can we just checking sparse updates and then move backwards through git history to build the real value (apply updates).
I have written this recursive smudge where when you smudge a file it will be transformed to include the content at each point in the history where it changed (and the commit the change happened at).
#!/bin/bash
COMMIT=${2:-"HEAD"}
echo "----------------------------" >> /tmp/smudge.log
echo "${COMMIT}" >> /tmp/smudge.log
if [ ${COMMIT} != "HEAD" ]; then
PREV_COMMIT="${COMMIT}~1"
else
PREV_COMMIT="${COMMIT}"
fi
echo "${PREV_COMMIT}" >> /tmp/smudge.log
echo "I'm running smudge"
LAST_CHANGE=$(git rev-list -1 ${PREV_COMMIT} -- $1)
echo "${LAST_CHANGE}" >> /tmp/smudge.log
if [ -z ${LAST_CHANGE} ]; then
exit 0
else
echo "The last time this file changed was ${LAST_CHANGE}"
echo `git show ${LAST_CHANGE}:$1`
/usr/local/google/home/brianlester/dev/git-theta-test/smudge.sh $1 ${LAST_CHANGE}
fi
Note, we can't run something like git checkout ${COMMIT}
from inside a smudge but we can run things like git show
and git rev-list
.
We can apply this same idea to parameters. Reading in a sparse update will recurse backwards through history, until it hits a dense update. Once the dense update (which just returns the value) each sparse update (read from git) will be applied as we move back up the stack.
The main open questions are:
tensorstore
read a tensor when the binary blob (and the metadata file) are bytes sequences from git show
Involves making a custom difftool
Probably it should just always designate a merge conflict? We also could eventually implement parameter averaging or allowing merges when complementary sets of parameters are updated.
Implement a function to rename initializers (variables) in ONNX checkpoints. May require traversing the graph and updating all references.
Change name to git-theta
(both repo name and in the code)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.