r-three / git-theta Goto Github PK

View Code? Open in Web Editor NEW

201.0 201.0 9.0 1.57 MB

git extension for {collaborative, communal, continual} model development

License: Apache License 2.0

Python 92.40% Shell 7.60%

git-theta's People

Stargazers

Watchers

Forkers

nkandpa2 vishalathreya blester125 mod-cpu eltociear eunchan24 shism2 afaiyaz006

git-theta's Issues

Support `git reset --hard`

After staging a change to a checkpoint with git theta add /path/to/my-model.pt, we should be able to use git reset --hard to destage the changes and blow away working tree modification, restoring back to the last commit.

Currently this results in a file not found error for ${git_repo}/path/to/my-model.pt file during one of the smudges.

Add optional framework installs.

As we have moved to plugins for checkpoint handling, we don't need all of the deep learning frameworks installed all the time. Therefore we don't need to install them all, especially given that they can be heavy.

Update the setup.py to include extras_require for various frameworks that install them with git-theta. Also include some target that installs all the frameworks, or at least some of the most popular ones.

Add `.name` as a `@property` to our checkpoint handler objects

We should record a .name value on each of the plugins. This property on a checkpoint object should return a string so that a get_checkpoint function call with this name would return the class of this object.

This will make things like logging what checkpoint type is used (and making sure we use the same one across multiple cleans, etc) much easier, especially when the value is set via an environment variable.

Use TensorStore for tracked files

It should still store the parameters and how they are changed.

Adding TensorFlow checkpoint plugin

Update `params` module

Currently the params module uses torch as a dependency only to convert the tensor back into a numpy array.

As we are working on supporting multiple checkpoint formats, can we just use numpy for most of these methods?

Update/finalize code for taking a diff file and applying it

Make subdirectories based on full path to model checkpoint

E.g. hyperformer/tracked_outputs/pytorch_model.bin should appear in .git_cml/hyperformer/tracked_outputs/pytorch_model.bin

Create binary for pretty-printing diffs

This would determine what to print out when git diff is run.

Create a format for representing incremental changes

As a starting point, this could be:
Operation type (e.g. dense update a parameter)
Parameter name
New value

use tensorstore async for writing out parameter group files

Get git-cml to work for Windows

Figure out storage of the initial checkpoint

Probably needs to be able to refer to an external location for the initial checkpoint since we don't want to store it all in git - it's too big. Might look like git LFS. Bonus: Store the random seed that can be used to reconstruct the initial parameter values.

Run black formatting on `bin/` scirpts.

The scripts in our bin/ file don't end in .py so they seem to get missed by black (I have confirmed they are missed in the pre-commit hook and I am pretty sure they are missed in the CI lint).

Update both the pre-commit hook and the ci to actually format these files. Will probably result in needing a regex as I think that specifying specific files in pre-commit removes the file-type based default change detection.

Updates to git_cml root

Use TensorStore (or something else) for the leaf nodes
Integrate LFS for tracking files in the git_cml root
Parent directory should be full filename

Make checkpoint backend for PyTorch files

Should take a PyTorch checkpoint and basically construct a dict-like object that is keyed by parameter name and whose values are the parameter values.

Allow specifying checkpoint type

rather than assuming it's a pytorch checkpoint

Remove `iterate_*_leaves`

With the change to using flat maps, we are no longer using the iterate_(dict|dir)_leaves functions.

They should be removed. The biggest part of the effort is that most of the tests that are operating on new functions like flatten or walk_dir are indirectly tested through these iterate functions. They need to be updated to test the actually functions we use.

Write down designs for possible git integrations

Extract part of one ONNX checkpoint (e.g. a layer) and copy it into another

See e.g. https://github.com/onnx/onnx/blob/main/docs/PythonAPIOverview.md#creating-an-onnx-model-using-helper-functions
https://github.com/onnx/onnx/blob/main/docs/PythonAPIOverview.md#extracting-sub-model-with-inputs-outputs-tensor-names

Write only parameter groups that have changed rather than all parameter groups in the checkpoint

When running git theta add all of the parameter groups are saved in .git_theta/<path to model> even if only a few of the parameter groups were actually modified. Instead of writing the whole model to disk every time we run git theta add, check what has changed and only write those parameters to disk.

Update README and examples

Add documentation compilation and integration testing

Define a Project Structure for the Repo

Build the automatic diff tool and/or interface for creating diff files

Function for accessing a particular parameter by name

Probably requires looping through entire set of variables/initializers. https://github.com/bindog/onnx-surgery/blob/master/surgery.py#L113

Update git-theta metadata file format

Currently the metadata file produced by the clean filter looks like this

{
  "model/scoping/to/param/1-weight shape": List[int],
  "model/scoping/to/param/1-weight dtype": str,
  "model/scoping/to/param/1-weight hash": str,
  ...,
  "model/scoping/to/param/2-bias shape": List[int],
  "model/scoping/to/param/2-bias dtype": str,
  "model/scoping/to/param/2-bias": str,
  ...

To make fetching metadata for a single parameter we are converting to a nested format:

{
  "model/scoping/to/param/1-weight": {
      "tensor_metadata": {
        "shape": List[str],
        "dtype": str,
        "hash": str,
      },
  },
  ...,
  "model/scoping/to/param/2-bias": {
      "tensor_metadata": {
        "shape": List[str],
        "dtype": str,
        "hash": str,
      },
  },
  ...,
}

Tensor metadata is in it's own nested dict because we may add other keys like git_theta_metadata for tracking things like update types eventually.

Note: We need a consistent serialization order (lexical sort on keys of each dict) when writing to disk to support diffs.

Force merge conflicts

Have the global checkpoint checksum written out somewhere so that git always flags a merge conflict.

Hello world git example

Run on a simple .json file (specifying parameter name -> parameter value)
Implement simple workflow for initial model -> make a change to a model -> produce diff file -> checkout commit
Simple example showing applying and rewinding a few changes

Meeting Notes (Running Thread)

January 19th, 2022

Summary of work the previous week

Read the proposal and blog post for VCS for collaborative update of models
Created Drive Folder for project

Meeting Summary

Why do we need sparse updates and other communication efficiency strategies
With large models, updating all the parameters can create very large checkpoints that would become infeasible to store (diff history) and communicate
May not be as much of a problem with small models or models that are rarely updated
Merge updates from models not fully in the scope of this project. Next layer after building a version control system
Fall back on some kind of averaging method, or for newly added layers that are not conflicting it would be a simple merge (Eg kNN), mixture of models
What do we do in the case of merge conflicts that cannot be resolved automatically?
Some form of distillation
Last semester tried to see how we could merge different update methods
Evaluation/Downstream tasks
Differentiate the scope of this project as building something similar to Git but not dealing with CI (continuous integration) just yet
Eventually we may also want to know what data and hyperparameters resulted in that model update. But that’s an added layer
If one were to update a large model, wouldn't one also need to be resource-rich to even load these large models for training?
Yes, but there are ways to run them on a single GPU -> DeepSpeedZero
A very basic version of a VCS using Git with a model stored in ONNX format? So everytime you update the model, git saves your version history?
May support some update types and not others - need to explore this
Does git only store line-level changes or is more nuanced

ToDo List

Please take a look at the notebook and see if you can figure out a cleaner way to update a specific parameter value in the ONNX checkpoint. I'm currently doing initializer[1], it would be nice to choose it by parameter name. And also figure out why it's called "6", etc. And possibly also play around with the on-disk format, see whether it's at all usable by git, etc.

Simplify file object and file path string polymorphic functions

Use something like the @file_or_name decorator to remove our many checks for if the input is a file object or a string.

support other update types

As described under https://github.com/r-three/git-cml/blob/main/README.md#design-notes

Change PyTorchCheckpoint to PickledDictCheckpoint

Have multiple aliases in the plugins.

Define File System for version control

More robust logging configuration

Currently logging is only done through the basicConfig and everything is done at the debug level.

We should update this, we should also be logging to a file (whose location is user controllable), especially for clean and smudge filters, we should have some messages at debug and some at info, and a user configurable way to control the verbosity of logs.

Ideally there would also be a way to see our debug messages without getting the ones from GitPython as some of their debug logs look like errors (the message about CYGWIN for example) and like they come from git-theta as we are currently configuring the root logger.

Prevent users from unintentionally running `git add <checkpoint>`

Regular files are staged with git add <file> while checkpoints are staged with git theta add <checkpoint file>. Talking with users, a common mistake is trying to stage some code by running something like git add . and unintentionally staging a checkpoint file in the current directory. We should prevent this behavior.

Replace init with install and track and support specifying checkpoint format

Replace git cml init with git cml install (only run once) and git cml track (run separately for each file).

git cml track should specify both the file to be tracked and the checkpoint format. There will need to be an attribute/metadata file somewhere in .git_cml that specifies that the checkpoint is a given format.

Function for replacing the value of an initializer in an ONNX checkpoint

Given an initializer name and a new value for a variable, replace the value of the variable with the new value. See e.g. https://github.com/bindog/onnx-surgery/blob/master/surgery.py#L118

Unify flattened leave iterations to flattened maps.

Lots of code uses small changes between (sorted) iteration through (value, key) pairs to do things like intersections and unions.

Convert these functions to use flattened maps and things like dict.update methods.

Read about how Git works under the hood

Add any useful links below

Figure out interface (functions vs command-line tool)

We also need to decide what operations we would support. The obvious requirements for a POC, in order of implementation:

Commit
Apply
Revert
Checkout
Log

Beyond that, we would also want to consider:

Merge
Branch

Create a System Diagram

Define all the pieces in the pipeline
Define input and output for each piece
Define functionality of each piece
Identify which pieces are required for the PoC and which pieces can be built later

Consider writing or repurposing gitattributes parsing/manipulation library

If we are doing a lot of manipulating or parsing of gitattributes files, we might want to roll that out into a separate (well-tested) library, or try to rely on another library for that if possibe.

Add basic integration tests

Add a simple test that creates a pytorch checkpoint and does as few operations on it; set up continuous integration to run it.

Investigate using git for tracking sparse updates and git smudge to apply them.

Instead of tracking/applying sparse updates manually (for example storing them in a different directory) can we just checking sparse updates and then move backwards through git history to build the real value (apply updates).

I have written this recursive smudge where when you smudge a file it will be transformed to include the content at each point in the history where it changed (and the commit the change happened at).

#!/bin/bash

COMMIT=${2:-"HEAD"}

echo "----------------------------" >> /tmp/smudge.log
echo "${COMMIT}" >> /tmp/smudge.log

if [ ${COMMIT} != "HEAD" ]; then
  PREV_COMMIT="${COMMIT}~1"
else
  PREV_COMMIT="${COMMIT}"
fi

echo "${PREV_COMMIT}" >> /tmp/smudge.log

echo "I'm running smudge"
LAST_CHANGE=$(git rev-list -1 ${PREV_COMMIT} -- $1)
echo "${LAST_CHANGE}" >> /tmp/smudge.log

if [ -z ${LAST_CHANGE} ]; then
  exit 0
else
  echo "The last time this file changed was ${LAST_CHANGE}"
  echo `git show ${LAST_CHANGE}:$1`
  /usr/local/google/home/brianlester/dev/git-theta-test/smudge.sh $1 ${LAST_CHANGE}
fi

Note, we can't run something like git checkout ${COMMIT} from inside a smudge but we can run things like git show and git rev-list.

We can apply this same idea to parameters. Reading in a sparse update will recurse backwards through history, until it hits a dense update. Once the dense update (which just returns the value) each sparse update (read from git) will be applied as we move back up the stack.

The main open questions are:

Does this still work when we hit a commit with multiple parents (from a merge for example)
Can tensorstore read a tensor when the binary blob (and the metadata file) are bytes sequences from git show
- If it can't this solution would need to write the blobs to a temporary space causing an extra read/write per updated parameter. This could be mitigated by only changing updated parameters but could be costly otherwise.