zincware / zntrack Goto Github PK

View Code? Open in Web Editor NEW

44.0 44.0 4.0 8.61 MB

Create, visualize, run & benchmark DVC pipelines in Python & Jupyter notebooks.

Home Page: https://zntrack.readthedocs.io

License: Apache License 2.0

Python 99.92% Shell 0.08%

data-science data-version-control developer-tools dvc git machine-learning python reproducibility

zntrack's People

Contributors

Stargazers

Watchers

Forkers

jr-1991 spp2363 openssl-sg-insights niklaskappel

zntrack's Issues

Allow for DVC custom outs

For e.g., numpy, pandas or TensorFlow objects often already exist simple methods to write them to a file.

Have some special e.g., DVC.<np> that inherit from a base class that can be simply overwritten to include a write_to_file and read_from_file method.

E.g.,:

class DVC_Numpy(PyTrackOption):
    def write_to_file(self):
        self.data: np.ndarray
        np.save(self.file, self.data)
    
    def read_from_file(self):
        data = np.load(self.file)
        return  data

Users would have to handle lists and dicts by them selfs, but this could be easily extended and allow for a broader usage.

Optimize the docstrings: Also mention where the functions/properties are used and what they do!

Allow `DVC.<placeholder>` to be a list

The following will raise an error:

@PyTrack(nb_name="Test01.ipynb")
class Stage:
    def __init__(self):
        self.deps = DVC.deps(['dependecy1'])
    
    def __call__(self, deps):
        self.deps.append(deps)
        
    def run(self):
        print(self.deps[0])

    
@PyTrack(nb_name="Test01.ipynb")
class Stage:
    def __init__(self):
        self.deps = DVC.deps()
    
    def __call__(self, deps1, deps2):
        self.deps = [deps1, deps2]
        
    def run(self):
        print(self.deps[0])

This works:

@PyTrack(nb_name="Test01.ipynb")
class Stage:
    def __init__(self):
        self.deps1 = DVC.deps()
        self.deps2 = DVC.deps()
    
    def __call__(self, deps1, deps2):
        self.deps1 = deps1
        self.deps2 = deps2
        
    def run(self):
        print(self.deps1)
        print(self.deps2)

Automatically detect parameters

I think it should be possible to automatically detect parameters.
It could be an experimental feature, looking at all values that change in the call method and that are not any outs, deps, ...

Add license information to all modules.

License headers should be in all modules.

Test for missing paths

As corrected in 797d15e this needs to be tested

self.parameters can't be empty

Even though it might be strange but currently self.parameters can not be left empty! It might be sensible to allow this.?

Support `@Node` and `@Node()`

Currently @Node will raise a ValueError to use @Node() but this should not be necessary and should be fixed.

Improve debugging code!

Discussion: What would be required to make writing PyTrack code easier? What would help with finding issues quicker?

New Name

Summary of possible Names for DVC_Op

pyDVC
DVCDB
StageManager
DynDVC
SVC (Stage Version Control)
StageTrack
pyTrack
FlowManager
FlowControl / StageControl

Feel free to add / vote for names.

Discussion: Tools beyond DVC

The main idea behind PyTrack is to easily run, reproduce, save and restore model pipelines that allow for easy parameter optimizations without any processing overhead.

Currently all of this is handled through DVC. Altough DVC can do these things very nicely it might make sense to think about the integration of other tools into PyTrack.

An example could be the integration of MLFlow 1363b05 to be able to use the MLFlow UI features. MLFlow also provides more detailed timestep logging.

Another very common tool for pipelines might be Airflow

I would keep the basic idea of PyTrack based on DVC, because of the way it builds the dependency graphs, handles the cache and overhead and can run experiments in parallel on temporary directories. But for this discussion I would motivate everyone to collect missing functionality or integration into common tools here.

This does not mean, that PyTrack will be expanded to use multiple tools! There must be a very clear advantage to add dependencies for multiple tools!

PyTrack already taken on PyPI

See https://pypi.org/search/?q=pytrack

We need to rename or slightly alter the package name

raise ValueError when trying to change e.g., results or parameters outside call or run

raise ValueError when trying to change e.g., results or parameters outside call or run

Allow ZnTrack on non-call classes and functions

The current PyTrack implementation aims at making complex classes a DVC stage.
Similar to https://metaflow.org/ it might be usefull to allow more simple scenarios where a stage could be an intermediate step that does not take any user arguments but still is executed as an individual stage might be useful.
In the extreme this would also apply to plain functions.

E.g.

from pytrack import PyTrack, DVC

@PyTrack
class HelloWorld:
    def __init__(self):
        self.output = DVC.result()
    def run(self):
        self.output = "Hello World"
        
@PyTrack
def hello_world(inp):
    return inp * 2

TESTS

Add more tests for possible failures!
The usage of properties has to be tested thoroughly!

autocomplete on decorated function

The decorated methods, especially the __call__ suffer from the missing arguments that come from the descriptor.

This can be solved by applying @functools.wraps

remove json_file

Don't use self.json_file in the config but try to use self.returns = True in call or something like that

Add argument to pass the directory

For e.g., Jupyter Notebooks it might make sense to have an additional wd argument where the repository is generated

Recursicley search for `DVC.<placeholder>` to also include passed class instances to the PyTrack class

Add optional alias for arbitrary tracked processes

It would be nice if users could add an alias to processes so that if they want, they can query them at a later time for some personal uses.

Disable / remove stage from `dvc.yaml`

Add an argument that allows to remove the stage from the dvc.yaml

`DVC.deps` can not take lists as argument

The following does not work:

from pytrack import PyTrack, DVC, PyTrackProject
import numpy as np


@PyTrack()
class CreateRandomNumber:
    def __init__(self) -> None:
        self.number = DVC.result()

    def run(self):
        self.number = np.random.normal(size=(2, 4))

@PyTrack()
class ComputeStd:
    def __init__(self) -> None:
        self.random_number = DVC.deps(CreateRandomNumber(id_=0))
        self.std = DVC.result()

    def run(self):
        self.std = np.std(CreateRandomNumber(id_=1).number)

@PyTrack()
class ComputeMean:
    def __init__(self) -> None:
        self.random_number = DVC.deps(CreateRandomNumber(id_=0))
        self.mean = DVC.result()

    def run(self):
        self.mean = np.mean(CreateRandomNumber(id_=1).number)

@PyTrack()
class JoinResults:
    def __init__(self) -> None:
        self.std = DVC.deps([ComputeStd(id_=0), ComputeMean(id_=0)])
    
    def run(self):
        print(self.std)

if __name__ == "__main__":
    project = PyTrackProject()
    project.create_dvc_repository()

    create_random_number = CreateRandomNumber()()
    compute_std = ComputeStd()()
    compute_mean =ComputeMean()()
    join_results = JoinResults()()

and raises:

ValueError: {'Hello': {'0': {'deps': {'deps': 'outs/0_Start.json'}}}, 'default': None, 'End': {'0': {'deps': {'deps': 'outs/0_Hello.json'}}}, 'ComputeStd': {'0': {'deps': {'deps': 'outs/0_CreateRandomNumber.json', 'random_number': 'outs/0_CreateRandomNumber.json'}}}, 'ComputeMean': {'0': {'deps': {'deps': 'outs/0_CreateRandomNumber.json', 'random_number': 'outs/0_CreateRandomNumber.json'}}}, 'JoinResults': {'0': {'deps': {'std': [<__main__.ComputeStd object at 0x0000022D6DF3E2B0>, <__main__.ComputeMean object at 0x0000022D6DF3E8E0>]}}}} is not JSON serializable

Issue with adding to list of `DVC.outs` adds the path multiple times

Code example:

from pytrack import PyTrack, DVC

@PyTrack()
class HelloWorld:
    def __init__(self):
        self.result = DVC.result()
        self.output_files = DVC.outs(['first_file.txt', 'second_file.txt'])

    def __call__(self):
        print(hello_world.output_files)
        self.output_files.append('third_file.txt')
        print(hello_world.output_files)
        self.output_files+= ['third_file.txt']
        print(hello_world.output_files)

    def run(self):
        pass

hello_world = HelloWorld()
hello_world()

This results in:

[WindowsPath('outs/first_file.txt'), WindowsPath('outs/second_file.txt')]
[WindowsPath('outs/first_file.txt'), WindowsPath('outs/second_file.txt')]
[WindowsPath('outs/outs/first_file.txt'), WindowsPath('outs/outs/second_file.txt'), WindowsPath('outs/third_file.txt')]

Extend docs by `DVC.deps`

Currently dependecies are not good enough explained in the docs.

Add a section on all DVC.<placeholder> with examples

Issue with running from `main`

The following code fails:

from pytrack import PyTrack, DVC, PyTrackProject


@PyTrack(package=False)
class HelloWorld:
    def __init__(self) -> None:
        self.output = DVC.result

    def __call__(self, *args, **kwds):
        pass

    def run(self):
        self.output = "Hello World"


if __name__ == "__main__":
    project = PyTrackProject()
    project.create_dvc_repository()

    hello_world = HelloWorld()
    hello_world()

with

ImportError: cannot import name 'HelloWorld' from '__main__' (unknown location) ERROR: failed to reproduce 'dvc.yaml': failed to run: python -c "from __main__ import HelloWorld; HelloWorld(id_=0).run()", exited with 1

Fix live output on `subprocess.run` in jupyter notebooks

Fix live output on subprocess.run in jupyter notebooks

https://github.com/zincware/py-track/blob/5aa611e719ffa6e93d58d49029bc8562910a0f83/pytrack/core/py_track.py#L395-L400

Expected behaviour:

If you run e.g. adding data the TQDM should show up when using exec_=True

Current behaviour

The output gets captured and printed at the end of the process. For adding data this can mean a long time without any outputs.

Make run an abstract method

`DVC.parameter` always required

Stages without DVC.parameter are currently not possible

Support all DVC stage options

Currently the only supported options are:

--deps
--outs

together with params and results.
PyTrack should support all options mentioned on https://dvc.org/doc/command-reference/stage/add#options

Expand the examples directory.

As PyTrack is being used for software development, it is important that there are lots of example cases available for people to use.

Loaded stage prohibts `call` on non-loaded stage

To reproduce:

from pytrack import PyTrack, DVC

@PyTrack()
class ComputeA:
    """PyTrack stage A"""

    def __init__(self):
        self.inp = DVC.params()
        self.out = DVC.result()

    def __call__(self, inp):
        self.inp = inp

    def run(self):
        self.out = np.power(2, self.inp).item()


a = ComputeA()
b = ComputeA(id_=0)
a(3)

This raises ValueError: This stage is being loaded. Parameters can not be set!

[Examples] Collection of ZnTrack examples that can be optimized and moved to the docs

Feel free to contribute your own examples via code or GISTS. Also ask questions about the given examples!
Feedback is very much appreciated and will be used to optimize the documentation and API.

EDIT

Some functions use from pytrack import PyTrack, DVC which can be replaced by using from zntrack import Node, dvc

new variables defined in config aren't in the init

PyCharm and PEP8 don't like class attributes defined outside the init. Even though self.config is called within the init try to use init + super instead

Documentation

Documentation in general - how to use, why, what are the benefits ...

Multiple dvc merge can cause ordering issues

If you use DVCParams.merge() multiple times it can mess up the order in which they were put in there.

consider having a dictionary or list in the PyTrack class instead of merge
self.dvc={key:DVCParams}

Allow named stages

Give the possibilty to name a stage, e.g.,

from pytrack import PyTrack, DVC

@PyTrack(name="WriteInputToOutput")
class HelloWorld:
    def __init__(self):
        self.input = DVC.params()
        self.output = DVC.result()
    
    def __call__(self, input):
        self.input = input
    def run(self):
        self.output = self.input

run_dvc shouldn't take any argument

Move the id from the run_dvc towards the init. This is more clear

`DVC.parameter` can not take on default values

self.param = DVC.parameter("default") is currently not supported

test that `--deps None` works

There seems to be an issue with just having a plain self.deps = DVC.deps() that causes some strange self dependencies

Save `self.dict` instead of `self.parameters`

To avoid the confusion why self.parameters can be used but any other class attribute can not, it might be helpful to save all self.__dict__ to a json file instead.

Split documentation and user requirements

For running PyTrack one only needs to have dvc and PyYAML but for the docs further packages are needed.

The should only be installed when requested.

Automatically use force if something changed?

In the previouse version you checked if the parameters changed and then used --force to overwrite them. Does it make sense to do that?

Accessing `stage.outs` required re-instantiatingthe class

The following does not work:

stage = StageIO()
stage(deps.resolve())
project.run()
project.load()

stage.outs.read_text()

while this does:

stage = StageIO()
stage(deps.resolve())
project.run()
project.load()

stage = StageIO(id_=0) # <-- required!
stage.outs.read_text()

Update `DVC.deps` to give access to its results directly

Instead of

@PyTrack()
class ComputeMean:
    def __init__(self) -> None:
        self.random_number = DVC.deps(CreateRandomNumber(id_=0))
        self.mean = DVC.result()

    def run(self):
        self.mean = np.mean(CreateRandomNumber(id_=1).number)

it would be potentially easier to have:

@PyTrack()
class ComputeMean:
    def __init__(self) -> None:
        self.random_number = DVC.deps(CreateRandomNumber(id_=0))
        self.mean = DVC.result()

    def run(self):
        self.mean = np.mean(self.random_number.number)

~~The only downside to this is, that DVC.deps then does not strictly return Path but also PyTrack classes.~~
It retruns what it was given!

Look into FDS for PyTrackProject

In the current state of the PyTrackProject it only supports a very small amount of functions.
Building it on top of https://github.com/DAGsHub/fds could help increase the functionality and decrease the complexity.

Support numpy arrays in `DVC.results()` and `DVC.params()`

Numpy Arrays are very common results and potentially parameters.
Currently they have to be converted with np.ndarry.tolist() manually.
This could be automated.

Possible implementations

convert them to lists
allow custom result types, that use e.g., np.save and np.load instead of json files.

Remove self.parameters

Check if you can get rid of the self.parameters and use real class attributes with setattr.
I don't know if this is possible, but it might improve the usability.

How to handle Parent Classes

Take the data selection methods for example.

We have different data selection methods like random, uniform_energetic which all inherit from one parent class but are independent.
We now have a method like TTV that depends on a data selection method.
There are two possible scenarios:

Pass it the name of the data selection method that you used (easy)
Automatically determine the data selection method that has been used (not so easy)

Possible approaches for 2.

Read the dvc.yaml and look for all known data selection methods. There shoudn't be multiple! But there can be?!
Somehow make it only a single PyTrack instance and enforce it being used and pass the method to its parameters

Cannot import parameter

Calling PyTrack from an privately written example results in:

ImportError: cannot import name 'parameter' from 'pytrack' (/Users/samueltovey/work/Repositories/py-track/pytrack/__init__.py)

When I go looking, I cannot find any declaration of 'parameter' as it does not seem to be in the parameter.py file.

Could it be params or the PyTrackParam class?:

 @staticmethod
    def params(value=None):
        """Parameter for PyTrack

        Parameters
        ----------
        obj: any class object that the parameter will take on, so that type hinting does not raise issues

        Returns
        -------
        cls: Class that inherits from obj

        """

        class PyTrackParameter(PyTrackOption):
            pass

        return PyTrackParameter("params", value=value)

Improve the documentation

We need to improve all of the PyTrack documentation.