Code Monkey home page Code Monkey logo

zntrack's People

Contributors

dependabot[bot] avatar mrjulenergy avatar pre-commit-ci[bot] avatar pythonfz avatar samtov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

zntrack's Issues

Allow for DVC custom outs

For e.g., numpy, pandas or TensorFlow objects often already exist simple methods to write them to a file.

Have some special e.g., DVC.<np> that inherit from a base class that can be simply overwritten to include a write_to_file and read_from_file method.

E.g.,:

class DVC_Numpy(PyTrackOption):
    def write_to_file(self):
        self.data: np.ndarray
        np.save(self.file, self.data)
    
    def read_from_file(self):
        data = np.load(self.file)
        return  data

Users would have to handle lists and dicts by them selfs, but this could be easily extended and allow for a broader usage.

Allow `DVC.<placeholder>` to be a list

The following will raise an error:

@PyTrack(nb_name="Test01.ipynb")
class Stage:
    def __init__(self):
        self.deps = DVC.deps(['dependecy1'])
    
    def __call__(self, deps):
        self.deps.append(deps)
        
    def run(self):
        print(self.deps[0])

    
@PyTrack(nb_name="Test01.ipynb")
class Stage:
    def __init__(self):
        self.deps = DVC.deps()
    
    def __call__(self, deps1, deps2):
        self.deps = [deps1, deps2]
        
    def run(self):
        print(self.deps[0])

This works:

@PyTrack(nb_name="Test01.ipynb")
class Stage:
    def __init__(self):
        self.deps1 = DVC.deps()
        self.deps2 = DVC.deps()
    
    def __call__(self, deps1, deps2):
        self.deps1 = deps1
        self.deps2 = deps2
        
    def run(self):
        print(self.deps1)
        print(self.deps2)

Automatically detect parameters

I think it should be possible to automatically detect parameters.
It could be an experimental feature, looking at all values that change in the call method and that are not any outs, deps, ...

self.parameters can't be empty

Even though it might be strange but currently self.parameters can not be left empty! It might be sensible to allow this.?

Improve debugging code!

Discussion: What would be required to make writing PyTrack code easier? What would help with finding issues quicker?

New Name

Summary of possible Names for DVC_Op

  • pyDVC
  • DVCDB
  • StageManager
  • DynDVC
  • SVC (Stage Version Control)
  • StageTrack
  • pyTrack
  • FlowManager
  • FlowControl / StageControl

Feel free to add / vote for names.

Discussion: Tools beyond DVC

The main idea behind PyTrack is to easily run, reproduce, save and restore model pipelines that allow for easy parameter optimizations without any processing overhead.

Currently all of this is handled through DVC. Altough DVC can do these things very nicely it might make sense to think about the integration of other tools into PyTrack.

An example could be the integration of MLFlow 1363b05 to be able to use the MLFlow UI features. MLFlow also provides more detailed timestep logging.

Another very common tool for pipelines might be Airflow

I would keep the basic idea of PyTrack based on DVC, because of the way it builds the dependency graphs, handles the cache and overhead and can run experiments in parallel on temporary directories. But for this discussion I would motivate everyone to collect missing functionality or integration into common tools here.

This does not mean, that PyTrack will be expanded to use multiple tools! There must be a very clear advantage to add dependencies for multiple tools!

Allow ZnTrack on non-call classes and functions

The current PyTrack implementation aims at making complex classes a DVC stage.
Similar to https://metaflow.org/ it might be usefull to allow more simple scenarios where a stage could be an intermediate step that does not take any user arguments but still is executed as an individual stage might be useful.
In the extreme this would also apply to plain functions.

E.g.

from pytrack import PyTrack, DVC

@PyTrack
class HelloWorld:
    def __init__(self):
        self.output = DVC.result()
    def run(self):
        self.output = "Hello World"
        
@PyTrack
def hello_world(inp):
    return inp * 2

TESTS

Add more tests for possible failures!
The usage of properties has to be tested thoroughly!

autocomplete on decorated function

The decorated methods, especially the __call__ suffer from the missing arguments that come from the descriptor.

This can be solved by applying @functools.wraps

remove json_file

Don't use self.json_file in the config but try to use self.returns = True in call or something like that

`DVC.deps` can not take lists as argument

The following does not work:

from pytrack import PyTrack, DVC, PyTrackProject
import numpy as np


@PyTrack()
class CreateRandomNumber:
    def __init__(self) -> None:
        self.number = DVC.result()

    def run(self):
        self.number = np.random.normal(size=(2, 4))

@PyTrack()
class ComputeStd:
    def __init__(self) -> None:
        self.random_number = DVC.deps(CreateRandomNumber(id_=0))
        self.std = DVC.result()

    def run(self):
        self.std = np.std(CreateRandomNumber(id_=1).number)

@PyTrack()
class ComputeMean:
    def __init__(self) -> None:
        self.random_number = DVC.deps(CreateRandomNumber(id_=0))
        self.mean = DVC.result()

    def run(self):
        self.mean = np.mean(CreateRandomNumber(id_=1).number)

@PyTrack()
class JoinResults:
    def __init__(self) -> None:
        self.std = DVC.deps([ComputeStd(id_=0), ComputeMean(id_=0)])
    
    def run(self):
        print(self.std)

if __name__ == "__main__":
    project = PyTrackProject()
    project.create_dvc_repository()

    create_random_number = CreateRandomNumber()()
    compute_std = ComputeStd()()
    compute_mean =ComputeMean()()
    join_results = JoinResults()()

and raises:

ValueError: {'Hello': {'0': {'deps': {'deps': 'outs/0_Start.json'}}}, 'default': None, 'End': {'0': {'deps': {'deps': 'outs/0_Hello.json'}}}, 'ComputeStd': {'0': {'deps': {'deps': 'outs/0_CreateRandomNumber.json', 'random_number': 'outs/0_CreateRandomNumber.json'}}}, 'ComputeMean': {'0': {'deps': {'deps': 'outs/0_CreateRandomNumber.json', 'random_number': 'outs/0_CreateRandomNumber.json'}}}, 'JoinResults': {'0': {'deps': {'std': [<__main__.ComputeStd object at 0x0000022D6DF3E2B0>, <__main__.ComputeMean object at 0x0000022D6DF3E8E0>]}}}} is not JSON serializable

Issue with adding to list of `DVC.outs` adds the path multiple times

Code example:

from pytrack import PyTrack, DVC

@PyTrack()
class HelloWorld:
    def __init__(self):
        self.result = DVC.result()
        self.output_files = DVC.outs(['first_file.txt', 'second_file.txt'])

    def __call__(self):
        print(hello_world.output_files)
        self.output_files.append('third_file.txt')
        print(hello_world.output_files)
        self.output_files+= ['third_file.txt']
        print(hello_world.output_files)

    def run(self):
        pass

hello_world = HelloWorld()
hello_world()

This results in:

[WindowsPath('outs/first_file.txt'), WindowsPath('outs/second_file.txt')]
[WindowsPath('outs/first_file.txt'), WindowsPath('outs/second_file.txt')]
[WindowsPath('outs/outs/first_file.txt'), WindowsPath('outs/outs/second_file.txt'), WindowsPath('outs/third_file.txt')]

Extend docs by `DVC.deps`

Currently dependecies are not good enough explained in the docs.

Add a section on all DVC.<placeholder> with examples

Issue with running from `__main__`

The following code fails:

from pytrack import PyTrack, DVC, PyTrackProject


@PyTrack(package=False)
class HelloWorld:
    def __init__(self) -> None:
        self.output = DVC.result

    def __call__(self, *args, **kwds):
        pass

    def run(self):
        self.output = "Hello World"


if __name__ == "__main__":
    project = PyTrackProject()
    project.create_dvc_repository()

    hello_world = HelloWorld()
    hello_world()

with

ImportError: cannot import name 'HelloWorld' from '__main__' (unknown location) ERROR: failed to reproduce 'dvc.yaml': failed to run: python -c "from __main__ import HelloWorld; HelloWorld(id_=0).run()", exited with 1

Expand the examples directory.

As PyTrack is being used for software development, it is important that there are lots of example cases available for people to use.

Loaded stage prohibts `__call__` on non-loaded stage

To reproduce:

from pytrack import PyTrack, DVC

@PyTrack()
class ComputeA:
    """PyTrack stage A"""

    def __init__(self):
        self.inp = DVC.params()
        self.out = DVC.result()

    def __call__(self, inp):
        self.inp = inp

    def run(self):
        self.out = np.power(2, self.inp).item()


a = ComputeA()
b = ComputeA(id_=0)
a(3)

This raises ValueError: This stage is being loaded. Parameters can not be set!

Documentation

Documentation in general - how to use, why, what are the benefits ...

Multiple dvc merge can cause ordering issues

If you use DVCParams.merge() multiple times it can mess up the order in which they were put in there.

consider having a dictionary or list in the PyTrack class instead of merge
self.dvc={key:DVCParams}

Allow named stages

Give the possibilty to name a stage, e.g.,

from pytrack import PyTrack, DVC

@PyTrack(name="WriteInputToOutput")
class HelloWorld:
    def __init__(self):
        self.input = DVC.params()
        self.output = DVC.result()
    
    def __call__(self, input):
        self.input = input
    def run(self):
        self.output = self.input

test that `--deps None` works

There seems to be an issue with just having a plain self.deps = DVC.deps() that causes some strange self dependencies

Accessing `stage.outs` required re-instantiatingthe class

The following does not work:

stage = StageIO()
stage(deps.resolve())
project.run()
project.load()

stage.outs.read_text()

while this does:

stage = StageIO()
stage(deps.resolve())
project.run()
project.load()

stage = StageIO(id_=0) # <-- required!
stage.outs.read_text()

Update `DVC.deps` to give access to its results directly

Instead of

@PyTrack()
class ComputeMean:
    def __init__(self) -> None:
        self.random_number = DVC.deps(CreateRandomNumber(id_=0))
        self.mean = DVC.result()

    def run(self):
        self.mean = np.mean(CreateRandomNumber(id_=1).number)

it would be potentially easier to have:

@PyTrack()
class ComputeMean:
    def __init__(self) -> None:
        self.random_number = DVC.deps(CreateRandomNumber(id_=0))
        self.mean = DVC.result()

    def run(self):
        self.mean = np.mean(self.random_number.number)

The only downside to this is, that DVC.deps then does not strictly return Path but also PyTrack classes.
It retruns what it was given!

Support numpy arrays in `DVC.results()` and `DVC.params()`

Numpy Arrays are very common results and potentially parameters.
Currently they have to be converted with np.ndarry.tolist() manually.
This could be automated.

Possible implementations

  • convert them to lists
  • allow custom result types, that use e.g., np.save and np.load instead of json files.

Remove self.parameters

Check if you can get rid of the self.parameters and use real class attributes with setattr.
I don't know if this is possible, but it might improve the usability.

How to handle Parent Classes

Take the data selection methods for example.

We have different data selection methods like random, uniform_energetic which all inherit from one parent class but are independent.
We now have a method like TTV that depends on a data selection method.
There are two possible scenarios:

  1. Pass it the name of the data selection method that you used (easy)
  2. Automatically determine the data selection method that has been used (not so easy)

Possible approaches for 2.

  1. Read the dvc.yaml and look for all known data selection methods. There shoudn't be multiple! But there can be?!
  2. Somehow make it only a single PyTrack instance and enforce it being used and pass the method to its parameters

Cannot import parameter

Calling PyTrack from an privately written example results in:

ImportError: cannot import name 'parameter' from 'pytrack' (/Users/samueltovey/work/Repositories/py-track/pytrack/__init__.py)

When I go looking, I cannot find any declaration of 'parameter' as it does not seem to be in the parameter.py file.

Could it be params or the PyTrackParam class?:

 @staticmethod
    def params(value=None):
        """Parameter for PyTrack

        Parameters
        ----------
        obj: any class object that the parameter will take on, so that type hinting does not raise issues

        Returns
        -------
        cls: Class that inherits from obj

        """

        class PyTrackParameter(PyTrackOption):
            pass

        return PyTrackParameter("params", value=value)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.