zincware / zntrack Goto Github PK
View Code? Open in Web Editor NEWCreate, visualize, run & benchmark DVC pipelines in Python & Jupyter notebooks.
Home Page: https://zntrack.readthedocs.io
License: Apache License 2.0
Create, visualize, run & benchmark DVC pipelines in Python & Jupyter notebooks.
Home Page: https://zntrack.readthedocs.io
License: Apache License 2.0
For e.g., numpy, pandas or TensorFlow objects often already exist simple methods to write them to a file.
Have some special e.g., DVC.<np>
that inherit from a base class that can be simply overwritten to include a write_to_file
and read_from_file
method.
E.g.,:
class DVC_Numpy(PyTrackOption):
def write_to_file(self):
self.data: np.ndarray
np.save(self.file, self.data)
def read_from_file(self):
data = np.load(self.file)
return data
Users would have to handle lists and dicts by them selfs, but this could be easily extended and allow for a broader usage.
The following will raise an error:
@PyTrack(nb_name="Test01.ipynb")
class Stage:
def __init__(self):
self.deps = DVC.deps(['dependecy1'])
def __call__(self, deps):
self.deps.append(deps)
def run(self):
print(self.deps[0])
@PyTrack(nb_name="Test01.ipynb")
class Stage:
def __init__(self):
self.deps = DVC.deps()
def __call__(self, deps1, deps2):
self.deps = [deps1, deps2]
def run(self):
print(self.deps[0])
This works:
@PyTrack(nb_name="Test01.ipynb")
class Stage:
def __init__(self):
self.deps1 = DVC.deps()
self.deps2 = DVC.deps()
def __call__(self, deps1, deps2):
self.deps1 = deps1
self.deps2 = deps2
def run(self):
print(self.deps1)
print(self.deps2)
I think it should be possible to automatically detect parameters.
It could be an experimental feature, looking at all values that change in the call method and that are not any outs, deps, ...
License headers should be in all modules.
As corrected in 797d15e this needs to be tested
Even though it might be strange but currently self.parameters can not be left empty! It might be sensible to allow this.?
Currently @Node
will raise a ValueError to use @Node()
but this should not be necessary and should be fixed.
Discussion: What would be required to make writing PyTrack code easier? What would help with finding issues quicker?
Summary of possible Names for DVC_Op
Feel free to add / vote for names.
The main idea behind PyTrack is to easily run, reproduce, save and restore model pipelines that allow for easy parameter optimizations without any processing overhead.
Currently all of this is handled through DVC. Altough DVC can do these things very nicely it might make sense to think about the integration of other tools into PyTrack.
An example could be the integration of MLFlow 1363b05 to be able to use the MLFlow UI features. MLFlow also provides more detailed timestep logging.
Another very common tool for pipelines might be Airflow
I would keep the basic idea of PyTrack based on DVC, because of the way it builds the dependency graphs, handles the cache and overhead and can run experiments in parallel on temporary directories. But for this discussion I would motivate everyone to collect missing functionality or integration into common tools here.
This does not mean, that PyTrack will be expanded to use multiple tools! There must be a very clear advantage to add dependencies for multiple tools!
See https://pypi.org/search/?q=pytrack
We need to rename or slightly alter the package name
raise ValueError when trying to change e.g., results or parameters outside call or run
The current PyTrack implementation aims at making complex classes a DVC stage.
Similar to https://metaflow.org/ it might be usefull to allow more simple scenarios where a stage could be an intermediate step that does not take any user arguments but still is executed as an individual stage might be useful.
In the extreme this would also apply to plain functions.
E.g.
from pytrack import PyTrack, DVC
@PyTrack
class HelloWorld:
def __init__(self):
self.output = DVC.result()
def run(self):
self.output = "Hello World"
@PyTrack
def hello_world(inp):
return inp * 2
Add more tests for possible failures!
The usage of properties has to be tested thoroughly!
The decorated methods, especially the __call__
suffer from the missing arguments that come from the descriptor.
This can be solved by applying @functools.wraps
Don't use self.json_file in the config but try to use self.returns = True in call or something like that
For e.g., Jupyter Notebooks it might make sense to have an additional wd
argument where the repository is generated
It would be nice if users could add an alias to processes so that if they want, they can query them at a later time for some personal uses.
Add an argument that allows to remove the stage from the dvc.yaml
The following does not work:
from pytrack import PyTrack, DVC, PyTrackProject
import numpy as np
@PyTrack()
class CreateRandomNumber:
def __init__(self) -> None:
self.number = DVC.result()
def run(self):
self.number = np.random.normal(size=(2, 4))
@PyTrack()
class ComputeStd:
def __init__(self) -> None:
self.random_number = DVC.deps(CreateRandomNumber(id_=0))
self.std = DVC.result()
def run(self):
self.std = np.std(CreateRandomNumber(id_=1).number)
@PyTrack()
class ComputeMean:
def __init__(self) -> None:
self.random_number = DVC.deps(CreateRandomNumber(id_=0))
self.mean = DVC.result()
def run(self):
self.mean = np.mean(CreateRandomNumber(id_=1).number)
@PyTrack()
class JoinResults:
def __init__(self) -> None:
self.std = DVC.deps([ComputeStd(id_=0), ComputeMean(id_=0)])
def run(self):
print(self.std)
if __name__ == "__main__":
project = PyTrackProject()
project.create_dvc_repository()
create_random_number = CreateRandomNumber()()
compute_std = ComputeStd()()
compute_mean =ComputeMean()()
join_results = JoinResults()()
and raises:
ValueError: {'Hello': {'0': {'deps': {'deps': 'outs/0_Start.json'}}}, 'default': None, 'End': {'0': {'deps': {'deps': 'outs/0_Hello.json'}}}, 'ComputeStd': {'0': {'deps': {'deps': 'outs/0_CreateRandomNumber.json', 'random_number': 'outs/0_CreateRandomNumber.json'}}}, 'ComputeMean': {'0': {'deps': {'deps': 'outs/0_CreateRandomNumber.json', 'random_number': 'outs/0_CreateRandomNumber.json'}}}, 'JoinResults': {'0': {'deps': {'std': [<__main__.ComputeStd object at 0x0000022D6DF3E2B0>, <__main__.ComputeMean object at 0x0000022D6DF3E8E0>]}}}} is not JSON serializable
Code example:
from pytrack import PyTrack, DVC
@PyTrack()
class HelloWorld:
def __init__(self):
self.result = DVC.result()
self.output_files = DVC.outs(['first_file.txt', 'second_file.txt'])
def __call__(self):
print(hello_world.output_files)
self.output_files.append('third_file.txt')
print(hello_world.output_files)
self.output_files+= ['third_file.txt']
print(hello_world.output_files)
def run(self):
pass
hello_world = HelloWorld()
hello_world()
This results in:
[WindowsPath('outs/first_file.txt'), WindowsPath('outs/second_file.txt')]
[WindowsPath('outs/first_file.txt'), WindowsPath('outs/second_file.txt')]
[WindowsPath('outs/outs/first_file.txt'), WindowsPath('outs/outs/second_file.txt'), WindowsPath('outs/third_file.txt')]
Currently dependecies are not good enough explained in the docs.
Add a section on all DVC.<placeholder>
with examples
The following code fails:
from pytrack import PyTrack, DVC, PyTrackProject
@PyTrack(package=False)
class HelloWorld:
def __init__(self) -> None:
self.output = DVC.result
def __call__(self, *args, **kwds):
pass
def run(self):
self.output = "Hello World"
if __name__ == "__main__":
project = PyTrackProject()
project.create_dvc_repository()
hello_world = HelloWorld()
hello_world()
with
ImportError: cannot import name 'HelloWorld' from '__main__' (unknown location) ERROR: failed to reproduce 'dvc.yaml': failed to run: python -c "from __main__ import HelloWorld; HelloWorld(id_=0).run()", exited with 1
Fix live output on subprocess.run
in jupyter notebooks
If you run e.g. adding data the TQDM should show up when using exec_=True
The output gets captured and printed at the end of the process. For adding data this can mean a long time without any outputs.
Stages without DVC.parameter
are currently not possible
Currently the only supported options are:
--deps
--outs
together with params
and results
.
PyTrack should support all options mentioned on https://dvc.org/doc/command-reference/stage/add#options
As PyTrack is being used for software development, it is important that there are lots of example cases available for people to use.
To reproduce:
from pytrack import PyTrack, DVC
@PyTrack()
class ComputeA:
"""PyTrack stage A"""
def __init__(self):
self.inp = DVC.params()
self.out = DVC.result()
def __call__(self, inp):
self.inp = inp
def run(self):
self.out = np.power(2, self.inp).item()
a = ComputeA()
b = ComputeA(id_=0)
a(3)
This raises ValueError: This stage is being loaded. Parameters can not be set!
Feel free to contribute your own examples via code or GISTS. Also ask questions about the given examples!
Feedback is very much appreciated and will be used to optimize the documentation and API.
Some functions use from pytrack import PyTrack, DVC
which can be replaced by using from zntrack import Node, dvc
PyCharm and PEP8 don't like class attributes defined outside the init. Even though self.config is called within the init try to use init + super instead
Documentation in general - how to use, why, what are the benefits ...
If you use DVCParams.merge()
multiple times it can mess up the order in which they were put in there.
consider having a dictionary or list in the PyTrack class instead of merge
self.dvc={key:DVCParams}
Give the possibilty to name a stage, e.g.,
from pytrack import PyTrack, DVC
@PyTrack(name="WriteInputToOutput")
class HelloWorld:
def __init__(self):
self.input = DVC.params()
self.output = DVC.result()
def __call__(self, input):
self.input = input
def run(self):
self.output = self.input
Move the id from the run_dvc towards the init. This is more clear
self.param = DVC.parameter("default")
is currently not supported
There seems to be an issue with just having a plain self.deps = DVC.deps()
that causes some strange self dependencies
To avoid the confusion why self.parameters
can be used but any other class attribute can not, it might be helpful to save all self.__dict__
to a json file instead.
For running PyTrack one only needs to have dvc
and PyYAML
but for the docs further packages are needed.
The should only be installed when requested.
In the previouse version you checked if the parameters changed and then used --force
to overwrite them. Does it make sense to do that?
The following does not work:
stage = StageIO()
stage(deps.resolve())
project.run()
project.load()
stage.outs.read_text()
while this does:
stage = StageIO()
stage(deps.resolve())
project.run()
project.load()
stage = StageIO(id_=0) # <-- required!
stage.outs.read_text()
Instead of
@PyTrack()
class ComputeMean:
def __init__(self) -> None:
self.random_number = DVC.deps(CreateRandomNumber(id_=0))
self.mean = DVC.result()
def run(self):
self.mean = np.mean(CreateRandomNumber(id_=1).number)
it would be potentially easier to have:
@PyTrack()
class ComputeMean:
def __init__(self) -> None:
self.random_number = DVC.deps(CreateRandomNumber(id_=0))
self.mean = DVC.result()
def run(self):
self.mean = np.mean(self.random_number.number)
The only downside to this is, that DVC.deps
then does not strictly return Path
but also PyTrack
classes.
It retruns what it was given!
In the current state of the PyTrackProject
it only supports a very small amount of functions.
Building it on top of https://github.com/DAGsHub/fds could help increase the functionality and decrease the complexity.
Numpy Arrays are very common results and potentially parameters.
Currently they have to be converted with np.ndarry.tolist()
manually.
This could be automated.
Possible implementations
Check if you can get rid of the self.parameters
and use real class attributes with setattr.
I don't know if this is possible, but it might improve the usability.
Take the data selection methods for example.
We have different data selection methods like random, uniform_energetic
which all inherit from one parent class but are independent.
We now have a method like TTV that depends on a data selection method.
There are two possible scenarios:
Possible approaches for 2.
dvc.yaml
and look for all known data selection methods. There shoudn't be multiple! But there can be?!PyTrack
instance and enforce it being used and pass the method to its parametersCalling PyTrack from an privately written example results in:
ImportError: cannot import name 'parameter' from 'pytrack' (/Users/samueltovey/work/Repositories/py-track/pytrack/__init__.py)
When I go looking, I cannot find any declaration of 'parameter' as it does not seem to be in the parameter.py
file.
Could it be params or the PyTrackParam class?:
@staticmethod
def params(value=None):
"""Parameter for PyTrack
Parameters
----------
obj: any class object that the parameter will take on, so that type hinting does not raise issues
Returns
-------
cls: Class that inherits from obj
"""
class PyTrackParameter(PyTrackOption):
pass
return PyTrackParameter("params", value=value)
We need to improve all of the PyTrack documentation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.