jmarshrossney / pyrex Goto Github PK
View Code? Open in Web Editor NEWSimple tool for reproducible experiments using Python
License: GNU General Public License v3.0
Simple tool for reproducible experiments using Python
License: GNU General Public License v3.0
It's probably not necessary to insist on creating a workspace from a Cookiecutter template; the .pyrex_workspace
file contains a few lines which can easily be provided by a user using Click's interactive CLI.
I could call the command pyrex workspace init
.
At one point I had intended to create a tool for generating reproducible containers in which arbitrary commands could be executed, with the knowledge that this tool was logging all of the information required to exactly reproduce the result.
This proved to (a) be too ambitious given my skill level, and (b) involve a considerable amount of wheel re-invention. Instead, I decided to build a much simpler tool that uses templates to semi-automate laborious coping and typing, and to offload the responsibility of making appropriate use of version control and dedicated package managers to the user.
The result is something that I think is 90% as useful as what I had originally intended, and about 2% as complicated to build and maintain (I think that's called being smart). The main use-case is someone like myself who wants to run experiments using code that's absolutely nowhere near a finished product. These experiments might amount to nothing, but I sure as hell would like to be able to go back and reproduce that 'really good result' I got months ago with some random hacked version of the code!
For future reference.
My original intention was to create a tool that would act as a drop-in replacement for python
when invoking scripts as in e.g. python script.py -c config.yaml
, which would create and run the script inside an isolated environment, recording all the information required to exactly reproduce the result. Essentially I wanted to semi-automate the following steps:
I also wanted the tool to perform a similar function for my colleagues who use Jupyter notebooks (e.g. using nbconvert or papermill), Julia, R etc.
I found that tox (or nox) does the hard work of creating an isolated virtual environment, and what's more allows you to execute arbitrary commands from inside this environment, so for a while the idea was to basically build a wrapper around tox -c experiment.ini
that created a new directory and ran everything from there. This proved to be a bit fiddly, however, since tox cares a lot about the directory in which the .ini
config file resides, which made it difficult to refer to files in a local repository (which would be under version control so storing their commit hash would be sufficient whereas copying them would be totally overkill).
For some time I played with using git -C /path/to/repo --work-tree . checkout <commit> -- <files>
to 'checkout' a specific commit from a local repo into a different working directory (I even wanted to put this command into the tox config file). This was a bit annoying because (a) if your workspace is a subdirectory of the repo then you do not end up with the just the workspace directory, but all of its parents in the git working tree, and (b) it's actually kind of annoying to have multiple copies of the worktree checked out all over the place. Actually the behaviour was fairly unintuitive and I think too complicated for a tool that is meant to have a very low barrier to entry. Here I tried symlinking the workspace to the experiment directory to avoid having to modify paths, but this was bad because (a) you have to symlink individual files, not directories, else the experiment outputs get sent back to your main workspace, and (b) you end up with a directory cluttered with symlinks of files you don't need.
Ultimately it's pretty overkill to insist on completely environment isolation just to run an experiment. It would probably be sufficient to check that a script can run inside an isolated environment, to confirm that no unrecorded software is being used, and perhaps test agreement between outputs run inside versus outside. Also, one of my aims was to build a tool that doesn't massively change someone's workflow, since I just don't think people would use it if it did. So basically I thought it best not to build tox/nox into the tool.
Anyway, after numerous from-scratch rewrites I ended up here with something fairly bloated that amounted to little more than an over-engineered copytree. This is roughly when I realised that by far the most useful bits of my code were the time-saving elements which copied files, logged commit hashes, created a .gitignore and a README etc. It occurred that I could make something incredibly simple which was nonetheless still useful, which seems kinda, idk, smart?
So I ended up building basically a wrapper around cookiecutter which also copies files and injects a bunch of useful parameters into the context that can be referred to in templates. Cookiecutter is really just a simple API that uses Jinja to render templated files and directories. This seems like a sweet spot where most people can just use basic templates out of the box, but others can create their own (at no maintenance cost to me :D ). See #3
Horay.
Sure, it's easy to open a YAML file and type, but this would add some useful checks, e.g
.pyrex_workspace.yaml
file?I could call the command pyrex workspace add-exp
.
As regards the CLI help I should in places extend Click's default behaviour by letting the user know useful information like
pyrex workspace create
)pyrex create
)Tests:
utils.py
At the moment an experiment config has to have a template
field which maps to the Template
dataclass containing attributes template
, checkout
and directory
which mirror argument names in cookiecutter.main.cookiecutter
.
However, checkout
and directory
are not required and are not even used if the template is a filepath. This leads to configuration files that look like
name:
template:
template: /path/to/template
which is annoying and confusing. If the template is one that has been added to the named templates file then that should also be accepted in the experiment config as template: name
.
Also, .pyrex_workspace.yaml
should contain a workspace-wide default template to save space in the experiments config file.
Log information during experiment creation. In particular, record the state of the git repository by logging the outputs of git status
, git rev-parse
etc.
Creating experiments with the repo in a dirty state should be strongly discouraged (via conformation prompts) but not completely disabled.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.