invenia / checkpoints.jl Goto Github PK
View Code? Open in Web Editor NEWA package for dynamically checkpointing program state
License: MIT License
A package for dynamically checkpointing program state
License: MIT License
We don't have a significant use case right now, but we may want to introduce a lock ensure multiple threads don't update checkpoint storage at the same time.
Gives a 404
If I save a checkpoint called "Forecasters.predicted" in example/path
without tags, the filepath is example/path/Forecasters/predicted.jlso
.
This IndexEntry
constructor assumes that prefixes are those paths segments which do not contain "=", i.e. are not tags. However, when there are no tags in the first place, all path segments will be included in this operation, because first_tag_ind
defaults to 1:
Checkpoints.jl/src/indexing.jl
Lines 24 to 27 in f380d77
MWE:
julia> Checkpoints.register("Forecasters", ["predicted"])
julia> Checkpoints.config("Forecasters.predicted", "example/path")
julia> checkpoint("Forecasters.predicted", [1])
13361
julia> IndexEntry("example/path/Forecasters/predicted.jlso").prefixes
("example", "path", "Forecasters")
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
We introduced a force=false
kwarg to the DictHandler constructor in #50 to ensure checkpoints aren't accidentally being overwritten. We may also want to introduce that in the JLSOHandler. This would be a breaking change as our code currently overwrites checkpoints JLSOs if a job fails.
Not quite reproducible example using some files I have on S3:
julia> using Checkpoints, FilePathsBase
julia> dir = p"s3://eis-jobresultsbucket-15m8pleoysiwx/backrun/2021-10-28T22h56m45.360/";
julia> index_checkpoint_files(dir)
ERROR: ArgumentError: . cannot be parsed as AWSS3.S3Path{AWS.AWSConfig}
Stacktrace:
[1] #parse#6
@ ~/.julia/packages/FilePathsBase/YFK4h/src/path.jl:74 [inlined]
[2] parse
@ ~/.julia/packages/FilePathsBase/YFK4h/src/path.jl:73 [inlined]
[3] relative(fp::AWSS3.S3Path{AWS.AWSConfig}, start::AWSS3.S3Path{AWS.AWSConfig})
@ FilePathsBase ~/.julia/packages/FilePathsBase/YFK4h/src/path.jl:440
[4] relpath
@ ~/.julia/packages/FilePathsBase/YFK4h/src/aliases.jl:22 [inlined]
[5] IndexEntry(filepath::AWSS3.S3Path{AWS.AWSConfig}, base_dir::AWSS3.S3Path{AWS.AWSConfig})
@ Checkpoints ~/JuliaEnvs/Checkpoints.jl/src/indexing.jl:28
[6] #27
@ ~/JuliaEnvs/Checkpoints.jl/src/indexing.jl:159 [inlined]
[7] iterate
@ ./generator.jl:47 [inlined]
[8] grow_to!(dest::Vector{IndexEntry}, itr::Base.Generator{Base.Iterators.Filter{ComposedFunction{Base.Fix2{typeof(==), String}, typeof(extension)}, Channel{AWSS3.S3Path{AWS.AWSConfig}}}, Checkpoints.var"#27#28"{AWSS3.S3Path{AWS.AWSConfig}}})
@ Base ./array.jl:739
[9] collect
@ ./array.jl:676 [inlined]
[10] map
@ ./abstractarray.jl:2323 [inlined]
[11] index_checkpoint_files(dir::AWSS3.S3Path{AWS.AWSConfig})
@ Checkpoints ~/JuliaEnvs/Checkpoints.jl/src/indexing.jl:158
[12] top-level scope
@ REPL[5]:1
Problem arises from a JLSO file existing at the top level in dir
, which means the dirname
and the base_dir
of this file path are the same here. Then FilePathsBase
calls parse(S3Path, ".")
under the hood. AWSS3
doesn't accept ".", it only accepts paths with the "s3://" URI.
We should be able to fix the above specific problem in this package by just not looking for prefixes/tags if the dirname
and base_dir
are equal.
It is because isfile
being run on each path returned by collect(walkpath)
on this line.
One possible workaround would be to filter out paths which end in /
. For a 3000 file path
the difference is 0.8s vs 60s using isfile
.
If we know a checkpoint's name expressed in the MODULE.SUBMODULE.NAME
form
then right now you need to do
split(fullname)[1:end-1] == prefixes(x) && last(split(fullname)) == checkpoint_name(x)
to see if it matches some x
from the index.
That is pretty gross.
We should add checkpoint_fullname(x::IndexEntry) = join(".", [prefixes(x); checkpoint_name(x)])
or something like that.
So that it can be checked easily.
index_checkpoint_files
walks a given path to find checkpoint files, and organises the segments of each checkpoint path into tags.
One might want to call this many times on the same top-level checkpoint directory, to analyse checkpoint data while the program is running and new checkpoints are added. For example, if a checkpoint is made at regular time intervals, with the timestamp used as a tag.
If there are a lot of checkpoint files (e.g. 100s), walking the whole path becomes a big waste. One could index a subdirectory of the top-level checkpoint directory, but then not all of the tags would be found, because tags are part of the path.
Is there a way to update the checkpoint index incrementally, based on diffs in the file tree? For example, if I want to reindex per timestep, it only searches the checkpoints for that timestep and adds them to an existing index, but still knows all of the tags.
We might likely to make checkpoint
into a macro
Reasons for macro:
@checkpoint("RegressionSummary", value=expensive_summary_function(foo))
. (This is what the Base Logging macros do)register
them in __init__
by making it, at parse time register
it. (not 100% sure if this will work, since it is mutating a global variable at parse time, I think it does. If it doens’t then shouldn’t do this)checkpoint_info
that would print a list. As a kind of documentation.a
be the same as :a=>a
(though we also get this if we changes to storing data in the kwarg position #16)On the otherhand macros are harder to reason about. so the gains might not be worth it.
I think low priority
Probably should have been done as part of #22
but I didn't think to, it just has API docs.
Consisder a checkpoint index index
constructed from the path results/backrun/2021-10-08T15:59:04.829/foo=BAR/sim_now=2019-02-14T10:15:00-05:00/strategy=1/Forecasters/predicted.jlso
checkpoint_path(ind)
would return that results/backrun/2021-10-08T15:59:04.829/foo=BAR/sim_now=2019-02-14T10:15:00-05:00/strategy=1/Forecasters/predicted.jlso
I propose a new: checkpoint_basepath(index)
that returns results/backrun/2021-10-08T15:59:04.829/
Giving what ever part is common for all tags
As of #15 (cc @mzgubic)
we are allowed duplicate tags, and because of this with_tag
is used as with_tag(:tag1=1, :tag2=2) do
Since we need to actually have a function that can take those duplicate tags they can't be in the keyword positon.
Conversely, the data keys must be unique, so having them in the varargs position is suboptimal.
Also as of #15 it will be very rare to pass tags directly to a checkpoint.
So calls will be checkpoint("FooBar"; foo=1, bar=[1,2,3])
which will create JLSO files containing both foo
and bar
.
In a location determined by the context tags
Workaround added in #42
It can be removed once JuliaCloud/AWSS3.jl#227 or rofinn/FilePathsBase.jl#156 is closed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.