Code Monkey home page Code Monkey logo

valet's Introduction

valet

Overview

valet is a utility for performing data management tasks automatically. Once started, valet will continue working until interrupted by SIGINT (^C) or SIGTERM (kill), when it will stop gracefully.

It monitors filesystem events and also performs filesystem walks to ensure that tasks are done in a timely fashion.

Use cases

Oxford Nanopore DNA sequencing instruments (GridION, PromethION)

Compressing, calculating checksums and moving data files from Oxford Nanopore DNA sequencing instruments (GridION, PromethION) during runs, to prevent local disks filling.

Compressing files

Some versions of MinKNOW do not compress fastq files and no version compresses the large sequencing_summary.txt files. valet compresses some files with gzip.

  • File patterns supported

    • *.csv
    • *.fastq
    • *.txt

Archiving files

  • Files patterns supported

    • *.csv
    • *.fast5
    • *.fastq
    • *.gz
    • *.md
    • *.pdf
    • *.tsv
    • *.txt

Creating up-to-date checksum files

No version of MinKNOW produces checksum files to ensure data integrity when moving files off of the instrument. valet produces these for all files that it recognises for archiving.

  • Directory hierarchy styles supported

    • Any
  • File patterns supported

    • All supported for archiving
  • Checksum file patterns supported

    • <data file name>.md5

valet will monitor a directory hierarchy and locate data files within it that have no accompanying checksum file, or have a checksum file that is stale. valet will then calculate the checksum and create or update the checksum file.

Operation

valet is a command-line program with online help. Once launched it will continue to run until signalled with SIGINT (^C) or SIGTERM (kill), when it will stop by cancelling the filesystem monitor and waiting for any running jobs to exit.

Architecture

valet identifies filesystem paths as potential work targets, applies a test to each and then performs the work on those passing the test (i.e. applies a filter). This process is implemented as three components

  • A filesystem monitor to identify work targets

  • A set of predicate (filter) functions.

  • A set of work functions and a driver to run them.

Further details on each of these elements are below.

Filesystem monitor

valet monitors filesystem events under a root directory to detect changes. Additionally, it performs a periodic sweep of all files under the root directory because events are not guaranteed to be a complete description of changes e.g. files may be added to a directory before a watch is established, another program on the system may exhaust the user's maximum permitted number of monitors, or valet may simply have been started after the target files were created.

Predicate functions

These functions are used to test filesystem paths to see if they are work targets. If the function returns a true value, the path is forwarded to a work function. The predicates are permitted to do anything that does not have side effects on the path argument e.g. matching the path to a glob or regular expression, testing whether the path is a regular file, directory or symlink, testing the file size etc.

A basic API toolkit is provided to create new predicates.

Work functions

A work function is applied to every path that passes the filter. A number of these are executed in parallel, each on a different path. The maximum number of parallel jobs can be controlled from the command line, the default being to run as many jobs as there are CPUs. Work function failures will be logged and counted, but will not cause valet to terminate. However, once valet terminates it will do so with a non-zero exit code if any work function failed.

valet prevents more than one instance of a work function (either of the same function, or another) from operating on a particular file concurrently.

Bugs

Status

Unit tests

Dependencies:

https://github.com/wtsi-npg/extendo

Versions >= 2.0.0

https://github.com/wtsi-npg/baton

Versions >= 2.0.0

valet's People

Contributors

dependabot-preview[bot] avatar dependabot[bot] avatar dkj avatar jenniferliddle avatar jmtcsngr avatar kjsanger avatar marcomoscasgr avatar mksanger avatar sb10 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

valet's Issues

Signal to downstream processes when a run is complete

The objective is to avoid downstream processes needing to use heuristics to tell if a run has completed (and completed copying too). The sequencing_summary.txt should be the last file to be created by the run. However, it may not be the last file to arrive at the destination when there are concurrent copy operations in progress. One option is to copy sequencing_summary.txt copy only after a period of idle time. This doesn't guard against the run being restarted (but does that create a new run folder?)

Document TMPDIR behaviour

Document that valet respects TMPDIR and that it should be set appropriately (to the same filesystem as the data).

Multiple work functions may operate on the same path

It is possible for race conditions to arise where e.g. filesystem sweeps are frequent relative to job runtime, resulting in more than one work function operating in parallel on the same path. valet should prevent this.

In the meantime, work functions should be written to fail gracefully in these cases, or to recover from errors internally.

Support PromethION device IDs.

The parser currently supports GridION device IDs only.

Parsing the report Markdown file at the end of publishing fails with the error message e.g. Failed to parse device ID '2-A7-D7'.

Error channel blocking on cancel

The channel for non-fatal runtime errors has a buffer size of 1, intended to allow at least one error to be sent and a non-zero exit to be possible. However, if mutiiple errors are sent by by the watcher or finder, they get blocked on sending and can become impossible to cancel cleanly.

There should be an additional goroutine to constanly drain the error channel and act on the errors appropriately. Mostly we want to keep on working, if we can, but also have a clean cancel.

There will be some additonal work to act appropriately on errors that we really want to be fatal.

Pruning the monitored directory hierarchy

In some cases there are deep sub-hierachies within the target hierarchy that are never interesting and are simply a waste of monitoring resources. We need to be able to prune the tree for both monitoring and interval sweeps. This should be settable from the command line.

Restarted runs on PromethION do not include the Markdown report

Runs on the PromethION-24 (MinKNOW 4.5.4) that are restarted do not get the usual Markdown report file created and sometimes do not get the final PDF report either.

We rely on the Markdown report for automated metadata extraction. The final report .txt file seems always to be present and contains most of the same data. However, it does not contain the MinKNOW or Guppy software versions.

Archive accompanying text and PDF files

The GridION and PromethION run folders contain additional text and PDF files that are not handled by the archiver. They should be archived too, perhaps not using md5 checksum files .

Dependabot can't resolve your Go dependency files

Dependabot can't resolve your Go dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

verifying github.com/kjsanger/extendo/[email protected]/go.mod: checksum mismatch
	downloaded: h1:ghBNpbUzejGnKbx54isdXlmXfMxh4vNih0olZyPTA8A=
	go.sum:     h1:Xibv5xvLjDUcqCB2dh4uuLxjY2NYa4inOVRCTwCJOcY=

SECURITY ERROR
This download does NOT match an earlier download recorded in go.sum.
The bits may have been replaced on the origin server, or an attacker may
have intercepted the download attempt.

For more information, see 'go help module-auth'.

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

Compress Fastq files

The ONT pipeline creates uncompressed Fastq files. We should compress and checksum these locally before archiving.

Automatically exclude TMPDIR from directory walks

If TMPDIR is set to be within the /data filesystem, automatically exclude it from directory walks, rather than have the user be required to use a --exclude CLI option if they have set things up this way.

Use namespaces for iRODS metadata

When adding ONT metadata to iRODS, use the ont namespace for the attributes to avoid clashes with existing (non-namespaced) attributes e.g. sample_id

Add metadata for experiment name and position

Experiment name is currently added under the ONT nomenclature of protocol_group_id (from the report.md file). The position is currently added as device_id (also from the report.md file), however the value is of the form e.g. GA10000 or X1 for position 1, and needs to be normalised.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.