Code Monkey home page Code Monkey logo

formatspecimens.jl's Introduction

FormatSpecimens

Latest Release Pkg Status Chat

Bioinformatics is rife with formats and parsers for those formats.

These parsers don't always agree on the definitions of these formats, since many lack any sort of formal standard.

This repository aims to consolidate a collection of format specimens, forming a unified file set for testing software. Testing against the same cases is a first step towards agreeing on the details and edge cases of a format.

Unlike its predecessor BioFmtSpeciments, FormatSpecimens is version controlled and released to a julia package registry, and features a small julia module to assist in unit-testing.

Install

FormatSpecimens is built primarily for BioJulia, and is maintained with compatibility with the BioJulia ecosystem of tools, and BioJulia developers in mind. FormatSpecimens is made available to install through BioJulia's package registry.

Julia by default only watches the "General" package registry, so before you start, you should add the BioJulia package registry.

Start a julia terminal, hit the ] key to enter pkg mode (you should see the prompt change from julia> to pkg> ), then enter the following command:

registry add https://github.com/BioJulia/BioJuliaRegistry.git

After you've added the registry, you can install FormatSpecimens from the julia REPL. Press ] to enter pkg mode again, and enter the following:

add FormatSpecimens

Organization

This repository consists of a directory for every major format. Directories contain format specimens along with a file index.toml. This is a TOML document. This index.toml contains two arrays, called valid, and invalid. All the index records for specimen files that are considered valid (i.e. conform to the format definition) are found in this array. All the index records for specimen files that are considered invalid (i.e. violate the format definition in some way) are found in the invalid array.

Every entry in the valid and invalid arrays have the following fields:

  • filename Specimen filename (required).
  • origin (Optional) The contributor or source from which a specimen was taken.
  • tags (Optional) One or more words used to group specimens by shared features.
  • comments (Optional) Any additional information that might be of interest.

Really the only field absolutely required to retrieve a file using the FormatSpecimens julia module is filename, but the other fields are useful to manipulate lists of specimen files in your unit tests.

Julia Module

To get a list of all valid or invalid file specimens for a given format, you can do the following:

using FormatSpecimens
goodfiles = list_valid_specimens("FASTQ")
badfiles = list_invalid_specimens("FASTQ")

You can test if a specimen in the list has a given tag, or get an attribute like so:

# Test if the first entry in the list of goodfiles has the tag "dna" in it's

# list of tags... 
hastag(goodfiles[1], "dna")

# Get the comments associated with an entry:
comments(goodfiles[1])

# Get the full path of a file in the entry:
fp = joinpath(path_of_format("FASTQ"), filename(entry))

You can also use do notation in order to filter the records e.g. to list all the valid FASTA files that are of a DNA sequence you can filter by tag:

gooddnafiles = list_valid_specimens("FASTA") do x
    hastag(x, "dna")
end

formatspecimens.jl's People

Contributors

jakobnissen avatar ciaranomara avatar kescobo avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.