Code Monkey home page Code Monkey logo

demultiplex's Introduction

Demultiplexer

Given four input fastq files (2 with biological reads, 2 with index reads) and a list of known indexes, this program will demultiplex reads by index-pair, outputting one R1 fastq file and one R2 fastq file per matching index-pair, another two fastq files for non-matching index-pairs (index-hopping), and two additional fastq files when one or both index reads are unknown or low quality.

The sequence of each index-pair will be added to the header of BOTH reads in all fastq files for all categories (e.g. “AAAAAAAA-CCCCCCCC” will be appended to headers of every read pair that had an index1 of AAAAAAAA and an index2 of CCCCCCCC.

Final output stats files will report the number of read-pairs with properly matched indexes (per index-pair), the number of read pairs with index-hopping observed, and the number of read-pairs with unknown index(es).

Input

  1. 4 fastq files (one read pair, one index pair)
  2. A text file with a list of known index sequences
  • argparse options:
    • -f, --files: required arg, Paths to input fastq files (one read pair, one index pair)
    • -i, --indexes: required arg, Path to file containing known index sequences + sample information
    • -d, --direct: required arg, Path to output directory
    • -s, --stats: required arg, Name for output stats files

Output

  1. A pair of fastq files per known index pair, a pair for index-hopped read-pairs, and a pair for reads with unknown or low quality index-pairs
  2. Summary stats files
    • % and # of read-pairs with matched indexes, read-pairs with index-hopping, and read-pairs with unknown indexes
    • % and # of read-pairs with matched indexes reported per index-pair

demultiplex's People

Contributors

czakarian avatar

Watchers

James Cloos avatar  avatar

demultiplex's Issues

Assignment the first comments

Great outline! I honestly cannot find anything to critique on. Your algorithm is very well written, and I can easily follow it. It does everything the assignment asks for. I think it’s a fantastic idea that you only store a single record at a time since these are huge files.

Your revcomp function also seems well planned out. Nicely done!

John C's comments

Christina,

Really well written pseudo code! Just some thought I had when I was reading it!

“For each pair of indices we will determine which bucket it should be sent to and output the 4 lines for each pair of reads to either the R1 or R2 file in corresponding bucket”

You explain this in detail further down but maybe throw on a “general idea for this code” label? I was a touch confused about how and what was going to take place.

“For the 2 read FASTQ files and 2 index FASTQ files, will extract one read at at time (line by line) and append the read to its appropriate bucket file”

You’re appending the full, 4-line record here, yeah? Not just the sequence line?

“If index1 and index2 RC are in the index dictionary // could be either matched or swapped
If index1 and index2 both meet the quality score cutoff:”

I wonder if these couple lines in this order might illicit some errors for you?

Instead of running to check if these potential error indicies match the index dictionary (we know they probably wont make the cut) why can’t we just trash everything if they don’t make the cutoff first?

“Else if either or both indexes don't meet cutoff:”

The quality score cutoff or if we don’t know what the indicies are?

Might be good to know exactly what data we are getting ‘rid’ of and what exactly that means? Are we sure we want to trash all the unknown reads?

“output the modified header to each file”

Same header for both read files? How might you be telling them apart as Read 1/Read 2 when all is said and done?

Functions are looking good might want to throw in some “N’s” in there on the rev comp function just as a contingency in case your code pics up an ‘N.’ That way you could use it later too if you want :)

Pseudocode Review

For your pseudocode:

  1. Very detailed and organized
  2. Good use of using a dictionary to store your indexes and if statements to organize your reads.

Optional:

  1. Move your quality score cut off to your first conditional statement, b/c if that isn’t met, there's no use of going through your other conditional statements.
  2. Replace counter with enumerate

For your Function:

  1. Steps are clearly written out and examples are good

Nice work ☉ ‿ ⚆

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.