czakarian / demultiplex Goto Github PK

0.0 2.0 0.0 220 KB

Tool that performs demultiplexing and reports levels of index hopping

Python 95.38% Shell 4.62%

demultiplex's Introduction

Demultiplexer

Given four input fastq files (2 with biological reads, 2 with index reads) and a list of known indexes, this program will demultiplex reads by index-pair, outputting one R1 fastq file and one R2 fastq file per matching index-pair, another two fastq files for non-matching index-pairs (index-hopping), and two additional fastq files when one or both index reads are unknown or low quality.

The sequence of each index-pair will be added to the header of BOTH reads in all fastq files for all categories (e.g. “AAAAAAAA-CCCCCCCC” will be appended to headers of every read pair that had an index1 of AAAAAAAA and an index2 of CCCCCCCC.

Final output stats files will report the number of read-pairs with properly matched indexes (per index-pair), the number of read pairs with index-hopping observed, and the number of read-pairs with unknown index(es).

Input

4 fastq files (one read pair, one index pair)
A text file with a list of known index sequences

argparse options:
- -f, --files: required arg, Paths to input fastq files (one read pair, one index pair)
- -i, --indexes: required arg, Path to file containing known index sequences + sample information
- -d, --direct: required arg, Path to output directory
- -s, --stats: required arg, Name for output stats files

Output

A pair of fastq files per known index pair, a pair for index-hopped read-pairs, and a pair for reads with unknown or low quality index-pairs
Summary stats files
- % and # of read-pairs with matched indexes, read-pairs with index-hopping, and read-pairs with unknown indexes
- % and # of read-pairs with matched indexes reported per index-pair

demultiplex's People

Contributors

Watchers

demultiplex's Issues

Assignment the first comments

Great outline! I honestly cannot find anything to critique on. Your algorithm is very well written, and I can easily follow it. It does everything the assignment asks for. I think it’s a fantastic idea that you only store a single record at a time since these are huge files.

Your revcomp function also seems well planned out. Nicely done!

John C's comments

Christina,

Really well written pseudo code! Just some thought I had when I was reading it!

“For each pair of indices we will determine which bucket it should be sent to and output the 4 lines for each pair of reads to either the R1 or R2 file in corresponding bucket”

You explain this in detail further down but maybe throw on a “general idea for this code” label? I was a touch confused about how and what was going to take place.

“For the 2 read FASTQ files and 2 index FASTQ files, will extract one read at at time (line by line) and append the read to its appropriate bucket file”

You’re appending the full, 4-line record here, yeah? Not just the sequence line?

“If index1 and index2 RC are in the index dictionary // could be either matched or swapped
If index1 and index2 both meet the quality score cutoff:”

I wonder if these couple lines in this order might illicit some errors for you?

Instead of running to check if these potential error indicies match the index dictionary (we know they probably wont make the cut) why can’t we just trash everything if they don’t make the cutoff first?

“Else if either or both indexes don't meet cutoff:”

The quality score cutoff or if we don’t know what the indicies are?

Might be good to know exactly what data we are getting ‘rid’ of and what exactly that means? Are we sure we want to trash all the unknown reads?

“output the modified header to each file”

Same header for both read files? How might you be telling them apart as Read 1/Read 2 when all is said and done?

Functions are looking good might want to throw in some “N’s” in there on the rev comp function just as a contingency in case your code pics up an ‘N.’ That way you could use it later too if you want :)

Pseudocode Review

For your pseudocode:

Very detailed and organized
Good use of using a dictionary to store your indexes and if statements to organize your reads.

Optional:

Move your quality score cut off to your first conditional statement, b/c if that isn’t met, there's no use of going through your other conditional statements.
Replace counter with enumerate

For your Function:

Steps are clearly written out and examples are good

Nice work ☉ ‿ ⚆

czakarian / demultiplex Goto Github PK

demultiplex's Introduction

Demultiplexer

Input

Output

demultiplex's People

Contributors

Watchers

demultiplex's Issues

Assignment the first comments

John C's comments

Pseudocode Review

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent