bushmanlab / intsitecaller Goto Github PK

This project forked from esherm/intsitecaller

License: GNU General Public License v3.0

Python 1.83% R 98.16% Shell 0.01%

intsitecaller's Introduction

intSiteCaller

Introduction

This code is designed to take fastq files that are produced by the MiSeq and return integration sites, multihits, and chimeras. RData and fasta files are used for data persistance in place of a database, thus allowing massive parallelization and the LSF job submission system on the PMACS HPC

Inputs

Analysis is started by having the user create the following directory structure:

primaryAnalysisDirectory
├── Data
│   ├── Undetermined_S0_L001_I1_001.fastq.gz
│   ├── Undetermined_S0_L001_R1_001.fastq.gz
│   └── Undetermined_S0_L001_R2_001.fastq.gz
├── processingParams.tsv
├── sampleInfo.tsv
└── vector.fasta

Primary Analysis Directory

Data/Undetermined_S0_L001_*_001.fastq.gz are the fastq files returned by the MiSeq (R1, R2, and I1) At present names are hard-coded and * can only be I1, R1 and R2
Optional processingParams.tsv contains 'dryside' processing parameters, all the same for all samples:
- qualityThreshold
- badQualityBases
- qualitySlidingWindow
- mingDNA is the minimum length of genomic DNA (in nt) that is allowed to be passed onto the alignment step
- minPctIdent is the minimum percent identity that a query sequence has to a putative alignment on the target sequence in order to be considered 'valid'
- maxAlignStart is the maximum number of nucleotides of the query sequence that do not match the putative target sequence before a matched nucleotide is seen
- maxFragLength is the maximum length of a properly-paired alignment (in nt)
- refGenome is the reference genome to be used for alignment - this is passed in as a standard text string (ex. 'hg18', 'hg19', 'mm8', 'mm9')
  
  After error-correcting and demultiplexing, intSiteCaller trims raw MiSeq reads based on Illumina-issued quality scores. badQualityBases is the number of bases below qualityThreshold, using standard Illumina ASCII Q-score encoding (p.41-2), that can be observed in a window of qualitySlidingWindow before the read is trimmed.
Default file default_processingParams.tsv is used when processingParams.tsv is not found in the folder.
Required sampleInfo.tsv contains 'wetside' sample metadata:
- alias is the human-readable sample description
- linkerSequence is the linker sequence as seen in MiSeq read 1. N's indicate the presence of a primerID
- bcSeq is the barcode used during sample preparation
- gender is either 'm' of 'f' for male/female, respectively
- primer is the primer sequence as seen in MiSeq read 2
- ltrBit is the LTR sequence as seen in MiSeq read 2
- largeLTRFrag is 43nt of the LTR sequence as seen from MiSeq read 1
- vectorSeq is a filepath (either absolute or relative to the primary analysis directory) to the vector sequence in fasta format -- it is encouraged to place the vector sequence directly in the primary analysis directory, although that is not a requirement
Required vector.fasta vector sequence file as specified by vectorSeq in sampleInfo.tsv
make_primaryAnalysisDirectoryByRunId.R will generate the directory automatically.

If we need to proccess multiple genomes on the same run we can use non-standard sampleInfo.tsv and non-standard processingParams.tsv, we add column refGenome to sampleInfo.tsv and remove this column from processingParams.tsv.

Usage

After creating the above directory structure and cd primaryAnalysisDirectory, the following command is issued:

Rscript path/to/intSiteCaller.R

The rest of the processing is fully automated and shouldn't take more than 4 hours to process 1.5e7 raw reads.

After intSiteCaller.R is done, one can examine the attrition table by the command:

Rscript path/to/check_stats.R

and the output is a tab delimited summary table describing each step.

intSiteCaller.R can handle the following optional arguments

-j, --jobID - Unique name by which to identify this intance of intSiteCaller [default: intSiteCallerJob]
-c, --codeDir - Directory where intSiteCaller code is stored, can be relative or absolute [default: codeDir as detected by Rscript]
-p, --primaryAnalysisDir - Location of primary analysis directory, can be relative or absolute [default: .]
-h, --help - Show the help message and exit

Code pipeline example for a run `run20150505`

We assume packages intSiteCaller, intSiteUploader, geneTherapyPatientReportMaker installed in the $HOME directory.
We start by creating a folder Frances/run20150505 and moving GeneTherapy-20150505-sampleInfo.csv to that folder.
Note that the miseq run date 150505 match between the folder name and the csv file for consistence.

#1. Prepare the structure of primaryAnalysisDirectory, assuming GeneTherapy-20150505-sampleInfo.csv exists
cd Frances/run20150505
Rscript ~/intSiteCaller/make_primaryAnalysisDirectory.R

#2. Align reads and call sites; wait until all bjobs are done; check exit status
Rscript ~/intSiteCaller/intSiteCaller.R
grep -i exit logs/*.txt                   #a good run returns nothing
grep -i error logs/*.txt                  #try to distinguish code error or warning

#3. Check attrition table, make sure the numbers are reasonable and the html or pdf file looks OK
Rscript ~/intSiteCaller/check_stats.R | cut -f1-20
Rscript ~/intSiteCaller/html_stats.R

#4. Upload to database
Rscript ~/intSiteUploader/intSiteUploader.R

#5. Check GTSP numbers, find patient metadate for this run, in this example, 
#   check_gtsp_patient.R shows the run was for pFR03. 
#   check_patient_gtsp.R pFR03 will give us all the sets saved in the database for pFR03
#   and we save that information as input file to generate report.
Rscript ~/geneTherapyPatientReportMaker/check_gtsp_patient.R                      #check patient info for this run
Rscript ~/geneTherapyPatientReportMaker/check_patient_gtsp.R                      #check all patients
Rscript ~/geneTherapyPatientReportMaker/check_patient_gtsp.R pFR03                #output to screen
Rscript ~/geneTherapyPatientReportMaker/check_patient_gtsp.R pFR03 > pFR03.csv    #dump to file

#6. Make report for pFR03
Rscript ~/geneTherapyPatientReportMaker/makeGeneTherapyPatientReport.R pFR03.csv 

#7. WAS.pFR03.20150617.html will be generated. Today was 20150617. 
#   If there are more patients in a run, repeat steps 5 and 6.

#8. Generate genomic heatmap
#   To be added

#9. The run folder minus the intermediate files should be saved in a central folder for permanant storage.
#   To be developed

#10. After all the above steps, the folder run20150505 can be deleted.

#11. Generate UCSC hub
#   To be developed

Clean up intermediate files to save space

We can remove all files, except Rdata and logs with the script:

Rscript clean_primaryAnalysisDirectory.R

the script asks for confirmation.

Outputs

Integration sites

This code returns integration sites in two formats. allSites.RData is a GRanges object that contains a single record for each Illumina read. sites.final.RData is a GRanges object of dereplicated integration sites along with a tally of how many reads were seen for each site (irregardless of sonic breakpoint). The revmap column in sites.final.RData links unique reads from allSites to dereplicated sites in sites.final.

Multihits

Multihits are stored in multihitData.RData which is a GRangesList. The first item in this list is a GRanges object where each record represents a properly-paired alignment. Individual multihit reads can be identified by analysing the ID column, which cooresponds to the unique Illumina read identifier. The second item in multihitData is a GRanges object of dereplicated multihits, which lists each unique genomic integration site as a unique record. The revmap column pairs records from multihitData[[1]] to multihitData[[2]]. The third item is a GRanges object of multihit clusters. This is still in development.

Chimeras

Chimeras are stored in chimeraData.RData which is a list that contains some basic chimera frequency statistics and a GRangesList object. Each GRanges object contains two records, one for the read1 alignment and another for the read2 alignment

PrimerIDs

PrimerIDs (if present in the linker sequence) are stored in primerIDData.RData. This file is a base R list containing a DNAStringSet and a BStringSet containing the sequences and quality scores of the primerIDs.

Stats

Processing statistics are returned in the stats.RData file. This file contains a single data.frame. Detail of the specific columns provided in this dataframe will be added later and can inferred from intSiteLogic.R.

Dependencies

This code is highly dependent on Bioconductor packages for processing DNA data and collapsing/expanding alignments.

The following R packages and their subsesequent dependencies are required for proper operation of intSiteCaller:

ShortRead
GenomicRanges
rtracklayer
BSgenome
argparse
igraph
BSgenome.*.UCSC.* package cooresponding to reference genomes specified in processingParams.csv

Specific versioning analysis has not yet been performed.

Additionally, blat and python are required and must be executable from any path. blat is available from https://genome.ucsc.edu/FAQ/FAQblat.html#blat3

intSiteCaller confirms the presence of all dependancies and will throw an error if a dependancy is not met.

Code Structure

Primary read trimming and integration site calling logic is contained in intSiteLogic.R.
Branching and condensing of PMACS jobs is performed in programFlow.R
Barcode error correcting logic is performed in errorCorrectIndices/golay.py as wrapped by errorCorrectIndices/processGolay.py.
All code files are checked into the repository.
Flowcharts will be added to graphically describe the flow of the overall program as well as the flow/logic of individual functions

Tests

A sample dataset is included for verification of integration site calling accuracy. The testCases directory contains,

intSiteValidation folder, which includes the minimal number of files to process a test run,
intSiteValidation.digest, a digest(R version of md5) file for the RData files that the test run would produce,
intSiteValidation.attr, an attrition table that describes the filtering and alignment process,
test_identical_run.R, the script to run the piepline and check the output.

To analyze the test data, run the following commands assuming the current directory is the root of the repository,

cd testCases/intSiteValidation/
Rscript test_identical_run.R

The test should finish in 10 minutes if PMACS is not busy and the output messages should tell whether the pipeline produced the same results as before. Note that this subset of data contains samples with some borderline cases. For example, clone7 samples should all fail, and many of the clone1-clone4 samples should return no multihits or chimeras. The current implementation of the code handles these gracefully.

Unit tests

Run unit tests with:

library(testthat)
test_dir('tests')

intsitecaller's People

Contributors

Stargazers

Watchers

Forkers

malnirav wangdi2014 heath1210 jbdoto yhabtoun

intsitecaller's Issues

all GRanges objects should have seqinfo set for catching out of bounds errors

LSF system stall if alignSeqs not sucessful

need to make sure psl file exists before gzip

if vector sequences files are not in primaryAnalysis directory need to fail

PrimerID reporting is broken

In output, primerIDData.RData is empty.

intSiteLogic.R ln 354 seems to have an incorrect reference to assign to the variable "primerIDs". It has the output variable from the trim_primerIDlinker_side_reads() referenced twice. Line below:

primerIDs <-readslprimer$readslprimer$primerID

It should be:

primerIDs <-readslprimer$primerID

As the "readslprimer" object is a list with first object 'reads.l' and second object 'primerID'.

The section of code is below: (ln 352 to 356 of intSiteLogic.R)
readslprimer <- trim_primerIDlinker_side_reads(reads[[1]], linker)
reads.l <- readslprimer$reads.l
primerIDs <-readslprimer$readslprimer$primerID
stats.bore$linkered <- length(reads.l)
save(primerIDs, file="primerIDData.RData")

intSiteCaller seems to hang, although RData objects were created.

981538 yinghua PEND normal node027.hpc *alidation May 27 10:24
997253 yinghua PEND normal node088.hpc *alidation May 28 13:55

[yinghua@node063 intSiteValidation]$ ls -l clone1-1/
total 384
-rw-rw-r-- 1 yinghua yinghua 1232 May 28 14:31 allSites.RData
-rw-rw-r-- 1 yinghua yinghua 61 May 28 14:31 callStatus.RData
-rw-rw-r-- 1 yinghua yinghua 1184 May 28 14:31 chimeraData.RData
-rw-rw-r-- 1 yinghua yinghua 1415 May 28 14:31 hits.R1.RData
-rw-rw-r-- 1 yinghua yinghua 1310 May 28 14:31 hits.R2.RData
-rw-rw-r-- 1 yinghua yinghua 487 May 28 14:27 keys.RData
-rw-rw-r-- 1 yinghua yinghua 1165 May 28 14:31 multihitData.RData
-rw-rw-r-- 1 yinghua yinghua 14509 May 28 14:27 primerIDData.RData
-rw-rw-r-- 1 yinghua yinghua 3108 May 28 14:27 R1-1.fa
-rw-rw-r-- 1 yinghua yinghua 494 May 28 14:31 R1-1.fa.psl.gz
-rw-rw-r-- 1 yinghua yinghua 1284 May 28 14:27 R2-1.fa
-rw-rw-r-- 1 yinghua yinghua 242 May 28 14:31 R2-1.fa.psl.gz
-rw-rw-r-- 1 yinghua yinghua 832 May 28 14:31 sites.final.RData
-rw-rw-r-- 1 yinghua yinghua 385 May 28 14:31 stats.RData
-rw-rw-r-- 1 yinghua yinghua 73 May 28 14:27 trimStatus.RData

***top priority, UninfectedControls-1, names for control should have runid in it,

something like UninfectedControls150218-1

pairing needs to understand down stream

hung job

Its seems that after alignment, many jobs completed but one job was left pending. The bjobs -l command output is below. There are a good number of samples that did complete, but probably about 25% that did not finish.

Job <142730>, Job Name <BushmanCleanup_CART19_Set_1>, User , Project <
default>, Status , Queue , Command <Rscript
-e "source('/home/cnobles/scripts/intSiteCaller/programFlo
w.R'); cleanup();">
Thu Jun 4 14:43:27: Submitted from host <node126.hpc.local>, CWD <$HOME/illumi
na/CART19/Sample_Set_1>, Output File <logs/cleanupOutput.t
xt>, Dependency Condition <done(BushmanCallIntSites_CART19
_Set_1)>;

MEMLIMIT
1000 M
PENDING REASONS:
Dependency condition invalid or never satisfied;

SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -

RESOURCE REQUIREMENT DETAILS:
Combined: select[type == local] order[r15s:pg] span[ptile='!',Intel_EM64T:32]
same[model] affinity[thread(1)*1]
Effective: -

counts in sites.final do not accurately represent the number of reads

There are large discrepancies between the sum of the counts in sites.final and the number of all single reads. Further, we need to make sure the dereplication for alignment is taken into account using the keys file.

Replace actual quality trimming.

Code should be changed back to looking at the quality scores rather than successive 'N' calls.

Quality trimming occurs around line 326 of intSiteLogic.R. Previous code which looked at quality scores was commented out.

This can cause problems if the low quality bases are not called as 'N'.

test_identical_run.R fails

Running_minutes: 0 jobs: 1
Running_minutes: 1 jobs: 42
Running_minutes: 2 jobs: 1
Running_minutes: 3 jobs: 108
Running_minutes: 4 jobs: 108
Running_minutes: 5 jobs: 92
Running_minutes: 6 jobs: 36
Running_minutes: 7 jobs: 0
Run stopped after: 7 minutes

Checking md5 digest for RData files
Error in if (any(!md5.all$same)) { :
missing value where TRUE/FALSE needed
Calls: source -> withVisible -> eval -> eval
Execution halted

@yhwu any guesses what is going on with the test?

How often did we have case when one run has several different ref genomes?

monkey and human on the same run?
or hg18 and hg19 on the same run?

what do we do with identical reads?

***samples with two refGenome

bsub cannot submit array job with too many subjobs

Bad one:

bsub -q normal -n 1 -M 24000 -J "BushmanCallIntSites_intSiteCallerJob[1,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165]" -o logs/callSitesOutput%I.txt Rscript -e "source('/home/yinghua/intSiteCaller/programFlow.R'); callIntSites();"
Bad job name. Job not submitted.

Good one:

bsub -q normal -n 1 -M 24000 -J "BushmanCallIntSites_intSiteCallerJob[1,6,7,8,9]" -o logs/callSitesOutput%I.txt Rscript -e "source('/home/yinghua/intSiteCaller/programFlow.R'); callIntSites();"
Job <614488> is submitted to queue <normal>.

This bug affected run run20150403, which contains 165 samples. Solution is to split.

qualityThreshold is ? in processing param is it used? or not?

if genome is not installed need to halt immediately and write message to user

High memory usage

Fix in multihit debugprocess...R file

fix_seg branch

Some difference for "-" strand rows, "+" strand rows seem to be identical

24    chr17 18145962 18146337   376      - 11, 12         6   clone1-4  D03FP:1:1101:16257:9296
24    chr17 18145962 18145962     1      - 11, 12         6   clone1-4  D03FP:1:1101:16257:9296

This is possibly caused by

# in chris's fix
standardizedStarts <- rep(start(dereplicated), dereplicated$counts)

# in original
standardizedStarts <- start(flank(standardized, -1, start=T))
standardized <- unname(dereplicated[rep(dereplicated$dereplicatedSiteID, dereplicated$counts)])

A possible fix to chris' could be

standardizedStarts <- rep(start(flank(dereplicated, -1, start=T)), dereplicated$counts)

Frances's Was run

Seems to be OK. Some concerns:

[yinghua@node063 run20150505]$ grep -i exit logs/*
## clean

[yinghua@node063 run20150505]$ grep -i error logs/*
logs/callSitesOutput3.txt:[1] "Caught error: error in evaluating the argument 'f' in selecting a method for function 'split': Error in as.vector(Rle(pairedAlignments$pairingID, alignmentsPerPairing)) : \n  error in evaluating the argument 'x' in selecting a method for function 'as.vector': Error in (function (classes, fdef, mtable)  : \n  unable to find an inherited method for function ‘Rle’ for signature ‘\"integer\", \"list\"’\nCalls: Rle -> <Anonymous>\n"
... ...
logs/errorCorrectWorkerOutput9.txt:/home/yinghua/intSiteCaller/errorCorrectIndices/golay.py:74: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
logs/trimOutput1.txt:[1] "Caught error: error - no curated reads"
logs/trimOutput2.txt:[1] "Caught error: error - no curated reads"
logs/trimOutput4.txt:[1] "Caught ERROR in intSiteLogic: No hits found which passed the qualityThreshold"
logs/trimOutput4.txt:[1] "Caught ERROR in intSiteLogic: No hits found"
logs/trimOutput4.txt:[1] "Caught error: error - no curated reads"

no curated reads, no hits were due to UninfectedControls-1,2,3,4.

test hangs if new job is created after script is started

Rscript test_identical_run.R

bsub -n 1 -Is bash
leads to tests hang untill interactive job is closed

concept of a trial should be intoduced

if we have concept of the run many parameters could be pulled from elsewhere:

primer
lltrBit
largLTRFrag
vectorSeq
reference Genome

read gender with stringAsFactors False

processGolay.py

out_handle = open("Data/correctedI1-" + fileNumber + ".fasta","a")

a: append mode. It should start from fresh. Reads will duplicate if a run was not successfully processed by the pipeline in a single time, and the folder was not cleaned before more attempts. Duplicated reads leads to duplicated read names and will result in an error later.

Suggested change:
outfile = ( "Data/correctedI1-" + fileNumber + ".fasta" )
os.remove(outfile) if os.path.exists(outfile) else None
out_handle = open(outfile, "a")

Error "Can't open NA.2bit to read"

Hi
I am trying to run the test on the fastq files contained in the validationDataSet/Data folder following your instructions but I get the following error after processing of clone1-4

(...)
sample clone1-4
barcoded 104
LTRed 92
linkered 102
ltredlinkered 90
[1] TRUE
mustOpen: Can't open NA.2bit to read: No such file or directory
[1] TRUE
Error in strsplit(alignFile, "/")[[1]] : subscript out of bounds
Calls: alignSeqs
Execution halted

processingParams.tsv are as follows

qualityThreshold badQualityBases qualitySlidingWindow mingDNA minPctIdent maxAlignStart maxFragLength refGenome
? 5 10 30 95 5 2500 hg18

Genome hg18 is installed properly and hg18.2bit is generated as well
(tried also changing to hg19. Same error)
Can you help?

***deal with re-dos on the same GTSPs

for example,

run20150505 GTSP0308
run20150218 GTSP0308

golay.py

logs/errorCorrectWorkerOutput8.txt:/home/yinghua/intSiteCaller/errorCorrectIndices/golay.py:74: FutureWarning: comparison to None will result in an elementwise object comparison in the future.

Low priority, but let's get rid of it.

JobID error intSiteCallerJob[1-0]

note: intSiteCallerJob[1-0]

bsub -q normal -n 1 -M 8000 -w "done(BushmanPostTrimProcessing_intSiteCallerJob)" -J "BushmanAlignSeqs_intSiteCallerJob[1-0]" -o logs/alignOutput%I.txt Rscript
-e "source('/home/yinghua/intSiteCaller/programFlow.R'); alignSeqs();"

Miscalling integration site and breakpoint positions on genome

On line 476 of intSiteLogic.R, we've assigned the "tStart" and "tEnd" from the alignment output to the GRanges start and end positions. This is correct most of the time, but the "tStart" and "tEnd" in the psl output file correspond to the edges of the alignment. If there are bases at the beginning and end of the read that do not align, they are not considered when assigning the position.

I think this is a mistake in logic. Instead, the range assignment should be incorporate the flanking nucleotides that may not align, especially on the LTR end. This region is known to have a drop in sequence quality that can lead to base miscalling. These miscalled bases may not be aligned which will change the position of the integration site position calling. A quality metric used for filtering is that the "qStart" must be less than or equal to 5nt (line 519, adjustible with maxAlignStart), which means that the called position could be up to 5 nt away from the actual junction of the LTR.

This is fixed downstream in our reports by standardizing within a 5bp window, but could be refined in the current informatic processing. The proposed change would look something like below:

intSiteLogic.R : Line 476

    ranges=IRanges(start=algns$tStart, end=algns$tEnd)

changed to:

    ranges = ifelse(algns$strand == "+",
        IRanges(
            start = (algns$tStart - algns$qStart + 1),
            end = (algns$end + (algns$qSize - algns$qEnd))),
        IRanges(
            start = (algns$tStart - (algns$qSize - algns$qEnd)),
            end = (algns$tEnd + algns$qStart - 1)))

should we use clustering or reduce

as discovered by @cnobles
algorithm that we are currently using are different
than clustering done before by Chuck and Nirav

? what strategy to use ?

Register metadata with Bushman Registry

Please register samples with the lab registry at completion. I realize that this isn't a generic feature of this script but here seems like a convenient place to call it at the end.

/home/aubreybailey/parse_samplesheet.py /media/THING1/Illumina/1507..../Data/Intensities/BaseCalls/Undetermined_S0_L001_R1_001.fastq.gz to get the suggested registration line like:
register_run.py --file=$PWD/Undetermined_S0_L001_R1_001.fastq.gz
then
register_samples.py --insprid -r ### -s sampleInfo.tsv

ref genome requirement

There is no flag or requirement to initialize run without specified Genome supplied in processing parameters.

If "hg18" is specified, then Rpackage "BSgenome.Hsapiens.UCSC.hg18" should be required to start run. Likewise, other reference genomes should be able to be used.

check_stats.R crashes on unfinished run

Tried running check_stats.R on runs which don't have 100% completion (about 95% or more samples completed) and the error message below is returned:

Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
Calls: do.call -> -> rbind
Execution halted

I'm going to take a guess at the issue: since not all the samples finished, the stats.RData from each of the samples don't have the same number of columns and therefore cannot rbind().

specify trial and patient for uniqueness

still too much memory

Chris's fix improved the situation, but we still run into memory problem.

After fix:

[yinghua@node063 run150127]$ grep -i exit logs/*.txt
logs/callSitesOutput12.txt:Subject: Job 204647[12]: <BushmanCallIntSites_intSiteCallerJob[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56]> in cluster <pennhpc> Exited
logs/callSitesOutput12.txt:Exited with exit code 1.

[yinghua@node063 run150127]$ grep -i memory logs/*.txt | grep -i max | grep Output12.txt:
logs/alignOutput12.txt:    Max Memory :                                 3901 MB
logs/callSitesOutput12.txt:    Max Memory :                                 24396 MB
logs/errorCorrectWorkerOutput12.txt:    Max Memory :                                 22 MB
logs/trimOutput12.txt:    Max Memory :                                 530 MB

Before fix:

[yinghua@node063 run150127.bak]$ grep -i exit logs/*.txt
logs/callSitesOutput12.txt:Subject: Job 149629[12]: <BushmanCallIntSites_intSiteCallerJob[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56]> in cluster <pennhpc> Exited
logs/callSitesOutput12.txt:Exited with exit code 1.
logs/callSitesOutput1.txt:Subject: Job 149629[1]: <BushmanCallIntSites_intSiteCallerJob[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56]> in cluster <pennhpc> Exited
logs/callSitesOutput1.txt:Exited with exit code 139.
logs/callSitesOutput1.txt:Subject: Job 160200[1]: <BushmanCallIntSites_intSiteCallerJob[1]> in cluster <pennhpc> Exited
logs/callSitesOutput1.txt:Exited with exit code 139.
logs/callSitesOutput2.txt:Subject: Job 149629[2]: <BushmanCallIntSites_intSiteCallerJob[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56]> in cluster <pennhpc> Exited
logs/callSitesOutput2.txt:Exited with exit code 139.
logs/callSitesOutput3.txt:Subject: Job 149629[3]: <BushmanCallIntSites_intSiteCallerJob[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56]> in cluster <pennhpc> Exited
logs/callSitesOutput3.txt:Exited with exit code 139.
logs/callSitesOutput43.txt:Subject: Job 149629[43]: <BushmanCallIntSites_intSiteCallerJob[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56]> in cluster <pennhpc> Exited
logs/callSitesOutput43.txt:Exited with exit code 139.
logs/callSitesOutput44.txt:Subject: Job 149629[44]: <BushmanCallIntSites_intSiteCallerJob[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56]> in cluster <pennhpc> Exited
logs/callSitesOutput44.txt:Exited with exit code 139.
logs/callSitesOutput45.txt:Subject: Job 149629[45]: <BushmanCallIntSites_intSiteCallerJob[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56]> in cluster <pennhpc> Exited
logs/callSitesOutput45.txt:Exited with exit code 139.
logs/callSitesOutput4.txt:Subject: Job 149629[4]: <BushmanCallIntSites_intSiteCallerJob[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56]> in cluster <pennhpc> Exited
logs/callSitesOutput4.txt:Exited with exit code 139.

what is linker Common for the case when linker does not have NNNN?

add accuracy test

Consider using rep() in line 68 of code review code

The fundamental problem here is that you need to repeat an element a given number of times. You may be able to use rep() directly.

Dereplicating and Clustering sites may result in up stream shift of 5bp or more.

Dereplicating and standardizing sites is based on reduce instead of overlap.

Columns from technicians

alias, linkerSequence, bcSeq, gender, primer , ltrBit, largeLTRFrag, vectorSeq

column names are case sensitive.
barcode and alias should have unique relationship.
alias should be something like GTSP0001-1 for gene therapy.

runs by yinghua and anatoly produced different results.

keys.RData are different for all samples.

$1. check md5 for correctedI1-1.fasta trimmedI1-1.fasta; same
-
$2. check md5 for demultiplexedReps/*.gz; same
-
$3. check completeMetadata.RData
Rscript -e "a=load('completeMetadata.RData'); print(get(a))" > tmp.txt
Rscript -e "a=load('~/intSiteValidation_4PM/completeMetadata.RData'); print(get(a))" > tmp2.txt
diff tmp.txt tmp2.txt
rm tmp.txt tmp2.txt
#only difference were the file names, pass
-
$4. check R1 R2 reads
wc -l ~/intSiteValidation_4PM/pool3-4/R1-1.fa                                   #Anatoly run
#116 /home/yinghua/intSiteValidation_4PM/pool3-4/R1-1.fa            
wc -l pool3-4/R1-1.fa                                                           #yinghua run
#70 pool3-4/R1-1.fa
(for f in ~/intSiteValidation.wu/*/R?-*.fa; do n=`grep "^>" $f | tail -1 | sed -e 's/>//'`; echo $f $n; done) > tmp.txt
(for f in ~/intSiteValidation.anatoly/*/R?-*.fa; do n=`grep "^>" $f | tail -1 | sed -e 's/>//'`; echo $f $n; done) > tmp2.txt
paste tmp.txt tmp2.txt | sed -e 's/\/home\/yinghua\/intSiteValidation.//g'
wu/clone1-1/R1-1.fa 24  anatoly/clone1-1/R1-1.fa 49
wu/clone1-1/R2-1.fa 15  anatoly/clone1-1/R2-1.fa 25
wu/clone1-2/R1-1.fa 19  anatoly/clone1-2/R1-1.fa 45
wu/clone1-2/R2-1.fa 7   anatoly/clone1-2/R2-1.fa 23
wu/clone1-3/R1-1.fa 27  anatoly/clone1-3/R1-1.fa 60
wu/clone1-3/R2-1.fa 10  anatoly/clone1-3/R2-1.fa 26
wu/clone1-4/R1-1.fa 25  anatoly/clone1-4/R1-1.fa 56
wu/clone1-4/R2-1.fa 12  anatoly/clone1-4/R2-1.fa 26
wu/clone2-1/R1-1.fa 31  anatoly/clone2-1/R1-1.fa 76
wu/clone2-1/R2-1.fa 13  anatoly/clone2-1/R2-1.fa 29
wu/clone2-2/R1-1.fa 41  anatoly/clone2-2/R1-1.fa 78
wu/clone2-2/R2-1.fa 18  anatoly/clone2-2/R2-1.fa 30
wu/clone2-3/R1-1.fa 49  anatoly/clone2-3/R1-1.fa 87
wu/clone2-3/R2-1.fa 22  anatoly/clone2-3/R2-1.fa 42
wu/clone2-4/R1-1.fa 36  anatoly/clone2-4/R1-1.fa 71
wu/clone2-4/R2-1.fa 15  anatoly/clone2-4/R2-1.fa 32
wu/clone3-1/R1-1.fa 21  anatoly/clone3-1/R1-1.fa 53
wu/clone3-1/R2-1.fa 11  anatoly/clone3-1/R2-1.fa 20
wu/clone3-2/R1-1.fa 21  anatoly/clone3-2/R1-1.fa 55
wu/clone3-2/R2-1.fa 10  anatoly/clone3-2/R2-1.fa 22
wu/clone3-3/R1-1.fa 23  anatoly/clone3-3/R1-1.fa 62
wu/clone3-3/R2-1.fa 5   anatoly/clone3-3/R2-1.fa 14
wu/clone3-4/R1-1.fa 13  anatoly/clone3-4/R1-1.fa 50
wu/clone3-4/R2-1.fa 4   anatoly/clone3-4/R2-1.fa 16
wu/clone4-1/R1-1.fa 28  anatoly/clone4-1/R1-1.fa 57
wu/clone4-1/R2-1.fa 11  anatoly/clone4-1/R2-1.fa 26
wu/clone4-2/R1-1.fa 31  anatoly/clone4-2/R1-1.fa 76
wu/clone4-2/R2-1.fa 10  anatoly/clone4-2/R2-1.fa 38
wu/clone4-3/R1-1.fa 41  anatoly/clone4-3/R1-1.fa 71
wu/clone4-3/R2-1.fa 22  anatoly/clone4-3/R2-1.fa 35
wu/clone4-4/R1-1.fa 43  anatoly/clone4-4/R1-1.fa 74
wu/clone4-4/R2-1.fa 20  anatoly/clone4-4/R2-1.fa 34
wu/HIV_CTRL_noLig-1/R1-1.fa 21  anatoly/HIV_CTRL_noLig-1/R1-1.fa 29
wu/HIV_CTRL_noLig-1/R2-1.fa 20  anatoly/HIV_CTRL_noLig-1/R2-1.fa 25
wu/HIV_CTRL_noLig-2/R1-1.fa 13  anatoly/HIV_CTRL_noLig-2/R1-1.fa 31
wu/HIV_CTRL_noLig-2/R2-1.fa 13  anatoly/HIV_CTRL_noLig-2/R2-1.fa 26
wu/HIV_CTRL_noLig-3/R1-1.fa 17  anatoly/HIV_CTRL_noLig-3/R1-1.fa 40
wu/HIV_CTRL_noLig-3/R2-1.fa 17  anatoly/HIV_CTRL_noLig-3/R2-1.fa 31
wu/HIV_CTRL_noLig-4/R1-1.fa 24  anatoly/HIV_CTRL_noLig-4/R1-1.fa 45
wu/HIV_CTRL_noLig-4/R2-1.fa 24  anatoly/HIV_CTRL_noLig-4/R2-1.fa 40
wu/HIV_CTRL_wLig-1/R1-1.fa 18   anatoly/HIV_CTRL_wLig-1/R1-1.fa 62
wu/HIV_CTRL_wLig-1/R2-1.fa 16   anatoly/HIV_CTRL_wLig-1/R2-1.fa 35
wu/HIV_CTRL_wLig-2/R1-1.fa 20   anatoly/HIV_CTRL_wLig-2/R1-1.fa 45
wu/HIV_CTRL_wLig-2/R2-1.fa 18   anatoly/HIV_CTRL_wLig-2/R2-1.fa 32
wu/HIV_CTRL_wLig-3/R1-1.fa 27   anatoly/HIV_CTRL_wLig-3/R1-1.fa 47
wu/HIV_CTRL_wLig-3/R2-1.fa 27   anatoly/HIV_CTRL_wLig-3/R2-1.fa 37
wu/HIV_CTRL_wLig-4/R1-1.fa 31   anatoly/HIV_CTRL_wLig-4/R1-1.fa 58
wu/HIV_CTRL_wLig-4/R2-1.fa 30   anatoly/HIV_CTRL_wLig-4/R2-1.fa 39
wu/pool1-1/R1-1.fa 33   anatoly/pool1-1/R1-1.fa 70
wu/pool1-1/R2-1.fa 22   anatoly/pool1-1/R2-1.fa 41
wu/pool1-2/R1-1.fa 32   anatoly/pool1-2/R1-1.fa 55
wu/pool1-2/R2-1.fa 20   anatoly/pool1-2/R2-1.fa 30
wu/pool1-3/R1-1.fa 40   anatoly/pool1-3/R1-1.fa 78
wu/pool1-3/R2-1.fa 30   anatoly/pool1-3/R2-1.fa 51
wu/pool1-4/R1-1.fa 28   anatoly/pool1-4/R1-1.fa 69
wu/pool1-4/R2-1.fa 24   anatoly/pool1-4/R2-1.fa 46
wu/pool2-1/R1-1.fa 40   anatoly/pool2-1/R1-1.fa 80
wu/pool2-1/R2-1.fa 25   anatoly/pool2-1/R2-1.fa 45
wu/pool2-2/R1-1.fa 33   anatoly/pool2-2/R1-1.fa 73
wu/pool2-2/R2-1.fa 18   anatoly/pool2-2/R2-1.fa 38
wu/pool2-3/R1-1.fa 40   anatoly/pool2-3/R1-1.fa 75
wu/pool2-3/R2-1.fa 26   anatoly/pool2-3/R2-1.fa 49
wu/pool2-4/R1-1.fa 34   anatoly/pool2-4/R1-1.fa 65
wu/pool2-4/R2-1.fa 22   anatoly/pool2-4/R2-1.fa 44
wu/pool3-1/R1-1.fa 20   anatoly/pool3-1/R1-1.fa 38
wu/pool3-1/R2-1.fa 9    anatoly/pool3-1/R2-1.fa 13
wu/pool3-2/R1-1.fa 42   anatoly/pool3-2/R1-1.fa 65
wu/pool3-2/R2-1.fa 24   anatoly/pool3-2/R2-1.fa 34
wu/pool3-3/R1-1.fa 35   anatoly/pool3-3/R1-1.fa 65
wu/pool3-3/R2-1.fa 14   anatoly/pool3-3/R2-1.fa 29
wu/pool3-4/R1-1.fa 26   anatoly/pool3-4/R1-1.fa 42
wu/pool3-4/R2-1.fa 16   anatoly/pool3-4/R2-1.fa 22
#### note, number of reads were significantly different
#### given demultiplexing produced same fa.gz files, something must be changed between the two runs in the filtering steps.
-
$5. check attrition table
Rscript ~/intSiteCaller/check_stats.R > tmp.wu.txt
Rscript ~/intSiteCaller/check_stats.R ~/intSiteValidation_4PM > tmp.anatoly.txt

After comparing the tables, problems was identified: vector trimming was not effective at all in anatoly's run.

Correcting barcode results in 20% loss of reads. Too much.

on the top level we only want to have run scripts(everything else should be hidden in folder)

parallel demultiplexing

#Demultiplexing is currently a single-core process - perhaps it could be made
#more efficient by having each error-correct worker do its own
#mini-demultiplex with the barcodes that it error corrected, then write their
#own shorter fastq files which can be cat'd together after everything is done