Code Monkey home page Code Monkey logo

five-dollar-genome-analysis-pipeline's People

Contributors

mmorgantaylor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

five-dollar-genome-analysis-pipeline's Issues

Localization can be made optional for more stages.

For example the SortSam task could be modified to operate on I=/dev/stdin with
gsutil cat invoked if the path was a google cloud api.

Alternatively the google cloud bucket could be mapped as a docker volume.

This could save a few more hours of processing time for a run as this time is spent copying the data back and forth in the critical path.

what is "five-dollar- genome-analysis" mean?

hi, I have a question: why does this title name "five-dollar-genome-analysis-pipeline"? does it mean the whole analysis cost 5 dollars? but if the input ubam file is around 100G, does the whole analysis cost is still 5 dollars ? thanks

sleep after creating files

Hi,

I think it would be a good idea to add sleep commands after the parts that the pipeline is creating text or other tsv etc. files and then the pipeline has to instantly open them,

For example, in the GetBwaVersion task when you are creating the txt file
sed 's/Version: //' > bwa_version.txt;

some filesystems are not fast enough to instantly access the file (in this occasion it is instantly called by the read_string("bwa_version.txt")) and causes a workflow failure.

The error I was getting was the: IOException: Could not read from ...

So by adding a sleep time the system is actually ready to read the file and my workflow is running smoothly:
sed 's/Version: //' > bwa_version.txt;
sleep 5

I was facing the issue for a long time. I googled it and it seems that there are many people facing similar issues and this fix could be the solution for them too.

Best,
Dionysis

GATK3 haplotype caller

Hi,
Why is this pipeline still using GATK3 by default for haplotype caller?

"Boolean use_gatk3_haplotype_caller = true"

GATK4 has been out for a while. Either GATK4 is reliable or it isn't, so it seems odd to keep using GATK3 for haplotype caller in the official Broad workflows without an explanation for this. There is no documentation to explain why this is being done.

Thanks.

What is "fingerprint_genotypes_file"

I'm curious what's fingerprint_genotypes_file, which wdl file it appears in ?

"WholeGenomeGermlineSingleSample.references": {
    "fingerprint_genotypes_file": "gs://dsde-data-na12878-public/NA12878.hg38.reference.fingerprint.vcf",
    "fingerprint_genotypes_index": "gs://dsde-data-na12878-public/NA12878.hg38.reference.fingerprint.vcf.idx",

Fail if empty ref_alt ?

Hi,

The alignment task is taken to a failed state if there are no ALT contigs (ref_alt file has no content). Why is this?

In the command ref_alt is not used but there's an if statement requiring this:

    # if ref_alt has data in it,
    if [ -s ${ref_alt} ]; then
      java -Xms5000m -jar /usr/gitc/picard.jar \
        SamToFastq \
        INPUT=${input_bam} \
        FASTQ=/dev/stdout \
        INTERLEAVE=true \
        NON_PF=true | \
      /usr/gitc/${bwa_commandline} /dev/stdin - 2> >(tee ${output_bam_basename}.bwa.stderr.log >&2) | \
      java -Dsamjdk.compression_level=${compression_level} -Xms3000m -jar /usr/gitc/picard.jar \
        MergeBamAlignment \
        VALIDATION_STRINGENCY=SILENT \
        EXPECTED_ORIENTATIONS=FR \
        ATTRIBUTES_TO_RETAIN=X0 \
        ATTRIBUTES_TO_REMOVE=NM \
        ATTRIBUTES_TO_REMOVE=MD \
        ALIGNED_BAM=/dev/stdin \
        UNMAPPED_BAM=${input_bam} \
        OUTPUT=${output_bam_basename}.bam \
        REFERENCE_SEQUENCE=${ref_fasta} \
        PAIRED_RUN=true \
        SORT_ORDER="unsorted" \
        IS_BISULFITE_SEQUENCE=false \
        ALIGNED_READS_ONLY=false \
        CLIP_ADAPTERS=false \
        MAX_RECORDS_IN_RAM=2000000 \
        ADD_MATE_CIGAR=true \
        MAX_INSERTIONS_OR_DELETIONS=-1 \
        PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
        PROGRAM_RECORD_ID="bwamem" \
        PROGRAM_GROUP_VERSION="${bwa_version}" \
        PROGRAM_GROUP_COMMAND_LINE="${bwa_commandline}" \
        PROGRAM_GROUP_NAME="bwamem" \
        UNMAPPED_READ_STRATEGY=COPY_TO_TAG \
        ALIGNER_PROPER_PAIR_FLAGS=true \
        UNMAP_CONTAMINANT_READS=true \
        ADD_PG_TAG_TO_READS=false

      grep -m1 "read .* ALT contigs" ${output_bam_basename}.bwa.stderr.log | \
      grep -v "read 0 ALT contigs"

    # else ref_alt is empty or could not be found
    else
      exit 1;
    fi

The reason I'm asking is that I'd like to run this with hg19 / b37 reference data.

Thanks!
Juho

Docker Image

Are there any plans to create a docker image to run this pipeline?

SamtoFastq errror when running locally

I keep getting this when running the pipeline locally:
Error in writing fastq file /dev/stdout

This is SamtoFastq where FASTQ=/dev/stdout. I have also tried FASTQ=/proc/self/fd/1

Can someone PLEASE tell me what the fix is ?

start fastq with gatk4

what can I do if start fastq with gatk4?
How to get unmapped bam, I need mapping before use gatk?

The flow with example files fails

verifyBamID used in the docker container seems to be verifyBamID2, which is known to be much less tolerant to low depth. The workflow with default sample files fails at the contamination detection stage.

PCR INDEL model

Hi,

I notice that this pipeline doesn't specify a PCR indel model, though in your manual for HaplotypeCaller you state in big bold capitals "VERY IMPORTANT: when using PCR-free sequencing data we definitely recommend setting this argument to NONE.". Was this an oversight?

Hg37 support

Hi,

Is there a way to make this work with Hg37? Or is there a similar Hg37 pipeline available?

Thanks,
Juho

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.