gatk-workflows / five-dollar-genome-analysis-pipeline Goto Github PK

Workflows used for WGS data processing -- replaced by https://github.com/gatk-workflows/gatk4-genome-processing-pipeline

Home Page: https://gatk.broadinstitute.org/hc/en-us

License: BSD 3-Clause "New" or "Revised" License

WDL 100.00%

five-dollar-genome-analysis-pipeline's People

Contributors

Stargazers

Watchers

five-dollar-genome-analysis-pipeline's Issues

Localization can be made optional for more stages.

For example the SortSam task could be modified to operate on I=/dev/stdin with
gsutil cat invoked if the path was a google cloud api.

Alternatively the google cloud bucket could be mapped as a docker volume.

This could save a few more hours of processing time for a run as this time is spent copying the data back and forth in the critical path.

BamToGvcf.wdl typo in import URI

five-dollar-genome-analysis-pipeline/tasks/BamToGvcf.wdl

Line 9 in 70025d4

    
           import "https://raw.githubusercontent.com/gatk-workflows/five-dollar-genome-analysis-pipeline/1.1.0/tasks/GermlineVarianltDiscovery.wdl" as Calling

what is "five-dollar- genome-analysis" mean?

hi, I have a question: why does this title name "five-dollar-genome-analysis-pipeline"? does it mean the whole analysis cost 5 dollars? but if the input ubam file is around 100G, does the whole analysis cost is still 5 dollars ? thanks

BamToGvcf.wdl typo in local import

First of all, thanks for making available this awesome workflow.

I think that this commented line:

five-dollar-genome-analysis-pipeline/tasks/BamToGvcf.wdl

Line 4 in a418352

#import "GermlineVarianltDiscovery.wdl" as Calling

Should be:

# import "GermlineVariantDiscovery.wdl" as Calling

sleep after creating files

Hi,

I think it would be a good idea to add sleep commands after the parts that the pipeline is creating text or other tsv etc. files and then the pipeline has to instantly open them,

For example, in the GetBwaVersion task when you are creating the txt file
sed 's/Version: //' > bwa_version.txt;

some filesystems are not fast enough to instantly access the file (in this occasion it is instantly called by the read_string("bwa_version.txt")) and causes a workflow failure.

The error I was getting was the: IOException: Could not read from ...

So by adding a sleep time the system is actually ready to read the file and my workflow is running smoothly:
sed 's/Version: //' > bwa_version.txt;
sleep 5

I was facing the issue for a long time. I googled it and it seems that there are many people facing similar issues and this fix could be the solution for them too.

Best,
Dionysis

Qc.CheckPreValidation runtime section has duplicate 'docker' keys

five-dollar-genome-analysis-pipeline/tasks/Qc.wdl

Line 319 in 70025d4

docker: "python:2.7"

what's "File?" ？ Is that a coding bug？

GATK3 haplotype caller

Hi,
Why is this pipeline still using GATK3 by default for haplotype caller?

"Boolean use_gatk3_haplotype_caller = true"

GATK4 has been out for a while. Either GATK4 is reliable or it isn't, so it seems odd to keep using GATK3 for haplotype caller in the official Broad workflows without an explanation for this. There is no documentation to explain why this is being done.

Thanks.

What is "fingerprint_genotypes_file"

I'm curious what's fingerprint_genotypes_file, which wdl file it appears in ?

"WholeGenomeGermlineSingleSample.references": {
    "fingerprint_genotypes_file": "gs://dsde-data-na12878-public/NA12878.hg38.reference.fingerprint.vcf",
    "fingerprint_genotypes_index": "gs://dsde-data-na12878-public/NA12878.hg38.reference.fingerprint.vcf.idx",

Fail if empty ref_alt ?

Hi,

The alignment task is taken to a failed state if there are no ALT contigs (ref_alt file has no content). Why is this?

In the command ref_alt is not used but there's an if statement requiring this:

    # if ref_alt has data in it,
    if [ -s ${ref_alt} ]; then
      java -Xms5000m -jar /usr/gitc/picard.jar \
        SamToFastq \
        INPUT=${input_bam} \
        FASTQ=/dev/stdout \
        INTERLEAVE=true \
        NON_PF=true | \
      /usr/gitc/${bwa_commandline} /dev/stdin - 2> >(tee ${output_bam_basename}.bwa.stderr.log >&2) | \
      java -Dsamjdk.compression_level=${compression_level} -Xms3000m -jar /usr/gitc/picard.jar \
        MergeBamAlignment \
        VALIDATION_STRINGENCY=SILENT \
        EXPECTED_ORIENTATIONS=FR \
        ATTRIBUTES_TO_RETAIN=X0 \
        ATTRIBUTES_TO_REMOVE=NM \
        ATTRIBUTES_TO_REMOVE=MD \
        ALIGNED_BAM=/dev/stdin \
        UNMAPPED_BAM=${input_bam} \
        OUTPUT=${output_bam_basename}.bam \
        REFERENCE_SEQUENCE=${ref_fasta} \
        PAIRED_RUN=true \
        SORT_ORDER="unsorted" \
        IS_BISULFITE_SEQUENCE=false \
        ALIGNED_READS_ONLY=false \
        CLIP_ADAPTERS=false \
        MAX_RECORDS_IN_RAM=2000000 \
        ADD_MATE_CIGAR=true \
        MAX_INSERTIONS_OR_DELETIONS=-1 \
        PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
        PROGRAM_RECORD_ID="bwamem" \
        PROGRAM_GROUP_VERSION="${bwa_version}" \
        PROGRAM_GROUP_COMMAND_LINE="${bwa_commandline}" \
        PROGRAM_GROUP_NAME="bwamem" \
        UNMAPPED_READ_STRATEGY=COPY_TO_TAG \
        ALIGNER_PROPER_PAIR_FLAGS=true \
        UNMAP_CONTAMINANT_READS=true \
        ADD_PG_TAG_TO_READS=false

      grep -m1 "read .* ALT contigs" ${output_bam_basename}.bwa.stderr.log | \
      grep -v "read 0 ALT contigs"

    # else ref_alt is empty or could not be found
    else
      exit 1;
    fi

five-dollar-genome-analysis-pipeline/tasks_pipelines/alignment.wdl

Line 71 in 3ad22df

if [ -s ${ref_alt} ]; then

The reason I'm asking is that I'd like to run this with hg19 / b37 reference data.

Thanks!
Juho

Docker Image

Are there any plans to create a docker image to run this pipeline?

SamtoFastq errror when running locally

I keep getting this when running the pipeline locally:
Error in writing fastq file /dev/stdout

This is SamtoFastq where FASTQ=/dev/stdout. I have also tried FASTQ=/proc/self/fd/1

Can someone PLEASE tell me what the fix is ?

start fastq with gatk4

what can I do if start fastq with gatk4?
How to get unmapped bam, I need mapping before use gatk?

The flow with example files fails

verifyBamID used in the docker container seems to be verifyBamID2, which is known to be much less tolerant to low depth. The workflow with default sample files fails at the contamination detection stage.

PCR INDEL model

Hi,

I notice that this pipeline doesn't specify a PCR indel model, though in your manual for HaplotypeCaller you state in big bold capitals "VERY IMPORTANT: when using PCR-free sequencing data we definitely recommend setting this argument to NONE.". Was this an oversight?

Hg37 support

Hi,

Is there a way to make this work with Hg37? Or is there a similar Hg37 pipeline available?

Thanks,
Juho

gatk-workflows / five-dollar-genome-analysis-pipeline Goto Github PK

five-dollar-genome-analysis-pipeline's People

Contributors

Stargazers

Watchers

Forkers

five-dollar-genome-analysis-pipeline's Issues

Recommend Projects

Recommend Topics

Recommend Org