Code Monkey home page Code Monkey logo

Comments (5)

marvin-jens avatar marvin-jens commented on September 4, 2024

This would indeed fix the missing RG. However, it still deletes all previous history of the BAM file as documented via PG header elements.
Here is an example:

@RG     ID:A    SM:sts_083_1
@RG     ID:U    SM:unassigned_sts_083_1
@PG     ID:0    PN:preprocess_read1.py  CL:--sample=sts_083_1 --read1=/data/rajewsky/home/marjens/slide_seq/projects/sts_083/raw_data/illumina/reads/raw/sts_083_1_R1.fastq.gz --read2=/data/rajewsky/home/marjens/slide_seq/projects/sts_083/raw_data/illumina/reads/raw/sts_083_1_R2.fastq.gz --parallel=16 --save-stats=/data/rajewsky/home/marjens/slide_seq/projects/sts_083/raw_data/illumina/reads/reversed/sts_083_1_reversed_R1.bc_stats.tsv --log-file=/data/rajewsky/home/marjens/slide_seq/projects/sts_083/raw_data/illumina/reads/reversed/sts_083_1_reversed_R1.preprocessing.log --bc1-ref=/data/rajewsky/projects/combbeads/three_segments/rev1.fa --bc2-ref= --bc1-cache=None --bc2-cache=None --threshold=0.5 --cell=BC1[::-1] --cell-raw=bc1[::-1] --out-format=bam --out-unassigned=/data/rajewsky/home/marjens/slide_seq/projects/sts_083/processed_data/sts_083_1/illumina/complete_data/unaligned_bc_unassigned.bam --out-assigned=/dev/stdout --UMI=r2[0:8] --bam-tags=CR:{raw},CB:{cell},MI:{UMI},RG:{assigned} --min-opseq-score=16  VN:0.9
@PG     ID:samtools     PN:samtools     CL:samtools view -bh /dev/stdin PP:0    VN:1.12
@PG     ID:sambamba     CL:view -h -l9 -f bam /data/rajewsky/home/marjens/slide_seq/projects/sts_083/processed_data/sts_083_1/illumina/complete_data/unaligned_bc_tagged.uncompressed.bam       PP:samtools     VN:1.0
@PG     ID:1    PN:TrimStartingSequence CL:TrimStartingSequence INPUT=/data/rajewsky/home/marjens/slide_seq/projects/sts_083/processed_data/sts_083_1/illumina/complete_data/unaligned_bc_tagged.bam OUTPUT=/data/rajewsky/home/marjens/slide_seq/projects/sts_083/processed_data/sts_083_1/illumina/complete_data/unaligned_tagged_trimmed.bam OUTPUT_SUMMARY=/data/rajewsky/home/marjens/slide_seq/projects/sts_083/processed_data/sts_083_1/illumina/complete_data/reports/remove_smart_adapter.report.txt SEQUENCE=AAGCAGTGGTATCAACGCAGAGTGAATGGG MISMATCHES=0 NUM_BASES=5 COMPRESSION_LEVEL=0    TRIM_TAG=ZS VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false     VN:2.4.0(3d2b3d8_1600201514)
@PG     ID:2    PN:PolyATrimmer CL:PolyATrimmer INPUT=/data/rajewsky/home/marjens/slide_seq/projects/sts_083/processed_data/sts_083_1/illumina/complete_data/unaligned_tagged_trimmed.bam OUTPUT=/data/rajewsky/home/marjens/slide_seq/projects/sts_083/processed_data/sts_083_1/illumina/complete_data/unaligned_tagged_trimmed_polyA.bam OUTPUT_SUMMARY=/data/rajewsky/home/marjens/slide_seq/projects/sts_083/processed_data/sts_083_1/illumina/complete_data/reports/remove_polyA.report.txt MISMATCHES=0 NUM_BASES=6    USE_NEW_TRIMMER=false TRIM_TAG=ZP ADAPTER=~XM~XCACGTACTCTGCGTTGCTACCACTG MAX_ADAPTER_ERROR_RATE=0.1 MIN_ADAPTER_MATCH=4 MIN_POLY_A_LENGTH=20 MIN_POLY_A_LENGTH_NO_ADAPTER_MATCH=6 DUBIOUS_ADAPTER_MATCH_LENGTH=6 MAX_POLY_A_ERROR_RATE=0.1 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false   VN:2.4.0(3d2b3d8_1600201514)
@PG     ID:STAR PN:STAR CL:STAR   --runThreadN 8   --genomeDir /data/rajewsky/home/nkarais/mm10_GRCm38.p5_gencode.vM23_STAR_2.7.1a/STAR_index   --readFilesType SAM   SE      --readFilesIn /data/rajewsky/home/marjens/slide_seq/projects/sts_083/processed_data/sts_083_1/illumina/complete_data/unaligned_tagged_trimmed_polyA.bam      --readFilesCommand samtools   view      --outFileNamePrefix /data/rajewsky/home/marjens/slide_seq/projects/sts_083/processed_data/sts_083_1/illumina/complete_data/star_   --outStd BAM_Unsorted   --outSAMtype BAM   Unsorted      --outSAMunmapped Within      --outFilterScoreMin 15   --outFilterMatchNmin 15    VN:2.7.1a
@PG     ID:3    PN:TagReadWithGeneFunction      CL:TagReadWithGeneFunction INPUT=/data/rajewsky/home/marjens/slide_seq/projects/sts_083/processed_data/sts_083_1/illumina/complete_data/star_Aligned.sorted.out.bam OUTPUT=/data/rajewsky/home/marjens/slide_seq/projects/sts_083/processed_data/sts_083_1/illumina/complete_data/final.bam ANNOTATIONS_FILE=/data/rajewsky/home/nkarais/mm10_GRCm38.p5_gencode.vM23_STAR_2.7.1a/gencode.vM23.primary_assembly.annotation.gtf    GENE_NAME_TAG=gn GENE_STRAND_TAG=gs GENE_FUNCTION_TAG=gf READ_FUNCTION_TAG=XF USE_STRAND_INFO=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false       VN:2.4.0(3d2b3d8_1600201514)

Since the bam header fix script operates on a stream, this should not really add to disk I/O and also not much to CPU load. A new BAM is created and - with exception of the repaired header - all records are simply copied over.
I'd therefore propose that we keep the script for now. What do you think?

from spacemake.

marvin-jens avatar marvin-jens commented on September 4, 2024

Just checked dropseq.smk and indeed we are currently creating temp(dropseq_mapped_reads_unsorted_headerless) in the map_reads rule. I think this should probably be a pipe. @sztankatt has a similar situation after gene annotation, where a script he added recently can choose the best out of multiple alignments, also by operating on a BAM stream. I propose we try to make STAR output to pipe

  • [ turn STAR output to pipe instead of temp ]

from spacemake.

nukappa avatar nukappa commented on September 4, 2024

In principle if we skip i/o then running the script, should be okay.

STAR has plenty of options though, inlcuding a --outSAMheaderPG which can be used to add the PG entries to the header. One less script to maintain...

from spacemake.

sztankatt avatar sztankatt commented on September 4, 2024

Setting the output of STAR with the pipe(...) output snakemake command will work. however, if we set the rule map_reads to use 24 threads for instance, then snakemake will need at least 25 threads to run (24 for mapping 1 for pipe). And it will throw an error and refuse to run. If you assign the output of star as temp(...) snakemake will run even with 1 core. Somehow the threads parameter doesn't work with pipes, instead of maximising the number it requires it exactly.

so it is a trade off between mapping faster and writing to disk (with a temp file) and then taking this temp file and piping it through the fix_bam_header.py and then through the TagGeneWithReadFunction or slower mapping but then pipe-ing the output.

So basically, it is very tricky to run multithreaded scripts (like STAR) with snakemake piped output, as then snakemake will require a lot of cores, and refuse to run with providing less. So that's why currently the output of star is temp(...)

from spacemake.

sztankatt avatar sztankatt commented on September 4, 2024

2ec05d7 as a workaround I merged the commands and used unix pipes. I am not sure if this is faster though

from spacemake.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.