Code Monkey home page Code Monkey logo

Comments (5)

melnel000 avatar melnel000 commented on June 16, 2024

I am also having the same problem. I picked this up because I tried to convert the BAMs generated by vg surject to CRAMs which gives the following error using samtools:

[W::cram_encode_aux] Missing '@ RG' header for RG "ID:1 LB:lib1 SM:IC_UTH_00023 PL:illumina PU:unit 1"

This is the command I am running:
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.11
bash -c "vg surject -x /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/hprc-v1.1-mc-chm13.gbz
-G ./hprc-v1.1-mc-chm13.${2}.new.gaf.gz --interleaved -F /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/grch38.paths.txt -b -N ${2}
-R 'ID:1 LB:lib1 SM:${2} PL:illumina PU:unit1' > hprc-v1.1-mc-chm13.${2}_surject_hg38.bam"

I also tried the latest docker image (docker://quay.io/comparative-genomics-toolkit/cactus:v2.8.0-gpu) and the problem persists.

How do we add the @ RG line to the BAM file generated by vg surject? Is it possible for vg surject to output in CRAM format?

Thanks for the help

Kind regards,
Melissa

from vg.

jltsiren avatar jltsiren commented on June 16, 2024

If it's only a header issue, you can fix it easily with samtools. Extract the header with samtools view -H, update it as necessary, and apply the new header with samtools reheader.

from vg.

jltsiren avatar jltsiren commented on June 16, 2024

As for the actual issue, what does the header look like? These commands should include sample name and read group in the header, and the relevant code path should include them in the output, at least by superficial inspection.

from vg.

melnel000 avatar melnel000 commented on June 16, 2024

Thanks for the feedback. I also have BAM files generated with BWA. Here the header contains 1 @ HD line, followed by multiple @ SQ lines, then 1 @ RG line followed by multiple @ PG lines before the records begin:

[lines before]
\@SQ     SN:HLA-DRB1*16:02:01    LN:11005        AH:*    M5:4a972df76bd3ee2857b87bd5be5ea00a     UR:/cbio/projects/003/charlton/nextflow/samrc_wes/align/nextflow-work/d0/651e1eadc049faa2b86abc2a1a3670/Homo_sapiens_assembly38.fasta
\@RG     ID:0.0  LB:LIBA SM:NRG60.1CAN   PL:Illumina
\@PG     ID:bwa  CL:/opt/sentieon-genomics-202112.07/libexec/bwa mem -R \@RG\tID:0.0\tLB:LIBA\tSM:NRG60.1CAN\tPL:Illumina -t 32 -K 100000000 -Y Homo_sapiens_assembly38.fasta V350190975_A_L01_103_1.fq.gz V350190975_A_L01_103_2.fq.gz   PN:bwa  VN:0.7.17-r1188
[lines after]

Here is an example of the BAM header in BAM files output by vg surject which lack the @ RG line:

[lines before]
\@SQ     SN:GRCh38#0#chrY_KI270740v1_random      LN:37240        M5:69e42252aead509bf56f1ea6fda91405     UR:file:/cbio/projects/003/melissa/h3abionet_refgraph/GRCh38.fa
\@PG     ID:samtools     PN:samtools     VN:1.11 CL:/home/cactus/bin/samtools reheader /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/GRCh38.dict ./hprc-v1.1-mc-chm13.IC_UTH_00023_surject_hg38.bam
\@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.11 CL:/home/cactus/bin/samtools sort -O BAM -o ./hprc-v1.1-mc-chm13.IC_UTH_00023_surject_hg38_sort.bam --threads 8 ./hprc-v1.1-mc-chm13.IC_UTH_00023_surject_hg38.bam
[lines after]

I also tried to output in CRAM format directly from vg surject by swopping the -b flag with the -c flag:

singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.11
bash -c "vg surject -x /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/hprc-v1.1-mc-chm13.gbz
-G ./hprc-v1.1-mc-chm13.${2}.new.gaf.gz --interleaved -F /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/grch38.paths.txt -c -N ${2}
-R 'ID:1 LB:lib1 SM:${2} PL:illumina PU:unit1' > hprc-v1.1-mc-chm13.${2}_surject_hg38.bam"

But then I get the following error:

INFO: Using cached SIF image
[E::cram_get_ref] Failed to populate reference for id 0
[vg::HTSWriter] error: failed to write the SAM header

Is there anything else I need to add to the command to output in CRAM format? Thanks for the suggestion to edit the BAM header- that could also be an option.

from vg.

jltsiren avatar jltsiren commented on June 16, 2024

When you use vg surject or map directly to htslib formats with options -N X -R Y, vg creates a SAM header with the following line:

@RG     ID:Y  SM:X

The header is then passed to htslib, which parses it and returns one of its internal data structures. @melnel000, you seem to be providing all tags of the RG line as the read group identifier, and its plausible that the information may not survive the transformations intact.

Another possible reason is that you are mapping initially to GAF format. GAF is a minimalistic format, and there is no standardized way of storing sample names or read groups.

As for CRAM output, CRAM is a reference-based format, and htslib needs reference fasta to output it. htslib apparently uses environment variables REF_PATH and REF_CACHE to locate them, but I'm not familiar with the specifics. Anyway, CRAM output is not included in vg tests, so it's possible that vg cannot actually output CRAM.

from vg.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.