milkschen / leviosam2 Goto Github PK

View Code? Open in Web Editor NEW

103.0 103.0 17.0 33.61 MB

Fast and accurate coordinate conversion between assemblies

License: MIT License

CMake 0.48% Dockerfile 0.10% Python 6.65% Shell 1.00% C++ 68.31% C 23.46%

leviosam2's People

Contributors

Stargazers

Watchers

Forkers

animesh jdidion sujunhao zhemingfan cbrueffer maxrossi91 ilyasoifer pythseq ultimagen maojanlin ch4rr0 gomezcan mattheww95 mystrarien yuzhenpeng xtmgah

leviosam2's Issues

Error when running full workflow using `bwa mem`

The samtools cat step fails because the committed and reailgned BAM files have different headers. I think you want to use samtools merge rather than cat.

Regarding time it takes to run leviosam2

Thank you for developing this useful tool!
I am currently trying to lift a bam file mapped to CHM13 using bwamem2; I installed leviosam2 and changed its path then downloaded leviosam2.py. I tried the bash version last night but the task still wasn't finished the next morning so I changed to the python workflow. These are my parameters:
-t 20 -i "/home/s/ASC_new/AS_02N.bam" -o "/home/s/ASC_new/AS_02N_lifted" -C "./t2t2hg38.clft" -f "/home/s/GRCH38/hg38_2.fa" -fi "/home/s/GRCH38/hg38_2.fai" --sequence_type ilmn_pe -a bwamem
However it is still taking a long time; Is there any mistakes in my code or should I just do leviosam2 lift as insctructed on the README page?
Thank you!

MAPQ of unmapped reads

Hi @milkschen et al.,

I received the following error when using picard MarkDuplicates with the BAM files I lifted using leviosam2:

ERROR::INVALID_MAPPING_QUALITY:Record 1, Read name ERR1938683.3, MAPQ should be 0 for unmapped read.

Above is an example error message. I ran leviosam2 for around 20 different BAM files and the error is the same for all of them (except that the read name and the record number are different).

I checked the the corresponding record in the BAM file. Here is what I have for that particular record:

Before leviosam2:

ERR1938683.3 99 chrI 1 60 29S121M = 37 183 CACCACACCCACACACCCACACACCACACCCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCCCACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTCCCCAACCTGTCTCTCAACTTA BBB@BBBBBBBBGGGGFGGGGGEGGD222EEAECE2FGA1AEFGAE?FEGGCEAEGGGEEGDHAEEEE1GHHGGH11EE?/>EE/EEA/</BFCB2222FB??FC22GH2F/CA<C2<0?F1?1G0??///<0.<<001>1F=<G0<=0D NM:i:3 MD:Z:51A47G0G20 MC:Z:3S147M AS:i:106 XS:i:51 RG:Z:ERR1938683

After leviosam2

ERR1938683.3 77 * 0 60 29S121M * 0 0 CACCACACCCACACACCCACACACCACACCCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCCCACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTCCCCAACCTGTCTCTCAACTTA BBB@BBBBBBBBGGGGFGGGGGEGGD222EEAECE2FGA1AEFGAE?FEGGCEAEGGGEEGDHAEEEE1GHHGGH11EE?/>EE/EEA/</BFCB2222FB??FC22GH2F/CA<C2<0?F1?1G0??///<0.<<001>1F=<G0<=0D MC:Z:3S147M AS:i:106 XS:i:51 RG:Z:ERR1938683 LO:Z:UM_UM

Clearly this particular read could not be lifted after leviosam2. The issue is the old mapping quality is still there (and the CIGAR string), which is why MarkDuplicates is complaining. I can easily filter out the unmapped reads from my lifted BAM files to eliminate this issue, but I thought you may want to check this in case you consider this as an unwanted behavior for your tool.

Below is the script I use for running leviosam2. The CHAIN files are downloaded from UCSC.

PREFIX=$1
CHAIN=$2
BAM=$3
FAI=$4
THREAD=$5

leviosam2 index -c "${CHAIN}" -p "${PREFIX}" -F "${FAI}"
leviosam2 lift -C "${PREFIX}".clft -a "${BAM}" -p "${PREFIX}" -O bam -t ${THREAD}

hg38 to hg19 performance

Hi!

I would like to ask for your opinion. Do you think it is better to map to hg38_T2T_masked and then use leviosam2 to convert to hg19 or is it better to just map to hg19 from the beginning? I saw the improvements mapping from chm13v2 to hg19 but I am not so sure how well is the transition from hg38 to hg19.
What is the leviosam2 behavior on non primary and secondary alignments? Do you think I should remove them from the BAM file before lifting to another assembly?

No length map is found. Please set `-F`.

Hello everyone, I am trying to create the index of the chain file but I am getting the following error: "[E::serialize_run] No length map is found. Please set -F.". The command I am using is docker run naechyun/leviosam2 leviosam2 index -c chm13v2-hg38.over.chain -p chm13v2-hg38 -F Homo_sapiens_GRCh38.p14.fasta.fai. I obtained the .fai file by using samtools faidx on my .fasta file downloaded from NCBI and previously modifying the chromosome names to chr1, chr2, etc to match the chain file. Both files are in the folder where I am running the command. I have also tried running "-F .", "-F /root/Resources/GRCh38/fasta/Homo_sapiens_GRCh38.p14.fasta.fai" and "./Homo_sapiens_GRCh38.p14.fasta.fai". Thank you in advance for your help.

Updating docker container to 0.3.0

Hi @milkschen ,

The latest version is 0.2.2 on dockerhub (see compressed file size), but did you mean to update bump it to 0.3.0? Or is 0.3.0 an early developmental version?

Integrating strobealign for pe-mapping?

Hi @milkschen,

I am interested in trying out levioSAM2 for strobealign's regarding the SNP/indel call improvements (chm13 to hg38) as per your thread and recent discussions on mapq scores on Twitter.

Strobealign is now on conda, would it be possible to integrate it? This is not urgent as we will resume the development of our aligner (hence this analysis) after summer.

Updating conda and singularity with the latest version of leviosam2

Hi @milkschen,

If its okay, could you update Conda or Singularity with the latest version of leviosam2? I'm eager to try out some of the new ONT flag. I've been running into some issues with cmake

Installation error

Hi,

I'd like to try your tool on my data, but I'm struggling to install it.
Do you have an idea on how to solve these conflicts?

Best
Francesco

conda install -c bioconda -c conda-forge leviosam2
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: - 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                                                                                                                                                                                                                                                                                

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package _libgcc_mutex conflicts for:
_libgcc_mutex
libgomp -> _libgcc_mutex==0.1[build='conda_forge|main']
leviosam2 -> libgcc-ng[version='>=12'] -> _libgcc_mutex[version='*|0.1',build='main|conda_forge|main']
_openmp_mutex -> _libgcc_mutex==0.1[build='conda_forge|main']
libgcc-ng[version='>=10.3.0'] -> _libgcc_mutex[version='*|0.1',build='main|conda_forge|main']

Package _openmp_mutex conflicts for:
_openmp_mutex
leviosam2 -> libgcc-ng[version='>=12'] -> _openmp_mutex[version='>=4.5']
libgcc-ng[version='>=10.3.0'] -> _openmp_mutex[version='>=4.5']

Package libzlib conflicts for:
_openmp_mutex -> llvm-openmp[version='>=9.0.1'] -> libzlib[version='>=1.2.11,<1.3.0a0|>=1.2.13,<1.3.0a0']
leviosam2 -> htslib[version='>=1.18,<1.19.0a0'] -> libzlib[version='1.2.11|1.2.11|1.2.11|1.2.12|1.2.12|1.2.12|1.2.12|1.2.12|1.2.13|>=1.2.11,<1.3.0a0|>=1.2.12,<1.3.0a0|>=1.2.13,<1.3.0a0',build='h166bdaf_1|h166bdaf_2|h166bdaf_4|h166bdaf_3|h166bdaf_0|h166bdaf_1014|h36c2ea0_1013|h36c2ea0_1012']

Package llvm-openmp conflicts for:
libgcc-ng[version='>=10.3.0'] -> _openmp_mutex[version='>=4.5'] -> llvm-openmp[version='>=9.0.1']
_openmp_mutex -> llvm-openmp[version='>=9.0.1']

Package libgomp conflicts for:
libgomp
_openmp_mutex -> libgomp[version='>=7.3.0|>=7.5.0']
libgcc-ng[version='>=10.3.0'] -> _openmp_mutex[version='>=4.5'] -> libgomp[version='>=7.3.0|>=7.5.0']The following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.17=0
  - feature:|@/linux-64::__glibc==2.17=0
  - leviosam2 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']

Your installed version is: 2.17

Note that strict channel priority may have removed packages required for satisfiability.

Possible bug in leviosam2.py

Hello,
I ran with the leviosam.py with the following command:
python leviosam2.py -t 23 -s ilmn_pe -a bwamem -i /home/user/ASC_v3/ori/AS_02N_sorted.bam -o AS_02N_lifted_new -O bam -C t2t2hg38.clft -f /home/user/GRCH38/hg38_1.fa -fi /home/user/GRCH38/hg38_1.fai --use_preset USE_PRESET

However, something went wrong, the followings are warnings showed:

[I::reconcile] Paired-end mode
File 0: AS_02N_lifted_new-paired-deferred-sorted_n.bam (source)
File 1: AS_02N_lifted_new-paired-realigned-sorted_n.bam (target)
[W::reconcile] Num REF in AS_02N_lifted_new-paired-realigned-sorted_n.bam differs with AS_02N_lifted_new-paired-deferred-sorted_n.bam. Please check.
Segmentation fault (core dumped)
[2024-02-22 17:09:15 /home/user/CHM13/leviosam2.py:743 INFO] samtools merge -f -@ 23 -o AS_02N_lifted_new-final.bam AS_02N_lifted_new-paired-committed.bam AS_02N_lifted_new-paired-deferred-reconciled-sorted.bam && samtools index AS_02N_lifted_new-final.bam
merge: invalid option -- 'o'

I guess I can manuallt fix the samtools issue but the Num REF issue seems harder to address. Could you kindly provide some insights? Thank you!

Bugs in leviosam2.sh

On line 189 you need to use =~ to do a regular expression comparison.

On line 193 it should be LR_MODE not LR_MORE and I think you also need a =~ there as well.

how to gain the chain information from two genomes

I want to use my assembled genome to liftover reference genome, how to do it? thanks ~

Phasing after levioSAM2

您好，謝謝您開發了如此具有突破性的軟體～
我們比較了單純hg002 align 到 GRCh38和 hg002 align 到 T2T-CHM13 並使用 levioSAM2 lift to GRCh38的座標系統後
分別跑phasing的結果，發現兩者產出VCF file完全相同
我們使用single-end workflow(minimap2 with Nanopore) :
bash leviosam2.sh
-a minimap2 -g 1500 -H 6000 -S -x ../configs/ont_all.yaml
-l map-ont
-i ont.bam
-o ont-lifted
-f grch38.fna
-C chm13v2-grch38.clft
-t 16
-R suppress_annotations.bed # optional
levioSAM2 將座標系統進行了完美的轉換，但接下來的phasing 結果如同開頭所述
想請我們的作法(T2T-chm13v2 lift to GRCh38 with Nanopore long read, no ultra long read)對於下游SNP-only phasing的改善結果有是否限呢 ?

Picard ValidateSamFile halts for 100 `Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped` Error after lifting from chm13v2 to hg38

Hi there,

I have been trying to adapt leviosam2 to out pipeline. This is the process I ran through:

bwa mem alignment to chm13v2
mark duplicates
sort bam & validate bam file was OK then
lift over to hg38 with leviosam2
bqsr & validate bam file throws the error and stops

These are the parameters I used to run leviosam2:

export LVS_SIF="$sw/leviosam2/leviosam2_latest.sif"
export LVS_PY="$sw/leviosam2/leviosam2/workflow/leviosam2.py"
export LVS_CMD="singularity exec ${LVS_SIF} python3 ${LVS_PY}"

${LVS_CMD} \
    -t "${n_threads}" \
    -i "${infile}" # infile is the input bam aligned to chm13v2\
    -o "${outfile_prefix}" \
    -O bam \
    -C "${clft_file}" \
    -f "${ref_fasta}" # ref_fasta is hg38.noalt.decoy.bwa/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna, used in previous pipeline as reference \
    -fi "${ref_fasta}"\
    -s ilmn_pe \
    -a bwamem \
    --use_preset True \
    --forcerun

And the error messages from ValidateSamFile:

ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 6, Read name ST-E00106:225:HHM22CCXX:7:2118:6238:47668, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 47, Read name ST-E00106:225:HHM22CCXX:7:1210:27732:7866, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 169, Read name ST-E00106:225:HHM22CCXX:7:1111:29630:48705, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 172, Read name ST-E00106:225:HHM22CCXX:7:1211:12763:57284, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 177, Read name ST-E00106:225:HHM22CCXX:7:1207:10439:33920, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 415, Read name ST-E00106:225:HHM22CCXX:7:2218:24515:6267, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR::MISMATCH_MATE_ALIGNMENT_START:Record 415, Read name ST-E00106:225:HHM22CCXX:7:2218:24515:6267, Mate alignment does not match alignment start of mate

... (more like this)

Maximum output of [100] errors reached.

It seems to me that no one has brought up that issue, but I did found a commit to remove MC:z tags for unmapped reads.

Is there anything I have missed?

Thanks in advance,
Yunkai

"ERROR: ': ' is not a valid token in plain flow (unquoted) scalars - While using selective re-mapping

Hello! I was wondering if you could help me to solve an issue that I have while running leviosam2. When I try to make a liftover using the leviosam2.sh file I got this error: "ERROR: ': ' is not a valid token in plain flow (unquoted) scalars". I also notice that the program autocompletes mapq, aln_score & hdist as follow:
Adding rule mapq:30
Adding rule aln_score:-10
Adding rule hdist:100
The code line I used was:
bash leviosam2.sh \ -a minimap2 -g 1000 -H 100 -S -x /mnt/Timina/cgonzaga/adiaz/PacBio_secuencias/PYM007/8.Liftover/pacbio_all.yaml \ -l map-hifi \ -i /mnt/Timina/cgonzaga/adiaz/PacBio_secuencias/PYM007/8.Liftover/PYM007.T2T.pbmm2.bam \ -o PYM007_T2TtoGRCh38_lifted \ -f /mnt/Timina/cgonzaga/adiaz/PacBio_secuencias/PYM007/8.Liftover/Homo_sapiens_GRCh38.p14.noMT.fasta \ -C /mnt/Timina/cgonzaga/adiaz/PacBio_secuencias/PYM007/8.Liftover/chm13v2-hg38.clft \ -t 20
I'm using the 0.2.2 version. I attach the log of the run in a text file
logoftherun_liftover.txt

Does not build on arm64

The issue is in your cmake file. You do STRING(FIND ${CPUINFO} "sse4_1" SSE4_1_TRUE) which sets SSE4_1_TRUE=-1 when "sse4_1" is not found. Then you do IF (SSE4_1_TRUE) and since -1 is a non-zero number it evaluates to true.

leviosam2 for nanopore?

Hi there,

Thanks for the cool software/preprint.

I'm wondering if there are good parameters for remapping nanopore reads, say error rates between 90-95% and lengths between 10 - 100kb? I see that there are workflows for HiFi reads, but I'm curious if parameters for nanopore reads have been tested. I tried the HiFi pipeline for nanopore reads and the results were not great. The default parameters for HiFI were too stringent, so I tried setting -H to 1000 and -G to 5000 but then there were issues with the remappings, especially around indels.

I see no mention to ONT reads in the preprint, so maybe this isn't a priority, but I would be highly interested in such a solution.

Thanks!

Adapting Leviosam2 to the data from Ultima Genomics

Hi @milkschen ,
My name is Ilya, I work at Ultima Genomics. You might have heard about us from Pichuan Chang.
Thank you for putting this together, we are starting to use LevioSam2 to improve our variant calling performance through aligning reads to t2t and then re-mapping them to hg38 to boost recall and precision. Overall, we are able to reproduce your results using our data (that is slightly different from Illumina's one), which is really great.

However, we do see some issues that we want to raise - fixing them will make the impact even bigger!

Ultima-specific BAM file contain additional custom tags that need to be reversed if the read is mapped to a reverse strand (we take care of this in the alignment, but leviosam does not take care of this if the read is mapped to a different strand on hg38)
Reads containing long deletion (due to specifics of UG data, all reads that span homopolymer longer than ~20 bases will contain a long deletion) are going to deferred reads
Some format inconsistencies: unmapped reads have mapping quality > 0 or CIGAR string, which drives Picard tools crazy (also raised in #33)

There might be some more issues with re-mapping our data (that contains more indel errors and are ~300bp long single end reads) that we did not encounter and would be good to hear your thoughts how to look for issues.

Anyway, we will be happy to contribute to this project that already looks very useful and, of course, would be happy if you could support us since some of the issues seem more complicated than the others.

Please, let me know how would you like to proceed - happy to chat after the holiday break!

Thanks again - happy new year
Ilya

Read 0 BED records (0 contigs)

I am getting the following

[I::add_split_rule] Adding rule `mapq:30`
[I::add_split_rule] Adding rule `aln_score:100`
[I::add_split_rule] Adding rule `hdist:5`
[I::Bed] Read 0 BED records (0 contigs) from /media/kokyriakidis/PRODUCTION/chm13v2-hg19/chm13v2-hg19-lt0.98-map_reduction_k100_e0.01.bed
[I::Bed] Read 0 BED records (0 contigs) from /media/kokyriakidis/PRODUCTION/chm13v2-hg19/chm13v2-hg19-source-unliftable-s_5000-winnowmap2-exc_1.bed
[I::lift_run] Loading levioSAM index...done
[I::load_fasta] Loading FASTA...done
[I::WriteDeferred::print_info] Alignments with any of the below features are deferred:
 - unlifted
 - MAPQ (pre-liftover) < 30
 - AS:i (pre-liftover) < 100
 - NM:i (post-liftover) > 5
 - In BED deferred (dest) /media/kokyriakidis/PRODUCTION/chm13v2-hg19/chm13v2-hg19-lt0.98-map_reduction_k100_e0.01.bed
 - Not in BED commit (source) /media/kokyriakidis/PRODUCTION/chm13v2-hg19/chm13v2-hg19-source-unliftable-s_5000-winnowmap2-exc_1.bed

the BED files seems to not be read? Is this ok? I am useing the files located in the Leviosam2-experiments.

EDIT: My BAD. I had wrong path.

how to index bam after liftover

i got the lifted_from_source.bam after liftover, but i dont obtained the index files for it.the error is showed in blow:

[E::hts_idx_push] Unsorted positions on sequence #1: 9031 followed by 5579
[E::sam_index] Read 'V350054331L4C002R03600248640' with ref_name='chrY', ref_length=47903866, flags=83, pos=5579 cannot be indexed
samtools index: failed to create index for "lifted_from_source.bam"

how to resolve it ? thanks~

small chain gaps cause unliftable alignments

Hi @milkschen :)

I am trying trying to lift a set of simulated reads derived from CHM13 from CHM13->hg38, but I am getting a curiously high amount un-liftable alignments. I don't specify any unliftable/to-defer regions, and all the source alignments are MAPQ=99.

GitHub isn't letting me upload a small example bam but here are the commands I used along with stdout.

$ leviosam2 lift -C chm13v2-grch38.clft -a data/truth_bams/example.bam -t 8 -p out.lift -O bam -S mapq:30 -m -f data/chm13v2.0.fa
[I::add_split_rule] Adding rule `mapq:30`
[I::lift_run] Loading levioSAM index...done
[I::load_fasta] Loading FASTA...done
[I::WriteDeferred::print_info] Alignments with any of the below features are deferred:
 - unlifted
 - MAPQ (pre-liftover) < 30

[I::main] Finished in 10.7 CPU seconds, or 15.3877 wall clock seconds
$ leviosam2 collate -a out.lift-committed.bam -b out.lift-deferred.bam -p out.lift-paired
Inputs:
 - BAM: out.lift-committed.bam
 - BAM (deferred): out.lift-deferred.bam

Outputs:
 - BAM (committed): out.lift-paired-committed.bam
 - BAM (deferred): out.lift-paired-deferred.bam
 - FASTQ1: out.lift-paired-deferred-R1.fq.gz
 - FASTQ2: out.lift-paired-deferred-R2.fq.gz

Num. unpaired reads1: 6
Num. unpaired reads2: 7
Num. merged unpaired: 13
[I::collate_core] Extract 13 reads from BAM

Finished in 0 CPU seconds, or 0.021676 wall clock seconds

$ samtools view -c out.lift-committed.bam 
65
$ samtools view -c out.lift-unliftable.bam 
15
$ samtools view -c out.lift-deferred.bam 
15

(the unliftable and deferred bams are equal here)

Here's an example of the results

Here, I color all the "unliftable" reads and their mates. Again, all these reads have extremely high mapq. That first BED track in the bottom panel shows lifted regions from CHM13->GRCh38. Notice that at least one mate from each colored pair overlaps that 4bp gap in the chain file.

Is a gap in the chain file supposed to cause a read to be unliftable/deferred, even if most of the read occurs outside of the gap? Is there an option in leviosam2 that lets me work around this?

Thanks,
Taher

About long read did not run with reconcile at setting

Hi,

I noticed that the provided workflow for the long read did not have the reconcile module as for the short read (workflow code). I wonder whether there is a reason for not using the reconcile module when running with the long read.

As in the recommended setting for long read (as in ../configs/pacbio_all.yaml), there is a setting likes nm_threshold: 0. I wonder whether this setting means that the tool will directly realign all alignments with an NM>0 while disregarding the lifted information.

Thanks.

Leviosam2 binary and python script questions

Hi,

Thank you for developing this useful program.

I am currently working on an SOP to hopefully incorporate leviosam2 into our pipeline suite. As I've worked through the SOP and processing of the lifting over of my HG002 T2T aligned cram to GRCh38 a few questions have arisen.

Firstly, I applied GATK4 mark duplicate (MD) to both the T2T cram and the liftover bam. From the metrics file generated by MD I noticed the following (note there are 4 libraries associated with our in-house sequencing of HG002 on Novaseq) :

T2T cram:
METRICS CLASS picard.sam.DuplicationMetrics LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE

A00266_0582_1 585705 115629635 443592 686841 168884 13197245 2524985 0.114574 561015435
A00266_0582_2 574547 112611123 435282 672881 173147 16866895 3233453 0.150166 401483145
A00266_0582_3 631460 115005353 452444 737914 177839 12356716 2540103 0.107922 606175323
A00266_0582_4 596197 111627691 443584 701687 168764 14068430 3143703 0.126448 501826973

Leviosam2 realigned bam on GRCh38:

METRICS CLASS picard.sam.DuplicationMetrics LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE

A00266_0582_1 276002 90264193 4605 1909682 66237 10555011 1960704 0.117122 423705208
A00266_0582_2 263087 88032381 4565 1859872 68118 13619508 2548337 0.154866 300868501
A00266_0582_3 275094 89707870 4823 1903425 64234 9881384 1981844 0.11034 457404529
A00266_0582_4 263004 87111049 4706 1846461 63651 11317226 2482239 0.130086 376592529
Unknown Library 2407942 95321444 0 3107400 575302 12933336 1570779 0.136969 354835070

As you may notice the additional "Unknown Library" name which I believe is associated with the realignment of the deferred/liftover reads. Now my question is the python scripts argument "--read_group", if set, will it rename the "Unknown Library" only or generate one read group name with the lose the library namings?

Second question. I've noticed the -O bam argument is hardcoded in python scripts and their is no documentation in the help for leviosam2 lift with regards to -O. Is it possible to use cram, if so, can an argument be added to the python script to be able to select either sam, bam or cram?

Thanks!

Best,
Rob E.

Alignment errors after Leviosam2?

After Leviosam2 (chm13v2 to hg19) I get the following error in the post-processing steps (Picard ReorderSam)

Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR::INVALID_ALIGNMENT_START:Record 353, Read name A00371:69:HG73LDMXX:1:1161:16966:28808, Mate Alignment start (176517898) must be <= reference sequence length (171115067) on reference chr6

These are the reads from the final chm13-to-hg19 BAM:

A00371:69:HG73LDMXX:1:1161:16966:28808	65	chr1	544	60	151M	chr6	176517898	0	CCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAAGCAATACACTGAAAATGTTTAGACGGGCTCACATCACCCCATAAACAAATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTAC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFF,F,FFFF:FFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFF:F,FFFFFFFFFF,FFFFFFFFFFFF:,FF,:FFFFFFF::FFF::FFFFFF	NM:i:0	MD:Z:151	MC:Z:151M	AS:i:151	XS:i:77RF:Z:target

A00371:69:HG73LDMXX:1:1161:16966:28808	129	chr6	176517898	60	151M	chr1	544	0	GGAACACACGGTCATTCTAGGGGCCTTCCCCTGCCCTCCAGCACCCTACTGGACACACCCCCAGCGCATGGAGAAGAAACTGCATGCAGTACCTGCGGGGAACACCGTCAAGTTCCGCTGTCCAGCTGCAGGCAACCCCACGCCCACCATC	:F:FF,FFFFFFF:FFFFFFFF:FFFFFFF:FFFFFFF:::FFFFFF,FFFFFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:151	MC:Z:151M	AS:i:151	XS:i:20RF:Z:target

The start position of the second read is larger than the chr6 length.

These are the reads from the original mapping to chm13v2 BAM:

A00371:69:HG73LDMXX:1:1161:16966:28808	129	chr5	177634105	60	151M	chrM	1	0	GGAACACACGGTCATTCTAGGGGCCTTCCCCTGCCCTCCAGCACCCTACTGGACACACCCCCAGCGCATGGAGAAGAAACTGCATGCAGTACCTGCGGGGAACACCGTCAAGTTCCGCTGTCCAGCTGCAGGCAACCCCACGCCCACCATC	:F:FF,FFFFFFF:FFFFFFFF:FFFFFFF:FFFFFFF:::FFFFFF,FFFFFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:151	MC:Z:35S116M	AS:i:151	XS:i:20RG:Z:SAMPLE

A00371:69:HG73LDMXX:1:1161:16966:28808	65	chrM	1	58	35S116Mchr5	177634105	0	CCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAAGCAATACACTGAAAATGTTTAGACGGGCTCACATCACCCCATAAACAAATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTAC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFF,F,FFFF:FFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFF:F,FFFFFFFFFF,FFFFFFFFFFFF:,FF,:FFFFFFF::FFF::FFFFFF	NM:i:0	MD:Z:116	MC:Z:151M	AS:i:116	XS:i:92RG:Z:SAMPLE	SA:Z:chrM,16535,+,35M116S,58,0;

A00371:69:HG73LDMXX:1:1161:16966:28808	2113	chrM	16535	58	35M116Hchr5	177634105	0	CCCGAACCAACCAAACCCCAAAGACACCCCCCACA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:35	MC:Z:151M	AS:i:35	XS:i:0	RG:Z:SAMPLE	SA:Z:chrM,1,+,35S116M,58,0;

Mapping to hg19 directly with dragmap gave those results:

A00371:69:HG73LDMXX:1:1161:16966:28808	65	chrM	544	60	151M	chr5	176517898	0	CCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAAGCAATACACTGAAAATGTTTAGACGGGCTCACATCACCCCATAAACAAATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTAC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFF,F,FFFF:FFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFF:F,FFFFFFFFFF,FFFFFFFFFFFF:,FF,:FFFFFFF::FFF::FFFFFF	RG:Z:1	AS:i:151	XS:i:111	NM:i:0	XQ:i:115

A00371:69:HG73LDMXX:1:1161:16966:28808	129	chr5	176517898	60	151M	chrM	544	0	GGAACACACGGTCATTCTAGGGGCCTTCCCCTGCCCTCCAGCACCCTACTGGACACACCCCCAGCGCATGGAGAAGAAACTGCATGCAGTACCTGCGGGGAACACCGTCAAGTTCCGCTGTCCAGCTGCAGGCAACCCCACGCCCACCATC	:F:FF,FFFFFFF:FFFFFFFF:FFFFFFF:FFFFFFF:::FFFFFF,FFFFFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	RG:Z:1	AS:i:151	NM:i:0	XQ:i:250

So I guess the first alignment with BWA to chm13v2 was correct (aligning to chr5 and chrM). Leviosam2 somehow translated them to (chr1 and chr6???) in hg19. The alignment position on chr5 is suprisingly the same as the position on chr6 making me wonder if leviosam wanted to write chr5 and somehow wrote chr6. The same applies for chr1 and chrM (!!!)

The command I used to generate the BAM files is the following:

bwa mem -t 15 \
-R "@RG\tID:SAMPLE\tPL:illumina\tPU:SAMPLE\tSM:SAMPLE" -v 1 \
/media/kokyriakidis/PRODUCTION/reference/chm13v2/bwa/chm13v2 \
/media/kokyriakidis/PRODUCTION/input_trimmed/SAMPLE_R1.fastq.gz \
/media/kokyriakidis/PRODUCTION/input_trimmed/SAMPLE_R3.fastq.gz \
| samtools sort -@ 15 -m 4G -O bam \
-o /media/kokyriakidis/PRODUCTION/output/chm13v2/1_Mapping_BWA/SAMPLE-sort.bam -

bash /media/kokyriakidis/PRODUCTION/leviosam2/workflow/leviosam2.sh \
    -a bwamem -A 100 -q 30 -H 5 \
    -i /media/kokyriakidis/PRODUCTION/output/chm13v2/1_Mapping_BWA/SAMPLE-sort.bam \
    -o /media/kokyriakidis/PRODUCTION/output/chm13v2/2_chm13v2_to_hg19/SAMPLE_chm13v2_hg19 \
    -f /media/kokyriakidis/PRODUCTION/reference/hg19/hg19.fa \
    -b /media/kokyriakidis/PRODUCTION/reference/hg19/bwa/hg19.fa \
    -C /media/kokyriakidis/PRODUCTION/chm13v2-hg19/chm13v2-hg19.clft \
    -D /media/kokyriakidis/PRODUCTION/chm13v2-hg19/chm13v2-hg19-lt0.98-map_reduction_k100_e0.01.bed \
    -R /media/kokyriakidis/PRODUCTION/chm13v2-hg19/chm13v2-hg19-source-unliftable-s_5000-winnowmap2-exc_1.bed \
    -t 18

I also get some errors from ABRA2 that I do not understand

java.lang.ArrayIndexOutOfBoundsException: 19525

Do you have any clue? How do I solve these issues?

`lift` rules are confusing and/or broken

I am trying to run the full workflow. I end up with nearly half my reads in the unliftable BAM, and most of them do not fall within one of the un-liftable regions specified in the -R file.

When I set the -A and -q parameters to 0, then most of the un-liftable reads become liftable.

The documentation makes it sounds like -A and -q only determine whether a read goes to committed or deferred - they should not by themselves send any reads to the unliftable file. If this is not the case then this should be made more clear in the documentation, otherwise it seems there is a bug.