jonassibbesen / rpvg Goto Github PK

View Code? Open in Web Editor NEW

44.0 44.0 5.0 42.75 MB

Method for inferring path posterior probabilities and abundances from pangenome graph read alignments

License: MIT License

CMake 1.12% C++ 98.61% Dockerfile 0.27%

rpvg's People

Contributors

Stargazers

Watchers

Forkers

jeizenga schaudge yujimlong fastflair

rpvg's Issues

Software for differential expression analysis from rpvg's readcount

Overall question

Which software would you advise to perform differential expression analysis on the ReadCount produced by rpvg ?

Context

I've used the mpmap-rpvg pipeline to 1) map RNA-seq reads, 2) quantify their expression. I now have set of files with ReadCount for each genomic feature and I would want to perform differential expression analysiss.

Thoughts
From the "haplotype specific ..." paper:
"While HST expression estimates can always be marginalized to produce allele or transcript expression estimates, more general statistical frameworks will need to be developed to avoid information loss between these steps in transcriptomic pipelines".

I understand that, in the ideal scenario, it would be required to have something coming after mpmap and rpvg to specfically make use of all the information from those pantranscriptomic methods, and to build proper statistical model to perform differential expression analysis.

But, in the meantime, I was wondering if any of the existing software for differential expression analysis could be suited ? For instance, what do you think about Sleuth (Patcher lab, designed to work with kallisto's).

Crash: Assertion `best_align_score <= optimal_score' failed

1. What were you trying to do?

Infer expression from .gamp file from vg mpmap.

I have a bunch of .fastq file (replicates and different treatments), and I runned them through an array in vg mpmap. The process seemed to have worked properly (exit: 0).

I then made some test on rpvg using:

bin/rpvg -g splicepangenome.xg -p pantranscriptome.gbwt -f pantranscriptome.txt.gz -a xxx.gamp -o rpvg --inference-model haplotype-transcripts

With xxx being different samples.

2. What actually happened?

What is weird is that it worked for all but one sample (only 4 tested in total) - i.e., I wanted to try some stuff before running on whole data. Also, I have already re-run vg mpmap on that sample, but still got the same error.

Here is the error message I get:

~/work/rpvg/bin/rpvg -g splicepangenome.xg -p pantranscriptome.gbwt -f pantranscriptome.txt.gz -a OvMM1_L6.gamp -o rpvg --inference-model haplotype-transcripts
Running rpvg (commit: 0380bdb172bbe255a18ed070935fa0013dc02548)
Random number generator seed: 1653473836
Fragment length distribution parameters found in alignment (mean: 321.48, standard deviation: 137.106)
Loaded graph, GBWT and r-index (0.217732 seconds, 0.361549 GB)
rpvg: /users/lsoldini/work/rpvg/src/alignment_path_finder.cpp:472: std::vector<AlignmentSearchPath> AlignmentPathFinder<AlignmentType>::extendAlignmentSearchPath(const AlignmentSearchPath&, const vg::MultipathAlignment&) const [with AlignmentType = vg::MultipathAlignment]: Assertion `best_align_score <= optimal_score' failed.
Aborted (core dumped)

Problems on pan-transcriptome index

Hello,
Dear contributors,
I've prepared my files for pan-transcriptome building by vg autoindex using reference sequence, variations (SVs from other accessions) and gtf file generated from Stringtie. However, vg shows that only 3 file could not build index like this:

[vg autoindex] Excecuting command: vg autoindex --threads 8 --workflow mpmap --prefix test_A01 --ref-fasta A01.fa --vcf A01.vcf.gz --tx-gff A01.gtf
[IndexRegistry]: Checking for phasing in VCF(s).
[W::vcf_parse_info] INFO '.' is not defined in the header, assuming Type=String
error:[vg autoindex] Input is not sufficient to create indexes
Inputs
GTF/GFF
Reference FASTA
VCF
are insufficient to create target index Haplotype-Transcript GBWT

There are 3 files which are consistent with examples given in vg autoindex illustration, I'm confused about the result that Inputs are insufficient to create index.
I'm looking forward for your helps.
Thanks for all contributors

Assertion `hasNodeId(node_id)' failed

Hi, I need help with an error running rpvg command.
I'm getting "paths_index.cpp:73: uint32_t PathsIndex::nodeLength(uint32_t) const: Assertion `hasNodeId(node_id)' failed" error, but I have no idea what that means.
The full command is:
rpvg -t 2 -g ref.xg -p pantranscriptome.gbwt -a mpmap.gamp -o rpvg --inference-model haplotypes-transcripts

Any idea of what the problem could come from, the .xg, .gbwt or .gamp parameter?
Thanks!

Could not install the rpvg in Linux

Dear developer:
We are trying to install your software, but we find it very difficult. I have installed it on two Linux servers separately, but I have encountered the same error.
When we print the cmake .. ,it showed all work well.
-- The C compiler identification is GNU 12.2.0
-- The CXX compiler identification is GNU 12.2.0
-- Check for working C compiler: /home/user/anaconda3/bin/x86_64-conda-linux-gnu-cc
-- Check for working C compiler: /home/user/anaconda3/bin/x86_64-conda-linux-gnu-cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /home/user/anaconda3/bin/x86_64-conda-linux-gnu-c++
-- Check for working CXX compiler: /home/user/anaconda3/bin/x86_64-conda-linux-gnu-c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Found Protobuf: /home/user/anaconda3/lib/libprotobuf.so;-lpthread (found version "3.20.1")
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1")
-- Checking for module 'htslib'
-- Found htslib, version 1.14
-- Configuring done
-- Generating done
-- Build files have been written to: /home/user/Software/rpvg/build

But when we tried to install, it showed below and l can not solve the problem well. Could you tell me what the problem is?

Protobuf will be /home/user/anaconda3/lib/libprotobuf.so;-pthread for PIC dynamic code and /home/user/anaconda3/lib/libprotobuf.so;-pthread for non-PIC static code
Using real Threads library
[ 21%] No build step for 'sdsl-lite-proj'
[ 23%] Performing install step for 'sdsl-lite-proj'
/bin/sh: 1: ./install.sh: Permission denied
make[2]: *** [CMakeFiles/sdsl-lite-proj.dir/build.make:74：sdsl-lite-proj-prefix/src/sdsl-lite-proj-stamp/sdsl-lite-proj-install] erro 126
make[1]: *** [CMakeFiles/Makefile2:255：CMakeFiles/sdsl-lite-proj.dir/all] erro 2
make[1]: ***
-- Configuring done
-- Generating done
-- Build files have been written to: /home/user/Software/rpvg/deps/libvgio
[ 25%] Performing build step for 'libvgio-proj'
[ 19%] Built target handlegraph
[ 19%] Built target link_target
[ 59%] Built target vgio_static
[ 61%] Linking CXX shared library libvgio.so
[100%] Built target vgio
[ 26%] No install step for 'libvgio-proj'
[ 28%] Completed 'libvgio-proj'
[ 28%] Built target libvgio-proj
make: *** [Makefile:95：all]

HST assignment : rpvg assigned two different homo HST instead of expected heterozygous transcripts.

Hi Developers!

I have just two haplotypes in my graph, and I assumed that there are three possible combinations of haplotypes from the rpvg results:

Hap1 Hap1 (Homo),
Hap1 and Hap2 (Hetero), and
Hap2 Hap2 (Homo) with expression values.

However, I got several isoforms with a probability of 1 from the same transcript pairing with the same haplotype, rather than different haplotypes as heterozygous. I have attached a screenshot below.

This is what I expected :

transcript8801.chr6.nnic_H1	transcript8801.chr6.nnic_R1	674	1	12.999715	3.324959	12.999917	3.649044

This is what I got with some transcripts :

transcript8801.chr6.nnic_H1	transcript8801.chr6.nnic_H1	674	1	12.999715	3.324959	12.999715	3.324959
transcript8801.chr6.nnic_R1	transcript8801.chr6.nnic_R1	638	1	12.999917	3.6490449	12.999917	3.6490449

What could be the reason for this discrepancy? If this is due to using a simple graph rather than incorporating multiple haplotypes, do you think manually replacing the values as per my expectations would be a viable solution?

Unable to run rpvg analyses without error

I was attempting to perform analyses using rpvg on the spliced graph, pantranscriptome, and alignment produced with vg however shortly after running rpvg it crashed. The minimal files and pipeline used to reproduce this error are given below, though the same error occurs when using the full sized files.

error:

rpvg: /home/rpvg/src/fragment_length_dist.cpp:23: FragmentLengthDist::FragmentLengthDist(double, double, double, uint32_t): Assertion `isValid()' failed.
Aborted (core dumped)

gff file genomic_chr8.gff

##gff-version 3
NC_050103.1	RefSeq	region	1	182411202	.	+	.	ID=NC_050103.1:1..182411202;Dbxref=taxon:4577;Name=8;chromosome=8;cultivar=B73;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_050103.1	BestRefSeq	gene	263720	266280	.	+	.	ID=gene-LOC100273199;Dbxref=GeneID:100273199;Name=LOC100273199;description=uncharacterized LOC100273199;gbkey=Gene;gene=LOC100273199;gene_biotype=protein_coding;gene_synonym=cl25415_1a,GRMZM6G040576
NC_050103.1	BestRefSeq	mRNA	263720	266280	.	+	.	ID=rna-NM_001147643.1;Parent=gene-LOC100273199;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;Name=NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1	BestRefSeq	exon	263720	264201	.	+	.	ID=exon-NM_001147643.1-1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1	BestRefSeq	exon	264910	265104	.	+	.	ID=exon-NM_001147643.1-2;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1	BestRefSeq	exon	265204	265439	.	+	.	ID=exon-NM_001147643.1-3;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1	BestRefSeq	exon	265540	266280	.	+	.	ID=exon-NM_001147643.1-4;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1	BestRefSeq	CDS	264022	264201	.	+	0	ID=cds-NP_001141115.1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NP_001141115.1;Name=NP_001141115.1;gbkey=CDS;gene=LOC100273199;product=uncharacterized protein LOC100273199;protein_id=NP_001141115.1
NC_050103.1	BestRefSeq	CDS	264910	265104	.	+	0	ID=cds-NP_001141115.1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NP_001141115.1;Name=NP_001141115.1;gbkey=CDS;gene=LOC100273199;product=uncharacterized protein LOC100273199;protein_id=NP_001141115.1
NC_050103.1	BestRefSeq	CDS	265204	265439	.	+	0	ID=cds-NP_001141115.1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NP_001141115.1;Name=NP_001141115.1;gbkey=CDS;gene=LOC100273199;product=uncharacterized protein LOC100273199;protein_id=NP_001141115.1

vcf file chr8_3lines_renamed.vcf:

##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##contig=<ID=NC_050103.1>
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Mo17	P39	B97
NC_050103.1	363813	8-17988	C	G	.	PASS	DP=0;AC=2;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0
NC_050103.1	364616	8-18791	C	T	.	PASS	DP=0;AC=0;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0
NC_050103.1	364621	8-18796	C	G	.	PASS	DP=0;AC=4;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0
NC_050103.1	364622	8-18797	T	C	.	PASS	DP=0;AC=4;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0
NC_050103.1	364624	8-18799	A	C	.	PASS	DP=0;AC=4;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0
NC_050103.1	364629	8-18804	C	T	.	PASS	DP=0;AC=0;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0
NC_050103.1	364734	8-18909	C	G	.	PASS	DP=0;AC=0;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0
NC_050103.1	364782	8-18957	C	T	.	PASS	DP=0;AC=2;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0
NC_050103.1	364901	8-19076	G	T	.	PASS	DP=0;AC=2;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0
NC_050103.1	364944	8-19119	T	A	.	PASS	DP=0;AC=0;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0

fastq file SRR5911103.4reads.fastq:

@SRR5911103.1 1 length=90
CAAATTTGCATGGNTATCTGTTATTCCTTTTTANGNGTAANGNCTTNNAANANAATGTAATNNNANNNNNAAANNNNANNNNNNNNNNNN
+SRR5911103.1 1 length=90
AAAAAEEEEEEEE#EEEEEEEEEEEEEEEEEEE#E#EEEE#E#EEE##/E#/#AAEEEEEE###E#####6EA####/############
@SRR5911103.2 2 length=90
TCATATCGGTAGGTTGTGGTATTTCATTGCTACAAACATGGGTTATNGTANAATAAGACATGNNANNNNGATACNNNTCNNNNNNNNNNA
+SRR5911103.2 2 length=90
6AAAAEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEE/E#EEA#AEEEEE/EEEE##E####EEEEE###EA##########/
@SRR5911103.3 3 length=90
ATTGATGCTGTGAGATGCATGTGTGTCTTTTGTTTCACGTTGCATTNCATAGGCAAGTCGAGATGNNNNGTTGGNNNTGTNCNNNANNNT
+SRR5911103.3 3 length=90
AAAAAEEAE6EEAEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEE#EAE/EEEEEEEEEEE6EE####EEEEE###EAE#A###/###E
@SRR5911103.4 4 length=90
TTGGGGTTTGGAGTAGTTCTCATATAGATCTTTATATTCAGCCCCTNTGGGAAGATACATTTCACNNNNAAATCNNNTGANTNNNANNNT
+SRR5911103.4 4 length=90
AAAAAEEEEEEEEEEEEEEEEEEEEEE6EEEEEAEEEEEEEAEEEE#EEEEEEEAEEEEEEEEAE####EEAEE###EEE#E###A###/

yaml file environment.yml for the conda environment:

name: pantranscriptome
channels:
  - conda-forge
  - bioconda
dependencies:
  - vg =1.49
  - cmake >=3.10
  - protobuf =3
  - htslib =1.17
  - jansson =2.14
  - llvm-openmp
  - bcftools =1.17
  - sra-tools =3
  - samtools =1.17

command line pipeline:

$ bgzip chr8_3lines_renamed.vcf
$ tabix chr8_3lines_renamed.vcf.gz
$ curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_902167145.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCF_902167145.1.zip" -H "Accept: application/zip"
$ unzip GCF_902167145.1.zip
$ mv ncbi_dataset/data/GCF_902167145.1/GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.fna .
$ samtools faidx GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.fna
$ samtools faidx GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.fna NC_050103.1 > genomic_chr8.fa
$ samtools faidx genomic_chr8.fa
$ vg construct -p -r genomic_chr8.fa -v chr8_3lines_renamed.vcf.gz > 282_rpvg.vg
$ vg convert -p 282_rpvg.vg > 282_rpvg.pg
$ vg rna -p -s Parent -t 12 -n genomic_chr8.gff -b pantranscriptome.gbwt -i pantranscriptome.txt 282_rpvg.pg > 282_rpvg.spliced.pg
$ vg convert --xg-out 282_rpvg.spliced.pg > 282_rpvg.spliced.xg
$ vg index --threads 4 --dist-name 282_rpvg.dist 282_rpvg.spliced.xg
$ vg prune --progress --threads 4 --unfold-paths --gbwt-name pantranscriptome.gbwt  --append-mapping --mapping mapping 282_rpvg.spliced.pg > 282_rpvg.spliced.pruned.vg
$ vg index --progress --threads 4 --gcsa-out 282_rpvg.gcsa --mapping mapping 282_rpvg.spliced.pruned.vg 
$ vg mpmap --nt-type RNA --threads 4 --graph-name 282_rpvg.spliced.xg --gcsa-name 282_rpvg.gcsa --dist-name 282_rpvg.dist --fastq SRR5911103.4reads.fastq > SRR5911103.gamp
$ vg gbwt --num-threads 4 --r-index pantranscriptome.gbwt.ri pantranscriptome.gbwt
$ wget https://github.com/jonassibbesen/rpvg/releases/download/v1.0/rpvg-v1.0_linux.tar.gz
$ tar zxvf rpvg-v1.0_linux.tar.gz
$ rpvg-v1.0_linux/bin/rpvg --threads 8 --single-end --frag-mean 90 --frag-sd 0 --graph 282_rpvg.spliced.xg --paths pantranscriptome.gbwt -f pantranscriptome.txt --alignments Mo17/SRR5911103.gamp -o rpvg_SRR5911103 --inference-model haplotype-transcripts

Your help in identifying the cause of this error would be greatly appreciated!

Can't run rpvg successfully

Running rpvg (commit: 1d91a9e)
Random number generator seed: 1693234232
Fragment length distribution parameters found in alignment (mean: 247.706, standard deviation: 55.7348)
Loaded graph and GBWT (7.87753 seconds, 3.51136 GB)
Fragment length distribution parameters re-estimated from alignment paths (location: 204.212, scale: 76.2045, shape: 1.94677)
Found alignment paths (769.318 seconds, 3.51136 GB)
Clustered alignment paths (369.159 seconds, 3.51136 GB)
rpvg: /psd/container/micromamba-0.25.1/envs/VG_v1.49.0/share/rpvg/src/main.cpp:317: spp::sparse_hash_map<std::__cxx11::basic_string, PathInfo> parseHaplotypeTranscriptInfo(const string&, bool, bool): Assertion `haplotype_transcript_info_it.first->second.source_ids.emplace(haplotype_id_index_it.first->second).second' failed.

The above is the output log file of rpvg. I would like to understand where the error occurred.grateful
rpvg.log

About the compilation and installation of rpvg software

Hi Jonas A,
I'm trying to install the rpvg software on my centos7 linux server by compiling in C++, but I'm getting a series of errors like the following, which is mainly from libvgio. Is it possible to provide a pre-built binary download link? Also, can you give me some guidance on how to avoid this error message?

I am looking forward for your reply.
Best,
Du

build rpvg failed

-- The C compiler identification is GNU 12.2.1
-- The CXX compiler identification is GNU 12.2.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Found Protobuf: /usr/lib64/libprotobuf.so (found version "3.19.4")
-- Found PkgConfig: /usr/bin/pkg-config (found version "1.8.0")
-- Checking for module 'htslib'
-- Found htslib, version 1.13
-- Configuring done
-- Generating done
-- Build files have been written to: /home/fanghl/rpvg/build

===============================

[ 84%] Building CXX object CMakeFiles/rpvg-tests.dir/src/tests/main_test.cpp.o
In file included from /usr/include/signal.h:328,
from /home/fanghl/rpvg/deps/Catch2/single_include/catch2/catch.hpp:8034,
from /home/fanghl/rpvg/src/tests/main_test.cpp:4:
/home/fanghl/rpvg/deps/Catch2/single_include/catch2/catch.hpp:10822:58: error: call to non-‘constexpr’ function ‘long int sysconf(int)’
10822 | static constexpr std::size_t sigStackSize = 32768 >= MINSIGSTKSZ ? 32768 : MINSIGSTKSZ;
| ^~~~~~~~~~~
In file included from /usr/include/bits/sigstksz.h:24:
/usr/include/unistd.h:640:17: note: ‘long int sysconf(int)’ declared here
640 | extern long int sysconf (int __name) __THROW;
| ^~~~~~~
/home/fanghl/rpvg/deps/Catch2/single_include/catch2/catch.hpp:10881:45: error: size of array ‘altStackMem’ is not an integral constant-expression
10881 | char FatalConditionHandler::altStackMem[sigStackSize] = {};
| ^~~~~~~~~~~~
make[2]: *** [CMakeFiles/rpvg-tests.dir/build.make:76: CMakeFiles/rpvg-tests.dir/src/tests/main_test.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:256: CMakeFiles/rpvg-tests.dir/all] Error 2
make: *** [Makefile:101: all] Error 2

===============================

I checkout Catch2 to v2.13.9 and the problem solved.

Hope this is helpful.

Haplotype transcript length inconsistent with GTF

Hi, developer

I recently ran into a problem. I would like to compare the difference in expression between pan-transcriptomic (vg mpmap+RPVG) and conventional transcriptomic analysis (STAR+stingtie2) of the same transcript, however, I found that there is a large difference in expression.

When looking for the cause, it was found that the transcripts were not of the same length in the haplotype transcripts and in the reference GTF file. What is the reason for this?

Next is an example of a particular transcript.

Transcript "ENSSSCT00000055142" in txorigin.tsv file generated by vg autoindex :

Name    Length  Transcripts     Haplotypes
ENSSSCT00000055142_R1   4165    ENSSSCT00000055142      7,20D006155_20D006155#1#7,20D006157_20D006157#0#7
ENSSSCT00000055142_H1   4165    ENSSSCT00000055142      20D006132_20D006132#0#7,20D006132_20D006132#1#7,20D006133_
ENSSSCT00000055142_H2   4165    ENSSSCT00000055142      20D006145_20D006145#1#7,20D006187_20D006187#0#7,20D006719_
ENSSSCT00000055142_H3   4165    ENSSSCT00000055142      20D006184_20D006184#0#7,20D006256_20D006256#0#7,20D006256_
ENSSSCT00000055142_H4   4165    ENSSSCT00000055142      20D006233_20D006233#1#7,20D006251_20D006251#0#7,20D006253_
ENSSSCT00000055142_H5   4165    ENSSSCT00000055142      20D006301_20D006301#0#7,20D006659_20D006659#1#7,20D006662_
ENSSSCT00000055142_H6   4165    ENSSSCT00000055142      20D006357_20D006357#1#7,20D006758_20D006758#0#7,20D007732_
ENSSSCT00000055142_H7   4165    ENSSSCT00000055142      20D006505_20D006505#1#7,20D006919_20D006919#1#7,20D007213_
ENSSSCT00000055142_H8   4165    ENSSSCT00000055142      20D006575_20D006575#0#7,20D007488_20D007488#1#7,
ENSSSCT00000055142_H9   4165    ENSSSCT00000055142      20D006641_20D006641#0#7
ENSSSCT00000055142_H10  4165    ENSSSCT00000055142      20D006684_20D006684#0#7,20D007281_20D007281#1#7,20D007438_
ENSSSCT00000055142_H11  4165    ENSSSCT00000055142      20D006958_20D006958#0#7
ENSSSCT00000055142_H12  4165    ENSSSCT00000055142      20D007579_20D007579#1#7

rpvg run to get the expression of an individual ENSSSCT00000055142 transcript:

Name    ClusterID       Length  EffectiveLength ReadCount       TPM
ENSSSCT00000055142      72      4165    3913.4432       561.9941        780.99766

However, the expression of this transcript obtained by stringtie2 was far from RPVG on the same individual, and the transcript length was inconsistent.

7	StringTie	transcript	9971209	9983829	1000	+	.	gene_id "ENSSSCG00000040816"; transcript_id "ENSSSCT00000055142"; ref_gene_name "NOL7"; cov "109.753166"; FPKM "18.930826"; TPM "18.137791";

Looking forward to your answers and replies.

Thanks

gtf error

Hi Jonas A:
I have tried to construct and index spliced pangenome graph using the this gtf，but an error has occurred. I tried to solve the problem but the result was not satisfactory. So, can you give me some guidance and I would appreciate you！
I am looking forward for your reply.
Best

1       Liftoff gene    20590   30676   .       +       .       gene_id "ENSG00000272438"; gene_version "1"; gene_source "havana"; gene_biotype "lncRNA"; coverage "1.0"; sequence_ID "0.980"; extra_copy_number "0"; copy_num_ID "ENSG00000272438_0"; 
1       Liftoff transcript      20590   30676   .       +       .       gene_id "ENSG00000272438"; gene_version "1"; transcript_id "ENST00000607769"; transcript_version "1"; gene_source "havana"; gene_biotype "lncRNA"; transcript_source "havana"; transcript_biotype "lncRNA"; tag "basic"; transcript_support_level "4"; Parent "ENSG00000272438"; extra_copy_number "0"; 
1       Liftoff exon    20590   20713   .       +       .       gene_id "ENSG00000272438"; gene_version "1"; transcript_id "ENST00000607769"; transcript_version "1"; exon_number "1"; gene_source "havana"; gene_biotype "lncRNA"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00003700027"; exon_version "1"; tag "basic"; transcript_support_level "4"; Parent "ENST00000607769"; ID "exon_79172"; extra_copy_number "0"; 
1       Liftoff exon    30450   30676   .       +       .       gene_id "ENSG00000272438"; gene_version "1"; transcript_id "ENST00000607769"; transcript_version "1"; exon_number "2"; gene_source "havana"; gene_biotype "lncRNA"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00003696845"; exon_version "1"; tag "basic"; transcript_support_level "4"; Parent "ENST00000607769"; ID "exon_79173"; extra_copy_number "0";

[IndexRegistry]: Checking for phasing in VCF(s).
[IndexRegistry]: Chunking inputs for parallelism.
[IndexRegistry]: Chunking FASTA(s).
[IndexRegistry]: Chunking VCF(s).
[IndexRegistry]: Chunking GTF/GFF(s).
[IndexRegistry]: Constructing spliced VG graph from FASTA and VCF input.
[transcriptome] ERROR: Exons are not in correct order according to attributes (line 5).

Errors while running rpvg

I have been trying to run the rpvg example but unfortunately keep getting errors. If I compile from source or use the docker container with singularity I get this error:

$ ../../Progs/rpvg/bin/rpvg -t 4 -g graph.xg -p pantranscriptome.gbwt -f pantranscriptome.txt.gz -a mpmap_align.gamp -o rpvg --inference-model haplotype-transcripts
Running rpvg (commit: 0380bdb172bbe255a18ed070935fa0013dc02548)
Random number generator seed: 1646476787
terminate called after throwing an instance of 'std::runtime_error'
  what():  [io::ProtobufIterator] could not parse message
Aborted

If try to use the docker container directly I get this one:

$ docker run quay.io/jsibbesen/rpvg rpvg -t 4 -g graph.xg -p pantranscriptome.gbwt -f pantranscriptome.txt.gz -a mpmap_align.gamp -o rpvg --inference-model haplotype-transcripts
Running rpvg (commit: 0380bdb172bbe255a18ed070935fa0013dc02548)
Random number generator seed: 1646476861
rpvg: /home/rpvg/src/main.cpp:512: int main(int, char**): Assertion `frag_alignments_istream.is_open()' failed.

Any help would be great.

Thanks
James

Unable to obtain GWBT

We built a pan-genome using minigraph-cactus, then converted to vg format and built index files, and everything was fine.
However, I don't have your pantranscriptom.gbwt, pantranscriptom.gbwt. ri and pantranscriptom.txt.gz in the sample file. I noticed that in your introduction, "The pantranscriptome paths should be compressed and indexed using the GBWT."
However, since I was a novice and not familiar with vg, I tried it as requested, but it was clear that there was still a problem. I implemented it successfully "Construct and index spliced pangenome graph" and "Map simulated RNA-seq reads using vg mpmap".
Best yours,

Code:
vg gbwt -x vg_rna.spliced.xg -o vg_rna.spliced.gbwt --vcf-input primates-pg.vcf.gz --num-threads 10
Result:
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 551c0d409cc2f3ef48b92e117f9f85b4a3dcbe4a at 1:475 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 11cdf7347abe3bc6f33e28762d9974805ea74508 at 1:477 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 4e18f5dfcc04b0f4cff5527391c48286331c6b03 at 1:480 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 0c9923fbf782e648cd1511bad9a8f116ee59727a at 1:483 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 6a19a010c3c8901cf9c52dfa93725cc33f736ed9 at 1:493 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 0de12e70ee148d8f93190a4f2995ca77ee8d359a at 1:509 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 851c9a881a775c4361e3a8d1d46a1702b2c86cd1 at 1:513 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 5de70cd73e7add41c65c83a32b7891e46920033e at 1:515 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 890c409bd790cb66ac2f85528a87ec6ff9b1e536 at 1:516 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 47a748b295f389bfe7052a394ceab94e85d8ce5c at 1:527 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] suppressing further missing variant warnings
warning: [HaplotypeIndexer::parse_vcf] Found 3198510/0 variants in phasing VCF but not in graph! Do your graph and VCF match?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 1cc6de4c3346fd58150ca3494b083634e9a2c57a at 9:27 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 6ce79784e3c8e81cff8ecf8a4cbea36dcb094452 at 9:33 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for e731c4a2cf7266bcc0a733e15495374a45e9f9b9 at 9:35 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for d69e78e7a7d7bb4c59ad3765ed503eeb515975ae at 9:39 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for ad6a0251fe8b99d93ffd593941d343bbc295e38b at 9:42 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for ff6e9609242d00346f9cfdd4a74cf48050376f40 at 9:45 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for fee027d669712b9a83e2c4cb7f23e140f6c26b68 at 9:48 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 183024f4a602aac9f9b0e3cd2170fa6f8b0d1888 at 9:51 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 1e7c415d80577aa2e5cf56a903af99c33079bc6c at 9:53 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for eea97e3450e3e2ace41b1835006ce0805d14fa59 at 9:56 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] suppressing further missing variant warnings
warning: [HaplotypeIndexer::parse_vcf] Found 3736445/0 variants in phasing VCF but not in graph! Do your graph and VCF match?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 2d5ebf3c355a29bfc1396fe04b0b2f4375b2890a at 6:10 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 260a460e2e81999d99d3755c89883faa55b4e224 at 6:147 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 99b990181e5492025f5622a2d756d02202836600 at 6:151 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 8c30ffac0a85efccd9a8fd401fab64def0eb1272 at 6:154 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for aa58be24542f897886dd0563ffdf641cece3da6e at 6:172 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 03c124788cf35f2eb47c05bfb1c51e3400cb5d21 at 6:187 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for e23cccd66cd3be84e222619bc1c1d790025534ec at 6:193 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 1aa92651d0fa7269556ed0b48902e7f695d7688d at 6:211 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 228c0af5369e7a1f35dc46dc93470596a1200346 at 6:250 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 51470b7acb41c982f23aab20d84f998340a4f019 at 6:277 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] suppressing further missing variant warnings
warning: [HaplotypeIndexer::parse_vcf] Found 3768320/0 variants in phasing VCF but not in graph! Do your graph and VCF match?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 5cc06dcbe07651acac02f1ffa866ae655a8c63ce at 8:124 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 26cbe1718a45847e8bafaeca91e1f08434ed5fec at 8:149 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 1746804253d49900d651432044ff960b7b75d0d7 at 8:151 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for f2e84cd68b31b3c456a80115c94d26f3252cbd3a at 8:697 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for ff9ba283c94b7f00d62422dd9aac691f97a44589 at 8:702 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for b75868657d54a26e4d5434cf0f4d45426cfb6515 at 8:705 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 6f62e6c6342aee0d0204637fb29c1b4e21353728 at 8:707 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for bf0d5ddfa07dc44c3a32f1a34bc22da8edd84817 at 8:714 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for cb6a4b8da090be595543e7d095bb90afffc5d36b at 8:718 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for d01eca693390f367a953c7689ad0c09ffcb30be8 at 8:722 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] suppressing further missing variant warnings
warning: [HaplotypeIndexer::parse_vcf] Found 2290678/0 variants in phasing VCF but not in graph! Do your graph and VCF match?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 588f8dddc1ebbd1458f34fe0b606c257304f0a2c at 14:1 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 476d7cb45ec56ebca462c452ef8d1a781ae1d28c at 14:5 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for bff7ac1a5d594a67d5985f1114108358b931570a at 14:19 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for e0ed32a990b64edabee53c259838cdb9343dba67 at 14:22 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for ead57a19db4fed5bcd4052bdf49df9566453c0e2 at 14:23 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 9498ff051a7651cf922eb19f04d5acaf77c21d68 at 14:421 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 8c50f953a821f30f9d110e3c370f52493003d6a0 at 14:428 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for b4deea24d7ad1f096313212f5db2c62a064176ce at 14:437 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for b42013329ff7c7279f1d504aaaa0b57f83c90e25 at 14:4794 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 1bb541a8f92e61280394d7a0e32efe12cc314e06 at 14:4810 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] suppressing further missing variant warnings
warning: [HaplotypeIndexer::parse_vcf] Found 3358433/0 variants in phasing VCF but not in graph! Do your graph and VCF match?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for ee3d56668c926850a7fee59ef244a7b08ae9f849 at 15:17 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for f843aaa7442a3560845c2ba2cca61e5d65518403 at 15:31 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 49c9946258d4fe49225ef6dcba7bec723d9ed10f at 15:34 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 53fe6f158161739032d924c47d7983ae7e71b3c1 at 15:38 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for ef32260df1c02a7168aaa2d8eec8914f50c18852 at 15:41 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 5dc8ed4644ec17321e11120048d6fc617bb42b8e at 15:43 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 808d9a7e08829019de7c96ab49eac9121b5916cc at 15:46 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 5fb333f44f0034e4a88066d63f1ca8a16f7c6f26 at 15:51 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 7b3f9bfec08e14be3ef809fabe399630f209c47f at 15:53 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 7f5d0b91d40a26639c60770cc95fb1f55efa01c3 at 15:63 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] suppressing further missing variant warnings
warning: [HaplotypeIndexer::parse_vcf] Found 3210329/0 variants in phasing VCF but not in graph! Do your graph and VCF match?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 9a79f3f9ea6e382cb4bf1a8cedfdf336f280a888 at 3:26 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 5e4c3eb6104a24b6589043cb53a7a9244168922b at 3:66 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 47a0950061bf75092e46316540bc37004ff46311 at 3:68 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 4285799aa36b23f390c7398147b8ed49c0cc9bc9 at 3:70 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for bf6b284589f60f653f4e5aea6542bd3b520e1623 at 3:82 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for a18eb3f563a5feca70497c02a9852d2d1e0e3fc0 at 3:117 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for a8447d3a85f1de57baf92d38d065269d0cf12544 at 3:146 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for e255d2394748ed777f85789b2a7eb53183dd0e40 at 3:171 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for dfcc5a32810104cc53e7e1cac548372bcdccaf03 at 3:184 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for dc1ca07ff857e8906f6e1a0279bd43557852699e at 3:204 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] suppressing further missing variant warnings
warning: [HaplotypeIndexer::parse_vcf] Found 3122888/0 variants in phasing VCF but not in graph! Do your graph and VCF match?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for cc4e7aaba463d824a0a0f17a3fe27b0c052208b9 at 4:16 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 729e8d90e50c3972d26762bc06553132e42f6bc0 at 4:60 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 08be2aae62b737742045206a9fd30bf589a319f7 at 4:61 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for ee16db5d5856aca7280ec2fb6ee5c085396f09ec at 4:111 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 4c75338fc89c199359c55a1ebe0976bc9387cabb at 4:121 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for fa117e65f307325433927bba2ff4d0214970e40e at 4:123 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 51b808a1c9ff7e730c8e0a7daabb1c68092c8b65 at 4:128 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 6f33aaf378382c39236b29ed1f2aade432cb8020 at 4:133 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for b188716644365ff1eba648b760acb8a460a4fba8 at 4:138 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for acb33d75d4e4db6656310a4a8eb93cda284f0355 at 4:191 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] suppressing further missing variant warnings
warning: [HaplotypeIndexer::parse_vcf] Found 3015392/0 variants in phasing VCF but not in graph! Do your graph and VCF match?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 9e7d19f0b2ab6ca1da80ff0b5c2c0bfc39f2ad9b at 13:13 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 9d6a8ecd8c0408e9a7b60c1b72705d9db43ec783 at 13:46 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 94e9be8eb64025f3196cb4689a637ec772e21184 at 13:57 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for efd4c75862fdbd672be9154801787ef48ed236ae at 13:62 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 6db3dc20a94e2beb379b94132e2dfc13fe3f5b1c at 13:74 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 37866e67a16b718130730b80e32aae7a47b38219 at 13:80 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for f9ae502bbd82741003fde35cd3f1414e5bc6fa59 at 13:96 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for bdbe762f47e220d4c39db48974fcb7293a3998f0 at 13:103 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for b7528c1ca42c559f6a05b1edea53597dd769096b at 13:117 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for cb2b5ca77dcdd0ec44da1c42a9660705cf58d158 at 13:133 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] suppressing further missing variant warnings
warning: [HaplotypeIndexer::parse_vcf] Found 2551281/0 variants in phasing VCF but not in graph! Do your graph and VCF match?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 132760e5fd75745fb988c6df867613cec821567f at 7:5 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for a0ddc661961331b24d0e5db7610e3bac68ba1606 at 7:7 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 6ae3d99e55cb1626a49e5693ac87847971ce41e3 at 7:32 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for c300ba005cb428d085ed01dc339c6ce2afbb53ec at 7:99 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for c1298bc42f7f3160412cb249229d33d5b12c9d5b at 7:725 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 5933fa9657ece989b16071ac1222e425a1cfa0de at 7:761 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 1c50b8e3e107ba652560f9c6b28a9a31d20661ff at 7:852 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 1e62fa762beedd86c43a323b0966df8c123d33e3 at 7:965 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for 5f2142f7b9aa069fb29e7e45fd99dee64e7a3120 at 7:1282 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] alt and ref paths for eb631a0aec70e056b21207806ea9b3cdd338aa3a at 7:1647 missing/empty! Was the variant skipped during construction?
warning: [HaplotypeIndexer::parse_vcf] suppressing further missing variant warnings
warning: [HaplotypeIndexer::parse_vcf] Found 2824500/0 variants in phasing VCF but not in graph! Do your graph and VCF match?

Error : void NestedPathAbundanceEstimator::inferPathSubsetAbundance

Hi developers!

I encountered the error below when I tried quantification using rpvg with a graph generated from minigraph-cactus, which has complex regions including some loops or multiple nodes, or loops spanning some genes as described in the vg tools pull request.

Running rpvg (commit: 301f553412a7f3b3c3dccad74e845868da4f0468)
Random number generator seed: 1701051741
Fragment length distribution parameters found in alignment (mean: 258.484, standard deviation: 61.0281)
Loaded graph, GBWT and r-index (6.69198 seconds, 2.5275 GB)
Fragment length distribution parameters re-estimated from alignment paths (location: 223.174, scale: 70.9671, shape: 1.38375)
Found alignment paths (654.778 seconds, 2.5275 GB)
Clustered alignment paths (0.519952 seconds, 2.5275 GB)
rpvg: /home/rpvg/src/path_abundance_estimator.cpp:715: void NestedPathAbundanceEstimator::inferPathSubsetAbundance(PathClusterEstimates*, const std::vector<ReadPathProbabilities>&, std::mt19937*, const spp::sparse_hash_map<std::vector<unsigned int>, double>&) const: Assertion `path_group.second.size() <= group_size' failed.

However, as far as I know, the haplotype-transcript information table will be only updated if I use the future vg tools that reflect the pull request.

So, I attempted to remove redundant haplotypes in the 4th column of the haplotype-transcript table myself and then rerun rpvg with the fixed table. But I got the error above.

Could you please provide advice on whether this error could be fixed by using the haplotype transcript table updated with future vg tools, or if it might be caused by another issue?

Thank you!

'std::bad_alloc' error in rpvg

I'm sorry to bother you.
I got the GWBT index followed by https://github.com/jonassibbesen/hprc-rnaseq-analyses-scripts/blob/main/graph_construction/minigraph_cactus_grch38_f1_clip_d9_m1k_D10M_m1k_cat_gencode38_vgrna_k32/index_gbwt.sh

When I use rpvg, I got an error like this:
GBWT::load(): Invalid header: GBWT v5: 269821 sequences of total length 18863910, alphabet size 195633865 with offset 1115
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

What's going on here, how can I solve it?

make static version?

Hi Jonas,

I was wondering if it would be possible to have a "make static" option for the program?

Louis

Too much time consumed in quantification

Dear contributers,
I've constructed the pan-transcriptome for each chromosome by the command:

vg autoindex --workflow mpmap -t 24 --prefix PT${chrom} --ref-fasta ../04generate_vcf/CRI12${chrom}.fa --vcf ../04generate_vcf/${chrom}.final.vcf.gz --tx-gff ../05transcript/chrom/${chrom}.gtf

and the fastq file divided into each chromosom were aligned to pan-transcriptome by the command:

vg mpmap -n rna -t 24 -x ../06construct/PT${chrom}.spliced.xg -g ../06construct/PT${chrom}.spliced.gcsa -d ../06construct/PT${chrom}.spliced.dist -f ../07split_fq/${sample}${chrom}_1.fastq -f ../07split_fq/${sample}${chrom}_2.fastq > ${sample}${chrom}.gamp

The quantification was implemented by the command :

../rpvg/bin/rpvg -t 4 -g PT${chrom}.spliced.xg -p PT${chrom}.haplotx.gbwt -f PT${chrom}.txorigin.tsv -a ${sample}${chrom}.gamp -o ${sample}${chrom} --inference-model haplotype-transcripts

However, for most chromosomes, the quantification could be finished in 1 or 2 minutes, while for a specific chromosome, the quantification has taken up over 5 hours. We found that the spliced graph and gbwt file has no significant difference with other chromosomes. This confused me that why this chromosome takes too much time in quantification.

(abnormal chromosome)
-rw-r--r-- 1 ls ls 11900160 08:45 PTX03.haplotx.gbwt
-rw-r--r-- 1 ls ls 17828845 08:45 PTX03.haplotx.gbwt.ri
-rw-r--r-- 1 ls ls 195644146 08:46 PTX03.spliced.xg
-rw-r--r-- 1 ls ls 504914 09:33 PTX03.txorigin.tsv
-rw-r--r-- 1 ls ls 158550365 08:29 PTX.gamp

(normal chromosome)
-rw-r--r-- 1 ls ls 11119600 08:45 PTY04.haplotx.gbwt
-rw-r--r-- 1 ls ls 15983392 08:45 PTY04.haplotx.gbwt.ri
-rw-r--r-- 1 ls ls 189164808 08:47 PTY04.spliced.xg
-rw-r--r-- 1 ls ls 291570 09:33 PTY04.txorigin.tsv
-rw-r--r-- 1 ls ls 110103782 08:29 PTY04.gamp

I'm looking forword to the reply.
Thanks
Yours sincerely Liu

Failed to read BGZF block data at offset 4088217654 expected 4336 bytes; hread returned 1914

Hello!
I've successfully gotten the .gamp file of the transcriptome file with vg mpmap and he seems to have no problem. But when I run rpvg, he has the following error:

Random number generator seed: 1679924110
Fragment length distribution parameters found in alignment (mean: 225.014, standard deviation: 224.567)
Loaded graph and GBWT (7.58161 seconds, 10.123 GB)
[E::bgzf_read_block] Failed to read BGZF block data at offset 5048343298 expected 8709 bytes; hread returned 6944
terminate called after throwing an instance of 'std::runtime_error'
what(): [vg::io::MessageIterator] obsolete, invalid, or corrupt input at message 330847274859937 group 330837660586273

In fact, I successfully ran rpvg once, but when I used the same script but different .gamp files, he got the error above.

This is my code:
docker run --rm \ -v /home/yulong/Pan_genome/Pan_transcript/2023.3.21submit:/input \ quay.io/jsibbesen/rpvg:latest \ rpvg \ -t 10 -g /input/vg_rpvg.spliced.xg -p /input/vg_rpvg.haplotx.gbwt -f /input/vg_rpvg.txorigin.tsv -a /input/sample1.vg.gamp -o /input/sample --inference-model haplotype-transcripts
Looking forward to your useful help！
Sincerely
yulong

Haplotype probability and read count

Hi,
We've been trying out rpvg on a crop plant and it looks very promising!
I have a question regarding inferring haplotype probability when read counts are 0.
Index was built using autoindex and quantification was done in --inference-model haplotype-transcripts mode.

I have cases where both single and joint haplotype probability is 1, but read count is zero. For example:

fgrep -f key.txt 843A.mpmap.gamp.rpvg_joint.txt
Name_1  Name_2  ClusterID       HaplotypingProbability  ReadCount_1     TPM_1   ReadCount_2     TPM_2
C03p009100.1_BnaEXP_H1  C03p009100.1_BnaEXP_H1  54565   1       0       0       0       0

Since they all seemed to be _H1 (just from visual inspection of results), I thought that maybe if there is only one haplotype, it is automatically assigned probability of 1.

But when I then looked at marginal haplotype probabilities they could be either 1 or 0.

fgrep -f key2.txt ../843A.mpmap.gamp.rpvg.txt
Name    ClusterID       Length  EffectiveLength HaplotypeProbability    ReadCount       TPM
C03p009100.1_BnaEXP_H1  54565   573     407.61853       1       0       0

fgrep -f key2.txt ../845A.mpmap.gamp.rpvg.txt
Name    ClusterID       Length  EffectiveLength HaplotypeProbability    ReadCount       TPM
C03p009100.1_BnaEXP_H1  116249  573     407.63127       0       0       0

How are haplotype probabilities derived if there are no reads mapped? Could it be that there were some reads supporting assignment but they did not make it into count?

All the best,
Agnieszka

Meaning of input and output files of pantranscriptome

Hi, developer,
I have some confusion about rpvg input and output files and would like to get answers from you guys!
First, I ran vg autoindex:

vg autoindex \
--threads 32 \
--workflow mpmap \
--workflow rpvg \
--prefix chr1 \
--ref-fasta ./3.fasta/Sus_scrofa.Sscrofa11.1.chr1.dna.toplevel.fa \
--vcf .1.SNP/103ind_tran_chr1_qc.generegion.phased.vcf.gz \
--tx-gff ./4.gtf/Sus_scrofa.Sscrofa11.1.100.chr1.gtf

Then, I ran in vg mpmap and rpvg:

time vg mpmap \
--nt-type rna \
--threads 32 \
--graph-name ${autoindex}/chr1.spliced.xg \
--gcsa-name ${autoindex}/chr1.spliced.gcsa \
--dist-name ${autoindex}/chr1.spliced.dist \
--fastq ${rnapath}/${rnaseq[$SLURM_ARRAY_TASK_ID]}_1.qc.fq.gz \
--fastq ${rnapath}/${rnaseq[$SLURM_ARRAY_TASK_ID]}_2.qc.fq.gz \
> ${quanti}/${rnaseq[$SLURM_ARRAY_TASK_ID]}.mpmap.gamp


time rpvg \
-t 32 \
--graph ${autoindex}/chr1.spliced.xg \
--paths ${autoindex}/chr1.haplotx.gbwt \
--path-info ${autoindex}/chr1.txorigin.tsv \
--alignments ${quanti}/${rnaseq[$SLURM_ARRAY_TASK_ID]}.mpmap.gamp \
--inference-model haplotype-transcripts \
--output-prefix ${quanti}/${rnaseq[$SLURM_ARRAY_TASK_ID]}.rpvg

My questions:
(1) In the first column in the txorgin.tsv file, I found that each transcript has either "_R" or "_H" added after the original transcript ID. however, the R is usually only R1, and the H has a different number. May I ask what R and H denote？

ENSSSCT00000004431_R1   2699    ENSSSCT00000004431      1
ENSSSCT00000004431_H1   2699    ENSSSCT00000004431      20D006138_20D006138#0#1,20D006316_20D006316#0#1,20D006539_20D006539#0#1,20D006539_20D006539#1#1,20D006565_20D006565#1#1,20D006753_20D
ENSSSCT00000004431_H2   2699    ENSSSCT00000004431      20D006138_20D006138#1#1,20D006191_20D006191#0#1,20D006191_20D006191#1#1,20D006202_20D006202#0#1,20D006202_20D006202#1#1,20D006228_20D
ENSSSCT00000004431_H3   2699    ENSSSCT00000004431      20D006140_20D006140#0#1
ENSSSCT00000004431_H4   2699    ENSSSCT00000004431      20D006140_20D006140#1#1
ENSSSCT00000004431_H5   2699    ENSSSCT00000004431      20D006170_20D006170#0#1
ENSSSCT00000004431_H6   2699    ENSSSCT00000004431      20D006170_20D006170#1#1
ENSSSCT00000004431_H7   2699    ENSSSCT00000004431      20D006176_20D006176#0#1,20D006348_20D006348#0#1
ENSSSCT00000004431_H8   2699    ENSSSCT00000004431      20D006176_20D006176#1#1,20D006944_20D006944#0#1

(2) What is the meaning of the number between the two "#" in the "sample ID#0# chromosome number" in the "haplotype" column contained in the txorgin.tsv file generated by vg autoindex? Like 20D006140_20D006140#0#1

(3) The rpvg_joint.txt file contains the second column "." .What is the cause of this

ENSSSCT00000080639_H5   .       525     1       99.141873       41.252798       0       0
ENSSSCT00000004553_H5   .       158     0.80520778      2.6734957       0.48751255      0       0
ENSSSCT00000004553_H13  .       158     0.16104156      0.53469914      0.097502509     0       0
ENSSSCT00000004553_H8   .       158     0.0033411062    0       0       0       0
ENSSSCT00000025290_R1   .       82      1       185.37403       94.535129       0       0
ENSSSCT00000005985_H8   .       336     0.79019431      0       0       0       0
ENSSSCT00000005985_R1   .       336     0.20509095      0       0       0       0
ENSSSCT00000005985_H28  .       336     0.0047147401    0       0       0       0

Looking forward to your answers and replies.

Thanks

How to prepare the gtf?

Hello developers!
I notice that in your sample file (vgrna-project-paper/originals/data/transcripts/gencode29) that we download is gencode.v29.primary_assembly.annotation.gff3.gz and gencode.v29.primary_assembly.annotation.gtf.gz, there are exons.sh gene_transcripts.sh preprocess.sh subsample_transcripts.sh in the folder, and l do not know how to bash the subsample_transcripts.sh, and when we bash the code, there is nothing works in our folder. By the way, in what order should the four scripts be executed in?

Model for rpvg's output

Overall question

I was wondering about model choice (i.e., haplotypes, transcripts, or haplotype-transcripts).

Model choice

I have RNA-seq data for which the diplotype of each individual is known (from qPCR genotyping).

First, from my understanding, there is no way to give the diplotype of each sample as a parameter of rpvg, is that correct ? (the idea would be to kind of enforce the diplotype)

Second, would you suggest using the haplotype-transcript (since it seems more accurate due to the diplotype conditionning) or simply the transcript model (since I will not later use the diplotype inference information) ?

So far, I have used the transcript model, but I was wondering if there would be strong reason to rather use the haplotype-transcript one.