eturro / mmseq Goto Github PK

View Code? Open in Web Editor NEW

67.0 67.0 20.0 48.27 MB

Haplotype, isoform and gene level expression analysis using multi-mapping RNA-seq reads

License: GNU General Public License v2.0

Perl 0.67% Shell 3.18% Ruby 3.27% Makefile 1.02% R 3.61% C++ 85.74% C 2.52%

mmseq's People

Contributors

Stargazers

Watchers

Forkers

bioinformaticsarchive fawnshao jiach kauralasoo jeffhsu3 lingdudefeiteng verdurin olneykimberly i-claudius binlu1981 scchess atfg inambioinfo liupfskygre elvaxia junaruga jmarshall gmazzoni683

mmseq's Issues

How to use mmcollapse for diploid transcriptome?

I've run mmseq on a diploid transcriptome that includes for example ENST00000052754.10_A (paternal transcript) and ENST00000052754.10_B (maternal transcript) in the results (.mmseq file). I want to use mmcollapse to collapse transcripts but I don't want to merge a _A transcript with a _B transcript. How should I do this?

Editing reference file

Hi,

I am working with Rat genome and I have downloaded the files cdna and ncrna as directed and followed the steps of merging them together but I did not filter them for anySuper contig or haplotype using fastagrep.sh.

However, all the genome alignment and everything went fine, but I get the gene Ids/feature Id as:

ENSRNOG00000000001.5
  ENSRNOG00000000007.7
  ENSRNOG00000000008.7
  ENSRNOG00000000009.5


feature_id  log_mu  sd  mcse    iact    effective_length    true_length unique_hits ntranscripts    observed    percentiles5,25,50,75,95
ENSRNOG00000000001.5    -11.9754    10.5222 0.328819    1   1237    NA  0   1   0   1.28257e-14,1.14878e-07,0.000147833,0.00841203,0.149571
ENSRNOG00000000007.7    -4.31787    2.62437 0.0820116   1   1758.2  NA  0   4   0   9.38522e-05,0.00341925,0.0229834,0.0856848,0.341789
ENSRNOG00000000008.7    0.469715    0.361166    0.0106912   0.897298    1568    NA  8   1   1   0.883641,1.26489,1.62147,2.05065,2.75036
ENSRNOG00000000009.5    -11.5155    9.9814  0.311919    1   1182    NA  0   1   0   1.44401e-14,2.84568e-07,0.000241538,0.00899963,0.159891
ENSRNOG00000000010.5    -11.7032    9.00453 0.281392    1   2265    NA  0   1   0   1.0683e-13,1.63457e-07,0.000137861,0.00509959,0.0691979
ENSRNOG00000000012.5    0.763521    0.429914    0.0134396   1.00072 814 NA  6   1   1   1.04263,1.64183,2.23023,2.92813,4.18205
ENSRNOG00000000017.7    -12.0851    11.0173 0.344292    1   1101    NA  0   1   0   1.37857e-15,1.36881e-07,0.000152784,0.00878132,0.139119
ENSRNOG00000000021.3    1.7313  0.238328    0.00781276  1.10042 1021    NA  19  1   1   3.73811,4.85222,5.74354,6.68997,8.16492

Could you please let me know, why is this happening.

Also, I have single end data and not paired end.

Thank you

Best regards

Interpretation of a model for assessing the fold-change difference between two pairs of groups

Hi,

I'm trying to follow the example for assessing whether the log fold change between group A and group B is different (as opposed to equal) to the log fold change between group C and group D. In my case it is testing whether the fold change between treatment and control in group 2 is different from that in group 2. I.e. is log((group1.treatment/group1.control)/(group2.treatment/group2.control)) significantly different from 0.

The matrix I'm using is identical to that given in the https://raw.githubusercontent.com/eturro/mmseq/master/doc/dod2.mat. What I'm unsure of is the order of groups I need to provide to mmdiff and the interpretation of the output.

What I'm currently doing is calling mmdiff this way:
mmdiff -m matrices_file group1.treatment.rep1.mmseq group1.treatment.rep2.mmseq group1.control.rep1.mmseq group1.control.rep2.mmseq group2.treatment.rep1.mmseq group2.treatment.rep2.mmseq group2.control.rep1.mmseq group2.control.rep2.mmseq > out.mmdiff
Looking at the output eta1_1 and eta1_2 look well correlated with ln(mean(exp(group1.treatment.rep1.mmseq),exp(group1.treatment.rep2.mmseq))/mean(exp(group1.control.rep1.mmseq),exp(group1.control.rep2.mmseq))) and ln(mean(exp(group2.treatment.rep1.mmseq),exp(group2.treatment.rep2.mmseq))/mean(exp(group2.control.rep1.mmseq),exp(group2.control.rep2.mmseq))), respectively, so I guess the order in which I specified the input files to mmdiff is correct. However, when I look at the correlation between:
eta1_0 and
ln((mean(exp(group1.treatment.rep1.mmseq),exp(group1.treatment.rep2.mmseq))/mean(exp(group1.control.rep1.mmseq),exp(group1.control.rep2.mmseq)))/(mean(exp(group2.treatment.rep1.mmseq),exp(group2.treatment.rep2.mmseq))/mean(exp(group2.control.rep1.mmseq),exp(group2.control.rep2.mmseq))))

it doesn't look that strong. In addition, the scale of eta1_0 seems significantly larger than that of eta1_1/eta1_2. A few lines for example:

eta1_0 eta1_1 eta1_2
-3.34 0.33 0.35
9.2 0.36 -0.84
-3.37 0.13 0.35

Any idea?

High percentage of reads are not mapping

Hi,

This might not be the right place to put this up but I will appreciate if someone can help me.

I have single end RNA-seq data and when I align that data using bowtie1 by following command

zcat lane8_E18_3_CGATGT_L008_R1.fastq.gz | ~/Softwares/MMseq/bowtie-1.1.2/bowtie -a --best --strata -S -m 100 -X 500 --chunkmbs 256 -p 8 ~/Softwares/MMseq/mmseq-latest/Rat6ref/new/Rattus_norvegicus.Rnor_6.0.ref_transcripts - | samtools view -F 0xC -bS - | samtools sort -n - > E18_1

# reads processed: 27452186
# reads with at least one reported alignment: 4384153 (15.97%)
# reads that failed to align: 23064807 (84.02%)
# reads with alignments suppressed due to -m: 3226 (0.01%)
Reported 18182347 alignments to 1 output stream(s)
[bam_sort_core] merging from 8 files...

Only 15% of reads map, is this normal?

What shall I do, to get better mapping? Does this have to do something with parameters of bowtie or it is just my data is shit.

Thank you

gcc-10: error: no post-decrement operator for type

Hello,

I faced following error with gcc-10.0.1-0.9.fc32 on Fedora 32 when building mmseq.
mmseq version is the latest version 1.0.11.
Do you have any idea to fix?
Thanks.

+ cd src
+ make -j5 'CXXFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -fasynchronous-unwind-tables -fstack-clash-protection'
...
/usr/include/boost/serialization/collections_save_imp.hpp: In function 'void boost::serialization::stl::save_collection(Archive&, const Container&, boost::serialization::collection_size_type)':
/usr/include/boost/serialization/collections_save_imp.hpp:60:16: error: no post-decrement operator for type
   60 |     while(count-- > 0){
      |                ^~
In file included from /usr/include/boost/numeric/ublas/storage_sparse.hpp:20,
                 from /usr/include/boost/numeric/ublas/vector_sparse.hpp:16,
                 from /usr/include/boost/numeric/ublas/matrix_sparse.hpp:16,
                 from hitsio.hpp:17:
/usr/include/boost/serialization/map.hpp: In function 'void boost::serialization::load_map_collection(Archive&, Container&)':
/usr/include/boost/serialization/map.hpp:58:16: error: no post-decrement operator for type
   58 |     while(count-- > 0){
      |                ^~

mmseq STAR aligner

mmseq appears to work with STAR aligner output. Do you know of any issues that might arise from using STAR with mmseq?

Setting up matrix and identifying specific gene clusters

Hi,

I have same sample for 7 different time points and each with replicates. I would like to find out gene clusters for each time point that define that time point (gene signatures).

Here is the table of replicates and time points:

The matrix that I am using is:

# M; no. of rows = no. of observations
1 0 0 0 0 0 0
1 0 0 0 0 0 0
1 0 0 0 0 0 0
1 0 0 0 0 0 0 
0 1 0 0 0 0 0
0 1 0 0 0 0 0
0 1 0 0 0 0 0
0 0 1 0 0 0 0
0 0 1 0 0 0 0
0 0 1 0 0 0 0
0 0 0 1 0 0 0
0 0 0 1 0 0 0
0 0 0 1 0 0 0
0 0 0 0 1 0 0
0 0 0 0 1 0 0
0 0 0 0 1 0 0
0 0 0 0 1 0 0
0 0 0 0 0 1 0
0 0 0 0 0 1 0
0 0 0 0 0 1 0
0 0 0 0 0 1 0
0 0 0 0 0 0 1
0 0 0 0 0 0 1
0 0 0 0 0 0 1
0 0 0 0 0 0 1
0 0 0 0 0 0 1
0 0 0 0 0 0 1
0 0 0 0 0 0 1
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
# P0(collapsed); no. of rows = no. of classes for model 0
1
# P1(collapsed); no. of rows = no. of classes for model 1

The command that I am using with mmdiff is :

`~/Softwares/MMseq/mmseq-latest/bin/mmdiff-linux -fixalpha -p 0 -T0_1.namesorted.bam.hits.gene.mmseq T0_2.namesorted.bam.hits.gene.mmseq T0_3.namesorted.bam.hits.gene.mmseq T0_4.namesorted.bam.hits.gene.mmseq T1_1.namesorted.bam.hits.gene.mmseq T1_2.namesorted.bam.hits.gene.mmseq T1_3.namesorted.bam.hits.gene.mmseq T2_1.namesorted.bam.hits.gene.mmseq T2_2.namesorted.bam.hits.gene.mmseq T2_3.namesorted.bam.hits.gene.mmseq T3_1.namesorted.bam.hits.gene.mmseq T3_2.namesorted.bam.hits.gene.mmseq T3_3.namesorted.bam.hits.gene.mmseq T4_1.namesorted.bam.hits.gene.mmseq T4_2.namesorted.bam.hits.gene.mmseq T4_3.namesorted.bam.hits.gene.mmseq T4_4.namesorted.bam.hits.gene.mmseq T5_1.namesort.bam.hits.gene.mmseq T5_2.namesort.bam.hits.gene.mmseq T5_3.namesort.bam.hits.gene.mmseq T5_4.namesort.bam.hits.gene.mmseq T6_1.namesort.bam.hits.gene.mmseq T6_2.namesort.bam.hits.gene.mmseq T6_3.namesort.bam.hits.gene.mmseq T6_4.namesort.bam.hits.gene.mmseq T6_5.namesort.bam.hits.gene.mmseq T6_6.namesort.bam.hits.gene.mmseq T6_7.namesort.bam.hits.gene.mmseq > out.mmdiff_p0_fixalpha

`
My first question, is everything correct till here ? if not could you please edit.

Second question: Once I have the output file from mmdiff, how to proceed to get clusters of time stage specific genes? I mean, how can I filter them out from the mmdiff output file or is it already done?

Could you please find sometime to answer these and guide me in my analysis.

Thank you.

Best regards
Abhishek

A 2^3 factorial design matrix

Hi @eturro,

I have a 2^3 factorial design (2 genotypes: WT and MUT, 2 tissues: HT and LV, and 2 time points: tp1 and tp2, each factor with dummy coding for its levels), and I'd like to estimate the genotype x tissue x time-point interaction effect.

For example in R, my data would be:

genotypes <- c("WT","MUT")
tissues <- c("HT","LV")
time.point <- c("tp1","tp2")

df <- data.frame(do.call(rbind, replicate(3, as.matrix(expand.grid(genotypes,tissues,time.point) %>% dplyr::rename(genotype=Var1,tissue=Var2,time.point=Var3)), simplify=FALSE))) %>%
  dplyr::mutate(tpm=rnorm(3*2^3))

df$genotype <- factor(df$genotype,levels=genotypes)
df$brain.region <- factor(df$brain.region,levels=brain.regions)
df$time <- factor(df$time,levels=time)

The model would be:

lm(tpm ~ genotype*tissue*time.point, data=df)
And the genotype x tissue x time-point interaction term would be the genotypeWT:tissueLV:time.pointtp2 coefficient from summary(lm(tpm ~ genotype*tissue*time.point, data=df))

Based on I thought that with MMDIFF what I need are 2 models.
The baseline model includes one regression coefficient representing the log(FC) between the tissues.
Model 1 would contrast a log(FC) between genotypes and log(FC) between tissues with log(FC) between tissues alone.
Model 2 would contrast a log(FC) between time.points and log(FC) between genotypes and log(FC) between tissues.

The samples order in the two model matrices will be:
HT,tp1,WT
HT,tp1,WT
HT,tp1,WT
HT,tp2,WT
HT,tp2,WT
HT,tp2,WT
HT,tp1,MUT
HT,tp1,MUT
HT,tp1,MUT
HT,tp2,MUT
HT,tp2,MUT
HT,tp2,MUT
LV,tp1,WT
LV,tp1,WT
LV,tp1,WT
LV,tp2,WT
LV,tp2,WT
LV,tp2,WT
LV,tp1,MUT
LV,tp1,MUT
LV,tp1,MUT
LV,tp2,MUT
LV,tp2,MUT
LV,tp2,MUT

Model 1's matrix will be:

# M; no. of rows = no. of observations
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 1
0 1
0 1
0 1
0 1
1 0
1 0
1 0
1 0
1 0
1 0
1 1
1 1
1 1
1 1
1 1
1 1
# P0(collapsed); no. of rows = no. of classes for model 0
0.5
-.5
# P1(collapsed); no. of rows = no. of classes for model 1
0.5 0.5
0.5 -.5
-.5 0.5
-.5 -.5

And model 2's matrix will be:

# M; no. of rows = no. of observations
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 0
0 0
0 0
0 1
0 1
0 1
0 2
0 2
0 2
0 3
0 3
0 3
1 4
1 4
1 4
1 5
1 5
1 5
1 6
1 6
1 6
1 7
1 7
1 7
# P0(collapsed); no. of rows = no. of classes for model 0
0.5
-.5
# P1(collapsed); no. of rows = no. of classes for model 1
0.5 0.5 0.5
0.5 -.5 0.5
0.5 0.5 -.5
0.5 -.5 -.5
-.5 0.5 0.5
-.5 -.5 0.5
-.5 0.5 -.5
-.5 -.5 -.5

I tried running model 1, where the samples order specified to MMDIFF is:
LV,tp2,MUT samples
LV,tp1,MUT samples
LV,tp2,WT samples
LV,tp1,WT samples
HT,tp2,MUT samples
HT,tp1,MUT samples
HT,tp2,WT samples
HT,tp1,WT samples

Trying this I get this error:
Error: more distinct rows of P than classes for model 1 (2 > 1).

The second model runs fine though.

Am I specifying the models correctly given the effect I'm interested in? And also, is the samples order correct?

Thanks a lot

What are the transcripts in *_mmseq.trace_gibbs.gz

There are much fewer transcripts in *_mmseq.trace_gibbs.gz than in *_mmseq.mmseq. How are the transcripts filtered?

Error: collinearity in combined matrix of intercept and covariates for model 0

Hi,

I am running mmdiff, with three groups each with 3 replicates. The program throws up this error:

The command that I am using is :

~/Softwares/MMseq/mmseq/bin/mmdiff-linux -m mat1.txt /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/E18/E18_1.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/E18/E18_2.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/E18/E18_3.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P1/P1_1.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P1/P1_2.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P1/P1_3.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P3/P3_1.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P3/P3_2.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P3/P3_3.namesorted.bam.hits.gene.mmseq > E18vsOthers.mmdiff
and this is what happens:

Min unique hits fraction for normalisation: 1
No. threads: 16
Parsing /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/E18/E18_1.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/E18/E18_2.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/E18/E18_3.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P1/P1_1.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P1/P1_2.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P1/P1_3.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P3/P3_1.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P3/P3_2.namesorted.bam.hits.gene.mmseq /media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P3/P3_3.namesorted.bam.hits.gene.mmseq 
Analysing 32637 features
Using 12943/32637 features for normalisation.
Log scale normalisation factors:
	/media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/E18/E18_1.namesorted.bam.hits.gene.mmseq	-0.0922577
	/media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/E18/E18_2.namesorted.bam.hits.gene.mmseq	0.157586
	/media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/E18/E18_3.namesorted.bam.hits.gene.mmseq	-0.309913
	/media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P1/P1_1.namesorted.bam.hits.gene.mmseq	0.370167
	/media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P1/P1_2.namesorted.bam.hits.gene.mmseq	0.127327
	/media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P1/P1_3.namesorted.bam.hits.gene.mmseq	-0.515323
	/media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P3/P3_1.namesorted.bam.hits.gene.mmseq	0.04334
	/media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P3/P3_2.namesorted.bam.hits.gene.mmseq	0.1999
	/media/data/u0106457/FASTQ/Rat/Ageing/AlignedBowtie1/P3/P3_3.namesorted.bam.hits.gene.mmseq	0.153414
Design matrix for model 0 ([1|M]):
Error: collinearity in combined matrix of intercept and covariates for model 0

Could you please let me know, how to fix this.

Thank you

Best regards
Abhishek

Compilation problem on Fedora

I'm updating the Fedora RPM to 1.0.10, and I see a compilation problem related to Boost:

+ make -j4 'CXXFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=g
eneric'
g++ -O3 -c sokal.cc
g++ -fopenmp -O3 -c uh.cpp
g++ -DVERSION=1.0.9 -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=ge
neric -fopenmp -O3 hitsio.cpp hitsio.hpp -c hitsio.cpp
gcc    -c -o fasta.o fasta.c
g++ -DVERSION=1.0.9 -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=ge
neric -O3 offsetGTF.cpp -o ../bin/offsetGTF
offsetGTF.cpp: In function 'int main(int, char**)':
offsetGTF.cpp:115:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
       while(gff_ind < gff_positions[chrom].size()) {
             ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
offsetGTF.cpp:117:57: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
               pos + tokens[3].size() - tokens[4].size() >= gff_positions[chrom][gff_ind]) {
offsetGTF.cpp:114:11: warning: unused variable 'curoff' [-Wunused-variable]
       int curoff=offset;
           ^~~~~~
offsetGTF.cpp:150:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
       while(gff_ind < gff_positions[prevchrom].size()) {
             ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
offsetGTF.cpp:163:53: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
           pos + tokens[3].size() - tokens[4].size() >= gff_positions[chrom][gff_ind]) {
offsetGTF.cpp:177:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     while(gff_ind < gff_positions[chrom].size() && gff_positions[chrom][gff_ind] <= pos) {
           ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
offsetGTF.cpp:235:20: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for(int i=5; i < tokens.size(); i++) {
                  ~~^~~~~~~~~~~~~~~
offsetGTF.cpp:237:12: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
       if(i < tokens.size()-1) {
          ~~^~~~~~~~~~~~~~~~~
offsetGTF.cpp:108:7: warning: unused variable 'gff_pos' [-Wunused-variable]
   int gff_pos= -1;
       ^~~~~~~
g++ -DVERSION=1.0.9 -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=ge
neric -O3 t2g_hits.cpp -o ../bin/t2g_hits
t2g_hits.cpp: In function 'int main(int, char**)':
t2g_hits.cpp:90:24: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
         for(int i=2; i < tokens.size(); i++) {
                      ~~^~~~~~~~~~~~~~~
In file included from /usr/include/boost/numeric/ublas/storage_sparse.hpp:23:0,
                 from /usr/include/boost/numeric/ublas/vector_sparse.hpp:16,
                 from /usr/include/boost/numeric/ublas/matrix_sparse.hpp:16,
                 from uh.hh:13,
                 from uh.cpp:1:
/usr/include/boost/numeric/ublas/storage.hpp: In member function 'void boost::numeric::ublas::unbounded_array<T, ALLOC>::serialize(Archive&, unsigned int)':
/usr/include/boost/numeric/ublas/storage.hpp:299:33: error: 'make_array' is not a member of 'boost::serialization'
             ar & serialization::make_array(data_, s);
                                 ^~~~~~~~~~
/usr/include/boost/numeric/ublas/storage.hpp:299:33: note: suggested alternative: 'make_nvp'
             ar & serialization::make_array(data_, s);
                                 ^~~~~~~~~~
                                 make_nvp
/usr/include/boost/numeric/ublas/storage.hpp: In member function 'void boost::numeric::ublas::bounded_array<T, N, ALLOC>::serialize(Archive&, unsigned int)':
/usr/include/boost/numeric/ublas/storage.hpp:494:33: error: 'make_array' is not a member of 'boost::serialization'
             ar & serialization::make_array(data_, s);
                                 ^~~~~~~~~~
/usr/include/boost/numeric/ublas/storage.hpp:494:33: note: suggested alternative: 'make_nvp'
             ar & serialization::make_array(data_, s);
                                 ^~~~~~~~~~
                                 make_nvp
In file included from /usr/include/boost/numeric/ublas/vector_sparse.hpp:16:0,
                 from /usr/include/boost/numeric/ublas/matrix_sparse.hpp:16,
                 from uh.hh:13,
                 from uh.cpp:1:
/usr/include/boost/numeric/ublas/storage_sparse.hpp: In member function 'void boost::numeric::ublas::map_array<I, T, ALLOC>::serialize(Archive&, unsigned int)':
/usr/include/boost/numeric/ublas/storage_sparse.hpp:515:33: error: 'make_array' is not a member of 'boost::serialization'
             ar & serialization::make_array(data_, s);
                                 ^~~~~~~~~~
/usr/include/boost/numeric/ublas/storage_sparse.hpp:515:33: note: suggested alternative: 'make_nvp'
             ar & serialization::make_array(data_, s);
                                 ^~~~~~~~~~
                                 make_nvp
In file included from /usr/include/boost/numeric/ublas/matrix_sparse.hpp:20:0,
                 from uh.hh:13,
                 from uh.cpp:1:
/usr/include/boost/numeric/ublas/matrix.hpp: In member function 'void boost::numeric::ublas::c_matrix<T, M, N>::serialize(Archive&, unsigned int)':
/usr/include/boost/numeric/ublas/matrix.hpp:5977:33: error: 'make_array' is not a member of 'boost::serialization'
             ar & serialization::make_array(data_, N);
                                 ^~~~~~~~~~
/usr/include/boost/numeric/ublas/matrix.hpp:5977:33: note: suggested alternative: 'make_nvp'
             ar & serialization::make_array(data_, N);
                                 ^~~~~~~~~~
                                 make_nvp

This is with Boost 1.64.0.

How to run haploref.rb

Could someone explain how to prepare the inputs ("pos_file" and "hap_file") to run haploref.rb?
The manual just says:

pos_file: file containing, for each transcript, chromosome and positions of SNPs.
hap_file: file containing, for each transcript, two versions, e.g. suffixed _A and _B, on separate lines with the alleles for the two haplotypes at each position listed in the pos_file (A and B respectively).

Should I used genomic coordinates of the SNPs? Suppose I have the following transcript and SNPs:

ENST00000450305.2 (chr1 12010-13670)
chr1 12020 A T (in exon1; phased genotype is 0|1)
chr1 13300 G C (in exon5; phased genotype is 0|1)

Is the following correct?

#Pos_file:
ENST00000450305.2
chr1 12020
chr1 13300

#Hap_file
ENST00000450305.2_A
A
G
ENST00000450305.2_B
T
C

Thanks.

haploref.rb

It's not clear to me on when the pos_file and the hap_file were supposed to made or how to make them to run haploref.rb

Usage: haploref.rb [-m tg_regexp t_ind g_ind] cdna_file gff_file pos_file hap_file > hapiso_fasta
Generate haplotype and isoform specific FASTA reference.

Mandatory arguments:
cdna_file: default transcript FASTA file. I have this
gff_file: GFF file containing structure annotation for transcripts in cdna_file. I have this
pos_file: file containing, for each transcript, chromosome and positions of SNPs. How is this supposed to be set up? How can a make a pos_file?
hap_file: file containing, for each transcript, two versions, e.g. suffixed _A and _B, on separate lines with the alleles for the two haplotypes at each position listed in the pos_file (A and B respectively). How is this supposed to be set up? How can a make a pos_file?

Could you also show me a couple of sample lines for this files: pos_file hap_file. I'm really confused on how they are supposed to be set up and how to make them in the first place.

Thank you in advice for helping me resolve this.

Suggestion: adding a simple test with small sized data

Dear Ernest,

You merged my pull-request to run tests in Travis CI. Thank you.

How do you think about adding a very simple test for bam2hits and mmseq commands with very small sized data and less dependency bio tools?

I can see the tutorial to test it here.
https://github.com/eturro/mmseq#estimating-expression-levels

And what I would like to know is how to test with very small sized data, if possible, not using or less using other dependent bio tools like bowtie, kallisto, trim_galore or samtools commands.

It's like trim_galore's case here. https://github.com/FelixKrueger/TrimGalore/blob/master/.travis.yml#L23-L26 . You see that trim_galore command is tested with very small sized data in .travis.yml (Travis CI).

Because I would like to see it for this repository. And I started to manage mmseq RPM in Fedora project recently. And I like to add a simple test to verify the mmseq command in the process to create RPM file regularly.
https://src.fedoraproject.org/rpms/mmseq/blob/master/f/mmseq.spec#_45

Thank you.

Gene ID to Gene name

Hi,

Not completely related to MMSeq but you may know the way out of this situation.

I am using GRCh37.70 genome build of ensembl and I get gene IDs for log_mu's. Now, Iwould like to convert these gene Ids to gene name. I have tried biomart but it is taking me to ensembl genes 86 (GRCh37.p13). Is this correct?

Thank you

Latest release does not contain pre-compiled binaries

Hi Ernest,

The latest release https://github.com/eturro/mmseq/releases/tag/latest and https://github.com/eturro/mmseq/releases/tag/1.0.11 do not contain the pre-compiled binaries. Is it possible to add that, highly appreciated.

Which output contains logFPKM of each transcript?

Hi,

I've run MMSeq and got the outputs. What I need is the logFPKM of each transcript for ASE quantification. Could anyone tell me which file contains this data? I thought "log_mu" was what I was looking for but someone told me they were using "the actual Gibbs posterior samples trace file. In that file you have the posterior samples (log FPMK) of each transcript you quantified"

Thanks,

How to apply "--best --strata" for paired-end reads?

As the Bowtie manual says, "--best --strata" does not apply for paired-end reads. What's the best solution to this? I really need to get only the best stratum.

haploref.rb error

Anyone knows how to fix this error when running haploref.rb?

$ haploref.rb gencode.v29.transcripts.fa gencode.v29.annotation.gtf pos_99822 hap_99822
Reading FASTA file.../nfs/med-bfx-activeprojects/weishwu/common/mmseq-latest/bin/haploref.rb:79:in `block in <main>': undefined method `[]' for nil:NilClass (NoMethodError)
        from /nfs/med-bfx-activeprojects/weishwu/common/mmseq-latest/bin/haploref.rb:75:in `open'
        from /nfs/med-bfx-activeprojects/weishwu/common/mmseq-latest/bin/haploref.rb:75:in `<main>'

mmcollapse-linux error: arma::memory::acquire(): out of memory

Hello
I try to learn this paper (http://dx.doi.org/10.7554/eLife.07860.001), to find imprinting gene.

I use mmcollapse-linux combining transcripts, but

mmcollapse-linux tpm_MR1000Aligned tpm_MR1312Aligned
Stopping threshold: -97.5th percentile of the distribution of correlations
tpm_MR1000Aligned SD thres=1.33026
tpm_MR1312Aligned SD thres=1.31731
73716 transcripts or sets of identical transcripts are unidentifiable in all samples.

error: arma::memory::acquire(): out of memory

terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted

what's wrong with it?

Compilation failures with Fedora 24

mmseq currently fails to build for the upcoming Fedora 24 release, which is using GCC 6.1.1 and Boost 1.60.0.

The error that breaks the build is:

/tmp/cclQ5X4W.o: In function `main':
/builddir/build/BUILD/mmseq-1.0.8a/src/bam2hits.cpp:671: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
/tmp/cclQ5X4W.o: In function `char const* boost::re_detail_106000::re_is_set_member<char const*, char, boost::regex_traits<char, boost::cpp_regex_traits<char> >, unsigned int>(char const*, char const*, boost::re_detail_106000::re_set_long<unsigned int> const*, boost::re_detail_106000::regex_data<char, boost::regex_traits<char, boost::cpp_regex_traits<char> > > const&, bool)':
bam2hits.cpp:(.text._ZN5boost16re_detail_10600016re_is_set_memberIPKccNS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEjEET_S8_S8_PKNS0_11re_set_longIT2_EERKNS0_10regex_dataIT0_T1_EEb[_ZN5boost16re_detail_10600016re_is_set_memberIPKccNS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEjEET_S8_S8_PKNS0_11re_set_longIT2_EERKNS0_10regex_dataIT0_T1_EEb]+0x156): undefined reference to `boost::re_detail_106000::cpp_regex_traits_implementation<char>::transform_primary[abi:cxx11](char const*, char const*) const'
bam2hits.cpp:(.text._ZN5boost16re_detail_10600016re_is_set_memberIPKccNS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEjEET_S8_S8_PKNS0_11re_set_longIT2_EERKNS0_10regex_dataIT0_T1_EEb[_ZN5boost16re_detail_10600016re_is_set_memberIPKccNS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEjEET_S8_S8_PKNS0_11re_set_longIT2_EERKNS0_10regex_dataIT0_T1_EEb]+0x2fd): undefined reference to `boost::re_detail_106000::cpp_regex_traits_implementation<char>::transform[abi:cxx11](char const*, char const*) const'
collect2: error: ld returned 1 exit status
Makefile:54: recipe for target 'bam2hits' failed
make: *** [bam2hits] Error 1

while there are lots of warnings like these:

g++ -DVERSION=1.0.8 -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generi
c -fopenmp -O2 bms.o mmdiff.cpp -o ../bin/mmdiff -lgsl -lgslcblas -lboost_regex -lbam -lboost_iostreams -llapack -lblas -lgfortran -Wl,--as-needed -lz -larmadillo
mmdiff.cpp: In function 'void parse_mmseq(std::vector<std::__cxx11::basic_string<char> >, std::vector<std::__cxx11::basic_string<char> >&, arma::Mat<double>&, arma::Mat<double>&, arma::Mat<double>&, int, int, double, bool)':
mmdiff.cpp:99:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for(int i=0; i < filenames.size(); i++) {
                ~~^~~~~~~~~~~~~~~~~~
mmdiff.cpp:110:20: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for(int j=0; j < tokens.size(); j++) {
                  ~~^~~~~~~~~~~~~~~
mmdiff.cpp:137:16: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
       if(k + 1 > features.size()) {
          ~~~~~~^~~~~~~~~~~~~~~~~
mmdiff.cpp:146:14: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
       if(k+1 > y.n_rows) {
          ~~~~^~~~~~~~~~
mmdiff.cpp:151:20: warning: suggest parentheses around '&&' within '||' [-Wparentheses]
       if(!useprops && tokens[lmg_ind]=="NA" || useprops && tokens[mp_ind]=="NA") {
          ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
mmdiff.cpp:172:62: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   if(range_start >=0 && range_end > range_start && range_end < y.n_rows) {
                                                    ~~~~~~~~~~^~~~~~~~~~
mmdiff.cpp:93:46: warning: variable 'mcse_ind' set but not used [-Wunused-but-set-variable]
   int feature_ind=-1, lmg_ind=-1, sd_ind=-1, mcse_ind=-1, iact_ind=-1, mp_ind=-1, sp_ind=-1, uh_ind=-1;
                                              ^~~~~~~~

Is it advised to remove transcripts with zero unique_hits?

Hi,

I'm using MMSeq to quantify haplotype-specific expression. I wonder if it is advised to filter out the transcripts/genes that:
(1) have zero "unique_hits" in the *.mmseq file
and
(2) have identical haplotype sequences as indicated in *.identical.mmseq file

Thanks,

A model matrix for estimating the effect of a covariate on treatment vs. control fold change between two groups

Hi @eturro,

I'm wondering what would be the model design matrix and interpretation for estimating tissue effect (2 different tissues) on the treatment vs. control data from 2 groups - assuming I have 3 samples from each tissue,group, and treatment (altogether 24)

Quantified expression does not go along with read counts

Hi,

I've run bowtie, bam2hits and MMSeq to quantify allele specific expression, and observed some discrepancy between the alignment counts and the final log_mu.

One problematic transcript is ENST00000381392.5 which is an IGF2 transcript. I have ENST00000381392.5_A and ENST00000381392.5_B in the reference representing paternal and maternal transcripts respectively.

In the Bowtie alignment, I counted the alignment counts in them and these are the results:
ENST00000381392.5_A: 725,664
ENST00000381392.5_B: 721,570

ENST00000381392.5_A has more counts than ENST00000381392.5_B, so the expression of the former should be higher than the latter. However the log_mu from MMSeq is:
ENST00000381392.5_A: -4.56312
ENST00000381392.5_B: 4.48249

How could this happen?

Thanks.

No *.mmseq output

Hi,

I've run MMSeq on my samples. For only a few of them MMSeq output the .mmseq files. But for most of them these outputs (*.mmseq, *.genes.mmseq, and *.identical.mmseq) are missing while the *_gibbs.gz, *.k and *.M files are there. There is no error message and the end of the log is as below:

Found 116641 transcripts in 51528 transcript combinations.
Transposing M (this may take a while)...done (83 seconds).
Counting unique hits to sets of identical transcripts...done.
Counting unique hits to genes...done.
EM iteration 513, log likelihood ratio: 0.09979
Gibbs iteration 16383

What's the problem? How can I get the *.mmseq files?

Thanks,

how to supply regex for bam2hits with single sequence name?

Hello,

All of my sequence headers share a single word format such as GENE_strain. I added a gene and transcript identifier originally, but my read mapping failed when using '--fullref' argument to bowtie1 using tophat (you can add all of the arguments at around line 2349 in the tophat executable). I had to go that route since using bowtie as provided in the manual was not producing a proper SAM alignment format, despite correct flags. Anyway, all my gene and transcript identifiers are 1:1 anyway, but I cannot figure out the regex pattern to make bam2hits assign the sequence header to both identifiers.

I've gotten this far but I can't get past a segfault.
bam2hits -m "(\S+)" 1 1

Thanks.

Refining the table of polytomous result

Hi,

I have applied the polytomous analysis on my dataset consisting of 7 groups and each group with a replicate. The results are in a tabular form for all the genes.

I would like to know on the basis of which column shall I sort or filter to get the signature genes for each group.

Thank you.
Best regards
Abhishek

Build error for Fedora 23

I'm the Fedora maintainer for mmseq.

We're in the middle of a mass rebuild of Fedora packages in advance of release 23, and for mmseq this produced an error:

In file included from /usr/include/armadillo:90:0,
from mmcollapse.cpp:28:
/usr/include/armadillo_bits/typedef_elem_check.hpp:18:5: error: size of array 'ERROR___TYPE_SIZE_T_IS_SMALLER_THAN_UWORD' is negative
arma_static_check( (sizeof(size_t) < sizeof(uword)), ERROR___TYPE_SIZE_T_IS_SMALLER_THAN_UWORD );
^

It's quite possible this is a bug in the packaging, but I thought I'd report it here just in case.

The armadillo package in Fedora 23 is currently 4.650.2.

Which version of samtools are you using?

Dear maintainers (CC: @verdurin)

I see sam.h in samtools project is used in following file.

https://github.com/eturro/mmseq/blob/master/src/bam2hits.cpp#L18

#include "sam.h"

But which version is samtools are you using actually?

On Fedora project, we are trying to update samtools RPM from version "0.1.19" to the latest version "1.9" [1] (big change), removing samtools-devel RPM package including the header files, because samtools upstream project does not provide the so file any more [2]

It seems this project is using libhts.
https://github.com/eturro/mmseq/blob/master/src/Makefile#L15

Is it htslib right?
https://github.com/samtools/htslib

mmseq error

Hi Ernest,

I have an error message with the mmseq command and I don't understand why.
I created a reference including two copies of transcripts for Black6 and JF1strains (to detect parental imbalance):

>NM_153523_Black6 gene:Tcstv3|236219_Black6
........................
>NM_153523_JF1 gene:Tcstv3|236219_JF1
........................

I aligned reads with Bowtie 1 and I sorted BAM by read name.

nohup /data/software/bowtie-1.1.2/bowtie -a --best --strata -S -m 100 --chunkmbs 256 -p 40 /data/MMseq/reference/transcrits_black6_JF1.uniq \
Bj1_21_TB.PF.fastq >Bj1_21_TB.PF.sam &
samtools view -bS Bj1_21_TB.PF.sam | samtools sort -n - Bj1_21_TB.PF.namesorted

I mapped reads to transcript sets:

bam2hits /data/MMseq/reference/transcrits_black6_JF1.fa Bj1_21_TB.PF.namesorted.bam > Bj1_21_TB.PF.hits
Parsing FASTA file...done (60548 transcripts).
Getting read length from BAM...50bp.
Reads are single-end.
Expected insert size: 180bp (provided by user)
Standard deviation: 25bp (provided by user)
Adjusting transcript lengths for insert size distribution...done.
Parsing BAM file...done.
Alignments removed by mismatch stratum filter: 5589063.
Alignments removed by repetitive transcript filter: 900807.
Alignments removed by isize deviation filter: 0.
Alignments retained: 111090628.
Reads retained: 41730531.
Now run `mmseq`.

But I have an error when I try to obtain expression estimates:


mmseq Bj1_21_TB.PF.hits Bj1_21_TB.PF
Running mmseq with parameters:
  alpha:         0.1
  beta:          0.1
  max_em_iter:   1000
  epsilon:       0.1
  gibbs_iter:    16384
  gibbs_ss:      16
  seed[0]:       1234
  debug:         0
  threads:       48
Error: transcripts must be nested within genes in GeneIsoforms metadata.

Ernest, could you help me to resolve this error?

Best regards,

Emeric

Matrices for time series-two conditions

Hi Ernest,

Could you please tell me which matrices to use for the following RNA-seq experimental design and model comparison?:

I have two parallel time-series, one for Treated individuals and one for Control individuals (two replicates each):

Time	Control	Treated
t0	2 replicates	2 rep
t1	2 rep	2 rep
t2	2 rep	2 rep

My goal is to identify those transcripts (or genes) whose linear expression trend over time is different in the treated and the untreated individuals. To put it in another way:

    model 1          model 0
expression       expression
    |*               |
    |    *   +       |        *+         * treated sample
    |    +   *       |    *+             + control sample
    |+               |*+    
    |__________time  |__________ time
     t0  t1  t2       t0  t1  t2
    6 classes?       3 classes?

In model 1 the slope and/or intersection of the treated and control trends would be different, while in model 0 they would be the same.

By now I have two alternative ideas on how to build the corresponding M and P matrices:

A) M would contain a single class (12 zeroes), and P would have two columns, each one of them depicting the 3 and 6 classes of models 0 and 1, respectively.
B) The P matrix will only depict the 3 classes corresponding to time, while the treated/control conditions would be represented as two classes in the M matrix.

I am not sure about the content of the P0 and P1 matrices for any of the two cases

I have found it difficult to understand the principles underlying the construction of this matrices, but I am willing to consult any material that you might suggest.

Aurora

Compilation failure with GCC 6

There is a build failure on Fedora Rawhide, which is using GCC 6.

Here is the error:

hitsio.cpp: In member function 'bool HitsfileReader::readReadMapRecordTranscriptIDSchema0(std::__cxx11::string&)':
hitsio.cpp:346:35: error: cannot convert 'std::basic_istream<char>' to 'bool' in return
  return(getline(ifs, transcriptID));
                                   ^

The person who first spotted this suggested the following fix:

return(static_cast<bool>(getline(ifs, transcriptID)));

Aggregate transcripts into groups by UTR

Hi Ernest,

I'm trying to aggregate transcripts by some features with mmseq. How to aggregate transcripts by common 5' UTR according to the "Other aggregations (e.g. of transcripts sharing the same UTRs) are also possible." https://github.com/eturro/mmseq

Best regards,
Elena

How to calculate ASE significance from MMSeq outputs

I've run alignment against a diploid transcriptome and used MMSeq to quantify expression on each of the two haplotypes for each transcript. How to calculate a p-value for allele specific expression (ASE)?

gene (tab) transcript

Request: would it be possible to support an alternative method allowing one to simply provide a gene-to-transcript mapping file (ie. tab-delimited like so) to bam2hits

gene (tab) transcript

Flipped ASE because of mapping issues

Hi Ernest,

@eturro Sorry to repost this question. You replied to my other post and closed it. I replied there with more questions but the post was gone. I'm not sure if you are still able to see that post, so I'm reposting here:

I asked in that post:
As the Bowtie manual says, "--best --strata" does not apply for paired-end reads. What's the best solution to this? I really need to get only the best stratum.

You replied "This is already done internally by bam2hits, don't worry. Using --best --strata just helps get rid of some of the sub-optimal alignments beforehand."

And below is my follow-up questions:

But I'm seeing some obviously wrong results which was caused by this issue. For example, I have a transcript that contains only a single SNP. When I quantified allele-specific expression at this SNP, it clearly showed strong paternal-biased expression (more than 95% reads support paternal allele), however the results from MMSeq showed strong maternal-biased expression. After some detective work, I figured it out that the major (not 100%) problem is that bowtie could not distinguish between the paternal version and the maternal version of this transcript because it is only a single nucleotide mismatch (because the default allows 2 mismatches) so it reported alignment to both in equal amount. I tried redoing the alignment by requiring zero mismatch then the results showed correct paternal-biased expression. For other transcripts, this does not necessarily cause flipped direction but can reduce the amplitude of allele-specific expression. Only after I required zero mismatch the results started to make sense. But rather than requiring zero mismatch, the ideal way is to let bowtie report only the best stratum.

Let me show you an example:

Below is mmseq results for the transcripts of IGF2. IGF2 is a maternally-imprinted gene in my samples. Column 4 and 5 is log_mu of paternal and maternal versions respectively and column 5 is column4-column5 (log ratio):
ENST00000381392.5 ENSG00000167244.21 IGF2 1.56433 -2.22791 3.79224
ENST00000381395.5 ENSG00000167244.21 IGF2 9.766810000000001 4.06606 5.700750000000001
ENST00000381406.8 ENSG00000167244.21 IGF2 8.01555 -6.4431 14.458649999999999
ENST00000416167.7 ENSG00000167244.21 IGF2 7.79913 -4.31979 12.11892
ENST00000434045.6 ENSG00000167244.21 IGF2 -7.343839999999999 4.61276 -11.956599999999998

As you can see, while all other transcripts are maternally imprinted, ENST00000434045.6 is paternally imprinted. And the first two transcripts show lower degree of imprinting than the next two. When I looked at SNP level ASE, all the SNPs in these transcripts have strong maternal imprinting.

I tried rerunning it by requiring zero mismatch in bowtie mapping, and got these results:
ENST00000381392.5 ENSG00000167244.21 IGF2 7.73735 -4.58045 12.3178
ENST00000381395.5 ENSG00000167244.21 IGF2 9.3704 2.71391 6.65649
ENST00000381406.8 ENSG00000167244.21 IGF2 7.44733 -5.74605 13.193380000000001
ENST00000416167.7 ENSG00000167244.21 IGF2 7.612660000000001 -2.5921 10.20476
ENST00000434045.6 ENSG00000167244.21 IGF2 4.867319999999999 -9.99843 14.86575

As you can see, now all transcripts show strong maternal imprinting, consistent with the SNP level results. Well, I'd expect ENST00000381395.5 to be similar with others, but at least there is no flipping any more.

The diploid transcriptome was created using the phased SNPs, so the transcript level ASE should be consistent with the SNP level ASE when all SNPs in this gene are strongly maternally imprinted.

Wrong tag 10.0.10a?

Seeing history of tags https://github.com/eturro/mmseq/tags here,

Tag 1.0.10 released at Fed 16 2017.
Tag 10.0.10a released at Feb 25 2019.

10.0.10a is wrong. 1.0.10a is correct, right?
If it is wrong, could you modify the tag information?

Thank you.

Making sample-specific transcript FASTAs through genotyping and phasing:

Hello,
I'd like to make a custom transcriptome but this section appears to be missing on github.

Matrices for time series-one condition

Dear developer,
I have a RNA-seq experimental design consisting of two individuals (biological replicates), who were sampled at 24, 48, 72, 96 and 120 hours after receiving a treatment. Could you be so kind to give me some advice on how to write the corresponding matrices file?

designed a matrix for identifying genes that are differential expression over all groups

Hi,
I have read a paper called "Transcriptional diversity during lineage commitment of human blood progenitors" recently. In the paper, author used mmdiff to find cell-specificed genes and transcripts. And I want to reproduce the results of the paper.
Now, I have three groups : CMP, GMP and MEP, and have 2 reps in each. And the decription of the models that paper used was "The simplest model assumes that the mean expression level is the same across cell types. The most complex model assumes that the mean expression level is different for each cell type. The remaining three models assume that two of the three cell types have the same mean expression level." I have creatd three matrics according your answer in "Setting up matrix and identifying specific gene clusters". But I have no idea about how to design "the most complex model". Following are the matrics that I have used:
mat1.txt:
# M; no. of rows = no. of observations
1 0 0
1 0 0
0 1 0
0 1 0
0 0 1
0 0 1
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 0
0 0
0 1
0 1
0 1
0 1
# P0(collapsed); no. of rows = no. of classes for model 0
1
# P1(collapsed); no. of rows = no. of classes for model 1
.5
-.5

mat2.txt:
# M; no. of rows = no. of observations
1 0 0
1 0 0
0 1 0
0 1 0
0 0 1
0 0 1
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 1
0 1
0 0
0 0
0 1
0 1
# P0(collapsed); no. of rows = no. of classes for model 0
1
# P1(collapsed); no. of rows = no. of classes for model 1
.5
-.5

mat3.txt:
# M; no. of rows = no. of observations
1 0 0
1 0 0
0 1 0
0 1 0
0 0 1
0 0 1
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 1
0 1
0 1
0 1
0 0
0 0
# P0(collapsed); no. of rows = no. of classes for model 0
1

# P1(collapsed); no. of rows = no. of classes for model 1
.5
-.5

I don't konw if I explained it clearly. If there was anything that I didn't say correctly, please let me know.
Thank you.

Crash on large metatranscriptomics database

Hi,

I am trying to run mmseq-linux v1.0.9 using a large metatranscriptomics dataset as reference (>4 million ORFs). The run stops after saying “Amalgamating transcripts and calculating summary statistics...”, due to memory limits (SIGSEGV: 11) when I used the maximum RAM available to me (3TB).
Before the crash the following output files are produced along the way:
sample.k,
sample.M,
sample.gene.trace_gibbs.gz,
sample.identical.trace_gibbs.gz,
sample.prop.trace_gibbs.gz,
sample.trace_gibbs.gz.

I’ve run some tests on a E.coli dataset and observed that the memory consumption jumps during the “Amalgamation of transcripts and calculation of summary statistics” requiring several times the one used during the rest of the computation. Is there a way to circumvent this problem and complete the run?

Best regards,
Francesco

regex

Hi,
I am trying to use MMSEQ to estimate abundances of a custom fasta file (assembly transcripts from trinity). I modified the fasta headers and I believe I have the regular expressions written correctly, but for some reason bam2hits -m is not working or when it does work the output is not coherent and doesn't work for mmseq. I also used the bowtie --fullref option as instructed.

MMSEQ_files$ head tester.fna

comp1_c0_seq1 comp1_c0
GAGATCTCTTTTTACTTAACGCTTAAACATTGAGATGTCAGGATAAGAGGAAGAACTGCAGGCAGATTTTCAAGA...

my transcript id is the first descriptor and my gene id is the 2nd descriptor (same as transcript id but without the _seq)

The regular expression might be as
"comp[0-9][0-9]__c[0-9][0-9]__seq[0-9][0-9]* comp[0-9][0-9]__c[0-9][0-9]_" 1 2
but this isn't working.

I can modify my headers to be something more conventional, but if you could point me in the right direction with this fasta header i'd really appreciate it.

Thanks!

A model matrix for estimating the effect of a numeric variable on expression levels

Hi @eturro ,

A couple of more questions about setting up model matrices:

I have a time course data (3 samples obtained at 1h, 5h, 10h, and 24h) and I'd like to estimate the effect of time on expression.

What would be the appropriate model matrix for this design?
Is it simply:

# M; no. of rows = no. of observations
0
0
0
0
0
0
0
0
0
0
0
0
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 1
0 1
0 1
0 5
0 5
0 5
0 10
0 10
0 10
0 24
0 24
0 24
# P0(collapsed); no. of rows = no. of classes for model 0
1
# P1(collapsed); no. of rows = no. of classes for model 1
.5
-.5

Same as in question 1 but in this case I have treatment and control samples (3 and 3) for each of the time points and I want to estimate the interaction between treatment and time - i.e., is the slope of the treatment different from the slope of the control.

What would be the appropriate model matrix for this design?
Would it be comparing a model with no treatment and control with a model with treatment and control?

So the no treatment and control model matrix: would be:

# M; no. of rows = no. of observations
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 1
0 1
0 1
0 1
0 1
0 1
0 5
0 5
0 5
0 5
0 5
0 5
0 10
0 10
0 10
0 10
0 10
0 10
0 24
0 24
0 24
0 24
0 24
0 24
# P0(collapsed); no. of rows = no. of classes for model 0
1
# P1(collapsed); no. of rows = no. of classes for model 1
.5
-.5

And the treatment and control model matrix will be:

# M; no. of rows = no. of observations
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 1
0 1
0 1
0 5
0 5
0 5
0 10
0 10
0 10
0 24
0 24
0 24
1 1
1 1
1 1
1 5
1 5
1 5
1 10
1 10
1 10
1 24
1 24
1 24
# P0(collapsed); no. of rows = no. of classes for model 0
0.5
-.5
# P1(collapsed); no. of rows = no. of classes for model 1
0.5 0.5
0.5 -.5
-.5 0.5
-.5 -.5

Thanks a lot

Pairing samples

Hello,

I have ribosome profiling data from two conditions: treatment and control. For both treatment and control I have 3 RNA-seq libraries (quantified by MMSEQ) of the ribosome RNA footprints and 3 samples of the total RNA, where the footprints and total are matched with respect to the animals they came from, meaning tissue from each sample is split to two: one for the ribosome profiling protocol and the other for the total RNA protocol, hence are paired.

I would like to test if the ribosome occupancy is different between the treatment and control. From the examples in the README page it seems that the case of assessing if the log fold change between group A and group B is different from the fold change between group C and group D is the appropriate one, where A and B would be ribosome footprints and total RNA from the treatment group, respectively, and the same for C and D for the control group. The only thing this doesn't seem to account for is the fact that the samples in A and B (and in C and D) are paired.

Should that information be encoded in M matrix? (each animal is given and integer ranging from 0 to the number of animals minus one?), or is there another way to utilize the sample pairing?

Thanks a lot
Tata

bam2hits.cpp:669: pointless allocation ?

[bam2hits.cpp:669]: (style) Variable 'binpath' is allocated memory that is never used.

Source code is

char *binpath = strdup(argv[0]);

mmdiff index out of bounds error

Hi,

When I try to run mmdiff it gets through printing out the log scale normalisation factors and then it terminates with this error:

terminate called after throwing an instance of 'std::logic_error'
what(): Mat::col(): index out of bounds
Aborted

Any ideas?

Thanks!

eturro / mmseq Goto Github PK

mmseq's People

Contributors

Stargazers

Watchers

Forkers

mmseq's Issues

Recommend Projects

Recommend Topics

Recommend Org