Code Monkey home page Code Monkey logo

gtex-pipeline's People

Contributors

abhiachoudhary avatar francois-a avatar leipzig avatar rosswell avatar scchess avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gtex-pipeline's Issues

The run_FastQTL_threaded.py script produces different output files with the same input files for different chunks number

I am trying to run the run_FastQTL_threaded.py script on the VCF and BED files which contain only chromosome 22 and I noticed that for the different number of chunks the output files are always different. Moreover, as I increase the number of chunks, the output file is smaller. I have checked all output log files and it seems that all the chunks are processed (I expected that some chunks were skipped because of the memory issue but it seems that this is not the case). I am sending you the command lines and the output file's sizes which are produced with these command lines. Please note that these are the command lines that I tried to run locally. But, I tried the same on the AWS m4.16 xlarge instance and the same problem exists there. Moreover, when I run this script on the VCF and BED files which contain all chromosomes, the output files are the same despite the different chunk size between tasks. Could you help me to understand why this is happening when I run the fastQTL tool on chromosome 22?
I used BED file produced by the Geuvadis consortium and VCF file from the 1000 Genomes project.

Number of chunks 10:
Command:
/opt/fastqtl/python/run_FastQTL_threaded.py chr22.vcf.gz FastQTL.bed.gz chunks10 --window 1e6 --chunks 10 --threads 4
Output file size:
chunks10.allpairs.txt.gz - 317 MB

Number of chunks 20:
Command:
/opt/fastqtl/python/run_FastQTL_threaded.py chr22.vcf.gz FastQTL.bed.gz chunks20 --window 1e6 --chunks 20 --threads 4
Output file size:
chunks20.allpairs.txt.gz - 312M

Number of chunks 30:
Command:
/opt/fastqtl/python/run_FastQTL_threaded.py chr22.vcf.gz FastQTL.bed.gz chunks30 --window 1e6 --chunks 30 --threads 4
Output file size:
chunks30.allpairs.txt.gz - 308 MB

Number of chunks 100:
Command:
/opt/fastqtl/python/run_FastQTL_threaded.py chr22.vcf.gz FastQTL.bed.gz chunks100 --window 1e6 --chunks 100 --threads 4
Output file size:
chunks100.allpairs.txt.gz - 265 MB

RNA-seQC output empty

Hi,

I followed the TOPmed pipeline used STAR to get the bam file, picard to mark duplicate, then used RNA-seQC to generate gene tpm. All steps ran without any error, however the gene_tpm.gct file only contain nan for tpm column, and exon_reads.gct and other files contain only 0 for last column, the first two columns (ID and name) looks fine.
head gene_tpm.gct
image

head exon_reads.gct
image

Therefore I double checked my work, and noticed a warning from picard mark duplicate step, because the log is very long, I will only paste the head and tail here:

picardLOG

image

Did this cause the empty output? I've been searching online for solution for several days and did not get any valid results.

Thank you for your help!

Error at run _rnaseq.py

Hello, I am stuck at the run_rnaseq.py step while following the suggested v8 pipeline with docker. All the reference gtf/fasta were obtained from the recommended source on the GitHub page. Have I missed out something? Thank you very much!

nelson@hemepath:~/OPERATION/GTEX/data/star_out$ docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 /bin/bash -c "/src/run_rnaseqc.py
${sample_id}.Aligned.sortedByCoord.out.patched.md.bam
${genes_gtf}
${genome_fasta}
${sample_id}
--java /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java
--output_dir /data"
[Nov 25 10:10:12] Running RNA-SeQC

  • command: "/usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -Xmx4g -jar /opt/RNA-SeQC_1.1.9/RNA-SeQC.jar -n 1000 -s HA10YLS,HA10YLS.Aligned.sortedByCoord.out.patched.md.bam,HA10YLS -t GENCODE_gencode.v26.GRCh38.annotation.gtf -r references_Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta -o /data -noDoC -strictMode -gatkFlags --allow_potentially_misencoded_quality_scores"
    RNA-SeQC v1.1.9 06/26/16
    Additional GATK flags provided: --allow_potentially_misencoded_quality_scores
    Creating rRNA Interval List based on given GTF annotations
    Retriving contig names from reference
    contig names in reference: 195
    Loading GTF for Read Counting
    The required transcript_id attribute was not found on line chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
    Traceback (most recent call last):
    File "/src/run_rnaseqc.py", line 102, in
    subprocess.check_call(cmd, shell=True)
    File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
    raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '/usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -Xmx4g -jar /opt/RNA-SeQC_1.1.9/RNA-SeQC.jar -n 1000 -s HA10YLS,HA10YLS.Aligned.sortedByCoord.out.patched.md.bam,HA10YLS -t GENCODE_gencode.v26.GRCh38.annotation.gtf -r references_Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta -o /data -noDoC -strictMode -gatkFlags --allow_potentially_misencoded_quality_scores' returned non-zero exit status 1

Error at running RNA-SeQC from docker

Hello, I'm running the RNA-seq pipeline with docker gtex_rnaseq:V8 at the RNA-SeQC, the built in RNA-SeQC is supposed to be v2.3.6 but when I run the docker, the following error appeared. Is the docker is still referring to an older version of RNA-SeQC? Your help is much appreciated. Thank you!

nelson@hemepath:~/OPERATION/GTEX/data$ docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 /bin/ bash -c "/src/run_rnaseqc.py
${sample_id}.Aligned.sortedByCoord.out.md.bam
${genes_gtf}
${genome_fasta}
${sample_id}
--output_dir /data"
[Nov 12 01:00:24] Running RNA-SeQC

  • command: "java -Xmx4g -jar /opt/RNA-SeQC_1.1.9/RNA-SeQC.jar -n 1000 -s HA10YLS,HA10YLS.Aligned.sortedByCoord.out.md. bam,HA10YLS -t GENCODE_gencode.v26.GRCh38.genes.gtf -r references_Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta -o / data -noDoC -strictMode -gatkFlags --allow_potentially_misencoded_quality_scores"
    RNA-SeQC v1.1.9 06/26/16
    Additional GATK flags provided: --allow_potentially_misencoded_quality_scores
    Creating rRNA Interval List based on given GTF annotations
    Retriving contig names from reference
    contig names in reference: 195
    Loading GTF for Read Counting
    Converting to refGene
    Transcript objects to RefGen format: 0 s
    Running IntronicExpressionReadBlock Walker ....
    Exception in thread "main" java.lang.ExceptionInInitializerError
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.(GenomeAnalysisEngine.java:160)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.(CommandLineExecutable.java:53)
    at org.broadinstitute.sting.gatk.CommandLineGATK.(CommandLineGATK.java:54)
    at org.broadinstitute.cga.rnaseq.gatk.GATKTools.runIntronReadCount(GATKTools.java:203)
    at org.broadinstitute.cga.rnaseq.ReadCountMetrics.runRegionCounting(ReadCountMetrics.java:220)
    at org.broadinstitute.cga.rnaseq.ReadCountMetrics.runReadCountMetrics(ReadCountMetrics.java:59)
    at org.broadinstitute.cga.rnaseq.RNASeqMetrics.runMetrics(RNASeqMetrics.java:211)
    at org.broadinstitute.cga.rnaseq.RNASeqMetrics.execute(RNASeqMetrics.java:162)
    at org.broadinstitute.cga.rnaseq.RNASeqMetrics.main(RNASeqMetrics.java:133)
    Caused by: java.lang.NullPointerException
    at org.reflections.Reflections.scan(Reflections.java:220)
    at org.reflections.Reflections.scan(Reflections.java:166)
    at org.reflections.Reflections.(Reflections.java:94)
    at org.broadinstitute.sting.utils.classloader.PluginManager.(PluginManager.java:77)
    ... 9 more
    Traceback (most recent call last):
    File "/src/run_rnaseqc.py", line 102, in
    subprocess.check_call(cmd, shell=True)
    File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
    raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command 'java -Xmx4g -jar /opt/RNA-SeQC_1.1.9/RNA-SeQC.jar -n 1000 -s HA10YLS,HA10YLS.Ali gned.sortedByCoord.out.md.bam,HA10YLS -t GENCODE_gencode.v26.GRCh38.genes.gtf -r references_Homo_sapiens_assembly38_noAL T_noHLA_noDecoy.fasta -o /data -noDoC -strictMode -gatkFlags --allow_potentially_misencoded_quality_scores' returned non -zero exit status 1
    nelson@hemepath:~/OPERATION/GTEX/data$ docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
/bin/bash -c "java -version"

openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

Leafcutter junction files

Hi All,

I am curious to know if leafcutter raw junction counts are available somewhere. I could see exon-exon junction counts file, but all the splicing analysis is done on leafcutter generated junctions. I also saw that the normalized phenotype file used for leafcutter is available but I would like to know if I can access the raw junction counts from leafcutter directly or via google cloud.

Thanks,
Goutham A

MarkDuplicates Does Not Affect Subsequent Analyses?

Hi, first of all thanks a lot for this incredibly useful repository.

I am following GTEx v9 (using the branch, because v8 run_rnaseqc.py gives me 0 counts for all exons/transcripts/genes) pipeline for expression quantification and eQTL mapping.

To see the effect of Mark Duplicates step on quantification, I took an original bam file and its version processed by Mark Duplicates (that keeps all the reads, but changes only the second column of the bam file [flag] as far as I know).

Then, I run "run_rnaseqc.py" on both the original bam file and the output file of Mark Duplicates. When I compare the output, they seem to be identical.

Is this expected? MarkDuplicates step is not meant to affect expression quantification and later steps? If so, what is its functionality?

Thanks!

question for run eqtl_prepare_expression.py

when I run eqtl_prepare_expression.py ,it show error message:
Loading expression data
Traceback (most recent call last):
File "eqtl_prepare_expression.py", line 177, in
raise ValueError('Sample IDs in expression files and participant lookup table must match.')
ValueError: Sample IDs in expression files and participant lookup table must match.

Which files do I get the SampeID and participantID from? Is SampeID from counts_gct files?is participantID from genotype file?

question about gene number after normalizing in eQTL pipeline

Hi,
I run the eqtl_prepare_expression pipeline and set the same expression thresholds as the example. I'm confused that expression values of some genes I'm concerned about in GCT files are well above the threshold but removed in BED file. This is why?

chimMainSegmentMultNmax & -outSAMattributes

Hi,

I think STAR 2.4.2a does not support "chimMainSegmentMultNmax" option, however it is found in run_STAR.py. Therefore it crashes when using STAR 2.4.2a. (However, I am not sure if docker version has 2.4.2a or a later version; since I am not using it).

Same goes for -outSAMattributes , where "ch" is not supported by STAR 2.4.2a.

Cheers,
Fahri

command to run rnaseqc in the docker image

Hi, in the readme file there is no command to run the quantification of rnaseqc

RSEM transcript quantification

docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq /bin/bash -c "/src/run_RSEM.py /data/rsem_reference /data/star_out/$prefix.Aligned.toTranscriptome.out.bam /data/$prefix --threads 1"

Could you provide me with an example to run the last step to get the tpms and proceed then to qtl pipeline? thanks

questions about counts_gct file and sample_participant_lookup file in eQTL

Hi,

I'm working on the first step of eQTL pipeline. In the RNA-Seq pipeline, I've aggregated the gene_tpm.gct files of all samples into one tpm.gct file. When generating normalized expression in BED format, which tpm.gct file should I use, the one of each sample or the combined one?

And the sample_participant_lookup file I've generated is showed in the following figure. I am not sure about the format. Is this right?

image

When I run the following command, I got this error. How can I fix this error?

image

Thanks!

output of RNA-SeQC

Hi, I am trying to use your RNAseq pipeline and have two questions:

After I run RNA-SeQC, I only got the expression of the transcripts, in ENST. How can I get the expression of the gene (in ENSG), like in file "GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz"?
Could you please tell me your command when running RNA-SeQC?

Here is the command I used:
java -jar /home/zhangsj/data/SeQC/RNA-SeQC_v1.1.8.jar -n 1000 -s "SRR821498|SRR821498Aligned.sortedByCoord.out.patched.added.bam|Notes" -t gencode.v19.annotation.patched_contigs.gtf -r Homo_sapiens_assembly19.fasta -o /data/out_seqc

And how you converted the expression in rpkm to ones in tpm?

Thank you very much!

Version of GENCODE

Hi, I am using docker image V8 for RNA-seq. I noticed that GENCODE v19 was used in your code, but it also says that V8 will be based on GENCODE v26.
So I am wondering which version of GENCODE should I use for the V8 docker image? I hope to get the same result as RNA-Seq data in GTEx Portal.

Thank you!

Which python version for leafcutter sQTL analysis? And the error when ran the leafcutter sQTL scripts here

Dear All,

Could you please let me know which python version did you use for the leafcutter sQTL analysis in GTEx?

I know GTEx incorporated the leafcutter scripts itself for the sQTL analysis, and I tried these scripts which you posted here (https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/leafcutter/src/cluster_prepare_fastqtl.py) on my data, and got the error -KeyError: 'the label [WASH9P] is not in the [index]' (Please see the detailed below) which doesn't make sense to me.

Because I checked the bed_file (.perind.counts.filtered.qqnorm.bed.gz) which doesn't contain clusters (chr1:clu_4863_NA, chr1:clu_4864_NA, chr1:clu_4865_NA) for "WASH9P", but the cluster2gene_dict (leafcutter.clusters_to_genes.txt) contains three clusters for this gene. So according to the script (https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/leafcutter/src/cluster_prepare_fastqtl.py)--line 194-210, this gene shouldn't be selected for next step. Even though this gene "WASH9P" were selected for the next step, the manipulated gtf bed file ( tss_df ) indeed contains this gene, so this error should come out. Could you please let me know why? By the way, I chose the python2.7 which leafcutter recommended.


Traceback (most recent call last):
File "/shares/mgomery/Group-Montgomery/f.Yang/2020/sQTL_grp/gtex-pipeline-master/qtl/leafcutter/src/cluster_prepare_fastqtl.py", line 207, in
gene_bed_df.append(tss_df.loc[g, ['chr', 'start', 'end']].tolist() + [gi] + r.iloc[4:].tolist())
File "/home/fei.yang/.local/lib/python2.7/site-packages/pandas/core/indexing.py", line 1309, in getitem
return self._getitem_tuple(key)
File "/home/fei.yang/.local/lib/python2.7/site-packages/pandas/core/indexing.py", line 795, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/home/fei.yang/.local/lib/python2.7/site-packages/pandas/core/indexing.py", line 921, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "/home/fei.yang/.local/lib/python2.7/site-packages/pandas/core/indexing.py", line 1481, in _getitem_axis
self._has_valid_type(key, axis)
File "/home/fei.yang/.local/lib/python2.7/site-packages/pandas/core/indexing.py", line 1418, in _has_valid_type
error()
File "/home/fei.yang/.local/lib/python2.7/site-packages/pandas/core/indexing.py", line 1405, in error
(key, self.obj._get_axis_name(axis)))
KeyError: 'the label [WASH9P] is not in the [index]'


Any my command for this script is below:
python2.7 /gtex-pipeline-master/qtl/leafcutter/src/cluster_prepare_fastqtl.py /pheno_try6/junc206_fastQTL.txt gencode.v31.exons.txt.gz collapse_gtf_v31.gtf TRY_7 -o /pheno_try7 --leafcutter_dir /leafcutter --check TRUE

I also tried the python3.7 for this script, the included leafcutter_cluster.py doesn't work and got the error like- NameError: name 'file' is not defined. So I feel that python version made a difference.

Could you please help figure out this issue? Unfortunately, there is no much description for incorporated leafcutter sQTL scripts. I have been stuck by this script for more than one week and really want to see the results. So, if you could help that will be very very appreciated.

Please let me know if you need more detailes about this error. I will reply immediately.

Many thanks,

Fei

collapse_annotation.py Error

Hi,

thank you for collapse_annotation.py script. I however, get the error below which I could not figure. Could you kindly assist to edit the script.

Traceback (most recent call last):
File "/home/banthony/software/gtex-pipeline-master/gene_model/collapse_annotation.py", line 249, in
annotation = Annotation(args.transcript_gtf)
File "/home/banthony/software/gtex-pipeline-master/gene_model/collapse_annotation.py", line 84, in init
e = Exon(str(len(t.exons)+1), len(t.exons)+1, attributes['exon_id'], start_pos, end_pos)
UnboundLocalError: local variable 't' referenced before assignment

Invariant sites are being called eQTLs

I ran the gtex eQTL pipeline on a subset of a panel of genotypes, i.e. the subset for which I have RNA-seq data. I forgot to remove sites in the subset genotype file that are invariant for that set of individuals before I ran the pipeline. I'll remove them and re-run. But first: I noticed that many sites that are invariant in the genotype file were erroneously called eQTLs by the pipeline. Is this is a known issue? Shouldn't the pipeline ignore any invariant sites in the input?

Thanks,
Eric

run_rnaseqc.py issue

Hello, I am testing and received below message. How should I try to fix this? Thank you for the help!

mz@mz:$ docker run --rm -v /RNA:/data -t rnatest:V1 /bin/bash -c "/src/run_rnaseqc.py K.Aligned.sortedByCoord.out.patched.md.bam gencode.v19.chr_patch_hapl_scaff.annotation.patched.gtf Homo_sapiens_assembly19.fasta K --output_dir /data"
[Aug 02 18:40:03] Running RNA-SeQC

  • command: "java -Xmx4g -jar /opt/RNA-SeQC_1.1.9/RNA-SeQC.jar -n 1000 -s K,K.Aligned.sortedByCoord.out.patched.md.bam,K -t gencode.v19.chr_patch_hapl_scaff.annotation.patched.gtf -r Homo_sapiens_assembly19.fasta -o /data -noDoC -strictMode -gatkFlags --allow_potentially_misencoded_quality_scores"
    RNA-SeQC v1.1.9 06/26/16
    Additional GATK flags provided: --allow_potentially_misencoded_quality_scores
    Creating rRNA Interval List based on given GTF annotations
    Retriving contig names from reference
    contig names in reference: 85
    Loading GTF for Read Counting
    Converting to refGene
    Transcript objects to RefGen format: 2 s
    Running IntronicExpressionReadBlock Walker ....
    Exception in thread "main" java.lang.ExceptionInInitializerError
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.(GenomeAnalysisEngine.java:160)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.(CommandLineExecutable.java:53)
    at org.broadinstitute.sting.gatk.CommandLineGATK.(CommandLineGATK.java:54)
    at org.broadinstitute.cga.rnaseq.gatk.GATKTools.runIntronReadCount(GATKTools.java:203)
    at org.broadinstitute.cga.rnaseq.ReadCountMetrics.runRegionCounting(ReadCountMetrics.java:220)
    at org.broadinstitute.cga.rnaseq.ReadCountMetrics.runReadCountMetrics(ReadCountMetrics.java:59)
    at org.broadinstitute.cga.rnaseq.RNASeqMetrics.runMetrics(RNASeqMetrics.java:211)
    at org.broadinstitute.cga.rnaseq.RNASeqMetrics.execute(RNASeqMetrics.java:162)
    at org.broadinstitute.cga.rnaseq.RNASeqMetrics.main(RNASeqMetrics.java:133)
    Caused by: java.lang.NullPointerException
    at org.reflections.Reflections.scan(Reflections.java:220)
    at org.reflections.Reflections.scan(Reflections.java:166)
    at org.reflections.Reflections.(Reflections.java:94)
    at org.broadinstitute.sting.utils.classloader.PluginManager.(PluginManager.java:77)
    ... 9 more
    Traceback (most recent call last):
    File "/src/run_rnaseqc.py", line 102, in
    subprocess.check_call(cmd, shell=True)
    File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
    raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command 'java -Xmx4g -jar /opt/RNA-SeQC_1.1.9/RNA-SeQC.jar -n 1000 -s K,K.Aligned.sortedByCoord.out.patched.md.bam,K -t gencode.v19.chr_patch_hapl_scaff.annotation.patched.gtf -r Homo_sapiens_assembly19.fasta -o /data -noDoC -strictMode -gatkFlags --allow_potentially_misencoded_quality_scores' returned non-zero exit status 1

utility of bamsync in rnaseq

Hi,

First thanks for maintaining this super useful repository.
In the documentation of RNAseq pipeline, it says

sync BAMs (optional; copy QC flags and read group IDs)

I wonder what's the utility of copying QC flags and read group IDs. Is including this step helping anything?

Thanks!

reference_junctions file

Hi All,

I would like to know where is the reference_junctions file that is required for the script:

gtex-pipeline/rnaseq/src/process_star_junctions.py

If not, how can I generate it ?

what is the input Bam file?

Hi, I am trying to use your RNAseq pipeline:
I have fastq format RNAseq raw data, but I saw the first step is “BAM to fastq” , what is that Bam file?
And I saw there is a step “sync Bams ” after Star alignment, so what is the difference between the original Bam file and the Bam file generated by STAR?
Thank you very much!
HG

Output files from rnaseq v8 pipeline

Hello. I finished running the v8 pipeline with docker on my own RNAseq data and there are 3 gcts generated, including sample.exon_reads.gct, sample.gene_reads.gct and sample.gene_rpkm.gct. I want to compare them with the reference dataset which uses TPM and have questions as follows.

  1. From the reference v8 dataset(https://www.gtexportal.org/home/datasets), does sample.gene_reads.gct correspond to GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct or GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct?

  2. Is there a corresponding reference file for sample.exon_reads.gct?

  3. Can I use the aggregating outputs command (docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8
    /bin/bash -c "python3 /src/combine_GCTs.py
    ${rnaseqc_rpkm_gcts} ${sample_set_id}.rnaseqc_rpkm"). for gcts (other than rpkm_gcts) and rsem files?

Thank you very much for your help!

to repeat GTEx result

./run_FastQTL_threaded.py GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_866Indiv.vcf.gz #BED# #NAME# --covariates #COVARIATE# --window 1e6 --chunks 100 --threads 16 --threshold 1 --fdr 1 --permute 1000 10000

And for all human genes, each gene has only one eqtl entry output. The tissue I used is the Cells_EBV-transformed_lymphocytes. However, according to the GTEx portal, the tissue has 45,8161. significant variant-gene pair. How can I repeat this result?

Thanks.

when getting 'transcript' or 'gene' from the gtf file.

Hi, I am currently using the eqtl_prepare_expresson.py to prepare my expression data.
I noticed that default setting of the function gtf_to_bed is to get the 'gene' instead of 'transcript', while the script uses 'transcript' to call the function in line 174.

And the number of the feature is largely different in the gtf file:

$ awk '$3=="transcript"' eqtl_data/gencode.v27lift37.annotation.gtf| wc -l
  202697
 $ awk '$3=="gene"' eqtl_data/gencode.v27lift37.annotation.gtf| wc -l
   60461

So I am wondering which is the proper one to use?

significant gene-SNP pairs for FastQTL

Dear,

Thanks for the script for generating the list of significant gene-SNP pairs: https://github.com/francois-a/fastqtl/blob/master/python/annotate_outputs.py

I was wondering what is the workflow for annotate_outputs.py ? How does the annotate_outputs.py work - calculating and filtering at FDR (i.e., 0.05) by combining nominal and permutation pass results? What happens If we apply the multiple testing correction on P-values in nominal pass, will the results be same?

I get an error running run_star.py script

Hi, I am getting an error running the STAR. Can you please help me with this. I downloaded the Index file from here
https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md (STAR index: STAR_genome_GRCh38_noALT_noHLA_noDecoy_ERCC_v30_oh100.tar.gz).

The resultant file gets empty bam file and log file.

docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 /bin/bash -c "/src/run_STAR.py
/data/STAR_genome_GRCh38_noALT_noHLA_noDecoy_ERCC_v30_oh100
/data/${sample_id}_1.fastq.gz
/data/${sample_id}_2.fastq.gz
${sample_id}
--threads 4
--output_dir /data/star_out"
Jun 24 18:16:56 ..... started STAR run
Jun 24 18:16:57 ..... loading genome
Traceback (most recent call last):
File "/src/run_STAR.py", line 92, in
subprocess.check_call(cmd, shell=True, executable='/bin/bash')
File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'STAR --runMode alignReads --runThreadN 4 --genomeDir /data/STAR_genome_GRCh38_noALT_noHLA_noDecoy_ERCC_v30_oh100 --twopassMode Basic --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outFilterType BySJout --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --limitSjdbInsertNsj 1200000 --readFilesIn /data/C42B_1.fastq.gz /data/C42B_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix /data/star_out/C42B. --outSAMstrandField intronMotif --outFilterIntronMotifs None --alignSoftClipAtReferenceEnds Yes --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM Unsorted --outSAMunmapped Within --genomeLoad NoSharedMemory --chimSegmentMin 15 --chimJunctionOverhangMin 15 --chimOutType WithinBAM SoftClip --chimMainSegmentMultNmax 1 --outSAMattributes NH HI AS nM NM ch --outSAMattrRGline ID:rg1 SM:sm1' returned non-zero exit status -9

Regards
Rimpi

Format of ${sample_participant_lookup}

Hello,
According to this instruction:
The file ${sample_participant_lookup} must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the VCF (these can be the same).

I generated the file containing lines like:
sample_id participant_id
0226-SM-5GZZ7 GTEX-1117F
0426-SM-5EGHI GTEX-1117F
0526-SM-5EGHJ GTEX-1117F
0626-SM-5N9CS GTEX-1117F
0726-SM-5GIEN GTEX-1117F
1326-SM-5EGHH GTEX-1117F

The columns are separated by tab.

When using eqtl_prepare_expression.py, I have this error:
raise ValueError('Sample IDs in expression files and participant lookup table must match.')
ValueError: Sample IDs in expression files and participant lookup table must match.

I think the sample-participant file I generated is problematic.

Thanks!

Genotype pcs

Hello,

Could someone tell me what are genotype based principal components in the step 3?
Where I could get this information?

Thanks,
Yanting

Import missing in rnaseq_pipeline_fastq.wdl

I'm getting an error for WDL from Cromwell:

Failed to import workflow https://api.firecloud.org/ga4gh/v1/tools/broadinstitute_gtex:samtofastq_v1-0_BETA/versions/3/plain-WDL/descriptor.:
Bad import https://api.firecloud.org/ga4gh/v1/tools/broadinstitute_gtex:samtofastq_v1-0_BETA/versions/3/plain-WDL/descriptor: Import file not found: 
https://api.firecloud.org/ga4gh/v1/tools/broadinstitute_gtex:samtofastq_v1-0_BETA/versions/3/plain-WDL/descriptor

It doesn't look like the import URL in https://github.com/broadinstitute/gtex-pipeline/blob/master/rnaseq/rnaseq_pipeline_fastq.wdl are valid. For example, https://api.firecloud.org/ga4gh/v1/tools/broadinstitute_gtex:rsem_v1-0_BETA/versions/3/plain-WDL/descriptor gives a 404 error.

May I ask if this is an expected behaviour? How should I run the WDL script with those imports missing?

Doesn't work with Ensemble GTF files?

Program works fine with Gencode but not Ensemble? How do I get it to work?

Steps to reproduce

  • wget ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
  • mkdir BUG
  • pigz -d Homo_sapiens.GRCh38.97.gtf.gz > BUG/Homo_sapiens.GRCh38.97.gtf
  • python3 SOFTWARE/gtex-pipeline/gene_model/collapse_annotation.py BUG/Homo_sapiens.GRCh38.97.gtf BUG/Homo_sapiens.GRCh38.97_compressed.gtf

Error

Traceback (most recent call last):
  File "SOFTWARE/gtex-pipeline/gene_model/collapse_annotation.py", line 249, in <module>
    annotation = Annotation(args.transcript_gtf)
  File "SOFTWARE/gtex-pipeline/gene_model/collapse_annotation.py", line 68, in __init__
    g = Gene(gene_id, attributes['gene_name'], attributes['gene_type'], chrom, strand, start_pos, end_pos)
KeyError: 'gene_type'

How to get chromosome list

Hello,

I do not quite understand how to get a chromosome list from a vcf file. Could you provide some suggestions?

Thanks

Homo_sapiens_assembly19

I am wondering why the assembly in v8 is named Homo_sapiens_assembly19 in rnaseq folder under building indexes. Shouldnt it be hg38 ? The original documentation on GTEX website says they used v26 of gencode and hg38 genome. Here it looks opposite:

docker run --rm -v $path_to_references:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "STAR \
        --runMode genomeGenerate \
        --genomeDir /data/star_index_oh75 \
        --genomeFastaFiles /data/Homo_sapiens_assembly19.fasta \
        --sjdbGTFfile /data/gencode.v19.annotation.patched_contigs.gtf \
        --sjdbOverhang 75 \
        --runThreadN 4"

Error during Fastq conversion

I'm trying the rnaseq pipeline with docker. At the fastq conversion step, the following error occurred. Would be grateful if you can advise. Thank you very much!

sudo docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \

/bin/bash -c "/src/run_SamToFastq.py /data/$input_bam -p ${sample_id} -o /data"

[Nov 07 01:04:25] Starting SamToFastq
[Sat Nov 07 01:04:25 UTC 2020] picard.sam.SamToFastq INPUT=/data/HA10YLS.bam FASTQ=read1_pipe SECOND_END_FASTQ=read2_pip e UNPAIRED_FASTQ=read0_pipe INCLUDE_NON_PF_READS=true INCLUDE_NON_PRIMARY_ALIGNMENTS=false VALIDATION_STRINGENCY=SILENT OUTPUT_PER_RG=false RG_TAG=PU RE_REVERSE=true INTERLEAVE=false CLIPPING_MIN_LENGTH=0 READ1_TRIM=0 READ2_TRIM=0 VERBOS ITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT _SECRETS=client_secrets.json
[Sat Nov 07 01:04:25 UTC 2020] Executing as root@cee0d84bfc96 on Linux 5.3.0-64-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT
INFO 2020-11-07 01:04:35 SamToFastq Processed 1,000,000 records. Elapsed time: 00:00:09s. Time for las t 1,000,000: 9s. Last read position: 1:27,415,567
INFO 2020-11-07 01:04:44 SamToFastq Processed 2,000,000 records. Elapsed time: 00:00:18s. Time for las t 1,000,000: 9s. Last read position: 1:43,164,794
INFO 2020-11-07 01:04:53 SamToFastq Processed 3,000,000 records. Elapsed time: 00:00:27s. Time for las t 1,000,000: 8s. Last read position: 1:91,716,929
INFO 2020-11-07 01:05:01 SamToFastq Processed 4,000,000 records. Elapsed time: 00:00:35s. Time for las t 1,000,000: 8s. Last read position: 1:117,624,489
INFO 2020-11-07 01:05:10 SamToFastq Processed 5,000,000 records. Elapsed time: 00:00:44s. Time for las t 1,000,000: 8s. Last read position: 1:153,855,959
INFO 2020-11-07 01:05:18 SamToFastq Processed 6,000,000 records. Elapsed time: 00:00:52s. Time for las t 1,000,000: 8s. Last read position: 1:178,824,140
INFO 2020-11-07 01:05:26 SamToFastq Processed 7,000,000 records. Elapsed time: 00:01:00s. Time for las t 1,000,000: 8s. Last read position: 1:205,087,027
INFO 2020-11-07 01:05:34 SamToFastq Processed 8,000,000 records. Elapsed time: 00:01:08s. Time for last 1,000,000: 8s. Last read position: 1:244,318,744
INFO 2020-11-07 01:05:42 SamToFastq Processed 9,000,000 records. Elapsed time: 00:01:16s. Time for last 1,000,000: 7s. Last read position: 10:22,636,254
INFO 2020-11-07 01:05:50 SamToFastq Processed 10,000,000 records. Elapsed time: 00:01:24s. Time for last 1,000,000: 8s. Last read position: 10:68,338,518
INFO 2020-11-07 01:05:58 SamToFastq Processed 11,000,000 records. Elapsed time: 00:01:32s. Time for last 1,000,000: 8s. Last read position: 10:96,535,790
INFO 2020-11-07 01:06:06 SamToFastq Processed 12,000,000 records. Elapsed time: 00:01:40s. Time for last 1,000,000: 8s. Last read position: 11:1,880,170
INFO 2020-11-07 01:06:10 SamToFastq Processed 13,000,000 records. Elapsed time: 00:01:44s. Time for last 1,000,000: 3s. Last read position: 11:5,225,474
INFO 2020-11-07 01:06:19 SamToFastq Processed 14,000,000 records. Elapsed time: 00:01:53s. Time for last 1,000,000: 8s. Last read position: 11:5,225,493
INFO 2020-11-07 01:06:33 SamToFastq Processed 15,000,000 records. Elapsed time: 00:02:07s. Time for last 1,000,000: 14s. Last read position: 11:5,225,528
INFO 2020-11-07 01:06:39 SamToFastq Processed 16,000,000 records. Elapsed time: 00:02:13s. Time for last 1,000,000: 6s. Last read position: 11:5,225,556
INFO 2020-11-07 01:07:04 SamToFastq Processed 17,000,000 records. Elapsed time: 00:02:38s. Time for last 1,000,000: 25s. Last read position: 11:5,225,581
INFO 2020-11-07 01:07:57 SamToFastq Processed 18,000,000 records. Elapsed time: 00:03:31s. Time for last 1,000,000: 52s. Last read position: 11:5,225,589
INFO 2020-11-07 01:08:51 SamToFastq Processed 19,000,000 records. Elapsed time: 00:04:25s. Time for last 1,000,000: 53s. Last read position: 11:5,225,601
INFO 2020-11-07 01:10:25 SamToFastq Processed 20,000,000 records. Elapsed time: 00:05:59s. Time for last 1,000,000: 94s. Last read position: 11:5,225,620
INFO 2020-11-07 01:12:17 SamToFastq Processed 21,000,000 records. Elapsed time: 00:07:51s. Time for last 1,000,000: 112s. Last read position: 11:5,225,647
INFO 2020-11-07 01:13:46 SamToFastq Processed 22,000,000 records. Elapsed time: 00:09:20s. Time for last 1,000,000: 89s. Last read position: 11:5,225,657
INFO 2020-11-07 01:15:05 SamToFastq Processed 23,000,000 records. Elapsed time: 00:10:39s. Time for last 1,000,000: 78s. Last read position: 11:5,225,669
INFO 2020-11-07 01:16:50 SamToFastq Processed 24,000,000 records. Elapsed time: 00:12:24s. Time for last 1,000,000: 104s. Last read position: 11:5,225,672
INFO 2020-11-07 01:31:59 SamToFastq Processed 25,000,000 records. Elapsed time: 00:27:33s. Time for last 1,000,000: 909s. Last read position: 11:5,225,692
INFO 2020-11-07 01:35:20 SamToFastq Processed 26,000,000 records. Elapsed time: 00:30:54s. Time for last 1,000,000: 200s. Last read position: 11:5,225,705
INFO 2020-11-07 01:36:50 SamToFastq Processed 27,000,000 records. Elapsed time: 00:32:24s. Time for last 1,000,000: 90s. Last read position: 11:5,226,586
INFO 2020-11-07 01:37:15 SamToFastq Processed 28,000,000 records. Elapsed time: 00:32:49s. Time for last 1,000,000: 24s. Last read position: 11:5,226,603
INFO 2020-11-07 01:37:36 SamToFastq Processed 29,000,000 records. Elapsed time: 00:33:10s. Time for last 1,000,000: 21s. Last read position: 11:5,226,620
INFO 2020-11-07 01:37:46 SamToFastq Processed 30,000,000 records. Elapsed time: 00:33:20s. Time for last 1,000,000: 9s. Last read position: 11:5,226,633
INFO 2020-11-07 01:38:08 SamToFastq Processed 31,000,000 records. Elapsed time: 00:33:42s. Time for last 1,000,000: 22s. Last read position: 11:5,226,647
INFO 2020-11-07 01:39:22 SamToFastq Processed 32,000,000 records. Elapsed time: 00:34:56s. Time for last 1,000,000: 73s. Last read position: 11:5,226,662
INFO 2020-11-07 01:40:54 SamToFastq Processed 33,000,000 records. Elapsed time: 00:36:28s. Time for last 1,000,000: 92s. Last read position: 11:5,226,687
INFO 2020-11-07 01:42:27 SamToFastq Processed 34,000,000 records. Elapsed time: 00:38:01s. Time for last 1,000,000: 93s. Last read position: 11:5,226,721
INFO 2020-11-07 01:43:57 SamToFastq Processed 35,000,000 records. Elapsed time: 00:39:31s. Time for last 1,000,000: 89s. Last read position: 11:5,226,753
INFO 2020-11-07 01:44:12 SamToFastq Processed 36,000,000 records. Elapsed time: 00:39:46s. Time for last 1,000,000: 15s. Last read position: 11:5,226,771
INFO 2020-11-07 01:44:28 SamToFastq Processed 37,000,000 records. Elapsed time: 00:40:02s. Time for last 1,000,000: 15s. Last read position: 11:5,226,774
INFO 2020-11-07 01:44:49 SamToFastq Processed 38,000,000 records. Elapsed time: 00:40:23s. Time for last 1,000,000: 21s. Last read position: 11:5,226,776
INFO 2020-11-07 01:45:01 SamToFastq Processed 39,000,000 records. Elapsed time: 00:40:35s. Time for last 1,000,000: 11s. Last read position: 11:10,766,183
INFO 2020-11-07 01:45:09 SamToFastq Processed 40,000,000 records. Elapsed time: 00:40:43s. Time for last 1,000,000: 8s. Last read position: 11:62,522,604
INFO 2020-11-07 01:45:17 SamToFastq Processed 41,000,000 records. Elapsed time: 00:40:51s. Time for last 1,000,000: 8s. Last read position: 11:65,502,777
INFO 2020-11-07 01:45:26 SamToFastq Processed 42,000,000 records. Elapsed time: 00:41:00s. Time for last 1,000,000: 8s. Last read position: 11:76,990,676
INFO 2020-11-07 01:45:34 SamToFastq Processed 43,000,000 records. Elapsed time: 00:41:08s. Time for last 1,000,000: 8s. Last read position: 11:121,496,986
INFO 2020-11-07 01:45:42 SamToFastq Processed 44,000,000 records. Elapsed time: 00:41:16s. Time for last 1,000,000: 8s. Last read position: 12:10,704,125
INFO 2020-11-07 01:45:50 SamToFastq Processed 45,000,000 records. Elapsed time: 00:41:24s. Time for last 1,000,000: 8s. Last read position: 12:50,461,337
INFO 2020-11-07 01:45:58 SamToFastq Processed 46,000,000 records. Elapsed time: 00:41:32s. Time for last 1,000,000: 8s. Last read position: 12:71,840,291
INFO 2020-11-07 01:46:06 SamToFastq Processed 47,000,000 records. Elapsed time: 00:41:40s. Time for last 1,000,000: 8s. Last read position: 12:112,314,456
INFO 2020-11-07 01:46:14 SamToFastq Processed 48,000,000 records. Elapsed time: 00:41:48s. Time for last 1,000,000: 8s. Last read position: 13:28,198,124
INFO 2020-11-07 01:46:22 SamToFastq Processed 49,000,000 records. Elapsed time: 00:41:56s. Time for last 1,000,000: 8s. Last read position: 13:77,191,780
INFO 2020-11-07 01:46:30 SamToFastq Processed 50,000,000 records. Elapsed time: 00:42:04s. Time for last 1,000,000: 8s. Last read position: 14:35,400,666
INFO 2020-11-07 01:46:35 SamToFastq Processed 51,000,000 records. Elapsed time: 00:42:09s. Time for last 1,000,000: 4s. Last read position: 14:49,586,632
INFO 2020-11-07 01:46:49 SamToFastq Processed 52,000,000 records. Elapsed time: 00:42:23s. Time for last 1,000,000: 14s. Last read position: 14:49,586,718
INFO 2020-11-07 01:46:55 SamToFastq Processed 53,000,000 records. Elapsed time: 00:42:29s. Time for last 1,000,000: 6s. Last read position: 14:49,862,578
INFO 2020-11-07 01:47:08 SamToFastq Processed 54,000,000 records. Elapsed time: 00:42:42s. Time for last 1,000,000: 13s. Last read position: 14:49,862,687
INFO 2020-11-07 01:47:18 SamToFastq Processed 55,000,000 records. Elapsed time: 00:42:52s. Time for last 1,000,000: 9s. Last read position: 14:71,390,545
INFO 2020-11-07 01:47:26 SamToFastq Processed 56,000,000 records. Elapsed time: 00:43:00s. Time for last 1,000,000: 8s. Last read position: 14:105,644,571
INFO 2020-11-07 01:47:34 SamToFastq Processed 57,000,000 records. Elapsed time: 00:43:08s. Time for last 1,000,000: 8s. Last read position: 15:56,700,946
INFO 2020-11-07 01:47:42 SamToFastq Processed 58,000,000 records. Elapsed time: 00:43:16s. Time for last 1,000,000: 8s. Last read position: 15:82,372,502
INFO 2020-11-07 01:47:49 SamToFastq Processed 59,000,000 records. Elapsed time: 00:43:23s. Time for last 1,000,000: 7s. Last read position: 16:172,890
INFO 2020-11-07 01:48:07 SamToFastq Processed 60,000,000 records. Elapsed time: 00:43:41s. Time for last 1,000,000: 17s. Last read position: 16:172,896
INFO 2020-11-07 01:48:14 SamToFastq Processed 61,000,000 records. Elapsed time: 00:43:48s. Time for last 1,000,000: 7s. Last read position: 16:172,901
INFO 2020-11-07 01:48:23 SamToFastq Processed 62,000,000 records. Elapsed time: 00:43:57s. Time for last 1,000,000: 8s. Last read position: 16:172,908
INFO 2020-11-07 01:50:46 SamToFastq Processed 63,000,000 records. Elapsed time: 00:46:20s. Time for last 1,000,000: 142s. Last read position: 16:172,939
[Sat Nov 07 02:03:15 UTC 2020] picard.sam.SamToFastq done. Elapsed time: 58.83 minutes.
Runtime.totalMemory()=4898947072
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:198)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:813)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:787)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:781)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:751)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:569)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:543)
at picard.sam.SamToFastq.doWork(SamToFastq.java:174)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)
Traceback (most recent call last):
File "/src/run_SamToFastq.py", line 52, in
+' VALIDATION_STRINGENCY=SILENT FASTQ=read1_pipe SECOND_END_FASTQ=read2_pipe UNPAIRED_FASTQ=read0_pipe', shell=True)
File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'java -jar -Xmx8g /opt/picard-tools/picard.jar SamToFastq INPUT=/data/HA10YLS.bam INCLUDE_NON_PF_READS=true INCLUDE_NON_PRIMARY_ALIGNMENTS=false VALIDATION_STRINGENCY=SILENT FASTQ=read1_pipe SECOND_END_FASTQ=read2_pipe UNPAIRED_FASTQ=read0_pipe' returned non-zero exit status 1

Multi mapped reads

Hi,

I found that the outFilterMultimapNmax is set to 20 in STAR alignments. I could not find anywhere in the code that only uniquely mapped reads are considered for downstream analysis of splicing. May I know which step actually retains uniquely mapped reads only ? For example, to generate leafcutter junction files, I could not find if only uniquely mapped reads are considered ? Is is the WASP sam tag vW:1 that does extract only uniquely mapped reads ?

For rnaseqc, the default value of --mapping-quality is 255, so presumably, for gene expression quantifications, rnaseqc is considering only uniquely mapped reads through --mapping-quality set to 255.

Permission denied run_rnaseqc.py

Hello, I received below error testing run_rnaseqc.py in V8. Other components work out fine.

"/bin/bash: /src/run_rnaseqc.py: Permission denied"

Maybe this is an isolated issue, any suggestions? Thanks!

low correlation of gene count between HTSeq and RNA-SeQC

Hi,

I am trying to adopt GTEx pipeline to replace the pipeline I have now. But when I try HTSeq and RNA-SeQC on the same set of BAM and annotation file. I find there is a difference between the two.
For HTSeq (https://htseq.readthedocs.io/en/release_0.11.1/), I am using the following script

python3 -m HTSeq.scripts.count \
--stranded=no \
--mode=union \
        --type=gene ${BAM} ${GENEGTF} >> gene_union_count

And for RNA-SeQC v2(https://github.com/broadinstitute/rnaseqc), I am using the following script

${RNASEQC} ${GENE_GTF} ${BAM} ${OUT_DIR} --sample=${sample}

After I correlate the gene count output from two methods (${sample}.gene_reads.gct. I find very poor correlation (pearson correlation=~0.2, spearman correlation=~0.8). Also the coverage (sum of the gene count) is different (HTSeq ~10mil, RNA-SeQC ~6.5mil). HTSeq gives me the approximate coverage I expected.

Do you have idea on what's could be the possible cause for this difference? Thanks a lot in advance!

run_FastQTL_threaded.py

I got the genotype data from dbGap:
phg001219.v1.GTEx_v8_WGS.genotype-calls-vcf.c1/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_866Indiv.vcf.gz
(It contains all genotypes from all tissues and individuals)

I also successfully generated the *.expression.bed.gz of one tissue as
well as covariates and tbi files etc.

Then I run the command ./fastqtl/python/run_FastQTL_threaded.py ./PhenoGenotypeFiles/RootStudyConsentSet_phs000424.GTEx.v8.p2.c1.GRU/GenotypeFiles/phg001219.v1.GTEx_v8_WGS.genotype-calls-vcf.c1/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_866Indiv.vcf.gz ./tissue9_size176.expression.bed.gz tissue9_size176 --covariates ./tissue9_size176.combined_covariates.txt --window 1e6 --chunks 100 --threads 8 --permute 1000 10000

I tried 8 threads or even 1 thread.

The error message is

Processing chunk 99
Processing chunk 100
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/ysm-gpfs/apps/software/Python/3.7.0-fosscuda-2018b/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/ysm-gpfs/apps/software/Python/3.7.0-fosscuda-2018b/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/ysm-gpfs/pi/gerstein/jx98/bin/gtex_eqtl/gtex-pipeline-master/fastqtl/python/run_FastQTL_threaded.py", line 54, in perm_worker
s = subprocess.check_call(cmd, shell=True, executable='/bin/bash', stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
File "/ysm-gpfs/apps/software/Python/3.7.0-fosscuda-2018b/lib/python3.7/subprocess.py", line 328, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command './fastqtl/bin/fastQTL --vcf ./PhenoGenotypeFiles/RootStudyConsentSet_phs000424.GTEx.v8.p2.c1.GRU/GenotypeFiles/phg001219.v1.GTEx_v8_WGS.genotype-calls-vcf.c1/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_866Indiv.vcf.gz --bed ./tissue9_size176.expression.bed.gz --window 1e6 --maf-threshold 0.0 --ma-sample-threshold 0 --interaction-maf-threshold 0 --cov ./tissue9_size176.combined_covariates.txt --permute 1000 10000 --chunk 1 100 --out tissue9_size176_chunk001.txt.gz --log tissue9_size176_chunk001.log' returned non-zero exit status 127.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "./fastqtl/python/run_FastQTL_threaded.py", line 95, in
assert res.get()[0]==0
File "/ysm-gpfs/apps/software/Python/3.7.0-fosscuda-2018b/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
subprocess.CalledProcessError: Command './fastqtl/bin/fastQTL --vcf ./PhenoGenotypeFiles/RootStudyConsentSet_phs000424.GTEx.v8.p2.c1.GRU/GenotypeFiles/phg001219.v1.GTEx_v8_WGS.genotype-calls-vcf.c1/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_866Indiv.vcf.gz --bed ./tissue9_size176.expression.bed.gz --window 1e6 --maf-threshold 0.0 --ma-sample-threshold 0 --interaction-maf-threshold 0 --cov ./tissue9_size176.combined_covariates.txt --permute 1000 10000 --chunk 1 100 --out tissue9_size176_chunk001.txt.gz --log tissue9_size176_chunk001.log' returned non-zero exit status 127.

I tried to import multprocessing module of pythong. It works. Could you help on debugging the command run? Or is the error due to data format?

Thanks.

Identifying the list of all significant variant-gene pairs

Hi,
Thank you very much for sharing all these scripts.
Is the script used for Identifying the list of all significant variant-gene pairs in this folder?
If so, which one is it?

I am referring to the script that does the following (quoted from GTEX 2017 paper supplemental methods):
"To identify the list of all significant variant-gene pairs associated with eGenes, a genome-wide empirical P-value threshold, pt was defined as the empirical P-value of the gene closest to the 0.05 FDR threshold. pt was then used to calculate a nominal P-value threshold for each gene based on the beta distribution model (from FastQTL) of the minimum P-value distribution f(pmin) obtained from the permutations for the gene. Specifically, the nominal threshold was calculated as F −1(pt), where F −1 is the inverse cumulative distribution. For each gene, variants with a nominal P-value below the gene-level threshold were considered significant and included in the final list of variant-gene pairs."

Eldad

unindexed BAMs

Hi, I am using the RNA-Seq pipeline. After running STAR and mark duplicates, when I am using RNA-SeQC, I got the following error which says the BAM files are unindexed. I see there are sort and index commands in the run_STAR.py, but it doesn't work.
(error: Invalid command line: Cannot process the provided BAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAMs in.)

Also, when I try to index the BAM files by samtools, I got the error that says my bam file may be corrupted or unsorted.

How should I try to fix this? Thank you for your help!

QC RNA-seq expression outliers

Hi, I have a question regarding the QC of the Gtex eQTL pipeline.

There is one step mentioned on the GtexPortal (https://gtexportal.org/home/documentationPage#staticTextAnalysisMethods) that I would like to use on my data:

"1. RNA-seq expression outliers were identified and excluded using a multidimensional extension of the statistic described in (Wright et al., Nat. Genet. 2014 ). Briefly, for each tissue, read counts from each sample were normalized using size factors calculated with DESeq2 and log-transformed with an offset of 1; genes with a log-transformed value >1 in >10% of samples were selected, and the resulting read counts were centered and unit-normalized. The resulting matrix was then hierarchically clustered (based on average and cosine distance), and a chi2 p-value was calculated based on Mahalanobis distance. Clusters with ≥60% samples with Bonferroni-corrected p-values <0.05 were marked as outliers, and their samples were excluded."

Is there a script or some code available that I could use to perform these steps?

Thank you for your help!

PEER Covariates

Hi,

Is the Rscript provided for PEER factor calculation accounting for already known cofounders? (such as gender and batch). If not, wouldn't we adjust the expression matrix multiple times for the same covariates if PEER is already detecting such cofounders? (and because the next step in the pipeline is combining known covariates with PEER factors)

I've modified the script with "p <- add_argument(p, "--covariates", help="")" and "invisible(PEER_setCovariates(model, argv$covariates))" to avoid this for example, but not sure if that's going to work.

Thanks,
Fahri

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.