broadinstitute / gtex-pipeline Goto Github PK

View Code? Open in Web Editor NEW

329.0 29.0 175.0 2.3 MB

GTEx & TOPMed data production and analysis pipelines

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.10% C++ 3.97% Python 62.26% Shell 0.54% R 1.55% WDL 24.77% Dockerfile 6.82%

gtex-pipeline's People

Contributors

Stargazers

Watchers

Forkers

xtmgah junjiezhujason jiamaozheng yilinxu liangma2017 yixf-self oldregan hakyimlab resurgo-genetics inambioinfo wkl1990 yu-1011 tankmermaid fahsan thyagoleal ddpinto drveera nrockweiler alextree yuntaotan ryetochondria xinwrli shengqh rlebron88 furbelows methionine23 shrutimarwaha tanimade capitaldata celalp parajago cuihuaxia yinuomarywang rosswell jindouzi94 rajlabmssm stekaz danielwyuan ayelletsegre igsr wangdi2014 bsun0802 leennabraham dalinlee stormlovetao daffay frogdong hnaeem liangyy kaichop santoroml chunyangbao remedic atps chaotang-scu leonguos akulan1 jasonlixinzhe rebjoh dongyuexie gamazonlab geneticresources chuanj chlorproaine shicheng-guo snkravitz cwarlysolsberg srividya22 hou bsmith030465 anksi jjenny leornardzhou general0710 gtex-etc-jusue404 pejlab gootec galexe2019 yw6 stivenming carlalbc kmfunder j-kreis yixf-education chrishusted yylu5 bayramakdeniz cjops haroon123 lichenbiostat dbrg77 sitazhou budsonjelmont zongchangli kadepettie hzaurzli yahgao standardgalactic bonder-mj osagiei

gtex-pipeline's Issues

The run_FastQTL_threaded.py script produces different output files with the same input files for different chunks number

I am trying to run the run_FastQTL_threaded.py script on the VCF and BED files which contain only chromosome 22 and I noticed that for the different number of chunks the output files are always different. Moreover, as I increase the number of chunks, the output file is smaller. I have checked all output log files and it seems that all the chunks are processed (I expected that some chunks were skipped because of the memory issue but it seems that this is not the case). I am sending you the command lines and the output file's sizes which are produced with these command lines. Please note that these are the command lines that I tried to run locally. But, I tried the same on the AWS m4.16 xlarge instance and the same problem exists there. Moreover, when I run this script on the VCF and BED files which contain all chromosomes, the output files are the same despite the different chunk size between tasks. Could you help me to understand why this is happening when I run the fastQTL tool on chromosome 22?
I used BED file produced by the Geuvadis consortium and VCF file from the 1000 Genomes project.

Number of chunks 10:
Command:
/opt/fastqtl/python/run_FastQTL_threaded.py chr22.vcf.gz FastQTL.bed.gz chunks10 --window 1e6 --chunks 10 --threads 4
Output file size:
chunks10.allpairs.txt.gz - 317 MB

Number of chunks 20:
Command:
/opt/fastqtl/python/run_FastQTL_threaded.py chr22.vcf.gz FastQTL.bed.gz chunks20 --window 1e6 --chunks 20 --threads 4
Output file size:
chunks20.allpairs.txt.gz - 312M

Number of chunks 30:
Command:
/opt/fastqtl/python/run_FastQTL_threaded.py chr22.vcf.gz FastQTL.bed.gz chunks30 --window 1e6 --chunks 30 --threads 4
Output file size:
chunks30.allpairs.txt.gz - 308 MB

Number of chunks 100:
Command:
/opt/fastqtl/python/run_FastQTL_threaded.py chr22.vcf.gz FastQTL.bed.gz chunks100 --window 1e6 --chunks 100 --threads 4
Output file size:
chunks100.allpairs.txt.gz - 265 MB

RNA-seQC output empty

Hi,

I followed the TOPmed pipeline used STAR to get the bam file, picard to mark duplicate, then used RNA-seQC to generate gene tpm. All steps ran without any error, however the gene_tpm.gct file only contain nan for tpm column, and exon_reads.gct and other files contain only 0 for last column, the first two columns (ID and name) looks fine.
head gene_tpm.gct

head exon_reads.gct

Therefore I double checked my work, and noticed a warning from picard mark duplicate step, because the log is very long, I will only paste the head and tail here:

Did this cause the empty output? I've been searching online for solution for several days and did not get any valid results.

Thank you for your help!

Error at run _rnaseq.py

Hello, I am stuck at the run_rnaseq.py step while following the suggested v8 pipeline with docker. All the reference gtf/fasta were obtained from the recommended source on the GitHub page. Have I missed out something? Thank you very much!

nelson@hemepath:~/OPERATION/GTEX/data/star_out$ docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 /bin/bash -c "/src/run_rnaseqc.py
${sample_id}.Aligned.sortedByCoord.out.patched.md.bam
${genes_gtf}
${genome_fasta}
${sample_id}
--java /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java
--output_dir /data"
[Nov 25 10:10:12] Running RNA-SeQC

command: "/usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -Xmx4g -jar /opt/RNA-SeQC_1.1.9/RNA-SeQC.jar -n 1000 -s HA10YLS,HA10YLS.Aligned.sortedByCoord.out.patched.md.bam,HA10YLS -t GENCODE_gencode.v26.GRCh38.annotation.gtf -r references_Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta -o /data -noDoC -strictMode -gatkFlags --allow_potentially_misencoded_quality_scores"
RNA-SeQC v1.1.9 06/26/16
Additional GATK flags provided: --allow_potentially_misencoded_quality_scores
Creating rRNA Interval List based on given GTF annotations
Retriving contig names from reference
contig names in reference: 195
Loading GTF for Read Counting
The required transcript_id attribute was not found on line chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
Traceback (most recent call last):
File "/src/run_rnaseqc.py", line 102, in
subprocess.check_call(cmd, shell=True)
File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -Xmx4g -jar /opt/RNA-SeQC_1.1.9/RNA-SeQC.jar -n 1000 -s HA10YLS,HA10YLS.Aligned.sortedByCoord.out.patched.md.bam,HA10YLS -t GENCODE_gencode.v26.GRCh38.annotation.gtf -r references_Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta -o /data -noDoC -strictMode -gatkFlags --allow_potentially_misencoded_quality_scores' returned non-zero exit status 1

Error at running RNA-SeQC from docker

Hello, I'm running the RNA-seq pipeline with docker gtex_rnaseq:V8 at the RNA-SeQC, the built in RNA-SeQC is supposed to be v2.3.6 but when I run the docker, the following error appeared. Is the docker is still referring to an older version of RNA-SeQC? Your help is much appreciated. Thank you!

nelson@hemepath:~/OPERATION/GTEX/data$ docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 /bin/ bash -c "/src/run_rnaseqc.py
${sample_id}.Aligned.sortedByCoord.out.md.bam
${genes_gtf}
${genome_fasta}
${sample_id}
--output_dir /data"
[Nov 12 01:00:24] Running RNA-SeQC

command: "java -Xmx4g -jar /opt/RNA-SeQC_1.1.9/RNA-SeQC.jar -n 1000 -s HA10YLS,HA10YLS.Aligned.sortedByCoord.out.md. bam,HA10YLS -t GENCODE_gencode.v26.GRCh38.genes.gtf -r references_Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta -o / data -noDoC -strictMode -gatkFlags --allow_potentially_misencoded_quality_scores"
RNA-SeQC v1.1.9 06/26/16
Additional GATK flags provided: --allow_potentially_misencoded_quality_scores
Creating rRNA Interval List based on given GTF annotations
Retriving contig names from reference
contig names in reference: 195
Loading GTF for Read Counting
Converting to refGene
Transcript objects to RefGen format: 0 s
Running IntronicExpressionReadBlock Walker ....
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.(GenomeAnalysisEngine.java:160)
at org.broadinstitute.sting.gatk.CommandLineExecutable.(CommandLineExecutable.java:53)
at org.broadinstitute.sting.gatk.CommandLineGATK.(CommandLineGATK.java:54)
at org.broadinstitute.cga.rnaseq.gatk.GATKTools.runIntronReadCount(GATKTools.java:203)
at org.broadinstitute.cga.rnaseq.ReadCountMetrics.runRegionCounting(ReadCountMetrics.java:220)
at org.broadinstitute.cga.rnaseq.ReadCountMetrics.runReadCountMetrics(ReadCountMetrics.java:59)
at org.broadinstitute.cga.rnaseq.RNASeqMetrics.runMetrics(RNASeqMetrics.java:211)
at org.broadinstitute.cga.rnaseq.RNASeqMetrics.execute(RNASeqMetrics.java:162)
at org.broadinstitute.cga.rnaseq.RNASeqMetrics.main(RNASeqMetrics.java:133)
Caused by: java.lang.NullPointerException
at org.reflections.Reflections.scan(Reflections.java:220)
at org.reflections.Reflections.scan(Reflections.java:166)
at org.reflections.Reflections.(Reflections.java:94)
at org.broadinstitute.sting.utils.classloader.PluginManager.(PluginManager.java:77)
... 9 more
Traceback (most recent call last):
File "/src/run_rnaseqc.py", line 102, in
subprocess.check_call(cmd, shell=True)
File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'java -Xmx4g -jar /opt/RNA-SeQC_1.1.9/RNA-SeQC.jar -n 1000 -s HA10YLS,HA10YLS.Ali gned.sortedByCoord.out.md.bam,HA10YLS -t GENCODE_gencode.v26.GRCh38.genes.gtf -r references_Homo_sapiens_assembly38_noAL T_noHLA_noDecoy.fasta -o /data -noDoC -strictMode -gatkFlags --allow_potentially_misencoded_quality_scores' returned non -zero exit status 1
nelson@hemepath:~/OPERATION/GTEX/data$ docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \

/bin/bash -c "java -version"

openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

Leafcutter junction files

Hi All,

I am curious to know if leafcutter raw junction counts are available somewhere. I could see exon-exon junction counts file, but all the splicing analysis is done on leafcutter generated junctions. I also saw that the normalized phenotype file used for leafcutter is available but I would like to know if I can access the raw junction counts from leafcutter directly or via google cloud.

Thanks,
Goutham A

MarkDuplicates Does Not Affect Subsequent Analyses?

Hi, first of all thanks a lot for this incredibly useful repository.

I am following GTEx v9 (using the branch, because v8 run_rnaseqc.py gives me 0 counts for all exons/transcripts/genes) pipeline for expression quantification and eQTL mapping.

To see the effect of Mark Duplicates step on quantification, I took an original bam file and its version processed by Mark Duplicates (that keeps all the reads, but changes only the second column of the bam file [flag] as far as I know).

Then, I run "run_rnaseqc.py" on both the original bam file and the output file of Mark Duplicates. When I compare the output, they seem to be identical.

Is this expected? MarkDuplicates step is not meant to affect expression quantification and later steps? If so, what is its functionality?

Thanks!

question for run eqtl_prepare_expression.py

when I run eqtl_prepare_expression.py ，it show error message：
Loading expression data
Traceback (most recent call last):
File "eqtl_prepare_expression.py", line 177, in
raise ValueError('Sample IDs in expression files and participant lookup table must match.')
ValueError: Sample IDs in expression files and participant lookup table must match.

Which files do I get the SampeID and participantID from? Is SampeID from counts_gct files？is participantID from genotype file？

question about gene number after normalizing in eQTL pipeline

Hi,
I run the eqtl_prepare_expression pipeline and set the same expression thresholds as the example. I'm confused that expression values of some genes I'm concerned about in GCT files are well above the threshold but removed in BED file. This is why?

chimMainSegmentMultNmax & -outSAMattributes

Hi,

I think STAR 2.4.2a does not support "chimMainSegmentMultNmax" option, however it is found in run_STAR.py. Therefore it crashes when using STAR 2.4.2a. (However, I am not sure if docker version has 2.4.2a or a later version; since I am not using it).

Same goes for -outSAMattributes , where "ch" is not supported by STAR 2.4.2a.

Cheers,
Fahri

command to run rnaseqc in the docker image

Hi, in the readme file there is no command to run the quantification of rnaseqc

RSEM transcript quantification

docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq /bin/bash -c "/src/run_RSEM.py /data/rsem_reference /data/star_out/$prefix.Aligned.toTranscriptome.out.bam /data/$prefix --threads 1"

Could you provide me with an example to run the last step to get the tpms and proceed then to qtl pipeline? thanks

questions about counts_gct file and sample_participant_lookup file in eQTL

Hi,

I'm working on the first step of eQTL pipeline. In the RNA-Seq pipeline, I've aggregated the gene_tpm.gct files of all samples into one tpm.gct file. When generating normalized expression in BED format, which tpm.gct file should I use, the one of each sample or the combined one?

And the sample_participant_lookup file I've generated is showed in the following figure. I am not sure about the format. Is this right?

When I run the following command, I got this error. How can I fix this error?

Thanks!

output of RNA-SeQC

Hi, I am trying to use your RNAseq pipeline and have two questions:

After I run RNA-SeQC, I only got the expression of the transcripts, in ENST. How can I get the expression of the gene (in ENSG), like in file "GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz"?
Could you please tell me your command when running RNA-SeQC?

Here is the command I used:
java -jar /home/zhangsj/data/SeQC/RNA-SeQC_v1.1.8.jar -n 1000 -s "SRR821498|SRR821498Aligned.sortedByCoord.out.patched.added.bam|Notes" -t gencode.v19.annotation.patched_contigs.gtf -r Homo_sapiens_assembly19.fasta -o /data/out_seqc

And how you converted the expression in rpkm to ones in tpm?

Thank you very much!

Version of GENCODE

Hi, I am using docker image V8 for RNA-seq. I noticed that GENCODE v19 was used in your code, but it also says that V8 will be based on GENCODE v26.
So I am wondering which version of GENCODE should I use for the V8 docker image? I hope to get the same result as RNA-Seq data in GTEx Portal.

Thank you!

Does STAR WDL requires alignment index?

https://github.com/broadinstitute/gtex-pipeline/blob/master/rnaseq/rnaseq_pipeline_fastq.wdl is the WDL file.

File fastq1
File? fastq2
String prefix

call star_wdl.star {
    input: fastq1=fastq1, fastq2=fastq2, prefix=prefix
}

As far as I see, STAR doesn't ask for alignment index. May I ask how this WDL script can work out where to align the reads to?

Which python version for leafcutter sQTL analysis? And the error when ran the leafcutter sQTL scripts here

Dear All,

Could you please let me know which python version did you use for the leafcutter sQTL analysis in GTEx?

I know GTEx incorporated the leafcutter scripts itself for the sQTL analysis, and I tried these scripts which you posted here (https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/leafcutter/src/cluster_prepare_fastqtl.py) on my data, and got the error -KeyError: 'the label [WASH9P] is not in the [index]' (Please see the detailed below) which doesn't make sense to me.

Because I checked the bed_file (.perind.counts.filtered.qqnorm.bed.gz) which doesn't contain clusters (chr1:clu_4863_NA, chr1:clu_4864_NA, chr1:clu_4865_NA) for "WASH9P", but the cluster2gene_dict (leafcutter.clusters_to_genes.txt) contains three clusters for this gene. So according to the script (https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/leafcutter/src/cluster_prepare_fastqtl.py)--line 194-210, this gene shouldn't be selected for next step. Even though this gene "WASH9P" were selected for the next step, the manipulated gtf bed file ( tss_df ) indeed contains this gene, so this error should come out. Could you please let me know why? By the way, I chose the python2.7 which leafcutter recommended.

Traceback (most recent call last):
File "/shares/mgomery/Group-Montgomery/f.Yang/2020/sQTL_grp/gtex-pipeline-master/qtl/leafcutter/src/cluster_prepare_fastqtl.py", line 207, in
gene_bed_df.append(tss_df.loc[g, ['chr', 'start', 'end']].tolist() + [gi] + r.iloc[4:].tolist())
File "/home/fei.yang/.local/lib/python2.7/site-packages/pandas/core/indexing.py", line 1309, in getitem
return self._getitem_tuple(key)
File "/home/fei.yang/.local/lib/python2.7/site-packages/pandas/core/indexing.py", line 795, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/home/fei.yang/.local/lib/python2.7/site-packages/pandas/core/indexing.py", line 921, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "/home/fei.yang/.local/lib/python2.7/site-packages/pandas/core/indexing.py", line 1481, in _getitem_axis
self._has_valid_type(key, axis)
File "/home/fei.yang/.local/lib/python2.7/site-packages/pandas/core/indexing.py", line 1418, in _has_valid_type
error()
File "/home/fei.yang/.local/lib/python2.7/site-packages/pandas/core/indexing.py", line 1405, in error
(key, self.obj._get_axis_name(axis)))
KeyError: 'the label [WASH9P] is not in the [index]'

Any my command for this script is below:
python2.7 /gtex-pipeline-master/qtl/leafcutter/src/cluster_prepare_fastqtl.py /pheno_try6/junc206_fastQTL.txt gencode.v31.exons.txt.gz collapse_gtf_v31.gtf TRY_7 -o /pheno_try7 --leafcutter_dir /leafcutter --check TRUE

I also tried the python3.7 for this script, the included leafcutter_cluster.py doesn't work and got the error like- NameError: name 'file' is not defined. So I feel that python version made a difference.

Could you please help figure out this issue? Unfortunately, there is no much description for incorporated leafcutter sQTL scripts. I have been stuck by this script for more than one week and really want to see the results. So, if you could help that will be very very appreciated.

Please let me know if you need more detailes about this error. I will reply immediately.

Many thanks,

Fei

collapse_annotation.py Error

Hi,

thank you for collapse_annotation.py script. I however, get the error below which I could not figure. Could you kindly assist to edit the script.

Traceback (most recent call last):
File "/home/banthony/software/gtex-pipeline-master/gene_model/collapse_annotation.py", line 249, in
annotation = Annotation(args.transcript_gtf)
File "/home/banthony/software/gtex-pipeline-master/gene_model/collapse_annotation.py", line 84, in init
e = Exon(str(len(t.exons)+1), len(t.exons)+1, attributes['exon_id'], start_pos, end_pos)
UnboundLocalError: local variable 't' referenced before assignment

rsem_reference_GRCh38_gencode26_ercc.tar.gz reference file not available through link

The link for this reference file is not functioning:

https://personal.broadinstitute.org/francois/topmed/rsem_reference_GRCh38_gencode26_ercc.tar.gz

Invariant sites are being called eQTLs

I ran the gtex eQTL pipeline on a subset of a panel of genotypes, i.e. the subset for which I have RNA-seq data. I forgot to remove sites in the subset genotype file that are invariant for that set of individuals before I ran the pipeline. I'll remove them and re-run. But first: I noticed that many sites that are invariant in the genotype file were erroneously called eQTLs by the pipeline. Is this is a known issue? Shouldn't the pipeline ignore any invariant sites in the input?

Thanks,
Eric

run_rnaseqc.py issue

Hello, I am testing and received below message. How should I try to fix this? Thank you for the help!

mz@mz:$ docker run --rm -v /RNA:/data -t rnatest:V1 /bin/bash -c "/src/run_rnaseqc.py K.Aligned.sortedByCoord.out.patched.md.bam gencode.v19.chr_patch_hapl_scaff.annotation.patched.gtf Homo_sapiens_assembly19.fasta K --output_dir /data"
[Aug 02 18:40:03] Running RNA-SeQC

command: "java -Xmx4g -jar /opt/RNA-SeQC_1.1.9/RNA-SeQC.jar -n 1000 -s K,K.Aligned.sortedByCoord.out.patched.md.bam,K -t gencode.v19.chr_patch_hapl_scaff.annotation.patched.gtf -r Homo_sapiens_assembly19.fasta -o /data -noDoC -strictMode -gatkFlags --allow_potentially_misencoded_quality_scores"
RNA-SeQC v1.1.9 06/26/16
Additional GATK flags provided: --allow_potentially_misencoded_quality_scores
Creating rRNA Interval List based on given GTF annotations
Retriving contig names from reference
contig names in reference: 85
Loading GTF for Read Counting
Converting to refGene
Transcript objects to RefGen format: 2 s
Running IntronicExpressionReadBlock Walker ....
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.(GenomeAnalysisEngine.java:160)
at org.broadinstitute.sting.gatk.CommandLineExecutable.(CommandLineExecutable.java:53)
at org.broadinstitute.sting.gatk.CommandLineGATK.(CommandLineGATK.java:54)
at org.broadinstitute.cga.rnaseq.gatk.GATKTools.runIntronReadCount(GATKTools.java:203)
at org.broadinstitute.cga.rnaseq.ReadCountMetrics.runRegionCounting(ReadCountMetrics.java:220)
at org.broadinstitute.cga.rnaseq.ReadCountMetrics.runReadCountMetrics(ReadCountMetrics.java:59)
at org.broadinstitute.cga.rnaseq.RNASeqMetrics.runMetrics(RNASeqMetrics.java:211)
at org.broadinstitute.cga.rnaseq.RNASeqMetrics.execute(RNASeqMetrics.java:162)
at org.broadinstitute.cga.rnaseq.RNASeqMetrics.main(RNASeqMetrics.java:133)
Caused by: java.lang.NullPointerException
at org.reflections.Reflections.scan(Reflections.java:220)
at org.reflections.Reflections.scan(Reflections.java:166)
at org.reflections.Reflections.(Reflections.java:94)
at org.broadinstitute.sting.utils.classloader.PluginManager.(PluginManager.java:77)
... 9 more
Traceback (most recent call last):
File "/src/run_rnaseqc.py", line 102, in
subprocess.check_call(cmd, shell=True)
File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'java -Xmx4g -jar /opt/RNA-SeQC_1.1.9/RNA-SeQC.jar -n 1000 -s K,K.Aligned.sortedByCoord.out.patched.md.bam,K -t gencode.v19.chr_patch_hapl_scaff.annotation.patched.gtf -r Homo_sapiens_assembly19.fasta -o /data -noDoC -strictMode -gatkFlags --allow_potentially_misencoded_quality_scores' returned non-zero exit status 1

utility of bamsync in rnaseq

Hi,

First thanks for maintaining this super useful repository.
In the documentation of RNAseq pipeline, it says

sync BAMs (optional; copy QC flags and read group IDs)

I wonder what's the utility of copying QC flags and read group IDs. Is including this step helping anything?

Thanks!

reference_junctions file

Hi All,

I would like to know where is the reference_junctions file that is required for the script:

gtex-pipeline/rnaseq/src/process_star_junctions.py

If not, how can I generate it ?

what is the input Bam file？

Hi， I am trying to use your RNAseq pipeline:
I have fastq format RNAseq raw data, but I saw the first step is “BAM to fastq” , what is that Bam file?
And I saw there is a step “sync Bams ” after Star alignment, so what is the difference between the original Bam file and the Bam file generated by STAR?
Thank you very much!
HG

Output files from rnaseq v8 pipeline

Hello. I finished running the v8 pipeline with docker on my own RNAseq data and there are 3 gcts generated, including sample.exon_reads.gct, sample.gene_reads.gct and sample.gene_rpkm.gct. I want to compare them with the reference dataset which uses TPM and have questions as follows.

From the reference v8 dataset(https://www.gtexportal.org/home/datasets), does sample.gene_reads.gct correspond to GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct or GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct?
Is there a corresponding reference file for sample.exon_reads.gct?
Can I use the aggregating outputs command (docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8
/bin/bash -c "python3 /src/combine_GCTs.py
${rnaseqc_rpkm_gcts} ${sample_set_id}.rnaseqc_rpkm"). for gcts (other than rpkm_gcts) and rsem files?

Thank you very much for your help!

Error running docker eqtl_prepare_expression.py

Hi,

I run the eqtl_prepare_expression pipeline and got this error:

Does anyone know how to solve this ?

which will be the commando to run runseqc?

to repeat GTEx result

./run_FastQTL_threaded.py GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_866Indiv.vcf.gz #BED# #NAME# --covariates #COVARIATE# --window 1e6 --chunks 100 --threads 16 --threshold 1 --fdr 1 --permute 1000 10000

And for all human genes, each gene has only one eqtl entry output. The tissue I used is the Cells_EBV-transformed_lymphocytes. However, according to the GTEx portal, the tissue has 45,8161. significant variant-gene pair. How can I repeat this result?

Thanks.

when getting 'transcript' or 'gene' from the gtf file.

Hi, I am currently using the eqtl_prepare_expresson.py to prepare my expression data.
I noticed that default setting of the function gtf_to_bed is to get the 'gene' instead of 'transcript', while the script uses 'transcript' to call the function in line 174.

And the number of the feature is largely different in the gtf file:

$ awk '$3=="transcript"' eqtl_data/gencode.v27lift37.annotation.gtf| wc -l
  202697

 $ awk '$3=="gene"' eqtl_data/gencode.v27lift37.annotation.gtf| wc -l
   60461

So I am wondering which is the proper one to use?

missing normalize_expression.py file in qtl/src/ folder

Where is the normalize_expression.py file? There is just eqtl_prepare_expression.py file in the src/ folder. Are they the same file?

significant gene-SNP pairs for FastQTL

Dear,

Thanks for the script for generating the list of significant gene-SNP pairs: https://github.com/francois-a/fastqtl/blob/master/python/annotate_outputs.py

I was wondering what is the workflow for annotate_outputs.py ? How does the annotate_outputs.py work - calculating and filtering at FDR (i.e., 0.05) by combining nominal and permutation pass results? What happens If we apply the multiple testing correction on P-values in nominal pass, will the results be same?

I get an error running run_star.py script

Hi, I am getting an error running the STAR. Can you please help me with this. I downloaded the Index file from here
https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md (STAR index: STAR_genome_GRCh38_noALT_noHLA_noDecoy_ERCC_v30_oh100.tar.gz).

The resultant file gets empty bam file and log file.

docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 /bin/bash -c "/src/run_STAR.py
/data/STAR_genome_GRCh38_noALT_noHLA_noDecoy_ERCC_v30_oh100
/data/${sample_id}_1.fastq.gz
/data/${sample_id}_2.fastq.gz
${sample_id}
--threads 4
--output_dir /data/star_out"
Jun 24 18:16:56 ..... started STAR run
Jun 24 18:16:57 ..... loading genome
Traceback (most recent call last):
File "/src/run_STAR.py", line 92, in
subprocess.check_call(cmd, shell=True, executable='/bin/bash')
File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'STAR --runMode alignReads --runThreadN 4 --genomeDir /data/STAR_genome_GRCh38_noALT_noHLA_noDecoy_ERCC_v30_oh100 --twopassMode Basic --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outFilterType BySJout --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --limitSjdbInsertNsj 1200000 --readFilesIn /data/C42B_1.fastq.gz /data/C42B_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix /data/star_out/C42B. --outSAMstrandField intronMotif --outFilterIntronMotifs None --alignSoftClipAtReferenceEnds Yes --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM Unsorted --outSAMunmapped Within --genomeLoad NoSharedMemory --chimSegmentMin 15 --chimJunctionOverhangMin 15 --chimOutType WithinBAM SoftClip --chimMainSegmentMultNmax 1 --outSAMattributes NH HI AS nM NM ch --outSAMattrRGline ID:rg1 SM:sm1' returned non-zero exit status -9

Regards
Rimpi

Format of ${sample_participant_lookup}

Hello,
According to this instruction:
The file ${sample_participant_lookup} must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the VCF (these can be the same).

I generated the file containing lines like:
sample_id participant_id
0226-SM-5GZZ7 GTEX-1117F
0426-SM-5EGHI GTEX-1117F
0526-SM-5EGHJ GTEX-1117F
0626-SM-5N9CS GTEX-1117F
0726-SM-5GIEN GTEX-1117F
1326-SM-5EGHH GTEX-1117F

The columns are separated by tab.

When using eqtl_prepare_expression.py, I have this error:
raise ValueError('Sample IDs in expression files and participant lookup table must match.')
ValueError: Sample IDs in expression files and participant lookup table must match.

I think the sample-participant file I generated is problematic.

Thanks!

Genotype pcs

Hello,

Could someone tell me what are genotype based principal components in the step 3?
Where I could get this information?

Thanks,
Yanting

Import missing in rnaseq_pipeline_fastq.wdl

I'm getting an error for WDL from Cromwell:

Failed to import workflow https://api.firecloud.org/ga4gh/v1/tools/broadinstitute_gtex:samtofastq_v1-0_BETA/versions/3/plain-WDL/descriptor.:
Bad import https://api.firecloud.org/ga4gh/v1/tools/broadinstitute_gtex:samtofastq_v1-0_BETA/versions/3/plain-WDL/descriptor: Import file not found: 
https://api.firecloud.org/ga4gh/v1/tools/broadinstitute_gtex:samtofastq_v1-0_BETA/versions/3/plain-WDL/descriptor

It doesn't look like the import URL in https://github.com/broadinstitute/gtex-pipeline/blob/master/rnaseq/rnaseq_pipeline_fastq.wdl are valid. For example, https://api.firecloud.org/ga4gh/v1/tools/broadinstitute_gtex:rsem_v1-0_BETA/versions/3/plain-WDL/descriptor gives a 404 error.

May I ask if this is an expected behaviour? How should I run the WDL script with those imports missing?

Doesn't work with Ensemble GTF files?

Program works fine with Gencode but not Ensemble? How do I get it to work?

Steps to reproduce

wget ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
mkdir BUG
pigz -d Homo_sapiens.GRCh38.97.gtf.gz > BUG/Homo_sapiens.GRCh38.97.gtf
python3 SOFTWARE/gtex-pipeline/gene_model/collapse_annotation.py BUG/Homo_sapiens.GRCh38.97.gtf BUG/Homo_sapiens.GRCh38.97_compressed.gtf

Error

Traceback (most recent call last):
  File "SOFTWARE/gtex-pipeline/gene_model/collapse_annotation.py", line 249, in <module>
    annotation = Annotation(args.transcript_gtf)
  File "SOFTWARE/gtex-pipeline/gene_model/collapse_annotation.py", line 68, in __init__
    g = Gene(gene_id, attributes['gene_name'], attributes['gene_type'], chrom, strand, start_pos, end_pos)
KeyError: 'gene_type'

“RuntimeError: Invalid interval bounds!” when handle the UCSC mappability file in the ASE pipeline

Hello, I am testing the ASE pipeline using GTEx data. When I run the ase_aggregate_by_individual script and add the mappability file using pyBigWig, I got the error "RuntimeError: Invalid interval bounds!". It seems that some SNPs are out of the chromosome regions. I do not know how to fix it and beg for your help. Thanks .

How to get chromosome list

Hello,

I do not quite understand how to get a chromosome list from a vcf file. Could you provide some suggestions?

Thanks

Homo_sapiens_assembly19

I am wondering why the assembly in v8 is named Homo_sapiens_assembly19 in rnaseq folder under building indexes. Shouldnt it be hg38 ? The original documentation on GTEX website says they used v26 of gencode and hg38 genome. Here it looks opposite:

docker run --rm -v $path_to_references:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "STAR \
        --runMode genomeGenerate \
        --genomeDir /data/star_index_oh75 \
        --genomeFastaFiles /data/Homo_sapiens_assembly19.fasta \
        --sjdbGTFfile /data/gencode.v19.annotation.patched_contigs.gtf \
        --sjdbOverhang 75 \
        --runThreadN 4"

Error during Fastq conversion

I'm trying the rnaseq pipeline with docker. At the fastq conversion step, the following error occurred. Would be grateful if you can advise. Thank you very much!

sudo docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \

/bin/bash -c "/src/run_SamToFastq.py /data/$input_bam -p ${sample_id} -o /data"

Stranded setting in RSEM and RNASeQC

According to the instruction at https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq, when running RSEM, there will be a strand-specific setting. However, in the usage in the RNASeQC, we don't specify that. Could you explain why? Thanks very much!

Multi mapped reads

Hi,

I found that the outFilterMultimapNmax is set to 20 in STAR alignments. I could not find anywhere in the code that only uniquely mapped reads are considered for downstream analysis of splicing. May I know which step actually retains uniquely mapped reads only ? For example, to generate leafcutter junction files, I could not find if only uniquely mapped reads are considered ? Is is the WASP sam tag vW:1 that does extract only uniquely mapped reads ?

For rnaseqc, the default value of --mapping-quality is 255, so presumably, for gene expression quantifications, rnaseqc is considering only uniquely mapped reads through --mapping-quality set to 255.

Permission denied run_rnaseqc.py

Hello, I received below error testing run_rnaseqc.py in V8. Other components work out fine.

"/bin/bash: /src/run_rnaseqc.py: Permission denied"

Maybe this is an isolated issue, any suggestions? Thanks!

low correlation of gene count between HTSeq and RNA-SeQC

Hi,

I am trying to adopt GTEx pipeline to replace the pipeline I have now. But when I try HTSeq and RNA-SeQC on the same set of BAM and annotation file. I find there is a difference between the two.
For HTSeq (https://htseq.readthedocs.io/en/release_0.11.1/), I am using the following script

python3 -m HTSeq.scripts.count \
--stranded=no \
--mode=union \
        --type=gene ${BAM} ${GENEGTF} >> gene_union_count

And for RNA-SeQC v2(https://github.com/broadinstitute/rnaseqc), I am using the following script

${RNASEQC} ${GENE_GTF} ${BAM} ${OUT_DIR} --sample=${sample}

After I correlate the gene count output from two methods (${sample}.gene_reads.gct. I find very poor correlation (pearson correlation=~0.2, spearman correlation=~0.8). Also the coverage (sum of the gene count) is different (HTSeq ~10mil, RNA-SeQC ~6.5mil). HTSeq gives me the approximate coverage I expected.

Do you have idea on what's could be the possible cause for this difference? Thanks a lot in advance!

run_FastQTL_threaded.py

I got the genotype data from dbGap:
phg001219.v1.GTEx_v8_WGS.genotype-calls-vcf.c1/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_866Indiv.vcf.gz
(It contains all genotypes from all tissues and individuals)

I also successfully generated the *.expression.bed.gz of one tissue as
well as covariates and tbi files etc.

Then I run the command ./fastqtl/python/run_FastQTL_threaded.py ./PhenoGenotypeFiles/RootStudyConsentSet_phs000424.GTEx.v8.p2.c1.GRU/GenotypeFiles/phg001219.v1.GTEx_v8_WGS.genotype-calls-vcf.c1/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_866Indiv.vcf.gz ./tissue9_size176.expression.bed.gz tissue9_size176 --covariates ./tissue9_size176.combined_covariates.txt --window 1e6 --chunks 100 --threads 8 --permute 1000 10000

I tried 8 threads or even 1 thread.

The error message is

Processing chunk 99
Processing chunk 100
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/ysm-gpfs/apps/software/Python/3.7.0-fosscuda-2018b/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/ysm-gpfs/apps/software/Python/3.7.0-fosscuda-2018b/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/ysm-gpfs/pi/gerstein/jx98/bin/gtex_eqtl/gtex-pipeline-master/fastqtl/python/run_FastQTL_threaded.py", line 54, in perm_worker
s = subprocess.check_call(cmd, shell=True, executable='/bin/bash', stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
File "/ysm-gpfs/apps/software/Python/3.7.0-fosscuda-2018b/lib/python3.7/subprocess.py", line 328, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command './fastqtl/bin/fastQTL --vcf ./PhenoGenotypeFiles/RootStudyConsentSet_phs000424.GTEx.v8.p2.c1.GRU/GenotypeFiles/phg001219.v1.GTEx_v8_WGS.genotype-calls-vcf.c1/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_866Indiv.vcf.gz --bed ./tissue9_size176.expression.bed.gz --window 1e6 --maf-threshold 0.0 --ma-sample-threshold 0 --interaction-maf-threshold 0 --cov ./tissue9_size176.combined_covariates.txt --permute 1000 10000 --chunk 1 100 --out tissue9_size176_chunk001.txt.gz --log tissue9_size176_chunk001.log' returned non-zero exit status 127.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "./fastqtl/python/run_FastQTL_threaded.py", line 95, in
assert res.get()[0]==0
File "/ysm-gpfs/apps/software/Python/3.7.0-fosscuda-2018b/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
subprocess.CalledProcessError: Command './fastqtl/bin/fastQTL --vcf ./PhenoGenotypeFiles/RootStudyConsentSet_phs000424.GTEx.v8.p2.c1.GRU/GenotypeFiles/phg001219.v1.GTEx_v8_WGS.genotype-calls-vcf.c1/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_866Indiv.vcf.gz --bed ./tissue9_size176.expression.bed.gz --window 1e6 --maf-threshold 0.0 --ma-sample-threshold 0 --interaction-maf-threshold 0 --cov ./tissue9_size176.combined_covariates.txt --permute 1000 10000 --chunk 1 100 --out tissue9_size176_chunk001.txt.gz --log tissue9_size176_chunk001.log' returned non-zero exit status 127.

I tried to import multprocessing module of pythong. It works. Could you help on debugging the command run? Or is the error due to data format?

Thanks.

CANNOT find normalize_expression.py

Hello,

I cannot find the normalize_expression.py in the first step of qtl/README.md

Identifying the list of all significant variant-gene pairs

Hi,
Thank you very much for sharing all these scripts.
Is the script used for Identifying the list of all significant variant-gene pairs in this folder?
If so, which one is it?

I am referring to the script that does the following (quoted from GTEX 2017 paper supplemental methods):
"To identify the list of all significant variant-gene pairs associated with eGenes, a genome-wide empirical P-value threshold, pt was defined as the empirical P-value of the gene closest to the 0.05 FDR threshold. pt was then used to calculate a nominal P-value threshold for each gene based on the beta distribution model (from FastQTL) of the minimum P-value distribution f(pmin) obtained from the permutations for the gene. Specifically, the nominal threshold was calculated as F −1(pt), where F −1 is the inverse cumulative distribution. For each gene, variants with a nominal P-value below the gene-level threshold were considered significant and included in the final list of variant-gene pairs."

Eldad

unindexed BAMs

Hi, I am using the RNA-Seq pipeline. After running STAR and mark duplicates, when I am using RNA-SeQC, I got the following error which says the BAM files are unindexed. I see there are sort and index commands in the run_STAR.py, but it doesn't work.
(error: Invalid command line: Cannot process the provided BAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAMs in.)

Also, when I try to index the BAM files by samtools, I got the error that says my bam file may be corrupted or unsorted.

How should I try to fix this? Thank you for your help!

QC RNA-seq expression outliers

Hi, I have a question regarding the QC of the Gtex eQTL pipeline.

There is one step mentioned on the GtexPortal (https://gtexportal.org/home/documentationPage#staticTextAnalysisMethods) that I would like to use on my data:

"1. RNA-seq expression outliers were identified and excluded using a multidimensional extension of the statistic described in (Wright et al., Nat. Genet. 2014 ). Briefly, for each tissue, read counts from each sample were normalized using size factors calculated with DESeq2 and log-transformed with an offset of 1; genes with a log-transformed value >1 in >10% of samples were selected, and the resulting read counts were centered and unit-normalized. The resulting matrix was then hierarchically clustered (based on average and cosine distance), and a chi2 p-value was calculated based on Mahalanobis distance. Clusters with ≥60% samples with Bonferroni-corrected p-values <0.05 were marked as outliers, and their samples were excluded."

Is there a script or some code available that I could use to perform these steps?

Thank you for your help!

Missing broadinstitute/gtex_rnaseq:V7

The documentation mentions docker for hg19 on "broadinstitute/gtex_rnaseq:V7". However, there is no such docker.

https://hub.docker.com/r/broadinstitute/gtex_rnaseq/tags/

There is "latest" and "V8". Does "latest" mean "V7"?

combine covariates

According to https://github.com/broadinstitute/gtex-pipeline/tree/master/qtl#3-combine-covariates,

This step generates a combined covariates file, containing genotype PCs, PEER factors, and additional explicit covariates (e.g., genotyping platform).

I did not find how to generate genotype PCs in the GitHub document and GTEx portal. Could you provide more information on that?

Thanks.

PEER Covariates

Hi,

Is the Rscript provided for PEER factor calculation accounting for already known cofounders? (such as gender and batch). If not, wouldn't we adjust the expression matrix multiple times for the same covariates if PEER is already detecting such cofounders? (and because the next step in the pipeline is combining known covariates with PEER factors)

I've modified the script with "p <- add_argument(p, "--covariates", help="")" and "invisible(PEER_setCovariates(model, argv$covariates))" to avoid this for example, but not sure if that's going to work.

Thanks,
Fahri

broadinstitute / gtex-pipeline Goto Github PK

gtex-pipeline's People

Contributors

Stargazers

Watchers

Forkers

gtex-pipeline's Issues

RSEM transcript quantification

Recommend Projects

Recommend Topics

Recommend Org