genepi / nf-gwas Goto Github PK
View Code? Open in Web Editor NEWA nextflow pipeline to perform state-of-the-art genome-wide association studies.
Home Page: https://genepi.github.io/nf-gwas
License: MIT License
A nextflow pipeline to perform state-of-the-art genome-wide association studies.
Home Page: https://genepi.github.io/nf-gwas
License: MIT License
Hi team,
Since we're already evaluating use of bundled JARs in the container, I went ahead and tested with the next steps (once we're past the jar issues).
The problems here seems to be the use of empty files, for example NO_COV_LOG
here
Line 46 in 6367cc8
This works fine on real filesystems, however as cloud also provides the option to run a pipeline using a blob storage as a backend (Azure Blob container or AWS S3) , this expectation fails.
I'm currently working on a separate internal branch to test this and once that's done I'll add here for a review.
UPDATE: I've merged that code within #46
hi all,
i have an annotation file that is a .tsv.gz with 1 million rows (variants), and I see the annotation step after REGENIE steps 1 & 2 seems to take a long time (NF_GWAS:SINGLE_VARIANT_TESTS:ANNOTATION:ANNOTATE_RESULTS).
currently it seems like for 1 million rows it takes 44 minutes on 8 CPUs & 4 GB. Which seems a bit long relative to the actual steps 1 & 2 of REGENIE. I went to see the Java source code & it seems a bit inefficient doing row-by-row sequential processing with single thread. Do you think this step can be multi-threaded or sped up in any way?
Thanks
For documentation, it might be useful to include the settings for pruning and qc of step 1 (e.g. prune_window_kbsize, qc_hwe, etc) into the report.
hi,
apologies for the somewhat clueless question as I am not a bioinformatician but i want to stress test this Nextflow pipeline & bring it into the cloud for some in-house projects
i see that you have given example files for testing in this path:
genotypes_prediction = "$projectDir/tests/input/pipeline/example.{bim,bed,fam}"
to my understanding, those files contain 500 positions/SNPs and it is a small file just for testing that the pipeline works instead of crashing/error-ing. I want to inflate this file to have more data, e.g. 500k SNPs or even 5 million, and run it on a large machine in the cloud, and check the runtimes, compute costs & things like that.
is there an easy way for me to do this? i do not need the data to be "real"
I tried looking around for public datasets but I think you have to sign some researcher agreements since these data tend to be very confidential/sensitive (people's genomes after all). basically i have not had success looking on the web
the other problem is these are binary files which seem to be produced by another program on upstream data so I can't easily write them myself (as opposed to the TSVs for example, which I could easily inflate with python or even linux commands)
thanks a lot
Currently, genes of significant loci are not shown in the Manhattan plot. Either add genes for the genome-wide significance level or for the threshold chosen for the annotated file.
If a required parameter is missing then the message "Please specify all required parameters" appears.
It would help to mention which parameter is missing.
The output summary statistic is sorted by lexical order and not by chromosme and position which will possibly raise error when directly input to pheweb
hi, i am on Ubuntu 22.04, trying to run the test workflow
I am using nextflow 23.10.1 from https://www.nextflow.io/docs/latest/getstarted.html#installation installed into my /home/<username>/
directory and added the path to $PATH so that I can just run nextflow
from anywhere in my filesystem.
this is with Java 17.0.6 (OpenJDK Runtime Environment Corretto-17.0.6.10.1 (build 17.0.6+10-LTS)) installed via SDKMAN! as advised by nextflow documentation
now, i just git clone the repo, then I execute nextflow run genepi/nf-gwas -r v1.0.4 -profile test,docker
I had to make my cloned github directory writable to all users with chmod 777 -R <cloned_repo_dir>
which I thought was strange but if I don't do this, nextflow will fail saying it can't create the workDir's within the folder, e.g. /home/minhtoo/repos/nf-gwas/work/e1/93c6238c11c0659e5b90902261c4e1
it prints the below log but seems to get stuck as the 0% never advances (it's been at least 20 minutes). i also have the .nextflow.log file contents in this link: CLICK ME, which do not raise any errors:
N E X T F L O W ~ version 23.10.1
Pulling genepi/nf-gwas ...
downloaded from https://github.com/genepi/nf-gwas.git
Launching `https://github.com/genepi/nf-gwas` [grave_bardeen] DSL2 - revision: 7927c67696 [v1.0.4]
Downloading plugin [email protected]
Core Nextflow options
revision : v1.0.4
runName : grave_bardeen
containerEngine : docker
container : quay.io/genepi/nf-gwas:v1.0.4
launchDir : /home/minhtoo/repos/nf-gwas
workDir : /home/minhtoo/repos/nf-gwas/work
projectDir : /home/minhtoo/.nextflow/assets/genepi/nf-gwas
userName : minhtoo
profile : test,docker
configFiles :
Input/output options
outdir : default
All Tests
project : test-gwas
genotypes_association : /home/minhtoo/.nextflow/assets/genepi/nf-gwas/tests/input/pipeline/example.bgen
genotypes_association_format: bgen
phenotypes_filename : /home/minhtoo/.nextflow/assets/genepi/nf-gwas/tests/input/pipeline/phenotype.txt
phenotypes_columns : Y1,Y2
genotypes_prediction : /home/minhtoo/.nextflow/assets/genepi/nf-gwas/tests/input/pipeline/example.{bim,bed,fam}
Single-variant tests only
regenie_min_imputation_score: 0.00
General options
validate_params : true
R report
annotation_min_log10p : 2
plot_ylimit : 0
Other options
genotypes_build : hg19
rsids_filename : /home/minhtoo/.nextflow/assets/genepi/nf-gwas/tests/input/pipeline/rsids.tsv.gz
!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
WARN: Option genotypes_build is deprecated. Please use association_build instead.
executor > local (2)
[67/2c9d8b] process > NF_GWAS:SINGLE_VARIANT_TESTS:INPUT_VALIDATION:VALIDATE_PHENOTYPES [ 0%] 0 of 1
[e1/93c623] process > NF_GWAS:SINGLE_VARIANT_TESTS:QUALITY_CONTROL:QC_FILTER_GENOTYPED (1) [ 0%] 0 of 1
[- ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REGENIE:REGENIE_STEP1:REGENIE_STEP1_RUN -
[- ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REGENIE:REGENIE_STEP1:REGENIE_LOG_PARSER_STEP1 -
[- ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REGENIE:REGENIE_STEP2:REGENIE_STEP2_RUN -
[- ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REGENIE:REGENIE_STEP2:REGENIE_LOG_PARSER_STEP2 -
[- ] process > NF_GWAS:SINGLE_VARIANT_TESTS:ANNOTATION:ANNOTATE_RESULTS -
[- ] process > NF_GWAS:SINGLE_VARIANT_TESTS:MERGE_RESULTS -
[- ] process > NF_GWAS:SINGLE_VARIANT_TESTS:FILTER_RESULTS -
[- ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REPORTING:REPORT -
[- ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REPORTING:REPORT_INDEX -
Currently, gene start and end (and consequently also distance) don't fit the start and end reported for the genes in other sources.
Hi,
I got the following error message
Missing process or function exit([1, Parameter genotypes_prediction is required.])`
with stacktrace:
Caused by: groovy.lang.MissingMethodException: No signature of method: static WorkflowMain.exit() is applicable for argument types: (Integer, String) v
alues: [1, Parameter genotypes_prediction is required.]
Possible solutions: wait(), wait(long), wait(long, int), getAt(java.lang.String), grep(), dump()
at groovy.lang.MetaClassImpl.invokeStaticMissingMethod(MetaClassImpl.java:1570)
at groovy.lang.MetaClassImpl.invokeStaticMethod(MetaClassImpl.java:1556)
at org.codehaus.groovy.runtime.callsite.StaticMetaClassSite.callStatic(StaticMetaClassSite.java:62)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:55)
(...)
using NF 23.10.0 .
it looks like Nextflow.exit https://github.com/nextflow-io/nextflow/blob/master/modules/nextflow/src/main/groovy/nextflow/Nextflow.groovy#L196 doesn't accept array/vector
When running the pipeline, the generated report for some of the phenotypes does not always contain the QQ plot, instead showing:
QQ Plot
Error in if (is.unsorted(-ypvs)) ypvs = sort.int(ypvs, decreasing = TRUE): missing value where TRUE/FALSE needed
Error in methods::is(x, "qqPlotInfo"): object 'qqPrepare' not found
I looked into my results, and it seems to happen for phenotypes, where the p-values for some of the variants contain NaN values.
EDIT: I forgot to add the logs, my apologies.
./work/08$ cat 6337d26544fe60eefdc643c500cec7/.command.sh
#!/bin/bash -ue
java -jar /opt/genomic-utils.jar csv-filter --separator ' ' --output-sep ' ' --input BlutdrucksystolischmmHg.regenie.gz --limit 2 --filter-column "LOG10P" --gz --output BlutdrucksystolischmmHg.regenie.tmp
csvtk sort BlutdrucksystolischmmHg.regenie.tmp -t -kLOG10P:nr | gzip > BlutdrucksystolischmmHg.regenie.filtered.gz
rm BlutdrucksystolischmmHg.regenie.tmp
./work/08$ cat 6337d26544fe60eefdc643c500cec7/.command.log
Picked up JAVA_TOOL_OPTIONS: -Djdk.lang.Process.launchMechanism=vfork
genomic-utils 0.3.7
https://github.com/genepi/genomic-utils
(c) 2023 Lukas Forer and Sebastian Sch?nherr
Ignoring line 11189. Value 'NA' cannot be parsed.
Ignoring line 31339. Value 'NA' cannot be parsed.
Ignoring line 33364. Value 'NA' cannot be parsed.
Ignoring line 34436. Value 'NA' cannot be parsed.
Ignoring line 35160. Value 'NA' cannot be parsed.
Ignoring line 40564. Value 'NA' cannot be parsed.
Ignoring line 52187. Value 'NA' cannot be parsed.
Ignoring line 78387. Value 'NA' cannot be parsed.
Ignoring line 96562. Value 'NA' cannot be parsed.
Ignoring line 98599. Value 'NA' cannot be parsed.
Ignoring line 98964. Value 'NA' cannot be parsed.
Ignoring line 106249. Value 'NA' cannot be parsed.
Ignoring line 152650. Value 'NA' cannot be parsed.
Ignoring line 153719. Value 'NA' cannot be parsed.
Provide qqplot in the report divided by maf
Currently there is no way to run the pipeline if >1M are included in step 1.
Gene-level analysis take long time to run on big imputed dataset.
Now used differently compared to older pipeline versions. 7.3 is better for report module (top loci definition), is it also ok for report_index module?
hi, I'm trying to use the next low pipeline version 0.6.3 and when the pipe tries to download the rsID file, I get this error.
Process NF_GWAS:DOWNLOAD_RSIDS
is unable to find [UnixPath]: /shares/CIBIO-Storage/NB/scratch/sfari_new/WGS123_complete_run_cov-pheno/c4/06e662ebb16c20061e9616ad88159b/rsids-v154-null.index.gz
(pattern: rsids-v154-null.index.gz
)
Jun-14 10:34:12.791 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=NF_GWAS:DOWNLOAD_RSIDS; work-dir=/shares/CIBIO-Storage/NB/scratch/sfari_new/WGS123_complete_run_cov-pheno/c4/06e662ebb16c20061e9616ad88159b
error [nextflow.exception.MissingFileException]: Missing output file(s) rsids-v154-null.index.gz
expected by process NF_GWAS:DOWNLOAD_RSIDS
Jun-14 10:34:12.820 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NF_GWAS:DOWNLOAD_RSIDS'
Caused by:
Missing output file(s) rsids-v154-null.index.gz
expected by process NF_GWAS:DOWNLOAD_RSIDS
I'm attaching the config file for reference and the entire log.
.nextflow.log.7.txt
gwas_conf.conf.txt
Thanks a lot.
An error is thrown when using plink set files divided by chromosome on gene-level analysis.
In particular for:
genotype_association=<mydir>/chr*.{bim,bed,fam}
Error on REGENIE_STEP2_GENE_TESTS
:
chr1.bed Not Found
When running with:
genotype_association=<mydir>/chr1.{bim,bed,fam}
Everything goes fine.
Process Report return this error when the regenie test fails and LOG10P -> NA
java.lang.NumberFormatException: For input string: "NA"
at java.base/jdk.internal.math.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2054)
at java.base/jdk.internal.math.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.base/java.lang.Double.parseDouble(Double.java:651)
at genepi.io.table.reader.AbstractTableReader.getDouble(AbstractTableReader.java:38)
at genepi.io.table.reader.AbstractTableReader.getDouble(AbstractTableReader.java:30)
at genepi.genomic.utils.commands.GwasReportCommand.createHtmlReport(GwasReportCommand.java:316)
at genepi.genomic.utils.commands.GwasReportCommand.call(GwasReportCommand.java:216)
at genepi.genomic.utils.commands.GwasReportCommand.call(GwasReportCommand.java:31)
at picocli.CommandLine.executeUserObject(CommandLine.java:2041)
at picocli.CommandLine.access$1500(CommandLine.java:148)
at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2453)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2415)
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2273)
at picocli.CommandLine$RunLast.execute(CommandLine.java:2417)
at picocli.CommandLine.execute(CommandLine.java:2170)
at genepi.genomic.utils.App.main(App.java:38)
See also: genepi/genomic-utils#5
-firth 0.01 --approx
is supported for binary traitsDear developers,
Regenie 3 allows for gene-level testing (https://rgcgithub.github.io/regenie/options/#gene-based-testing), but it requires to play with the step 2 configuration script. Is it already possible to make such an analysis with nextflow?
If not, what do you suggest? I also attach the example steps configurationsl, highlighting in red the differences between the two analysis parameters.
All the best,
Giacomo
The pipeline looks like it is optimized for processing imputed vcf data from UMICH or TOPMed imputation server which generates a DS field. Is is possible to run the pipeline on GATK WGS sequencing data without the DS field. Or does that need to be calculated with the PL field and written out to plink format before running the pipeline?
hi again,
I have a question about the exact configuration of your runs for the experiments reported in the paper, which were conducted either on AWS Batch or SLURM.
For these large-scale GWAS runs, do you provide the input files as a single file (e.g. a single giant phenotype.txt file with N columns for N phenotypes, a single giant .vcf.gz file containing all variants, 3 giant .bed/.fam/.bam files), prepare 1 single config file, then execute the Nextflow run?
Then, how do you specify the resources for AWS Batch? Since there is no single AWS machine that has 450 CPUs, surely it has to be split up on smaller machines (eg 32 CPUs per machine). Or the alternative is do you split the input files by phenotype so with 10 phenotypes you have 10 runs, and then you execute 10 separate Nextflow runs?
Basically I am just curious about how you specified the total resources & orchestrated the entire workload
sorry for the repeated questions - I loved reading your paper & the code is extremely well written hence my interest in this work!
Thanks a lot
The ./bin folder is designed to hold small, self-contained executable scripts used by the Nextflow processes, ideally below 1MB. The nf-gwas workflow contains a 22MB jar file, which does not seem to be called directly in the pipeline.
Is this jar file still necessary for the workflow? If not, I would recommend that it be removed. If the jar file is required by one of the executables contained in the docker image, it would make sense that tabix-merge.jar is also included in the docker image.
Hi all, Can you please help with my understanding here - I am having large scale INDIVIDUAL WGS VCF files - want to run the NF-GWAS pipeline on the full dataset. I have the nextflow and infra ready to handle the size of this scale. ~50 Nodes - 64 CPUS 256 GB RAM
Hi genepi team ๐
Thanks for the neat pipeline!
I was able to successfully run the pipeline locally using the -profile test,docker
profiles however it seems that the jbang's native behavior of downloading JARs on the fly might not be well suited for the cloud environment.
When I tried to run the pipeline on the cloud (both Azure/AWS Batch) setting via Nextflow CLI by adding the cloud-specific configs (azure.config
) and invoking
$ nextflow -c ./azure.config run https://github.com/genepi/nf-gwas -profile test,docker,azb -r main -latest
I kept running across the following issue in the initial caching process
Error executing process > 'NF_GWAS:CACHE_JBANG_SCRIPTS'
Caused by:
Process `NF_GWAS:CACHE_JBANG_SCRIPTS` terminated with an error exit status (1)
Command executed:
jbang export portable -O=RegenieLogParser.jar RegenieLogParser.java
jbang export portable -O=RegenieFilter.jar RegenieFilter.java
jbang export portable -O=RegenieValidateInput.jar RegenieValidateInput.java
Command exit status:
1
Command output:
(empty)
Command error:
Exception in thread "main" dev.jbang.cli.ExitException: Resource could not be copied from class path: jbang.properties
at dev.jbang.source.resolvers.ClasspathResourceResolver.getClasspathResource(ClasspathResourceResolver.java:66)
at dev.jbang.source.resolvers.ClasspathResourceResolver.resolve(ClasspathResourceResolver.java:31)
at dev.jbang.source.resolvers.CombinedResourceResolver.lambda$resolve$0(CombinedResourceResolver.java:26)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.base/java.util.Spliterators$ArraySpliterator.tryAdvance(Spliterators.java:958)
at java.base/java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:127)
at java.base/java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:502)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:488)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at java.base/java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:150)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:543)
at dev.jbang.source.resolvers.CombinedResourceResolver.resolve(CombinedResourceResolver.java:28)
at dev.jbang.source.ResourceRef.forResource(ResourceRef.java:136)
at dev.jbang.Configuration.getBuiltin(Configuration.java:265)
at dev.jbang.Configuration.defaults(Configuration.java:218)
at dev.jbang.Configuration.getMerged(Configuration.java:251)
at dev.jbang.Configuration.instance(Configuration.java:196)
at dev.jbang.cli.JBang$ConfigurationResourceBundle.getKeys(JBang.java:212)
at picocli.CommandLine$Model$Messages.keys(CommandLine.java:11250)
at picocli.CommandLine$Model$Messages.<init>(CommandLine.java:11234)
at picocli.CommandLine$Model$CommandSpec.setBundle(CommandLine.java:6346)
at picocli.CommandLine$Model$CommandSpec.resourceBundle(CommandLine.java:6342)
at picocli.CommandLine.setResourceBundle(CommandLine.java:3319)
at dev.jbang.cli.JBang.getCommandLine(JBang.java:227)
at dev.jbang.cli.JBang.getCommandLine(JBang.java:107)
at dev.jbang.Main.main(Main.java:12)
I suspect that this might have to do with how jbang relies on the download of jar dependencies (in a local lib
folder) which is not available to the tasks in other nodes (or other instances of the container) in a multi-node setting.
Allow me to share a couple of suggestions which might be worth considering.
Use a compiled shadow-jar/uber-jar (single jar with all deps baked in) so that there is no need for the lib
folder to be available to downstream processes which rely upon these cached jar files.
Another alternative, perhaps with less effort, is to perhaps bake in the compiled jar files in the container itself since the tool is already available in the container, this way we can ensure that the dependencies (i.e. lib
folder) as well as the Regenie*
JARs are all available within the container instances across different nodes.
I'd be happy to test the pipeline on the cloud and to discuss any changes which might be necessary for making the pipeline optimal (hardware config etc) for the cloud setting.
If a setlist contains a key containing "CHROM" string the header line of the summary stat will contain multiple lines and the tabix fail creating the indexing.
It would be great if the annotated output would also include the rsIDs of the SNPs.
Hi again,
I am having some trouble specifying the genotypes_prediction
parameter on AWS. The problem is that it expects 3 files together as a pattern, e.g. "example.{bim,bed,fam}" but AWS S3 doesn't recognise this format of paths and my run is failing on AWS because of it.
To solve this problem, I'd like to create 3 separate input params: genotypes_prediction_bim
, genotypes_prediction_bed
and genotypes_prediction_fam
. This way, I can just provide the absolute path to all 3 files without patterns and it will work on any platform.
I understand this involves changing the fromFilePairs
channel into 3 separate fromPaths
channels, but I'm having trouble on how I could structure the Nextflow code such that it will give the same channel output as fromFilePairs
, which I believe is something like [filename, [path_bim, path_bed, path_fam]]
I have never coded in Nextflow before, so apologies if this is too simple a problem
Any assistance would be greatly appreciated!
Thank you so much
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.