human-pangenomics / hpp_production_workflows Goto Github PK
View Code? Open in Web Editor NEWWDL’s and Dockerfiles for assembly QC process
License: MIT License
WDL’s and Dockerfiles for assembly QC process
License: MIT License
Hi all,
I want to use the Ensembl Mapping Pipeline in HPRC paper (A Draft Human Pangenome Reference) to evaluate my assemblies, but I could not find the source codes in this repository. Is this pipeline not publicly available? Or am I missing something?
BTW, the pipeline is descriped at page 46 of the HPRC manuscript and Supplementary Figure 46.
Thanks!
Best,
Peng
Hi all,
I want to use the Ensembl Mapping Pipeline for annation in this paper (A draft human pangenome reference) , but I could not find the source codes in this repository. Is this pipeline not publicly available? Or am I missing something?
There is only a description of the process and some code in the paper, but I can't use the Ensembl Mapping Pipeline for annation method.
Thanks!
Best,
Tony
I see the .dockstore.yml
. What is the URL on https://dockstore.org/ for this repository? I am interested in seeing the page. Thanks.
Various QC WDL such as QC/tasks/quast.wdl
reference their docker image dependency using the :latest
tag. This causes an ambiguity on which image is required, as latest is an ambiguous tag (changes over time). This can cause a couple of issues:
Recommend that all images be referenced in WDL by an unambiguous tag or their actual digest. That will create an unambiguous dependency.
Please update BUSCO to the latest version. It contains the new BUSCO datasets (*_odb10).
A number of the workflows/tasks have lint as reported by miniwdl check
. Some of the suggestions are worth implementing (eg, quoting).
For example, I ran it on QC/wdl/workflows/standard_qc_haploid.wdl
and it reported:
$ miniwdl check hpp_production_workflows/QC/wdl/workflows/standard_qc_haploid.wdl
standard_qc_haploid.wdl
workflow standardQualityControlHaploid
(Ln 17, Col 9) UnusedDeclaration, nothing references Boolean isMaleSample
call asmgene_t.asmgene
call quast_t.quast
call meryl_t.runMeryl as meryl
call merqury_t.merqury
call yak_t.runYakAssemblyStats as yak
call consolidate
task consolidate
(Ln 135, Col 11) CommandShellCheck, SC2035 Use ./*glob* or -- *glob* so names with dashes won't become options.
(Ln 135, Col 30) CommandShellCheck, SC2035 Use ./*glob* or -- *glob* so names with dashes won't become options.
asmgene_t : asmgene.wdl
workflow runAsmgene (not called)
call asmgene
(Ln 4, Col 2) IncompleteCall, required input(s) omitted in call to asmgene (assemblyFasta, genesFasta)
task asmgene
(Ln 46, Col 82) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 49, Col 46) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 49, Col 60) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
merqury_t : merqury.wdl
workflow runMerqury (not called)
call merqury
(Ln 4, Col 5) IncompleteCall, required input(s) omitted in call to merqury (assemblyFasta, kmerTarball)
task merqury
(Ln 31, Col 8) CommandShellCheck, SC2034 OMP_NUM_THREADS appears unused. Verify use (or export if used externally).
(Ln 46, Col 15) CommandShellCheck, SC2207 Prefer mapfile or read -a to split command output (or quote to avoid splitting).
(Ln 48, Col 19) CommandShellCheck, SC2207 Prefer mapfile or read -a to split command output (or quote to avoid splitting).
(Ln 49, Col 19) CommandShellCheck, SC2207 Prefer mapfile or read -a to split command output (or quote to avoid splitting).
(Ln 56, Col 19) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 57, Col 15) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 68, Col 23) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 69, Col 19) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 77, Col 15) CommandShellCheck, SC2206 Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(Ln 80, Col 8) CommandShellCheck, SC2068 Double quote array expansions to avoid re-splitting elements.
(Ln 83, Col 17) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 83, Col 40) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
meryl_t : meryl.wdl
(Ln 4, Col 1) UnusedImport, no use of workflow, tasks, or structs defined in the imported document shardReads_t
workflow runMeryl
scatter readFile
call extractReads_t.extractReads as sampleReadsExtracted
scatter readFile
call extractReads_t.extractReads as maternalReadsExtracted
scatter readFile
call extractReads_t.extractReads as paternalReadsExtracted
call arithmetic_t.sum as sampleReadSize
call arithmetic_t.sum as maternalReadSize
call arithmetic_t.sum as paternalReadSize
call arithmetic_t.sum as allReadSize
call arithmetic_t.max as sampleReadSizeMax
call arithmetic_t.max as maternalReadSizeMax
call arithmetic_t.max as paternalReadSizeMax
scatter readFile
call merylCount as sampleMerylCount
scatter readFile
call merylCount as maternalMerylCount
scatter readFile
call merylCount as paternalMerylCount
call merylUnionSum as sampleMerylUnionSum
call merylUnionSum as maternalMerylUnionSum
call merylUnionSum as paternalMerylUnionSum
call merylHapmer
task merylCount
(Ln 198, Col 8) CommandShellCheck, SC2034 OMP_NUM_THREADS appears unused. Verify use (or export if used externally).
(Ln 201, Col 11) CommandShellCheck, SC2006 Use $(...) notation instead of legacy backticked `...`.
(Ln 202, Col 92) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 205, Col 16) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 205, Col 30) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 208, Col 15) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
task merylHapmer
(Ln 299, Col 8) CommandShellCheck, SC2034 OMP_NUM_THREADS appears unused. Verify use (or export if used externally).
(Ln 308, Col 13) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 311, Col 36) CommandShellCheck, SC2035 Use ./*glob* or -- *glob* so names with dashes won't become options.
(Ln 311, Col 42) CommandShellCheck, SC2035 Use ./*glob* or -- *glob* so names with dashes won't become options.
task merylUnionSum
(Ln 244, Col 8) CommandShellCheck, SC2034 OMP_NUM_THREADS appears unused. Verify use (or export if used externally).
arithmetic_t : arithmetic.wdl
task max
task sum
extractReads_t : extract_reads.wdl
workflow runExtractReads (not called)
scatter file
call extractReads
task extractReads
(Ln 58, Col 65) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 64, Col 12) CommandShellCheck, SC2226 This ln has no destination. Check the arguments, or specify '.' explicitly.
(Ln 65, Col 56) CommandShellCheck, SC2046 Quote this to prevent word splitting.
(Ln 65, Col 56) CommandShellCheck, SC2006 Use $(...) notation instead of legacy backticked `...`.
(Ln 65, Col 106) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 67, Col 46) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 73, Col 19) CommandShellCheck, SC2006 Use $(...) notation instead of legacy backticked `...`.
(Ln 74, Col 21) CommandShellCheck, SC2053 Quote the right-hand side of == in [[ ]] to prevent glob matching.
(Ln 75, Col 23) CommandShellCheck, SC2006 Use $(...) notation instead of legacy backticked `...`.
(Ln 77, Col 13) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
shardReads_t : shard_reads.wdl
workflow runShardReads (not called)
call shardReads
(Ln 4, Col 5) IncompleteCall, required input(s) omitted in call to shardReads (readFile)
task shardReads (not called)
(Ln 46, Col 97) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 48, Col 88) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 52, Col 24) CommandShellCheck, SC2046 Quote this to prevent word splitting.
(Ln 52, Col 26) CommandShellCheck, SC2012 Use find instead of ls to better handle non-alphanumeric filenames.
quast_t : quast.wdl
workflow runQuast (not called)
call quast
(Ln 4, Col 2) IncompleteCall, required input(s) omitted in call to quast (assemblyFasta)
task quast
(Ln 34, Col 19) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 37, Col 12) CommandShellCheck, SC2226 This ln has no destination. Check the arguments, or specify '.' explicitly.
(Ln 44, Col 18) CommandShellCheck, SC2206 Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(Ln 51, Col 23) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 54, Col 16) CommandShellCheck, SC2226 This ln has no destination. Check the arguments, or specify '.' explicitly.
(Ln 56, Col 22) CommandShellCheck, SC2206 Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(Ln 60, Col 14) CommandShellCheck, SC2236 Use -n instead of ! -z.
(Ln 65, Col 15) CommandShellCheck, SC2206 Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(Ln 71, Col 17) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 71, Col 38) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
yak_t : yak.wdl
(Ln 4, Col 1) UnusedImport, no use of workflow, tasks, or structs defined in the imported document shardReads_t
workflow runYakAssemblyStats
(Ln 16, Col 9) UnusedDeclaration, nothing references Int kmerSize
(Ln 17, Col 9) UnusedDeclaration, nothing references Int shardLinesPerFile
scatter readFile
call extractReads_t.extractReads as maternalReadsExtracted
scatter readFile
call extractReads_t.extractReads as paternalReadsExtracted
scatter readFile
call extractReads_t.extractReads as sampleReadsExtracted
call arithmetic_t.sum as maternalReadSize
call arithmetic_t.sum as paternalReadSize
call arithmetic_t.sum as sampleReadSize
call yakCount as yakCountMat
call yakCount as yakCountPat
call yakCount as yakCountSample
call yakAssemblyStats
task yakAssemblyStats
(Ln 187, Col 81) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 188, Col 81) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 191, Col 112) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 192, Col 112) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 196, Col 8) CommandShellCheck, SC2129 Consider using { cmd1; cmd2; } >> file instead of individual redirects.
(Ln 196, Col 26) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 197, Col 17) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 197, Col 42) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 198, Col 26) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 199, Col 17) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 199, Col 42) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 200, Col 30) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 201, Col 17) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 201, Col 52) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 202, Col 30) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 203, Col 17) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 203, Col 52) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 206, Col 17) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 206, Col 39) CommandShellCheck, SC2035 Use ./*glob* or -- *glob* so names with dashes won't become options.
task yakCount
arithmetic_t : arithmetic.wdl
task sum
task max (not called)
extractReads_t : extract_reads.wdl
workflow runExtractReads (not called)
scatter file
call extractReads
task extractReads
(Ln 58, Col 65) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 64, Col 12) CommandShellCheck, SC2226 This ln has no destination. Check the arguments, or specify '.' explicitly.
(Ln 65, Col 56) CommandShellCheck, SC2046 Quote this to prevent word splitting.
(Ln 65, Col 56) CommandShellCheck, SC2006 Use $(...) notation instead of legacy backticked `...`.
(Ln 65, Col 106) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 67, Col 46) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 73, Col 19) CommandShellCheck, SC2006 Use $(...) notation instead of legacy backticked `...`.
(Ln 74, Col 21) CommandShellCheck, SC2053 Quote the right-hand side of == in [[ ]] to prevent glob matching.
(Ln 75, Col 23) CommandShellCheck, SC2006 Use $(...) notation instead of legacy backticked `...`.
(Ln 77, Col 13) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
shardReads_t : shard_reads.wdl
workflow runShardReads (not called)
call shardReads
(Ln 4, Col 5) IncompleteCall, required input(s) omitted in call to shardReads (readFile)
task shardReads (not called)
(Ln 46, Col 97) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 48, Col 88) CommandShellCheck, SC2086 Double quote to prevent globbing and word splitting.
(Ln 52, Col 24) CommandShellCheck, SC2046 Quote this to prevent word splitting.
(Ln 52, Col 26) CommandShellCheck, SC2012 Use find instead of ls to better handle non-alphanumeric filenames.
There's currently no license associated with the code in this repo, making it impossible to reuse. Please consider adding a permissive license.
It would be helpful to add the option -e to the paftools.js asmgene.wdl to aid in evaluation of misassemblies. It will print both single-copy and multi-copy gene errors.
#paftools.js asmgene
Usage: paftools.js asmgene [options] <ref-splice.paf> <asm-splice.paf> [...]
Options:
-i FLOAT min identity [0.99]
-c FLOAT min coverage [0.99]
-a only evaluate genes mapped to the autosomes
-e print fragmented/missing genes
Thank you!!
Hi!
I ran the standard_haploid_qc workflow and saw that it was using a really old version of minimap2 in the asmgene call.
The asmgene task (here: https://github.com/human-pangenomics/hpp_production_workflows/blob/master/QC/wdl/tasks/asmgene.wdl), uses the docker image "tpesout/hpp_minimap2:latest"
If Iyou pull and start that docker image, and run minimap2 --version:
$ docker run -it --rm tpesout/hpp_minimap2:latest /bin/bash
root@9826e39b4f73:/data# minimap2 --version
2.17-r941
However, the docker build file (here: https://github.com/human-pangenomics/hpp_production_workflows/blob/master/QC/docker/minimap2/Dockerfile) points at a much newer (1 release back) version.
We think @tpesout Trevor forgot to update the docker images with the latest build, maybe did it on a local machine but didn't publish the image.
Thanks,
Sara and Bruce @bkmartinjr
Hi,
could you please clarify the data source for the chm13v1.1/HSat annotation used in this step:
2. Incorporating HSATs Coverage Bias
[...]
To do so we need a bed file pointing to the HSat in the reference then we can run the script project_blocks.py to project it back to the assembly.
Or is that already packaged in the container? Thanks!
+Peter
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.