genomic-medicine-sweden / jasen Goto Github PK
View Code? Open in Web Editor NEWBacterial typing pipeline for clinical NGS data. Written in NextFlow, Python & Bash.
License: GNU General Public License v3.0
Bacterial typing pipeline for clinical NGS data. Written in NextFlow, Python & Bash.
License: GNU General Public License v3.0
The variable cgmlstSchema
found in the config is used to estimate the depth and breadth of sequencing coverage by defining a set of "core" regions in the reference genome.
Put all the JSON into one megafile.
The work lies in going through the file and assembling them in a way that's sensible.
A analysis profile is used to tweak the execution of the pipeline to tailor it to different species etc
The setup process is currently rather conveluted and are in dire need of some clarification.
The output should contain a "version" field that describes the output format version. This would simplify both writing parsers and working with older files.
Files and comments are going in every direction. Do a quick review of everything and rename/remove whatever isn't necessary anymore.
Involve Amaya
Incorporate support for Nanopore sequence data.
I istalled Singularity using Conda.
Then I get this error:
(jasen) [JASEN]$ cd container && sudo bash -i build_container.sh && cd ..
[sudo] password for xxx:
Building tool chewbbaca to chewbbaca_2.8.5.sif
sudo: singularity: command not found
(jasen) [container]$ singularity version
3.8.6
Is it possible to install JASEN not being a sudo?
I have two input-fastq files with names: DFGW55A_S24_L001_R1_001.fastq.gz
and DFGW55A_S24_L001_R2_001.fastq.gz
. When I run the workflow the spades assembler fails with:
== Error == file is empty: trim_unpair.fastq.gz (single reads, library number: 1, library type: paired-end)
Create script to verify the presence of fastq files, runtime settings file and sample metadata file (hypothetical). Also have the script perform basic sanity checks (fastq file is a fastq etc).
Since a few labs including the Swedish public health institute is using Iontorrent for sequencing, we need to support that.
The error profile of Iontorrent data will likely give additional problems down the line, but these things should at least let us enable to analyse them.
Setup the Wiki and move existing documentation there
Add to readme how to build a minimal version of the kraken2 database to save resources.
I don't know if that would also require changes to the code?
JASEN crashes when there are multiple files with identical sample-id in input csv. This because it causes miss-pairings of assemblies and assembly indexes.
Could be prevented by a "validate input" step
Produce a NextFlow skeleton that handles the steps presented in the attached flowchart.
https://github.com/genomic-medicine-sweden/unnamed_microbial_pipeline/wiki/Flowchart
Should JASEN enforce limitations on how a sample ID should look to avoid potential issues with character encoding etc.
Limitations could be to limit ids to
Multiple tools don't produce json compatible output. In some cases it's just a matter of translating a tsv into a json, in some cases it is more complicated. Write small scripts to produce json compatible output for the following tools/steps/paths:
Despite being bureaucratic and slower, I think it woud be a good to implement reviewer approval rules i.e. at least one person must approve a PR before merging to the main branch. Also, utilising git actions to perform tests (linting and unit tests similar to that of nf-core would be elegant).
What needs to be done:
Essentially, the deploy_references.sh didn't indicate that the ariba prepareref command didn't work:
ariba prepareref -f nucleotide_fasta_protein_homolog_model.fasta --all_coding yes --force tmpdir
FileNotFoundError: [Errno 2] No such file or directory: 'card/00.info.txt'
Suggestions on how to get it done:
Catch it in bash and fail the installation process
What are the arguments for getting it done:
Easily obscured installation errors that can cost tons of debug time
When a new mlst is indentified (see below), prp throws an error as it expects an integer. Here is an example of a novel mlst output:
[
{
"filename" : "GMS18-fohm.fasta",
"scheme" : "saureus",
"id" : "GMS18-fohm.fasta",
"alleles" : {
"gmk" : "6",
"glpF" : "8",
"arcC" : "~862",
"tpi" : "3",
"yqiL" : "2",
"pta" : "10",
"aroE" : "14"
},
"sequence_type" : "-"
}
]
Here is the error output:
Traceback (most recent call last):
File "/data/bnf/sw/miniconda3/envs/jasen/bin/prp", line 33, in <module>
sys.exit(load_entry_point('prp', 'console_scripts', 'prp')())
File "/data/bnf/sw/miniconda3/envs/jasen/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/data/bnf/sw/miniconda3/envs/jasen/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/data/bnf/sw/miniconda3/envs/jasen/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/data/bnf/sw/miniconda3/envs/jasen/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/data/bnf/sw/miniconda3/envs/jasen/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/data/bnf/dev/ryan/JASEN/bin/pipeline_result_processor/prp/cli.py", line 102, in create_output
res: MethodIndex = parse_mlst_results(mlst)
File "/data/bnf/dev/ryan/JASEN/bin/pipeline_result_processor/prp/parse/typing.py", line 17, in parse_mlst_results
result_obj = TypingResultMlst(
File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for TypingResultMlst
alleles -> arcC
value is not a valid integer (type=type_error.integer)
--
Essentially, it would be nice to have support for PubMLST and BLAST databases to be held outside of the singularity image. Creating scripts to update the respective external databases would make updating them much easier. Alternatively, a script to update them within the image would be neat as well, if this at all possible. This should be modelled off the db updating process.
It would be nice if information regarding SE/PE and sequencing platform was added to the analysis_result.json output.
I got the following error when running mlst
:
BLAST Database error: No alias or index file found for nucleotide database [mlst.fa] in search path [/path/to/wd/8e/96e6c1c132db78bcd252109b891c38::]
When encountering novel STs in the 7 locus MLST scheme it would be nice to be able to submit them directly via the REST API to pubmlst. I at least think it should be possible via https://bigsdb.readthedocs.io/en/latest/rest.html#post-db-submissions
There are two scenarios
All in all, it is an asynchronous operation where we have to await the manual curation in each step before we can pull the new data (also via the API) and then update our local db
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.