The jasen from genomic-medicine-sweden

Rename variable "cgmlstSchema" to better reflects its purpose

The variable cgmlstSchema found in the config is used to estimate the depth and breadth of sequencing coverage by defining a set of "core" regions in the reference genome.

Create JSON megafile

Put all the JSON into one megafile.
The work lies in going through the file and assembling them in a way that's sensible.

Describe setup process and options available when configuring a new analysis profile

A analysis profile is used to tweak the execution of the pipeline to tailor it to different species etc

The setup process is currently rather conveluted and are in dire need of some clarification.

Version output format

The output should contain a "version" field that describes the output format version. This would simplify both writing parsers and working with older files.

Clean up old files

Files and comments are going in every direction. Do a quick review of everything and rename/remove whatever isn't necessary anymore.

Decide on default references

cgMLST
Resistance
Virulence
Reference genome source
Misc application thresholds

Install and unittest the pipeline

Gather real resistance/virulence dataset

Involve Amaya

Nanopore track writeup

Incorporate support for Nanopore sequence data.

Establish & Link MRSA dataset

Package and upload to NGP (Isak to support)
Upload to SRA (requires agreement from each party)
Provide easily parsable results, in order to easily validate

Installation: All Singularity images retrievable without self-builds

I istalled Singularity using Conda.
Then I get this error:

(jasen) [JASEN]$ cd container && sudo bash -i build_container.sh && cd ..
[sudo] password for xxx:
Building tool chewbbaca to chewbbaca_2.8.5.sif
sudo: singularity: command not found
(jasen) [container]$ singularity version
3.8.6

Is it possible to install JASEN not being a sudo?

Create script/task for producing MST

Spades assembly doesn't run when input fastq-file names have numbers 1 and 2 in their names several times

I have two input-fastq files with names: DFGW55A_S24_L001_R1_001.fastq.gz and DFGW55A_S24_L001_R2_001.fastq.gz. When I run the workflow the spades assembler fails with:

== Error ==  file is empty: trim_unpair.fastq.gz (single reads, library number: 1, library type: paired-end)

Validate the solution on ThermoFisher data

Input data validation script

Create script to verify the presence of fastq files, runtime settings file and sample metadata file (hypothetical). Also have the script perform basic sanity checks (fastq file is a fastq etc).

Validate solution

Iontorrent support

Since a few labs including the Swedish public health institute is using Iontorrent for sequencing, we need to support that.

adding feature to parse the value "iontorrent" in the .csv input file (the "platform" field).
optionally adding a field to the .csv file so that single ended (SE) illumina files are supported as well as PE
modifying the processes that uses fastq files so they can work with SE data
modifying parameters of de novo assemblers in order to optimize for the different error profile in iontorrent data. This is especially for SPAdes I think.

The error profile of Iontorrent data will likely give additional problems down the line, but these things should at least let us enable to analyse them.

Establish minimal testdata set

Validate pipeline results using MRSA data

Kickstart Wiki

Setup the Wiki and move existing documentation there

Add instructions for kraken2 minimal database

Add to readme how to build a minimal version of the kraken2 database to save resources.

I don't know if that would also require changes to the code?

JASEN crashes when there are multiple files with identical sample-id in input csv

JASEN crashes when there are multiple files with identical sample-id in input csv. This because it causes miss-pairings of assemblies and assembly indexes.

Could be prevented by a "validate input" step

Create Nextflow skeleton

Produce a NextFlow skeleton that handles the steps presented in the attached flowchart.

Discuss implementation of each individual step
Finish viable parts of pre-analysis
Finish viable parts of bioinformatical analysis
Finish viable parts of post-analysis
Confirm completion of skeleton

https://github.com/genomic-medicine-sweden/unnamed_microbial_pipeline/wiki/Flowchart

revise file ownership

Enforce limitation on characters in sample ids?

Should JASEN enforce limitations on how a sample ID should look to avoid potential issues with character encoding etc.

Limitations could be to limit ids to

ASCII characters
Minimum 5 characters long
Maximum 50 characters long

Verify relevant visualisation solution

Doublecheck SLURM support

Investigate thresholds for pass/fail

Results into json format script

Multiple tools don't produce json compatible output. In some cases it's just a matter of translating a tsv into a json, in some cases it is more complicated. Write small scripts to produce json compatible output for the following tools/steps/paths:

Repository security with reviewing PR and unit testing

Despite being bureaucratic and slower, I think it woud be a good to implement reviewer approval rules i.e. at least one person must approve a PR before merging to the main branch. Also, utilising git actions to perform tests (linting and unit tests similar to that of nf-core would be elegant).

Doublecheck Sungrid engine support

ariba prepareref FileNotFoundError

What needs to be done:

Essentially, the deploy_references.sh didn't indicate that the ariba prepareref command didn't work:
ariba prepareref -f nucleotide_fasta_protein_homolog_model.fasta --all_coding yes --force tmpdir
FileNotFoundError: [Errno 2] No such file or directory: 'card/00.info.txt'

Suggestions on how to get it done:

Catch it in bash and fail the installation process

What are the arguments for getting it done:

Easily obscured installation errors that can cost tons of debug time

Set up Singularity image with relevant dependencies

Pipeline result processor (prp) unable to deal with novel mlst calls

When a new mlst is indentified (see below), prp throws an error as it expects an integer. Here is an example of a novel mlst output:

[
   {
      "filename" : "GMS18-fohm.fasta",
      "scheme" : "saureus",
      "id" : "GMS18-fohm.fasta",
      "alleles" : {
         "gmk" : "6",
         "glpF" : "8",
         "arcC" : "~862",
         "tpi" : "3",
         "yqiL" : "2",
         "pta" : "10",
         "aroE" : "14"
      },
      "sequence_type" : "-"
   }
]

Here is the error output:

Traceback (most recent call last):
  File "/data/bnf/sw/miniconda3/envs/jasen/bin/prp", line 33, in <module>
    sys.exit(load_entry_point('prp', 'console_scripts', 'prp')())
  File "/data/bnf/sw/miniconda3/envs/jasen/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/data/bnf/sw/miniconda3/envs/jasen/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/data/bnf/sw/miniconda3/envs/jasen/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/bnf/sw/miniconda3/envs/jasen/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/bnf/sw/miniconda3/envs/jasen/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/data/bnf/dev/ryan/JASEN/bin/pipeline_result_processor/prp/cli.py", line 102, in create_output
    res: MethodIndex = parse_mlst_results(mlst)
  File "/data/bnf/dev/ryan/JASEN/bin/pipeline_result_processor/prp/parse/typing.py", line 17, in parse_mlst_results
    result_obj = TypingResultMlst(
  File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for TypingResultMlst
alleles -> arcC
  value is not a valid integer (type=type_error.integer)

Verify SKESA track

Clean up installation process

Ready for others to install/run
Merge PRs
Add standard reference databases
Write deployment scripts
Update instructions
Test run

--

Create batch wrapper

Moving PubMLST and BLAST dbs outside of singularity image

Essentially, it would be nice to have support for PubMLST and BLAST databases to be held outside of the singularity image. Creating scripts to update the respective external databases would make updating them much easier. Alternatively, a script to update them within the image would be neat as well, if this at all possible. This should be modelled off the db updating process.

BLAST Database error: No alias or index file found for nucleotide database [mlst.fa] in search path [/path/to/wd/8e/96e6c1c132db78bcd252109b891c38::]

Novel allele. It probably needs to be a two step operation where you first submit the new allele sequence, and then await the manual curation which usually happens within a week. And subsequently submit the new allele combination and wait for the next manual curation from the pubmlst admins.
Novel ST. Here one would skip step 1 in the previous description

All in all, it is an asynchronous operation where we have to await the manual curation in each step before we can pull the new data (also via the API) and then update our local db

"Klubba" default references

Genome
Resistance
cgMLST
(Optional) SNP

genomic-medicine-sweden / jasen Goto Github PK

jasen's People

Stargazers

Watchers

Forkers

jasen's Issues

Recommend Projects

Recommend Topics

Recommend Org