jaleezyy / covid-19-signal Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 25.0 3.15 MB

Files and methodology pertaining to the sequencing and analysis of SARS-CoV-2, causative agent of COVID-19.

License: MIT License

Shell 16.66% Perl 1.04% Python 78.03% Dockerfile 0.24% WDL 1.30% TeX 2.72%

biochemistry bioinformatics metagenomics sequencing virology

covid-19-signal's People

Contributors

Stargazers

Watchers

covid-19-signal's Issues

coverage plot

consider creating an a read coverage plot alternative to that provided by BreSeq, based on the step 13 HiSAT2 results, with y-axis in log scale?

Error running get_data_dependencies.sh

I ran bash pipeline/scripts/get_data_dependencies.sh -d data -a MN908947.3 and got the following error:

Downloading nucleotide gb accession to taxon map...rsync: failed to connect to ftp.ncbi.nlm.nih.gov (130.14.250.12): Connection timed out (110)
rsync: failed to connect to ftp.ncbi.nlm.nih.gov (2607:f220:41e:250::11): Network is unreachable (101)
rsync error: error in socket IO (code 10) at clientserver.c(127) [Receiver=3.1.3]

Looking through the script, I narrowed it down to line 45: kraken2-build --download-taxonomy --db Kraken2/db --threads 10 When I run that at the commandline, I get the same error as above. Which led me here DerrickWood/kraken2#38. Adding --use-ftp solved my issue.

It would also be helpful to add a comment on the estimated disc space needed for these databases & dependencies.

Update documentation

Since the big PR, the documentation needs updated including the figure.

@agmcarthur can the SVG for the overview figure be added to the repo?

new summary visualization - %SARS versus completeness

I've been using this pipeline and making this summary plot based on the output statistics. The label for the sample is "Sample Name, Average Fold Coverage".

It would be great if this summary figure was generated automatically.

Add known clinical diagnostic primers SNP comparison

N's summary statistics

From @nknox : add number of N's in consensus sequence to the high level summary report.

Generate QC plots using the ncov-tools

https://github.com/jts/ncov-tools

config example needs to include additional parameters and remove unused items

min_len:
min_qual:
scheme_bed:

Remove lmat associated lines and cutadapt's primer_rc / primer_rw.

pandas dependency error

The installation script needs to install pandas module. I had to install separately using:
conda install -c anaconda pandas

Support multiple samples

Add support for a table of samples to be run simultaneously.

Running pipeline via docker

Running pull command on both Mac OS and Linux i'm getting the following error:

docker pull finlaymaguire/pipeline
Using default tag: latest
Error response from daemon: pull access denied for finlaymaguire/pipeline, repository does not exist or may require 'docker login'

Running build on Mac OS using docker build -f Dockerfile_pipeline . gave error:

Solving environment: ...working... The command '/bin/sh -c conda create --name snakemake --channel conda-forge --channel bioconda snakemake=5.11.2' returned a non-zero code: 137

Snakemake running error

/usr/local/lib/python3.6/dist-packages/snakemake/workflow.py:18: FutureWarning: read_table is deprecated, use read_csv instead.
  
NameError in line 159 of /data0/fwhelan/mcarthur/covid-19-sequencing/pipeline/Snakefile.master:
name 'multiext' is not defined
  File "/data0/fwhelan/mcarthur/covid-19-sequencing/pipeline/Snakefile.master", line 159, in <module>

RuntimeWarning for generated figures

scripts/c19_postprocess.py:1075: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (matplotlib.pyplot.figure) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam figure.max_open_warning).

Step 9: MN908947.3 is currently hardcoded here. Is this okay or should we do something different?

Containerise pipeline

Step 7: I'm using the Kraken2 DB from galaxylab:/home/raphenar/sars-cov-2/round1/human/Iran1-LA/kraken/db. Is this okay or should we use something different?

Add benchmarking

https://snakemake.readthedocs.io/en/stable/tutorial/additional_features.html#benchmarking

KeyError phylo_include_seqs for ncov-tools integration

Activating conda environment: /workspace/raphenar/sars-cov-2/covid-19-sequencing/.snakemake/conda/afb830cb
Writing config for ncov to ncov-tools/config.yaml
Traceback (most recent call last):
File "/workspace/raphenar/sars-cov-2/covid-19-sequencing/.snakemake/scripts/tmptora1qx_.ncov-tools.py", line 59, in
set_up()
File "/workspace/raphenar/sars-cov-2/covid-19-sequencing/.snakemake/scripts/tmptora1qx_.ncov-tools.py", line 24, in set_up
'tree_include_consensus': f"'{os.path.abspath(snakemake.config['phylo_include_seqs'])}'",
KeyError: 'phylo_include_seqs'
[Tue May 26 22:37:32 2020]
Error in rule ncov_tools:
jobid: 0
conda-env: /workspace/raphenar/sars-cov-2/covid-19-sequencing/.snakemake/conda/afb830cb

RuleException:
CalledProcessError in line 120 of /workspace/raphenar/sars-cov-2/covid-19-sequencing/Snakefile:
Command 'source /home/raphenar/miniconda3/bin/activate '/workspace/raphenar/sars-cov-2/covid-19-sequencing/.snakemake/conda/afb830cb'; set -euo pipefail; python /workspace/raphenar/sars-cov-2/covid-19-sequencing/.snakemake/scripts/tmptora1qx_.ncov-tools.py' returned non-zero exit status 1.
File "/workspace/raphenar/sars-cov-2/covid-19-sequencing/Snakefile", line 120, in __rule_ncov_tools
File "/home/raphenar/miniconda3/envs/snakemake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /workspace/raphenar/sars-cov-2/covid-19-sequencing/.snakemake/log/2020-05-26T223716.884897.snakemake.log

Generate summaries of BreSeq among many samples using a command-line script.

Breseq failing for some samples

Breseq failing for some samples even after re-runs, see error log:

grep "FATAL ERROR" */breseq/breseq.log -A 20
S278/breseq/breseq.log:!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
S278/breseq/breseq.log-Attempt to translate codon without three bases.
S278/breseq/breseq.log-FILE: reference_sequence.cpp   LINE: 2383
S278/breseq/breseq.log-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
S278/breseq/breseq.log-Backtrace with 9 stack frames.
S278/breseq/breseq.log-breseq(+0x1b176) [0x56276e113176]
S278/breseq/breseq.log-breseq(+0x1844d2) [0x56276e27c4d2]
S278/breseq/breseq.log-breseq(+0x1868c5) [0x56276e27e8c5]
S278/breseq/breseq.log-breseq(+0x189860) [0x56276e281860]
S278/breseq/breseq.log-breseq(+0x18c222) [0x56276e284222]
S278/breseq/breseq.log-breseq(+0x4c2f1) [0x56276e1442f1]
S278/breseq/breseq.log-breseq(+0xc652) [0x56276e104652]
S278/breseq/breseq.log-/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fc60065bb97]
S278/breseq/breseq.log-breseq(+0x1a629) [0x56276e112629]
S278/breseq/breseq.log-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--
S384/breseq/breseq.log:!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
S384/breseq/breseq.log-Attempt to translate codon without three bases.
S384/breseq/breseq.log-FILE: reference_sequence.cpp   LINE: 2383
S384/breseq/breseq.log-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
S384/breseq/breseq.log-Backtrace with 9 stack frames.
S384/breseq/breseq.log-breseq(+0x1b176) [0x55c171d16176]
S384/breseq/breseq.log-breseq(+0x1844d2) [0x55c171e7f4d2]
S384/breseq/breseq.log-breseq(+0x1868c5) [0x55c171e818c5]
S384/breseq/breseq.log-breseq(+0x189860) [0x55c171e84860]
S384/breseq/breseq.log-breseq(+0x18c222) [0x55c171e87222]
S384/breseq/breseq.log-breseq(+0x4c2f1) [0x55c171d472f1]
S384/breseq/breseq.log-breseq(+0xc652) [0x55c171d07652]
S384/breseq/breseq.log-/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fd978c3db97]
S384/breseq/breseq.log-breseq(+0x1a629) [0x55c171d15629]
S384/breseq/breseq.log-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--
S56/breseq/breseq.log:!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
S56/breseq/breseq.log-Attempt to translate codon without three bases.
S56/breseq/breseq.log-FILE: reference_sequence.cpp   LINE: 2383
S56/breseq/breseq.log-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
S56/breseq/breseq.log-Backtrace with 9 stack frames.
S56/breseq/breseq.log-breseq(+0x1b176) [0x560f40897176]
S56/breseq/breseq.log-breseq(+0x1844d2) [0x560f40a004d2]
S56/breseq/breseq.log-breseq(+0x1868c5) [0x560f40a028c5]
S56/breseq/breseq.log-breseq(+0x189860) [0x560f40a05860]
S56/breseq/breseq.log-breseq(+0x18c222) [0x560f40a08222]
S56/breseq/breseq.log-breseq(+0x4c2f1) [0x560f408c82f1]
S56/breseq/breseq.log-breseq(+0xc652) [0x560f40888652]
S56/breseq/breseq.log-/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f31a3734b97]
S56/breseq/breseq.log-breseq(+0x1a629) [0x560f40896629]
S56/breseq/breseq.log-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

I'm suspecting the fastqs didn't end properly.

MN908947.3 reference genome is currently used throughout. Is this okay or should we do something different?

typo in install instructions

Hi all- Andrew asked if I could do an independent test of your workflow (install and usage). I'll drop any issues I find here.

I think the install instructions should read bash pipeline/scripts/get_data_dependencies.sh -d data -a MN908947.3 instead of bash scripts/get_data_dependencies.sh -d data -a MN908947.3

ncov-tools failing due to same headers in the consensus.fasta

Error message:

grep error ncovtools.log -B 20
Finished job 6.
1 of 7 steps (14%) done

[Wed May 27 14:33:06 2020]
rule make_msa:
    input: qc_analysis/default_consensus.fasta
    output: qc_analysis/default_aligned.fasta
    jobid: 5
    wildcards: prefix=default

Detected duplicate input strains "Consensus_virus.consensus_threshold_0.75_quality_20" but the sequences are different.
[Wed May 27 14:33:08 2020]
Error in rule make_msa:
    jobid: 5
    output: qc_analysis/default_aligned.fasta
    shell:
        augur align --sequences qc_analysis/default_consensus.fasta --reference-sequence /workspace/raphenar/sars-cov-2/data/MN908947.3.fasta --output qc_analysis/default_aligned.fasta --fill-gaps
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

need to rename the headers i.e:

$ head -1 S99.consensus.fasta
>Consensus_virus.consensus_threshold_0.75_quality_20
$ head -1 SB2.consensus.fasta
>Consensus_virus.consensus_threshold_0.75_quality_20

ivar failing with ARTIC bed

ivar trim -e -i Blank1/core/reference.mapped.bam -b /workspace/raphenar/sars-cov-2/covid-19-sequencing/resources/primer_schemes/nCoV-2019_v3.bed -m 20 -q 20 -p Blank1/core/reference.mapped.primertrimmed 2> Blank1/core/ivar_trim.log
iVar uses the standard 6 column BED format as defined here - https://genome.ucsc.edu/FAQ/FAQformat.html#format1.
It requires the following columns delimited by a tab: chrom, chromStart, chromEnd, name, score, strand
$ head /workspace/raphenar/sars-cov-2/covid-19-sequencing/resources/primer_schemes/nCoV-2019_v3.bed
MN908947.3	30	54	nCoV-2019_1_LEFT	nCoV-2019_1	+
MN908947.3	385	410	nCoV-2019_1_RIGHT	nCoV-2019_1	-
MN908947.3	320	342	nCoV-2019_2_LEFT	nCoV-2019_2	+
MN908947.3	704	726	nCoV-2019_2_RIGHT	nCoV-2019_2	-
MN908947.3	642	664	nCoV-2019_3_LEFT	nCoV-2019_1	+
MN908947.3	1004	1028	nCoV-2019_3_RIGHT	nCoV-2019_1	-
MN908947.3	943	965	nCoV-2019_4_LEFT	nCoV-2019_2	+
MN908947.3	1312	1337	nCoV-2019_4_RIGHT	nCoV-2019_2	-
MN908947.3	1242	1264	nCoV-2019_5_LEFT	nCoV-2019_1	+
MN908947.3	1623	1651	nCoV-2019_5_RIGHT	nCoV-2019_1	-

$ head /workspace/raphenar/sars-cov-2/covid-19-sequencing/resources/primer_schemes/Wuhan_liverpool_primers_28.01.20.bed 
MN908947.3	248	269	Wuhan_1_LEFT	1	+
MN908947.3	1184	1203	Wuhan_1_RIGHT	1	-
MN908947.3	944	963	Wuhan_2_LEFT	2	+
MN908947.3	2137	2156	Wuhan_2_RIGHT	2	-
MN908947.3	1912	1931	Wuhan_3_LEFT	1	+
MN908947.3	3146	3165	Wuhan_3_RIGHT	1	-
MN908947.3	2936	2957	Wuhan_4_LEFT	2	+
MN908947.3	4180	4199	Wuhan_4_RIGHT	2	-
MN908947.3	4052	4071	Wuhan_5_LEFT	1	+
MN908947.3	5324	5347	Wuhan_5_RIGHT	1	-

The 5th column needs to be a score

Move to standarised "core"

From discussion in teams, consider moving towards a standardised set of core tools (with our additional QC and analyses):

trim_galore -> bwa -> ivar trim -> ivar variant -> ivar consensus

Step 3: The Wuhan primer files are currently hardcoded here. Is this okay or should we do something different?

adjust Trimmomatic options

There are short sequences in fastqs after Trimmomatic step we should update to specify lengths:

ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.
LEADING: Cut bases off the start of a read, if below a threshold quality
TRAILING: Cut bases off the end of a read, if below a threshold quality
CROP: Cut the read to a specified length
HEADCROP: Cut the specified number of bases from the start of the read
MINLEN: Drop the read if it is below a specified length
TOPHRED33: Convert quality scores to Phred-33
TOPHRED64: Convert quality scores to Phred-64

IonTorrent support

Implement support for IonTorrent data in the pipeline, should largely be the same as Illumina with some quirks of quality score issues

c19_postprocess script needs updating

Need to swap cutadapt parsing for trim galore and ivar trim. Also need to change the directory where Fastqc files are found post-trimming. May want to run fastqc after ivar trim instead.

Output details on README

Generate a verbose README on the output files from individual assemblies, summaries among assemblies, ncov-tools, etc. Consider making a corresponding YouTube video.

simplify env files

Ensure full conda env ymls (i.e. from conda env export) work long-term has been a nightmare for me recently.

I'd suggest we simplify the conda env yaml files down to the core tools we actually want to install and let the installation of all the dependencies be automatic.

Slightly less reproducible but a lot more maintainable.

missing dependencies in c19_postprocess

Maybe add to dependencies install script:

Traceback (most recent call last):
  File "scripts/c19_postprocess.py", line 13, in <module>
    import matplotlib.pyplot as plt
ModuleNotFoundError: No module named 'matplotlib'

Use LMAT conda env instead of LMAT docker

We're currently calling LMAT through Fin's docker container, via the wrapper script lmat_wrapper.py which addresses some nuisance issues such as file permissions. Fin has also made an LMAT conda recipe (conda create -n lmat -c fmaguire lmat) which should remove these nuisance issues and be more convenient.

Add primer QC check

Do a QC for primers after trimming to make sure there aren't any remaining

Run ncov-tools pipeline on output

Add pipeline support for running https://github.com/jts/ncov-tools

Issue created at ncov-tools repo: jts/ncov-tools#2

typo for assign_lineages in ncov-tools/config.yaml

The following:

assigned_lineages: true should be assign_lineages: true

Swap post assembly analyses to consensus

Swap all the assembly based analyses to the ivar consensus instead of the unicycler assembly.

Version 2 for long-reads

Bigger milestone to start as separate repo but version of pipeline to support minion and pacbio with test data for both

Step 4: the NexteraPE file is currently hardcoded here. Is this okay or should we do something different?

Output postprocessig

Writing python script to postprocess a pipeline run and populate a .csv file with high-level summary info from Andrew's spreadsheet

run_bed_primer_trim rule fails

Pipeline consistently fails at this step. I can run the cmd manually (samtools view -F / ivar trim) and it works - might be a rel path issue?

Error: It can't generate and find reference.mapped.primertrimmed.sorted.bam

Snakemake/Conda install issues (temporary solution?)

In attempting to create the conda environment with snakemake:

$ conda create -c conda-forge -c bioconda -n snakemake snakemake pandas
$ conda activate snakemake
(snakemake) $ snakemake --version
5.3.0
(snakemake) $

So firstly, the channels need to be specified or else it fails to find snakemake.
Secondly, the "latest" version of snakemake found is 5.3.0 (5.11.2 not even found, hence why the version is not specified in the above command...which may or may not be a typo...).
Apparently this is a known problem.

The following can be seen at the installation instructions in the Snakemake documentation:

conda install -c conda-forge mamba # install workaround mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake # create conda environment using mamba, installing snakemake
conda activate snakemake

Temporary rewording of instuctions to install snakemake through this alternate method? ...at least until conda catches up?

e.g. generate ARTIC bed file: perl -ne 'my @x=split m/\t/; print join("\t",@x[0..3], 60, $x[3]=~m/LEFT/?"+":"-"),"\n";' < arcticv1.scheme.bed > ARTIC-V1.bed
Trim as per https://github.com/connor-lab/ncov2019-artic-nf/blob/master/modules/illumina.nf#L92

        ivarCmd = "ivar trim -e"
    } else {
        ivarCmd = "ivar trim"
    }
        """
        samtools view -F4 -o ${sampleName}.mapped.bam ${bam}
        samtools index ${sampleName}.mapped.bam
        ${ivarCmd} -i ${sampleName}.mapped.bam -b ${bedfile} -m ${params.illuminaKeepLen} -q ${params.illuminaQualThreshold} -p ivar.out
        samtools sort -o ${sampleName}.mapped.primertrimmed.sorted.bam ivar.out.bam```

sample     ct
sample_a   30.2

Thanks @jts

Add sample name wildcard to outputs

In a similar style to this, we should add the sample name to the output names: https://github.com/pha4ge/hAMRonization/blob/master/Snakefile

Also helps implementation of #16

Add config flag for LMAT docker vs conda Docker

Issues with older glibc versions with LMAT (e.g. 2.17) conda recipe

jaleezyy / covid-19-signal Goto Github PK

covid-19-signal's People

Contributors

Stargazers

Watchers

Forkers

covid-19-signal's Issues

Recommend Projects

Recommend Topics

Recommend Org