dib-lab / genome-grist Goto Github PK

View Code? Open in Web Editor NEW

36.0 22.0 6.0 16.79 MB

map Illumina metagenomes to genomes!

Home Page: https://dib-lab.github.io/genome-grist/

License: Other

Python 9.76% Jupyter Notebook 58.38% Shell 0.15% Standard ML 31.47% Makefile 0.23%

sourmash hacktoberfest

genome-grist's People

Contributors

Stargazers

Watchers

Forkers

marisalim vikash84 taylorreiter seedpcseed padbc mr-eyes

genome-grist's Issues

automatically plot/cluster/etc gather matching signatures

it should be fairly easy to cluster gather matches into "species groupings", for e.g. doing combined sgc queries.

why doesn't output directory `genbank_genomes` get written to config-specified `outdir`

using conf file:

sample:
- gather_vita_vars_gtdb_shared_assemblies
outdir: out
metagenome_trim_memory: 1e9

$ ls *

conf.yml  README.sh

genbank_genomes:
GCA_000210075.1_genomic.fna.gz  GCA_900543035.1.info.csv

out:
genbank  sgc

Shouldn't dir genbank_genomes be written to out?

consider using a 'flag' file for jupyter kernel touch

https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#flag-files

add gene abundance information to output

tracy teal suggested in conversation that we output gene abundances - I suspect this can be done pretty easily with the bam files and bedtools, we'd just need to add the NCBI annotation files into the mix.

ref #10

use snakemake `temp` to remove temporary files automatically

I ran out of room in the /tmp on my node, and noticed that I had some previous versions of downloads that failed or were cancelled with Ctrl-C hadn't been removed. It looks like temporary fastq removal from /tmp/tmp.* only happens if the download_sra_wc rule succeeds (removal line here).

Error:

2021-05-03T20:04:24 fastq-dump.2.10.1 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='5'
2021-05-03T20:04:24 fastq-dump.2.10.1 err: storage exhausted while writing file within file system module - failed SRR4140278
2021-05-03T20:04:24 fastq-dump.2.10.1 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='5'
2021-05-03T20:04:24 fastq-dump.2.10.1 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='6'

Might be worth getting snakemake to remove these automatically by declaring them as temporary output in the rule (documentation here; code: temp("/path/to/output_file"). I see that part of the issue is that $ACC doesn't always match wildcards.sample, but temp with directory should work.

update: working PR using temp with directory in #71.

computing ANI for metagenomes vs ref genomes

there are three ways to do this.

mapping based: map reads, call variants, build a variant call file (VCF), estimate ANI from VCF SNP calls.
mapping based: create a new consensus genome based on VCF (bcftools supports this natively!)
assembly based: do spacegraphcats queries (which will get all mapped reads + graph neighborhood), run assembler.

I’m in the process of trying (2) and (3) now. All very straightforward with tools we have.

Not clear what to do about genomes that are not fully covered, will do some exploratory analysis! probably best to eliminate those regions from consideration (which is what an assembly based approach will do).

comments from @taylorreiter -

ANI is a more accepted/understood metric for genome relatedness than the %s gather reports.
Would tell you how related thing in Q is to all the Rs in a database.
Theoretically could insert into a tree based on that info, i think a la pplacer?

comments from @widdowquinn -

I suspect, but have not shown, that there is an asymptotic relationship between proportion of genome covered and ANI %identity. I wouldn’t be surprised if the proportion of a bacterial genome you need for an acceptable estimate of ANI %id is relatively small (cgMLST/CNI may well even be overkill…).
However, %coverage does matter for interpretation, and the extent of mapping is an upper bound on the observed %coverage.

and a somewhat tangential comment from nanopore considerations -

I would expect k-mer approaches to always work better here; I think it might be satisfying to demonstrate that ;)

config file location thoughts

titus:speech_balloon: 7:51 AM
what is good practice for user-override systems-level config files in python packages, that works well with conda? I would like to have non-git-archived config files that operate at “system” or “user” level, as well as per-project-directory overrides. Maybe something like:
7:51
~/.genome-grist.conf - system/user level overrides, processed first
7:52
./genome-grist.conf - per-project-directory overrides, processed after system, but before project- or sample-specific overrides
7:52
path/to/sample.conf - project-/sample-specific overrides
7:52
seem reasonable?

bluegenes:feet: 7:56 AM
can you name the first two with the level (e.g. genome-grist.system.conf , genome-grist.directory.conf ) or similar? Just thinking it might be a little confusing to explain the ordering between identically named files (even if they are in different locations (edited)
New

luizirber 8:06 AM
And please use ~/.config/genome-grist/ as basedir, don't pollute my home dir =)

figure out how to retain/extract unmapped reads, too.

we already have the unmapped hashes, due to the split function.

for unmapped reads... what do we want to do with them?

feed them from a DNA query into a protein query
run sgc on them with a different set of queries (i.e. use genome-grist as a prefilter)
other things?

`mamba` required for test execution on gh actions

Current PR's are failing tests with this error:

----------------------------- Captured stderr call -----------------------------
Error: mamba package manager is not available. [...]

thoughts on taxonomic comparisons, protein searches, and challenges therein

I'm becoming rather fond of the protein search (#40) and want to combine DNA and protein searches (#48) but keep on running into mental problems about how to deal with taxonomy. This issue is to explain the issues to future me.

In #40, I explored comparison of DNA vs protein taxonomy.

However, the challenge is that I used different databases! I used the Genbank compleat database (700k+ genomes) for DNA searching, and the GTDB genus-level protein database for protein searching. That's OK, because both databases have GenBank identifiers that let me pull out NCBI taxonomy.

I do prefer the GTDB taxonomy for things, but the majority of GenBank genomes don't have GTDB taxonomy assigned! I suppose we could relabel all of GenBank with GTDB...

In addition, GTDB doesn't have euk sequences in it, so we'd have to figure out how to add euk in. This now makes the question relevant to charcoal dib-lab/charcoal#30, proving that ultimately all of our software gets stuck on the same set of hard problems :).

Anyway, for now I'm stuck using (a) NCBI taxonomy with (b) a protein database that doesn't contain euk sequences.

provide a way to specify a genome list independent of the genbank gather

(and then run gather on those genomes to get the order)

update README with memory requirements for quickstart

waiting for @mr-eyes to confirm that 8GB is enough :)

notes on release process

run git log --oneline v0.6.1.. to get list of merges since last release
cut a new tag w/release notes over at https://github.com/dib-lab/genome-grist/releases/new - mark it as pre-release
run:

git fetch --tags
rm dist/*
python setup.py sdist
twine upload dist/*

choose abundtrim memory parameters based on k-mer counting estimates

ref #51

Mo -
so, khmer pre-allocate the memory even there's no need for it?
titus:speech_balloon: 9 minutes ago
Yes - after all, how do you know how many kmers there are if you haven’t read through the whole data set yet?
titus:speech_balloon: 8 minutes ago
MQF paper has to address same problem ;). In Khmer we use bloom filter based data structures that let us preallocate memory.
Mo 7 minutes ago
Got it now ^^ thanks
titus:speech_balloon: 6 minutes ago
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0101271

titus:speech_balloon: 6 minutes ago
😁
titus:speech_balloon: 6 minutes ago
(there WILL be a quiz… 😬 )
😆
2

Mo 5 minutes ago
haha I am thrilled
Mo 4 minutes ago
If we were to develop an estimation function for the distinct kmers count, would a tool like KmerStream or any alternative algorithm would reduce the need for the fixed memory requirement?
Mo 4 minutes ago
https://github.com/pmelsted/KmerStream

GitHubGitHub
pmelsted/KmerStream
Streaming algorithm for computing kmer statistics for massive genomics datasets - pmelsted/KmerStream
titus:speech_balloon: 3 minutes ago
We have hll implemented too. Still have to walk across the data once.
👍
1

titus:speech_balloon: 1 minute ago
now that I see your previous comment - yeah, we’ve explored this (extremely) thoroughly in khmer. with grist, we could totally integrate a k-mer counting scheme without much extra effort, since we already do multiple passes across the data.
titus:speech_balloon: 1 minute ago
…please let me know when the pull request is ready 😁
titus:speech_balloon: < 1 minute ago
(more seriously, it’s just not that big an issue for grist, so I don’t worry about it. the high water mark for grist’s memory usage is the database search at the moment, and for genbank that’s ~120 GB, so …)

turn off time-stamp checking of downloaded genomes

why do tests unnecessarily download genome files?!

the tests added in #19 download genome files from NCBI, but they shouldn't have to... #pleasefix

have khmer trim-low-abund use tmp directory on local file system

e.g. /scratch on farm

should we add taxonomic output?

while I'm a big fan of various aspects of the current reports, and look forward to leveling it up over the coming months #10, there are many things missing.

one idea is to add taxonomy reporting, a la @luizirber efforts with OPAL and CAMI.

custom database

Hello,

Thank you so much for your wonderful tool. I was wondering how do we adapt your pipeline if we have a set of isolate genomes that we want to use to query our metagenomes. We have analyzed our isolate genomes and have a VCF file representing all the variants that can be found and want to find the abundance of those strains/variants. Is this at all possible with your tool?

Many thanks in advance,

explore protein gather

swipe marisa's image from 2/8/21 for README

@marisalim has a nice colorful image of the genome-grist workflow that we should co-opt and put in the README

improve reports

our current reports are not really aimed at anyone other than internal lab members. it would be easy to add some more information that is already computed by genome-grist. for example, we could produce and provide --

FastQC reports on the metagenome itself
a summary of reference genomes - linking names to accessions, etc.
a summary of mapping statistics in numerical form, including abundances
an estimate or rundown of the number of unmapped reads

this is on top of a basic need for more explanatory information about what is going on in the first place :)

more advanced stuff could include gene abundances and so on for portions of genomes that were actually mapped to, although that starts to get into a territory of doing a lot more work 👀

have prefetch_gather split off unknown hashes

I think this will be part of the greyhound merge sourmash-bio/sourmash#1238, but it seems like it could be a significant speedup in the meantime.

why are `checkpoint gather_genbank` and `checkpoint gather_genbank_wc` checkpoints instead of rules?

why are checkpoint gather_genbank and checkpoint gather_genbank_wc checkpoints instead of rules?

I understand the purpose of Checkpoint_GatherResults to dynamically solve for outputs based on accessions in gather results. However, the output of gather_genbank and gather_genbank_wc are based only on the wildcard sample, which is defined in the config and not dynamically altered by an unknown number of outputs in the workflow, so shouldn't/couldn't these checkpoints just be rules?

(wondering if this may be related to #77, and I want to reimplement a class like Checkpoint_GatherResults in another workflow so also curious if my mental model is missing something important)

provide non-mapping based summaries, too

the taxonomic analysis in #35 and the protein comparisons in #40 as well as the sample summaries in #42 can all be usefully done immediately after the gather and genome information is calculated.

reporting on multiple samples

with #29, genome-grist now supports running on multiple samples. which begs the question, what kind of reporting should/could we be doing here for collections of samples?

some ideas -

size (# reads)
% known x genbank, by accession
taxonomic breakdown? see also #35
k-mer diversity!! see also spacegraphcats/spacegraphcats#376

[feature, CI] Create a Docker image

I suggest using Github action:build-and-push-docker-images to build and push genome-grsit docker image to DockerHub and GitHub Container Registry.

Why is this? to keep genome-grist docker image auto updated to be used easily in the various workflows etc ...

dealing with host/human reads

In a conversations with @ctb, @ctb realized that A LOT of metagenomes that I've been working with for a recent project appeared truncated. Instead, the files were much smaller than what is on the SRA because I had removed human reads before downstream analysis.
Here's a smattering of the file sizes for some of the files i've been working with. For some, there is almost no human (host) reads, and for others, there's quite a bit:

-rw-r--r--  1 tereiter tereiter  232M Feb 23  2020 4000_R1.human.fq.gz
-rw-r--r--  1 tereiter tereiter  1.5G Feb 23  2020 4000_R1.nohost.fq.gz
-rw-r--r--  1 tereiter tereiter  232M Feb 23  2020 4000_R2.human.fq.gz
-rw-r--r--  1 tereiter tereiter  1.6G Feb 23  2020 4000_R2.nohost.fq.gz
-rw-r--r--  1 tereiter tereiter  8.8M Feb 23  2020 4001_R1.human.fq.gz
-rw-r--r--  1 tereiter tereiter  1.5G Feb 23  2020 4001_R1.nohost.fq.gz
-rw-r--r--  1 tereiter tereiter  8.3M Feb 23  2020 4001_R2.human.fq.gz
-rw-r--r--  1 tereiter tereiter  1.5G Feb 23  2020 4001_R2.nohost.fq.gz
-rw-r--r--  1 tereiter tereiter  5.2M Feb 23  2020 4002_R1.human.fq.gz
-rw-r--r--  1 tereiter tereiter  2.4G Feb 23  2020 4002_R1.nohost.fq.gz
-rw-r--r--  1 tereiter tereiter  4.3M Feb 23  2020 4002_R2.human.fq.gz
-rw-r--r--  1 tereiter tereiter  2.4G Feb 23  2020 4002_R2.nohost.fq.gz
-rw-r--r--  1 tereiter tereiter  5.5M Feb 23  2020 4004_R1.human.fq.gz
-rw-r--r--  1 tereiter tereiter  978M Feb 23  2020 4004_R1.nohost.fq.gz
-rw-r--r--  1 tereiter tereiter  5.2M Feb 23  2020 4004_R2.human.fq.gz
-rw-r--r--  1 tereiter tereiter  987M Feb 23  2020 4004_R2.nohost.fq.gz
-rw-rw-r--  1 tereiter tereiter  5.3M Feb 26  2020 PSM7J4EF_R1.human.fq.gz
-rw-rw-r--  1 tereiter tereiter   54M Feb 26  2020 PSM7J4EF_R1.nohost.fq.gz
-rw-rw-r--  1 tereiter tereiter  5.1M Feb 26  2020 PSM7J4EF_R2.human.fq.gz
-rw-rw-r--  1 tereiter tereiter   54M Feb 26  2020 PSM7J4EF_R2.nohost.fq.gz
-rw-rw-r--  1 tereiter tereiter   80K Feb 24  2020 PSMA264O_R1.human.fq.gz
-rw-rw-r--  1 tereiter tereiter  738M Feb 24  2020 PSMA264O_R1.nohost.fq.gz
-rw-rw-r--  1 tereiter tereiter   83K Feb 24  2020 PSMA264O_R2.human.fq.gz
-rw-rw-r--  1 tereiter tereiter  755M Feb 24  2020 PSMA264O_R2.nohost.fq.gz
-rw-rw-r--  1 tereiter tereiter  346K Feb 25  2020 PSMA265X_R1.human.fq.gz
-rw-rw-r--  1 tereiter tereiter  540M Feb 25  2020 PSMA265X_R1.nohost.fq.gz
-rw-rw-r--  1 tereiter tereiter  337K Feb 25  2020 PSMA265X_R2.human.fq.gz
-rw-rw-r--  1 tereiter tereiter  544M Feb 25  2020 PSMA265X_R2.nohost.fq.gz

I used bbduk to remove the human reads. You can download the database here: https://drive.google.com/file/d/0B3llHR93L14wd0pSSnFULUlhcUk/edit?usp=sharing
The release of this database is documented here: http://seqanswers.com/forums/archive/index.php/t-42552.html

rule remove_host:
    output:
        r1 = 'outputs/bbduk/{library}_R1.nohost.fq.gz',
        r2 = 'outputs/bbduk/{library}_R2.nohost.fq.gz',
        human_r1='outputs/bbduk/{library}_R1.human.fq.gz',
        human_r2='outputs/bbduk/{library}_R2.human.fq.gz'
    input:
        r1 = 'outputs/cut/{library}_R1.cut.fq.gz',
        r2 = 'outputs/cut/{library}_R2.cut.fq.gz',
        human='inputs/host/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz'
    conda: 'envs/env.yml'
    shell:'''
    bbduk.sh -Xmx64g t=3 in={input.r1} in2={input.r2} out={output.r1} out2={output.r2} outm={output.human_r1} outm2={output.human_r2} k=31 ref={input.human}
    '''

I originally detected human reads by adding a human genome signature to my gather runs. Up to 10% of samples were human. A genome sig is here: https://osf.io/fxup3/ and an rna sig is here: https://osf.io/anj6b/
If using the genome sig, repetitive regions may cause false matching, especially to other eukaryotes. However, the RNA sig is much smaller and would miss a lot of real human. The bbduk database above deals with this nicely I think...sourmash is currently at the quick and dirty stage I think. Could be possible to turn the bbduk database into a signature?

Errors on WDL - connection timed out

Titus:

I wonder what the restrictions are on internet access during the jobs? there are a bunch of HTTP calls out to NCBI… it’d be annoying to have to pre-load the matching genomes (or, well, run gather… then exit wf, download matching genomes somewhere else, then restart wf)

Current error running genome-grist tutorial in WDL:

rule make_genbank_info_csv:
    output: genbank_genomes/GCF_000012825.1.info.csv
    jobid: 29
    wildcards: acc=GCF_000012825.1

opening directory: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/012/825
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1567, in ftp_open
    fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout)
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1588, in connect_ftp
    return ftpwrapper(user, passwd, host, port, dirs, timeout,
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 2409, in __init__
    self.init()
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 2418, in init
    self.ftp.connect(self.host, self.port, self.timeout)
  File "/root/miniconda3/lib/python3.8/ftplib.py", line 152, in connect
    self.sock = socket.create_connection((self.host, self.port), self.timeout,
  File "/root/miniconda3/lib/python3.8/socket.py", line 808, in create_connection
    raise err
  File "/root/miniconda3/lib/python3.8/socket.py", line 796, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/lib/python3.8/site-packages/genome_grist/genbank_genomes.py", line 124, in <module>
    sys.exit(main())
  File "/root/miniconda3/lib/python3.8/site-packages/genome_grist/genbank_genomes.py", line 106, in main
    genome_url, assembly_report_url = url_for_accession(acc)
  File "/root/miniconda3/lib/python3.8/site-packages/genome_grist/genbank_genomes.py", line 24, in url_for_accession
    with urllib.request.urlopen(url) as response:
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1585, in ftp_open
    raise exc.with_traceback(sys.exc_info()[2])
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1567, in ftp_open
    fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout)
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1588, in connect_ftp
    return ftpwrapper(user, passwd, host, port, dirs, timeout,
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 2409, in __init__
    self.init()
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 2418, in init
    self.ftp.connect(self.host, self.port, self.timeout)
  File "/root/miniconda3/lib/python3.8/ftplib.py", line 152, in connect
    self.sock = socket.create_connection((self.host, self.port), self.timeout,
  File "/root/miniconda3/lib/python3.8/socket.py", line 808, in create_connection
    raise err
  File "/root/miniconda3/lib/python3.8/socket.py", line 796, in create_connection
    sock.connect(sa)
urllib.error.URLError: <urlopen error ftp error: TimeoutError(110, 'Connection timed out')>
[Thu Jan 28 17:51:21 2021]
Error in rule make_genbank_info_csv:
    jobid: 29
    output: genbank_genomes/GCF_000012825.1.info.csv
    shell:
        
        python -m genome_grist.genbank_genomes GCF_000012825.1             --output genbank_genomes/GCF_000012825.1.info.csv
    
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job make_genbank_info_csv since they might be corrupted:
genbank_genomes/GCF_000012825.1.info.csv
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /cromwell-executions/ggristv2/8c56fb3e-b7dd-4e93-aaa7-1c7e2cecb4c1/call-run_test/execution/grist/.snakemake/log/2021-01-28T173100.518576.snakemake.log
Error in snakemake invocation: Command '['snakemake', '-s', '/root/miniconda3/lib/python3.8/site-packages/genome_grist/conf/Snakefile', '-j', '1', '--use-conda', 'summarize', '--configfile', '/root/miniconda3/lib/python3.8/site-packages/genome_grist/conf/defaults.conf', '/root/miniconda3/lib/python3.8/site-packages/genome_grist/conf/system.conf', '/cromwell-executions/ggristv2/8c56fb3e-b7dd-4e93-aaa7-1c7e2cecb4c1/call-run_test/inputs/-269893485/conf-tutorial.yml']' returned non-zero exit status 1.

comparing and evaluating DNA vs protein matches

So, I did a thing, and I don't really know how to evaluate it. Thoughts welcome!

summary: DNA and protein

so I gristed a bunch of the Northen data sets, using both DNA searching (against all of genbank, ~700k) and protein searching (against @bluegenes genus-level GTDB databases).

Then I intersected with Genbank taxonomy and did taxonomic aggregation and summarization (#35).

taxonomy report on DNA search

default config, against genbank.

phylum level DNA:

genus level DNA:

species level DNA:

taxonomy report on protein search

config:

outdir: outputs.roux.protein/
sourmash_scaled: 1000
sourmash_compute_ksizes:
- 11
sourmash_sigtype: protein
sourmash_database_glob_pattern: /home/ntpierce/thumper/databases/gtdb95-genus-n0.protein-k11-scaled100.sbt.zip
sourmash_database_ksize: 33
sourmash_database_threshold_bp: 50e3

phylum level protein:

genus level protein

species level protein:

how to currently install and run cmds

To install genome-grist, you currently need to run:

git clone https://github.com/dib-lab/genome-grist.git
pip install -e genome-grist

As of today, this installation version is v0.6.dev6+g0e772df.

genome-grist commands need to be run from inside the repo right now, for example:

cd genome-grist
genome-grist process HSMA33MX smash_reads
# snakemake workflow should start running
# or you can run the whole test workflow 
make test
# snakemake workflow should start running

If you use the pip install instructions in the repo README, the installation code will run without error (v0.5). But that installation method is missing the config.yml file and you'll get errors like this when you try the quick start command genome-grist process HSMA33MX smash_reads:

File "/home/mcwlim/miniconda3/envs/grist/lib/python3.7/site-packages/genome_grist/__main__.py", line 77, in run_snakemake
    raise ValueError(f"cannot find config file '{configfile}'")
ValueError: cannot find config file 'conf.yml'

run both protein and DNA searches by default

I increasingly like the idea of generating a combined report for DNA and protein matches per #40.

This would add another layer of reporting. In addition to reporting summaries across multiple samples per #42, and aggregated taxonomic reporting #35 at the DNA level, we would add protein reporting, and also maybe do some kind of comparative analysis of DNA vs protein presence.

visualization of taxonomy - use sankey flow diagrams again?

as per charcoal https://github.com/ctb/2020-charcoal-sankey, could use sankey flow diagrams in #35

add summary reporting for multiple samples

using #56, add a notebook or output for summary reporting on sample details.

some code for loading and displaying:

import itables

import glob
import yaml
import pandas

x = []
for yf in glob.glob('outputs.paper/*.yaml'):
    with open(yf, 'rt') as fp:
        d = yaml.safe_load(fp)
        x.append(d)

df = pandas.DataFrame(x)
cols = ['sample', 'n_reads', 'n_bases', 'kmers', 'total_hashes', 'known_hashes', 'unknown_hashes']
df = df[cols]

add pandas/notebook utilities into genome-grist codebase?

this seems like generically useful code that ensures that the various dataframes are all copacetic and synchronized...

SampleDFs = namedtuple('SampleDFs', 'gather_df, all_df, left_df, names_df')

def load_dfs(outdir, sample_id):

    # load mapping CSVs
    all_df = pd.read_csv(f'{outdir}/minimap/depth/{sample_id}.summary.csv')
    left_df = pd.read_csv(f'{outdir}/leftover/depth/{sample_id}.summary.csv')

    # load gather CSV
    gather_df = pd.read_csv(f'{outdir}/genbank/{sample_id}.x.genbank.gather.csv')

    # names!
    names_df = pd.read_csv(f'{outdir}/genbank/{sample_id}.genomes.info.csv')

    # connect gather_df to all_df and left_df using 'genome_id'
    def fix_name(x):
        return "_".join(x.split('_')[:2]).split('.')[0]

    gather_df['genome_id'] = gather_df['name'].apply(fix_name)
    names_df['genome_id'] = names_df['acc'].apply(fix_name)

    # check that all dataframes are copacetic
    in_gather = set(gather_df.genome_id)
    in_left = set(left_df.genome_id)
    assert in_gather == in_left
    assert in_gather == set(names_df.genome_id)
    assert in_gather == set(all_df.genome_id)

    # re-sort left_df and all_df to match gather_df order, using matching genome_id column
    all_df.set_index("genome_id")
    all_df.reindex(index=gather_df["genome_id"])
    all_df.reset_index()

    left_df.set_index("genome_id")
    left_df.reindex(index=gather_df["genome_id"])
    left_df.reset_index()

    #left_df["mapped_bp"] = (1 - left_df["percent missed"]/100) * left_df["genome bp"]
    #left_df["unique_mapped_coverage"] = left_df.coverage / (1 - left_df["percent missed"] / 100.0)

    names_df.set_index("genome_id")
    names_df.reindex(index=gather_df["genome_id"])
    names_df.reset_index()
    
    return SampleDFs(gather_df, all_df, left_df, names_df)

taxonomic classification of reads with genome-grist?

it's a little bit circuitous, but for people that really want taxonomic classification of reads and not entire data sets, we can do that by identifying their best-mapping genome(s) :)

paladin snakemake rules (map to proteins)

about paladin: https://github.com/ToniWestbrook/paladin

"PALADIN currently only supports single-end reads (or reads merged with FLASH, PEAR, abyss-mergepairs), and BWA-MEM based alignment. It makes use of many BWA parameters and is therefore compatible with many of its command line arguments."

working snakefile for PEAR --> paladin. On farm, see /home/ntpierce/genome-grist/test_paladin to run w/test data

rule all:
    input: expand("paladin_mapped/{sample}.bam", sample="test")

rule pear_read_merging:
    """
    Merge PE reads with PEAR, for input into PALADIN
    """
    input:
        r1="{sample}_1.fq.gz",
        r2="{sample}_2.fq.gz"
    output:
        assembled = '{sample}.assembled.fastq',
        discarded = '{sample}.discarded.fastq',
        unassembled_r1 = '{sample}.unassembled.forward.fastq',
        unassembled_r2 = '{sample}.unassembled.reverse.fastq',
    message:
        """--- Merging paired reads using PEAR  ---"""
    params:
        pval=0.01,# (default PEAR pval)
        max_mem="4G",
    threads: 6
    log: "logs/{sample}.pear.log"
    conda: 'paladin-env.yml'
    shell:
        """
        pear -f {input.r1} -r {input.r2} -p {params.pval} \
             -j {threads} -y {params.max_mem} -o {wildcards.sample} {log}
        """

# using swissprot as an example, even though you can run paladin prepare to autodownload this, instead of running index
rule paladin_index:
    input:
        "uniprot_sprot.fasta",
    output:
        "uniprot_sprot.fasta.bwt"
    log:
        "logs/uniprot_sprot.paladin-index.log"
    conda: 'paladin-env.yml'
    shell:
        """
        paladin index -r3 {input}
        """

rule paladin_align:
    input:
        reads='{sample}.assembled.fastq',
        index="uniprot_sprot.fasta.bwt",
    output:
        "paladin_mapped/{sample}.bam"
    params:
        index_base="uniprot_sprot.fasta"
    log:
        "logs/{sample}.paladin-align.log"
    threads: 4
    conda: "paladin-env.yml"
    shell:
        """
        paladin align -t {threads} -T 20 {params.index_base} {input.reads} | samtools view -Sb - > {output}
        """

paladin-env.yml:

name: paladin
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - pear=0.9.6
  - paladin=1.4.6
  - samtools=1.11

enable "private" identifiers for metagenomes/genomes

download from non-NCBI websites/use local paths
support identifer "interception"

probably can be enabled as part of caching #8

preparing your own metagenome samples - an example `Snakefile`

So @taylorreiter gave me some internally sequenced and processed files that I wanted to feed into genome-grist, and I wrote the following little Snakefile to bypass all of the SRA stuff. FYI.

(taylor gave me the .nohost.fq.gz files, and I wanted to put them in interleaved format after fixing the names)

NAMES=['AW_BI_1_Soil', 'AW_BI_1_Water']

rule some:
    input:
        expand('outputs.killi/abundtrim/{N}.abundtrim.fq.gz', N=['AW_M2_Lung'])

rule all:
    input:
        expand('outputs.killi/abundtrim/{N}.abundtrim.fq.gz', N=NAMES)

rule fixname: # fix sequence names
    input:
        r1="{name}_R1.nohost.fq.gz",
        r2="{name}_R2.nohost.fq.gz",
    output:
        r1="{name}_R1.name.fq.gz",
        r2="{name}_R2.name.fq.gz",
    conda: 'genome_grist/conf/env/sra.yml'
    shell: """
        echo processing R1...
        seqtk seq -C {input.r1} | \
            perl -ne 's/\.([12])$/\/$1/; print $_' | \
            gzip -c > {output.r1}

        echo processing R2...
        seqtk seq -C {input.r2} | \
            perl -ne 's/\.([12])$/\/$1/; print $_' | \
            gzip -c > {output.r2}
    """

rule merge: # interleave
    input:
        r1="{name}_R1.name.fq.gz",
        r2="{name}_R2.name.fq.gz",
    output:
        "{name}.abundtrim.fq.gz"
    conda: 'genome_grist/conf/env/sra.yml'
    shell: """
        interleave-reads.py {input.r1} {input.r2} --gzip -o {output}
    """

Determining relative abundance of strains

Hello,

My understanding is that genome grist tells us what genomes are present. I am trying to determine the relative abundance of different strains that are found to be present. Can I do this by just using --track-abundance when running sourmash compute to make the database that is used by sourmash gather?

Also, would you mind explaining the difference between hashes classified to this genome, mapped bp to this genome, and hashes classified to this species?

Thank you!

grist fails if a sample has no sourmash gather matches

right at the wc_sourmash_gather step - sourmash gather errors out if no matches found.

right now the only solution is to remove the sample from the config...

update `split_query_known_unknown_wc` to just do reporting.

with sourmash 4.1, we can do the split into known/unknown using sourmash prefetch, but it requires adjusting the reporting in this rule.

ref #68

compute & save a sourmash signature for the downloaded metagenome

...prior to trimming, etc. helps with postmortem/retrospective analyses of information loss!

enable caching of sra download, abundtrim, genomes, and gather outputs somehow

What should we do about SRA data sets with unpaired reads?

this rule:

# download SRA IDs.
rule wc_download_sra:
    output:
        r1 = protected(outdir + "/raw/{sra_id}_1.fastq.gz"),
        r2 = protected(outdir + "/raw/{sra_id}_2.fastq.gz"),
    conda: "env/sra.yml"
    shell: '''
        fastq-dump --skip-technical  \
               --readids \
               --read-filter pass \
               --dumpbase \
               --split-spot \
               --clip \
               -Z \
               {wildcards.sra_id} | \
               perl -ne 's/\.([12]) /\/$1 /; print $_' | \
               split-paired-reads.py --gzip -1 {output.r1} -2 {output.r2}
        '''

is generating the following error:

Unpaired reads found starting at SRR5855412.25364063/1 25364063 length=150; exiting
fastq-dump (PID 111316) was killed [SIGPIPE (13)] (no core dump was generated) at /home/ctbrown/genome-grist/.snakemake/conda/2d8b6fef/bin/fastq-dump line 242.
[Wed Dec 16 00:03:26 2020]
Error in rule wc_download_sra:
...

for multiple different SRA data sets (SRR5855412, SRR5855424, SRR5855429) and I'm not sure what to do about it.

My current thought is to download the data sets manually and clean them, but I'm wondering if there is a reason this is happening...

I was also thinking about splitting it out in the workflow so that rather than doing all the piping etc. it would save the SRA download file first, then do the perl & split-paired-reads steps second. This would give me access to the data more directly, & I could try using temporary to indicate that (once the file is successfully processed) it can be deleted.

error inserting my own files into genome grist for target download_matching_genomes

I'm not sure what error I'm making to cause this error message.

directory structure:

$ ls *
conf.yml  README.sh

genbank:
gather_vita_vars_gtdb_shared_assemblies.x.genbank.gather.csv

genbank_genomes:

conf file:

sample:
- gather_vita_vars_gtdb_shared_assemblies
outdir: .
metagenome_trim_memory: 1e9

command:

genome-grist run conf.yml download_matching_genomes

Error:

genome-grist run conf.yml download_matching_genomes
sample: ['gather_vita_vars_gtdb_shared_assemblies']
outdir: .
Building DAG of jobs...
InputFunctionException in line 144 of /home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/Snakefile:
Error:
  WorkflowError:
    Missing wildcard values for sample
Wildcards:

Traceback:
  File "/home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/Snakefile", line 131, in __call__
Error in snakemake invocation: Command '['snakemake', '-s', '/home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/Snakefile', '-j', '1', '--use-conda', 'download_matching_genomes', '--configfile', '/home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/defaults.conf', '/home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/system.conf', 'conf.yml']' returned non-zero exit status 1.

head of the gather file i'm using (I did read a bunch of gather files into R and then write this out, so theres NAs, but I don't think they're the problem):

intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,name,filename,md5,f_match_orig,unique_intersect_bp,gathe
868000,0.01065658301821932,0.1523876404494382,0.01065658301821932,0.01065658301821932,NA,NA,NA,"GCA_000210075.1 Bacteroides xylanisolvens XB1A strain=XB1A, ASM2100
58000,7.120758237980651e-4,2.918004085205719e-4,2.2098904876491677e-4,2.2098904876491677e-4,NA,NA,NA,"GCA_000256725.2 Toxoplasma gondii TgCatPRC2 strain=TgCatPRC2,
106000,0.0013013799538378432,0.004314477468839885,2.2098904876491677e-4,2.2098904876491677e-4,NA,NA,NA,"GCA_000285855.2 Ruminococcus sp. JC304, ASM28585v2",/group/
394000,0.004837204734076512,0.2008324661810614,0.0047389873790698814,0.0047389873790698814,NA,NA,NA,"GCA_000431575.1 Firmicutes bacterium CAG:83, MGS83",/group/ctb
134000,0.001645140696361047,0.030209140201394268,9.576192113146393e-4,9.576192113146393e-4,NA,NA,NA,"GCA_000432135.1 Firmicutes bacterium CAG:124, MGS124",/group/c
20000,2.455433875165742e-4,0.009478672985781991,2.455433875165742e-4,2.455433875165742e-4,NA,NA,NA,"GCA_000432155.1 Eubacterium sp. CAG:274, MGS274",/group/ctbrown
100000,0.001227716937582871,0.030973451327433628,6.875214850464077e-4,6.875214850464077e-4,NA,NA,NA,"GCA_000432375.1 Firmicutes bacterium CAG:103, MGS103",/group/c
96000,0.001178608260079556,0.0519774011299435,0.0011294995825762413,0.0011294995825762413,NA,NA,NA,"GCA_000432435.1 Eubacterium sp. CAG:180, MGS180",/group/ctbrown
62000,7.6118450130138e-4,0.01794071762870515,5.647497912881207e-4,5.647497912881207e-4,NA,NA,NA,"GCA_000434635.1 Firmicutes bacterium CAG:110, MGS110",/group/ctbro
22000,2.700977262682316e-4,0.015068493150684932,2.700977262682316e-4,2.700977262682316e-4,NA,NA,NA,"GCA_000435175.1 Clostridium sp. CAG:245, MGS245",/group/ctbrown
540000,0.0066296714629475026,0.209992193598751,0.006605117124195845,0.006605117124195845,NA,NA,NA,"GCA_000435395.1 Firmicutes bacterium CAG:94, MGS94",/group/ctbro
82000,0.0010067278888179541,0.04411764705882353,8.839561950596671e-4,8.839561950596671e-4,NA,NA,NA,"GCA_000435555.1 Firmicutes bacterium CAG:176, MGS176",/group/ct

Combine genome grist commands?

Would/could this work or do summarize and zip need to run separately?

genome-grist run <config.yml> summarize zip

It would be nice to run in one line :)

Update - right, so combining doesn't work if summarize hasn't been run yet, but it does create the zip file if the outputs directory already exists. :) anyhow, i'll leave this here as an enhancement feature.

maybe use taxonomic breakdown (or Jaccard clustering) to do sgc pangenome queries

it would seem to make sense to have the sgc queries be done on pangenomes at the species or genus level, rather than individual genomes.

we could accomplish this by grouping the gathered genome with either taxonomic groupings or Jaccard similarity clustering.

potential command lines

I'm thinking of doing the command line for this snakemake app differently; right now I'm not using config files much, just the sample name. How does the below sound?

each command would be successive & snakemake-ified, of course, so you could run the last one and it would do all the previous ones.

# download SRA metagenome
genome-grist download-sra HSMA33MX

# trim and abundtrim SRA metagenome
genome-grist trim-reads HSMA33MX

# make a sourmash signature
genome-grist smash HSMA33MX

# download, abundtrim, genbank gather SRA metagenome
genome-grist gather HSMA33MX

# download genbank genomes based on gather results
genome-grist download-genomes HSMA33MX

# do iterative read mapping etc based on gather
genome-grist map HSMA33MX

# summarize content, abundance, taxonomy, mapping based on gather => pretty graphs etc.
genome-grist summarize HSMA33MX

# extract spacegraphcats neighborhoods
genome-grist neighborhoods HSMA33MX

what this pipeline currently does, and what it could do

(1) download a metagenome
(2) process it into trimmed reads + signature
(3) gather x genbank
(4) download matching genomes
(5) map all reads to all matching genomes
(6) extract “leftover” reads iteratively, by successively eliminating reads that matched to previous gather matches
(7) runs mapping on “leftover” reads to genomes
(8) summarize all mapping results for comparison and graphing a la figure 2 in the gather draft paper

@luizirber comment:

(1), (2) and (5) are also useful for MAG search (where "matching genomes" are the queries)

@taylorreiter comment:

ya i think this generally get’s the people what they've been wanting forever — the reads that match the actual thing
(parts of this is a thing that happens in distillerycats that would be awesome if it were already implemented and modularized)

so the question is:

should we appify/clickify it?

it seems like there could be utility in having these steps be made a bit more generic, and released as part of a separate package that behaves like elvers/etc.
e.g. “give me the genome files for this gather csv”
and “give me the mapped reads, both ALL and LEFTOVER, for this gather csv”

since this is boring functionality that may break routinely with NCBI updates, etc., we can package it and test the bejeezus out of it
and then rely on that, without having it be reimplemented in various different ways across a dozen individual lab projects