Code Monkey home page Code Monkey logo

genome-grist's People

Contributors

bluegenes avatar ctb avatar dependabot[bot] avatar marisalim avatar mr-eyes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

genome-grist's Issues

add gene abundance information to output

tracy teal suggested in conversation that we output gene abundances - I suspect this can be done pretty easily with the bam files and bedtools, we'd just need to add the NCBI annotation files into the mix.

ref #10

use snakemake `temp` to remove temporary files automatically

I ran out of room in the /tmp on my node, and noticed that I had some previous versions of downloads that failed or were cancelled with Ctrl-C hadn't been removed. It looks like temporary fastq removal from /tmp/tmp.* only happens if the download_sra_wc rule succeeds (removal line here).

Error:

2021-05-03T20:04:24 fastq-dump.2.10.1 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='5'
2021-05-03T20:04:24 fastq-dump.2.10.1 err: storage exhausted while writing file within file system module - failed SRR4140278
2021-05-03T20:04:24 fastq-dump.2.10.1 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='5'
2021-05-03T20:04:24 fastq-dump.2.10.1 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='6'

Might be worth getting snakemake to remove these automatically by declaring them as temporary output in the rule (documentation here; code: temp("/path/to/output_file"). I see that part of the issue is that $ACC doesn't always match wildcards.sample, but temp with directory should work.

update: working PR using temp with directory in #71.

computing ANI for metagenomes vs ref genomes

there are three ways to do this.

  • mapping based: map reads, call variants, build a variant call file (VCF), estimate ANI from VCF SNP calls.
  • mapping based: create a new consensus genome based on VCF (bcftools supports this natively!)
  • assembly based: do spacegraphcats queries (which will get all mapped reads + graph neighborhood), run assembler.

I’m in the process of trying (2) and (3) now. All very straightforward with tools we have.

Not clear what to do about genomes that are not fully covered, will do some exploratory analysis! probably best to eliminate those regions from consideration (which is what an assembly based approach will do).

comments from @taylorreiter -

ANI is a more accepted/understood metric for genome relatedness than the %s gather reports.
Would tell you how related thing in Q is to all the Rs in a database.
Theoretically could insert into a tree based on that info, i think a la pplacer?

comments from @widdowquinn -

I suspect, but have not shown, that there is an asymptotic relationship between proportion of genome covered and ANI %identity. I wouldn’t be surprised if the proportion of a bacterial genome you need for an acceptable estimate of ANI %id is relatively small (cgMLST/CNI may well even be overkill…).
However, %coverage does matter for interpretation, and the extent of mapping is an upper bound on the observed %coverage.

and a somewhat tangential comment from nanopore considerations -

I would expect k-mer approaches to always work better here; I think it might be satisfying to demonstrate that ;)

config file location thoughts

titus:speech_balloon: 7:51 AM
what is good practice for user-override systems-level config files in python packages, that works well with conda? I would like to have non-git-archived config files that operate at “system” or “user” level, as well as per-project-directory overrides. Maybe something like:
7:51
~/.genome-grist.conf - system/user level overrides, processed first
7:52
./genome-grist.conf - per-project-directory overrides, processed after system, but before project- or sample-specific overrides
7:52
path/to/sample.conf - project-/sample-specific overrides
7:52
seem reasonable?

bluegenes:feet: 7:56 AM
can you name the first two with the level (e.g. genome-grist.system.conf , genome-grist.directory.conf ) or similar? Just thinking it might be a little confusing to explain the ordering between identically named files (even if they are in different locations (edited)
New

luizirber 8:06 AM
And please use ~/.config/genome-grist/ as basedir, don't pollute my home dir =)

figure out how to retain/extract unmapped reads, too.

we already have the unmapped hashes, due to the split function.

for unmapped reads... what do we want to do with them?

  • feed them from a DNA query into a protein query
  • run sgc on them with a different set of queries (i.e. use genome-grist as a prefilter)
  • other things?

thoughts on taxonomic comparisons, protein searches, and challenges therein

I'm becoming rather fond of the protein search (#40) and want to combine DNA and protein searches (#48) but keep on running into mental problems about how to deal with taxonomy. This issue is to explain the issues to future me.

In #40, I explored comparison of DNA vs protein taxonomy.

However, the challenge is that I used different databases! I used the Genbank compleat database (700k+ genomes) for DNA searching, and the GTDB genus-level protein database for protein searching. That's OK, because both databases have GenBank identifiers that let me pull out NCBI taxonomy.

I do prefer the GTDB taxonomy for things, but the majority of GenBank genomes don't have GTDB taxonomy assigned! I suppose we could relabel all of GenBank with GTDB...

In addition, GTDB doesn't have euk sequences in it, so we'd have to figure out how to add euk in. This now makes the question relevant to charcoal dib-lab/charcoal#30, proving that ultimately all of our software gets stuck on the same set of hard problems :).

Anyway, for now I'm stuck using (a) NCBI taxonomy with (b) a protein database that doesn't contain euk sequences.

choose abundtrim memory parameters based on k-mer counting estimates

ref #51

Mo -
so, khmer pre-allocate the memory even there's no need for it?
titus:speech_balloon: 9 minutes ago
Yes - after all, how do you know how many kmers there are if you haven’t read through the whole data set yet?
titus:speech_balloon: 8 minutes ago
MQF paper has to address same problem ;). In Khmer we use bloom filter based data structures that let us preallocate memory.
Mo 7 minutes ago
Got it now ^^ thanks
titus:speech_balloon: 6 minutes ago
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0101271

titus:speech_balloon: 6 minutes ago
😁
titus:speech_balloon: 6 minutes ago
(there WILL be a quiz… 😬 )
😆
2

Mo 5 minutes ago
haha I am thrilled
Mo 4 minutes ago
If we were to develop an estimation function for the distinct kmers count, would a tool like KmerStream or any alternative algorithm would reduce the need for the fixed memory requirement?
Mo 4 minutes ago
https://github.com/pmelsted/KmerStream

GitHubGitHub
pmelsted/KmerStream
Streaming algorithm for computing kmer statistics for massive genomics datasets - pmelsted/KmerStream
titus:speech_balloon: 3 minutes ago
We have hll implemented too. Still have to walk across the data once.
👍
1

titus:speech_balloon: 1 minute ago
now that I see your previous comment - yeah, we’ve explored this (extremely) thoroughly in khmer. with grist, we could totally integrate a k-mer counting scheme without much extra effort, since we already do multiple passes across the data.
titus:speech_balloon: 1 minute ago
…please let me know when the pull request is ready 😁
titus:speech_balloon: < 1 minute ago
(more seriously, it’s just not that big an issue for grist, so I don’t worry about it. the high water mark for grist’s memory usage is the database search at the moment, and for genbank that’s ~120 GB, so …)

custom database

Hello,

Thank you so much for your wonderful tool. I was wondering how do we adapt your pipeline if we have a set of isolate genomes that we want to use to query our metagenomes. We have analyzed our isolate genomes and have a VCF file representing all the variants that can be found and want to find the abundance of those strains/variants. Is this at all possible with your tool?

Many thanks in advance,

improve reports

our current reports are not really aimed at anyone other than internal lab members. it would be easy to add some more information that is already computed by genome-grist. for example, we could produce and provide --

  • FastQC reports on the metagenome itself
  • a summary of reference genomes - linking names to accessions, etc.
  • a summary of mapping statistics in numerical form, including abundances
  • an estimate or rundown of the number of unmapped reads

this is on top of a basic need for more explanatory information about what is going on in the first place :)

more advanced stuff could include gene abundances and so on for portions of genomes that were actually mapped to, although that starts to get into a territory of doing a lot more work 👀

why are `checkpoint gather_genbank` and `checkpoint gather_genbank_wc` checkpoints instead of rules?

why are checkpoint gather_genbank and checkpoint gather_genbank_wc checkpoints instead of rules?

I understand the purpose of Checkpoint_GatherResults to dynamically solve for outputs based on accessions in gather results. However, the output of gather_genbank and gather_genbank_wc are based only on the wildcard sample, which is defined in the config and not dynamically altered by an unknown number of outputs in the workflow, so shouldn't/couldn't these checkpoints just be rules?

(wondering if this may be related to #77, and I want to reimplement a class like Checkpoint_GatherResults in another workflow so also curious if my mental model is missing something important)

reporting on multiple samples

with #29, genome-grist now supports running on multiple samples. which begs the question, what kind of reporting should/could we be doing here for collections of samples?

some ideas -

dealing with host/human reads

In a conversations with @ctb, @ctb realized that A LOT of metagenomes that I've been working with for a recent project appeared truncated. Instead, the files were much smaller than what is on the SRA because I had removed human reads before downstream analysis.
Here's a smattering of the file sizes for some of the files i've been working with. For some, there is almost no human (host) reads, and for others, there's quite a bit:

-rw-r--r--  1 tereiter tereiter  232M Feb 23  2020 4000_R1.human.fq.gz
-rw-r--r--  1 tereiter tereiter  1.5G Feb 23  2020 4000_R1.nohost.fq.gz
-rw-r--r--  1 tereiter tereiter  232M Feb 23  2020 4000_R2.human.fq.gz
-rw-r--r--  1 tereiter tereiter  1.6G Feb 23  2020 4000_R2.nohost.fq.gz
-rw-r--r--  1 tereiter tereiter  8.8M Feb 23  2020 4001_R1.human.fq.gz
-rw-r--r--  1 tereiter tereiter  1.5G Feb 23  2020 4001_R1.nohost.fq.gz
-rw-r--r--  1 tereiter tereiter  8.3M Feb 23  2020 4001_R2.human.fq.gz
-rw-r--r--  1 tereiter tereiter  1.5G Feb 23  2020 4001_R2.nohost.fq.gz
-rw-r--r--  1 tereiter tereiter  5.2M Feb 23  2020 4002_R1.human.fq.gz
-rw-r--r--  1 tereiter tereiter  2.4G Feb 23  2020 4002_R1.nohost.fq.gz
-rw-r--r--  1 tereiter tereiter  4.3M Feb 23  2020 4002_R2.human.fq.gz
-rw-r--r--  1 tereiter tereiter  2.4G Feb 23  2020 4002_R2.nohost.fq.gz
-rw-r--r--  1 tereiter tereiter  5.5M Feb 23  2020 4004_R1.human.fq.gz
-rw-r--r--  1 tereiter tereiter  978M Feb 23  2020 4004_R1.nohost.fq.gz
-rw-r--r--  1 tereiter tereiter  5.2M Feb 23  2020 4004_R2.human.fq.gz
-rw-r--r--  1 tereiter tereiter  987M Feb 23  2020 4004_R2.nohost.fq.gz
-rw-rw-r--  1 tereiter tereiter  5.3M Feb 26  2020 PSM7J4EF_R1.human.fq.gz
-rw-rw-r--  1 tereiter tereiter   54M Feb 26  2020 PSM7J4EF_R1.nohost.fq.gz
-rw-rw-r--  1 tereiter tereiter  5.1M Feb 26  2020 PSM7J4EF_R2.human.fq.gz
-rw-rw-r--  1 tereiter tereiter   54M Feb 26  2020 PSM7J4EF_R2.nohost.fq.gz
-rw-rw-r--  1 tereiter tereiter   80K Feb 24  2020 PSMA264O_R1.human.fq.gz
-rw-rw-r--  1 tereiter tereiter  738M Feb 24  2020 PSMA264O_R1.nohost.fq.gz
-rw-rw-r--  1 tereiter tereiter   83K Feb 24  2020 PSMA264O_R2.human.fq.gz
-rw-rw-r--  1 tereiter tereiter  755M Feb 24  2020 PSMA264O_R2.nohost.fq.gz
-rw-rw-r--  1 tereiter tereiter  346K Feb 25  2020 PSMA265X_R1.human.fq.gz
-rw-rw-r--  1 tereiter tereiter  540M Feb 25  2020 PSMA265X_R1.nohost.fq.gz
-rw-rw-r--  1 tereiter tereiter  337K Feb 25  2020 PSMA265X_R2.human.fq.gz
-rw-rw-r--  1 tereiter tereiter  544M Feb 25  2020 PSMA265X_R2.nohost.fq.gz

I used bbduk to remove the human reads. You can download the database here: https://drive.google.com/file/d/0B3llHR93L14wd0pSSnFULUlhcUk/edit?usp=sharing
The release of this database is documented here: http://seqanswers.com/forums/archive/index.php/t-42552.html

rule remove_host:
    output:
        r1 = 'outputs/bbduk/{library}_R1.nohost.fq.gz',
        r2 = 'outputs/bbduk/{library}_R2.nohost.fq.gz',
        human_r1='outputs/bbduk/{library}_R1.human.fq.gz',
        human_r2='outputs/bbduk/{library}_R2.human.fq.gz'
    input:
        r1 = 'outputs/cut/{library}_R1.cut.fq.gz',
        r2 = 'outputs/cut/{library}_R2.cut.fq.gz',
        human='inputs/host/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz'
    conda: 'envs/env.yml'
    shell:'''
    bbduk.sh -Xmx64g t=3 in={input.r1} in2={input.r2} out={output.r1} out2={output.r2} outm={output.human_r1} outm2={output.human_r2} k=31 ref={input.human}
    '''

I originally detected human reads by adding a human genome signature to my gather runs. Up to 10% of samples were human. A genome sig is here: https://osf.io/fxup3/ and an rna sig is here: https://osf.io/anj6b/
If using the genome sig, repetitive regions may cause false matching, especially to other eukaryotes. However, the RNA sig is much smaller and would miss a lot of real human. The bbduk database above deals with this nicely I think...sourmash is currently at the quick and dirty stage I think. Could be possible to turn the bbduk database into a signature?

Errors on WDL - connection timed out

Titus:

I wonder what the restrictions are on internet access during the jobs? there are a bunch of HTTP calls out to NCBI… it’d be annoying to have to pre-load the matching genomes (or, well, run gather… then exit wf, download matching genomes somewhere else, then restart wf)


Current error running genome-grist tutorial in WDL:

rule make_genbank_info_csv:
    output: genbank_genomes/GCF_000012825.1.info.csv
    jobid: 29
    wildcards: acc=GCF_000012825.1

opening directory: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/012/825
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1567, in ftp_open
    fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout)
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1588, in connect_ftp
    return ftpwrapper(user, passwd, host, port, dirs, timeout,
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 2409, in __init__
    self.init()
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 2418, in init
    self.ftp.connect(self.host, self.port, self.timeout)
  File "/root/miniconda3/lib/python3.8/ftplib.py", line 152, in connect
    self.sock = socket.create_connection((self.host, self.port), self.timeout,
  File "/root/miniconda3/lib/python3.8/socket.py", line 808, in create_connection
    raise err
  File "/root/miniconda3/lib/python3.8/socket.py", line 796, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/lib/python3.8/site-packages/genome_grist/genbank_genomes.py", line 124, in <module>
    sys.exit(main())
  File "/root/miniconda3/lib/python3.8/site-packages/genome_grist/genbank_genomes.py", line 106, in main
    genome_url, assembly_report_url = url_for_accession(acc)
  File "/root/miniconda3/lib/python3.8/site-packages/genome_grist/genbank_genomes.py", line 24, in url_for_accession
    with urllib.request.urlopen(url) as response:
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1585, in ftp_open
    raise exc.with_traceback(sys.exc_info()[2])
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1567, in ftp_open
    fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout)
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1588, in connect_ftp
    return ftpwrapper(user, passwd, host, port, dirs, timeout,
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 2409, in __init__
    self.init()
  File "/root/miniconda3/lib/python3.8/urllib/request.py", line 2418, in init
    self.ftp.connect(self.host, self.port, self.timeout)
  File "/root/miniconda3/lib/python3.8/ftplib.py", line 152, in connect
    self.sock = socket.create_connection((self.host, self.port), self.timeout,
  File "/root/miniconda3/lib/python3.8/socket.py", line 808, in create_connection
    raise err
  File "/root/miniconda3/lib/python3.8/socket.py", line 796, in create_connection
    sock.connect(sa)
urllib.error.URLError: <urlopen error ftp error: TimeoutError(110, 'Connection timed out')>
[Thu Jan 28 17:51:21 2021]
Error in rule make_genbank_info_csv:
    jobid: 29
    output: genbank_genomes/GCF_000012825.1.info.csv
    shell:
        
        python -m genome_grist.genbank_genomes GCF_000012825.1             --output genbank_genomes/GCF_000012825.1.info.csv
    
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job make_genbank_info_csv since they might be corrupted:
genbank_genomes/GCF_000012825.1.info.csv
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /cromwell-executions/ggristv2/8c56fb3e-b7dd-4e93-aaa7-1c7e2cecb4c1/call-run_test/execution/grist/.snakemake/log/2021-01-28T173100.518576.snakemake.log
Error in snakemake invocation: Command '['snakemake', '-s', '/root/miniconda3/lib/python3.8/site-packages/genome_grist/conf/Snakefile', '-j', '1', '--use-conda', 'summarize', '--configfile', '/root/miniconda3/lib/python3.8/site-packages/genome_grist/conf/defaults.conf', '/root/miniconda3/lib/python3.8/site-packages/genome_grist/conf/system.conf', '/cromwell-executions/ggristv2/8c56fb3e-b7dd-4e93-aaa7-1c7e2cecb4c1/call-run_test/inputs/-269893485/conf-tutorial.yml']' returned non-zero exit status 1.

comparing and evaluating DNA vs protein matches

So, I did a thing, and I don't really know how to evaluate it. Thoughts welcome!

summary: DNA and protein

so I gristed a bunch of the Northen data sets, using both DNA searching (against all of genbank, ~700k) and protein searching (against @bluegenes genus-level GTDB databases).

Then I intersected with Genbank taxonomy and did taxonomic aggregation and summarization (#35).

taxonomy report on DNA search

default config, against genbank.

phylum level DNA:
dna gathergram-phylum-SRR5855411

genus level DNA:
dna gathergram-genus-SRR5855411

species level DNA:
dna gathergram-species-SRR5855411

taxonomy report on protein search

config:

outdir: outputs.roux.protein/
sourmash_scaled: 1000
sourmash_compute_ksizes:
- 11
sourmash_sigtype: protein
sourmash_database_glob_pattern: /home/ntpierce/thumper/databases/gtdb95-genus-n0.protein-k11-scaled100.sbt.zip
sourmash_database_ksize: 33
sourmash_database_threshold_bp: 50e3

phylum level protein:
prot gathergram-phylum-SRR5855411

genus level protein
prot gathergram-genus-SRR5855411

species level protein:
prot gathergram-species-SRR5855411

how to currently install and run cmds

To install genome-grist, you currently need to run:

git clone https://github.com/dib-lab/genome-grist.git
pip install -e genome-grist

As of today, this installation version is v0.6.dev6+g0e772df.

genome-grist commands need to be run from inside the repo right now, for example:

cd genome-grist
genome-grist process HSMA33MX smash_reads
# snakemake workflow should start running
# or you can run the whole test workflow 
make test
# snakemake workflow should start running

If you use the pip install instructions in the repo README, the installation code will run without error (v0.5). But that installation method is missing the config.yml file and you'll get errors like this when you try the quick start command genome-grist process HSMA33MX smash_reads:

File "/home/mcwlim/miniconda3/envs/grist/lib/python3.7/site-packages/genome_grist/__main__.py", line 77, in run_snakemake
    raise ValueError(f"cannot find config file '{configfile}'")
ValueError: cannot find config file 'conf.yml'

run both protein and DNA searches by default

I increasingly like the idea of generating a combined report for DNA and protein matches per #40.

This would add another layer of reporting. In addition to reporting summaries across multiple samples per #42, and aggregated taxonomic reporting #35 at the DNA level, we would add protein reporting, and also maybe do some kind of comparative analysis of DNA vs protein presence.

add summary reporting for multiple samples

using #56, add a notebook or output for summary reporting on sample details.

some code for loading and displaying:

import itables

import glob
import yaml
import pandas

x = []
for yf in glob.glob('outputs.paper/*.yaml'):
    with open(yf, 'rt') as fp:
        d = yaml.safe_load(fp)
        x.append(d)

df = pandas.DataFrame(x)
cols = ['sample', 'n_reads', 'n_bases', 'kmers', 'total_hashes', 'known_hashes', 'unknown_hashes']
df = df[cols]

add pandas/notebook utilities into genome-grist codebase?

this seems like generically useful code that ensures that the various dataframes are all copacetic and synchronized...

SampleDFs = namedtuple('SampleDFs', 'gather_df, all_df, left_df, names_df')

def load_dfs(outdir, sample_id):

    # load mapping CSVs
    all_df = pd.read_csv(f'{outdir}/minimap/depth/{sample_id}.summary.csv')
    left_df = pd.read_csv(f'{outdir}/leftover/depth/{sample_id}.summary.csv')

    # load gather CSV
    gather_df = pd.read_csv(f'{outdir}/genbank/{sample_id}.x.genbank.gather.csv')

    # names!
    names_df = pd.read_csv(f'{outdir}/genbank/{sample_id}.genomes.info.csv')

    # connect gather_df to all_df and left_df using 'genome_id'
    def fix_name(x):
        return "_".join(x.split('_')[:2]).split('.')[0]

    gather_df['genome_id'] = gather_df['name'].apply(fix_name)
    names_df['genome_id'] = names_df['acc'].apply(fix_name)

    # check that all dataframes are copacetic
    in_gather = set(gather_df.genome_id)
    in_left = set(left_df.genome_id)
    assert in_gather == in_left
    assert in_gather == set(names_df.genome_id)
    assert in_gather == set(all_df.genome_id)

    # re-sort left_df and all_df to match gather_df order, using matching genome_id column
    all_df.set_index("genome_id")
    all_df.reindex(index=gather_df["genome_id"])
    all_df.reset_index()

    left_df.set_index("genome_id")
    left_df.reindex(index=gather_df["genome_id"])
    left_df.reset_index()

    #left_df["mapped_bp"] = (1 - left_df["percent missed"]/100) * left_df["genome bp"]
    #left_df["unique_mapped_coverage"] = left_df.coverage / (1 - left_df["percent missed"] / 100.0)

    names_df.set_index("genome_id")
    names_df.reindex(index=gather_df["genome_id"])
    names_df.reset_index()
    
    return SampleDFs(gather_df, all_df, left_df, names_df)

paladin snakemake rules (map to proteins)

about paladin: https://github.com/ToniWestbrook/paladin

"PALADIN currently only supports single-end reads (or reads merged with FLASH, PEAR, abyss-mergepairs), and BWA-MEM based alignment. It makes use of many BWA parameters and is therefore compatible with many of its command line arguments."

working snakefile for PEAR --> paladin. On farm, see /home/ntpierce/genome-grist/test_paladin to run w/test data

rule all:
    input: expand("paladin_mapped/{sample}.bam", sample="test")

rule pear_read_merging:
    """
    Merge PE reads with PEAR, for input into PALADIN
    """
    input:
        r1="{sample}_1.fq.gz",
        r2="{sample}_2.fq.gz"
    output:
        assembled = '{sample}.assembled.fastq',
        discarded = '{sample}.discarded.fastq',
        unassembled_r1 = '{sample}.unassembled.forward.fastq',
        unassembled_r2 = '{sample}.unassembled.reverse.fastq',
    message:
        """--- Merging paired reads using PEAR  ---"""
    params:
        pval=0.01,# (default PEAR pval)
        max_mem="4G",
    threads: 6
    log: "logs/{sample}.pear.log"
    conda: 'paladin-env.yml'
    shell:
        """
        pear -f {input.r1} -r {input.r2} -p {params.pval} \
             -j {threads} -y {params.max_mem} -o {wildcards.sample} {log}
        """

# using swissprot as an example, even though you can run paladin prepare to autodownload this, instead of running index
rule paladin_index:
    input:
        "uniprot_sprot.fasta",
    output:
        "uniprot_sprot.fasta.bwt"
    log:
        "logs/uniprot_sprot.paladin-index.log"
    conda: 'paladin-env.yml'
    shell:
        """
        paladin index -r3 {input}
        """

rule paladin_align:
    input:
        reads='{sample}.assembled.fastq',
        index="uniprot_sprot.fasta.bwt",
    output:
        "paladin_mapped/{sample}.bam"
    params:
        index_base="uniprot_sprot.fasta"
    log:
        "logs/{sample}.paladin-align.log"
    threads: 4
    conda: "paladin-env.yml"
    shell:
        """
        paladin align -t {threads} -T 20 {params.index_base} {input.reads} | samtools view -Sb - > {output}
        """

paladin-env.yml:

name: paladin
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - pear=0.9.6
  - paladin=1.4.6
  - samtools=1.11

preparing your own metagenome samples - an example `Snakefile`

So @taylorreiter gave me some internally sequenced and processed files that I wanted to feed into genome-grist, and I wrote the following little Snakefile to bypass all of the SRA stuff. FYI.

(taylor gave me the .nohost.fq.gz files, and I wanted to put them in interleaved format after fixing the names)

NAMES=['AW_BI_1_Soil', 'AW_BI_1_Water']

rule some:
    input:
        expand('outputs.killi/abundtrim/{N}.abundtrim.fq.gz', N=['AW_M2_Lung'])

rule all:
    input:
        expand('outputs.killi/abundtrim/{N}.abundtrim.fq.gz', N=NAMES)

rule fixname: # fix sequence names
    input:
        r1="{name}_R1.nohost.fq.gz",
        r2="{name}_R2.nohost.fq.gz",
    output:
        r1="{name}_R1.name.fq.gz",
        r2="{name}_R2.name.fq.gz",
    conda: 'genome_grist/conf/env/sra.yml'
    shell: """
        echo processing R1...
        seqtk seq -C {input.r1} | \
            perl -ne 's/\.([12])$/\/$1/; print $_' | \
            gzip -c > {output.r1}

        echo processing R2...
        seqtk seq -C {input.r2} | \
            perl -ne 's/\.([12])$/\/$1/; print $_' | \
            gzip -c > {output.r2}
    """

rule merge: # interleave
    input:
        r1="{name}_R1.name.fq.gz",
        r2="{name}_R2.name.fq.gz",
    output:
        "{name}.abundtrim.fq.gz"
    conda: 'genome_grist/conf/env/sra.yml'
    shell: """
        interleave-reads.py {input.r1} {input.r2} --gzip -o {output}
    """

Determining relative abundance of strains

Hello,

My understanding is that genome grist tells us what genomes are present. I am trying to determine the relative abundance of different strains that are found to be present. Can I do this by just using --track-abundance when running sourmash compute to make the database that is used by sourmash gather?

Also, would you mind explaining the difference between hashes classified to this genome, mapped bp to this genome, and hashes classified to this species?

Thank you!

What should we do about SRA data sets with unpaired reads?

this rule:

# download SRA IDs.
rule wc_download_sra:
    output:
        r1 = protected(outdir + "/raw/{sra_id}_1.fastq.gz"),
        r2 = protected(outdir + "/raw/{sra_id}_2.fastq.gz"),
    conda: "env/sra.yml"
    shell: '''
        fastq-dump --skip-technical  \
               --readids \
               --read-filter pass \
               --dumpbase \
               --split-spot \
               --clip \
               -Z \
               {wildcards.sra_id} | \
               perl -ne 's/\.([12]) /\/$1 /; print $_' | \
               split-paired-reads.py --gzip -1 {output.r1} -2 {output.r2}
        '''

is generating the following error:

Unpaired reads found starting at SRR5855412.25364063/1 25364063 length=150; exiting
fastq-dump (PID 111316) was killed [SIGPIPE (13)] (no core dump was generated) at /home/ctbrown/genome-grist/.snakemake/conda/2d8b6fef/bin/fastq-dump line 242.
[Wed Dec 16 00:03:26 2020]
Error in rule wc_download_sra:
...

for multiple different SRA data sets (SRR5855412, SRR5855424, SRR5855429) and I'm not sure what to do about it.

My current thought is to download the data sets manually and clean them, but I'm wondering if there is a reason this is happening...

I was also thinking about splitting it out in the workflow so that rather than doing all the piping etc. it would save the SRA download file first, then do the perl & split-paired-reads steps second. This would give me access to the data more directly, & I could try using temporary to indicate that (once the file is successfully processed) it can be deleted.

error inserting my own files into genome grist for target download_matching_genomes

I'm not sure what error I'm making to cause this error message.

directory structure:

$ ls *
conf.yml  README.sh

genbank:
gather_vita_vars_gtdb_shared_assemblies.x.genbank.gather.csv

genbank_genomes:

conf file:

sample:
- gather_vita_vars_gtdb_shared_assemblies
outdir: .
metagenome_trim_memory: 1e9

command:

genome-grist run conf.yml download_matching_genomes

Error:

genome-grist run conf.yml download_matching_genomes
sample: ['gather_vita_vars_gtdb_shared_assemblies']
outdir: .
Building DAG of jobs...
InputFunctionException in line 144 of /home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/Snakefile:
Error:
  WorkflowError:
    Missing wildcard values for sample
Wildcards:

Traceback:
  File "/home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/Snakefile", line 131, in __call__
Error in snakemake invocation: Command '['snakemake', '-s', '/home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/Snakefile', '-j', '1', '--use-conda', 'download_matching_genomes', '--configfile', '/home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/defaults.conf', '/home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/system.conf', 'conf.yml']' returned non-zero exit status 1.

head of the gather file i'm using (I did read a bunch of gather files into R and then write this out, so theres NAs, but I don't think they're the problem):

intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,name,filename,md5,f_match_orig,unique_intersect_bp,gathe
868000,0.01065658301821932,0.1523876404494382,0.01065658301821932,0.01065658301821932,NA,NA,NA,"GCA_000210075.1 Bacteroides xylanisolvens XB1A strain=XB1A, ASM2100
58000,7.120758237980651e-4,2.918004085205719e-4,2.2098904876491677e-4,2.2098904876491677e-4,NA,NA,NA,"GCA_000256725.2 Toxoplasma gondii TgCatPRC2 strain=TgCatPRC2,
106000,0.0013013799538378432,0.004314477468839885,2.2098904876491677e-4,2.2098904876491677e-4,NA,NA,NA,"GCA_000285855.2 Ruminococcus sp. JC304, ASM28585v2",/group/
394000,0.004837204734076512,0.2008324661810614,0.0047389873790698814,0.0047389873790698814,NA,NA,NA,"GCA_000431575.1 Firmicutes bacterium CAG:83, MGS83",/group/ctb
134000,0.001645140696361047,0.030209140201394268,9.576192113146393e-4,9.576192113146393e-4,NA,NA,NA,"GCA_000432135.1 Firmicutes bacterium CAG:124, MGS124",/group/c
20000,2.455433875165742e-4,0.009478672985781991,2.455433875165742e-4,2.455433875165742e-4,NA,NA,NA,"GCA_000432155.1 Eubacterium sp. CAG:274, MGS274",/group/ctbrown
100000,0.001227716937582871,0.030973451327433628,6.875214850464077e-4,6.875214850464077e-4,NA,NA,NA,"GCA_000432375.1 Firmicutes bacterium CAG:103, MGS103",/group/c
96000,0.001178608260079556,0.0519774011299435,0.0011294995825762413,0.0011294995825762413,NA,NA,NA,"GCA_000432435.1 Eubacterium sp. CAG:180, MGS180",/group/ctbrown
62000,7.6118450130138e-4,0.01794071762870515,5.647497912881207e-4,5.647497912881207e-4,NA,NA,NA,"GCA_000434635.1 Firmicutes bacterium CAG:110, MGS110",/group/ctbro
22000,2.700977262682316e-4,0.015068493150684932,2.700977262682316e-4,2.700977262682316e-4,NA,NA,NA,"GCA_000435175.1 Clostridium sp. CAG:245, MGS245",/group/ctbrown
540000,0.0066296714629475026,0.209992193598751,0.006605117124195845,0.006605117124195845,NA,NA,NA,"GCA_000435395.1 Firmicutes bacterium CAG:94, MGS94",/group/ctbro
82000,0.0010067278888179541,0.04411764705882353,8.839561950596671e-4,8.839561950596671e-4,NA,NA,NA,"GCA_000435555.1 Firmicutes bacterium CAG:176, MGS176",/group/ct

Combine genome grist commands?

Would/could this work or do summarize and zip need to run separately?

genome-grist run <config.yml> summarize zip

It would be nice to run in one line :)

Update - right, so combining doesn't work if summarize hasn't been run yet, but it does create the zip file if the outputs directory already exists. :) anyhow, i'll leave this here as an enhancement feature.

potential command lines

I'm thinking of doing the command line for this snakemake app differently; right now I'm not using config files much, just the sample name. How does the below sound?

each command would be successive & snakemake-ified, of course, so you could run the last one and it would do all the previous ones.

# download SRA metagenome
genome-grist download-sra HSMA33MX

# trim and abundtrim SRA metagenome
genome-grist trim-reads HSMA33MX

# make a sourmash signature
genome-grist smash HSMA33MX

# download, abundtrim, genbank gather SRA metagenome
genome-grist gather HSMA33MX

# download genbank genomes based on gather results
genome-grist download-genomes HSMA33MX

# do iterative read mapping etc based on gather
genome-grist map HSMA33MX

# summarize content, abundance, taxonomy, mapping based on gather => pretty graphs etc.
genome-grist summarize HSMA33MX

# extract spacegraphcats neighborhoods
genome-grist neighborhoods HSMA33MX

what this pipeline currently does, and what it could do

(1) download a metagenome
(2) process it into trimmed reads + signature
(3) gather x genbank
(4) download matching genomes
(5) map all reads to all matching genomes
(6) extract “leftover” reads iteratively, by successively eliminating reads that matched to previous gather matches
(7) runs mapping on “leftover” reads to genomes
(8) summarize all mapping results for comparison and graphing a la figure 2 in the gather draft paper

@luizirber comment:

(1), (2) and (5) are also useful for MAG search (where "matching genomes" are the queries)

@taylorreiter comment:

ya i think this generally get’s the people what they've been wanting forever — the reads that match the actual thing
(parts of this is a thing that happens in distillerycats that would be awesome if it were already implemented and modularized)

so the question is:

should we appify/clickify it?

it seems like there could be utility in having these steps be made a bit more generic, and released as part of a separate package that behaves like elvers/etc.
e.g. “give me the genome files for this gather csv”
and “give me the mapped reads, both ALL and LEFTOVER, for this gather csv”

since this is boring functionality that may break routinely with NCBI updates, etc., we can package it and test the bejeezus out of it
and then rely on that, without having it be reimplemented in various different ways across a dozen individual lab projects

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.