dib-lab / genome-grist Goto Github PK
View Code? Open in Web Editor NEWmap Illumina metagenomes to genomes!
Home Page: https://dib-lab.github.io/genome-grist/
License: Other
map Illumina metagenomes to genomes!
Home Page: https://dib-lab.github.io/genome-grist/
License: Other
it should be fairly easy to cluster gather matches into "species groupings", for e.g. doing combined sgc queries.
using conf file:
sample:
- gather_vita_vars_gtdb_shared_assemblies
outdir: out
metagenome_trim_memory: 1e9
$ ls *
conf.yml README.sh
genbank_genomes:
GCA_000210075.1_genomic.fna.gz GCA_900543035.1.info.csv
out:
genbank sgc
Shouldn't dir genbank_genomes
be written to out
?
tracy teal suggested in conversation that we output gene abundances - I suspect this can be done pretty easily with the bam files and bedtools, we'd just need to add the NCBI annotation files into the mix.
ref #10
I ran out of room in the /tmp
on my node, and noticed that I had some previous versions of downloads that failed or were cancelled with Ctrl-C
hadn't been removed. It looks like temporary fastq removal from /tmp/tmp.*
only happens if the download_sra_wc
rule succeeds (removal line here).
Error:
2021-05-03T20:04:24 fastq-dump.2.10.1 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='5'
2021-05-03T20:04:24 fastq-dump.2.10.1 err: storage exhausted while writing file within file system module - failed SRR4140278
2021-05-03T20:04:24 fastq-dump.2.10.1 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='5'
2021-05-03T20:04:24 fastq-dump.2.10.1 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='6'
Might be worth getting snakemake to remove these automatically by declaring them as temporary output in the rule (documentation here; code: temp("/path/to/output_file")
. I see that part of the issue is that $ACC doesn't always match wildcards.sample
, but temp
with directory
should work.
update: working PR using temp
with directory
in #71.
there are three ways to do this.
I’m in the process of trying (2) and (3) now. All very straightforward with tools we have.
Not clear what to do about genomes that are not fully covered, will do some exploratory analysis! probably best to eliminate those regions from consideration (which is what an assembly based approach will do).
comments from @taylorreiter -
ANI is a more accepted/understood metric for genome relatedness than the %s gather reports.
Would tell you how related thing in Q is to all the Rs in a database.
Theoretically could insert into a tree based on that info, i think a la pplacer?
comments from @widdowquinn -
I suspect, but have not shown, that there is an asymptotic relationship between proportion of genome covered and ANI %identity. I wouldn’t be surprised if the proportion of a bacterial genome you need for an acceptable estimate of ANI %id is relatively small (cgMLST/CNI may well even be overkill…).
However, %coverage does matter for interpretation, and the extent of mapping is an upper bound on the observed %coverage.
and a somewhat tangential comment from nanopore considerations -
I would expect k-mer approaches to always work better here; I think it might be satisfying to demonstrate that ;)
titus:speech_balloon: 7:51 AM
what is good practice for user-override systems-level config files in python packages, that works well with conda? I would like to have non-git-archived config files that operate at “system” or “user” level, as well as per-project-directory overrides. Maybe something like:
7:51
~/.genome-grist.conf - system/user level overrides, processed first
7:52
./genome-grist.conf - per-project-directory overrides, processed after system, but before project- or sample-specific overrides
7:52
path/to/sample.conf - project-/sample-specific overrides
7:52
seem reasonable?
bluegenes:feet: 7:56 AM
can you name the first two with the level (e.g. genome-grist.system.conf , genome-grist.directory.conf ) or similar? Just thinking it might be a little confusing to explain the ordering between identically named files (even if they are in different locations (edited)
New
luizirber 8:06 AM
And please use ~/.config/genome-grist/ as basedir, don't pollute my home dir =)
we already have the unmapped hashes, due to the split function.
for unmapped reads... what do we want to do with them?
Current PR's are failing tests with this error:
----------------------------- Captured stderr call -----------------------------
Error: mamba package manager is not available. [...]
I'm becoming rather fond of the protein search (#40) and want to combine DNA and protein searches (#48) but keep on running into mental problems about how to deal with taxonomy. This issue is to explain the issues to future me.
In #40, I explored comparison of DNA vs protein taxonomy.
However, the challenge is that I used different databases! I used the Genbank compleat database (700k+ genomes) for DNA searching, and the GTDB genus-level protein database for protein searching. That's OK, because both databases have GenBank identifiers that let me pull out NCBI taxonomy.
I do prefer the GTDB taxonomy for things, but the majority of GenBank genomes don't have GTDB taxonomy assigned! I suppose we could relabel all of GenBank with GTDB...
In addition, GTDB doesn't have euk sequences in it, so we'd have to figure out how to add euk in. This now makes the question relevant to charcoal dib-lab/charcoal#30, proving that ultimately all of our software gets stuck on the same set of hard problems :).
Anyway, for now I'm stuck using (a) NCBI taxonomy with (b) a protein database that doesn't contain euk sequences.
(and then run gather on those genomes to get the order)
waiting for @mr-eyes to confirm that 8GB is enough :)
git log --oneline v0.6.1..
to get list of merges since last releasegit fetch --tags
rm dist/*
python setup.py sdist
twine upload dist/*
ref #51
Mo -
so, khmer pre-allocate the memory even there's no need for it?
titus:speech_balloon: 9 minutes ago
Yes - after all, how do you know how many kmers there are if you haven’t read through the whole data set yet?
titus:speech_balloon: 8 minutes ago
MQF paper has to address same problem ;). In Khmer we use bloom filter based data structures that let us preallocate memory.
Mo 7 minutes ago
Got it now ^^ thanks
titus:speech_balloon: 6 minutes ago
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0101271
titus:speech_balloon: 6 minutes ago
😁
titus:speech_balloon: 6 minutes ago
(there WILL be a quiz… 😬 )
😆
2
Mo 5 minutes ago
haha I am thrilled
Mo 4 minutes ago
If we were to develop an estimation function for the distinct kmers count, would a tool like KmerStream or any alternative algorithm would reduce the need for the fixed memory requirement?
Mo 4 minutes ago
https://github.com/pmelsted/KmerStream
GitHubGitHub
pmelsted/KmerStream
Streaming algorithm for computing kmer statistics for massive genomics datasets - pmelsted/KmerStream
titus:speech_balloon: 3 minutes ago
We have hll implemented too. Still have to walk across the data once.
👍
1
titus:speech_balloon: 1 minute ago
now that I see your previous comment - yeah, we’ve explored this (extremely) thoroughly in khmer. with grist, we could totally integrate a k-mer counting scheme without much extra effort, since we already do multiple passes across the data.
titus:speech_balloon: 1 minute ago
…please let me know when the pull request is ready 😁
titus:speech_balloon: < 1 minute ago
(more seriously, it’s just not that big an issue for grist, so I don’t worry about it. the high water mark for grist’s memory usage is the database search at the moment, and for genbank that’s ~120 GB, so …)
the tests added in #19 download genome files from NCBI, but they shouldn't have to... #pleasefix
e.g. /scratch
on farm
while I'm a big fan of various aspects of the current reports, and look forward to leveling it up over the coming months #10, there are many things missing.
one idea is to add taxonomy reporting, a la @luizirber efforts with OPAL and CAMI.
Hello,
Thank you so much for your wonderful tool. I was wondering how do we adapt your pipeline if we have a set of isolate genomes that we want to use to query our metagenomes. We have analyzed our isolate genomes and have a VCF file representing all the variants that can be found and want to find the abundance of those strains/variants. Is this at all possible with your tool?
Many thanks in advance,
@marisalim has a nice colorful image of the genome-grist workflow that we should co-opt and put in the README
our current reports are not really aimed at anyone other than internal lab members. it would be easy to add some more information that is already computed by genome-grist. for example, we could produce and provide --
this is on top of a basic need for more explanatory information about what is going on in the first place :)
more advanced stuff could include gene abundances and so on for portions of genomes that were actually mapped to, although that starts to get into a territory of doing a lot more work 👀
I think this will be part of the greyhound merge sourmash-bio/sourmash#1238, but it seems like it could be a significant speedup in the meantime.
why are checkpoint gather_genbank
and checkpoint gather_genbank_wc
checkpoints instead of rules?
I understand the purpose of Checkpoint_GatherResults
to dynamically solve for outputs based on accessions in gather results. However, the output of gather_genbank
and gather_genbank_wc
are based only on the wildcard sample
, which is defined in the config and not dynamically altered by an unknown number of outputs in the workflow, so shouldn't/couldn't these checkpoints just be rules?
(wondering if this may be related to #77, and I want to reimplement a class like Checkpoint_GatherResults
in another workflow so also curious if my mental model is missing something important)
with #29, genome-grist now supports running on multiple samples. which begs the question, what kind of reporting should/could we be doing here for collections of samples?
some ideas -
I suggest using Github action:build-and-push-docker-images to build and push genome-grsit docker image to DockerHub and GitHub Container Registry.
Why is this? to keep genome-grist docker image auto updated to be used easily in the various workflows etc ...
In a conversations with @ctb, @ctb realized that A LOT of metagenomes that I've been working with for a recent project appeared truncated. Instead, the files were much smaller than what is on the SRA because I had removed human reads before downstream analysis.
Here's a smattering of the file sizes for some of the files i've been working with. For some, there is almost no human (host) reads, and for others, there's quite a bit:
-rw-r--r-- 1 tereiter tereiter 232M Feb 23 2020 4000_R1.human.fq.gz
-rw-r--r-- 1 tereiter tereiter 1.5G Feb 23 2020 4000_R1.nohost.fq.gz
-rw-r--r-- 1 tereiter tereiter 232M Feb 23 2020 4000_R2.human.fq.gz
-rw-r--r-- 1 tereiter tereiter 1.6G Feb 23 2020 4000_R2.nohost.fq.gz
-rw-r--r-- 1 tereiter tereiter 8.8M Feb 23 2020 4001_R1.human.fq.gz
-rw-r--r-- 1 tereiter tereiter 1.5G Feb 23 2020 4001_R1.nohost.fq.gz
-rw-r--r-- 1 tereiter tereiter 8.3M Feb 23 2020 4001_R2.human.fq.gz
-rw-r--r-- 1 tereiter tereiter 1.5G Feb 23 2020 4001_R2.nohost.fq.gz
-rw-r--r-- 1 tereiter tereiter 5.2M Feb 23 2020 4002_R1.human.fq.gz
-rw-r--r-- 1 tereiter tereiter 2.4G Feb 23 2020 4002_R1.nohost.fq.gz
-rw-r--r-- 1 tereiter tereiter 4.3M Feb 23 2020 4002_R2.human.fq.gz
-rw-r--r-- 1 tereiter tereiter 2.4G Feb 23 2020 4002_R2.nohost.fq.gz
-rw-r--r-- 1 tereiter tereiter 5.5M Feb 23 2020 4004_R1.human.fq.gz
-rw-r--r-- 1 tereiter tereiter 978M Feb 23 2020 4004_R1.nohost.fq.gz
-rw-r--r-- 1 tereiter tereiter 5.2M Feb 23 2020 4004_R2.human.fq.gz
-rw-r--r-- 1 tereiter tereiter 987M Feb 23 2020 4004_R2.nohost.fq.gz
-rw-rw-r-- 1 tereiter tereiter 5.3M Feb 26 2020 PSM7J4EF_R1.human.fq.gz
-rw-rw-r-- 1 tereiter tereiter 54M Feb 26 2020 PSM7J4EF_R1.nohost.fq.gz
-rw-rw-r-- 1 tereiter tereiter 5.1M Feb 26 2020 PSM7J4EF_R2.human.fq.gz
-rw-rw-r-- 1 tereiter tereiter 54M Feb 26 2020 PSM7J4EF_R2.nohost.fq.gz
-rw-rw-r-- 1 tereiter tereiter 80K Feb 24 2020 PSMA264O_R1.human.fq.gz
-rw-rw-r-- 1 tereiter tereiter 738M Feb 24 2020 PSMA264O_R1.nohost.fq.gz
-rw-rw-r-- 1 tereiter tereiter 83K Feb 24 2020 PSMA264O_R2.human.fq.gz
-rw-rw-r-- 1 tereiter tereiter 755M Feb 24 2020 PSMA264O_R2.nohost.fq.gz
-rw-rw-r-- 1 tereiter tereiter 346K Feb 25 2020 PSMA265X_R1.human.fq.gz
-rw-rw-r-- 1 tereiter tereiter 540M Feb 25 2020 PSMA265X_R1.nohost.fq.gz
-rw-rw-r-- 1 tereiter tereiter 337K Feb 25 2020 PSMA265X_R2.human.fq.gz
-rw-rw-r-- 1 tereiter tereiter 544M Feb 25 2020 PSMA265X_R2.nohost.fq.gz
I used bbduk to remove the human reads. You can download the database here: https://drive.google.com/file/d/0B3llHR93L14wd0pSSnFULUlhcUk/edit?usp=sharing
The release of this database is documented here: http://seqanswers.com/forums/archive/index.php/t-42552.html
rule remove_host:
output:
r1 = 'outputs/bbduk/{library}_R1.nohost.fq.gz',
r2 = 'outputs/bbduk/{library}_R2.nohost.fq.gz',
human_r1='outputs/bbduk/{library}_R1.human.fq.gz',
human_r2='outputs/bbduk/{library}_R2.human.fq.gz'
input:
r1 = 'outputs/cut/{library}_R1.cut.fq.gz',
r2 = 'outputs/cut/{library}_R2.cut.fq.gz',
human='inputs/host/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz'
conda: 'envs/env.yml'
shell:'''
bbduk.sh -Xmx64g t=3 in={input.r1} in2={input.r2} out={output.r1} out2={output.r2} outm={output.human_r1} outm2={output.human_r2} k=31 ref={input.human}
'''
I originally detected human reads by adding a human genome signature to my gather runs. Up to 10% of samples were human. A genome sig is here: https://osf.io/fxup3/ and an rna sig is here: https://osf.io/anj6b/
If using the genome sig, repetitive regions may cause false matching, especially to other eukaryotes. However, the RNA sig is much smaller and would miss a lot of real human. The bbduk database above deals with this nicely I think...sourmash is currently at the quick and dirty stage I think. Could be possible to turn the bbduk database into a signature?
Titus:
I wonder what the restrictions are on internet access during the jobs? there are a bunch of HTTP calls out to NCBI… it’d be annoying to have to pre-load the matching genomes (or, well, run gather… then exit wf, download matching genomes somewhere else, then restart wf)
Current error running genome-grist tutorial in WDL:
rule make_genbank_info_csv:
output: genbank_genomes/GCF_000012825.1.info.csv
jobid: 29
wildcards: acc=GCF_000012825.1
opening directory: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/012/825
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1567, in ftp_open
fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout)
File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1588, in connect_ftp
return ftpwrapper(user, passwd, host, port, dirs, timeout,
File "/root/miniconda3/lib/python3.8/urllib/request.py", line 2409, in __init__
self.init()
File "/root/miniconda3/lib/python3.8/urllib/request.py", line 2418, in init
self.ftp.connect(self.host, self.port, self.timeout)
File "/root/miniconda3/lib/python3.8/ftplib.py", line 152, in connect
self.sock = socket.create_connection((self.host, self.port), self.timeout,
File "/root/miniconda3/lib/python3.8/socket.py", line 808, in create_connection
raise err
File "/root/miniconda3/lib/python3.8/socket.py", line 796, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/miniconda3/lib/python3.8/site-packages/genome_grist/genbank_genomes.py", line 124, in <module>
sys.exit(main())
File "/root/miniconda3/lib/python3.8/site-packages/genome_grist/genbank_genomes.py", line 106, in main
genome_url, assembly_report_url = url_for_accession(acc)
File "/root/miniconda3/lib/python3.8/site-packages/genome_grist/genbank_genomes.py", line 24, in url_for_accession
with urllib.request.urlopen(url) as response:
File "/root/miniconda3/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/root/miniconda3/lib/python3.8/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/root/miniconda3/lib/python3.8/urllib/request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/root/miniconda3/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1585, in ftp_open
raise exc.with_traceback(sys.exc_info()[2])
File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1567, in ftp_open
fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout)
File "/root/miniconda3/lib/python3.8/urllib/request.py", line 1588, in connect_ftp
return ftpwrapper(user, passwd, host, port, dirs, timeout,
File "/root/miniconda3/lib/python3.8/urllib/request.py", line 2409, in __init__
self.init()
File "/root/miniconda3/lib/python3.8/urllib/request.py", line 2418, in init
self.ftp.connect(self.host, self.port, self.timeout)
File "/root/miniconda3/lib/python3.8/ftplib.py", line 152, in connect
self.sock = socket.create_connection((self.host, self.port), self.timeout,
File "/root/miniconda3/lib/python3.8/socket.py", line 808, in create_connection
raise err
File "/root/miniconda3/lib/python3.8/socket.py", line 796, in create_connection
sock.connect(sa)
urllib.error.URLError: <urlopen error ftp error: TimeoutError(110, 'Connection timed out')>
[Thu Jan 28 17:51:21 2021]
Error in rule make_genbank_info_csv:
jobid: 29
output: genbank_genomes/GCF_000012825.1.info.csv
shell:
python -m genome_grist.genbank_genomes GCF_000012825.1 --output genbank_genomes/GCF_000012825.1.info.csv
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job make_genbank_info_csv since they might be corrupted:
genbank_genomes/GCF_000012825.1.info.csv
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /cromwell-executions/ggristv2/8c56fb3e-b7dd-4e93-aaa7-1c7e2cecb4c1/call-run_test/execution/grist/.snakemake/log/2021-01-28T173100.518576.snakemake.log
Error in snakemake invocation: Command '['snakemake', '-s', '/root/miniconda3/lib/python3.8/site-packages/genome_grist/conf/Snakefile', '-j', '1', '--use-conda', 'summarize', '--configfile', '/root/miniconda3/lib/python3.8/site-packages/genome_grist/conf/defaults.conf', '/root/miniconda3/lib/python3.8/site-packages/genome_grist/conf/system.conf', '/cromwell-executions/ggristv2/8c56fb3e-b7dd-4e93-aaa7-1c7e2cecb4c1/call-run_test/inputs/-269893485/conf-tutorial.yml']' returned non-zero exit status 1.
So, I did a thing, and I don't really know how to evaluate it. Thoughts welcome!
so I gristed a bunch of the Northen data sets, using both DNA searching (against all of genbank, ~700k) and protein searching (against @bluegenes genus-level GTDB databases).
Then I intersected with Genbank taxonomy and did taxonomic aggregation and summarization (#35).
default config, against genbank.
config:
outdir: outputs.roux.protein/
sourmash_scaled: 1000
sourmash_compute_ksizes:
- 11
sourmash_sigtype: protein
sourmash_database_glob_pattern: /home/ntpierce/thumper/databases/gtdb95-genus-n0.protein-k11-scaled100.sbt.zip
sourmash_database_ksize: 33
sourmash_database_threshold_bp: 50e3
To install genome-grist, you currently need to run:
git clone https://github.com/dib-lab/genome-grist.git
pip install -e genome-grist
As of today, this installation version is v0.6.dev6+g0e772df.
genome-grist commands need to be run from inside the repo right now, for example:
cd genome-grist
genome-grist process HSMA33MX smash_reads
# snakemake workflow should start running
# or you can run the whole test workflow
make test
# snakemake workflow should start running
If you use the pip install instructions in the repo README, the installation code will run without error (v0.5). But that installation method is missing the config.yml
file and you'll get errors like this when you try the quick start command genome-grist process HSMA33MX smash_reads
:
File "/home/mcwlim/miniconda3/envs/grist/lib/python3.7/site-packages/genome_grist/__main__.py", line 77, in run_snakemake
raise ValueError(f"cannot find config file '{configfile}'")
ValueError: cannot find config file 'conf.yml'
I increasingly like the idea of generating a combined report for DNA and protein matches per #40.
This would add another layer of reporting. In addition to reporting summaries across multiple samples per #42, and aggregated taxonomic reporting #35 at the DNA level, we would add protein reporting, and also maybe do some kind of comparative analysis of DNA vs protein presence.
as per charcoal https://github.com/ctb/2020-charcoal-sankey, could use sankey flow diagrams in #35
using #56, add a notebook or output for summary reporting on sample details.
some code for loading and displaying:
import itables
import glob
import yaml
import pandas
x = []
for yf in glob.glob('outputs.paper/*.yaml'):
with open(yf, 'rt') as fp:
d = yaml.safe_load(fp)
x.append(d)
df = pandas.DataFrame(x)
cols = ['sample', 'n_reads', 'n_bases', 'kmers', 'total_hashes', 'known_hashes', 'unknown_hashes']
df = df[cols]
this seems like generically useful code that ensures that the various dataframes are all copacetic and synchronized...
SampleDFs = namedtuple('SampleDFs', 'gather_df, all_df, left_df, names_df')
def load_dfs(outdir, sample_id):
# load mapping CSVs
all_df = pd.read_csv(f'{outdir}/minimap/depth/{sample_id}.summary.csv')
left_df = pd.read_csv(f'{outdir}/leftover/depth/{sample_id}.summary.csv')
# load gather CSV
gather_df = pd.read_csv(f'{outdir}/genbank/{sample_id}.x.genbank.gather.csv')
# names!
names_df = pd.read_csv(f'{outdir}/genbank/{sample_id}.genomes.info.csv')
# connect gather_df to all_df and left_df using 'genome_id'
def fix_name(x):
return "_".join(x.split('_')[:2]).split('.')[0]
gather_df['genome_id'] = gather_df['name'].apply(fix_name)
names_df['genome_id'] = names_df['acc'].apply(fix_name)
# check that all dataframes are copacetic
in_gather = set(gather_df.genome_id)
in_left = set(left_df.genome_id)
assert in_gather == in_left
assert in_gather == set(names_df.genome_id)
assert in_gather == set(all_df.genome_id)
# re-sort left_df and all_df to match gather_df order, using matching genome_id column
all_df.set_index("genome_id")
all_df.reindex(index=gather_df["genome_id"])
all_df.reset_index()
left_df.set_index("genome_id")
left_df.reindex(index=gather_df["genome_id"])
left_df.reset_index()
#left_df["mapped_bp"] = (1 - left_df["percent missed"]/100) * left_df["genome bp"]
#left_df["unique_mapped_coverage"] = left_df.coverage / (1 - left_df["percent missed"] / 100.0)
names_df.set_index("genome_id")
names_df.reindex(index=gather_df["genome_id"])
names_df.reset_index()
return SampleDFs(gather_df, all_df, left_df, names_df)
it's a little bit circuitous, but for people that really want taxonomic classification of reads and not entire data sets, we can do that by identifying their best-mapping genome(s) :)
about paladin: https://github.com/ToniWestbrook/paladin
"PALADIN currently only supports single-end reads (or reads merged with FLASH, PEAR, abyss-mergepairs), and BWA-MEM based alignment. It makes use of many BWA parameters and is therefore compatible with many of its command line arguments."
working snakefile for PEAR --> paladin. On farm, see /home/ntpierce/genome-grist/test_paladin
to run w/test data
rule all:
input: expand("paladin_mapped/{sample}.bam", sample="test")
rule pear_read_merging:
"""
Merge PE reads with PEAR, for input into PALADIN
"""
input:
r1="{sample}_1.fq.gz",
r2="{sample}_2.fq.gz"
output:
assembled = '{sample}.assembled.fastq',
discarded = '{sample}.discarded.fastq',
unassembled_r1 = '{sample}.unassembled.forward.fastq',
unassembled_r2 = '{sample}.unassembled.reverse.fastq',
message:
"""--- Merging paired reads using PEAR ---"""
params:
pval=0.01,# (default PEAR pval)
max_mem="4G",
threads: 6
log: "logs/{sample}.pear.log"
conda: 'paladin-env.yml'
shell:
"""
pear -f {input.r1} -r {input.r2} -p {params.pval} \
-j {threads} -y {params.max_mem} -o {wildcards.sample} {log}
"""
# using swissprot as an example, even though you can run paladin prepare to autodownload this, instead of running index
rule paladin_index:
input:
"uniprot_sprot.fasta",
output:
"uniprot_sprot.fasta.bwt"
log:
"logs/uniprot_sprot.paladin-index.log"
conda: 'paladin-env.yml'
shell:
"""
paladin index -r3 {input}
"""
rule paladin_align:
input:
reads='{sample}.assembled.fastq',
index="uniprot_sprot.fasta.bwt",
output:
"paladin_mapped/{sample}.bam"
params:
index_base="uniprot_sprot.fasta"
log:
"logs/{sample}.paladin-align.log"
threads: 4
conda: "paladin-env.yml"
shell:
"""
paladin align -t {threads} -T 20 {params.index_base} {input.reads} | samtools view -Sb - > {output}
"""
paladin-env.yml
:
name: paladin
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- pear=0.9.6
- paladin=1.4.6
- samtools=1.11
probably can be enabled as part of caching #8
So @taylorreiter gave me some internally sequenced and processed files that I wanted to feed into genome-grist, and I wrote the following little Snakefile to bypass all of the SRA stuff. FYI.
(taylor gave me the .nohost.fq.gz files, and I wanted to put them in interleaved format after fixing the names)
NAMES=['AW_BI_1_Soil', 'AW_BI_1_Water']
rule some:
input:
expand('outputs.killi/abundtrim/{N}.abundtrim.fq.gz', N=['AW_M2_Lung'])
rule all:
input:
expand('outputs.killi/abundtrim/{N}.abundtrim.fq.gz', N=NAMES)
rule fixname: # fix sequence names
input:
r1="{name}_R1.nohost.fq.gz",
r2="{name}_R2.nohost.fq.gz",
output:
r1="{name}_R1.name.fq.gz",
r2="{name}_R2.name.fq.gz",
conda: 'genome_grist/conf/env/sra.yml'
shell: """
echo processing R1...
seqtk seq -C {input.r1} | \
perl -ne 's/\.([12])$/\/$1/; print $_' | \
gzip -c > {output.r1}
echo processing R2...
seqtk seq -C {input.r2} | \
perl -ne 's/\.([12])$/\/$1/; print $_' | \
gzip -c > {output.r2}
"""
rule merge: # interleave
input:
r1="{name}_R1.name.fq.gz",
r2="{name}_R2.name.fq.gz",
output:
"{name}.abundtrim.fq.gz"
conda: 'genome_grist/conf/env/sra.yml'
shell: """
interleave-reads.py {input.r1} {input.r2} --gzip -o {output}
"""
Hello,
My understanding is that genome grist tells us what genomes are present. I am trying to determine the relative abundance of different strains that are found to be present. Can I do this by just using --track-abundance when running sourmash compute to make the database that is used by sourmash gather?
Also, would you mind explaining the difference between hashes classified to this genome, mapped bp to this genome, and hashes classified to this species?
Thank you!
right at the wc_sourmash_gather
step - sourmash gather
errors out if no matches found.
right now the only solution is to remove the sample from the config...
with sourmash 4.1, we can do the split into known/unknown using sourmash prefetch
, but it requires adjusting the reporting in this rule.
ref #68
...prior to trimming, etc. helps with postmortem/retrospective analyses of information loss!
this rule:
# download SRA IDs.
rule wc_download_sra:
output:
r1 = protected(outdir + "/raw/{sra_id}_1.fastq.gz"),
r2 = protected(outdir + "/raw/{sra_id}_2.fastq.gz"),
conda: "env/sra.yml"
shell: '''
fastq-dump --skip-technical \
--readids \
--read-filter pass \
--dumpbase \
--split-spot \
--clip \
-Z \
{wildcards.sra_id} | \
perl -ne 's/\.([12]) /\/$1 /; print $_' | \
split-paired-reads.py --gzip -1 {output.r1} -2 {output.r2}
'''
is generating the following error:
Unpaired reads found starting at SRR5855412.25364063/1 25364063 length=150; exiting
fastq-dump (PID 111316) was killed [SIGPIPE (13)] (no core dump was generated) at /home/ctbrown/genome-grist/.snakemake/conda/2d8b6fef/bin/fastq-dump line 242.
[Wed Dec 16 00:03:26 2020]
Error in rule wc_download_sra:
...
for multiple different SRA data sets (SRR5855412, SRR5855424, SRR5855429) and I'm not sure what to do about it.
My current thought is to download the data sets manually and clean them, but I'm wondering if there is a reason this is happening...
I was also thinking about splitting it out in the workflow so that rather than doing all the piping etc. it would save the SRA download file first, then do the perl & split-paired-reads steps second. This would give me access to the data more directly, & I could try using temporary
to indicate that (once the file is successfully processed) it can be deleted.
I'm not sure what error I'm making to cause this error message.
directory structure:
$ ls *
conf.yml README.sh
genbank:
gather_vita_vars_gtdb_shared_assemblies.x.genbank.gather.csv
genbank_genomes:
conf file:
sample:
- gather_vita_vars_gtdb_shared_assemblies
outdir: .
metagenome_trim_memory: 1e9
command:
genome-grist run conf.yml download_matching_genomes
Error:
genome-grist run conf.yml download_matching_genomes
sample: ['gather_vita_vars_gtdb_shared_assemblies']
outdir: .
Building DAG of jobs...
InputFunctionException in line 144 of /home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/Snakefile:
Error:
WorkflowError:
Missing wildcard values for sample
Wildcards:
Traceback:
File "/home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/Snakefile", line 131, in __call__
Error in snakemake invocation: Command '['snakemake', '-s', '/home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/Snakefile', '-j', '1', '--use-conda', 'download_matching_genomes', '--configfile', '/home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/defaults.conf', '/home/tereiter/miniconda3/envs/grist/lib/python3.8/site-packages/genome_grist/conf/system.conf', 'conf.yml']' returned non-zero exit status 1.
head of the gather file i'm using (I did read a bunch of gather files into R and then write this out, so theres NAs, but I don't think they're the problem):
intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,name,filename,md5,f_match_orig,unique_intersect_bp,gathe
868000,0.01065658301821932,0.1523876404494382,0.01065658301821932,0.01065658301821932,NA,NA,NA,"GCA_000210075.1 Bacteroides xylanisolvens XB1A strain=XB1A, ASM2100
58000,7.120758237980651e-4,2.918004085205719e-4,2.2098904876491677e-4,2.2098904876491677e-4,NA,NA,NA,"GCA_000256725.2 Toxoplasma gondii TgCatPRC2 strain=TgCatPRC2,
106000,0.0013013799538378432,0.004314477468839885,2.2098904876491677e-4,2.2098904876491677e-4,NA,NA,NA,"GCA_000285855.2 Ruminococcus sp. JC304, ASM28585v2",/group/
394000,0.004837204734076512,0.2008324661810614,0.0047389873790698814,0.0047389873790698814,NA,NA,NA,"GCA_000431575.1 Firmicutes bacterium CAG:83, MGS83",/group/ctb
134000,0.001645140696361047,0.030209140201394268,9.576192113146393e-4,9.576192113146393e-4,NA,NA,NA,"GCA_000432135.1 Firmicutes bacterium CAG:124, MGS124",/group/c
20000,2.455433875165742e-4,0.009478672985781991,2.455433875165742e-4,2.455433875165742e-4,NA,NA,NA,"GCA_000432155.1 Eubacterium sp. CAG:274, MGS274",/group/ctbrown
100000,0.001227716937582871,0.030973451327433628,6.875214850464077e-4,6.875214850464077e-4,NA,NA,NA,"GCA_000432375.1 Firmicutes bacterium CAG:103, MGS103",/group/c
96000,0.001178608260079556,0.0519774011299435,0.0011294995825762413,0.0011294995825762413,NA,NA,NA,"GCA_000432435.1 Eubacterium sp. CAG:180, MGS180",/group/ctbrown
62000,7.6118450130138e-4,0.01794071762870515,5.647497912881207e-4,5.647497912881207e-4,NA,NA,NA,"GCA_000434635.1 Firmicutes bacterium CAG:110, MGS110",/group/ctbro
22000,2.700977262682316e-4,0.015068493150684932,2.700977262682316e-4,2.700977262682316e-4,NA,NA,NA,"GCA_000435175.1 Clostridium sp. CAG:245, MGS245",/group/ctbrown
540000,0.0066296714629475026,0.209992193598751,0.006605117124195845,0.006605117124195845,NA,NA,NA,"GCA_000435395.1 Firmicutes bacterium CAG:94, MGS94",/group/ctbro
82000,0.0010067278888179541,0.04411764705882353,8.839561950596671e-4,8.839561950596671e-4,NA,NA,NA,"GCA_000435555.1 Firmicutes bacterium CAG:176, MGS176",/group/ct
Would/could this work or do summarize
and zip
need to run separately?
genome-grist run <config.yml> summarize zip
It would be nice to run in one line :)
Update - right, so combining doesn't work if summarize
hasn't been run yet, but it does create the zip file if the outputs directory already exists. :) anyhow, i'll leave this here as an enhancement feature.
it would seem to make sense to have the sgc queries be done on pangenomes at the species or genus level, rather than individual genomes.
we could accomplish this by grouping the gathered genome with either taxonomic groupings or Jaccard similarity clustering.
I'm thinking of doing the command line for this snakemake app differently; right now I'm not using config files much, just the sample name. How does the below sound?
each command would be successive & snakemake-ified, of course, so you could run the last one and it would do all the previous ones.
# download SRA metagenome
genome-grist download-sra HSMA33MX
# trim and abundtrim SRA metagenome
genome-grist trim-reads HSMA33MX
# make a sourmash signature
genome-grist smash HSMA33MX
# download, abundtrim, genbank gather SRA metagenome
genome-grist gather HSMA33MX
# download genbank genomes based on gather results
genome-grist download-genomes HSMA33MX
# do iterative read mapping etc based on gather
genome-grist map HSMA33MX
# summarize content, abundance, taxonomy, mapping based on gather => pretty graphs etc.
genome-grist summarize HSMA33MX
# extract spacegraphcats neighborhoods
genome-grist neighborhoods HSMA33MX
(1) download a metagenome
(2) process it into trimmed reads + signature
(3) gather x genbank
(4) download matching genomes
(5) map all reads to all matching genomes
(6) extract “leftover” reads iteratively, by successively eliminating reads that matched to previous gather matches
(7) runs mapping on “leftover” reads to genomes
(8) summarize all mapping results for comparison and graphing a la figure 2 in the gather draft paper
@luizirber comment:
(1), (2) and (5) are also useful for MAG search (where "matching genomes" are the queries)
@taylorreiter comment:
ya i think this generally get’s the people what they've been wanting forever — the reads that match the actual thing
(parts of this is a thing that happens in distillerycats that would be awesome if it were already implemented and modularized)
so the question is:
should we appify/clickify it?
it seems like there could be utility in having these steps be made a bit more generic, and released as part of a separate package that behaves like elvers/etc.
e.g. “give me the genome files for this gather csv”
and “give me the mapped reads, both ALL and LEFTOVER, for this gather csv”
since this is boring functionality that may break routinely with NCBI updates, etc., we can package it and test the bejeezus out of it
and then rely on that, without having it be reimplemented in various different ways across a dozen individual lab projects
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.