sourmash-bio / branchwater Goto Github PK

View Code? Open in Web Editor NEW

6.0 2.0 2.0 11.95 MB

Searching large collections of sequencing data with genome-scale queries

Home Page: https://branchwater.sourmash.bio

License: Other

JavaScript 15.88% Python 21.55% HTML 27.84% Dockerfile 0.99% Rust 23.57% Nix 10.10% Makefile 0.07%

metagenomics sourmash

branchwater's Introduction

branchwater

This is the central repository for branchwater.

branchwater is the framework we use for searching large collections of sequencing data with genome-scale queries. At its core it is a new search index for sourmash signatures, allowing near real-time search of large scale databases. It is an inverted index implemented on top of RocksDB.

branchwater had a couple of names over time:

sra_search
MAGsearch
rocksdb-eval
mastiff We finally brought it all together under the same umbrella.

Here are a few blog posts:

MinHashing all the things: searching for MAGs in the SRA
MinHashing all the things: a quick analysis of MAG search results
Searching all public metagenomes with sourmash
Discussion for the initial prototype for real time search of the SRA

Code repository links and details.

branchwater is based on sourmash, and the search index data structure live there since version 0.12 of the Rust crate.

branchwater is currently (Jan 2024) mostly contained in this repo, with the tools developed to work with the new index:

branchwater-api, a search server indexing ~946,000 SRA metagenomes.
branchwater-web, a webapp that takes a genome of interest and rapidly searches for publicly-available metagenomes within NCBI's sequence read archive with branchwater. Metadata associated with the metagenome accessions are summarized in interactive tables, plots, and maps.
branchwater-index, a command-line interface to build the search index. See the Query README for more details.
branchwater-query, a command-line interface to submit queries to a search server.

There are also additional resources:

The code for monitoring the SRA and building sourmash sketches from genomes and metagenomes is in wort.
sourmash_plugin_branchwater is a sourmash plugin exposing more features from branchwater in sourmash.

Need help? Have questions? Want to make a suggestion?

Please file branchwater-specific issues and pull requests in the branchwater repo. We also hang out in the sourmash repo a lot, if you have more general questions about sourmash. And there's a gitter/matrix channel where you can contact a number of the sourmash collaborators.

License information

branchwater is AGPL licensed.

The webapp was developed by the USDA Agricultural Research Service, Genomics and Bioinformatics Research Unit group in Gainesville, FL, Primarily authored by Suzanne Fleishman and led by Adam Rivers. Check out their other work at https://tinyecology.com. As a work of the United States Government, the original code is available under the CC0 1.0 Universal Public Domain Dedication (CC0 1.0).

branchwater's People

Contributors

Stargazers

Watchers

Forkers

fossabot mgs-sails

branchwater's Issues

document local and remote-accessible locations of wort sketches

On farm, all the SRA sketches are under /group/ctbrowngrp/irber/data/wort-data/wort-sra.

We have a second copy on the temporary quobyte disk per ctb/magsearch#12.

Remote accessibility to the sketches is a work in progress.

Selecting multiple queries

Dear branchwater team,

I was wondering if branchwater is intended to be able to accept several genomes as queries. The website allows it but I am noticing that if I select two FastA files as inputs, even though they both seem to load and get green checkmarks, the results in the CSV file are only for one of the genomes in my list (and they are identical to the ones I get if I only select that one genome as query). I am wondering if this feature isn't implemented yet or if it's not intended at all (or maybe I am doing something wrong).

Thanks!
Tanya

Add dev containers for easier dev environments

After #4 is merged, work on adding dev container so gitpod or github codespaces can serve as development platform (without needing to set up a local docker compose stack)

Links:
https://docs.github.com/en/codespaces/setting-up-your-project-for-codespaces/adding-a-dev-container-configuration/introduction-to-dev-containers
https://docs.github.com/en/codespaces/developing-in-a-codespace/creating-a-codespace-for-a-repository#creating-a-codespace-for-a-repository
https://containers.dev/templates
https://github.com/devcontainers/templates/blob/main/src/rust/.devcontainer/devcontainer.json

match to a query is missing from branchwater Web site

The accession BK010471 is for a crAssphage that is ubiquitous in human gut metagenomes (link), and in particular is found in the 454 data set SRR073439.

When I do a containment search, I see:

% sourmash search --containment BK010471.fa.sig SRR073439.sig -k 31

selecting specified query k=31
loaded query: BK010471.fa... (k=31, DNA)
--
loaded 3 total signatures from 1 locations.
after selecting signatures compatible with search, 1 remain.

1 matches above threshold 0.080:
similarity   match
----------   -----
 59.0%       SRR073439

and the Venn diagram is pleasing:

However, the FASTA sequence does not have any matches when searched at https://branchwater.jgi.doe.gov/. Any ideas?

thanks!

SRR073439.k31.sig.zip
BK010471.k31.sig.zip
BK010471.fa.zip

explore lat-lon-parser when normalizing metadata

https://pypi.org/project/lat-lon-parser/

#4 fixed variable name when normalizing metadata, but we are still missing lat/lon combinations in the values because it is a freeform text field...

(thanks @rchikhi for the suggestion!)

allow users to modify search threshold in 'advanced' search

add information on how to cite branchwater to the Web site

ref #10

Use cargo-dist for release CI

https://github.com/axodotdev/cargo-dist/

bad signature - ERR1248659

@luizirber not sure where to post this, but don't want it to get lost -

/group/ctbrowngrp/irber/data/wort-data/wort-sra/sigs/ERR1248659.sig is a bad sketch file.

What's the right way to handle this? Can you/we trigger rebuilding it?

No results from the Webserver

I have tried the web server at https://branchwater.jgi.doe.gov/ many times over the last few days and I have not gotten any results from any of the tries. I did the search using the E. coli K12 genome in fasta format.

Jonathan

operation timed out

$ ./mastiff sequence.fasta > matches.csv
[2023-07-15T04:19:00Z INFO mastiff_client] Preparing signature
[2023-07-15T04:19:00Z INFO mastiff_client] Sending request to https://mastiff.sourmash.bio
Error:
0: error sending request for url (https://mastiff.sourmash.bio/search): operation timed out
1: operation timed out

Location:
crates/client/src/main.rs:132

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets

Publish images for branchwater components

After #4 is merged tag a new release and start publishing images, to make it easier to external projects to bring up/expand on branchwater

Questions:

where to host? dockerhub, quay.io, ~~ghcr.io~~?

running branchwater on large assemblies

Hello,
Thanks for developing such a great tool. I've been trying to run branchwater on some whole metagenome assemblies that are quite large (0.3G-1G). When I upload even the smaller ones and submit them I don't get any output. I've tried leaving a couple with the tab open for ~12 hours to no avail.
If I leave them for long enough will they eventually complete?
Thanks!
Jenny

How to cite mastiff?

Dear Luiz,

Thank you for creating this tool! Can you please tell me how to cite it?

add information about data privacy to branchwater web site

from an e-mail conversation, author luiz:

I'm enjoying the branchwater metagenome query tool. However, are the submitted
queries stored or used within your servers? I want to submit a couple of genomes
that are not published yet, and I want to make sure they are only available
after our manuscript is published.

They are stored only in memory while the search is happening, they are never
stored on disk [0].

I do check the server access logs from time to time to have an idea of how many
unique visitors we have, but this is only data that the HTTP server logs
regularly (time, IP), and doesn't include the HTTP request content (where the
query data actually lives).

Move metadata from mongodb into the index manifest?

Over at sourmash-bio/sourmash#3006 (comment) I mentioned adding extra columns to manifest to hold metadata not available in a signature. I think we can do the same approach to store the SRA metadata into the manifest, and remove the mongodb dependency, returning the metadata from the search index together with the containment.

More refs on the sourmash context: sourmash-bio/sourmash#2180

But... is it a good idea?

Over at #4 I'm trying to make it easy to bring up a new branchwater installation, and there is a bit of a dance for building index, bringing up mongo, loading metadata, and then bringing up server/frontend. Moving the metadata into the index building step makes things easier, but requires to be able to update the manifest in the index in case we want different data (which is not that hard, it's a CSV). It can be more constraining for developing new frontend features, tho?

pinging @bluegenes and @SuzanneFleishman for ideas =]

visualize usage stats

thanks to @luizirber's nixos setup, we're already logging with caddy

log files on the mastiff server: wc -l /var/log/caddy/access-*

3868 /var/log/caddy/access-branchwater.sourmash.bio.log
23794 /var/log/caddy/access-mastiff.sourmash.bio.log
4511 /var/log/caddy/access-minke.sourmash.bio.log

But we should make this logging visible/accessible/usable.

suggestions from luiz:

goaccess (https://moth.monster/blog/caddygoaccess/)
https://www.goatcounter.com/

we could alternatively set up a self-hosted plausible.io if we have issues using the caddy logs...

Build a k=31 SRA metagenomes index

I'll use this issue to document steps to build a k=31,scaled=1000 index for SRA metagenomes. This is the same process used for the current k=21,scaled=1000 index in branchwater.sourmash.bio, but considering the changes from #4, and bringing new SRA datasets added after the cutoff from the current index (2023-08-17).

add links to FracMinHash paper
add FAQs like "how do I get the SRA sketches?" and point people at, umm, places. and mastiff.