Code Monkey home page Code Monkey logo

branchwater's Introduction

branchwater

This is the central repository for branchwater.

branchwater is the framework we use for searching large collections of sequencing data with genome-scale queries. At its core it is a new search index for sourmash signatures, allowing near real-time search of large scale databases. It is an inverted index implemented on top of RocksDB.

You can read more about branchwater in Sourmash Branchwater Enables Lightweight Petabyte-Scale Sequence Search, Irber et al., 2022, and you can read about one of the earliest use cases in Biogeographic Distribution of Five Antarctic Cyanobacteria Using Large-Scale k-mer Searching with sourmash branchwater, Lumian et al., 2022.

branchwater had a couple of names over time:

Here are a few blog posts:

Code repository links and details.

branchwater is based on sourmash, and the search index data structure live there since version 0.12 of the Rust crate.

branchwater is currently (Jan 2024) mostly contained in this repo, with the tools developed to work with the new index:

  • branchwater-api, a search server indexing ~946,000 SRA metagenomes.
  • branchwater-web, a webapp that takes a genome of interest and rapidly searches for publicly-available metagenomes within NCBI's sequence read archive with branchwater. Metadata associated with the metagenome accessions are summarized in interactive tables, plots, and maps.
  • branchwater-index, a command-line interface to build the search index. See the Query README for more details.
  • branchwater-query, a command-line interface to submit queries to a search server.

There are also additional resources:

  • The code for monitoring the SRA and building sourmash sketches from genomes and metagenomes is in wort.
  • sourmash_plugin_branchwater is a sourmash plugin exposing more features from branchwater in sourmash.

Need help? Have questions? Want to make a suggestion?

Please file branchwater-specific issues and pull requests in the branchwater repo. We also hang out in the sourmash repo a lot, if you have more general questions about sourmash. And there's a gitter/matrix channel where you can contact a number of the sourmash collaborators.

License information

branchwater is AGPL licensed.

The webapp was developed by the USDA Agricultural Research Service, Genomics and Bioinformatics Research Unit group in Gainesville, FL, Primarily authored by Suzanne Fleishman and led by Adam Rivers. Check out their other work at https://tinyecology.com. As a work of the United States Government, the original code is available under the CC0 1.0 Universal Public Domain Dedication (CC0 1.0).

branchwater's People

Contributors

luizirber avatar bluegenes avatar suzannefleishman avatar

Stargazers

Adam Rivers avatar Julio Batista Silva avatar Rauf Salamzade avatar  avatar  avatar Bishoy Hanna avatar

Watchers

 avatar C. Titus Brown avatar

branchwater's Issues

Selecting multiple queries

Dear branchwater team,

I was wondering if branchwater is intended to be able to accept several genomes as queries. The website allows it but I am noticing that if I select two FastA files as inputs, even though they both seem to load and get green checkmarks, the results in the CSV file are only for one of the genomes in my list (and they are identical to the ones I get if I only select that one genome as query). I am wondering if this feature isn't implemented yet or if it's not intended at all (or maybe I am doing something wrong).

Thanks!
Tanya

Add dev containers for easier dev environments

match to a query is missing from branchwater Web site

The accession BK010471 is for a crAssphage that is ubiquitous in human gut metagenomes (link), and in particular is found in the 454 data set SRR073439.

When I do a containment search, I see:

% sourmash search --containment BK010471.fa.sig SRR073439.sig -k 31

selecting specified query k=31
loaded query: BK010471.fa... (k=31, DNA)
--
loaded 3 total signatures from 1 locations.
after selecting signatures compatible with search, 1 remain.

1 matches above threshold 0.080:
similarity   match
----------   -----
 59.0%       SRR073439

and the Venn diagram is pleasing:

venn2

However, the FASTA sequence does not have any matches when searched at https://branchwater.jgi.doe.gov/. Any ideas?

thanks!

SRR073439.k31.sig.zip
BK010471.k31.sig.zip
BK010471.fa.zip

bad signature - ERR1248659

@luizirber not sure where to post this, but don't want it to get lost -

/group/ctbrowngrp/irber/data/wort-data/wort-sra/sigs/ERR1248659.sig is a bad sketch file.

What's the right way to handle this? Can you/we trigger rebuilding it?

operation timed out

$ ./mastiff sequence.fasta > matches.csv
[2023-07-15T04:19:00Z INFO mastiff_client] Preparing signature
[2023-07-15T04:19:00Z INFO mastiff_client] Sending request to https://mastiff.sourmash.bio
Error:
0: error sending request for url (https://mastiff.sourmash.bio/search): operation timed out
1: operation timed out

Location:
crates/client/src/main.rs:132

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets

Publish images for branchwater components

After #4 is merged tag a new release and start publishing images, to make it easier to external projects to bring up/expand on branchwater

Questions:

  • where to host? dockerhub, quay.io, ghcr.io?

running branchwater on large assemblies

Hello,
Thanks for developing such a great tool. I've been trying to run branchwater on some whole metagenome assemblies that are quite large (0.3G-1G). When I upload even the smaller ones and submit them I don't get any output. I've tried leaving a couple with the tab open for ~12 hours to no avail.
If I leave them for long enough will they eventually complete?
Thanks!
Jenny

How to cite mastiff?

Dear Luiz,

Thank you for creating this tool! Can you please tell me how to cite it?

add information about data privacy to branchwater web site

from an e-mail conversation, author luiz:


I'm enjoying the branchwater metagenome query tool. However, are the submitted
queries stored or used within your servers? I want to submit a couple of genomes
that are not published yet, and I want to make sure they are only available
after our manuscript is published.

They are stored only in memory while the search is happening, they are never
stored on disk [0].

I do check the server access logs from time to time to have an idea of how many
unique visitors we have, but this is only data that the HTTP server logs
regularly (time, IP), and doesn't include the HTTP request content (where the
query data actually lives).

Move metadata from mongodb into the index manifest?

Over at sourmash-bio/sourmash#3006 (comment) I mentioned adding extra columns to manifest to hold metadata not available in a signature. I think we can do the same approach to store the SRA metadata into the manifest, and remove the mongodb dependency, returning the metadata from the search index together with the containment.

More refs on the sourmash context: sourmash-bio/sourmash#2180

But... is it a good idea?

Over at #4 I'm trying to make it easy to bring up a new branchwater installation, and there is a bit of a dance for building index, bringing up mongo, loading metadata, and then bringing up server/frontend. Moving the metadata into the index building step makes things easier, but requires to be able to update the manifest in the index in case we want different data (which is not that hard, it's a CSV). It can be more constraining for developing new frontend features, tho?

pinging @bluegenes and @SuzanneFleishman for ideas =]

visualize usage stats

thanks to @luizirber's nixos setup, we're already logging with caddy

log files on the mastiff server: wc -l /var/log/caddy/access-*

3868 /var/log/caddy/access-branchwater.sourmash.bio.log
23794 /var/log/caddy/access-mastiff.sourmash.bio.log
4511 /var/log/caddy/access-minke.sourmash.bio.log

But we should make this logging visible/accessible/usable.

suggestions from luiz:

we could alternatively set up a self-hosted plausible.io if we have issues using the caddy logs...

Build a k=31 SRA metagenomes index

I'll use this issue to document steps to build a k=31,scaled=1000 index for SRA metagenomes. This is the same process used for the current k=21,scaled=1000 index in branchwater.sourmash.bio, but considering the changes from #4, and bringing new SRA datasets added after the cutoff from the current index (2023-08-17).

Set up integration CI

Bring up branchwater with {podman,docker}-compose and run a query end-to-end.

Needs #4 to be merged first

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.