It would be very useful (e.g in kmer based analyses) if the

I agree with <a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I'm with <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard

Feature request: /reads/search to support searching by substring about ga4gh-schemas HOT 18 CLOSED

ga4gh commented on August 11, 2024

Feature request: /reads/search to support searching by substring

from ga4gh-schemas.

Comments (18)

ekg commented on August 11, 2024

I tend to see this as a separate layer. Perhaps I am misunderstanding the
scope of the schemas discussion?

from ga4gh-schemas.

pgrosu commented on August 11, 2024

It's not hard, but you need create parallelized versions of several algorithms. For instance, one of the simplest is the suffix trees ( http://en.wikipedia.org/wiki/Suffix_tree ), which has a build and search complexity that is usually linear, but there are better approaches with extra optimizations. If you take the seed strings I posted in #142 (for inverted indices), you can then build a tree for each seed string. Then the search in parallel will be very fast per genome or any read variation you want to search. Again you need to build more layers to make it fast for variants (via graphs, etc) and reads, but without making this post too long, yes it is possible :)

from ga4gh-schemas.

lh3 commented on August 11, 2024

I agree with @ekg. We should delay this feature until a backend wants to implement it.

from ga4gh-schemas.

calbach commented on August 11, 2024

I don't think it's entirely out of scope for the API, but this feature would significantly complicate a Reads API implementation. I think the API should be very conservative about its requirements for per-read indexing (ideally just a primary positional index) as it will be very expensive and/or complicated to maintain secondary indices over a repository which would likely host at least trillions of reads (1000 30x whole genomes, for example).

My thoughts are that it would be better to build an open k-mer indexing implementation on top of the API, and see where the pain points are. If this winds up being the most frequently used tool backed by the GA4GH Reads API, then perhaps it would be worth considering merging it into the API.

from ga4gh-schemas.

pgrosu commented on August 11, 2024

@calbach, trillions would be way too easy! For instance, I can order a Nvidia Tesla K40 GPU from newegg.com and if I remember my CUDA correctly, I multiply the maximum grid size to get the blocks and then by the maximum threads per block (1024), which would exceed a trillion with a lot of room to spare! Based on the following device capabilities:

http://www.microway.com/hpc-tech-tips/nvidia-tesla-k40-atlas-gpu-accelerator-kepler-gk110b-up-close/

The total thread count would be:

2147483647 * 65535 * 65535 * 1024 ~ 9.44444E+21 >> 10^12 (trillion)

This would be using a brute-force approach. Now if I add optimized parallel algorithms and data-structures on top of that, I can go quite far.

Paul

from ga4gh-schemas.

cassiedoll commented on August 11, 2024

I'm with @ekg, @lh3 and @calbach.

from ga4gh-schemas.

fnothaft commented on August 11, 2024

I don't have a strong preference; it's not hard to implement, but it can be expensive.

from ga4gh-schemas.

cassiedoll commented on August 11, 2024

Yeah, cause its expensive, might as well leave it out until we have a strong need. It'll be difficult enough to just get everybody out there talking v0.5 :)

from ga4gh-schemas.

fnothaft commented on August 11, 2024

Yeah, I'm also not 100% convinced of the value of the feature without some fuzzyness (i.e. inexact matching).

from ga4gh-schemas.

pgrosu commented on August 11, 2024

If people are worried about cost, Amazon (AWS) offers GPU instances (as g2.2xlarge - with Nvidia GK104 "Kepler" having 1,536 CUDA cores) for $0.650/hour, based on the following:

http://aws.amazon.com/ec2/pricing/

http://aws.amazon.com/blogs/aws/build-3d-streaming-applications-with-ec2s-new-g2-instance-type/

If not convinced about applications, there are several that I listed in the link below, but I'm willing to wait too :)

http://en.wikipedia.org/wiki/Alignment-free_sequence_analysis

from ga4gh-schemas.

fnothaft commented on August 11, 2024

That doesn't make it cheap. If you have a static index, you incur the cost of computing + maintaining the indexes. You'll probably need to store multiple indexes, as people may want to ask different length queries, etc. If you choose not to store static indices, you need to dynamically compute them in response to arbitrary queries. Additionally, while there are plenty of alignment free analyses, I'm not sure as to how may alignment free analyses are really a great fit for running against a REST API...

Also, note that the k-mer spectrum of sequenced individuals contains many more k-mers than the k-mer spectrum of a reference genome due to errors in reads, which introduce many low frequency k-mers.

from ga4gh-schemas.

pgrosu commented on August 11, 2024

I agree - we have plenty to play with what we already have now, as @cassiedoll mentioned :) Though through prioritized caches it should be possible, especially if we restructure the reads as trees, graphs, or something else. Though let's keep a running list of requests, and maybe if something keeps being requested often we can revisit it.

from ga4gh-schemas.

richarddurbin commented on August 11, 2024

I strongly support the position that a search by substring facility is a different thing from the current API, although I agree it is seriously interesting.

There's quite a lot work on making efficient indices on very large read sets for this purpose based on suffix array-based data structure.
Heng, the Oxford group with Zam and Gil, and my group have all been involved, as has Tony Cox's group at Illumina. We have built an
in-memory searchable index over all the (error-corrected) phase 3 1000 Genomes reads (70TB BAM data) that sits inside two 256GB RAM
servers, though it requires some larger on disk metadata resources to give back information on the hits.

Richard

On 16 Sep 2014, at 16:51, CH Albach [email protected] wrote:

I don't think it's entirely out of scope for the API, but this feature would significantly complicate a Reads API implementation. I think the API should be very conservative about its requirements for per-read indexing (ideally just a primary positional index) as it will be very expensive and/or complicated to maintain secondary indices over a repository which would likely host at least trillions of reads (1000 30x whole genomes, for example).

My thoughts are that it would be better to build an open k-mer indexing implementation on top of the API, and see where the pain points are. If this winds up being the most frequently used tool backed by the GA4GH Reads API, then perhaps it would be worth considering merging it into the API.

—
Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

from ga4gh-schemas.

jeromekelleher commented on August 11, 2024

Thanks for all the feedback everyone, lots of great points that I agree with. I certainly don't want to have to implement such a feature for the reference server!

I'm still a little bit confused though: is the consensus that:

(a) This is something that we'd like to see in the protocol ultimately, but is unhelpfully complicated to tackle at this early stage; or
(b) This is something that is conceptually distinct from the goals of the API and should be considered a downstream consumer of the protocol.

from ga4gh-schemas.

adamnovak commented on August 11, 2024

It makes sense to me that we would want to design a search by substring facility (and I would favor BWT-based approaches for implementation because I know them), but it also makes sense that asking everyone to provide it in order to implement the reads API would be kind of problematic.

Would it be possible to structure it as a server-side filter on search results (i.e. not guaranteeing that it be any more efficient than looping over all the reads the search would otherwise return)?

Are we going to have some sort of system for capabilities or optional methods?

from ga4gh-schemas.

lh3 commented on August 11, 2024

I don't know whether it is (a) or (b) ultimately, but the consensus largely is: we don't want such a feature now. We may review this topic later when we have a running instance of the current APIs and when we get more experiences with such queries.

from ga4gh-schemas.

pgrosu commented on August 11, 2024

@lh3, but isn't that what @jeromekelleher is doing with the GA4GH Server implementation at the following github location?

https://github.com/ga4gh/server

from ga4gh-schemas.

jeromekelleher commented on August 11, 2024

Thanks for the input @lh3, I'll close the issue in a few days once everyone has had a chance to weigh in.

from ga4gh-schemas.

Feature request: /reads/search to support searching by substring about ga4gh-schemas HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent