Code Monkey home page Code Monkey logo

Comments (18)

ekg avatar ekg commented on August 11, 2024

I tend to see this as a separate layer. Perhaps I am misunderstanding the
scope of the schemas discussion?

from ga4gh-schemas.

pgrosu avatar pgrosu commented on August 11, 2024

It's not hard, but you need create parallelized versions of several algorithms. For instance, one of the simplest is the suffix trees ( http://en.wikipedia.org/wiki/Suffix_tree ), which has a build and search complexity that is usually linear, but there are better approaches with extra optimizations. If you take the seed strings I posted in #142 (for inverted indices), you can then build a tree for each seed string. Then the search in parallel will be very fast per genome or any read variation you want to search. Again you need to build more layers to make it fast for variants (via graphs, etc) and reads, but without making this post too long, yes it is possible :)

from ga4gh-schemas.

lh3 avatar lh3 commented on August 11, 2024

I agree with @ekg. We should delay this feature until a backend wants to implement it.

from ga4gh-schemas.

calbach avatar calbach commented on August 11, 2024

I don't think it's entirely out of scope for the API, but this feature would significantly complicate a Reads API implementation. I think the API should be very conservative about its requirements for per-read indexing (ideally just a primary positional index) as it will be very expensive and/or complicated to maintain secondary indices over a repository which would likely host at least trillions of reads (1000 30x whole genomes, for example).

My thoughts are that it would be better to build an open k-mer indexing implementation on top of the API, and see where the pain points are. If this winds up being the most frequently used tool backed by the GA4GH Reads API, then perhaps it would be worth considering merging it into the API.

from ga4gh-schemas.

pgrosu avatar pgrosu commented on August 11, 2024

@calbach, trillions would be way too easy! For instance, I can order a Nvidia Tesla K40 GPU from newegg.com and if I remember my CUDA correctly, I multiply the maximum grid size to get the blocks and then by the maximum threads per block (1024), which would exceed a trillion with a lot of room to spare! Based on the following device capabilities:

http://www.microway.com/hpc-tech-tips/nvidia-tesla-k40-atlas-gpu-accelerator-kepler-gk110b-up-close/

The total thread count would be:

2147483647 * 65535 * 65535 * 1024 ~ 9.44444E+21 >> 10^12 (trillion)

This would be using a brute-force approach. Now if I add optimized parallel algorithms and data-structures on top of that, I can go quite far.

Paul

from ga4gh-schemas.

cassiedoll avatar cassiedoll commented on August 11, 2024

I'm with @ekg, @lh3 and @calbach.

from ga4gh-schemas.

fnothaft avatar fnothaft commented on August 11, 2024

I don't have a strong preference; it's not hard to implement, but it can be expensive.

from ga4gh-schemas.

cassiedoll avatar cassiedoll commented on August 11, 2024

Yeah, cause its expensive, might as well leave it out until we have a strong need. It'll be difficult enough to just get everybody out there talking v0.5 :)

from ga4gh-schemas.

fnothaft avatar fnothaft commented on August 11, 2024

Yeah, I'm also not 100% convinced of the value of the feature without some fuzzyness (i.e. inexact matching).

from ga4gh-schemas.

pgrosu avatar pgrosu commented on August 11, 2024

If people are worried about cost, Amazon (AWS) offers GPU instances (as g2.2xlarge - with Nvidia GK104 "Kepler" having 1,536 CUDA cores) for $0.650/hour, based on the following:

http://aws.amazon.com/ec2/pricing/

http://aws.amazon.com/blogs/aws/build-3d-streaming-applications-with-ec2s-new-g2-instance-type/

If not convinced about applications, there are several that I listed in the link below, but I'm willing to wait too :)

http://en.wikipedia.org/wiki/Alignment-free_sequence_analysis

from ga4gh-schemas.

fnothaft avatar fnothaft commented on August 11, 2024

That doesn't make it cheap. If you have a static index, you incur the cost of computing + maintaining the indexes. You'll probably need to store multiple indexes, as people may want to ask different length queries, etc. If you choose not to store static indices, you need to dynamically compute them in response to arbitrary queries. Additionally, while there are plenty of alignment free analyses, I'm not sure as to how may alignment free analyses are really a great fit for running against a REST API...

Also, note that the k-mer spectrum of sequenced individuals contains many more k-mers than the k-mer spectrum of a reference genome due to errors in reads, which introduce many low frequency k-mers.

from ga4gh-schemas.

pgrosu avatar pgrosu commented on August 11, 2024

I agree - we have plenty to play with what we already have now, as @cassiedoll mentioned :) Though through prioritized caches it should be possible, especially if we restructure the reads as trees, graphs, or something else. Though let's keep a running list of requests, and maybe if something keeps being requested often we can revisit it.

from ga4gh-schemas.

richarddurbin avatar richarddurbin commented on August 11, 2024

I strongly support the position that a search by substring facility is a different thing from the current API, although I agree it is seriously interesting.

There's quite a lot work on making efficient indices on very large read sets for this purpose based on suffix array-based data structure.
Heng, the Oxford group with Zam and Gil, and my group have all been involved, as has Tony Cox's group at Illumina. We have built an
in-memory searchable index over all the (error-corrected) phase 3 1000 Genomes reads (70TB BAM data) that sits inside two 256GB RAM
servers, though it requires some larger on disk metadata resources to give back information on the hits.

Richard

On 16 Sep 2014, at 16:51, CH Albach [email protected] wrote:

I don't think it's entirely out of scope for the API, but this feature would significantly complicate a Reads API implementation. I think the API should be very conservative about its requirements for per-read indexing (ideally just a primary positional index) as it will be very expensive and/or complicated to maintain secondary indices over a repository which would likely host at least trillions of reads (1000 30x whole genomes, for example).

My thoughts are that it would be better to build an open k-mer indexing implementation on top of the API, and see where the pain points are. If this winds up being the most frequently used tool backed by the GA4GH Reads API, then perhaps it would be worth considering merging it into the API.


Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

from ga4gh-schemas.

jeromekelleher avatar jeromekelleher commented on August 11, 2024

Thanks for all the feedback everyone, lots of great points that I agree with. I certainly don't want to have to implement such a feature for the reference server!

I'm still a little bit confused though: is the consensus that:

(a) This is something that we'd like to see in the protocol ultimately, but is unhelpfully complicated to tackle at this early stage; or
(b) This is something that is conceptually distinct from the goals of the API and should be considered a downstream consumer of the protocol.

from ga4gh-schemas.

adamnovak avatar adamnovak commented on August 11, 2024

It makes sense to me that we would want to design a search by substring facility (and I would favor BWT-based approaches for implementation because I know them), but it also makes sense that asking everyone to provide it in order to implement the reads API would be kind of problematic.

Would it be possible to structure it as a server-side filter on search results (i.e. not guaranteeing that it be any more efficient than looping over all the reads the search would otherwise return)?

Are we going to have some sort of system for capabilities or optional methods?

from ga4gh-schemas.

lh3 avatar lh3 commented on August 11, 2024

I don't know whether it is (a) or (b) ultimately, but the consensus largely is: we don't want such a feature now. We may review this topic later when we have a running instance of the current APIs and when we get more experiences with such queries.

from ga4gh-schemas.

pgrosu avatar pgrosu commented on August 11, 2024

@lh3, but isn't that what @jeromekelleher is doing with the GA4GH Server implementation at the following github location?

https://github.com/ga4gh/server

from ga4gh-schemas.

jeromekelleher avatar jeromekelleher commented on August 11, 2024

Thanks for the input @lh3, I'll close the issue in a few days once everyone has had a chance to weigh in.

from ga4gh-schemas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.