noorts / dlsa Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 33.17 MB

🧬 Distributing Local Sequence Alignment using Volunteer Computing

License: Apache License 2.0

Go 13.68% Python 35.25% Shell 17.73% Makefile 0.39% Rust 31.44% C 1.50%

bioinformatics distributed-systems sequence-alignment smith-waterman-algorithm crowdsourcing

dlsa's Issues

Master

Add parameters #12
Add logging (able to view what is happening)
Add new scheduler
- Simple work split (for one-to-many and for many-to-many jobs)
- Fancy heuristics once we have the benchmarking numbers sent by the worker

One or multiple queries on a worker

Are we assuming the worker can recieve multiple queries or only one query and multiple targets?

Worker - Tracker interface

The communication between the worker nodes and the tracker is the most critical, I think. Here is a thread to discuss the development and design decisions for the interface.

The component that serves a REST API. This is the component that clients (e.g., a user on a laptop) interact with from the outside. It allows a client to 1) submit a sequence alignment job request, and 2) to poll for the results.

The provided job request will be parsed and converted into our internal format.

Worker - Metric computation and Benchmarking

Metric computation
- Time
- (Opt) CUPS
Benchmarking (for performance insight and experiments)

Metric granularity can be down to the following 4 metrics inside the worker node:

So computation consists of: | BUILD MATRIX | BACKTRACE |
                            | 1GCUPS,2TIME |   3TIME   |
                            |       4COMBINED TIME     |

Benchmarking has two purposes:

performance benchmarking for the experiments and testing in general (allows us to see which code improvements deliver practical results)
compute capacity estimation, for the “intelligent” scheduler.

Roadmap

Sequences to compute

class WorkPackage(BaseModel):
    # work package id
    id: str
    targets: Dict[SequenceId, Sequence]
    queries: Dict[SequenceId, Sequence]

    sequences: List[TargetQueryCombination]

Is the point of the sequences field only to specify which of the sequences sent to the worker it should compute?

Tracking Issue: Interfaces

We have a quite a few entities that have to communicate. Let this function as a tracking issue for general discussion over interfaces.

Overview of interface issues:

Configuration option passing

Adjust the client (TUI), master, and worker node such that the following configuration options are passed from the client all the way to the worker node.

Configuration options:

match score
mismatch score
gap (usually split into extension penalty δ and gap open penalty Δ, but fine to keep them the same in this PR)

Defaults for these could be as stated in the competition document: Match = +2, Mismatch = -1, Gap=1. It might make sense to define these defaults only in the client (the relevant functions in the master and worker will parameterize these options). We might want to group them into a configuration object (for maintainability’s sake).

Requirements

The client (TUI)
- allows the configuration options to be passed as arguments
- falls back to defaults in the case that some options are not specified
The master
- passes the configuration options to the worker, taking into account potential job to work splitting
The worker node
- parses and uses the configuration options inside the Smith-Waterman algorithm
- its tests have been updated to use a default set of configuration options

Write SIMD algorithm

To Do:

Don't overcompute diagonal parts of the Matrix
Use SIMD in the diagonal parts

About using SIMD in the diagonals. SIMD can be used as soon as the matrix is LANES + 1 wide (one because of the leftmost zero column). This should speed up all cases that have a fairly high query length compared to target length.

Interface - master

This is just a rough draft (hopefully enough for the design to show Tiziano) feel free to comment and obviously there need to be more specifications:

The master node is responsible for receiving jobs from the job scheduler and assigning them to workers. The master keeps a list of workers and whether they are available and which jobs they have and how long they have taken. The master node also keeps track of the health of the workers and provides recovery in case of failures. Once the master has received a confirmation that there are no more jobs to be submitted and all jobs have been processed, the node sends the results to a AWS bucket

Jobs: Queue like data structure
Results: Data structure to store the results
Workers: List of workers available,

Methods:

ReceiveTask()
receive task from scheduler and add it to the task queue

RegisterWorker(workerId: String or int, metadata: Custom class)
Allows worker to register with the master with the id

DelegateTask(workerId: String)
Delegates a task to a worker with it workerID

ReportStatus(wokerId, status: Custom class)
Receives and processes status updates from worker a node(completion, error etc.)

SubmitResult()
Submits a result to the bucket or database

Tracking: Rust

A tracking issue about all problems related to the Rust implementation

To Do

Convert unit tests to rust
Setup Benchmarking using Criterion
Create complete FFI-bindings for Go.
#27
Write a Low memory variant of the algorithm
Use rust version in benchmarker @haraldurbjarni
Make vec allocate fallible
Catch all panics at FFI @haraldurbjarni

Project structure

I created a proposed project structure branch which also includes a grpc config. Be free to just leave a comment or change stuff around.

noorts / dlsa Goto Github PK

dlsa's People

Contributors

Stargazers

Watchers

dlsa's Issues

Recommend Projects

Recommend Topics

Recommend Org