Code Monkey home page Code Monkey logo

genobase's Introduction

Genobase - A DNA file parser service (WIP)

These instructions assume you have (and know how to work) python3.6, Docker

Technologies
  1. Docker containers
  2. Python 3.6
  3. Django
  4. Postgres
  5. S3 buckets/objects
  6. RabbitMQ
  7. Celery workers
General architecture (and process)
  1. Client POST a GeneParser object to API Server.
    1. API Server contacts S3 and retrieves a presigned URL for upload.
    2. API Server stores it in the GeneParser object and returns to client.
  2. Client PUT a file in the S3 presigned URL.
  3. Client POST to API Server that the file was uploaded.
  4. API Server creates AsyncJob and puts a task in the RabbitMQ.
  5. Worker pulls job from RabbitMQ and begins processing.
    1. Worker downloads file in chunks from S3 and processes each chunk separately.
    2. Once the Worker finds a DNA sequence it stores it in Postgres.
    3. When file is complete the worker builds a JSON from the stored DNA sequences and sends it to S3.
  6. Client polls AsyncJob progress until completed.
  7. Client GET GeneParser
    1. API Server contacts S3 and retrieves a presigned URL for download.
  8. Client downloads file using presigned URL.
How to run the project
  1. Create terminal and navigate to project folder.
  2. docker network create genobase_default
  3. docker-compose -f data.yml up
    1. This will bring up Postgres, Minio and RabbitMQ
  4. Wait for the database to initialize...
  5. Create terminal and navigate to project folder.
  6. docker-compose -f processors.yml up
    1. This will bring up the API server, Celery workers and a schema migrations server.
  7. Create terminal and navigate to project folder.
    1. pip install requests: Needed for the test script
    2. Run the test script: python tests/test_full_flow.py
    3. Navigate to ./downloads/ and see the results.
Key places to look at:
  1. ./gene_parser/tasks.py: The processor background job.
  2. ./gene_parser/parsers.py: The file text parsers.
  3. ./downloads/: Where the resulted parsers are.
  4. .tests/test_full_flow.py: The test script.
Closing notes
  1. Since objects are stored in the DB it's simple to create an API to retrieve them in different manners.
  2. Code assumes genes are not repeating in the files.
  3. Improvement idea: Shard file to multiple workers.
  4. Improvement idea: Use S3 bucket notifications instead of the file_uploaded endpoint.
  5. Improvement idea: Store file in S3 in parts instead of whole.
  6. I'm using python manage.py runserver as the server for demo simplicity.
  7. No Authentication layer or any security for simplicity.
  8. If you want to reset everything just delete the ./data folder.

genobase's People

Contributors

yanivp avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.