These instructions assume you have (and know how to work) python3.6
, Docker
- Docker containers
- Python 3.6
- Django
- Postgres
- S3 buckets/objects
- RabbitMQ
- Celery workers
Client
POST
aGeneParser
object toAPI Server
.API Server
contactsS3
and retrieves a presigned URL for upload.API Server
stores it in theGeneParser
object and returns to client.
Client
PUT
a file in theS3
presigned URL.Client
POST
toAPI Server
that the file was uploaded.API Server
createsAsyncJob
and puts a task in theRabbitMQ
.Worker
pulls job fromRabbitMQ
and begins processing.Worker
downloads file in chunks fromS3
and processes each chunk separately.- Once the
Worker
finds a DNA sequence it stores it inPostgres
. - When file is complete the worker builds a JSON from the stored DNA sequences and sends it to
S3
.
Client
pollsAsyncJob
progress until completed.Client
GET
GeneParser
API Server
contactsS3
and retrieves a presigned URL for download.
Client
downloads file using presigned URL.
- Create terminal and navigate to project folder.
docker network create genobase_default
docker-compose -f data.yml up
- This will bring up
Postgres
,Minio
andRabbitMQ
- This will bring up
- Wait for the database to initialize...
- Create terminal and navigate to project folder.
docker-compose -f processors.yml up
- This will bring up the
API server
,Celery workers
and aschema migrations server
.
- This will bring up the
- Create terminal and navigate to project folder.
pip install requests
: Needed for the test script- Run the test script:
python tests/test_full_flow.py
- Navigate to
./downloads/
and see the results.
./gene_parser/tasks.py
: The processor background job../gene_parser/parsers.py
: The file text parsers../downloads/
: Where the resulted parsers are..tests/test_full_flow.py
: The test script.
- Since objects are stored in the DB it's simple to create an API to retrieve them in different manners.
- Code assumes genes are not repeating in the files.
- Improvement idea: Shard file to multiple workers.
- Improvement idea: Use S3 bucket notifications instead of the
file_uploaded
endpoint. - Improvement idea: Store file in S3 in parts instead of whole.
- I'm using
python manage.py runserver
as the server for demo simplicity. - No Authentication layer or any security for simplicity.
- If you want to reset everything just delete the
./data
folder.