Code Monkey home page Code Monkey logo

awslabs / guidance-for-multi-omics-and-multi-modal-data-integration-and-analysis-on-aws Goto Github PK

View Code? Open in Web Editor NEW
23.0 10.0 7.0 183 KB

This guidance creates a scalable environment in AWS to prepare genomic, clinical, mutation, expression and imaging data for large-scale analysis and perform interactive queries against a data lake. The solution also demonstrates the use of Amazon Omics for multi-modal analysis.

Home Page: https://aws.amazon.com/solutions/guidance/multi-omics-and-multi-modal-data-integration-and-analysis/

License: Apache License 2.0

Shell 5.17% Jupyter Notebook 75.38% Python 19.44%
aws life-sciences multimodal multimodality

guidance-for-multi-omics-and-multi-modal-data-integration-and-analysis-on-aws's Introduction

Guidance for Multi-Omics and Multi-Modal Data Integration and Analysis on AWS

This guidance creates a scalable environment in AWS to prepare genomic, clinical, mutation, expression and imaging data for large-scale analysis and perform interactive queries against a data lake. This solution demonstrates how to 1) build, package, and deploy libraries used for genomics data conversion, 2) provision serverless data ingestion pipelines for multi-modal data preparation and cataloging, 3) visualize and explore clinical data through an interactive interface, and 4) run interactive analytic queries against a multi-modal data lake. This solution also demonstrates how to use AWS Omics to create and work with a Sequence Store, Reference Store and Variant Store in a multi-modal context.

Setup

You can setup the solution in your account by clicking the "Deploy sample code on Console" button on the solution home page.

Customization

Running unit tests for customization

  • Clone the repository, then make the desired code changes
  • Next, run unit tests to make sure added customization passes the tests
cd ./deployment
chmod +x ./run-unit-tests.sh
./run-unit-tests.sh

Prerequisites

  1. Create a distribution bucket, i.e., my-bucket-name
  2. Create a region based distribution, i.e., bucket my-bucket-name-us-west-2
  3. Create a Cloud9 environment.
  4. Clone this repo into that environment.

Building and deploying distributable for customization

Configure the bucket name and region of your target Amazon S3 distribution bucket and run the following statements.

_Note:_ You would have to create an S3 bucket with the prefix 'my-bucket-name-<aws_region>'; aws_region is where you are testing the customized solution.
#bucket where customized code will reside (without -<region> at the end. The -<region will be added>)
export DIST_OUTPUT_BUCKET=my-bucket-name 

#default region where resources will get created
#Use "us-east-1" to get publicly available data from AWS solution bucket
export REGION=my-region

#default name of the solution (use this name to get publicly available test datasets from AWS S3 bucket)
export SOLUTION_NAME=genomics-tertiary-analysis-and-data-lakes-using-aws-glue-and-amazon-athena

#version number for the customized code (use this version to get publicly available test datasets from AWS S3 bucket)
export VERSION=latest

Change to deployment directory.

cd deployment

Build the distributable.

chmod +x ./build-s3-dist.sh
./build-s3-dist.sh $DIST_OUTPUT_BUCKET $SOLUTION_NAME $VERSION

Deploy the distributable to an Amazon S3 bucket in your account. Note: you must have the AWS Command Line Interface installed

aws s3 cp ./$SOLUTION_NAME.template s3://$DIST_OUTPUT_BUCKET-$REGION/$SOLUTION_NAME/$VERSION/

Deploy the global assets.

aws s3 cp ./global-s3-assets/ s3://$DIST_OUTPUT_BUCKET-$REGION/$SOLUTION_NAME/$VERSION --recursive

Deploy the regional assets.

aws s3 cp ./regional-s3-assets/ s3://$DIST_OUTPUT_BUCKET-$REGION/$SOLUTION_NAME/$VERSION --recursive

Copy the static assets.

./copy-static-files.sh [Optional]AWSProfile

Go to the DIST_OUTPUT_BUCKET and copy the OBJECT URL for latest/guidance-for-multi-omics-and-multi-modal-data-integration-and-analysis-on-aws.template.

Go to the AWS CloudFormation Console and create a new stack using the template URL copied.

File Structure

The overall file structure for the application.

.
├── ATTRIBUTION.txt
├── CHANGELOG.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE.txt
├── NOTICE.txt
├── README.md
├── buildspec.yml
├── deploy.sh
├── deployment
│   ├── build-s3-dist.sh
│── source
│   ├── GenomicsAnalysisCode
│   │   ├── TCIA_etl.yaml
│   │   ├── code_cfn.yml
│   │   ├── copyresources_buildspec.yml
│   │   ├── omics_cfn.yml
│   │   ├── omicsresources_buildspec.yml
│   │   ├── quicksight_cfn.yml
│   │   ├── resources
│   │   │   ├── notebooks
│   │   │   │   ├── cohort-building.ipynb
│   │   │   │   ├── runbook.ipynb
│   │   │   │   └── summarize-tcga-datasets.ipynb
│   │   │   ├── omics
│   │   │   │   ├── create_annotation_store_lambda.py
│   │   │   │   ├── create_reference_store_lambda.py
│   │   │   │   ├── create_variant_store_lambda.py
│   │   │   │   ├── import_annotation_lambda.py
│   │   │   │   ├── import_reference_lambda.py
│   │   │   │   └── import_variant_lambda.py
│   │   │   └── scripts
│   │   │       ├── create_tcga_summary.py
│   │   │       ├── image_api_glue.py
│   │   │       ├── run_tests.py
│   │   │       ├── tcga_etl_common_job.py
│   │   │       └── transfer_tcia_images_glue.py
│   │   ├── run_crawlers.sh
│   │   └── setup
│   │       ├── lambda.py
│   │       └── requirements.txt
│   ├── GenomicsAnalysisPipe
│   │   └── pipe_cfn.yml
│   ├── GenomicsAnalysisZone
│   │   └── zone_cfn.yml
│   ├── TCIA_etl.yaml
│   ├── setup.sh
│   ├── setup_cfn.yml
│   └── teardown.sh
├── template_cfn.yml

This solution collects anonymous operational metrics to help AWS improve the quality of features of the solution. For more information, including how to disable this capability, please see the implementation guide.


Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Apache License Version 2.0 (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at

http://www.apache.org/licenses/

or in the "license" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and limitations under the License.

guidance-for-multi-omics-and-multi-modal-data-integration-and-analysis-on-aws's People

Contributors

amazon-auto avatar aws-hyunmin avatar nbulsara avatar rulaszek avatar staskh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

guidance-for-multi-omics-and-multi-modal-data-integration-and-analysis-on-aws's Issues

error

An error occurred (EntityNotFoundException) when calling the StartCrawler operation: Crawler with name genomicsanalysis-annotation does not exist

Update to Hail 0.2?

Can this solution be used with Hail 0.2, which is considered a formal release? We tried several ways to use Hail 0.2 in this solution but it failed for various reasons.

Is it possible for this solution to work with the latest Hail?

Thank you.

Package import error in notebook

File:
genomics-tertiary-analysis-and-data-lakes-using-aws-glue-and-amazon-athena/source/GenomicsAnalysisCode/resources/notebooks/runbook.ipynb

Environment:
Current version of Sagemaker Jupyter and JupyterLab (as of June 2021)

Bug:
Notebook fails to run due to package import error

Error:
ImportError: cannot import name 'as_pandas'

Fix:
from pyathena.util import as_pandas -> change to -> from pyathena.pandas.util import as_pandas

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.