Code Monkey home page Code Monkey logo

datasets's Introduction

JUMP Cell Painting Datasets

DOI

This is a collection of Cell Painting image datasets generated by the JUMP-Cell Painting Consortium, funded in part by a grant from the Massachusetts Life Sciences Center.

This repository contains notebooks and instructions to work with the datasets.

All the data is hosted on the Cell Painting Gallery on the Registry of Open Data on AWS (https://registry.opendata.aws/cellpainting-gallery/). If you'd like to take a look at (a subset of) the data interactively, the JUMP-CP Data Explorer by Ardigen and the JUMP-CP Data Portal by Spring Discovery provide portals to do so.

Details about the data

Currently, this collection comprises 4 datasets:

  • The principal dataset of 116k chemical and >15k genetic perturbations the partners created in tandem (cpg0016), split across 12 data-generating centers. Human U2OS osteosarcoma cells are used.
  • 3 pilot datasets created to test: different perturbation conditions (cpg0000, including different cell types), staining conditions (cpg0001), and microscopes (cpg0002).

What’s available now

  • All data components of the three pilots.
  • Most data components (images, raw CellProfiler output, single-cell profiles, aggregated CellProfiler profiles) from 12 sources for the principal dataset. Each source corresponds to a unique data generating center (except source_7 and source_13, which were from the same center).
  • First draft of metadata files.
  • A notebook to load and inspect the data currently available in the principal dataset.

Please note: At present in the principal dataset (cpg0016), some compounds will be missing replicates, and a full QC of the dataset is pending. We don’t recommend performing any analysis with the principal dataset the full QC of the dataset is complete. The other datasets are complete.

What’s coming up

  • Extending the metadata and notebooks to the three pilots so that all these datasets can be quickly loaded together.
  • Curated annotations for the compounds, obtained from ChEMBL and other sources.
  • The remaining data components (normalized profiles, feature selected profiles, treatment-level consensus profiles, quality control results) and the one remaining source for the principal dataset.
  • Deep learning embeddings using a pre-trained neural network for all 4 datasets.
  • Quality control results at the image level for the principal dataset to allow removing bad images.

How to load the data: notebooks and folder structure

See the sample notebook to learn more about how to load the data in the principal dataset.

To get set up to run the notebook, first install the python dependencies and activate the virtual environment

# install pipenv if you don't have it already https://pipenv.pypa.io/en/latest/#install-pipenv-today
pipenv install
pipenv shell

See the typical folder structure for datasets in the Cell Painting Gallery. Please note that not all components are currently available.

Citation/license

Citing the JUMP resource as a whole

All the data is released with CC0 1.0 Universal (CC0 1.0). Still, professional ethics require that you cite the associated publication. Please use the following format to cite this resource as a whole:

We used the JUMP Cell Painting datasets (Chandrasekaran et al., 2023), available from the Cell Painting Gallery on the Registry of Open Data on AWS (https://registry.opendata.aws/cellpainting-gallery/).

Chandrasekaran et al., 2023: doi:10.1101/2023.03.23.534023

Citing individual JUMP datasets

To cite individual JUMP Cell Painting datasets, please follow the guidelines in the Cell Painting Gallery citation guide. Examples are as follows:

We used the dataset cpg0001 (Cimini et al., 2022), available from the Cell Painting Gallery on the Registry of Open Data on AWS (https://registry.opendata.aws/cellpainting-gallery/).

We used the dataset cpg0000 (Chandrasekaran et al., 2022), available from the Cell Painting Gallery on the Registry of Open Data on AWS (https://registry.opendata.aws/cellpainting-gallery/).

Gratitude

Thanks to Consortium Partner scientists for creating this data, from Ksilink, Amgen, AstraZeneca, Bayer, Biogen, the Broad Institute, Eisai, Janssen Pharmaceutica NV, Merck KGaA Darmstadt Germany, Pfizer, Servier, and Takeda.

Supporting Partners include Ardigen, Google Research, Nomic Bio, PerkinElmer, and Verily. Collaborators include the Pistoia Alliance, Umeå University, and the Stanford Machine Learning Group. The AWS Open Data Sponsorship Program is sponsoring data storage.

This work was funded by a major grant from the Massachusetts Life Sciences Center and the National Institutes of Health through MIRA R35 GM122547 to Anne Carpenter.

Questions?

Please ask your questions via issues https://github.com/jump-cellpainting/datasets/issues.

Keep posted on future data updates by subscribing to our email list, see the button here: https://jump-cellpainting.broadinstitute.org/more-info

datasets's People

Contributors

alxndrkalinin avatar annecarpenter avatar bethac07 avatar deflaux avatar erinweisbart avatar johnarevalo avatar niranjchandrasekaran avatar shntnu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.