cytomining / profiling-handbook Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 7.0 4.3 MB

Image-based Profiling Handbook

Home Page: https://cytomining.github.io/profiling-handbook/

License: Creative Commons Zero v1.0 Universal

cellprofiler cytomining guide handbook morphological-profiling

profiling-handbook's People

Stargazers

Watchers

Forkers

bethac07 gwaybio joe-nano stevievb erinweisbart callum-jpg nasimj

profiling-handbook's Issues

Alter instructions to use batch file

Sync batch files
Run the batch file in "pipeline" and "data file" of the job file

Update instructions for using with CP3

Note this issue
broadinstitute/cmQTL#14 (comment)

Specify software versions

We should specify software versions

Cloned repos

grep -A 2 -B 2 -n "git clone" *.md
02-config.md-88-mkdir software
02-config.md-89-cd software
02-config.md:90:git clone https://github.com/broadinstitute/pe2loaddata.git
02-config.md:91:git clone https://github.com/CellProfiler/Distributed-CellProfiler.git
02-config.md-92-
02-config.md-93-cd ..
--
05-create-profiles.md-67-if [ -d pycytominer ]; then rm -rf pycytominer; fi
05-create-profiles.md-68-
05-create-profiles.md:69:git clone https://github.com/cytomining/pycytominer.git
05-create-profiles.md-70-
05-create-profiles.md-71-cd pycytominer

For cytominer-database, keep this issue in mind: https://stackoverflowteams.com/c/broad-institute-imaging-platform/questions/96 i.e. need to peg pandas<2.1.4

Other

CellProfiler: should be compatible with whatever version we specify for Distributed-CellProfiler

We needn't specify these versions because they are pegged via the AMI, but we should point to the versions used in the AMI

Python
awscli
Miniconda

generate variable PLATES without file plates_to_process.txt

We create the variable ${PLATES} in section 3.4 manually / using the text file plates_to_process.txt.

Instead, we could use the code

PLATES=$(ls ~/efs/${PROJECT_NAME}/${BATCH_ID}/images/ | cut -d '_' -f 1)

to create the variable; or we could create a text file containing the PLATE_IDs using

ls ~/efs/${PROJECT_NAME}/${BATCH_ID}/images/ | cut -d '_' -f 1 >>  ~/efs/${PROJECT_NAME}/workspace/scratch/${BATCH_ID}/platenames.txt

Add instructions to compile QC results

Migrated from https://github.com/broadinstitute/cellpainting_scripts/issues/36

Define folder structures and implement data versioning

We want to address two issues here

define a new folder structure for profiling experiments
identify which of the components will be version controlled.

I will update this comment periodically as the strategy evolves. I realize this is not ideal because it upsets the chronology of discussions.

This is our current folder structure specified in the Profiling Handbook. This differs slightly from the folder structure specified in the Cell Painting Gallery. For this level of nesting (under workspace) the only discrepancy is metadata/platemaps (see #70); consensus and collated are currently missing in the Gallery, but that is not a discrepancy per se.

This is the proposed folder structure in the Profiling Handbook:

├── profiles
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167_augmented.csv
│           ├── SQ00015167_normalized.csv
│           ├── SQ00015167_normalized_feature_select.csv
│           └── SQ00015167_spherized.csv
├── collated (*)
│   └── 2016_04_01_a549_48hr_batch1
│       ├── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized_feature_select.parquet
│       └── 2016_04_01_a549_48hr_batch1_spherized.parquet
├── consensus (*)
│   └── 2016_04_01_a549_48hr_batch1
│       ├── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       └── 2016_04_01_a549_48hr_batch1_spherized.parquet
├── backend
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167.csv
│           └── SQ00015167.sqlite 
├── load_data_csv
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── load_data.csv
│           └── load_data_with_illum.csv
├── log 
├── metadata
│   └── 2016_04_01_a549_48hr_batch1
│       ├── barcode_platemap.csv
│       └── platemap
│           └── C-7161-01-LM6-006.txt
└── pipelines

* collated and consensus files are saved as parquet to allow fast loading.

We will version these folders by placing them inside the project repo

folder	generator
profiles	pycytominer
collated	pycytominer
consensus	pycytominer
load_data_csv	pe2loaddata
log	GNU parallel (when running various commands)
metadata	manual
pipelines	manual

We will not version these folders:

folder	generator	reason
backend	cytominer-database
analysis	CellProfiler, Distributed-CellProfiler	redundant with SQLite backend
images	Microscope	Never changes, and too big!

Handbook discrepancy: metadata/<batch> vs metadata/<batch>/platemaps

Having the metadata broken out into platemaps and external is still a relatively new thing, and the image analysis team to date has typically only ever dealt with the platemap level metadata. Thus, on AWS, metadata is stored as s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/${BATCH_ID}, as evidenced in this sync command.

BUT, for the recipe, we eventually want it in s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/platemaps/${BATCH_ID}, as depicted in this tree structure. We currently handle that with just adding that extra platemaps in during the sync command above.

BUT, that's not what the handbook implies - it implies that it should be uploaded in the tree structure. So we basically have one of a couple of choices - 1) Explain everything I just wrote here in the handbook and show both tree structures there or 2) From now on, change how the analysts structure things and update the sync command, because presumably in places like the gallery in the future (but I don't know that anyone has checked for current bits of metadata over there), we want this new structure.

I'm guessing the preference here is 2, but we should make a decision and update the handbook accordingly.

Specify that you need to clone DCP if you're going to set it up

[email protected]:CellProfiler/Distributed-CellProfiler.git

Re add infrastructure stuff to Chapter 1

Once we fix it.

Update cytominer-database package

The VM mentioned in this manual currently has an older version of cytominer-database. Install the latest version, primarily to handle nans correctly.

pip install --upgrade cytominer-database

See this PR and comment cytomining/cytominer-database#104 (comment)

cc @NasimJ @DavidStirling @bethac07

Document Beth's profiling workflow

(Tentatively as appendix B)

Rewrite imaging-vms to remove private info

Probably extracted in a config file so we can otherwise share the ami creation info.

Use temp dir flag when making CSVs

Conditional Travis Build

Do we want to deploy the handbook for every pull request? For example, in #19, travis builds and deploys the changes to the gh page at every commit. We should consider performing a conditional build with travis to only deploy when code is merged to master. another relevant post.

Specify that most steps should be done in a tmux or screen session

Add preselect justifications

In preselect, add justifications for why we use certain sections of the data to do different steps.

Move builds to JupyterBook and GH Actions

From: Beth Cimini [email protected]
Date: Tue, Nov 30, 2021 at 6:49 AM
To: Shantanu Singh [email protected]

FWIW, I don't think the "lift" is currently probably worth it, but if we did want to switch to JupyterBook somewhere down the road, it looks like it takes Rmd pages so it would be a fast start-up.

https://jupyterbook.org/file-types/jupytext.html?highlight=rmd

From: Shantanu Singh [email protected]
Date: Tue, Nov 30, 2021 at 6:56 AM
To: Beth Cimini [email protected]

Neat! The conversion would be a nice project (~2-3 days?) for someone who is interested in creating training material and learning about GitHub Actions.

From: Beth Cimini [email protected]
Date: Tue, Nov 30, 2021 at 7:11 AM
To: Shantanu Singh [email protected]

Since it can read Rmd files directly, and GH actions is already documented, assuming we stuck with the Rmd files we have in theory it's a 30 minute project; someone with not a ton of experience still could probably get it up and running in a day.

It doesn't look like there are any good tools (ala pandoc) for a straight Rmd-to-MyST conversion, but with the VSCode extensions for Rmd and for MyST both installed and a MyST cheat sheet, I still can't see it taking much more than a few hours.

From: Shantanu Singh [email protected]
Date: Tue, Nov 30, 2021 at 7:15 AM
To: Beth Cimini [email protected]

Good point – just 6 Rmd files, and it's mostly just the code blocks that will need conversion (if done by hand)

Originally posted by @shntnu in #64 (comment)

Specify that for DCP CSVs need to be synced back to S3

Look into if cytominer_scripts can be made public (keys in git history, etc)

Mounting AWS Volumes in Config

Introduction

It took me a bit of troubleshooting to mount a volume as described in step 1.1. I eventually did get it to work, but the steps I used were not listed. It is possible (and even likely) that I am doing something slightly wrong that makes the instructions a bit off, but I thought it would be useful to document here:

Troubleshooting

The error I initially received upon login after SSH'ing into the instance was:

mount.nfs4: failed to resolve server name or service not known

I had initially thought this was because of improper region settings so I double checked these. They looked ok to me, so I looked into why I was getting this error on login.

I saw this line in the .bashrc

sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 us-east-1a.fs-3609f37f.efs.us-east-1.amazonaws.com:/ ~/efs

(I also tried swapping us-east-1a with us-east-1b to no effect)

However, the volume on the AWS gui still looked like it was assigned to the correct instance.

So, after adding my credentials in an aws configure command, I successfully mounted the volume after following these steps. It worked! 🎉

Recommendation

So, not sure to what extent this solution is specific to me (maybe I configured something slightly wrong) or if it belongs in the actual handbook. Either way I thought it was useful to document.

BUCKET should be a variable

Currently imaging-platform-dev

Change docker from shntnu 2.2.1 to cellprofiler 2.3.1

Package pe2loaddata

Stated aggregation method (mean) is inconsistent with that used in other projects

The handbook uses mean for aggregating (i.e. creating level 3) as well as for collapsing (i.e. creating level 5). However, in other projects / papers / software, we decided to use median. This is a major issue and should be resolved!

Some notes

In the LINCS dataset, we decided broadinstitute/lincs-cell-painting#3 (comment) to use medians: we were following the lead of what was done in the previously-processed version of the dataset in which we used median. When this project was first executed in 2017, cytominer_scripts used median as default; this was later changed here broadinstitute/cytominer_scripts#18 (more on this below)
pycytominer uses median by default for aggregation.
In our 2018 experiments, we found that median performed better (that plot doesn't show mean).
This PR broadinstitute/cytominer_scripts#18 – Change default aggregation to be mean instead of median was to make cytominer_scripts/aggregate.R consistent with cytotools/aggregate.R. But it is unclear why cytotools (the new version of cytominer_scripts) used mean! I think this was because cytominer::aggregate used mean by default.
Carmen Verdugo reported in June 2022: The profiling recipe does aggregation using median by default (here and here), while the profiling handbook uses collate.py for the creation of the sqlite and the aggregation, and here is the key: collate.py hard-coded mean by default (here).

Have a programmatic way to make plate names before making CSVs

Is there a way to do that? That'd be ideal.

Update comparing Index.idx.xml to config.yml

Add a citation file

Handbook discrepancy: <top-level>/<batch> vs. <top-level>/images/<batch>

The Cell Painting Gallery and the Profiling Handbook specify different nesting structures for images.

Should we try to resolve it by modifying the handbook to suit? Note that when deciding the folder structure for the gallery, we did start with the handbook structure and then modified it because we felt this new structure (used in Cell Painting Gallery) made more sense.

It's not mandatory to resolve this discrepancy because it can easily be handled during sync.

Cell Painting Gallery: https://github.com/broadinstitute/cellpainting-gallery/blob/0d63bf7db1c5db70675de37fe18577e3cb537e3c/folder_structure.md#images-folder-structure

    └── <top-level>
        └── images
        │   ├── YYYY_MM_DD_<batch-name>
        │   │   ├── illum
        │   │   │   ├── <plate-name>
        │   │   │   │   ├── <plate-name>_Illum<Channel>.npy
        │   │   │   │   └── <plate-name>_Illum<Channel>.npy
        │   │   │   └── <plate-name>
        │   │   └── images
        │   │       ├── <full-plate-name>
        │   │       └── <full-plate-name>
        │   └── YYYY_MM_DD_<batch-name>
        └── workspace

Profiling Handbook: https://github.com/cytomining/profiling-handbook/blob/2c4dc1ba62ef5141ceb789494d450f6ba14fe05e/06-appendix.md#directory-structure

    └── <top-level>
        ├── YYYY_MM_DD_<batch-name>
        │   ├── illum
        │   │   ├── <plate-name>
        │   │   │   ├── <plate-name>_Illum<Channel>.npy
        │   │   │   └── <plate-name>_Illum<Channel>.npy
        │   │   └── <plate-name>
        │   └── images
        │       ├── <full-plate-name>
        │       └── <full-plate-name>
        └── YYYY_MM_DD_<batch-name>
        └── workspace

Add other resources page

Links page to things like the gallery, videos on profiling and Morpheus, various methods and protocol papers