Code Monkey home page Code Monkey logo

profiling-handbook's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

profiling-handbook's Issues

Specify software versions

We should specify software versions

Cloned repos

grep -A 2 -B 2 -n "git clone" *.md
02-config.md-88-mkdir software
02-config.md-89-cd software
02-config.md:90:git clone https://github.com/broadinstitute/pe2loaddata.git
02-config.md:91:git clone https://github.com/CellProfiler/Distributed-CellProfiler.git
02-config.md-92-
02-config.md-93-cd ..
--
05-create-profiles.md-67-if [ -d pycytominer ]; then rm -rf pycytominer; fi
05-create-profiles.md-68-
05-create-profiles.md:69:git clone https://github.com/cytomining/pycytominer.git
05-create-profiles.md-70-
05-create-profiles.md-71-cd pycytominer

For cytominer-database, keep this issue in mind: https://stackoverflowteams.com/c/broad-institute-imaging-platform/questions/96 i.e. need to peg pandas<2.1.4

Other

  • CellProfiler: should be compatible with whatever version we specify for Distributed-CellProfiler

We needn't specify these versions because they are pegged via the AMI, but we should point to the versions used in the AMI

  • Python
  • awscli
  • Miniconda

generate variable PLATES without file plates_to_process.txt

We create the variable ${PLATES} in section 3.4 manually / using the text file plates_to_process.txt.

Instead, we could use the code

PLATES=$(ls ~/efs/${PROJECT_NAME}/${BATCH_ID}/images/ | cut -d '_' -f 1) 

to create the variable; or we could create a text file containing the PLATE_IDs using

ls ~/efs/${PROJECT_NAME}/${BATCH_ID}/images/ | cut -d '_' -f 1 >>  ~/efs/${PROJECT_NAME}/workspace/scratch/${BATCH_ID}/platenames.txt

Define folder structures and implement data versioning

We want to address two issues here

  1. define a new folder structure for profiling experiments
  2. identify which of the components will be version controlled.

I will update this comment periodically as the strategy evolves. I realize this is not ideal because it upsets the chronology of discussions.

This is our current folder structure specified in the Profiling Handbook. This differs slightly from the folder structure specified in the Cell Painting Gallery. For this level of nesting (under workspace) the only discrepancy is metadata/platemaps (see #70); consensus and collated are currently missing in the Gallery, but that is not a discrepancy per se.

This is the proposed folder structure in the Profiling Handbook:

├── profiles
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167_augmented.csv
│           ├── SQ00015167_normalized.csv
│           ├── SQ00015167_normalized_feature_select.csv
│           └── SQ00015167_spherized.csv
├── collated (*)
│   └── 2016_04_01_a549_48hr_batch1
│       ├── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized_feature_select.parquet
│       └── 2016_04_01_a549_48hr_batch1_spherized.parquet
├── consensus (*)
│   └── 2016_04_01_a549_48hr_batch1
│       ├── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       └── 2016_04_01_a549_48hr_batch1_spherized.parquet
├── backend
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167.csv
│           └── SQ00015167.sqlite 
├── load_data_csv
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── load_data.csv
│           └── load_data_with_illum.csv
├── log 
├── metadata
│   └── 2016_04_01_a549_48hr_batch1
│       ├── barcode_platemap.csv
│       └── platemap
│           └── C-7161-01-LM6-006.txt
└── pipelines

* collated and consensus files are saved as parquet to allow fast loading.

We will version these folders by placing them inside the project repo

folder generator
profiles pycytominer
collated pycytominer
consensus pycytominer
load_data_csv pe2loaddata
log GNU parallel (when running various commands)
metadata manual
pipelines manual

We will not version these folders:

folder generator reason
backend cytominer-database
analysis CellProfiler, Distributed-CellProfiler redundant with SQLite backend
images Microscope Never changes, and too big!

Handbook discrepancy: metadata/<batch> vs metadata/<batch>/platemaps

Having the metadata broken out into platemaps and external is still a relatively new thing, and the image analysis team to date has typically only ever dealt with the platemap level metadata. Thus, on AWS, metadata is stored as s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/${BATCH_ID}, as evidenced in this sync command.

BUT, for the recipe, we eventually want it in s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/platemaps/${BATCH_ID}, as depicted in this tree structure. We currently handle that with just adding that extra platemaps in during the sync command above.

BUT, that's not what the handbook implies - it implies that it should be uploaded in the tree structure. So we basically have one of a couple of choices - 1) Explain everything I just wrote here in the handbook and show both tree structures there or 2) From now on, change how the analysts structure things and update the sync command, because presumably in places like the gallery in the future (but I don't know that anyone has checked for current bits of metadata over there), we want this new structure.

I'm guessing the preference here is 2, but we should make a decision and update the handbook accordingly.

Move builds to JupyterBook and GH Actions


From: Beth Cimini [email protected]
Date: Tue, Nov 30, 2021 at 6:49 AM
To: Shantanu Singh [email protected]

FWIW, I don't think the "lift" is currently probably worth it, but if we did want to switch to JupyterBook somewhere down the road, it looks like it takes Rmd pages so it would be a fast start-up.

https://jupyterbook.org/file-types/jupytext.html?highlight=rmd


From: Shantanu Singh [email protected]
Date: Tue, Nov 30, 2021 at 6:56 AM
To: Beth Cimini [email protected]

Neat! The conversion would be a nice project (~2-3 days?) for someone who is interested in creating training material and learning about GitHub Actions.


From: Beth Cimini [email protected]
Date: Tue, Nov 30, 2021 at 7:11 AM
To: Shantanu Singh [email protected]

Since it can read Rmd files directly, and GH actions is already documented, assuming we stuck with the Rmd files we have in theory it's a 30 minute project; someone with not a ton of experience still could probably get it up and running in a day.

It doesn't look like there are any good tools (ala pandoc) for a straight Rmd-to-MyST conversion, but with the VSCode extensions for Rmd and for MyST both installed and a MyST cheat sheet, I still can't see it taking much more than a few hours.


From: Shantanu Singh [email protected]
Date: Tue, Nov 30, 2021 at 7:15 AM
To: Beth Cimini [email protected]

Good point – just 6 Rmd files, and it's mostly just the code blocks that will need conversion (if done by hand)

Originally posted by @shntnu in #64 (comment)

Mounting AWS Volumes in Config

Introduction

It took me a bit of troubleshooting to mount a volume as described in step 1.1. I eventually did get it to work, but the steps I used were not listed. It is possible (and even likely) that I am doing something slightly wrong that makes the instructions a bit off, but I thought it would be useful to document here:

Troubleshooting

The error I initially received upon login after SSH'ing into the instance was:

mount.nfs4: failed to resolve server name or service not known

I had initially thought this was because of improper region settings so I double checked these. They looked ok to me, so I looked into why I was getting this error on login.

I saw this line in the .bashrc

sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 us-east-1a.fs-3609f37f.efs.us-east-1.amazonaws.com:/ ~/efs

(I also tried swapping us-east-1a with us-east-1b to no effect)

However, the volume on the AWS gui still looked like it was assigned to the correct instance.

So, after adding my credentials in an aws configure command, I successfully mounted the volume after following these steps. It worked! 🎉

Recommendation

So, not sure to what extent this solution is specific to me (maybe I configured something slightly wrong) or if it belongs in the actual handbook. Either way I thought it was useful to document.

Stated aggregation method (mean) is inconsistent with that used in other projects

The handbook uses mean for aggregating (i.e. creating level 3) as well as for collapsing (i.e. creating level 5). However, in other projects / papers / software, we decided to use median. This is a major issue and should be resolved!

Some notes

  1. In the LINCS dataset, we decided broadinstitute/lincs-cell-painting#3 (comment) to use medians: we were following the lead of what was done in the previously-processed version of the dataset in which we used median. When this project was first executed in 2017, cytominer_scripts used median as default; this was later changed here broadinstitute/cytominer_scripts#18 (more on this below)
  2. pycytominer uses median by default for aggregation.
  3. In our 2018 experiments, we found that median performed better (that plot doesn't show mean).
  4. This PR broadinstitute/cytominer_scripts#18 – Change default aggregation to be mean instead of median was to make cytominer_scripts/aggregate.R consistent with cytotools/aggregate.R. But it is unclear why cytotools (the new version of cytominer_scripts) used mean! I think this was because cytominer::aggregate used mean by default.
  5. Carmen Verdugo reported in June 2022: The profiling recipe does aggregation using median by default (here and here), while the profiling handbook uses collate.py for the creation of the sqlite and the aggregation, and here is the key: collate.py hard-coded mean by default (here).

Handbook discrepancy: <top-level>/<batch> vs. <top-level>/images/<batch>

The Cell Painting Gallery and the Profiling Handbook specify different nesting structures for images.

Should we try to resolve it by modifying the handbook to suit? Note that when deciding the folder structure for the gallery, we did start with the handbook structure and then modified it because we felt this new structure (used in Cell Painting Gallery) made more sense.

It's not mandatory to resolve this discrepancy because it can easily be handled during sync.

Cell Painting Gallery: https://github.com/broadinstitute/cellpainting-gallery/blob/0d63bf7db1c5db70675de37fe18577e3cb537e3c/folder_structure.md#images-folder-structure

    └── <top-level>
        └── images
        │   ├── YYYY_MM_DD_<batch-name>
        │   │   ├── illum
        │   │   │   ├── <plate-name>
        │   │   │   │   ├── <plate-name>_Illum<Channel>.npy
        │   │   │   │   └── <plate-name>_Illum<Channel>.npy
        │   │   │   └── <plate-name>
        │   │   └── images
        │   │       ├── <full-plate-name>
        │   │       └── <full-plate-name>
        │   └── YYYY_MM_DD_<batch-name>
        └── workspace

Profiling Handbook: https://github.com/cytomining/profiling-handbook/blob/2c4dc1ba62ef5141ceb789494d450f6ba14fe05e/06-appendix.md#directory-structure

    └── <top-level>
        ├── YYYY_MM_DD_<batch-name>
        │   ├── illum
        │   │   ├── <plate-name>
        │   │   │   ├── <plate-name>_Illum<Channel>.npy
        │   │   │   └── <plate-name>_Illum<Channel>.npy
        │   │   └── <plate-name>
        │   └── images
        │       ├── <full-plate-name>
        │       └── <full-plate-name>
        └── YYYY_MM_DD_<batch-name>
        └── workspace

Add other resources page

Links page to things like the gallery, videos on profiling and Morpheus, various methods and protocol papers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.