cytomining / profiling-handbook Goto Github PK
View Code? Open in Web Editor NEWImage-based Profiling Handbook
Home Page: https://cytomining.github.io/profiling-handbook/
License: Creative Commons Zero v1.0 Universal
Image-based Profiling Handbook
Home Page: https://cytomining.github.io/profiling-handbook/
License: Creative Commons Zero v1.0 Universal
Note this issue
broadinstitute/cmQTL#14 (comment)
We should specify software versions
grep -A 2 -B 2 -n "git clone" *.md
02-config.md-88-mkdir software
02-config.md-89-cd software
02-config.md:90:git clone https://github.com/broadinstitute/pe2loaddata.git
02-config.md:91:git clone https://github.com/CellProfiler/Distributed-CellProfiler.git
02-config.md-92-
02-config.md-93-cd ..
--
05-create-profiles.md-67-if [ -d pycytominer ]; then rm -rf pycytominer; fi
05-create-profiles.md-68-
05-create-profiles.md:69:git clone https://github.com/cytomining/pycytominer.git
05-create-profiles.md-70-
05-create-profiles.md-71-cd pycytominer
For cytominer-database
, keep this issue in mind: https://stackoverflowteams.com/c/broad-institute-imaging-platform/questions/96 i.e. need to peg pandas<2.1.4
We needn't specify these versions because they are pegged via the AMI, but we should point to the versions used in the AMI
We create the variable ${PLATES} in section 3.4 manually / using the text file plates_to_process.txt.
Instead, we could use the code
PLATES=$(ls ~/efs/${PROJECT_NAME}/${BATCH_ID}/images/ | cut -d '_' -f 1)
to create the variable; or we could create a text file containing the PLATE_IDs using
ls ~/efs/${PROJECT_NAME}/${BATCH_ID}/images/ | cut -d '_' -f 1 >> ~/efs/${PROJECT_NAME}/workspace/scratch/${BATCH_ID}/platenames.txt
We want to address two issues here
I will update this comment periodically as the strategy evolves. I realize this is not ideal because it upsets the chronology of discussions.
This is our current folder structure specified in the Profiling Handbook. This differs slightly from the folder structure specified in the Cell Painting Gallery. For this level of nesting (under workspace
) the only discrepancy is metadata/platemaps
(see #70); consensus
and collated
are currently missing in the Gallery, but that is not a discrepancy per se.
This is the proposed folder structure in the Profiling Handbook:
├── profiles
│ └── 2016_04_01_a549_48hr_batch1
│ └── SQ00015167
│ ├── SQ00015167_augmented.csv
│ ├── SQ00015167_normalized.csv
│ ├── SQ00015167_normalized_feature_select.csv
│ └── SQ00015167_spherized.csv
├── collated (*)
│ └── 2016_04_01_a549_48hr_batch1
│ ├── 2016_04_01_a549_48hr_batch1_augmented.parquet
│ ├── 2016_04_01_a549_48hr_batch1_normalized.parquet
│ ├── 2016_04_01_a549_48hr_batch1_normalized_feature_select.parquet
│ └── 2016_04_01_a549_48hr_batch1_spherized.parquet
├── consensus (*)
│ └── 2016_04_01_a549_48hr_batch1
│ ├── 2016_04_01_a549_48hr_batch1_augmented.parquet
│ ├── 2016_04_01_a549_48hr_batch1_normalized.parquet
│ └── 2016_04_01_a549_48hr_batch1_spherized.parquet
├── backend
│ └── 2016_04_01_a549_48hr_batch1
│ └── SQ00015167
│ ├── SQ00015167.csv
│ └── SQ00015167.sqlite
├── load_data_csv
│ └── 2016_04_01_a549_48hr_batch1
│ └── SQ00015167
│ ├── load_data.csv
│ └── load_data_with_illum.csv
├── log
├── metadata
│ └── 2016_04_01_a549_48hr_batch1
│ ├── barcode_platemap.csv
│ └── platemap
│ └── C-7161-01-LM6-006.txt
└── pipelines
* collated and consensus files are saved as parquet to allow fast loading.
We will version these folders by placing them inside the project repo
folder | generator |
---|---|
profiles | pycytominer |
collated | pycytominer |
consensus | pycytominer |
load_data_csv | pe2loaddata |
log | GNU parallel (when running various commands) |
metadata | manual |
pipelines | manual |
We will not version these folders:
folder | generator | reason |
---|---|---|
backend | cytominer-database | |
analysis | CellProfiler, Distributed-CellProfiler | redundant with SQLite backend |
images | Microscope | Never changes, and too big! |
Having the metadata broken out into platemaps and external is still a relatively new thing, and the image analysis team to date has typically only ever dealt with the platemap level metadata. Thus, on AWS, metadata is stored as s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/${BATCH_ID}
, as evidenced in this sync command.
BUT, for the recipe, we eventually want it in s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/platemaps/${BATCH_ID}
, as depicted in this tree structure. We currently handle that with just adding that extra platemaps
in during the sync command above.
BUT, that's not what the handbook implies - it implies that it should be uploaded in the tree structure. So we basically have one of a couple of choices - 1) Explain everything I just wrote here in the handbook and show both tree structures there or 2) From now on, change how the analysts structure things and update the sync command, because presumably in places like the gallery in the future (but I don't know that anyone has checked for current bits of metadata over there), we want this new structure.
I'm guessing the preference here is 2, but we should make a decision and update the handbook accordingly.
[email protected]:CellProfiler/Distributed-CellProfiler.git
Once we fix it.
The VM mentioned in this manual currently has an older version of cytominer-database
. Install the latest version, primarily to handle nan
s correctly.
pip install --upgrade cytominer-database
See this PR and comment cytomining/cytominer-database#104 (comment)
(Tentatively as appendix B)
Probably extracted in a config file so we can otherwise share the ami creation info.
Do we want to deploy the handbook for every pull request? For example, in #19, travis builds and deploys the changes to the gh page at every commit. We should consider performing a conditional build with travis to only deploy when code is merged to master. another relevant post.
In preselect, add justifications for why we use certain sections of the data to do different steps.
From: Beth Cimini [email protected]
Date: Tue, Nov 30, 2021 at 6:49 AM
To: Shantanu Singh [email protected]
FWIW, I don't think the "lift" is currently probably worth it, but if we did want to switch to JupyterBook somewhere down the road, it looks like it takes Rmd pages so it would be a fast start-up.
https://jupyterbook.org/file-types/jupytext.html?highlight=rmd
From: Shantanu Singh [email protected]
Date: Tue, Nov 30, 2021 at 6:56 AM
To: Beth Cimini [email protected]
Neat! The conversion would be a nice project (~2-3 days?) for someone who is interested in creating training material and learning about GitHub Actions.
From: Beth Cimini [email protected]
Date: Tue, Nov 30, 2021 at 7:11 AM
To: Shantanu Singh [email protected]
Since it can read Rmd files directly, and GH actions is already documented, assuming we stuck with the Rmd files we have in theory it's a 30 minute project; someone with not a ton of experience still could probably get it up and running in a day.
It doesn't look like there are any good tools (ala pandoc) for a straight Rmd-to-MyST conversion, but with the VSCode extensions for Rmd and for MyST both installed and a MyST cheat sheet, I still can't see it taking much more than a few hours.
From: Shantanu Singh [email protected]
Date: Tue, Nov 30, 2021 at 7:15 AM
To: Beth Cimini [email protected]
Good point – just 6 Rmd files, and it's mostly just the code blocks that will need conversion (if done by hand)
Originally posted by @shntnu in #64 (comment)
It took me a bit of troubleshooting to mount a volume as described in step 1.1. I eventually did get it to work, but the steps I used were not listed. It is possible (and even likely) that I am doing something slightly wrong that makes the instructions a bit off, but I thought it would be useful to document here:
The error I initially received upon login after SSH'ing into the instance was:
mount.nfs4: failed to resolve server name or service not known
I had initially thought this was because of improper region settings so I double checked these. They looked ok to me, so I looked into why I was getting this error on login.
I saw this line in the .bashrc
sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 us-east-1a.fs-3609f37f.efs.us-east-1.amazonaws.com:/ ~/efs
(I also tried swapping us-east-1a
with us-east-1b
to no effect)
However, the volume on the AWS gui still looked like it was assigned to the correct instance.
So, after adding my credentials in an aws configure
command, I successfully mounted the volume after following these steps. It worked! 🎉
So, not sure to what extent this solution is specific to me (maybe I configured something slightly wrong) or if it belongs in the actual handbook. Either way I thought it was useful to document.
Currently imaging-platform-dev
The handbook uses mean
for aggregating (i.e. creating level 3) as well as for collapsing (i.e. creating level 5). However, in other projects / papers / software, we decided to use median
. This is a major issue and should be resolved!
Some notes
median
. When this project was first executed in 2017, cytominer_scripts
used median
as default; this was later changed here broadinstitute/cytominer_scripts#18 (more on this below)pycytominer
uses median
by default for aggregation.median
performed better (that plot doesn't show mean
).mean
instead of median
was to make cytominer_scripts/aggregate.R
consistent with cytotools/aggregate.R
. But it is unclear why cytotools
(the new version of cytominer_scripts
) used mean
! I think this was because cytominer::aggregate
used mean
by default.median
by default (here and here), while the profiling handbook uses collate.py for the creation of the sqlite and the aggregation, and here is the key: collate.py hard-coded mean
by default (here).Is there a way to do that? That'd be ideal.
The Cell Painting Gallery and the Profiling Handbook specify different nesting structures for images.
Should we try to resolve it by modifying the handbook to suit? Note that when deciding the folder structure for the gallery, we did start with the handbook structure and then modified it because we felt this new structure (used in Cell Painting Gallery) made more sense.
It's not mandatory to resolve this discrepancy because it can easily be handled during sync.
Cell Painting Gallery: https://github.com/broadinstitute/cellpainting-gallery/blob/0d63bf7db1c5db70675de37fe18577e3cb537e3c/folder_structure.md#images-folder-structure
└── <top-level>
└── images
│ ├── YYYY_MM_DD_<batch-name>
│ │ ├── illum
│ │ │ ├── <plate-name>
│ │ │ │ ├── <plate-name>_Illum<Channel>.npy
│ │ │ │ └── <plate-name>_Illum<Channel>.npy
│ │ │ └── <plate-name>
│ │ └── images
│ │ ├── <full-plate-name>
│ │ └── <full-plate-name>
│ └── YYYY_MM_DD_<batch-name>
└── workspace
Profiling Handbook: https://github.com/cytomining/profiling-handbook/blob/2c4dc1ba62ef5141ceb789494d450f6ba14fe05e/06-appendix.md#directory-structure
└── <top-level>
├── YYYY_MM_DD_<batch-name>
│ ├── illum
│ │ ├── <plate-name>
│ │ │ ├── <plate-name>_Illum<Channel>.npy
│ │ │ └── <plate-name>_Illum<Channel>.npy
│ │ └── <plate-name>
│ └── images
│ ├── <full-plate-name>
│ └── <full-plate-name>
└── YYYY_MM_DD_<batch-name>
└── workspace
Links page to things like the gallery, videos on profiling and Morpheus, various methods and protocol papers
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.