cytomining / cytominer Goto Github PK

View Code? Open in Web Editor NEW

46.0 7.0 28.0 240.59 MB

Methods for Image-Based Cell Profiling

Home Page: https://cytomining.github.io/cytominer/

License: Other

R 100.00%

profiling microscopy r

cytominer's Introduction

cytominer

Typical morphological profiling datasets have millions of cells and hundreds of features per cell. When working with this data, you must

clean the data
normalize the features so that they are comparable across experiments
transform the features so that their distributions are well-behaved ( i.e., bring them in line with assumptions we want to make about their disributions)
select features based on their quality
aggregate the single-cell data, if needed

The cytominer package makes these steps fast and easy.

Installation

You can install cytominer from CRAN:

install.packages("cytominer")

Or, install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("cytomining/cytominer", dependencies = TRUE, build_vignettes = TRUE)

Occasionally, the Suggests dependencies may not get installed, depending on your system, so you'd need to install those explicitly.

Example

See vignette("cytominer-pipeline") for basic example of using cytominer to analyze a morphological profiling dataset.

cytominer's People

Contributors

Stargazers

Watchers

cytominer's Issues

Implement generalized log to transform feature values

glog = function(x, c=1) log( (x+sqrt(x^2+c^2))/2 ))

Described in Pg. 8 of http://bioconductor.org/packages/release/data/experiment/vignettes/DmelSGI/inst/doc/DmelSGI.pdf

"For c = 0, the function is equivalent to an ordinary logarithm transformation. For c > 0, the function is smooth for all values of x (including 0 and negative values), avoiding the singularity of the ordinary logarithm at x = 0, but still approximately equivalent to the ordinary logarithm for x �>> c. For each feature, we chose c to be the 3%-quantile of the feature’s empirical distribution."

Evaluate feature distributions

It would be good if we could get, for every feature (or combination of correlated features), a report on range and distribution; statistical parameters and outliers. Something that the analysts can use at high level to review the data, so more likely a table than a report.

Here's how we plan to do it:
Create summary statistics and divide the features into these categories:

"almost normally distributed"
"single peaked but skewed"
"multi-modal or discrete".

Implement profiling method based on more complex summary stats

Similar to #23
mean+deciles, deciles, quartiles

Move packages from Imports to Suggests

These packages are in Imports; move them to Suggests and test whether devtools::install_github("cytomining/cytominer", dependencies = TRUE, build_vignettes = TRUE) still works ok (i.e. is able to build vignettes)

* checking dependencies in R code ... NOTE
Namespaces in Imports field not imported from:
  ‘DBI’ ‘RSQLite’ ‘dbplyr’ ‘knitr’ ‘lazyeval’ ‘readr’ ‘rmarkdown’
  ‘stringr’ ‘testthat’
  All declared Imports should be used.

Pool multiple backend instances

Enable pooling multiple backend instances so that calculations can be done across multiple plates. Currently, we can handle this by merging them (which is easy because we use UUIDs for primary keys), but this is not efficient

Filter out bad wells using image QC metrics

Image QC metrics may be generated by a CellProfiler pipeline. Implement a filter to (optionally) exclude these wells or sites. Note that typical QC pipelines create multiple QC flags but not all should be used to filter out images, hence include parameter that specifies which flags to use.

Implement Similarity network fusion

http://www.nature.com/nmeth/journal/v11/n3/full/nmeth.2810.html

rows with identical observations are removed during normalization

In the example below a population with 8 rows is normalized. The normalized version of the data includes only 7 rows: all rows with identical entries are summarized in one row.

This error will probably not happen during the analysis of profiling data; however, in the simple test data it shows up.

variables <- c("AreaShape_Area")
strata <- c("Metadata_batch")
sample <- population %>% filter(Metadata_group == "control")

population <- tibble::data_frame(
   Metadata_group = c("control", "control","control","control","experiment","experiment","experiment","experiment"),
   Metadata_batch = c("a","a","b","b","a","a","b","b"),
   AreaShape_Area = c(10,12,15,16,8,9,7,7)
 ) %>%
  print

normalized = cytominer::normalize(population, variables, strata, sample, operation = "standardize") %>% print

Population:

Metadata_group<chr> | Metadata_batch<chr> | AreaShape_Area<dbl> |   |  
-- | -- | -- | -- | --
control | a | 10 |   |  
control | a | 12 |   |  
control | b | 15 |   |  
control | b | 16 |   |  
experiment | a | 8 |   |  
experiment | a | 9 |   |  
experiment | b | 7 |   |  
experiment | b | 7 |   |

normalized

Metadata_group<chr> | Metadata_batch<chr> | AreaShape_Area<dbl> |   |  
-- | -- | -- | -- | --
experiment | b | -12.0208153 |   |  
control | b | 0.7071068 |   |  
control | b | -0.7071068 |   |  
experiment | a | -1.4142136 |   |  
experiment | a | -2.1213203 |   |  
control | a | 0.7071068 |   |  
control | a | -0.7071068 |   |

Unused argument in correlation_threshold

The function

correlation_threshold <- function(population, variables, sample, cutoff = 0.90,
                                  method = "pearson")

defines the argument population. In the following code, population is not used.

Question: is the population argument used to have a more unique signature for all cytominer functions or can it be removed?

rename cytominr to cytominer

Exclude cells that have NA/Inf in all features

Evaluate methods for plate effect correction most samples are not negative

Median polish (implicitly) assumes most samples are negative. See roadmap for a discussion on this.

Apply medpolish on plate layout for each feature

Apply stats::medpolish on the single-cell data

Evaluate how to transform multimodal distributions

#21 may work ok for unimodal distributions. But how to handle transformation multimodal distributions? Should they be handled uniquely? Note that multimodality in a sample is special, but not much more special than a sample with high variance – both indicate heterogeneity. We don’t want to lose information that helps to separates populations.

Create skeleton for time lapse data analysis

Implement factor analysis

Similar to #26

Add pkgdown page

Create vignette to outline a typical profiling workflow

Create vignette to outline a typical profiling workflow, using all the functions that are currently be implemented up to the https://github.com/CellProfiler/cytominr/milestones/Simple%20profiling%20methods%20implemented milestone

Complete CRAN release checklist

http://r-pkgs.had.co.nz/release.html

Update README.md
Update NEWS.md
Figure out what the version should be and update DESCRIPTION and NEWS.md

Retrieve data into SQLite backend from MySQL

CellProfiler's ExportToDatabase creates MySQL tables – typically Per_Image and Per_Object. Load this into a SQLite database.

In our current data analysis workflow, cache.py performs the equivalent of this operation (it stores the data into npy files)

2nd order moment profiling

Replacing "mean" with the "covariance" as the summarization function in the profiling has recently shown to be advantageous. Would be nice to implement this and use random projections to make it scalable to handle high number of features efficiently.

Implement SVD-entropy feature selection

drop_na_rows does not work on data frames

cytominer::drop_na_rows does not work on a data frame

Example:

 population <- tibble::data_frame(
   Metadata_group = c("control", "control","control","control","experiment","experiment","experiment","experiment"),
   Metadata_batch = c("a","a","b","b","a","a","b","b"),
   AreaShape_Area = c(10,12,15,16,8,8,7,7),
   AreaShape_length = c(2,3,NA,NA,4,5,1,5)
 )
variables <- c('AreaShape_Area','AreaShape_length')

It does not work using a data frame

na_per_row <- drop_na_rows(population, variables)

Error in filter_impl(.data, quo) : 
  Evaluation error: object 'Sepal.Length' not found.

It works using a sql data base

db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
population <- dplyr::copy_to(db, population)

Implement profiling method based on simple summary stats

Implement equivalent of https://github.com/CellProfiler/CellProfiler-Analyst/blob/master/cpa/profiling/profile_mean.py
compute mean, median, mean+std, median+mad

query.sim.mat should accept query with multiple rows

query.sim.mat imposes that the query be only one row because the logic is more complicated with multiple rows. For now, wrapper query_n has been added as a workaround.

Update authors

Include Tim Becker

Use feature quality measure to prune list of features

Use, for instance, output of #17 to prune list of features

Evaluate mRMR for feature selection

Find a minimal set of features that result in best correlation of replicates.
Evaluate mRMR for this purpose.

Similar to #18 except that this is multivariate

Document function signatures

@mcquin had said:
It would be nice to have information in the README or a contributing document which describes the standardized function signatures.

Decide on data structure for tracking datasets

@cells2numbers said:

I am implementing a simple csv export for my Matlab tracking software and I am looking for your advice
The csv is meant for importing into cytominer. Are there any conventions / restrictions for the csv I should know about?

In general, I use the following hierarchy / data structure:

2D info. In each frame I have an image segmentation and each cell is described by its morphology (are, position, length, mean intensity etc)
2D + t info. After cell tracking each neutrophil (cell) is represented as a trajectory. This trajectory is described using spatiotemporal information (displacement, velocity, length of the trajectory in frames, randomness of movement etc.).
Experiment Info. Each experiment is described using the mean values of the spatiotemporal information. Additional parameters like the x- and y-FMI (forward migration index) are calculated; these parameters describe for example the portion of cells moving towards a specific site / chemoattractant.

To have a flexible implementation, I plan to export the 2D information of the cells per frame. For each cell, the corresponding trajectory id is stored as well -> this allows to easily calculate features like velocity etc. in cytominer and should allow to easily import other file formats / data.

Using this setup, the image segmentation and tracking performed using Matlab or CellProfiler is independent from the analysis in cytominer. I only need a good and flexible way to store the tracking data in R.

Document functions

Document all functions

Handle failure of tests on Windows

testthat/test-cytominer.R takes too long to run on Windows

Also, need to do
Sys.setenv("R_TESTS" = "")

to avoid problems with path r-lib/testthat#144

Extract subpopulations

Given a population and a reference set, extract subpopulations are use it to create profiles of the two sets.

Evaluate multivariate z-score as an option for normalization

Linear transformation to fit the multivariate normal distributions of the negative controls in the different batches onto each other.

Fix build

Build is broken due to changes upstream in dplyr and related packages.
Once done, delete https://github.com/cytomining/cytominer/tree/peg-package-versions

Implement PCA

Implement PCA on sparse sampling of single cell data.

In our current data analysis workflow, decomp.py performs the equivalent of this operation

Report if medpolish is worth performing on a plate

Test if correction is smoothly varying - which is what we would expect if indeed we are correcting a systematic effect.

Retrieve data into SQLite backend from CSV

CellProfiler's ExportToSpreadsheet outputs CSV files – typically per_image.csv and per_object.csv. Load this into a SQLite database.

In our current data analysis workflow, https://github.com/CellProfiler/CellProfiler-Analyst/blob/csv_interface/cpa/profiling/cache.py performs the equivalent of this operation (it stores the data into npy files)

Evaluate L1 SVM for feature selection

Similar to #29

Handle NAs in normalize

Measure feature reproducibility across replicates

Split the data into two technical replicate groups (T1, T2).
For each feature, measure the correlation between T1 and T2 and report summary

@shntnu needs to decide if this should be done on per-cell or per-well

Report how many cells have NA/Inf values for each feature

For a given dataset, report how many cells have NA/Inf values, for each feature. This will help decide whether a feature should be dropped in later stages of analysis.

Implement stepwise feature selection algorithm to remove linear dependencies

This will use output of #17

@shntnu needs to decide if this should be done on per-cell or per-well

Normalize cell features by z-scoring - multiple plates

Similar to #10 except that allow grouping of multiple plates when computing the reference distribution. So, for instance, we should be able to build reference distributions from all DMSO cells across a batch of plates

Non-standard evaluation (NSE) in dplyr in drop_na_rows

devtools gives following note in checking R code for possible problems:

* checking R code for possible problems ... NOTE drop_na_rows: no visible binding for global variable ‘value’ drop_na_rows: no visible binding for global variable ‘key’ drop_na_rows: no visible binding for global variable ‘rowname_temp’ Undefined global functions or variables: key rowname_temp value

We can get rid of it simply by using NSE operations in dplyr.

Implement findCorrelation to reduce pairwise correlations

@shntnu needs to decide if this should be done on per-cell or per-well

Implement non-negative matrix factorization

Similar to #26

Output heatmap for a given feature on a plate layout

This will be used for checking if there are any plate effects. Typical features that will be plotted using this function will be intensity features and cell counts

Test that xid is unique at the time of creating profile.data

Normalize cell features by z-scoring - single plate

For each feature, compute mean and s.d. per plate across all the reference cells in the plate and use that to compute z-scores. Reference cells could be either a random sampling of all cells on the plate, or cells from some specific treatments on the plate (e.g. DMSO)

We previously used https://github.com/CellProfiler/CellProfiler-Analyst/blob/master/cpa/profiling/normalization.py for this operation

@shntnu needs to decide whether this operation should be eager or lazy.

cytomining / cytominer Goto Github PK

cytominer's Introduction

cytominer

Installation

Example

cytominer's People

Contributors

Stargazers

Watchers

Forkers

cytominer's Issues

Recommend Projects

Recommend Topics

Recommend Org