Code Monkey home page Code Monkey logo

cytominer's Introduction

Build Status Coverage Status CRAN_Status_Badge

cytominer

Typical morphological profiling datasets have millions of cells and hundreds of features per cell. When working with this data, you must

  • clean the data

  • normalize the features so that they are comparable across experiments

  • transform the features so that their distributions are well-behaved ( i.e., bring them in line with assumptions we want to make about their disributions)

  • select features based on their quality

  • aggregate the single-cell data, if needed

The cytominer package makes these steps fast and easy.

Installation

You can install cytominer from CRAN:

install.packages("cytominer")

Or, install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("cytomining/cytominer", dependencies = TRUE, build_vignettes = TRUE)

Occasionally, the Suggests dependencies may not get installed, depending on your system, so you'd need to install those explicitly.

Example

See vignette("cytominer-pipeline") for basic example of using cytominer to analyze a morphological profiling dataset.

cytominer's People

Contributors

0x00b1 avatar bethac07 avatar cells2numbers avatar gwaybio avatar mcquin avatar mrohban avatar shntnu avatar vincerubinetti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cytominer's Issues

Implement generalized log to transform feature values

glog = function(x, c=1) log( (x+sqrt(x^2+c^2))/2 ))

Described in Pg. 8 of http://bioconductor.org/packages/release/data/experiment/vignettes/DmelSGI/inst/doc/DmelSGI.pdf

"For c = 0, the function is equivalent to an ordinary logarithm transformation. For c > 0, the function is smooth for all values of x (including 0 and negative values), avoiding the singularity of the ordinary logarithm at x = 0, but still approximately equivalent to the ordinary logarithm for x �>> c. For each feature, we chose c to be the 3%-quantile of the feature’s empirical distribution."

Evaluate feature distributions

It would be good if we could get, for every feature (or combination of correlated features), a report on range and distribution; statistical parameters and outliers. Something that the analysts can use at high level to review the data, so more likely a table than a report.

Here's how we plan to do it:
Create summary statistics and divide the features into these categories:

"almost normally distributed"
"single peaked but skewed"
"multi-modal or discrete".

Move packages from Imports to Suggests

These packages are in Imports; move them to Suggests and test whether devtools::install_github("cytomining/cytominer", dependencies = TRUE, build_vignettes = TRUE) still works ok (i.e. is able to build vignettes)

* checking dependencies in R code ... NOTE
Namespaces in Imports field not imported from:
  ‘DBI’ ‘RSQLite’ ‘dbplyr’ ‘knitr’ ‘lazyeval’ ‘readr’ ‘rmarkdown’
  ‘stringr’ ‘testthat’
  All declared Imports should be used.    

Pool multiple backend instances

Enable pooling multiple backend instances so that calculations can be done across multiple plates. Currently, we can handle this by merging them (which is easy because we use UUIDs for primary keys), but this is not efficient

Filter out bad wells using image QC metrics

Image QC metrics may be generated by a CellProfiler pipeline. Implement a filter to (optionally) exclude these wells or sites. Note that typical QC pipelines create multiple QC flags but not all should be used to filter out images, hence include parameter that specifies which flags to use.

rows with identical observations are removed during normalization

In the example below a population with 8 rows is normalized. The normalized version of the data includes only 7 rows: all rows with identical entries are summarized in one row.

This error will probably not happen during the analysis of profiling data; however, in the simple test data it shows up.

variables <- c("AreaShape_Area")
strata <- c("Metadata_batch")
sample <- population %>% filter(Metadata_group == "control")

population <- tibble::data_frame(
   Metadata_group = c("control", "control","control","control","experiment","experiment","experiment","experiment"),
   Metadata_batch = c("a","a","b","b","a","a","b","b"),
   AreaShape_Area = c(10,12,15,16,8,9,7,7)
 ) %>%
  print

normalized = cytominer::normalize(population, variables, strata, sample, operation = "standardize") %>% print

Population:

Metadata_group<chr> | Metadata_batch<chr> | AreaShape_Area<dbl> |   |  
-- | -- | -- | -- | --
control | a | 10 |   |  
control | a | 12 |   |  
control | b | 15 |   |  
control | b | 16 |   |  
experiment | a | 8 |   |  
experiment | a | 9 |   |  
experiment | b | 7 |   |  
experiment | b | 7 |   |  

normalized

Metadata_group<chr> | Metadata_batch<chr> | AreaShape_Area<dbl> |   |  
-- | -- | -- | -- | --
experiment | b | -12.0208153 |   |  
control | b | 0.7071068 |   |  
control | b | -0.7071068 |   |  
experiment | a | -1.4142136 |   |  
experiment | a | -2.1213203 |   |  
control | a | 0.7071068 |   |  
control | a | -0.7071068 |   |  

Unused argument in correlation_threshold

The function

correlation_threshold <- function(population, variables, sample, cutoff = 0.90,
                                  method = "pearson") 

defines the argument population. In the following code, population is not used.

Question: is the population argument used to have a more unique signature for all cytominer functions or can it be removed?

Evaluate how to transform multimodal distributions

#21 may work ok for unimodal distributions. But how to handle transformation multimodal distributions? Should they be handled uniquely? Note that multimodality in a sample is special, but not much more special than a sample with high variance – both indicate heterogeneity. We don’t want to lose information that helps to separates populations.

Retrieve data into SQLite backend from MySQL

CellProfiler's ExportToDatabase creates MySQL tables – typically Per_Image and Per_Object. Load this into a SQLite database.

In our current data analysis workflow, cache.py performs the equivalent of this operation (it stores the data into npy files)

2nd order moment profiling

Replacing "mean" with the "covariance" as the summarization function in the profiling has recently shown to be advantageous. Would be nice to implement this and use random projections to make it scalable to handle high number of features efficiently.

drop_na_rows does not work on data frames

cytominer::drop_na_rows does not work on a data frame

Example:

 population <- tibble::data_frame(
   Metadata_group = c("control", "control","control","control","experiment","experiment","experiment","experiment"),
   Metadata_batch = c("a","a","b","b","a","a","b","b"),
   AreaShape_Area = c(10,12,15,16,8,8,7,7),
   AreaShape_length = c(2,3,NA,NA,4,5,1,5)
 )
variables <- c('AreaShape_Area','AreaShape_length')

It does not work using a data frame

na_per_row <- drop_na_rows(population, variables)

Error in filter_impl(.data, quo) : 
  Evaluation error: object 'Sepal.Length' not found. 

It works using a sql data base

db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
population <- dplyr::copy_to(db, population)

Evaluate mRMR for feature selection

Find a minimal set of features that result in best correlation of replicates.
Evaluate mRMR for this purpose.

Similar to #18 except that this is multivariate

Document function signatures

@mcquin had said:
It would be nice to have information in the README or a contributing document which describes the standardized function signatures.

Decide on data structure for tracking datasets

@cells2numbers said:

I am implementing a simple csv export for my Matlab tracking software and I am looking for your advice
The csv is meant for importing into cytominer. Are there any conventions / restrictions for the csv I should know about?

In general, I use the following hierarchy / data structure:

  1. 2D info. In each frame I have an image segmentation and each cell is described by its morphology (are, position, length, mean intensity etc)
  2. 2D + t info. After cell tracking each neutrophil (cell) is represented as a trajectory. This trajectory is described using spatiotemporal information (displacement, velocity, length of the trajectory in frames, randomness of movement etc.).
  3. Experiment Info. Each experiment is described using the mean values of the spatiotemporal information. Additional parameters like the x- and y-FMI (forward migration index) are calculated; these parameters describe for example the portion of cells moving towards a specific site / chemoattractant.

To have a flexible implementation, I plan to export the 2D information of the cells per frame. For each cell, the corresponding trajectory id is stored as well -> this allows to easily calculate features like velocity etc. in cytominer and should allow to easily import other file formats / data.

Using this setup, the image segmentation and tracking performed using Matlab or CellProfiler is independent from the analysis in cytominer. I only need a good and flexible way to store the tracking data in R.

Extract subpopulations

Given a population and a reference set, extract subpopulations are use it to create profiles of the two sets.

Implement PCA

Implement PCA on sparse sampling of single cell data.

In our current data analysis workflow, decomp.py performs the equivalent of this operation

Non-standard evaluation (NSE) in dplyr in drop_na_rows

devtools gives following note in checking R code for possible problems:

* checking R code for possible problems ... NOTE drop_na_rows: no visible binding for global variable ‘value’ drop_na_rows: no visible binding for global variable ‘key’ drop_na_rows: no visible binding for global variable ‘rowname_temp’ Undefined global functions or variables: key rowname_temp value

We can get rid of it simply by using NSE operations in dplyr.

Normalize cell features by z-scoring - single plate

For each feature, compute mean and s.d. per plate across all the reference cells in the plate and use that to compute z-scores. Reference cells could be either a random sampling of all cells on the plate, or cells from some specific treatments on the plate (e.g. DMSO)

We previously used https://github.com/CellProfiler/CellProfiler-Analyst/blob/master/cpa/profiling/normalization.py for this operation

@shntnu needs to decide whether this operation should be eager or lazy.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.