Code Monkey home page Code Monkey logo

mlcp's Introduction

Machine learning pipeline for CellProfiler output

This is a vignette of performing supvervised and unsupervised classification of cellular phenotypes, as a downstream step from CellProfiler analysis (output as SQLite database)

Installation

Core requirement: CellProfiler 2.2+, Python and its machine learning libraries Scikit-learn

  1. Install Python (>=2.7.9 or >=3.4)

    In case you already had older version of Python that does not include pip, and you do not wish to upgrade, please follow the instruction here

  2. Open a command line window

    In Linux/Mac OS, open the "Terminal".

    In Windows, open the "cmd" (as administrator) here's how-to. Windows user may also have to first navigate to:

    cd C:\Users\YOUR_USER_NAME\AppData\Local\Programs\Python\Python36\Scripts
    

    Note:

    Conveniently, Windows user may want to add python installation folder to the %PATH variable. Here is a powershell script to do it. BE SURE TO ADD THE ";" before adding the new path.

    $ [Environment]::SetEnvironmentVariable("Path", $env:Path + ";C:\Users\<username>\AppData\Roaming\Python\Python35\Scripts", [EnvironmentVariableTarget]::User)
    
  3. In Terminal/cmd, type :

    pip3 install --upgrade pip

    In OSX/Linux you may have to use "sudo", e.g:

    sudo pip3 install --upgrade pip

    If you prefer to use Python 2, then run with "pip" instead of "pip3", e.g.:

    sudo pip install --upgrade pip
  4. If success, continue to run the following commands, one line at a time:

    pip3 install numpy
    pip3 install scipy
    pip3 install sklearn
    pip3 install pandas
    pip3 install matplotlib
    pip3 install jupyter

    Note:

    Windows user may need to manually download numpy package from this website

    cd C:\Users\YOUR_USER_NAME\AppData\Local\Programs\Python\Python36\Scripts
    pip3 install /YOUR_download_location/<package-name>.whl
    
  5. Once done, test if you can use jupyter notebook. In Terminal or cmd, type:

    jupyter notebook

    If it opens an interface in your default web-browser, you’re ready!

    If you have issues running jupyter, please visit Jupyter website

    If jupyter notebook is not available, the vignette can still be used in command line (see Use)

Use

  1. Feature extraction by CellProfiler

    Starting from raw images input, you will first need to perform CellProfiler analysis on those images. In this vignette, you can use the example pipeline located at CPpipeline/cellcycle.cppipe to analyze the example images provided in folder images. Read more about these images and the study.

    Input of this step:

    * raw images
    

    Output of this step:

    * SQLite database .DB files (preferably, see note)    
    

    Note 1:

    To validate the result of classification, each object often needs to be labelled with a class name. In this example, the objects are automatically labelled according to the names of the immediate folders containing its images. In CellProfiler terminology, it's called Image_Metadata_folder.

    For instance, a folder structure of a drug screen:

    └── Experiment_7817
        └── Positive_control
            └── B12_s1_w1.tif
            └── B12_s1_w2.tif
            └── B13_s1_w1.tif
            └── B13_s1_w2.tif                                  
        └── Treatment_1
            └── A1_s1_w1.tif
            └── A1_s1_w2.tif
            └── A3_s1_w1.tif
            └── A3_s1_w2.tif      
    

    Here Positive_control and Treatment_1 will be used as the labels for any identified objects from the images of wells "B" and "A", respectively. The regular expression rule used in CellProfiler pipeline will look for the prefix "Experiment" as in "Experiment_7817" of the parental folder. You can tune CellProfiler pipeline to fit your own use of metadata.

    Note 2:

    The machine learning script at its current stage will need to be placed in a folder at the same level with the CellProfiler output folder, i.e. “CPOut”.

    └── MLCP
        └── CPOutput
            └── DefaultDB_train.db
            └── DefaultDB_test.db
        └── MachineLearning
            └── MLCP.py
            └── MLCP.ipynb    
    

    The script will also look for files named DefaultDB_train.db and DefaultDB_test.db , which are SQLite databases, inside CPOut. Please consider this if you need to change the name of the output folder, and name and type of CellProfiler output database. If you wish, you can change these path and names easily by editing the script itself using any text editor.

    If you prefer to use .CSV outputs, please visit another vignette of the same work.

  2. Feature selection

    CellProfiler may extract large number of features. For classical machine learning, it is advisable to first remove the zero-variance features (same values for every objects) as well as redundant features (highly correlated features). This preprocessing step is demonstrated in the first part of MachineLeaning/MLCP.ipynb using tree-based feature selection method. You can easily modify this module to use any other feature selection methods.

    Input of this step:

    * SQLite database .DB files (from step 1)
    

    Output of this step:

    * Histogram of feature ranking (.PNG)
    * Data table of objects with selected features (.TXT files)
    * Metadata labels of objects (.TSV file)
    

    Feature ranking

    Alternatively, Cytominer is an R package that can perform various preprocessing tasks for large morphological profilling data. You are encouraged to visit its example use here

  3. Supervised learning / classification

    Preprocessed feature space can then be used to predict the cellular phenotypes, as demonstrated in the later part of MachineLeaning/MLCP.ipynb.

    Note : Supervised classification typically requires ground-truth annotation. In this example, the cells are manually sorted into folders with names correspond to the phases of cell cycle. CellProfiler pipeline can be tuned to automatically collect the names of these folders as ground-truth labels of the cells.

    Input of this step:

    * preprocessed feature space (from step 2) 
    

    Output of this step:

    * Plot of the recalls (.PNG), i.e. percent of cells correctly classified
    * Confusion matrix (.PNG and .TXT) 
    * Weights of the trained models (.SAV)
    

    Confusion matrix

    Note : This task can be computationally heavy, depends on how large your data is. Please save your current works, free your computer CPU and memory before running the script.

    To run the script (for step 2 and 3), use Jupyter notebook for MachineLeaning/MLCP.ipynb

    jupyter notebook

    or command line for MachineLeaning/MLCP.py (if jupyter is not available):

    cd /path_to_YOUR_folder/MachineLearning
    python MLCP.py
  4. Unsupervised learning

    Preprocessed feature space can also be used for unsupervised learning. In this example, we project the cells with selected features into t-SNE and PCA plots, which can be interactively visualized at http://projector.tensorflow.org

    This projector plot can also be used as a tool to isolate the cluster of objects-of-interest for further analysis.

    Input of this step:

    * Data table of objects with selected features (.TXT files) (from step 2)
    * Metadata labels of objects (.TSV file) (from step 2)
    

    Output of this step:

    * t-SNE / PCA plots on a web-browser at http://projector.tensorflow.org   
    

    tSNE

mlcp's People

Contributors

minh-doan avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.