Code Monkey home page Code Monkey logo

iniquitate's Introduction

Iniquitate Pipeline

This repository corresponds to the analysis and pipeline outlined in Characterizing the impacts of dataset imbalance on single-cell data integration. If you use the integration guidelines or any aspect of this pipeline, please consider citing our work.

Downsampling-based perturbation experiments for single-cell RNA sequencing integration


Table of contents

Using the imbalanced integration guidelines

A separate README for the imbalanced integration guidelines, with full environment installation instructions are in the docs folder.

Reproducing the paper analysis

Please note that mamba and snakemake are required to run the pipeline through conda. After installing conda (https://conda.io/projects/conda/en/latest/user-guide/install/index.html), please add mamba (https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html) to the base environment, as well as snakemake (https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) in base or a new environment:

conda install -n base -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake
mamba activate snakemake

The environments necessary to run the pipeline will be automatically installed by snakemake. The only exception is the analysis pipeline, which is outlined in step 5 below.

The following steps are necessary to reproduce the paper analysis:

  1. Clone the GitHub repository:
git clone https://github.com/hsmaan/Iniquitate.git
  1. Download the resources utilized in the study, extract and move to Iniquitate directory. The data can be downloaded from Figshare or Google Drive:

    Figshare: https://doi.org/10.6084/m9.figshare.24625302.v1

    Google Drive Link:

    After downloading:

    tar -xzvf resources.tar.gz 
    mv resources Iniquitate
    
  2. Run the different configurations utilized in the study through the Snakemake pipeline:

    • Change the configuration option at the top of workflow/Snakefile. The following configs were utilized for different analyses in the study:

      • config_control
      • config_lowcap_modified
      • config_pdac_comp
    • Run the Snakemake pipeline specific to the selected config:

    snakemake --unlock 
    snakemake -j 1000 \
        --use-conda \
        --cluster-config cluster.json \
        --cluster "sbatch \
            --mem={cluster.mem} \
            --gres=gpu:{cluster.gpu} \
            -t {cluster.time} \
            -p {cluster.partition} \
            -c {threads}" \
        --restart-times 0 \
        --latency-wait 300 \
        --keep-going \
        --rerun-incomplete 
    

    Note that the above Snakemake run utilizes a workflow/cluster.json configuration file and HPC parallelization of the various steps in the pipeline. Users will need to create a cluster.json file specific to their HPC setup that has resources for all of the rules in workflow/Snakefile. Alternatively, users can also choose to employ Snakemake profiles. Details can be found here: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html.

    Further, currently all of the temporary integration files will by default be stored in /tmp based on the TMPDIR variable in workflow/envs/integrate.yaml. These files can be quite large and pile up during runtime, even though they are being deleted at the end of each run, and we therefore recommend that users set this directory to one that contains enough space - at least 100 GB.

  3. Analyze the results using the R and python scripts/notebooks:

    • Begin by installing the analysis conda environment:
    cd Iniquitate/workflows/envs
    mamba env create -f analysis.yaml
    
    • First run the python notebook to completion for metric analysis:
    conda activate iniq_analysis
    jupyter-notebook 01_Fig_7_Imbal_Metric_Analysis.ipynb
    
    • Run the Rscripts based on their order, through RStudio or the commandline
    conda activate iniq_analysis
    Rscript 05_Iniq_Control_Fig_2_Analysis_Plots.R
    Rscript 06_Iniq_Control_Fig_2_Analysis_Stat_Tests.R
    ...
    

It is not possible to re-run all of the perturbation experiments and downstream analyses in a reasonable amount of time without high-performance computing (HPC). It is highly recommended that the workflow is parallelized over HPC.

It is also recommended to Run the R and python analysis notebooks in an HPC environment as well, because some of the steps are memory-intensive. Particularly, we don't recommend running Rscripts 08 or 09 without HPC, as they are time-intensive sampling experiments.


Custom data perturbation configuration setup

The same software requirements for the paper analysis apply to custom data perturbation experiments (conda, mamba, snakemake). Please see the first part of Reproducing the paper analysis for details on how to install these requirements.

The following steps are necessary to use a custom dataset:

  1. Process the batches/samples as necessary and convert to h5ad format. Create a folder in resources/h5ad_files/int_datasets/ (e.g. resources/h5ad_files/int_datasets/custom_data) and move batches/samples to this location.

  2. Modify the custom dataset configuration file (workflow/configs/config_custom.json). This .json file has the following parameters:

    • config_name - this can be left as custom, or changed to a different name, but this same name must be used when modifying the Snakefile

    • int_datasets - this is a nested dictionary of the datasets to be used in the downsampling experiments. In this case, it's best to use the name of the dataset in resources/h5ad_files/int_datasets as the top level name

      • data_folder - this value should be the same as the folder containing the batches/samples in resources/h5ad_files/int_datasets/
      • ds_celltypes - this value indicates how many cell-types to downsample and/or ablate in each run
      • ds_proportions - this value indicates what proportion of cell-types remain in the downsampled batch after the pertubation. In this case, 0 would indicate ablation, and 0.1 would indicate the same level of downsampling used for the main experiments in the study
      • num_batches - the number of batches to downsample in each perturbation. The 0 option is included here so that control experiments are possible (no perturbation to any batches).
      • repetitions - how many experiments based on the given grid of ds_celltypes, ds_proportions, and num_batches to perform. 200 is a good starting point. If the space of possible cell-types is very large (n celltypes > 20), then it may be useful to increase this value to ensure each cell-type is downsampled/ablated enough times in the total number of runs.
    • int_ti_datasets - if any datasets have an underlying trajectory, and PAGA-based integration is needed to be done, then they should be added here with the same options indicated in int_datasets

    • query_to_reference - A "Yes" or "No" option indicating whether or not to perform query-to-reference experiments. Currently this functionality is not available, but custom query-to-reference setups will be available soon. This should be left as "No".

    • celltype_list - If the user has a list of specific cell-types to downsample (and not others), they can be included her as a json list of strings based on their names. We don't recommend specifying certain cell-types, as a-priori knowledge of the effects of downsampling/perturbation may not be accurate.

  3. Modify the Snakefile in workflow\Snakefile at line 3, in reference to the name of the specific config being used. In the example given, the configfile line would be changed to:

    • configfile: "configs/config_custom.json"
  4. Run snakemake:

    snakemake --unlock 
    snakemake -j 1000 \
        --use-conda \
        --cluster-config cluster.json \
        --cluster "sbatch \
            --mem={cluster.mem} \
            --gres=gpu:{cluster.gpu} \
            -t {cluster.time} \
            -p {cluster.partition} \
            -c {threads}" \
        --restart-times 0 \
        --latency-wait 300 \
        --keep-going \
        --rerun-incomplete 
    

    Note that the above Snakemake run utilizes a workflow/cluster.json configuration file and HPC parallelization of the various steps in the pipeline. Users will need to create a cluster.json file specific to their HPC setup that has resources for all of the rules in workflow/Snakefile.

  5. Analyze the result files using the R scripts - all of the R scripts in the workflow/analysis/R folder can be used to analyze the results of the perturbation experiments. Currently, an exhaustive list of custom analysis scripts is not available, but the existing scripts can be modified to suit the needs of the user. We provide one custom script for analysis of KNN classification accuracy in the workflow/analysis/R/knn_example.R file. Please note that this file still needs to be modified in the appropriate input locations, which are indicated in the comments of the file.

Citation information

Maan, H. et al. (2024) ‘Characterizing the impacts of dataset imbalance on single-cell data integration’, Nature biotechnology. Available at: https://doi.org/10.1038/s41587-023-02097-9.

iniquitate's People

Contributors

hsmaan avatar

Stargazers

 avatar Bramadi  Arya avatar Alexander Cristofaro avatar Jiang Shoujia avatar  avatar Xiao-Ou Zhang avatar Léonore Tideman avatar huangzhihao avatar Devika Agarwal avatar  avatar  avatar  avatar Seyoon Lee avatar Qin Lin avatar Timothy Sundell avatar Steve Marshall avatar  avatar Paul L. Maurizio avatar Jasim K.B. avatar Pierre-Yves LOUIS avatar Xue Huiwen avatar  avatar  avatar DCZ avatar  avatar slp avatar

Watchers

Kostas Georgiou avatar  avatar

Forkers

animesh

iniquitate's Issues

Analysis script cleanup

  • After merging all of the disparate branches, organize analysis scripts
  • Renumber (as well as output)
  • Ensure that the output dir creation statements are present, as they were originally reorganized
  • Ensure no overlap of any output

Add guidelines into README

  • PR for guidelines RMD complete
  • examples folder has separate readme but needs pointer in main readme
  • Also add figure of guidelines from manuscript into the readme of the examples folder

Invalid argument docstrings

  • Currently, many of the scripts in scripts/python have invalid help strings which were copied over from other scripts
  • Ensure each script has the correct help docstring based on it's arguments, even though the arguments are mostly redundant across scripts

Custom user dataset input

  • Currently there is still no infastructure in place for users to utilize their own datasets in the perturbation experiments and the Iniquitate pipeline
  • First step would be to begin by adding processing script for users h5ad inputs to get it set up and ready for the perturbation experiments

Remaining custom user data input scripts

Added initial documentation and custom user data KNN classification script. Outstanding custom scripts include:

  • Metrics + balanced metrics analysis
  • DGE
  • Trajectory inference
  • Leiden clustering
  • Cell type center distance + cell type support scripts

Originally posted by @hsmaan in #26 (comment)

Pull additional analysis branches and resolve config/snakefile conflicts

  • Currently quite a few activate branches for additional analyses are diverging in terms of file structure
  • After the analyses are finished, aggregate these changes and do PRs for each branch in a way that ensures all of the changes can be pushed to main
  • Additional constraints/rules in the Snakefile will likely be needed

After recent branches pulled, update configs

  • Currently, different branches have altered the config structure in different ways
  • Before merging, ensure that the config structure is up-to-date across configs and with any changes to main

Clean up `integrate` classes and downsampling functions

  • Currently some unused code and None declared variables still present in these classes/functions
  • Clean things up and remove these unnecessary bits and also remove the associated parameter calls from the scripts that call them

CIDER pbmc supplementary analyses

Currently the correct changes to the Snakefile after the addition of the CIDER supplementary experiments (pr: cider_pbmc_test). The changes to the Snakefile should be made before finishing this merge

Installation issues

Thank you for developing such a useful tool.
I would like to ask, does the installation require Snakemake, Python, and R to work together? Is there a simpler way to install?
Thank you once again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.