Code Monkey home page Code Monkey logo

batchcorrectionpublicdata's Introduction

batchCorrectionPublicData

Summary of the code published in "reComBat: Batch effect removal in large-scale, multi-source omics data integration".

Installation of packages

All packages have been compiled in the provided requirements.txt file. Simply use this file to install all pachages via "pip install requirements.txt".

Run example

We provide all data and code to reproduce Figures 1, 2 and S1-S10 of our recent publication. Simply execute the main script by running

harmonizedDataCreation.py

Here parameter options referring the specific batch correction methods, evaluation metrics and output folders are defined. This script comprises three main parts:

  1. Data loading and metadata preprocessing
  2. Batch correction
  3. Evaluation of the batch correction methods

Data loading and preparation

The relevant data associated with this code is provided as a .zip file and needs to be extracted into the 'data' folder. It comprises >1000 micro array gene expressen samples extracted from the GEO database in October 2020 as indicated by the relevant GSE and GSM identifiers. All data was preprocessed using RMA normalization.

The data annotation (referred to as "metadata") is categorized to reflect the specific PA strain, and culture conditions (temperature, growth medium, culture geometry, antibiotic treatment, growth phase) and each sample is assigned to one of 39 unique metadata subsets (ZeroHops). Only ZeroHops comprising at least 2 batches (GSEs) of at least two samples (GSMs) are kept.

Batch correction

We provide code for the following (optional) batch correction methods:

  1. Uncorrected data
  2. Standardized data (Z-scoreing to mean zero and unit variance was applied)
  3. Marker gene elimination for each of the ZeroHop Clusters (default top 8 marker genes)
  4. Principal component elimination for each of the ZeroHops
  5. reComBat For each of the relevant methods overview fiures showing t-SNE embeddings of the corrected adata colored by all metadata categories are created to provide a visual inspection of the batch correction success.

Evaluation of the batch correction methods

We provide a range of custom evaluation metrics probing different aspects of a successful batch corrected dataframe. These include:

  1. LDA score
  2. DRS score
  3. Cluster purity and Gini impurity
  4. Minmum Cluster Separation number
  5. Cluster Cross-distance
  6. Logistic Regression (or other classifier) classification performance of batch and ZeroHop.

Synthetic data generation

We also provide code to gerenate and evaluate synthetic data in syntheticDataGeneration.py.Here, the user can define their choice of synthetic data properties, the properties of the imposed batch effects and then correct these with the set of possible methods outlined above. The obtained results can be compared to the relevant ground truth.

Contact

This code is developed and maintained by members of the Machine Learning and Computational Biology Lab of Prof. Dr. Karsten Borgwardt. Michael F. Adamer Sarah C. Brüningk

batchcorrectionpublicdata's People

Contributors

sbrueningk avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

mikeadamer

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.