batchCorrectionPublicData

Summary of the code published in "reComBat: Batch effect removal in large-scale, multi-source omics data integration".

Installation of packages

All packages have been compiled in the provided requirements.txt file. Simply use this file to install all pachages via "pip install requirements.txt".

Run example

We provide all data and code to reproduce Figures 1, 2 and S1-S10 of our recent publication. Simply execute the main script by running

harmonizedDataCreation.py

Here parameter options referring the specific batch correction methods, evaluation metrics and output folders are defined. This script comprises three main parts:

Data loading and metadata preprocessing
Batch correction
Evaluation of the batch correction methods

Data loading and preparation

The relevant data associated with this code is provided as a .zip file and needs to be extracted into the 'data' folder. It comprises >1000 micro array gene expressen samples extracted from the GEO database in October 2020 as indicated by the relevant GSE and GSM identifiers. All data was preprocessed using RMA normalization.

The data annotation (referred to as "metadata") is categorized to reflect the specific PA strain, and culture conditions (temperature, growth medium, culture geometry, antibiotic treatment, growth phase) and each sample is assigned to one of 39 unique metadata subsets (ZeroHops). Only ZeroHops comprising at least 2 batches (GSEs) of at least two samples (GSMs) are kept.

Batch correction

We provide code for the following (optional) batch correction methods:

Uncorrected data
Standardized data (Z-scoreing to mean zero and unit variance was applied)
Marker gene elimination for each of the ZeroHop Clusters (default top 8 marker genes)
Principal component elimination for each of the ZeroHops
reComBat For each of the relevant methods overview fiures showing t-SNE embeddings of the corrected adata colored by all metadata categories are created to provide a visual inspection of the batch correction success.

Evaluation of the batch correction methods

We provide a range of custom evaluation metrics probing different aspects of a successful batch corrected dataframe. These include:

LDA score
DRS score
Cluster purity and Gini impurity
Minmum Cluster Separation number
Cluster Cross-distance
Logistic Regression (or other classifier) classification performance of batch and ZeroHop.

Synthetic data generation

We also provide code to gerenate and evaluate synthetic data in syntheticDataGeneration.py.Here, the user can define their choice of synthetic data properties, the properties of the imposed batch effects and then correct these with the set of possible methods outlined above. The obtained results can be compared to the relevant ground truth.

Contact

This code is developed and maintained by members of the Machine Learning and Computational Biology Lab of Prof. Dr. Karsten Borgwardt. Michael F. Adamer Sarah C. Brüningk

borgwardtlab / batchcorrectionpublicdata Goto Github PK

batchcorrectionpublicdata's Introduction

batchCorrectionPublicData

Installation of packages

Run example

Data loading and preparation

Batch correction

Evaluation of the batch correction methods

Synthetic data generation

Contact

batchcorrectionpublicdata's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent