Code Monkey home page Code Monkey logo

combatharmonization's Introduction

ComBatHarmonization

Harmonization of multi-site imaging data with ComBat


Maintainer: Jean-Philippe Fortin, [email protected]

Licenses:

  • R code: Artistic 2.0 License
  • Python code: MIT License
  • Matlab code: MIT License

References: If you are using ComBat for the harmonization of multi-site imaging data, please cite the following papers:

Citation Paper Link
ComBat for multi-site DTI data Jean-Philippe Fortin, Drew Parker, Birkan Tunc, Takanori Watanabe, Mark A Elliott, Kosha Ruparel, David R Roalf, Theodore D Satterthwaite, Ruben C Gur, Raquel E Gur, Robert T Schultz, Ragini Verma, Russell T Shinohara. Harmonization Of Multi-Site Diffusion Tensor Imaging Data. NeuroImage, 161, 149-170, 2017 Link
ComBat for multi-site cortical thickness measurements Jean-Philippe Fortin, Nicholas Cullen, Yvette I. Sheline, Warren D. Taylor, Irem Aselcioglu, Philip A. Cook, Phil Adams, Crystal Cooper, Maurizio Fava, Patrick J. McGrath, Melvin McInnis, Mary L. Phillips, Madhukar H. Trivedi, Myrna M. Weissman, Russell T. Shinohara. Harmonization of cortical thickness measurements across scanners and sites. NeuroImage, 167, 104-120, 2018 Link
Original ComBat paper for gene expression array W. Evan Johnson and Cheng Li, Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1):118-127, 2007. Link

Table of content

1. Introduction

Imaging data suffer from technical between-scanner variation that hinders comparisons of images across imaging sites, scanners and over time. This includes common imaging modalities, such as MRI, fMRI and DTI, as well as measurements derived from those modalities, for instance ROI volumes, RAVENS maps, cortical thickness measurements, connectome matrices, etc. To maximize statistical power, post-processing data harmonization is a powerful technique to remove unwanted variation when combining data across scanners and sites.

In two recent papers (harmonization of DTI data and harmonization of cortical thickness measurements) we have shown that ComBat, a popular batch-effect correction tool used in genomics, succesffuly removes inter-site technical variability while preserving inter-site biological variability. We showed that ComBat performs well for multi-site imaging studies that only have a few participants per site. We also showed that ComBat was robust to unbalanced studies, in which the biological covariate of interest is not balanced across sites.

We recommend to use the ComBat harmonization method after imaging processing before downstream statistical analyses. The ComBat harmonization requires the imaging data to be represented in a matrix where rows are the imaging features (for instance voxels, ROIs or connectome edges) and columns are the participants. For example, for voxel-level analyses, this usually requires images to be registered to a common template space.

Input and parameters

Data inputs for ComBat are:

  • A data matrix. The data to harmonize. Rows are features (for instance voxels or brain regions) and columns are participants.
  • A batch id vector. A vector (length should be equal to the number of columns in the data matrix) that specifies the id for the batch, site, or scanner to correct for. ComBat only accepts one batch vector. You should provide the smallest unit of the study that you believe introduces unwanted variation. For instance, for a study with 2 sites and 3 scanners (1 site with 1 scanner, 1 site with 2 scanners), the id for scanner should be used.
  • Biological variables. Optional design matrix specifying biological covariates that should be protected for during the removal of scanner/site effects, such as disease status, age, gender, etc.

There are several alternative modes of running ComBat:

  • parametric=FALSE: will instead use a non-parametric prior method in the empirical Bayes procedure (default uses parametric priors).
  • eb=FALSE: will not run the empirical Bayes procedure, and therefore location and scale parameters are not shrunk towards common factors averaged across features. This is equivalent to running a location-and-scale correction method for each feature separately. This is particularly useful for debugging and method development.
  • mean.only=TRUE: will only adjust the mean of the site effects across sites (default adjusts for mean and variance). This option is recommended for studies where the variances are expected to be different across sites due to the biology.

2. Software implementations

The reference implementation (Standard Version) of ComBat, developed for gene expression analyses, is written in R and is part of the sva package available through the Bioconductor project here. We include here a reimplementation of ComBat in R, Matlab and Python (neuroCombat) for the harmonization of imaging data. Our R implementation extends the original code for more flexibility and additional visualization of the internal components of the algorithm. We are also currently working on several extensions of the original method that will be included here as well. We use the MIT license for the Python and Matlab code, and an Artistic License 2.0 for the R code to be compatible with the sva package.

Current implemented features

R Python Matlab
Parametric adjustments x x x
Non-parametric adjustments x x x
Empirical Bayes x x x
No empirical Bayes x x
Mean adjustment only x x
Reference batch x x
Can handle missing values x

Testing and comparing the different implementations

The Testing directory contains code for testing and comparing the outputs from the R, Matlab and Python implementations. We routinely perform the analyses to make sure that all versions and implementations agree with each other, as well as with the sva implementation of ComBat, for all modes of running ComBat (parametric/non-parametric/eb/mean.only).

3. Handling of missing values

  • For R, the current implementation accepts missing values. Constant rows, and rows with missing values only, need to be removed before running ComBat. Not removing such rows will results in an error, or a matrix of NaN values.

  • For Matlab and Python, the input data can only contain finite values (no NA or Nan). Constant rows, and rows with missing values only, need to be removed before running ComBat. Not removing such rows will results in an error, or a matrix of NaN values.

4. FAQs

5. News

05-23-2020: Reference batch option (ref.batch) now implemented in Python.

05-19-2020: Reference batch option (ref.batch) now implemented in R.

05-17-2020: Mean adjustment only option (mean.only=True) now implemented in both Python and R.

05-15-2020: Non-parametric adjustments (parametric=False), and eb=False now implemented in both Python and R.

05-14-2020: We migrated our official Python implementation (neuroCombat) here for maintainability.

03-06-2020: ComBat in R now accepts missing values.

05-19-2019: Added the option of running the non-parametric version of ComBat in the R implementation.

05-19-2019: Added the option of running the non-parametric version of ComBat in the Matlab implementation.

combatharmonization's People

Contributors

jfortin1 avatar willforan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

combatharmonization's Issues

Possible to use parallelization to speed up non-parametric estimation in R?

Hi
I am testing out neuroCombat to harmonize fMRI data across two sites. The parametric version works great even on a larger number of input features. But it takes an exceedingly long time to run the non-parametric version with thousands of features.
I was wondering if it would be possible to use parallel processing to speed this up? Particularly the getEbEstimators helper function for "Finding non-parametric adjustments"
Thanks!

Working with voxel-wise data

Hey combat users,
I have a question regarding the use of voxel-wise data for combat (Python). I have implemented neurocombat for ROI-wise data, but I don't know how to input voxel-wise data which would require the shape of features (= participants) x samples (= 3D voxels). Can someone help on how to include 3D voxel information per participant into a numpy array?
Best, Melissa

Applying ComBat on single feature

I would like to use ComBat on a single feature. From the formulation and the fact that this is done in the cortical thickness paper referred to in the readme, I would think it should be possible. However, the parametric formulation gives an error, while the non-parametric version results in only NaNs:

data = randn(1,10);
batch = [1 1 1 1 1 2 2 2 2 2];
mod = [1 2 1 2 1 2 1 2 1 2]';
combat(data, batch, mod, 0)

Results in:

[combat] Found 2 batches
[combat] Adjusting for 1 covariate(s) of covariate level(s)
[combat] Standardizing Data across features
[combat] Fitting L/S model and finding priors
[combat] Finding non-parametric adjustments
[combat] Adjusting the Data
ans =

NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

combat(data, batch, mod, 1)

Results in:

[combat] Found 2 batches
[combat] Adjusting for 1 covariate(s) of covariate level(s)
[combat] Standardizing Data across features
[combat] Fitting L/S model and finding priors
[combat] Finding parametric adjustments
Index exceeds matrix dimensions.

Error in combat (line 93)
temp =
itSol(s_data(:,indices),gamma_hat(i,:),delta_hat(i,:),gamma_bar(i),t2(i),a_prior(i),b_prior(i),
0.001);

The error seems to be in the second iteration of the for loop line 93 is in. The index exceeding is in gamma_bar and t2.

Is this a bug in the code, or a limitation of the method I'm unaware of? Thanks in advance.

FYI: I'm using the MATLAB version in MATLAB 2015b.

Is it possible to use ComBat in native space?

Hello,
We know that ComBat requires the images to be registered to a common template space. For some of our analyses, we want to get the value of subjects in native space. Is it possible to use ComBat in native space? Or can we firstly use ComBat for all in a template space, then project the subjects back to native space? Any risk for this procedure (i.e. introduce the bias back again)?

Thanks!

mod.mean in neuroCombatFromTraining()

Hi Fortin,

Thanks for your update, I found the R version of ComBat begins to support using estimation from training data on unseen data.

For the function neuroCombatFromTraining(), I am curious about why you use rowMeans(mod.mean) instead of using something like beta.hat * X_{new_data}. beta_hat is estimated from training data, X_{new_data} is the covariates from new dataset.

It seems quite straightforward to me, is there is any reason why you do not do this?

Thanks!

aprior formular

Hello,
I have a question regarding the aprior function,
From the equation that is shown in Johnson 2007
image

y=(2s2+m^2)/s2 in the aprior function confused me,
if I understand it correctly, should it be written as
y=(2
s2+m)/s2?

zhipeng

Unable to install R package

I keep getting an error code when I try to install this package. I tried in RStudio on mac and on linux, I tried using devtools and remotes, I tried forking the package and installing from that, and I installed BiocParallel just in case that was the issue. The install always fails. Here is the output:

> remotes::install_github("Jfortin1/neuroCombat_Rpackage")
Downloading GitHub repo Jfortin1/neuroCombat_Rpackage@HEAD
Error: Failed to install 'neuroCombat' from GitHub:
  Command failed (69)
In addition: Warning message:
In system(full, intern = TRUE, ignore.stderr = quiet) :
  running command ''/usr/bin/git' ls-remote https://git.bioconductor.org/packages/BiocParallel RELEASE_3_17 2>/dev/null' had status 69

Help would be greatly appreciated.

Error in solve.default(t(design) %*% design)?

Hello,

I previously got ComBat to work with some test data, but recently ran into an error when working with "real data". The data I'm working with is located here.

My basic R syntax was--

source("scripts/utils.R");
source("scripts/combat.R")

dat<-t(as.matrix(read.csv("Freesurfer_HBN_w_Site_For_Combat.csv")))
batch<-read.csv("HBN_site.csv")
batch<-t(batch$Site)
data.harmonized <- combat(dat=dat, batch=batch)

After I run that command, I get the following error--

untitled

Any thoughts re: what I might be doing incorrectly?
Thanks much!

All the best,
Jamie.

Error using combat in Matlab, error about vertcat although dimensions of input are right

Hello Experts,

When I run combat using Matlab, I get the following error about dimensions of the the arrays although my batch is a vector of 125, and my data is 9 features by 125 (9x125).

I am using mode = [];
DataHarmonized = combat(DataVol, batch, mod, 0);

Error using vertcat
Dimensions of arrays being concatenated are not consistent.

Error in combat (line 80)
delta_hat = [delta_hat; var(s_data(:,indices)')];

[Python] Dividing by zeros in standardize_accros_features function

(neuroCombat migrated here for maintainability)

From @RocaVincent:

Hi,

Thanks for this code,

Line 170 of neuroCombat.py file, I got a problem when it divides using the following term : np.dot(np.sqrt(var_pooled), np.ones((1, n_sample))). In this term, almost half of the coefficients are zeros so I got a RuntimeWarning and Nan values create problems later. Have you any idea why I got zeros and how I can fix this problem ?

Thanks

Nonparametric version of ComBat in R

From email: Hello,
I am applying the ComBat software to a multisite harmonization question. I have seen mentions of ComBat being available in both parametric and non-parametric versions. Is the version available on GitHub parametric or non-parametric?

Call neuroCombat without ref_batch

Hi,

I want to use the Python version of neuroCombat and I see that in the neuroCombat function there is an optional argument ref_batch. I understand the role of the ref batch thanks to one of the ComBat papers, but I don't understand how it can be optional since all theoretical explanations use a reference site. Can you tell me where I can find explanations about ComBat used without a reference site ?

Thanks in advance

train/test in R

Hi Jfortin1,

Thank you for this ComBat code. Just like someone pointed out earlier, I would like to be able to use ComBat in new test data as well but using the parameter of the training set, I would like to ask how can I do that in R since I am primarily using R for the purpose of radiomics study.

I really appreciate your time and effort.

Error in apply(data, 1, sd) : dim(X) must have a positive length

Hello ComBatHarmonization users,

I was trying to test-drive ComBat Harmonization, but ran into some issues with the apply function. I'm unsure if there's an issue with another R library or my R version (?).

I'm using a mac (10.13.5) with R-studio 1.1.453 and R 3.5.0

source("scripts/utils.R"); source("scripts/combat.R")
p=10000
n=10
batch = c(1,1,1,1,1,2,2,2,2,2) #Batch variable for the scanner id
dat = matrix(runif(p*n), p, n) #Random Data matrix
nrow(dat)
[1] 10000
ncol(dat)
[1] 10

data.harmonized <- combat(dat=dat, batch=batch)
Show Traceback
Rerun with Debug
Error in apply(data, 1, sd) : dim(X) must have a positive length `

I wondered if others had run into this issues?
Any suggestions for troubleshooting this?

Thanks much!
Jamie.

How to deal with .nii files directly?

Hi, sorry for bothering. I am WangRong.
The codes you provided are very useful! :) But I have some questions.
I want to do combat harmonization for .nii files directly, because I want to do Independent Components Analysis, moreover, the parameter needed to group-level analysis is too much.

I tried following codes:
a = load_nifti('I:\combatdemo\411.nii');
b = load_nifti('I:\combatdemo\CDP.nii');
p = {a,b};
n = 2;
dat = repmat(p,n,1);

But results give me lots of error:
var (第 164 行)
y = sum(abs(x - sum(x,dim)./n).^2, dim) ./ denom;
std (第 59 行)
y = sqrt(var(varargin{:}));
combat (第 9 行)
[sds] = std(dat')';
Because sum should be used for a numeric or logical value.

I was wondering if you could tell me how to solve my problem. Thank you very much!

ComBat in large sample size

Hello, I am trying to run ComBat in Matlab using a relatively large sample size (n= 300) which contains DTI data from two different sites. Although I have been able to run ComBat, it returns a matrix with only NA in it. Any suggestions to clear this issue?
Thanks

Typo in matlab implementation tutorial

Where the documentation reads:

We create a p x 2 model matrix with age as the first column

Shouldn't it read?

We create a n x 2 model matrix with age as the first column

ComBat parametric adjustment returning NANs in Matlab

Dear Experts,

I am running ComBat in Matlab for Mac version 9.7.0.1190202 (R2019b).
When I run the batch correction using parametric adjustments, I get all NANs. This has been a recurring issue for me, even in other datasets.
There are no NANs in my data and, in the past, the non-parametric adjustment has worked.
Any idea what might be causing this?
Thank you,

Leonardo Tozzi

Non-parametric form by Matlab

Thanks for sharing the code!

I am using your Matlab code, and parametric format works perfectly, but non-paramteric format can't work, most values are"NaN" after ComBat.

Can you help me with this problem? Thank you very much.

Use of ComBat when I have travelling subjects

Hi,

Can I use this code if I have traveling subject data, means suppose, I have 100 subjects and all of them are acquired in 4 different sites. If yes, may I know how?

Thanks and regards,
Jagruti Patel

How can I detect overcorrection?

Hi guys, I'm using neuroCombat (python version) to assess the impact of different CT machines (and other stuff) on a radiomic dataset.
The only thing that I'm still trying to figure out is how to determine if neuroCombat leads me to overcorrection.
Can someone help?
Sorry, but I usually analyze other types of data and this is my first time working on features from CT images.

Thanks in advance :)

Harmonization for multi-study project where each study has multiple sites

Greetings,

I am hoping to combine the MRI data from 2 independent studies, each of which had 4 unique sites. Thus, there are in total 8 sites where data was collected that I am interested in aggregating.

I had two questions:

  1. In this scenario, because the actual study site (8 in total) is the smallest unit, would the batch id vector be made up of the site information, and not study?

  2. In any post-harmonization statistical analyses, for example linear regression, is it suggested/would there be any value in including study (2 in total) as a covariate? On that note, I assume there would be no need for site (which would be already addressed in the initial harmonization procedure) to be included as a covariate as well?

Sorry for what are naive questions I'm sure. Any response would be greatly appreciated!
Thank you.

Detection of single batch in one site so var() calculation is not needed

Hi Jfortin1,

I've been using combat to harmonize multisite MRI data and it has been very helpful so far! However, I recently ran into a bit of trouble after limiting my cohort for further analyses. I kept getting an error that points me to line 80 in the combat.m function:

delta_hat = [];
	for i=1:n_batch
		indices = batches{i};
		delta_hat = [delta_hat; var(s_data(:,indices)')];
	end

After some digging, I realize that for single column entries in a batch, var() would calculate the column variance as opposed to row variance which I assume would happen if there were more column entries in the batch. Would you be able to quickly amend this part to include situations illustrated above?

Thanks so much for the great work! Hope to hear back from you soon!!

Sincerely,
Judy

[matlab] need to raise error if there is only 1 batch, otherwise it produces incorrect output

For data that only includes a single batch (eg: you have a multi-site dataset but you mask some data out during a loop and are unexpectedly left with only a single site), this code ends up removing the batchmod column from the design matrix (in combat.m line 33: design(:,bad)=[];, and then later when it extracts design(:,1:n_batch) or design(:,(n_batch+1):nn, it ends up changing the data in unexpected ways.

One solution would be to simply add a check in the beginning after generating n_batch:

    if n_batch < 2
        error('Error. Input must have at least 2 batches to proceed.');
    end

Nonparametric implementation of ComBat in Matlab

From email: I would like to use your Combat Harmonization routine to harmonise imaging data coming from different acquisition systems. I'm working in Matlab. Is the non parametric routine also available for Matlab or it is just implemented in R?

Neurocombat on 3D volumes

Hello,

I am working on 3D CT images and trying to apply neurocombat on 3D volumes where my features are voxel intensities. I transformed for each image the 3d array of voxel intensities into a long 1d vector by concatenating the slides. This allows me to create a matrix where each column is one image. Then I applied neuroCombat on the matrix, and transformed back each harmonized column into a 3d array with the original dimensions. However, the background intensities interfere with the results and poses problems for the NeuroCombat model (harmonized results are worse than original images). Could you please tell what is the strategy to follow when features are individual voxels? I would appreciate it if you could direct me to a code example where neurocombat is applied on voxel intensities.

train/test matlab implementation

Hi Jean-Philippe,

I'm trying to workout how to use combat in a train/test situation because it's super important to us and we want to use your implementation in our studies.

It seems that this could be pretty easily done by:

  1. for training sample, output the stand_mean, var_pooled, gamma_star, delta_star.
  2. for test sample, get the s_data using the training stand_mean/var_pooled and forward to the adjustment using the gamma_star/delta_star
  3. for n_batch, use the training data n_batch and establish a dummy matrix for the test data with zeros if they do not contain a specific n_batch (i.e., if the test sample does not have any cases from a specific MRI site.

Is this on the right track? Is it possible to review the code if I do this? I talked with Russel Shinohara about it before a few times but haven't implemented it. My name is Dom Dwyer and I'm from LMU in Germany.

Thanks!

Best, Dom

Correct batch effect of test data

Hello, thank you for your work on this wonderful tool.
I'm using this tool for a machine learning related project.
To prevent information leakage when correct the batch effect of them together, would it be possible to correct the batch effect of test data with information generated from train data?

batch needed to be ordered

I had an issue in MATLAB when the batch was unordered. I didn't have data from sites 1 and 2 but had the other 19 sites (from 21 sites in ABCD). But n_batch showed 21 and had some error. I tried using "batchmode(:,~any(batchmod,1))=[];" inline 15 to remove zero columns.

harmonize DTI files

Hi,

I am using combat to harmonize DTI nifti file, and I am getting the error:
"Error. There are rows with constant values across samples. Remove
these rows and rerun ComBat."

I know there are many 0s in my 3D nifti image data, I am wondering if I how can I using combat in for those data?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.