Code Monkey home page Code Monkey logo

cue's Introduction

CUE

CpG impUtation Ensemble for DNA Methylation Levels Across the HumanMethylation450 (HM450) BeadChip and HumanMethylation EPIC (HM850) BeadChip Platforms

DNA methylation at CpG dinucleotides is one of the most extensively studied epigenetic marks. With technological advancements, geneticists can profile DNA methylation with multiple reliable approaches. However, different profiling platforms can differ substantially in the CpGs they assess, consequently hindering integrated analysis across platforms. Here, we present CpG impUtation Ensemble (CUE), which leverages multiple classical statistical and modern machine learning methods, to impute from the Illumina HumanMethylation450 (HM450) BeadChip to the Illumina HumanMethylationEPIC (HM850) BeadChip.

CUE is maintained by Gang Li [[email protected]] and Laura Raffield [[email protected]].

News and Updates

  • Version 0.8.9 released

Brief introduction

From this study, we provide two sets of imputation models: one for whole blood and the other for placenta. Investigators can therefore complete their own imputation of placental or whole blood HM850 CpG sites using their own HM450 data, without access to their own reference panel. Our method is also applicable to imputation in other tissues, provided the user can provide a reference dataset assayed on both HM450 and HM850.

Installation

CUE package can be directly installed as following:

git clone https://github.com/GangLiTarheel/CUE.git
cd CUE

Required library for imputation.

Please install the following version of R and R packages before performing imputation.

  • R version 3.6.0 (2019-04-26)

  • class_7.3-15

  • xgboost_0.82.1

  • randomForest_4.6-14

  • parallel

Note: to install an R packages, simply run install.packages("Name_of_package"). For example,

install.packages("xgboost")

Note: you can also install the version of xgboost we used from the source code: https://cran.r-project.org/src/contrib/Archive/xgboost/xgboost_0.82.1.tar.gz

library(randomForest) # RF
library(xgboost) # XGBoost
library("class") # KNN
source("R/refund_lib.R")

library(parallel) # Use this package if you need parallel computing to accelerate computation speed

Download the pretrained imputation models

Note: The compressed pre-trained models need around ~ 58 GB (ELGAN ~36 GB; and PTSD ~21 GB) memory of storage.

Please save the above pretrained models (tar.gz files) in the same directory of your CUE packages.

To extract the pre-trained models:

Make two directories for two sets of pre-trained models:

pwd # this should be your root directory of CUE
mkdir PTSD
mkdir ELGAN

Download the pre-trained models amd move the PTSD and ELGAN pretrained models to the corresponding folder.

Whole blood (PTSD):

ftp://yunlianon:[email protected]/CUE/PTSD_model.tar.gz

Placenta (ELGAN):

ftp://yunlianon:[email protected]/CUE/ELGAN_model.tar.gz

then decompress.

cd PTSD
tar -xf PTSD_model.tar.gz
cd ../

cd ELGAN
tar -xf ELGAN_model.tar.gz
cd ../

CUE imputation

Here we show the imputation for a toy dataset with three samples, 248K HM450K probes. The reason we use only 248K probes is that when we trained models, we only retain the 248,421 HM450 CpG sites (sites overlapping between ELGAN and PTSD, and without missingness in our samples) to train since we don't want the incompleteness or the pre-imputation on HM450 affects our evaluations on HM850 imputation. The full list of 248K HM450K probes can be found in Probes.RData.

The input dataset X should have row as probes, columns as samples.

## Set the working directory
setwd(“/YOUR_DIRECTORY/CUE”)
# Replace YOUR_DIRECTORY with your directory where you downloaded CUE.

root_dir = getwd()


## load sample dataset
load("PTSD/Sample_Dataset.RData")
X<-sample_data                            # Input HM450
m<-dim(X)[2] # number of samples          # Number of smaples

## load required datasets 
load("PTSD/Annotations.RData")                # load Annotation data
load("Data/PTSD.neighbors.RData")             # load All neighbors
load("Data/Probes.RData")                     # load All probes' names

## load required datasets for PTSD imputation
load("Data/PTSD_CpG_Best_method_list.RData")  # load the best method list 
load("PTSD/RF/best_RF.RData")                 # load neighbors for RF
load("PTSD/XGB/best_XGB.RData")               # load neighbors for XGB
load("PTSD/KNN/best_KNN.RData")               # load neighbors for KNN
load("PTSD/TCR/best_TCR.RData")               # load neighbors for PFR

## load all the required packages
library(randomForest)
library(xgboost)
library(xgboost)
source("R/refund_lib.R")

## load the CUE imputation function
source("R/impute.R")

## Check the input
CUE_check(X)                        # check if the inuput X match the requirements

## TCR required input	
temp <- TCR.input(X)
X_logit = temp[[1]]                 # Logit transformation make it better follows Gaussian distribution
test.funcs <- temp[[2]]             # Pre-calculated sample-specific density function

## Imputation (Single CPU might takes 4-5 days to imputation for a middle sized methylation dataset (n=~100 samples).)
m.imputed<-CUE.impute(X,m,"PTSD")

## Save the imputed probes as y_impute.RData
setwd(root_dir)
save(m.imputed,file="y_impute.RData")


Note: (a) we impute all 339K HM850 specific probes which had complete data in our reference whole blood and placenta datasets. Users of CUE must use the following quality control steps to retain the well imputed probes only for use in subsequent analysis (such as epigenome wide association studies).

(b) we provided the corresponding annotation files within PTSD datasets ("PTSD/Annotations.RData") for building funtional predicitors for penalized function regression models. The annntation files can also be downloaded from the Illumina website.

File link: HM850

ftp://webdata2:[email protected]/downloads/productfiles/methylationEPIC/infinium-methylationepic-v-1-0-b4-manifest-file-csv.zip

HM450

Quality Control

We provdies two sets of QC+ probes list for two datsets with the following thresholds:

QC for ELGAN: RMSE<0.1 and Accruracy > 90%;

QC for PTSD: RMSE<0.05 and Accruracy > 95%.

QC_probes<-read.csv("Placenta (ELGAN)QC_probe_list.csv") # for Placenta
#QC_probes<-read.csv("Whole Blood (PTSD)QC_probe_list") # for whole blood
load("y_impute.RData")
QC_probes<-unlist(QC_probes)
m.QC <- m.imputed[,paste(QC_probes)]
save(m.QC, file="m.imputed.QC.RData")

(Optional): We also provide an shiny app, CUE_QC.R, an visualization tool to select the QC+ probe list.

runApp('CUE_QC.R')

The shiny app will save the list of well imputed probes. Users can use the previous code to subset the output probes.

The final QC+ imputed HM850 probes will be saved as "m.imputed.QC.RData". See the output:

## Output : DNA methylation matrix
dim(m.QC)
# [1]      3 238090
head(m.QC[,1:5])
#          cg16269199 cg22519184  cg24063007 cg10692041 cg02339369
# sample_1  0.7452506  0.3634424 0.012868836  0.9411898  0.9493930
# sample_2  0.7341876  0.3567242 0.006375319  0.9145329  0.9557632
# sample_3  0.7422996  0.3668391 0.012483532  0.9411898  0.9493930

Citation

Li G, Raffield L, Logue M, Miller MW, Santos HP Jr, O'Shea TM, Fry RC, Li Y. CUE: CpG impUtation ensemble for DNA methylation levels across the human methylation450 (HM450) and EPIC (HM850) BeadChip platforms. Epigenetics. 2021 Aug;16(8):851-861. doi: 10.1080/15592294.2020.1827716. Epub 2020 Oct 4. PMID: 33016200; PMCID: PMC8330997.

cue's People

Contributors

ganglitarheel avatar lm-ra2323 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

cue's Issues

Issue with predict()

Hello,

I am trying to impute from 450k to EPIC, but since we normally perform QC and then normalization, the 248K probes used to train the models are not available to me. Since the method implemented here uses the predict() function, I cannot run the models in my test data. Is there any way of fixing this or will I be unable to perform imputation if I don't have the 248K probes available?

Many thanks in advance!

Best wishes,

Marta

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.