tshmak / lassosum Goto Github PK
View Code? Open in Web Editor NEWLASSO for GWAS with summary statistics
License: MIT License
LASSO for GWAS with summary statistics
License: MIT License
Hi, Sam,
I was looking at the documentation to figure out how I would run PRSice when I have data from multiple chromosomes. I have generated PLINK (.bim, .bed and .fam) files for each chromosome 1-22 respectively.
The documentation mentions this:
"Target genotype file. Currently support both BGEN and binary PLINK format. For multiple chromosome input, simply substitute the chromosome number with #. PRSice will automatically replace # with 1-22. A separate fam/sample file can be specified by --target ,"
I tried substituting the number of the chromosome with the "#" but it gave me the error listed below. Am I using this option incorrectly? If not what do you think is the best way to run PRSice on the combined chromosome data?
Error: -----------------------------------------------------------------------------------------------------------------------
Initializing Genotype file:
(bed)
With external fam file:
Hi!
Thank you for your help on crosspred thread, I think this is more lassosum related so I will just put it here.
Is there a way to obtain the predicted phenotypes from the PGS and use them to calculate the specifcity, sensitivity and balanced accuracy for model comparison? I figured the pheno in result.table is the true phenotype we input for the test dataset.
Thank you for the help
Hey Tim, is there any documentation about the pseudovalidate? I am wondering what is the principle of pseudovalidate.
By the way, can I use the same training dataset to do lasso and training the parameter "s", then apply it to a new dataset?
Thanks!
Hi Sam,
So I have run PRSice on a sample of 106 individuals, 53 which are controls and 53 which are test cases. When running PRSice I seem to be getting all the same PRS for each individual.
PRSice.R --dir . --./PRSice_linux --base basefile --target mergedGenomeLNKAnc --thread 1 --a1 ALT --a2 REF --snp SNP --beta --stat all_inv_var_meta_beta --binary-target T --pvalue all_inv_var_meta_p
Here is what the PRSice .best file looks like:
d54f4d2c-9b1e-4a4f-a21a-59ab07425a93 d54f4d2c-9b1e-4a4f-a21a-59ab07425a93 Yes -0.062335
6c531200-d09a-4b6d-a30d-dd3d9ca4cbf0 6c531200-d09a-4b6d-a30d-dd3d9ca4cbf0 Yes -0.062335
922f420d-dbf2-40af-a712-6da062cec326 922f420d-dbf2-40af-a712-6da062cec326 Yes -0.062335
4622feae-5216-466a-90a6-704da2dd0e7c 4622feae-5216-466a-90a6-704da2dd0e7c Yes -0.062335
4d6eae1d-18e1-4183-ad80-c393c1c943aa 4d6eae1d-18e1-4183-ad80-c393c1c943aa Yes -0.062335
88b639cd-f358-4684-882c-7881162d5e51 88b639cd-f358-4684-882c-7881162d5e51 Yes -0.062335
2556ca53-88f0-4ac3-b3be-d124b0ec73aa 2556ca53-88f0-4ac3-b3be-d124b0ec73aa Yes -0.062335
6a1b8d2f-474a-4896-a9b2-e1d934021f90 6a1b8d2f-474a-4896-a9b2-e1d934021f90 Yes -0.062335
382f8f34-c91e-4d95-934a-2a81991dd74a 382f8f34-c91e-4d95-934a-2a81991dd74a Yes -0.062335
9d6bdebb-26d8-4848-b2cd-81c2f342d256 9d6bdebb-26d8-4848-b2cd-81c2f342d256 Yes -0.062335
9dd07cc4-d6cd-40c4-8b11-9c0992e242f5 9dd07cc4-d6cd-40c4-8b11-9c0992e242f5 Yes -0.062335
Do you know possible reasons for why this could be?
Thanks!
pval <- 10^-(0:500)
sign <- rnorm(length(pval))
lassosum::p2cor(pval, 10e3, sign)
Hi,
After I install the package into R and use library(lassosum), it says:
Error: package or namespace load failed for ‘lassosum’ in get(Info[i, 1], envir = env):
lazy-load database '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/lassosum/R/lassosum.rdb' is corrupt
In addition: Warning message:
In get(Info[i, 1], envir = env) : internal error -3 in R_decompress1
And I find when I try to install it, it says:
functions.cpp:424:9: warning: unused variable 'i' [-Wunused-variable]
int j,i;
^
functions.cpp:467:7: warning: unused variable 'nreps' [-Wunused-variable]
int nreps=startvec.n_elem;
^
2 warnings generated.
clang++ -std=gnu++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o lassosum.so RcppExports.o functions.o -L/Library/Frameworks/R.framework/Resources/lib -lRlapack -L/Library/Frameworks/R.framework/Resources/lib -lRblas -L/usr/local/Cellar/gcc/10.2.0/lib/gcc/10 -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
installing to /Library/Frameworks/R.framework/Versions/3.6/Resources/library/00LOCK-lassosum/00new/lassosum/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
So I'm not sure whether I install the package correctly.
Thank you so much!
Hi,
I am trying to use lassosum in my pipeline and I'm getting the following error. Any help would be appreciated.
Running lassosum.pipeline
Coordinating summary stats with reference panel...
Splitting genome by LD blocks ...
Running lassosum ...
s = 0.2
Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal
Calls: do.call ... postNode -> sendData -> sendData.SOCK0node -> serialize
Execution halted
Thank you
Dear Developers,
Could you please help to support PLink2 dosage format? which probably better to guessing the missing genotype.
Best regards
Wallace
Hi,
I ran the lassosum per chromosome using the multi-thread mode (10 threads). The code is as follows.
`for (i in 1:22){
message(paste0("Chr ",i," is calculating"))
ss<-fread(paste0(ssf.dir,"/severe_clean_chr",i,".txt"))
test.file <- paste0(test.dir,"/chr",i,".nodup")
ref.file <- paste0(ref.dir,"/chr",i)
# Transform the P-values into correlation
cor <- p2cor(p = ss$P,
n = ss$N,
sign = ss$BETA
)
if (i==1) {
out <- lassosum.pipeline(
cor = cor,
chr = ss$CHR,
pos = ss$BP,
A1 = ss$A1,
A2 = ss$A2,
ref.bfile = ref.file,
test.bfile = test.file,
LDblocks = LDblocks,
cluster=cl)
} else {
out.temp <- lassosum.pipeline(
cor = cor,
chr = ss$CHR,
pos = ss$BP,
A1 = ss$A1,
A2 = ss$A2,
ref.bfile = ref.file,
test.bfile = test.file,
LDblocks = LDblocks,
cluster=cl)
out<- merge(out,out.temp)
}
}
`
For most chromosomes, it worked well. However, for chromosome 3, an error occurred, which is
Running lassosum ...
s = 0.2
Error in checkForRemoteErrors(val) : one node produced an error: Not converging.....
I would appreciate it if you could provide some help with it.
Best wishes.
Yuting
When using function pseudovalidate.lassosum.pipeline.R with destandardize=T
, betas are destandardized and given to function pseudovalidation.R where they are destandardized again to compute bXXb. I was wondering what explains the need a such a double destandardization? I understand that this is not the case if one specifies destandardize=F
in the first function. Still, if one wishes to obtain pgs computed on destandardized betas, destandardize=T
needs to be specified.
Many thanks for the clarification,
Jasmin
@tshmak thanks for the nice tool I have been using for a while
I have these two snps in my summary stats which have really low pvalue and end up generating NaN correlations when using p2cor
function. The sample size is above 1000
snp | af | effectsize | se | pvalue |
---|---|---|---|---|
snp1 | NA | 0.6433 | 0.08915 | 4.940656e-324 |
snp2 | NA | 0.6434 | 0.08916 | 4.940656e-324 |
What is the best approach to handle such snps in this function/tool because the NaNs present issues downstream. I would really appreciate your feedback on this issue.
Hi, I have followed the suggestion to apply validated beta into new file as below
out2 = subset(out, out$s=v$best.s, out$lambda=v$best.lambda)
v2 = validate(out2, pheno=pheno.read[,3], covar = covar.read, test.bfile=newfile)
At the same time, we tried to generate the PRS using v$best.beta and the genotype in the "newflie" using
X %*% v$best.beta, where X is the genotype in the newfile, coded as 0/1/2.
However, v2$best.pgs is not the same as PRS generated using X %*% v$best.beta. Could you please shed some light on this issue? The "--fill-missing-a2" function seems to be no longer available in plink, the missing data in X is replaced by 0, which is consistent as what's being done in lassosum. We also make sure that the A2 in X is the same as the A2 in newfile.
We have tested our scripts on other PRS generation and encountered no problem. I am really grateful if you could give us some advice on this issue.
Many thanks!
Can you please give more document for lassosum.pipeline for help understand the results?
Thanks much!
Wallace
Greetings,
As mentionned in the title, i'm actually using the splitvalidation function for the parameter tuning and the estimation of the best PRSs. The distribution of the resulting PRSs is odd to me as it is bimodal. It's the first time that I'm seeing such a thing. Here the distribution in question:
In addition, I'm computing PRSs using splitvalidation with other popular methods (LDpred2, PRS-CS, EBPRS..) and the distributions are all pretty normal.
Unfortunately, I cannot give any further information about the data I'm using as they are confidention. Despite, does anyone has any idea on why this situation is happening?
Thanks a lot.
I'm using the lassosum.pipeline
/ validate
/ subset
/ validate
strategy of the README (splitting the test set in 2).
I'm getting v2$best.validation.result
equals to -0.491. And AUC(v2$best.pgs, v2$pheno - 1)
equals to 0.205.
Is this normal? Should I just take the opposite?
@tshmak thanks for the nice tool I have been used to caculate PRS.
I have a problem with best.pgs generating NaNs.
library(lassosum)
y<- fread("pheno_cov.txt")
cov<- y[,c("FID","IID" ,"sex","byear","genotype_array","PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10")]
target.pheno <- y[,c("FID", "IID", "pho")]
cl <- makeCluster(10)
sum.stat <- "prs_sumstat"
bfile <- "my_bfile"
ld.file <- "EUR.hg19"
ss <- fread(sum.stat)
head(ss)
#1: rs3115860 1 753405 A C 0.013735 0.008693 0.114096 71255
#2: rs2073813 1 753541 A G -0.017856 0.009096 0.049656 68419
nrow(ss)
#[1] 5067039
ss <- ss[!p == 0]
cor <- p2cor(p = ss$p,
n =ss$sample,
sign = ss$beta
)
fam <- fread(paste0(bfile, ".fam"))
fam[,ID:=do.call(paste, c(.SD, sep=":")),.SDcols=c(1:2)]
head(fam)
set.seed(1234)
out <- lassosum.pipeline(
cor = cor,
chr = ss$CHR,
pos = ss$BP,
A1 = ss$A1,
A2 = ss$A2,
ref.bfile = bfile,
sample=5000,
test.bfile = bfile,
LDblocks = ld.file,
cluster=cl
)
target.res <- validate(out, pheno = as.data.frame(target.pheno), covar=as.data.frame(cov))
lasso_score <- target.res$results.table
summary(lasso_score$best.pgs)
#-0.3927 -0.0565 0.0027 0.0000 0.0592 0.3293 2071
the sample size for bfile is 330,000
I want to know why 2071 individuasl have NA pgs...
I would be really appreciated for your feedback on this issue.
It doesn't work even if test .bed file is given. I think it's due to the an error in this script, more specifically this code snippet down below. It tries to check if LDblocks
is a data.frame object before it is read.table in R, so the error is imminent.
if(is.factor(LDblocks)) LDblocks <- as.integer(LDblocks)
if(is.vector(LDblocks)) stopifnot(length(LDblocks) == length(cor)) else
if(is.data.frame(LDblocks) || is.data.table(LDblocks)) {
LDblocks <- as.data.frame(LDblocks)
stopifnot(ncol(LDblocks) == 3)
stopifnot(all(LDblocks[,3] >= LDblocks[,2]))
LDblocks[,1] <- as.character(sub("^chr", "", LDblocks[,1], ignore.case = T))
And the resulting error:
Error in (function (cor, chr = NULL, pos = NULL, snp = NULL, A1 = NULL, :
I cannot recognize this LDblock. Specify one of EUR.hg19, AFR.hg19, ASN.hg19, EUR.hg38, AFR.hg38, ASN.hg38
Calls: do.call -> <Anonymous>
Execution halted
Dear tshmak
Hello, I am running the lassosum.pipeline in unix but I have some errors related to this package.
First chromosome-wise running is not working in some chromosomes.
I am using the code as follows.
out = lassosum.pipeline(cor=cor,chr=ss.df$CHR,pos=ss.df$BP,A1=ss.df$ALT,A2=ss.df$REF,ref.bfile="/data/pkg36/lassosum/data/refpanel",test.bfile=test.bfile,LDblocks="ASN.hg19")
out22 = lassosum.pipeline(cor=cor,chr=ss.df$CHR,pos=ss.df$BP,A1=ss.df$ALT,A2=ss.df$REF,ref.bfile="/data/pkg36/lassosum/data/refpanel",test.bfile=paste0(test.bfile,"_chr22"),LDblocks="ASN.hg19")
out21 = lassosum.pipeline(cor=cor,chr=ss.df$CHR,pos=ss.df$BP,A1=ss.df$ALT,A2=ss.df$REF,ref.bfile="/data/pkg36/lassosum/data/refpanel",test.bfile=paste0(test.bfile,"_chr21"),LDblocks="ASN.hg19")
out20 = lassosum.pipeline(cor=cor,chr=ss.df$CHR,pos=ss.df$BP,A1=ss.df$ALT,A2=ss.df$REF,ref.bfile="/data/pkg36/lassosum/data/refpanel",test.bfile=paste0(test.bfile,"_chr20"),LDblocks="ASN.hg19")
out22 and out21 works fine, but out20 gets an error like this:
But lassosum.pipeline for whole chromosomes; out works.
Second, as I merged the outcomes, in my case out22 and out21, validate works but pseudovalidate doesn't work.
out = merge(out22,out21)
v = validate(out)
v = pseudovalidate(out)
I tried to figure out the reason for these errors, but I am lost.
If you have any comments or tips for fixing these, please let me know.
Best Wishes
Hi, I ran the pipeline by chromosomes using the same ref.file and test.file for each chromosome, then merged the output variables together using "merge" in a loop.
However, when I used "validate", it threw the error:
Error in splitvec.from.bfile(bfile) :length(pvec) == length(bfile) is not TRUE
Could you explain to me what might caused the error?
this link does not work: https://github.com/tshmak/lassosum/releases/download/v0.2.2/lassosum_0.2.2.tar.gz
Hi there,
I was wondering if lassosum
could take pre-computed LD as part of the input instead of taking genotypes of a reference panel.
Thanks!
Yanyu
Dear authors,
From your paper (page3, just above equation (4)), the genotype matrix is first normalized and then divided by sqrt(n), such that R=X^{T}X is the LD matrix, which makes sense to me. However, in the runElnet function in funcions.cpp, the genotype matrix is only normalized (line 660), did the genotype matrix then divided by sqrt(n) afterward? I am trying to understand your implementation of lassosum algorithm, thanks!
Peng
Dear Timothy,
Thanks for and congratulations to your hard work on lassosum!
Some of the LD regions of the provided Berisa.XXX.hg38.bed files overlap. Also, there are gaps between the regions. However, they were no overlaps and no gaps in the original Berisa.XXX.hg19.bed files.
Berisa.EUR.hg38.bed:
chr pos pos2
1: chr1 10583 1892607
2: chr1 1961168 3666172
3: chr1 3582736 4380811
4: chr1 4320751 5853833
5: chr1 5853833 7187275
Overlap: line 2 ends at 3,666,172, line 3 starts at 3,582,736
Gap: line 2 ends at 1,892,607, line 3 starts at 1,961,168
Berisa.EUR.hg19.bed:
chr start stop
1: chr1 10583 1892607
2: chr1 1892607 3582736
3: chr1 3582736 4380811
4: chr1 4380811 5913893
5: chr1 5913893 7247335
Any suggestions on how to resolve this?
Thanks,
Lars
Hi Dr. Mak,
I just wanted to let you know that lassosum.pipeline
and parseselect
indicate that they can accept a two-column data.frame
(FID, IID) in order to keep/remove samples from the reference panel or test data. However, I notice that it won't accept tibbles
in place of a data.frame
, even though tibbles
inherit the data.frame
class.
class(keep_ids)
[1] "tbl_df" "tbl" "data.frame"
Not sure if this is relevant to making a fix, since it is rather simple to wrap the tibble
in as.data.frame()
call, but it did cause me a few minutes of confusion and I thought I would share my experience in case it benefits others/the package.
For context, here is the error produced when you use a tibble
instead of a data.frame
when making a call to either parseselect
or lassosum.pipeline
:
Error in parseselect(ref.bfile, keep = keep.ref, remove = remove.ref) :
No individuals left after keep/remove! Make sure the FID/IID are correct.
Can you please suggest how I can find the list of SNPs with non-zero best used in final pgs? TIA
Dear developers,
Is there a way to export the reweighed(LD panel adjusted) beta values? This can give us more flexibilities to do the following analysis.
Thanks much!
Wallace
I am benchmarking PRS/PGS methods practiced in the PRS tutorial. Regard to Lassosum, I got some confusions. I'd appreciate if you clarify them. Here they are:
1- In the paper (Polygenic scores via penalized regression on summary statistics) and the introduction of the lassosum repository, it is mentioned that 'reference panel' is deployed to use its LD information. In the exercise provided in the tutorial, ref.bfile is literally the bed file of the target dataset (EUR.QC.bed in the Height case), while it is expected ref.bfile refers to a reference panel file like HapMap3. Please clarify this confusion.
2- In the tutorial, fam variable is calculated via these codes:
fam <- fread(paste0(bfile, ".fam"))
fam[,ID:=do.call(paste, c(.SD, sep=":")),.SDcols=c(1:2)]
but, there is no application for fam. Please clarify.
3- Why there is no 'SNPs matching' like what we have in LDpred-2?
Thank you,
Farshad
Hi, I have applied the lassopipeline to a summary statitics and got a rds file succussefully, but when I run the pseudovalidate on a new data set (160K data) without external phenotype, it took me more 24h which is limit time on my server, could you give me some suggestions to solve this problem?
Looking forward to your kind reply,
Best Regards,
Junbiao
Hi,
I calculated the PRS using internal dataset (whole dataset )and now trying to use UK biobank as a validation dataset (per chromosome). I used the command below, the analysis is running without any error. However it is taking too long to compute the new PGS. It is stuck at the "Calculating PGS" step for almost 3 days. Is it normal? and is there a way to speedup? I am using the standalone version of lassosum.
lassosum --lassosum.pipeline <internal.lassosum.pipeline.rds> --validate.rds <internal.validate.rds> --applyto ukb_chr --out ukb_chr1 --nthreads 16
Please let me know if something is unclear.
Regards
One might encounter the following error when using lassosum in a Rscript with R version < 3.4.0
Error in validObject(.Object) :
invalid class “ddiMatrix” object: Not a valid 'Mnumeric' class object
This is because of some strange versioning problem and can be solved by also including
library(methods)
in the beginning of your Rscript (Or just use R >= 3.4.0), just in case anyone else encountered the same problem...
I have a binary outcome denoted by 1 and 2 in the .fam file of test data and I'm getting the following error when I tried to run validation step:
v <- validate(out) # Use the 6th column in .fam file in test dataset for test phenotype
Error:
Error in validate.lassosum.pipeline(out) :
There's no variation in phenotype
Can you please suggest how to address this?
PS: I have replaced missing values with NA instead of -9
Hi, below is my code.
library(lassosum)
library(data.table)
library(parallel)
library(fdrtool)
cl <- makeCluster(7, type="FORK")
#the data is downloaded from https://blog.nus.edu.sg/agen/summary-statistics/t2d-2020/ T2D_ALL_PRIMARY dataset
ss <- fread("/path/to/SpracklenCN_prePMID_T2D_ALL_Primary.txt")
ss$P[ss$P < .Machine$double.xmin] <- .Machine$double.xmin
test.bfile <- "bi.merged"
cor <- p2cor(p = ss$P, n = ss$Neff, sign=ss$Beta)
sum(is.na(cor))
ld <- fread("/path/to/Berisa.ASN.hg19.bed")
out <- lassosum.pipeline(cor=cor, chr=ss$Chr, pos = ss$Pos, A1 = ss$EAF, A2 = ss$NEA,
test.bfile = "/path/to/bi.merged",
LDblocks = ld, cluster = cl,exclude.ambiguous = F)
v <- pseudovalidate(out)
save(v,file="/path/to/ASN.DM.Rdata")
The output is:
[1] 0
Coordinating summary stats with reference panel...
Splitting genome by LD blocks ...
Running lassosum ...
s = 0.2
Error in lassosum(cor = cor2, bfile = ref.bfile, shrink = s, extract = ref.extract, :
!any(is.na(cor)) is not TRUE
Calls: lassosum.pipeline -> lapply -> FUN -> lassosum -> stopifnot
Execution halted
I managed to use the same "bi.merged" bfiles to construct PRS for other phenotypes but not this one. Sorry that I may not be able to provide the bfiles I used. But if you really need them to diagnose the problem, maybe I could revise the .fam data a bit and send them to you.
Hope my question is clear now. Thank you for your time.
Line 211,
lassosum(cor=cor2, bfile="./data/refpanel",
this prevents the pipeline being run without the refpanel being in that exact path.
Hi thank you for making this software. I'm trying it on my own data and I was wondering if you could explain what the out$pgs actually is. I can see that rows are values for different individuals, but what are the columns in each "s"?
thank you
Maria
I'm trying to run lassosum using the following code and I'm getting this error: Error in $<-.data.frame
(*tmp*
, "ok123", value = TRUE) :
replacement has 1 row, data has 0
out <- lassosum.pipeline(cor=cor, chr=ss$CHR, pos=ss$BP,
A1=ss$A1, A2=ss$A2, # A2 is not required but advised
ref.bfile=ref.bfile, test.bfile=test.bfile,
LDblocks = LDblocks)
#Log
Coordinating summary stats with reference panel...
Error in $<-.data.frame
(*tmp*
, "ok123", value = TRUE) :
replacement has 1 row, data has 0
Can you please suggest how to address this issue? TIA
Hi!
I am new to working with polygenic risk scores (PRS) and Lassosum. I started working on PRS using the diabetes summary statistics file as my base data and my own samples as test data. When I run the pipeline, I constantly get an error. I would really appreciate it if you could help me out with it.
Here's the code that I've used:
library(lassosum)
# Prefer to work with data.table as it speeds up file reading
library(data.table)
library(methods)
library(magrittr)
# For multi-threading, you can use the parallel package and
# invoke cl which is then passed to lassosum.pipeline
library(parallel)
# This will invoke 2 threads.
cl <- makeCluster(2)
sum.stat <- “d1new.QC.gz"
bfile <- “report.QC"
# Read in and process the covariates
covariate <- fread(“report.cov")
pcs <- fread(“report.eigenvec")
setnames(., colnames(.), c("FID","IID", paste0("PC",1:6)))
# Need as.data.frame here as lassosum doesn't handle data.table
# covariates very well
cov <- merge(covariate, pcs, by = c(“IID”)
# We will need the EUR.hg19 file provided by lassosum
# which are LD regions defined in Berisa and Pickrell (2015) for the European population and the hg19 genome.
ld.file <- "EUR.hg19"
# output prefix
prefix <- “report”
# Read in the target phenotype file
target.pheno <- fread(“report1.diabetes)[,c(“FID", "IID", “Diabetes”)]
# Read in the summary statistics
ss <- fread(sum.stat)
#Change names of the columns
names(ss)[names(ss) == "P-val"] <- "P"
names(ss)[names(ss) == "Position_hg19"] <- "BP"
names(ss)[names(ss) == "SNP"] <- "summary"
names(ss)[names(ss) == "rsid"] <- "SNP"
names(ss)[names(ss) == "Effect-allele"] <- "A1"
names(ss)[names(ss) == "Other-allele"] <- "A2"
names(ss)[names(ss) == "Other-allele-frequency"] <- "MAF"
names(ss)[names(ss) == "Other-allele-frequency-cases"] <- "MAF-cases"
names(ss)[names(ss) == "Sample-size-cases"] <- "N-cases"
names(ss)[names(ss) == "Sample-size"] <- "N"
# Remove P-value = 0, which causes problem in the transformation
ss <- ss[!P == 0]
#Remove Chromosomes>23
df <- subset(ss, Chr<23)
# Transform the P-values into correlation
cor <- p2cor(p = df$P,
n = df$N,
sign = log(df$OR)
)
fam <- fread(paste0(bfile, ".fam"))
fam[,ID:=do.call(paste, c(.SD, sep=":")),.SDcols=c(1:2)]
# Run the lassosum pipeline
# The cluster parameter is used for multi-threading
# You can ignore that if you do not wish to perform multi-threaded processing
out <- lassosum.pipeline(
cor = cor,
chr = df$CHR,
pos = df$BP,
A1 = df$A1,
A2 = df$A2,
ref.bfile = bfile,
test.bfile = bfile,
LDblocks = ld.file,
cluster=cl
)
# Store the R2 results
target.res <- validate(out, pheno = as.data.frame(target.pheno), covar=as.data.frame(cov))
# Get the maximum R2
r2 <- max(target.res$validation.table$value)^2
Here's the error that I'm getting:
0 out of 49 samples kept in pheno.
Error in parse.pheno.covar(pheno = pheno, covar = covar, parsed = parsed.test, :
No phenotype left. Perhaps the FID/IID do not match?
I have used the code given in [https://choishingwan.github.io/PRS-Tutorial/lassosum/]
Thanks in advance!!
Harshita
Hello. I was running lassosum to construct PRS,my sample size is 133129, I get this message :
Error in group.blocks(Blocks, parsed, mem.limit, chunks, cluster) :
Size of blocks too large. Suggest split into smaller blocks or increase mem.limit or use only a sample of individuals.
Does someone met this problem before? how to handle this issue? please let me know since this is urgent things for me.
I installed lassosum library on Rv4.0, ~/libraries_R/R_LIB4.0/lassosum/data
where I've following files:
list.files()
[1] "Berisa.AFR.hg19.bed" "Berisa.AFR.hg38.bed" "Berisa.ASN.hg19.bed"
[4] "Berisa.ASN.hg38.bed" "Berisa.EUR.hg19.bed" "Berisa.EUR.hg38.bed"
[7] "Berisa.R" "Berisa.README" "GenerateData.R"
[10] "refpanel.bed" "refpanel.bim" "refpanel.fam"
[13] "summarystats.txt" "testsample.bed" "testsample.bim"
[16] "testsample.covar.txt" "testsample.fam" "testsample.pheno.txt"
I've data for an admixed population for which none of the widely used population or LD panel will be helpful.
What is the format of LD used in pipeline with the lassosum.pipeline
function?
### Read LD region file ###
LDblocks <- "EUR.hg19" # This will use LD regions as defined in Berisa and Pickrell (2015) for the European population and the hg19 genome.
# Other alternatives available. Type ?lassosum.pipeline for more details.
I think from the ?lassosum.pipeline
LD block is a data.frame of the following format:
chr start stop
chr1 1961168 3666172
I'm at the repo https://bitbucket.org/nygcresearch/ldetect/src/master/
Can this be used to create an in-house LD acceptable data format for the lassosum?
I got link to this code repository from https://academic.oup.com/bioinformatics/article/32/2/283/1743626
Or, can output from plink be used? https://zzz.bwh.harvard.edu/plink/ld.shtml
plink --file mydata --r
thanks,
Does pseudovalidate() also provide a valid estimate of corr(PRS, Phenotype) when there is no LD? That is, I'm simulating SNP data with no LD, then run a GWAS of a disease phenotype using these SNPs. Using the resulting summary stats, and a test.bfile that also has no LD, will pseudovalidate provide a correct estimate?
Hi,
When I run the example of lassosum, it is difficult for me to get the same result from lassosum and plink. lassosum code is:
library(lassosum)
setwd(system.file("data", package="lassosum"))
ss <- fread("summarystats.txt")
ref.bfile <- "refpanel"
test.bfile <- "testsample"
LDblocks <- "EUR.hg19"
cor <- p2cor(p = ss$P_val, n = 60000, sign=log(ss$OR_A1))
out <- lassosum.pipeline(cor=cor, chr=ss$Chr, pos=ss$Position,
A1=ss$A1, A2=ss$A2, # A2 is not required but advised
ref.bfile=ref.bfile, test.bfile=test.bfile,
LDblocks = LDblocks)
set.seed(20190822)
pheno <- rnorm(nrow.bfile(out$test.bfile))
v <- validate(out, pheno=pheno)
beta <- v$best.beta
idx_eff <- which(beta != 0) ##lasso zero
idx <- out$sumstats$order[idx_eff]
beta_dat <- data.frame(ss[idx, 1], ss[idx, 3], ss[idx, 4], beta[idx_eff])
test_bim <- data.frame(fread("testsample.bim"))
test_snp <- test_bim[match(beta_dat[, 2], test_bim[, 4]), 2]
beta_dat2 <- data.frame(test_snp, beta_dat[, 3], beta_dat[, 4])
write.table(beta_dat2, file = "eff.txt", col.names = F, row.names = F, quote = F)
plink code is:
cd your/lassosum/datapath/
plink-1.9 --bfile testsample --score eff.txt 1 2 3 sum --out pheno_testsample
Result of lassosum:
head(v$beta.pgs)
-0.418 0.46 -0.451 0.165 0.102
Result of plink:
FID IID PHENO CNT CNT2 SCORESUM
0 HG00096 -9 1634 730 -0.307637
0 HG00097 -9 1634 748 -0.102278
0 HG00099 -9 1634 761 -0.202595
0 HG00100 -9 1634 744 -0.0630223
0 HG00101 -9 1634 734 -0.221991
0 HG00102 -9 1634 740 -0.139318
0 HG00103 -9 1634 716 -0.286505
0 HG00105 -9 1634 722 -0.221223
0 HG00106 -9 1634 746 -0.237664
Could you tell me the mistake? Thanks!
Best,
Sheng
If I understand correctly, the best.validation.result is the correlation between the actual and predicted phenotype and the square if the percent of the phenotype that is explained by the genetic model. However, the correlation in the validation$results.table pheno and best.pgs is dramatically different from the best.validation.result (ex. 0.44 vs. 0.72). Can you please clarify why this difference exists? Thank you in advance!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.