tshmak / lassosum Goto Github PK

View Code? Open in Web Editor NEW

70.0 70.0 24.0 2.02 MB

LASSO for GWAS with summary statistics

License: MIT License

R 70.93% C++ 28.93% C 0.13%

lassosum's People

Contributors

Stargazers

Watchers

lassosum's Issues

How to use the "#" for multiple chromosome input

Hi, Sam,

I was looking at the documentation to figure out how I would run PRSice when I have data from multiple chromosomes. I have generated PLINK (.bim, .bed and .fam) files for each chromosome 1-22 respectively.
The documentation mentions this:
"Target genotype file. Currently support both BGEN and binary PLINK format. For multiple chromosome input, simply substitute the chromosome number with #. PRSice will automatically replace # with 1-22. A separate fam/sample file can be specified by --target ,"
I tried substituting the number of the chromosome with the "#" but it gave me the error listed below. Am I using this option incorrectly? If not what do you think is the best way to run PRSice on the combined chromosome data?

Error: -----------------------------------------------------------------------------------------------------------------------
Initializing Genotype file:
(bed)
With external fam file:

Start processing basefile

Error: Cannot open file:
***/chr#.QC.bim

Getting predicted phenotypes from validate()

Hi!

Thank you for your help on crosspred thread, I think this is more lassosum related so I will just put it here.

Is there a way to obtain the predicted phenotypes from the PGS and use them to calculate the specifcity, sensitivity and balanced accuracy for model comparison? I figured the pheno in result.table is the true phenotype we input for the test dataset.

Thank you for the help

Any documentation of pseudovalidate?

Hey Tim, is there any documentation about the pseudovalidate? I am wondering what is the principle of pseudovalidate.

By the way, can I use the same training dataset to do lasso and training the parameter "s", then apply it to a new dataset?

Thanks!

Getting All the Same PRS in PRS.best

Hi Sam,

So I have run PRSice on a sample of 106 individuals, 53 which are controls and 53 which are test cases. When running PRSice I seem to be getting all the same PRS for each individual.

PRSice.R --dir . --./PRSice_linux --base basefile --target mergedGenomeLNKAnc --thread 1 --a1 ALT --a2 REF --snp SNP --beta --stat all_inv_var_meta_beta --binary-target T --pvalue all_inv_var_meta_p

Here is what the PRSice .best file looks like:
d54f4d2c-9b1e-4a4f-a21a-59ab07425a93 d54f4d2c-9b1e-4a4f-a21a-59ab07425a93 Yes -0.062335
6c531200-d09a-4b6d-a30d-dd3d9ca4cbf0 6c531200-d09a-4b6d-a30d-dd3d9ca4cbf0 Yes -0.062335
922f420d-dbf2-40af-a712-6da062cec326 922f420d-dbf2-40af-a712-6da062cec326 Yes -0.062335
4622feae-5216-466a-90a6-704da2dd0e7c 4622feae-5216-466a-90a6-704da2dd0e7c Yes -0.062335
4d6eae1d-18e1-4183-ad80-c393c1c943aa 4d6eae1d-18e1-4183-ad80-c393c1c943aa Yes -0.062335
88b639cd-f358-4684-882c-7881162d5e51 88b639cd-f358-4684-882c-7881162d5e51 Yes -0.062335
2556ca53-88f0-4ac3-b3be-d124b0ec73aa 2556ca53-88f0-4ac3-b3be-d124b0ec73aa Yes -0.062335
6a1b8d2f-474a-4896-a9b2-e1d934021f90 6a1b8d2f-474a-4896-a9b2-e1d934021f90 Yes -0.062335
382f8f34-c91e-4d95-934a-2a81991dd74a 382f8f34-c91e-4d95-934a-2a81991dd74a Yes -0.062335
9d6bdebb-26d8-4848-b2cd-81c2f342d256 9d6bdebb-26d8-4848-b2cd-81c2f342d256 Yes -0.062335
9dd07cc4-d6cd-40c4-8b11-9c0992e242f5 9dd07cc4-d6cd-40c4-8b11-9c0992e242f5 Yes -0.062335

Do you know possible reasons for why this could be?
Thanks!

p2cor() doesn't handle very small pvalues

pval <- 10^-(0:500)
sign <- rnorm(length(pval))
lassosum::p2cor(pval, 10e3, sign)

About installing the package

Hi,
After I install the package into R and use library(lassosum), it says:

Error: package or namespace load failed for ‘lassosum’ in get(Info[i, 1], envir = env):
lazy-load database '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/lassosum/R/lassosum.rdb' is corrupt
In addition: Warning message:
In get(Info[i, 1], envir = env) : internal error -3 in R_decompress1

And I find when I try to install it, it says:

functions.cpp:424:9: warning: unused variable 'i' [-Wunused-variable]
int j,i;
^
functions.cpp:467:7: warning: unused variable 'nreps' [-Wunused-variable]
int nreps=startvec.n_elem;
^
2 warnings generated.
clang++ -std=gnu++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o lassosum.so RcppExports.o functions.o -L/Library/Frameworks/R.framework/Resources/lib -lRlapack -L/Library/Frameworks/R.framework/Resources/lib -lRblas -L/usr/local/Cellar/gcc/10.2.0/lib/gcc/10 -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
installing to /Library/Frameworks/R.framework/Versions/3.6/Resources/library/00LOCK-lassosum/00new/lassosum/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path

DONE (lassosum)

So I'm not sure whether I install the package correctly.

Thank you so much!

Lassosum standalone error

Hi,

I am trying to use lassosum in my pipeline and I'm getting the following error. Any help would be appreciated.

Running lassosum.pipeline
Coordinating summary stats with reference panel...
Splitting genome by LD blocks ...
Running lassosum ...
s = 0.2
Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal
Calls: do.call ... postNode -> sendData -> sendData.SOCK0node -> serialize
Execution halted

Thank you

Support PLINK2 dosage format

Dear Developers,

Could you please help to support PLink2 dosage format? which probably better to guessing the missing genotype.

Best regards
Wallace

lassosum-pipeline: Not converging

Hi,

I ran the lassosum per chromosome using the multi-thread mode (10 threads). The code is as follows.
`for (i in 1:22){
message(paste0("Chr ",i," is calculating"))
ss<-fread(paste0(ssf.dir,"/severe_clean_chr",i,".txt"))
test.file <- paste0(test.dir,"/chr",i,".nodup")
ref.file <- paste0(ref.dir,"/chr",i)
# Transform the P-values into correlation
cor <- p2cor(p = ss$P,
n = ss$N,
sign = ss$BETA
)
if (i==1) {

    out <- lassosum.pipeline(
        cor = cor,
        chr = ss$CHR,
        pos = ss$BP,
        A1 = ss$A1,
        A2 = ss$A2,
        ref.bfile = ref.file,
        test.bfile = test.file,
        LDblocks = LDblocks, 
        cluster=cl)

} else {

    out.temp <- lassosum.pipeline(
        cor = cor,
        chr = ss$CHR,
        pos = ss$BP,
        A1 = ss$A1,
        A2 = ss$A2,
        ref.bfile = ref.file,
        test.bfile = test.file,
        LDblocks = LDblocks, 
        cluster=cl)       
    out<- merge(out,out.temp)

}

}
`
For most chromosomes, it worked well. However, for chromosome 3, an error occurred, which is
Running lassosum ...
s = 0.2
Error in checkForRemoteErrors(val) : one node produced an error: Not converging.....

I would appreciate it if you could provide some help with it.

Best wishes.

Yuting

pseudovalidation double destandardize

When using function pseudovalidate.lassosum.pipeline.R with destandardize=T, betas are destandardized and given to function pseudovalidation.R where they are destandardized again to compute bXXb. I was wondering what explains the need a such a double destandardization? I understand that this is not the case if one specifies destandardize=F in the first function. Still, if one wishes to obtain pgs computed on destandardized betas, destandardize=T needs to be specified.

Many thanks for the clarification,
Jasmin

p2cor generating NaNs

@tshmak thanks for the nice tool I have been using for a while

I have these two snps in my summary stats which have really low pvalue and end up generating NaN correlations when using p2cor function. The sample size is above 1000

snp	af	effectsize	se	pvalue
snp1	NA	0.6433	0.08915	4.940656e-324
snp2	NA	0.6434	0.08916	4.940656e-324

What is the best approach to handle such snps in this function/tool because the NaNs present issues downstream. I would really appreciate your feedback on this issue.

PRS from lassosum does not tally with PRS calculated manually

Hi, I have followed the suggestion to apply validated beta into new file as below

out2 = subset(out, out$s=v$best.s, out$lambda=v$best.lambda)
v2 = validate(out2, pheno=pheno.read[,3], covar = covar.read, test.bfile=newfile)

At the same time, we tried to generate the PRS using v$best.beta and the genotype in the "newflie" using

X %*% v$best.beta, where X is the genotype in the newfile, coded as 0/1/2.

However, v2$best.pgs is not the same as PRS generated using X %*% v$best.beta. Could you please shed some light on this issue? The "--fill-missing-a2" function seems to be no longer available in plink, the missing data in X is replaced by 0, which is consistent as what's being done in lassosum. We also make sure that the A2 in X is the same as the A2 in newfile.

We have tested our scripts on other PRS generation and encountered no problem. I am really grateful if you could give us some advice on this issue.

Many thanks!

[enhancement] Can you please give more document for lassosum.pipeline

Can you please give more document for lassosum.pipeline for help understand the results?

Thanks much!
Wallace

Splitvalidation results in a bimodal distribution of PRSs

Greetings,

As mentionned in the title, i'm actually using the splitvalidation function for the parameter tuning and the estimation of the best PRSs. The distribution of the resulting PRSs is odd to me as it is bimodal. It's the first time that I'm seeing such a thing. Here the distribution in question:

In addition, I'm computing PRSs using splitvalidation with other popular methods (LDpred2, PRS-CS, EBPRS..) and the distributions are all pretty normal.

Unfortunately, I cannot give any further information about the data I'm using as they are confidention. Despite, does anyone has any idea on why this situation is happening?

Thanks a lot.

Negative results

I'm using the lassosum.pipeline / validate / subset / validate strategy of the README (splitting the test set in 2).
I'm getting v2$best.validation.result equals to -0.491. And AUC(v2$best.pgs, v2$pheno - 1) equals to 0.205.
Is this normal? Should I just take the opposite?

target.res$best.pgs generating NaNs

@tshmak thanks for the nice tool I have been used to caculate PRS.
I have a problem with best.pgs generating NaNs.

library(lassosum)
y<- fread("pheno_cov.txt")
cov<- y[,c("FID","IID" ,"sex","byear","genotype_array","PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10")]
target.pheno <- y[,c("FID", "IID", "pho")]
cl <- makeCluster(10)
sum.stat <- "prs_sumstat"
bfile <- "my_bfile"
ld.file <- "EUR.hg19"
ss <- fread(sum.stat)
head(ss)

SNP CHR BP A1 A2 beta se p sample

#1: rs3115860 1 753405 A C 0.013735 0.008693 0.114096 71255
#2: rs2073813 1 753541 A G -0.017856 0.009096 0.049656 68419
nrow(ss)
#[1] 5067039
ss <- ss[!p == 0]
cor <- p2cor(p = ss$p,
n =ss$sample,
sign = ss$beta
)
fam <- fread(paste0(bfile, ".fam"))
fam[,ID:=do.call(paste, c(.SD, sep=":")),.SDcols=c(1:2)]
head(fam)
set.seed(1234)
out <- lassosum.pipeline(
cor = cor,
chr = ss$CHR,
pos = ss$BP,
A1 = ss$A1,
A2 = ss$A2,
ref.bfile = bfile,
sample=5000,
test.bfile = bfile,
LDblocks = ld.file,
cluster=cl
)

Store the R2 results

target.res <- validate(out, pheno = as.data.frame(target.pheno), covar=as.data.frame(cov))
lasso_score <- target.res$results.table
summary(lasso_score$best.pgs)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

#-0.3927 -0.0565 0.0027 0.0000 0.0592 0.3293 2071

the sample size for bfile is 330,000
I want to know why 2071 individuasl have NA pgs...
I would be really appreciated for your feedback on this issue.

--LDblocks options doesn't work for manually generated .bed file.

It doesn't work even if test .bed file is given. I think it's due to the an error in this script, more specifically this code snippet down below. It tries to check if LDblocks is a data.frame object before it is read.table in R, so the error is imminent.

 if(is.factor(LDblocks)) LDblocks <- as.integer(LDblocks)
    if(is.vector(LDblocks)) stopifnot(length(LDblocks) == length(cor)) else 
      if(is.data.frame(LDblocks) || is.data.table(LDblocks)) {   
        LDblocks <- as.data.frame(LDblocks)
        stopifnot(ncol(LDblocks) == 3)
        stopifnot(all(LDblocks[,3] >= LDblocks[,2]))
        LDblocks[,1] <- as.character(sub("^chr", "", LDblocks[,1], ignore.case = T))

And the resulting error:

Error in (function (cor, chr = NULL, pos = NULL, snp = NULL, A1 = NULL,  : 
  I cannot recognize this LDblock. Specify one of EUR.hg19, AFR.hg19, ASN.hg19, EUR.hg38, AFR.hg38, ASN.hg38
Calls: do.call -> <Anonymous>
Execution halted

chromosomewise lassosum and pseudovalidate error

Dear tshmak

Hello, I am running the lassosum.pipeline in unix but I have some errors related to this package.

First chromosome-wise running is not working in some chromosomes.

I am using the code as follows.

out = lassosum.pipeline(cor=cor,chr=ss.df$CHR,pos=ss.df$BP,A1=ss.df$ALT,A2=ss.df$REF,ref.bfile="/data/pkg36/lassosum/data/refpanel",test.bfile=test.bfile,LDblocks="ASN.hg19")

out22 = lassosum.pipeline(cor=cor,chr=ss.df$CHR,pos=ss.df$BP,A1=ss.df$ALT,A2=ss.df$REF,ref.bfile="/data/pkg36/lassosum/data/refpanel",test.bfile=paste0(test.bfile,"_chr22"),LDblocks="ASN.hg19")
out21 = lassosum.pipeline(cor=cor,chr=ss.df$CHR,pos=ss.df$BP,A1=ss.df$ALT,A2=ss.df$REF,ref.bfile="/data/pkg36/lassosum/data/refpanel",test.bfile=paste0(test.bfile,"_chr21"),LDblocks="ASN.hg19")
out20 = lassosum.pipeline(cor=cor,chr=ss.df$CHR,pos=ss.df$BP,A1=ss.df$ALT,A2=ss.df$REF,ref.bfile="/data/pkg36/lassosum/data/refpanel",test.bfile=paste0(test.bfile,"_chr20"),LDblocks="ASN.hg19")

out22 and out21 works fine, but out20 gets an error like this:

But lassosum.pipeline for whole chromosomes; out works.

Second, as I merged the outcomes, in my case out22 and out21, validate works but pseudovalidate doesn't work.

out = merge(out22,out21)
v = validate(out)
v = pseudovalidate(out)

I tried to figure out the reason for these errors, but I am lost.

If you have any comments or tips for fixing these, please let me know.

Best Wishes

Error in splitvec.from.bfile(bfile) : length(pvec) == length(bfile) is not TRUE

Hi, I ran the pipeline by chromosomes using the same ref.file and test.file for each chromosome, then merged the output variables together using "merge" in a loop.
However, when I used "validate", it threw the error:
Error in splitvec.from.bfile(bfile) :length(pvec) == length(bfile) is not TRUE
Could you explain to me what might caused the error?

the link to the .tar.gz does not work

this link does not work: https://github.com/tshmak/lassosum/releases/download/v0.2.2/lassosum_0.2.2.tar.gz

Using pre-computed LD

Hi there,

I was wondering if lassosum could take pre-computed LD as part of the input instead of taking genotypes of a reference panel.

Thanks!

Yanyu

Is genotype matrix divided by sqrt(n) in lassosum?

Dear authors,
From your paper (page3, just above equation (4)), the genotype matrix is first normalized and then divided by sqrt(n), such that R=X^{T}X is the LD matrix, which makes sense to me. However, in the runElnet function in funcions.cpp, the genotype matrix is only normalized (line 660), did the genotype matrix then divided by sqrt(n) afterward? I am trying to understand your implementation of lassosum algorithm, thanks!
Peng

Gaps and overlaps between LD regions of Berisa.XXX.hg38.bed files

Dear Timothy,

Thanks for and congratulations to your hard work on lassosum!
Some of the LD regions of the provided Berisa.XXX.hg38.bed files overlap. Also, there are gaps between the regions. However, they were no overlaps and no gaps in the original Berisa.XXX.hg19.bed files.

Berisa.EUR.hg38.bed:
chr pos pos2
1: chr1 10583 1892607
2: chr1 1961168 3666172
3: chr1 3582736 4380811
4: chr1 4320751 5853833
5: chr1 5853833 7187275
Overlap: line 2 ends at 3,666,172, line 3 starts at 3,582,736
Gap: line 2 ends at 1,892,607, line 3 starts at 1,961,168

Berisa.EUR.hg19.bed:
chr start stop
1: chr1 10583 1892607
2: chr1 1892607 3582736
3: chr1 3582736 4380811
4: chr1 4380811 5913893
5: chr1 5913893 7247335

Any suggestions on how to resolve this?

Thanks,
Lars

parseselect/lassosum.pipeline won't accept tibbles for keep/remove arguments

Hi Dr. Mak,

I just wanted to let you know that lassosum.pipeline and parseselect indicate that they can accept a two-column data.frame (FID, IID) in order to keep/remove samples from the reference panel or test data. However, I notice that it won't accept tibbles in place of a data.frame, even though tibbles inherit the data.frame class.

class(keep_ids)

[1] "tbl_df"     "tbl"        "data.frame"

Not sure if this is relevant to making a fix, since it is rather simple to wrap the tibble in as.data.frame() call, but it did cause me a few minutes of confusion and I thought I would share my experience in case it benefits others/the package.

For context, here is the error produced when you use a tibble instead of a data.frame when making a call to either parseselect or lassosum.pipeline:

Error in parseselect(ref.bfile, keep = keep.ref, remove = remove.ref) : 
  No individuals left after keep/remove! Make sure the FID/IID are correct.

How to find SNPs with non-zero best used in final pgs?

Can you please suggest how I can find the list of SNPs with non-zero best used in final pgs? TIA

Output the reweighed beta?

Dear developers,

Is there a way to export the reweighed(LD panel adjusted) beta values? This can give us more flexibilities to do the following analysis.

Thanks much!
Wallace

Lassosum benchmarking

Hi @privefl , @tshmak

I am benchmarking PRS/PGS methods practiced in the PRS tutorial. Regard to Lassosum, I got some confusions. I'd appreciate if you clarify them. Here they are:

1- In the paper (Polygenic scores via penalized regression on summary statistics) and the introduction of the lassosum repository, it is mentioned that 'reference panel' is deployed to use its LD information. In the exercise provided in the tutorial, ref.bfile is literally the bed file of the target dataset (EUR.QC.bed in the Height case), while it is expected ref.bfile refers to a reference panel file like HapMap3. Please clarify this confusion.

2- In the tutorial, fam variable is calculated via these codes:
fam <- fread(paste0(bfile, ".fam"))
fam[,ID:=do.call(paste, c(.SD, sep=":")),.SDcols=c(1:2)]

but, there is no application for fam. Please clarify.

3- Why there is no 'SNPs matching' like what we have in LDpred-2?

Thank you,
Farshad

took too much time to run the pseudovalidate

Hi, I have applied the lassopipeline to a summary statitics and got a rds file succussefully, but when I run the pseudovalidate on a new data set (160K data) without external phenotype, it took me more 24h which is limit time on my server, could you give me some suggestions to solve this problem?
Looking forward to your kind reply,
Best Regards,
Junbiao

Speed up lassosum

Hi,

I calculated the PRS using internal dataset (whole dataset )and now trying to use UK biobank as a validation dataset (per chromosome). I used the command below, the analysis is running without any error. However it is taking too long to compute the new PGS. It is stuck at the "Calculating PGS" step for almost 3 days. Is it normal? and is there a way to speedup? I am using the standalone version of lassosum.

lassosum --lassosum.pipeline <internal.lassosum.pipeline.rds> --validate.rds <internal.validate.rds> --applyto ukb_chr --out ukb_chr1 --nthreads 16

Please let me know if something is unclear.

Regards

Using lassosum in Rscript

One might encounter the following error when using lassosum in a Rscript with R version < 3.4.0

Error in validObject(.Object) : 
  invalid class “ddiMatrix” object: Not a valid 'Mnumeric' class object

This is because of some strange versioning problem and can be solved by also including

library(methods)

in the beginning of your Rscript (Or just use R >= 3.4.0), just in case anyone else encountered the same problem...

Error in validate.lassosum.pipeline(out) : There's no variation in phenotype

I have a binary outcome denoted by 1 and 2 in the .fam file of test data and I'm getting the following error when I tried to run validation step:

v <- validate(out) # Use the 6th column in .fam file in test dataset for test phenotype

Error:

Error in validate.lassosum.pipeline(out) :
There's no variation in phenotype

Can you please suggest how to address this?

PS: I have replaced missing values with NA instead of -9

!any(is.na(cor)) is not TRUE

Hi, below is my code.

library(lassosum)
library(data.table)
library(parallel)
library(fdrtool)
cl <- makeCluster(7, type="FORK") 

#the data is downloaded from https://blog.nus.edu.sg/agen/summary-statistics/t2d-2020/ T2D_ALL_PRIMARY dataset
ss <- fread("/path/to/SpracklenCN_prePMID_T2D_ALL_Primary.txt") 

ss$P[ss$P < .Machine$double.xmin] <- .Machine$double.xmin

test.bfile <- "bi.merged"
cor <- p2cor(p = ss$P, n = ss$Neff, sign=ss$Beta)
sum(is.na(cor))

ld <- fread("/path/to/Berisa.ASN.hg19.bed")
out <- lassosum.pipeline(cor=cor, chr=ss$Chr, pos = ss$Pos, A1 = ss$EAF, A2 = ss$NEA, 
                         test.bfile = "/path/to/bi.merged",
                         LDblocks = ld, cluster = cl,exclude.ambiguous = F)

v <- pseudovalidate(out)
save(v,file="/path/to/ASN.DM.Rdata")

The output is:

[1] 0
Coordinating summary stats with reference panel...
Splitting genome by LD blocks ...
Running lassosum ...
s =  0.2
Error in lassosum(cor = cor2, bfile = ref.bfile, shrink = s, extract = ref.extract,  :
  !any(is.na(cor)) is not TRUE
Calls: lassosum.pipeline -> lapply -> FUN -> lassosum -> stopifnot
Execution halted

I managed to use the same "bi.merged" bfiles to construct PRS for other phenotypes but not this one. Sorry that I may not be able to provide the bfiles I used. But if you really need them to diagnose the problem, maybe I could revise the .fam data a bit and send them to you.

Hope my question is clear now. Thank you for your time.

lassosum.pipeline.R has hardcoded paths

Line 211,

lassosum(cor=cor2, bfile="./data/refpanel",

this prevents the pipeline being run without the refpanel being in that exact path.

explain the "pgs" in out

Hi thank you for making this software. I'm trying it on my own data and I was wondering if you could explain what the out$pgs actually is. I can see that rows are values for different individuals, but what are the columns in each "s"?
thank you
Maria

Error in `$<-.data.frame`(`tmp`, "ok123", value = TRUE) : replacement has 1 row, data has 0

I'm trying to run lassosum using the following code and I'm getting this error: Error in $<-.data.frame(*tmp*, "ok123", value = TRUE) :
replacement has 1 row, data has 0

out <- lassosum.pipeline(cor=cor, chr=ss$CHR, pos=ss$BP,
A1=ss$A1, A2=ss$A2, # A2 is not required but advised
ref.bfile=ref.bfile, test.bfile=test.bfile,
LDblocks = LDblocks)

#Log

Coordinating summary stats with reference panel...
Error in $<-.data.frame(*tmp*, "ok123", value = TRUE) :
replacement has 1 row, data has 0

Can you please suggest how to address this issue? TIA

No phenotype left. Perhaps the FID/IID do not match?

Hi!

I am new to working with polygenic risk scores (PRS) and Lassosum. I started working on PRS using the diabetes summary statistics file as my base data and my own samples as test data. When I run the pipeline, I constantly get an error. I would really appreciate it if you could help me out with it.

Here's the code that I've used:

library(lassosum)
# Prefer to work with data.table as it speeds up file reading
library(data.table)
library(methods)
library(magrittr)
# For multi-threading, you can use the parallel package and 
# invoke cl which is then passed to lassosum.pipeline
library(parallel)
# This will invoke 2 threads. 
cl <- makeCluster(2)
sum.stat <- “d1new.QC.gz"
bfile <- “report.QC"
# Read in and process the covariates
covariate <- fread(“report.cov")
pcs <- fread(“report.eigenvec") 
    setnames(., colnames(.), c("FID","IID", paste0("PC",1:6)))
# Need as.data.frame here as lassosum doesn't handle data.table 
# covariates very well
cov <- merge(covariate, pcs, by = c(“IID”)
# We will need the EUR.hg19 file provided by lassosum 
# which are LD regions defined in Berisa and Pickrell (2015) for the European population and the hg19 genome.
ld.file <- "EUR.hg19"
# output prefix
prefix <- “report”
# Read in the target phenotype file
target.pheno <- fread(“report1.diabetes)[,c(“FID", "IID", “Diabetes”)]
# Read in the summary statistics
ss <- fread(sum.stat)
#Change names of the columns
names(ss)[names(ss) == "P-val"] <- "P"
names(ss)[names(ss) == "Position_hg19"] <- "BP"
names(ss)[names(ss) == "SNP"] <- "summary"
names(ss)[names(ss) == "rsid"] <- "SNP"
names(ss)[names(ss) == "Effect-allele"] <- "A1"
names(ss)[names(ss) == "Other-allele"] <- "A2"
names(ss)[names(ss) == "Other-allele-frequency"] <- "MAF"
names(ss)[names(ss) == "Other-allele-frequency-cases"] <- "MAF-cases"
names(ss)[names(ss) == "Sample-size-cases"] <- "N-cases"
names(ss)[names(ss) == "Sample-size"] <- "N"
# Remove P-value = 0, which causes problem in the transformation
ss <- ss[!P == 0]
#Remove Chromosomes>23
 df <- subset(ss, Chr<23)
# Transform the P-values into correlation
cor <- p2cor(p = df$P,
        n = df$N,
        sign = log(df$OR)
        )
fam <- fread(paste0(bfile, ".fam"))
fam[,ID:=do.call(paste, c(.SD, sep=":")),.SDcols=c(1:2)]
# Run the lassosum pipeline
# The cluster parameter is used for multi-threading
# You can ignore that if you do not wish to perform multi-threaded processing
out <- lassosum.pipeline(
    cor = cor,
    chr = df$CHR,
    pos = df$BP,
    A1 = df$A1,
    A2 = df$A2,
    ref.bfile = bfile,
    test.bfile = bfile,
    LDblocks = ld.file, 
    cluster=cl
)
# Store the R2 results
target.res <- validate(out, pheno = as.data.frame(target.pheno), covar=as.data.frame(cov))
# Get the maximum R2
r2 <- max(target.res$validation.table$value)^2

Here's the error that I'm getting:

0 out of 49 samples kept in pheno.
Error in parse.pheno.covar(pheno = pheno, covar = covar, parsed = parsed.test,  : 
  No phenotype left. Perhaps the FID/IID do not match?

I have used the code given in [https://choishingwan.github.io/PRS-Tutorial/lassosum/]

Thanks in advance!!

Harshita

Error in group.blocks ?

Hello. I was running lassosum to construct PRS,my sample size is 133129, I get this message :
Error in group.blocks(Blocks, parsed, mem.limit, chunks, cluster) :
Size of blocks too large. Suggest split into smaller blocks or increase mem.limit or use only a sample of individuals.
Does someone met this problem before? how to handle this issue? please let me know since this is urgent things for me.

what is the format of the LDfile used in ?

I installed lassosum library on Rv4.0, ~/libraries_R/R_LIB4.0/lassosum/data where I've following files:


list.files()
 [1] "Berisa.AFR.hg19.bed"  "Berisa.AFR.hg38.bed"  "Berisa.ASN.hg19.bed"
 [4] "Berisa.ASN.hg38.bed"  "Berisa.EUR.hg19.bed"  "Berisa.EUR.hg38.bed"
 [7] "Berisa.R"             "Berisa.README"        "GenerateData.R"
[10] "refpanel.bed"         "refpanel.bim"         "refpanel.fam"
[13] "summarystats.txt"     "testsample.bed"       "testsample.bim"
[16] "testsample.covar.txt" "testsample.fam"       "testsample.pheno.txt"

I've data for an admixed population for which none of the widely used population or LD panel will be helpful.
What is the format of LD used in pipeline with the lassosum.pipeline function?

### Read LD region file ###
LDblocks <- "EUR.hg19" # This will use LD regions as defined in Berisa and Pickrell (2015) for the European population and the hg19 genome.
# Other alternatives available. Type ?lassosum.pipeline for more details.

I think from the ?lassosum.pipeline LD block is a data.frame of the following format:

chr	start	stop
chr1	1961168	3666172

I'm at the repo https://bitbucket.org/nygcresearch/ldetect/src/master/
Can this be used to create an in-house LD acceptable data format for the lassosum?
I got link to this code repository from https://academic.oup.com/bioinformatics/article/32/2/283/1743626

Or, can output from plink be used? https://zzz.bwh.harvard.edu/plink/ld.shtml
plink --file mydata --r

thanks,

pseudovalidate() with no LD

Does pseudovalidate() also provide a valid estimate of corr(PRS, Phenotype) when there is no LD? That is, I'm simulating SNP data with no LD, then run a GWAS of a disease phenotype using these SNPs. Using the resulting summary stats, and a test.bfile that also has no LD, will pseudovalidate provide a correct estimate?

how to use result of lassosum to plink

Hi,
When I run the example of lassosum, it is difficult for me to get the same result from lassosum and plink. lassosum code is:

library(lassosum)
setwd(system.file("data", package="lassosum")) 
ss <- fread("summarystats.txt")
ref.bfile <- "refpanel"
test.bfile <- "testsample"
LDblocks <- "EUR.hg19"
cor <- p2cor(p = ss$P_val, n = 60000, sign=log(ss$OR_A1))
out <- lassosum.pipeline(cor=cor, chr=ss$Chr, pos=ss$Position, 
                         A1=ss$A1, A2=ss$A2, # A2 is not required but advised
                         ref.bfile=ref.bfile, test.bfile=test.bfile,
                         LDblocks = LDblocks)
set.seed(20190822)
pheno <- rnorm(nrow.bfile(out$test.bfile))
v <- validate(out, pheno=pheno)
beta <- v$best.beta
idx_eff <- which(beta != 0) ##lasso zero
idx <- out$sumstats$order[idx_eff]
beta_dat <- data.frame(ss[idx, 1], ss[idx, 3], ss[idx, 4], beta[idx_eff])
test_bim <- data.frame(fread("testsample.bim"))
test_snp <- test_bim[match(beta_dat[, 2], test_bim[, 4]), 2]
beta_dat2 <- data.frame(test_snp, beta_dat[, 3], beta_dat[, 4])
write.table(beta_dat2, file = "eff.txt", col.names = F, row.names = F, quote = F)

plink code is:

cd your/lassosum/datapath/
plink-1.9 --bfile testsample --score eff.txt 1 2 3 sum --out pheno_testsample

Result of lassosum:

head(v$beta.pgs)
-0.418 0.46 -0.451 0.165 0.102

Result of plink:

 FID       IID  PHENO    CNT   CNT2 SCORESUM
   0   HG00096     -9   1634    730 -0.307637
   0   HG00097     -9   1634    748 -0.102278
   0   HG00099     -9   1634    761 -0.202595
   0   HG00100     -9   1634    744 -0.0630223
   0   HG00101     -9   1634    734 -0.221991
   0   HG00102     -9   1634    740 -0.139318
   0   HG00103     -9   1634    716 -0.286505
   0   HG00105     -9   1634    722 -0.221223
   0   HG00106     -9   1634    746 -0.237664

Could you tell me the mistake? Thanks!
Best,
Sheng

Model Performance R2

If I understand correctly, the best.validation.result is the correlation between the actual and predicted phenotype and the square if the percent of the phenotype that is explained by the genetic model. However, the correlation in the validation$results.table pheno and best.pgs is dramatically different from the best.validation.result (ex. 0.44 vs. 0.72). Can you please clarify why this difference exists? Thank you in advance!

tshmak / lassosum Goto Github PK

lassosum's People

Contributors

Stargazers

Watchers

Forkers

lassosum's Issues

Start processing basefile

Error: Cannot open file: ***/chr#.QC.bim

SNP CHR BP A1 A2 beta se p sample

Store the R2 results

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

Recommend Projects

Recommend Topics

Recommend Org

Error: Cannot open file:
***/chr#.QC.bim