xinhe-lab / ctwas Goto Github PK

package for the causal TWAS project

Home Page: https://xinhe-lab.github.io/ctwas/

License: MIT License

R 100.00%

ctwas's Introduction

cTWAS: integrating molecular QTLs and GWAS for gene discovery

Expression Quantitative Trait Loci (eQTLs) have often been used to nominate candidate genes from Genome-wide association studies (GWAS). However, commonly used methods are susceptible to false positives largely due to Linkage Disequilibrium of eQTLs with causal variants acting on the phenotype directly. Our method, causal-TWAS (cTWAS), addressed this challenge by borrowing ideas from statistical fine-mapping. It is a generalization of Transcriptome-wide association studies (TWAS), but when analyzing any gene, it adjusts for other nearby genes and all nearby genetic variants.

Running cTWAS involves four main steps: preparing input data, imputing gene z-scores, estimating parameters, and fine-mapping genes and variants. The output of cTWAS are posterior inclusion probabilities (PIPs) for all variants and genes with expression models. We have included a tutorial of how to use the ctwas software. You can browse source code and report a bug here.

Install

Install ctwas:

remotes::install_github("xinhe-lab/ctwas",ref = "main")

Currently, ctwas has only been tested on linux systems. We recommend running ctwas on a High-Performance Computing system.

Citing this work

If you find the ctwas package or any of the source code in this repository useful for your work, please cite:

Zhao S, Crouse W, Qian S, Luo K, Stephens M, He X. Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits. Nat Genet (2024). https://doi.org/10.1038/s41588-023-01648-9

Useful resources

We have pre-computed the LD matrices of European samples from UK Biobank. They can be downloaded here.

cTWAS requires the expression prediction models, or weights, of genes. The pre-computed weights of GTEx expression and splicing traits can be downloaded from PredictDB.

Acknowledgement

We acknowledge the authors of susieR package for using their codes.

Original susieR code obtained by:

git clone [email protected]:stephenslab/susieR.git
git checkout c7934c0

Minor edits to make it accept different prior variances for each variable.

ctwas's People

Contributors

Stargazers

Watchers

Forkers

pcarbo chunmingl szhaolab shugamoe michael-g-callahan j3gu seanjosephjurgens xinhe-lab cuihuaxia allenlile xinhe-lab kevinlkx

ctwas's Issues

About chromosome number Settings

Dear author,
I would like to ask about how to set the number of chromosomes, because I want to use this software to calculate the TWAS of other species. For example, the number of chromosomes of my species is 29 pairs, how should I set it

Running genome-wide cTWAS with "ld_pgenfs=" option

Hello,

I have been attempting to perform cTWAS based on “Running cTWAS genome-wide” section of cTWAS tutorial page and encountered an error that I am having difficulty resolving.

I successfully ran “Imputing gene z-scores” section with “ld_pgenfs =” option in “impute_expr_z” function using the 1000G LD reference PLINK format files provided in TWAS/FUSION website (https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2) using the codes below:

res <- impute_expr_z(z_snp = z_snp_all,
weight = my_fusion_weights,
ld_pgenfs = sorted_gusev_ldref_bed,
outputdir = outputdir,
outname = outname,
harmonize_z = T,
harmonize_wgt = T,
strand_ambig_action_z = "none",
recover_strand_ambig_wgt = T)

However, when I tried to perform “ctwas_rss” function, the code attached below stopped at Susie iteration 1, with the following error message:

thin <- 0.1
max_snp_region <- 20000
ncore <- 6

ctwas_rss(z_gene = z_gene,
z_snp = z_snp,
ld_exprvarfs = ld_exprvarfs,
ld_pgenfs = sorted_gusev_ldref_bed,
ld_R_dir = NULL,
ld_regions = "EUR",
ld_regions_version = "b37",
outputdir = outputdir,
outname = outname,
thin = thin,
max_snp_region = max_snp_region,
ncore = ncore)

-------error code-------
2024-03-22 00:36:38 INFO::Run susie iteratively, getting rough estimate ...
2024-03-22 00:36:38 INFO::run iteration 1
starting worker pid=315008 on localhost:11154 at 00:36:39.033
starting worker pid=315007 on localhost:11154 at 00:36:39.036
starting worker pid=315011 on localhost:11154 at 00:36:39.038
starting worker pid=315009 on localhost:11154 at 00:36:39.039
starting worker pid=315010 on localhost:11154 at 00:36:39.105
starting worker pid=315012 on localhost:11154 at 00:36:39.117
Loading required package: ctwas
Loading required package: ctwas
loaded ctwas and set parent environment
Loading required package: ctwas
loaded ctwas and set parent environment
Loading required package: ctwas
loaded ctwas and set parent environment
Loading required package: ctwas
loaded ctwas and set parent environment
Loading required package: ctwas
loaded ctwas and set parent environment
loaded ctwas and set parent environment
Error in { :
task 1 failed - "input= must be a single character string containing a file name, a system command containing at least one space, a URL starting 'http[s]://', 'ftp[s]://' or 'file://', or, the input data itself containing at least one \n or \r"

I would be grateful if you could provide me with guidance or insights into potential sources of the error.
Thank you!

Dong Heon

".exprqc.Rd" file required

I am using imputed gene expression that was computed with other methods as ctwas_rss() input. I received an error message saying that “cannot open compressed file '/(path)/....chr1.exprqc.Rd', probable reason 'No such file or directory'”. While I have .expr and .exprvar files listed in the directory, would you please provide instructions about how to build .exprqc.Rd files and the format requirement of these files?

When using FUSION weight files, ctwas reads the "top1" model differently than FUSION

In read_weight_fusion(), as called by impute_expr_z() , if method="top1" or method="best" is used and the "top1" model is read, read_weight_fusion() simply selects all non-zero SNP weights. This is not what happens in FUSION's z score imputation when reading the top1 model.

From ctwas_read_data() (L271-283) we see that the "top1" column of wgt.matrix would be selected in its entirety, removing only zero weights.

      g.method = method
      if (g.method == "best") {
        g.method = names(which.max(cv.performance["rsq",]))
      }
      if (exists("cv.performance")){
        if (!(g.method %in% names(cv.performance[1,]))){
          next
        }
      }
      wgt.matrix <- wgt.matrix[abs(wgt.matrix[, g.method]) > 
                                 0, , drop = F]
      wgt.matrix <- wgt.matrix[complete.cases(wgt.matrix), 
                               , drop = F]

In FUSION's equivalent of ctwas::impute_expr_z(), when it selects the top1 model, it zeros out all other weights but the one with the max squared value, resulting in a model using only a single SNP.

	# if this is a top1 model, clear out all the other weights
	if ( substr( (colnames(cv.performance))[ mod.best ],1,4) == "top1" ) wgt.matrix[ -which.max(wgt.matrix[,mod.best]^2)  , mod.best] = 0

There are FUSION weights available online where the wgt.matrix variable in the *.wgt.RDat files look like the example below:

When FUSION reads the top1 model it results in a single SNP model, but with ctwas, selecting the top1 model results in the inclusion of weights not used in FUSION for the zscore calculation. To avoid this, ctwas should adopt a similar measure to FUSION when reading the "top1" model from wgt.matrix, and keep only the highest squared value weight.

Miscellaneous improvements to ctwas

@kevinlkx I'm going to use this issue to document some miscellaneous suggested improvements to the package. I will continue to add items here as I review the package. (Note: make these improvements to the "multigroup_test" branch.)

Add a CITATION file. See here for an example.
Replace @import statements with @importFrom statements. (Importing all the functions in a package seems risky; I would advise against it.)
Fix non-portable file names (I think the issue is the colon in these file names):

inst/extdata/sample_data/cor_matrix/region.16:71020125-72901251.R_gene.RDS
inst/extdata/sample_data/cor_matrix/region.16:71020125-72901251.R_snp_gene.RDS
inst/extdata/sample_data/cor_matrix/region.16:71020125-72901251.R_snp.RDS

Make sure all arguments to merge_finemap_regions() are documented. Currently, "LD_format" and "LD_loader_fun" are not documented.
It would be helpful if make_convergence_plots() had a single parameter controlling the overall font size; right now when I create the plots, the text for the axes is quite large. For example when I use theme_cowplot() this is quite easy since it has a "font_size" argument. The same comment applies to other plotting functions such as make_locusplot().
For the locus zoom plot, it is hard to distinguish squares from circles when there are overplotted. Can the color and shape be changed? Also, looks like you are varying size as well, but this does not show up in the legend. Also, there is no explanation in the legend what "red" represents.

Fully parameterize by-chromosome functionality in `ctwas_rss` | `impute_expr_z` maybe also need small tweak to accomodate

I've found it convenient when using cTWAS with sQTL models across multiple tissues (~20k models per tissue) to customize cTWAS to allow for by chromosome ctwas_rss parameter estimation/fine-mapping so that I'm not waiting more than 24 hours for a single tissue to finish running on one computer (even with single node parallelization), but instead waiting under that time for 22 chromosomes to finish independently instead.

Below is a link to some of the little hacks I made to allow for this functionality before the latest round of updates like pre-harmonization, separate parameter estimation/fine-mapping. (Some small messing around with the impute_expr_z output is required as well.)
multigroup...shugamoe:ctwas:multigroup_dev

I will update this issue with the appropriate branch from my fork when I port these hacks over to the latest multigroup updates that have been made, but I think it would be nice for ctwas_rss to be parameterized to allow for at least param. estimation and finemapping by chromosome. I am aware of an example Wesley made to run ctwas for a single region, but I'm not a fan of also having to subset the .db model as well. Now that parameter estimation and fine-mapping can be separated, it might even make sense to do genome-wide parameter estimation, but by-chromosome fine-mapping/final PIP calculations (with each chromsome on a multicore node).

Ideally, it would be nice if when specifying chromosome in impute_expr_z (already possible), that the output for this is able to transfer smoothly to ctwas_rss so that a chromosome can be run singly in that function as well. I.e. modify ld_exprfs from impute_expr_z results if a chromosome is specified, so that it can integrate more smoothly into a chromosome parameterized ctwas_rss function.

Reading PredictDB Weights | Don't hardcode rsid as the SNP identifier

ctwas/R/ctwas_read_data.R

Line 434 in 47c0e3b

rownames(wgt.matrix) <- wgt$rsid

When reading in weights from a .db it can be inconvenient to have rsid hard coded as the SNP identifier. Especially since there is another column, varID, which could be used. While working with predictdb mashr gtex v8 eQTL and sQTL models, I've used that identifier instead. Some of the SNPs in these models, if rsid is missing for whatever reason, use varID as a backup in the rsid row anyway as well.

Here's an example of a db_id parameter added to allow the user to specify the SNP identifier from a .db weight source.
shugamoe@bd42fc0

Questions about LD Reference file

Hello cTWAS developers,

Thank you for putting together this useful tool! I am in the process of implementing GWAS and had a few questions about preparing the LD matrix .RDS/ Rvar files. Specifically:

I'm having issues downloading files from the UKB Ref hg38 box folder directly onto my computing cluster (via both wget and curl), and unfortunately have to download the files individually ("The selected item(s) exceed the download size limit."). Do you have any tips on how to more efficiently download these Box files? Thank you.
I'm trying to adapt the convert_geno_to_LDR_chr.R script to create LD matrices from 1000G Phase 3 data (adapted from LDSC). Are you open to sharing your BASH file for running the R file, or the arguments that you're supplying to the R script?

Thank you!

Fine mapping with parameter estimation

I'm trying to estimate parameters for finemapping and for some reason I the first iteration estimates the gene prior to be zero:

 ctwas::ctwas_rss(
          z_gene = z_gene_subset,
          z_snp = z_snp_subset,
          ld_exprvarfs = ordered_ld_exprvarfs,
          ld_R_dir = ld_R_dir,
          ld_regions_custom = regions_file,
          outputdir = outputdir,
          outname = outname,
          estimate_group_prior = T,
          estimate_group_prior_var = T)
   ...
2024-04-18 13:22:30.132874 INFO::Adding R matrix info for chrom 20
2024-04-18 13:22:30.141519 INFO::Adding R matrix info for chrom 21
2024-04-18 13:22:30.14914 INFO::Adding R matrix info for chrom 22
2024-04-18 13:22:30.177372 INFO::Run susie iteratively, getting rough estimate ...
2024-04-18 13:22:30.179785 INFO::run iteration 1
starting worker pid=3144 on localhost:11502 at 13:22:31.111
Loading required package: ctwas
loaded ctwas and set parent environment
2024-04-18 13:22:57.624383 INFO::After iteration 1, gene prior 0:, SNP prior:0.000186636804777902
2024-04-18 13:22:57.658267 INFO::run iteration 2
starting worker pid=4189 on localhost:11502 at 13:22:58.565

This seems to result in an error in the second iteration specifically here as prior_variance is never assigned a variable. Unclear on what I may have done incorrectly or if this is a potential bug, but I'd greatly appreciate any thoughts on what might be going on!

User option ignored in ctwas_impute_expr.R

In ctwas_impute_expr.R, the function "impute_expr" has a flag for whether to harmonize the SNP data. However, the flag seems to be unused in the function, and the function calls for reading fusion or predictDB weights have harmonize=T hardcoded.

excluding genes with poor performance

Thank you for creating a nice software!
I'm wondering if it makes sense to exclude genes with poor prediction accuracy from prediction model as it could be noisy information.
Do you have any idea?

`index_regions` function very slow

Hi, thanks for your interesting tool!

I have been using cTWAS on a few datasets, and have got interesting results.
Overall, I have been able to run most parts of the analysis relatively smoothly. Unfortunately, however, the ctwas_rss function is very slow for me. One of the major bottlenecks, even when using many many cores, is during the two steps that call on index_regions. Specifically, steps for "Adding R matrix info for chrom ..." take extremely long.

I am using an LD matrix reference dataset I created myself based on the tutorial. This includes ~8.3M variants from the EUR subset of GTEx WGS data. The GWAS data also has ~8.3M variants after filtering for overlapping variants. Using this LD matrix and GWAS dataset, all steps run reasonably, with reasonable parameters for my GWAS (prior for genes 0.014, prior for variants 0.001). However, the "Adding R matrix info for chrom ..." steps take easily 10-20 minutes per chromosome, and are run twice in the algorithm, totalling 660 minutes for just this munging step.

I have managed to use multiple cores and Forking to make all other time-consuming steps reasonably efficient. However, the "Adding R matrix info for chrom ..." munging step does not currently support parallel computing.

Can you confirm that such a run time is expected for this step?
Is there anything I can do to speed this up? Eg is there a way the parallel cores can be used to speed this up?

Thanks so much in advance!
-- Sean

Use inherits to check class of object

There are a bunch of places where you check the class of an object incorrectly. e.g., instead of

class(snp_info) == "list"

it should be

inherits(snp_info,"list")

These are the lines of code that need to be fixed:

File ‘ctwas/R/ctwas_harmonize_data.R’: if (class(snp_info) == "list") ...
File ‘ctwas/R/ctwas_harmonize_data.R’: if (class(snp_info) == "list") ...
File ‘ctwas/R/ctwas_preprocess_weights.R’: if (class(snp_info) == "list") ...
File ‘ctwas/R/ctwas_preprocess_z_snp.R’: if (class(snp_info) == "list") ...
File ‘ctwas/R/ctwas_region_data.R’: if (class(snp_info) == "list") ...
File ‘ctwas/R/ctwas_region_data.R’: if (class(snp_info) == "list") ...
File ‘ctwas/R/ctwas_region_data.R’: if (class(snp_info) == "list") ...
File ‘ctwas/R/ctwas_summarize_finemap_res.R’: if (class(snp_info) == "list") ...

``ctwas_rss`` vignette error

Hello,

I am trying to run thectwas_rss function example from the ctwas_rss_wkfl.Rmd vignette. I ran the code:

library(ctwas)

data(z_snp)
head(z_snp)

ld_pgenfs <- system.file("extdata/example_genotype_files", paste0("example_chr", 1:22, ".pgen"), package = "ctwas")
ld_pgenfs

ld_R_dir <- system.file("extdata/example_ld_R", package = "ctwas")
list.files(ld_R_dir)[1:10]

head(read.table(system.file("extdata/example_ld_R","example_chr1.R_snp.0_5.Rvar",  package = "ctwas"), header = T))

weight.fusion <- system.file("extdata/example_fusion_weights", "Tissue", package = "ctwas")

weight.predictdb <- system.file("extdata", "example_tissue.db", package = "ctwas") 

regionsfile <- system.file("extdata", "example_regions.bed", package = "ctwas")
head(read.table(regionsfile, header = T))

outputdir <- "~/temp"
res <- impute_expr_z(z_snp = z_snp,  weight = weight.fusion, ld_pgenfs = ld_pgenfs,
                           method = "lasso", outputdir = outputdir,
                           outname = "test_ss")
z_gene <- res$z_gene
ld_exprfs <- res$ld_exprfs

head(z_gene)
ld_exprfs


pars <- ctwas_rss(z_gene = z_gene, z_snp = z_snp, ld_exprfs = ld_exprfs, ld_pgenfs = ld_pgenfs, ld_regions_custom = regionsfile, thin = 0.9, max_snp_region = 20, outputdir = outputdir, outname = "test_ss", ncore = 1)

When I run the final line above, I get the error:

Error in { : task 1 failed - "'length = 4' in coercion to 'logical(1)'"
In addition: Warning messages:
1: In data.table::fread(exprvarf, header = T) :
  File '~/temp/test_ss_chr21.exprvar' has size 0. Returning a NULL data.table.
2: In data.table::fread(exprvarf, header = T) :
  File '~/temp/test_ss_chr22.exprvar' has size 0. Returning a NULL data.table.

I am trying to run ctwas and ctwas_rss on some of my own data and get a very similar error to above.

Do you know what may be causing this? My sessionInfo() is below:

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ctwas_0.1.34

loaded via a namespace (and not attached):
 [1] doParallel_1.0.17 crayon_1.5.2      vctrs_0.6.3       cli_3.6.1        
 [5] rlang_1.1.1       generics_0.1.3    data.table_1.14.8 glue_1.6.2       
 [9] fansi_1.0.4       grid_4.3.1        tibble_3.2.1      foreach_1.5.2    
[13] lifecycle_1.0.3   compiler_4.3.1    dplyr_1.1.2       codetools_0.2-19 
[17] Rcpp_1.0.11       pkgconfig_2.0.3   pgenlibr_0.3.5    lattice_0.21-8   
[21] R6_2.5.1          tidyselect_1.2.0  utf8_1.2.3        parallel_4.3.1   
[25] pillar_1.9.0      magrittr_2.0.3    Matrix_1.6-0      tools_4.3.1      
[29] withr_2.5.0       iterators_1.0.14  logging_0.10-108

Thanks in advance,

Jeff

Error in { : task 1 failed - "'length = 4' in coercion to 'logical(1)'" in ctwas_rss

Hi,

I am sorry to bother you.

When I run the code below following the tutorial, I got the error:

starting worker pid=3276557 on localhost:11580 at 14:30:57.803
Loading required package: ctwas
loaded ctwas and set parent environment
Error in { : task 1 failed - "'length = 4' in coercion to 'logical(1)'"

library(ctwas)

data(z_snp)

ld_pgenfs <- system.file("extdata/example_genotype_files", paste0("example_chr", 1:22, ".pgen"), package = "ctwas")

ld_R_dir <- system.file("extdata/example_ld_R", package = "ctwas")

weight.fusion <- system.file("extdata/example_fusion_weights", "Tissue", package = "ctwas")

regionsfile <- system.file("extdata", "example_regions.bed", package = "ctwas")

outputdir <- "~/temp"
res <- impute_expr_z(z_snp = z_snp,  weight = weight.fusion, method = "lasso",ld_pgenfs = ld_pgenfs,
                          outputdir = outputdir,
                           outname = "test_ss")		
z_gene <- res$z_gene
ld_exprfs <- res$ld_exprfs

pars <- ctwas_rss(z_gene = z_gene, z_snp = z_snp, ld_exprfs = ld_exprfs, ld_pgenfs = ld_pgenfs, ld_regions_custom = regionsfile, thin = 0.9, max_snp_region = 20, outputdir = outputdir, outname = "test_ss", ncore = 1)

I guess there may be a dimensions mismatch of "prior_variance/V".

Btw, if I used my own data, especially with regard the plink file of reference panel as inputs, there also exist some issues, such as the missing definition of the colnames of "snps" if I set harmonize_wgt=F in ctwas_read_data.R.

I have tried many times and could not solve these problems. Could you give me some suggestions?

Thank you very much.

Best regards,
Zhongshang

Obtain NaN when estimating parameters

Hi, I'm trying to run cTWAS analysis for several locus. When running parameter estimation, I get normal parameter results in the first step of rough estimate.

2024-05-21 22:16:06 INFO::After iteration 3, gene prior 0.0714285714285714:, SNP prior:0.000139378469065089
2024-05-21 22:16:06 INFO::Blocks are filtered: 56 blocks left

But in the second step of precise estimate, the gene prior is NaN after iteration 1.

2024-05-21 22:16:06 INFO::Run susie iteratively, getting accurate estimate ...
2024-05-21 22:16:06 INFO::run iteration 1
2024-05-21 22:18:44 INFO::After iteration 1, gene prior NaN:, SNP prior:0.000137847448790878

Although the code runs correctly until the end of iteration process, an error occurs during the next step of Run susie for all regions.

2024-05-21 23:35:38 INFO::Run susie for all regions.
2024-05-21 23:35:38 INFO::run iteration 1
Error in { : task 1 failed - "missing value where TRUE/FALSE needed"
In addition: There were 24 warnings (use warnings() to see them)

Here are the codes:

ctwas_rss(z_gene = z_gene_subset,
        z_snp = z_snp_subset,
        ld_exprvarfs = ld_exprvarfs,
        ld_R_dir = ld_R_dir,
        ld_regions = "ASN",
        ld_regions_version = "b37",
        outputdir = outputdir,
        outname = outname,
        ncore = 10)

I'd appreciate any suggestions or ideas on how to resolve this issue. Thank you in advance!

A few function arguments that are undocumented

All arguments to exported functions should have a @param entry in the roxygen2 docs. A few are missing:

Documented arguments not in \usage in documentation object 'ctwas_sumstats_noLD':
  ‘...’
Undocumented arguments in documentation object 'merge_finemap_regions'
  ‘...’
Documented arguments not in \usage in documentation object 'merge_finemap_regions':
  ‘logfile’
Undocumented arguments in documentation object 'merge_region_data'
  ‘use_LD’ ‘LD_info’ ‘snp_info’
Undocumented arguments in documentation object 'preprocess_weights'
  ‘ncore’ ‘scale_by_ld_variance’

Add imports for all functions called explicitly from other packages

I noticed that many of the functions used in other packages are not explicitly imported with @importFrom statements. For example, in commit 27b8c2d I added this above function "anno_finemap_res" because data.table::rbindlist() is used in this function:

#' @importFrom data.table rbindlist

This isn't always strictly necessary but it is helpful to avoid issues. Note the one exception to this is for packages that are conditionally loaded, e.g., Rfast.

Here is a list of functions for which importFrom statements may be missing:

ctwas_EM.R:    EM_susie_res_list <- parallel::mclapply(region_ids, function(region_id){
ctwas_compute_gene_z.R:  z_gene <- parallel::mclapply(names(weights), function(id) {
ctwas_convert_geno_to_LD_matrix.R:    region_info$chr <- readr::parse_number(region_info$chr)
ctwas_convert_geno_to_LD_matrix.R:        R_snp <- Rfast::cora(X.g)
ctwas_convert_geno_to_LD_matrix.R:          R_snp_variances <- Rfast::colVars(X.g)
ctwas_convert_regionlist_to_region_data.R:  logging::loginfo("Convert the data structure from regionlist to region data")
ctwas_convert_regionlist_to_region_data.R:  logging::loginfo("%d regions in region_data", length(region_data))
ctwas_convert_regionlist_to_region_data.R:  logging::loginfo("Add z-scores to region_data...")
ctwas_diagnose_LD_mismatch.R:  condz_list <- parallel::mclapply(region_ids, function(region_id){
ctwas_diagnose_LD_mismatch.R:  condz_stats <- data.table::rbindlist(condz_list, idcol = "region_id")
ctwas_diagnose_LD_mismatch.R:    R_snp <- suppressWarnings(as.matrix(Matrix::bdiag(R_snp)))
ctwas_finemapping.R:        R_snp <- suppressWarnings(as.matrix(Matrix::bdiag(R_snp)))
ctwas_finemapping.R:  finemap_region_res_list <- parallel::mclapply(region_ids, function(region_id){
ctwas_harmonize_data.R:    snp_info <- as.data.frame(data.table::rbindlist(snp_info, idcol = "region_id"))
ctwas_harmonize_data.R:    snp_info <- as.data.frame(data.table::rbindlist(snp_info, idcol = "region_id"))
ctwas_merge_regions.R:  logging::loginfo("Merged %d boundary genes into %d regions", nrow(boundary_genes), nrow(merged_region_info))
ctwas_plots.R:        R_snp <- suppressWarnings(as.matrix(Matrix::bdiag(R_snp)))
ctwas_plots.R:  loc <- locuszoomr::locus(
ctwas_plots.R:  cowplot::plot_grid(p_pvalue, p_pip, p_qtl, p_genes, ncol = 1,
ctwas_preprocess_regions.R:  snp_info <- parallel::mclapply(region_ids, function(region_id){
ctwas_preprocess_weights.R:    snp_info_df <- as.data.frame(data.table::rbindlist(snp_info, idcol = "region_id"))
ctwas_preprocess_weights.R:    context <- tools::file_path_sans_ext(basename(weight_file))
ctwas_preprocess_weights.R:    cl <- parallel::makeCluster(ncore, outfile = "")
ctwas_preprocess_weights.R:    doParallel::registerDoParallel(cl)
ctwas_preprocess_weights.R:              R_snp <- suppressWarnings({as.matrix(Matrix::bdiag(R_snp))})
ctwas_preprocess_z_snp.R:    snp_info <- as.data.frame(data.table::rbindlist(snp_info, idcol = "region_id"))
ctwas_region_data.R:    snp_info <- as.data.frame(data.table::rbindlist(snp_info, idcol = "region_id"))
ctwas_region_data.R:          logging::loginfo("Trim region %s with SNPs more than %s", region_id, maxSNP)
ctwas_region_data.R:          logging::loginfo("Trim region %s with SNPs more than %s", region_id, maxSNP)
ctwas_region_data.R:  logging::loginfo("Adding z-scores to region_data ...")
ctwas_region_data.R:  region_data2 <- parallel::mclapply(region_ids, function(region_id){
ctwas_region_data.R:  logging::loginfo("Adjusting for boundary genes ...")
ctwas_region_data.R:    snp_info <- as.data.frame(data.table::rbindlist(snp_info, idcol = "region_id"))
ctwas_region_data.R:    snp_info <- as.data.frame(data.table::rbindlist(snp_info, idcol = "region_id"))
ctwas_region_data.R:  region_data <- parallel::mclapply(region_ids, function(region_id){
ctwas_simulations.R:      expr <- data.table::data.table(NULL)
ctwas_simulations.R:      geneinfo <- data.table::data.table(NULL)
ctwas_simulations.R:    data.table::fwrite(expr, file = exprf,
ctwas_simulations.R:    data.table::fwrite(geneinfo, file = exprvarf, sep = "\t", quote = F)
ctwas_simulations.R:  sqlite <- RSQLite::dbDriver("SQLite")
ctwas_simulations.R:    db = RSQLite::dbConnect(sqlite, weight)
ctwas_simulations.R:    query <- function(...) RSQLite::dbGetQuery(db, ...)
ctwas_simulations.R:    RSQLite::dbDisconnect(db)
ctwas_simulations.R:  cl <- parallel::makeCluster(ncore, outfile = "")
ctwas_simulations.R:  doParallel::registerDoParallel(cl)
ctwas_simulations.R:      weight_name <- tools::file_path_sans_ext(basename(weight))
ctwas_simulations.R:      db = RSQLite::dbConnect(sqlite, weight)
ctwas_simulations.R:      query <- function(...) RSQLite::dbGetQuery(db, ...)
ctwas_simulations.R:      RSQLite::dbDisconnect(db)
ctwas_simulations.R:  parallel::stopCluster(cl)
ctwas_simulations.R:        R_snp <- suppressWarnings({as.matrix(Matrix::bdiag(R_snp))})
ctwas_simulations.R:  exprvar <- try(data.table::fread(exprvarf, header = T))
ctwas_simulations.R:    return(as.matrix(data.table::fread(exprf, header = F,
ctwas_summarize_finemap_res.R:    snp_info <- as.data.frame(data.table::rbindlist(snp_info, idcol = "region_id"))
ctwas_summarize_parameters.R:    outlist$convergence_plot <- cowplot::plot_grid(p_pi, p_sigma2, p_enrich, p_pve)
ctwas_susie_rss.R:    # following the default in susieR::susie_rss
ctwas_utils.R:  snp_info <- as.data.frame(data.table::fread(file, header = TRUE))
ctwas_utils.R:    weight_name <- tools::file_path_sans_ext(basename(weight_file))
ctwas_utils.R:    sqlite <- RSQLite::dbDriver("SQLite")
ctwas_utils.R:    db = RSQLite::dbConnect(sqlite, weight_file)
ctwas_utils.R:    query <- function(...) RSQLite::dbGetQuery(db, ...)
ctwas_utils.R:      predictdb_LD_file <- paste0(tools::file_path_sans_ext(weight_file), ".txt.gz")
ctwas_utils.R:    RSQLite::dbDisconnect(db)
ctwas_utils.R:    weight_name <- tools::file_path_sans_ext(basename(weight_file))
ctwas_utils.R:    cl <- parallel::makeCluster(ncore, outfile = "", type = "FORK")
ctwas_utils.R:    doParallel::registerDoParallel(cl)
ctwas_utils.R:            left_join(tibble::as_tibble(snps) %>% select(-cm), by = "rsid")
ctwas_utils.R:      parallel::stopCluster(cl)
ctwas_utils.R:    file_ext_lower <- tolower(tools::file_ext(file))
ctwas_utils.R:    res <- as.matrix(data.table::fread(file))
ctwas_utils.R:        pvar <- data.table::fread(pvarf, header = F)
ctwas_utils.R:        data.table::fwrite(pvar, file = pvarf2 , sep="\t", quote = F)
ctwas_utils.R:      pvar <- data.table::fread(pvarf, header = F)
ctwas_utils.R:      data.table::fwrite(pvar, file = pvarf2 , sep="\t", quote = F)
ctwas_utils.R:  pvar <- pgenlibr::NewPvar(pvarf)
ctwas_utils.R:    pgen <- pgenlibr::NewPgen(pgenf, pvar = pvar)
ctwas_utils.R:    fam <- data.table::fread(famf, header = F)
ctwas_utils.R:    pgen <- pgenlibr::NewPgen(pgenf, pvar = pvar, raw_sample_ct = raw_s_ct)
ctwas_utils.R:    variantidx <- 1: pgenlibr::GetVariantCt(pgen)}
ctwas_utils.R:  pgenlibr::ReadList(pgen,
ctwas_utils.R:  pvar <- data.table::fread(pvarf, skip = "#CHROM")
ctwas_utils.R:  pvar <- dplyr::rename(pvar, "chrom" = "#CHROM", "pos" = "POS",
ctwas_utils.R:  bim <- data.table::fread(bimf)
susie_set_X_attributes.R:# @details This should give the same result as matrixStats::colSds(X),
susie_susie_utils.R:    get_upper_tri = Rfast::upper_tri
susie_susie_utils.R:    get_median    = Rfast::med
susie_susie_utils.R:    get_median    = stats::median

Undefined functions

These functions are used in various places but as far as I can tell are not defined anywhere in the package:

warning_message
read_ld_Rvar_RDS
read_weight_fusion
compute_tf_Xb
compute_tf_Xty
compute_tf_cm
compute_tf_csd
compute_tf_d

Remove help pages for non-exported functions

I see that many of the help pages are for non-exported functions, which presumably means they should not be called by the user, e.g.,

library(ctwas)
help(read_pvar)

I understand if you may want some of these help pages for developer documentation, but in general it is confusing for the user, so it should be used sparingly (and if you want these help pages, they should be carefully flagged as "internal").

These help pages are being generated automatically by roxygen2 wherever you have the #' syntax. A simple solution is to replace #' with an ordinary comment # so that the help pages for these non-exported functions are not generated.

Error in susie examples

I'm getting an error when I try to run the susie example:

set.seed(1)
n = 1000
p = 1000
beta = rep(0,p)
beta[1:4] = 1
X = matrix(rnorm(n*p),nrow = n,ncol = p)
X = scale(X,center = TRUE,scale = TRUE)
y = drop(X %*% beta + rnorm(n))
res1 = susie(X,y,L = 10)
# Error in s$V[l, ] : incorrect number of dimensions

Fix variable bindings in use of dplyr and ggplot2 functions

@kevinlkx At several points you use dplyr and ggplot2 functions that produce "no visible binding for global variable" errors in the R package checks. These errors can be fixed following the instructions given here.

I believe this approach should fix all or most of these errors:

anno_finemap_res: no visible binding for global variable 'start'
anno_finemap_res: no visible binding for global variable 'end'
anno_finemap_res: no visible binding for global variable 'chrom'
anno_finemap_res: no visible binding for global variable 'id'
anno_finemap_res: no visible binding for global variable 'susie_pip'
load_weights : <anonymous>: no visible binding for global variable 'cm'
make_locusplot: no visible binding for global variable 'pos'
make_locusplot: no visible binding for global variable 'p'
make_locusplot: no visible binding for global variable 'type'
make_locusplot: no visible binding for global variable 'object_type'
make_locusplot: no visible binding for global variable 'r2_levels'
make_locusplot: no visible binding for global variable 'label'
make_locusplot: no visible binding for global variable 'susie_pip'
make_locusplot: no visible binding for global variable 'id'
summarize_param: no visible binding for global variable 'niter'
summarize_param: no visible binding for global variable 'value'
summarize_param: no visible binding for global variable 'group'

Unclear error message

I am getting the following error:

res <- impute_expr_z(z_snp = z_snp, weight = weight.predictdb, ld_pgenfs = ld_pgenfs,
method = "lasso", outputdir = outputdir,
outname = "res_ss")
2022-09-30 14:52:23 INFO::flipping z scores to match LD reference
2022-09-30 14:52:32 INFO::will also flip weights to match LD reference for each gene
2022-09-30 14:52:32 INFO::Reading weights for chromosome 1
2022-09-30 14:52:33 INFO::number of genes with weights provided: 13199
2022-09-30 14:52:33 INFO::collecting gene weight information ...

2022-09-30 15:11:27 INFO::Start gene z score imputation ...
2022-09-30 15:11:27 INFO::ld genotype is given, using genotypes to impute gene z score.
Error in if (abs(max(gexpr) - min(gexpr)) < 1e-08) { :
missing value where TRUE/FALSE needed

It occurs from the file 'twas_impute_expr_z.R, lines 98-107. It is unclear to me what is generating this error and I was hoping you could help.

Question about SNP-gene correlation matrix

Thanks for developing this useful tools! In ctwas_process_regions.R at line 270, the code include R_snp[x,x] in the denominator. This term is not present in the equation from the paper's Supplementary Notes. Could you explain what is this term used for?

            R_snp_gene[,i] <- sapply(1:nrow(R_snp),
                                     function(x){crossprod(wgt,R_snp[ld.idx,x])/sqrt(crossprod(wgt,R.s)%*%wgt*R_snp[x,x])})

,Xinyu

testthat test errors

@kevinlkx I'm getting some errors when I try to run the tests with devtools::test():

Error (test-ctwas_compute_region_cor.R:23:3): get_region_cor works
Error in `get_region_cor(region_id, region_data = screened_region_data,
    LD_info = LD_info, snp_info = snp_info, weights = weights)`: all(file.exists(LD_matrix_files)) is not TRUE
Backtrace:
    ▆
 1. └─ctwas::get_region_cor(...) at test-ctwas_compute_region_cor.R:23:3
 2.   └─base::stopifnot(all(file.exists(LD_matrix_files))) at ctwas/R/ctwas_compute_region_cor.R:172:5

Error (test-ctwas_finemapping.R:21:5): finemap_regions works with LD
Error in `get_region_cor(region_id, region_data = region_data, LD_info = LD_info,
    snp_info = snp_info, weights = weights, force_compute_cor = force_compute_cor,
    save_cor = save_cor, cor_dir = cor_dir, LD_format = LD_format,
    LD_loader = LD_loader, verbose = verbose)`: all(file.exists(LD_matrix_files)) is not TRUE
Backtrace:
    ▆
 1. ├─base::suppressWarnings(...) at test-ctwas_finemapping.R:20:3
 2. │ └─base::withCallingHandlers(...)
 3. └─ctwas::finemap_regions(...) at test-ctwas_finemapping.R:21:5
 4.   └─parallel::mclapply(...) at ctwas/R/ctwas_finemapping.R:275:3
 5.     └─base::lapply(X = X, FUN = FUN, ...)
 6.       └─ctwas (local) FUN(X[[i]], ...)
 7.         └─ctwas::finemap_region(...) at ctwas/R/ctwas_finemapping.R:283:5
 8.           └─ctwas::get_region_cor(...) at ctwas/R/ctwas_finemapping.R:128:5
 9.             └─base::stopifnot(all(file.exists(LD_matrix_files))) at ctwas/R/ctwas_compute_region_cor.R:172:5

Use of "aggregate" in sum_pip_across_contexts()

It is unclear which version of the "aggregate" function you are using. Is this the aggregate function from the "stats" package?