The ldshrink from stephenslab

LD score calculation

It seems that the current LD score calculation is different from the authors' approach.

https://github.com/stephenslab/LDshrink/blob/32b4ad3942f7cb429f23c529b86ab72cfbb1b257/R/block_utils.R#L117

Below is the 2nd paragraph of the first LDSC paper online method: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4495769/

Rcpp error: no matching function for call to 'R_useDynamicSymbols', shown in macOS 10.14 + R 3.5.1

I came across the following error in my mac laptop:

> devtools::load_all(".")

[omitted]

RcppExports.cpp:242:5: error: no matching function for call to 'R_useDynamicSymbols'
    R_useDynamicSymbols(dll, FALSE);
    ^~~~~~~~~~~~~~~~~~~
/Library/Frameworks/R.framework/Resources/include/R_ext/Rdynload.h:84:10: note: candidate function not viable: no known conversion from 'int' to 'Rboolean' for 2nd argument
Rboolean R_useDynamicSymbols(DllInfo *info, Rboolean value);
         ^
4 warnings and 1 error generated.
make: *** [RcppExports.o] Error 1
ERROR: compilation failed for package ‘ldshrink’
* removing ‘/private/var/folders/dm/39__1gf52_s_9xbqrfd_46lc0000gp/T/RtmpjSVD5F/devtools_install_60b7e017afc/ldshrink’
Error: Command failed (1)

Here is my laptop info:

# xiangzhu @ stanford in ~ [22:27:51]
$ system_profiler SPSoftwareDataType
Software:

    System Software Overview:

      System Version: macOS 10.14 (18A391)
      Kernel Version: Darwin 18.0.0
      Boot Volume: Macintosh HD
      Boot Mode: Normal
      Computer Name: STA-C02V26FCHTD5
      User Name: Xiang Zhu (xiangzhu)
      Secure Virtual Memory: Enabled
      System Integrity Protection: Enabled
      Time since boot: 5 days 22:05

Here is my R session info:

> devtools::session_info()
Session info --------------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.1.456)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2018-10-04                  

Packages ------------------------------------------------------------------------------------------------------------------------------------------
 package      * version   date       source        
 base         * 3.5.1     2018-07-05 local         
 BH             1.66.0-1  2018-02-13 CRAN (R 3.5.0)
 commonmark     1.6       2018-09-30 CRAN (R 3.5.0)
 compiler       3.5.1     2018-07-05 local         
 datasets     * 3.5.1     2018-07-05 local         
 devtools       1.13.6    2018-06-27 CRAN (R 3.5.0)
 digest         0.6.17    2018-09-12 CRAN (R 3.5.1)
 graphics     * 3.5.1     2018-07-05 local         
 grDevices    * 3.5.1     2018-07-05 local         
 grid           3.5.1     2018-07-05 local         
 lattice        0.20-35   2017-03-25 CRAN (R 3.5.1)
 magrittr       1.5       2014-11-22 CRAN (R 3.5.0)
 Matrix         1.2-14    2018-04-13 CRAN (R 3.5.1)
 memoise        1.1.0     2017-04-21 CRAN (R 3.5.0)
 methods      * 3.5.1     2018-07-05 local         
 R6             2.3.0     2018-10-04 CRAN (R 3.5.1)
 Rcpp           0.12.19   2018-10-01 CRAN (R 3.5.0)
 RcppEigen      0.3.3.4.0 2018-02-07 CRAN (R 3.5.0)
 RcppParallel   4.4.1     2018-07-19 CRAN (R 3.5.0)
 RcppProgress   0.4.1     2018-05-11 CRAN (R 3.5.0)
 rlang          0.2.2     2018-08-16 CRAN (R 3.5.0)
 roxygen2       6.1.0     2018-07-27 CRAN (R 3.5.0)
 stats        * 3.5.1     2018-07-05 local         
 stringi        1.2.4     2018-07-20 CRAN (R 3.5.0)
 stringr        1.3.1     2018-05-10 CRAN (R 3.5.0)
 testthat       2.0.0     2017-12-13 CRAN (R 3.5.0)
 tools          3.5.1     2018-07-05 local         
 utils        * 3.5.1     2018-07-05 local         
 withr          2.1.2     2018-03-15 CRAN (R 3.5.0)
 xml2           1.2.0     2018-01-24 CRAN (R 3.5.0)
 yaml           2.2.0     2018-07-25 CRAN (R 3.5.0)

LD shrinkage in Zou et al (Hum Hered. 2010)

This is a LD shrinkage method introduced for better PCA in population structure. I will further review it to see whether/how to incorporate it in ldshrink.

Paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2912642/

Key equation

L2-penalty LD shrinkage

I wonder how hard to implement the following L2-penalty (i.e. "ridge") LD shrinkage estimator: Sigma_ridge = Sigma + lambda I, where Sigma is the sample covariance matrix for SNPs.

For simplicity we can fix the value of lambda (e.g. lambda=0.01) for now. (It might be conceptually straightforward to tune this parameter by cross-validation, but this definitely complicates the software implementation.)

This approach was used in ImpG-Summary: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4184260/

Including external data

@pcarbo Hi Peter -- I just had a great meeting with Nick today, and we have the following question about including published external data in LDshrink package. We would appreciate your input.

To make the package user-friendly, we would like to include the following external data files:

Genetic maps from 1000 Genomes: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130507_omni_recombination_rates/
Approximately independent LD blocks: https://bitbucket.org/nygcresearch/ldetect-data/src

These data files have fairly simple structure. The genetic maps are data frames with the following headers:

[id] [physical position] [genetic position (cumulative)]

The LD blocks are data frames with the following headers:

[chromosome name] [region start] [region stop]

These data files also have formal publications available. I wonder if we could include these external files in LDshrink package, provided that we explicitly cite these publications?

Do we have to worry about licensing issues?

An alternative plan is to provide users with the data preprocessing scripts/functions, but I think this will make the package harder to use.

One question on the "theta shrinkage"

I'm referring to
https://github.com/stephenslab/rss/blob/6e9c8f08e82996ee1364f65dec491d84b6bcac26/misc/get_corr.m#L76-L77
and
https://github.com/stephenslab/rss/blob/6e9c8f08e82996ee1364f65dec491d84b6bcac26/misc/get_corr.m#L139

theta seems to be very small for all sample sizes, even small ones like 100; I'm therefore wondering if this has any effect at all?

Thanks,
Florian

External sources of genetic maps

Below are links to several genetic maps that I have used in my past work. After discussing with Peter (see Issue #3 ), it seems fine to include these external data in LDshrink as long as we properly cite these data. To make LDshrink a one-stop package for LD-related calculation, I think we should include all these files in the package (if possible).

Genetic maps listed in `IMPUTE v1`

Genetic maps created by Pickrell

https://github.com/joepickrell/1000-genomes-genetic-maps

Note that these maps have very straightforward format:

$ head genetic_map_CEU_chr16.txt
position CEU_rate(cM/Mb) Genetic_Map(cM)
24045 0 0
24170 0.3020482 0
25057 0.3020482 0.0002679168
25065 0.2846052 0.0002703331
25561 0.2782453 0.0004114973
25658 0.2782453 0.0004384871
28165 0.2782453 0.001136048
29475 0.2794324 0.001500549
31010 0.2794324 0.001929478

Return SNP info along with LD matrix

It seems the main function only returns an estimated LD matrix at this point?
https://github.com/stephenslab/LDshrink/blob/32b4ad3942f7cb429f23c529b86ab72cfbb1b257/R/LDshrink.R#L6

Ideally we want to have some basic SNP info available (e.g. position, allele), which is essential in combining LD with GWAS summary statistics in analyses.

I think the emeraLD package gives us a good example: https://github.com/statgen/emeraLD

> source('emeraLD2R.r');
Loading required package: data.table
data.table 1.11.4  Latest news: http://r-datatable.com
emeraLD v0.1 (c) 2018 corbin quick (corbinq@gmail.com)

reading from m3vcf file...

processed genotype data for 5008 haplotypes...

calculating LD for 60 SNPs...

done!! thanks for using emeraLD

> names(ld_data)
[1] "Sigma" "info"

> head(ld_data$info)
   chr   pos          id ref alt
1:  20 83061 rs549711487   C   T
2:  20 83196  rs62190472   A   T
3:  20 83252   rs6137896   G   C
4:  20 83570   rs6048967   T   G
5:  20 83611 rs114000219   C   A
6:  20 83792 rs529518485   A   G

> head(ld_data$Sigma[, 1:5], 5)
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,]  1.00000 -0.00602  0.03989 -0.00824 -0.00331
[2,] -0.00602  1.00000 -0.14013 -0.03102 -0.01245
[3,]  0.03989 -0.14013  1.00000 -0.05714 -0.04400
[4,] -0.00824 -0.03102 -0.05714  1.00000 -0.01704
[5,] -0.00331 -0.01245 -0.04400 -0.01704  1.00000

Set threshold in Wen-Stephens shrinkage estimator

https://github.com/stephenslab/rss-private/issues/32

Option to output traditional LD statistics

Although the main purpose of LDshrink is to provide a one-stop place for various LD shrinkage estimators that have been recently used in GWAS summary statistics literature, I think perhaps we should also provide the option to output the traditional LD statistics in population genetics literature.

For phased haplotypes: r2, D, and D’ statistics
For unphased genotypes: sample correlation or r2

Below are two potential use cases for this option.

Use case 1: a data analyst doesn't need or care LD shrinkage estimator, and just need standard LD statistics based on a dataset.
Use case 2: a method developer wants to assess the improvement of using LD shrinkage estimator over using standard LD estimator.

This issue may be redundant since Nick has already provided the option useLDshrink = TRUE.

How to use this package?

Hi Xiang,
I have read your AOAS paper, and I think ldshrink package is very helpful to me. But when I download the package, it is hard for me to find the main function to calculate the shinkage LD structure. Would you like to give me a toy example to let me use the package?
Many thanks for your kindly help.

                                              Sheng

Segmentation Fault when merging ldm files with GCTB

Hello,
I just started using the GCTB software but am running into an issue when I try to merge sparse ldm files from a single chromsome into one ldm file. I get the following error:
/var/spool/slurmd.cn1605/job55094972/slurm_script: line 13: 82612 Segmentation fault

Here's the basics of the code I ran
gctb --mldm chr1.mldmlist --make-sparse-ldm --out chr1 --thread 16

Thank you,
Alexis

Rename `haplo_panel`

https://github.com/stephenslab/LDshrink/blob/a8fdff30a4df3b202a9932aeb9aa4d3debc7d535/R/LDshrink.R#L2

It seems that haplo_panel is a bit confusing here -- haplotype data are not required; people can input genotype data to compute Wen-Stephens estimator, and indeed this feature is supported by the isGeno option.

On a related note, there seem to be three types of naming conventions:

map_data
isGeno
na.rm

Is it worthwhile to pick one and stick to it?

Possible ways of exploiting the structure of LD matrix

Here we simply call R built-in function eigen().

https://github.com/stephenslab/LDshrink/blob/32b4ad3942f7cb429f23c529b86ab72cfbb1b257/R/block_utils.R#L118

According to R documentation, it seems that this call does not exploit the banded structure of Wen-Stephens estimator:

Source
eigen uses the LAPACK routines DSYEVR, DGEEV, ZHEEV and ZGEEV.
LAPACK is from http://www.netlib.org/lapack and its guide is listed in the references.

Nick also has a version for sparse LD matrix, which is not included here:

I also have a routine that computes and stores a sparse symmetric matrix for the whole chromosome
it's faster than the dense block-wise routine

Distance-based LD shrinkage

I wonder how hard to add the following distance-based LD shrinkage estimator: LD(i, j) = 0 if SNP i and j are far away, say 10Mb apart.

This approach was used in GCTA-COJO paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3593158/

Confirm that EVs are actually real numbers

During my recent whole genome simulations I found some tailing EVs of LD matrices are (numerically) complex numbers, e.g. 3.28+1e-14 i.

This seems a numerical issue since the magnitude of imaginary part is so close to zero.

However, Nick haven't noticed this pattern in his whole genome data analyses.

LD shrinkage estimation in GCTB

It seems that GCTB software has implemented Wen-Stephens shrinkage LD estimator (see also https://github.com/stephenslab/rss-private/issues/64):

Shrunk LD matrix
The shrinkage estimator for the LD correlation matrix was originally proposed by Wen and Stephens (2010). The estimator shrinks the off-diagonal entries of the sample LD matrix toward zero. Zhu and Stephens (2017) used the shrinkage estimator in their Regression with Summary Statistics (RSS) methodology and showed empirically that it can provide improved inference. The shrinkage estimator overcomes some of the issues arising from approximating the full LD matrix in the summary-data based model using a subset of the full LD matrix and constructing the matrix from a reference. The GCTB implementation is a C++ port from that provided with the RSS software and has been adapted for use with the GCTB software.

GCTB software doc page: http://cnsgenomics.com/software/gctb/#SummaryBayesianAlphabet

Simple visualization of LD matrix

Although not the main focus or output of LDshrink, it will be straightforward (and perhaps useful) to generate some simple visualization results similar to Fig. 1 in Wen & Stephens (2010).

Also point out in documentation/manuscript that some existing packages can be used for this task: e.g. https://cran.r-project.org/web/packages/corrplot/index.html

Revise object documentation based on Wickham book

Below is a check list of guidelines on files in man/ from Wickham's package book: http://r-pkgs.had.co.nz/man.html

Even as a total newbie to R package development, I know I should not manually edit files in man/ folder most of time ... Hence, this check list should be used together with #22.

Instead of writing these files by hand, we’re going to use roxygen2 which turns specially formatted comments into .Rd files.

Use `data.table` package to parse vcf

Currently ldshrink assumes the input genotype/haplotype data are stored in an n-by-p numerical matrix, which is convenient from statisticians' perspective. However, public genotype/haplotype data from 1000 Genomes are stored in vcf format.

In the past I first used vcftools to convert vcf data to IMPUTE2 format (which is indeed a p-by-n matrix), and then transpose IMPUTE2-formatted data in R. See https://github.com/stephenslab/rss/blob/master/misc/import_1000g_vcf.sh.

This two-step workflow is not so convenient (at least for statisticians): they have to learn a new program like vcftools before any LD-related operations in ldshrink.

It seems that now we can use data.table (https://cran.r-project.org/web/packages/data.table) to directly convert vcf data to the n-by-p matrix in R. Here is an example: https://gist.github.com/cfljam/bc762f1d7b412df594ebc4219bac2d2b.

Here is my own example.

> suppressPackageStartupMessages(library(data.table))
> my_dt <- data.table::fread(cmd="zcat B_CELL_NAIVE.vcf.gz")

|--------------------------------------------------|
|==================================================|
>
> dim(my_dt)
[1] 215708754         8
> head(my_dt)
   #CHROM       POS          ID REF ALT QUAL FILTER
1:  chr15 101094127  rs12904576   T   C    .   PASS
2:  chr15 101114640  rs36053285   T   C    .   PASS
3:  chr15  45685616 rs796721871  TA   T    .   PASS
4:  chr15  45670316   rs2114501   G   A    .   PASS
5:  chr15  45618730  rs59889118   G   A    .   PASS
6:  chr15  45620612   rs1980288   T   C    .   PASS
                                                                                                               INFO
1:   Gene=ENSG00000270127.1;GeneSymbol=ENSG00000270127.1;Pvalue=3.714e-27;Beta=-1.10;Statistic=-15.72;FDR=2.681e-20
2:   Gene=ENSG00000270127.1;GeneSymbol=ENSG00000270127.1;Pvalue=2.649e-24;Beta=-1.09;Statistic=-14.16;FDR=3.412e-18
3: Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=2.614e-23;Beta=-1.10;Statistic=-13.63;FDR=3.412e-18
4:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.007e-23;Beta=-1.09;Statistic=-13.6;FDR=3.412e-18
5:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.64e-23;Beta=-1.09;Statistic=-13.56;FDR=3.412e-18
6:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.64e-23;Beta=-1.09;Statistic=-13.56;FDR=3.412e-18

The benefit of using data.table here is two-fold: i) users don't have to leave R and use vcftools to get n-by-p genotype matrix from vcf data; ii) data.table is a well-maintained and constantly-upgraded package that can handle large datasets efficiently (at least based on my past experiences).

Hence, we can either add a wrapper that uses data.table to parse vcf for ldshrink users, or at minimum, we can simply provide a vignette showing how to use data.table to parse vcf.

Finally there exists a package vcfR (https://cran.r-project.org/web/packages/vcfR) that might be relevant (but I have not used it much).

C++17 requirement is limiting

@CreRecombinase @xiangzhu Can you find a way to use replace the C++17 requirement (in the Makevars file) with the C++11 requirement, for example? The C++17 standard is much less widely supported; compare this and this. In the interest of making the LDshrink package as accessible as possible, can you try to find a way to avoid using C++17 features?

Thanks to @jean997 for identifying this issue.

Revise package metadata based on Wickham book

Below is a check list of guidelines on the DESCRIPTION file from Wickham's package book: http://r-pkgs.had.co.nz/description.html

Warnings from package installation

In my version of the clang compiler (version 4.0.1), I get several dozen warnings, which are attached. Some of these are in the Eigen source, but others are in our own code, and perhaps some of these can be easily addressed. I've labeled this as a "bug" but perhaps should be read as "potential bug".

ldshrink.install.err.gz

Make sure that `map_data`, `m` and `Ne` are consistent

https://github.com/stephenslab/LDshrink/blob/a8fdff30a4df3b202a9932aeb9aa4d3debc7d535/R/LDshrink.R#L2

We need to make sure that map_data, m and Ne are all consistent with the same version of genetic map data (map_data) used here. For example, if I remember it correctly, m=85 and Ne=11490.672741 correspond to 1000 Genomes Phase 1 CEU samples.

Nick came up a smart way to handle this issue.

Long-range regions of high LD

It may be useful to include the list of long-range LD regions in LDshrink: https://genome.sph.umich.edu/wiki/Regions_of_high_linkage_disequilibrium_(LD).

The data structure is straightforward, 3-column data frame:

Chr	Start	Stop
1	48000000	52000000
2	86000000	100500000
2	134500000	138000000
2	183000000	190000000

For example, let LDshrink throw a warning message when genotypes of SNPs in these long-range LD regions are used.

2018-06-14 meeting agenda

Below is a draft of meeting agenda.

Time: 2018-06-14, 10 am PST
Location: Skype
Attendees: Nick, Xiang

Review items

What is the current state of LDshrink development?
Can current LDshrink reliably produce results for one chromosome (i.e. Xiang's use case)?
Can current LDshrink reliably produce results for whole genome (i.e. Nick's use case)?
How far is the current LDshrink from an accepted Bioconductor package?

Action items

Freeze a set of features for the first public release.
Complete LDshrink package development so that it can be accepted by Bioconductor.
Draft a short manuscript on the LDshrink package (closely related to package vignettes).
Compare with related work: https://doi.org/10.1101/301366.
Split work by category (one for coding-related work, the other for writing-related work).
Set a clear time frame, to avoid Xiang's research "down time" when he has to teach.

Future items

Find public domains to share precomputed LD data for other projects (e.g. I vaguely remember that mashr revision used Nick's precomputed LD data on Midway).
Communicate with others (e.g. Matthew, Xin, Peter) once the package is near mature.

Revise vignettes based on Wickham book

Below is a check list of guidelines on files in vignette / from Wickham's package book: http://r-pkgs.had.co.nz/vignettes.html.

This part seems straightforward since the main tools are rmarkdown and knitr, with which I have some experiences. However, preparing and organizing the contents of vignettes may be as hard as writing articles.

Start with devtools::use_vignette("my-vignette").
Can build all vignettes from the console with devtools::build_vignettes(), but this is rarely useful. Instead use devtools::build() to create a package bundle with the vignettes included.
Need to put yourself in the readers’ shoes, and adopt a “beginner’s mind”.
Any packages used by the vignette must be declared in the DESCRIPTION.
Need to watch the file size.

Name the package

This thread is based on http://r-pkgs.had.co.nz/package.html#naming.

Pick a unique name you can easily Google. Done: top 1 search in my side.

Avoid using both upper and lower case letters. To do: LDshrink -> ldshrink. This seems to be a global change.

Interface to export LD matrix for requested region

Here is a feature request for an export interface that for given chr, start and end positions, a complete matrix (or sparse matrix in R which can be trivially made complete) from given blocked LD database file can be extracted.

Revise R codes based on Wickham book

Below is a check list of guidelines on files in R/ from Wickham's package book: http://r-pkgs.had.co.nz/r.html

Possibly unnecessary use of `scale()`

@CreRecombinase spotted the following line that might be not necessary.

https://github.com/stephenslab/LDshrink/blob/a8fdff30a4df3b202a9932aeb9aa4d3debc7d535/R/LDshrink.R#L8

LD shrinkage based on Ledoit-Wolf estimator (JMA, 2004)

Paper: https://doi.org/10.1016/j.ajhg.2015.11.021

Relevant paragraph:

See related discussions at #33.

Document the interpolation of genetic maps

Nick found this feature useful, and we should document it.

MHC region exclusion

We often exclude MHC region when estimating LD and performing further downstream analyses.

It would be nice to do this MHC exclusion for people.

The idea is to have the following simple data frame inside ldshrink, and a flag like remove_mhc=TRUE:

> HLA.hg19
    chrom start.base end.base
HLA     6   29719561 32883508
> is(HLA.hg19)
[1] "data.frame" "list"       "oldClass"   "vector"

Similarly we may want to remove centromeres as well.

Both MHC and centromeres data frames can be found here: https://github.com/smgogarten/GWASTools/tree/master/data.

This issue is also related to Issue #14, since they share the same idea: if we cannot estimate something reliably, then probably better no to estimate them at all.

Another very detailed note: pay attention to the sources of these regions -- e.g. if these regions are pulled out from UCSC, then they are 0-based.

Allow people to use a faster `cov()`

https://github.com/stephenslab/LDshrink/blob/a8fdff30a4df3b202a9932aeb9aa4d3debc7d535/R/LDshrink.R#L10

Here the cov() seems to be the default one in R, but Nick also has a faster cov() based on C (which does not handle NA well though). We should allow people to use this faster one, if they wish and can provide input data without NA.

stephenslab / ldshrink Goto Github PK

ldshrink's Introduction

ldshrink

ldshrink's People

Contributors

Stargazers

Watchers

Forkers

ldshrink's Issues

Genetic maps listed in IMPUTE v1

Genetic maps created by Pickrell

Review items

Action items

Future items

Recommend Projects

Recommend Topics

Recommend Org

Genetic maps listed in `IMPUTE v1`