Code Monkey home page Code Monkey logo

knockoffs.jl's Introduction

Here are some of my past and ongoing projects, which I think are pretty cool. Not all projects are being actively developed, but I will certainly respond to issues and pull requests.

I am the main developer for (more recent projects are listed first):

  • GhostKnockoffGWAS: Package for performing knockoff-based analysis for GWAS summary statistics data
  • Ghostbasil.jl: (WIP) Provides Julia wrappers to the C++ code of ghostbasil
  • Knockoffs.jl: Implements the knockoff filter framework for variable selection, which performs conditional independence testing and controls the FDR (false discovery rate)
  • groupknockoffs: Simple app to solve group knockoff optimization, without Julia installed!
  • EasyLD.jl: Julia utilities for downloading and reading LD (linkage disequilibrium) matrices stored in Hail's BlockMatrix format
  • knockoffspy: A Python package that provides a direct wrapper over Knockoffs.jl
  • knockoffsr: A R package that provides a direct wrapper over Knockoffs.jl
  • MendelIHT.jl: Implements iterative hard thresholding (l0 penalized regression solver). It is highly optimized for handling compressed (binary PLINK) genotype data
  • MendelImpute.jl: A package for genotype imputation, phasing, and (global/local) ancestry inference utilizing a reference haplotype panel. It is significantly faster than existing methods but slightly less accurate
  • Thyrosim.jl: An updated version of THYROSIM, Thyrosim.jl produces individualized thyroid hormone predictions (TT4/TT3/TSH) based on a rather complicated ODE model
  • VCFTools.jl: Julia utilities for handling VCF (Variant Call Format) files
  • fastPHASE.jl: Julia wrapper for the famous fastPHASE genetics software. The primary use case is to allow the original program to run on binary PLINK data.

I am a contributor for (at least 5 commits):

  • QuasiCopula.jl: Implements a new class of distribution (Quasi-Copulas) that captures correlation among non-Gaussian random variables efficiently
  • SnpArrays.jl: Julia package for handling binary PLINK formatted data. It has the fastest (matrix)-(vector) multiplication routine for compressed PLINK files as far as I know.
  • MendelKinship.jl: Calculates various empirical and theoretical kinship coefficients, based on pedigree or genotype data.

knockoffs.jl's People

Contributors

biona001 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

knockoffs.jl's Issues

SDP solver a lot slower than Matlab/R equivalent

Currently the SDP solver in Knockoffs.jl is ~2 times slower than knockoff package in R and ~4 times slower than the knockoff package in Matlab. Speed difference is entirely due to underlying solver (Hypatia SDP solver+ Convex.jl in Julia vs R's Rdsdp vs Matlab's SDPT3 solver)

How do we could get more performance?

MWE

using Revise
using Knockoffs
using Test
using LinearAlgebra
using Random
using StatsBase
using Statistics
using Distributions
using ToeplitzMatrices
using RCall
using PositiveFactorizations
using UnicodePlots
using MATLAB
using SCS
using JuMP
using Convex

# simulate data
Random.seed!(2022)
n = 100
p = 400
ρ = 0.4
Sigma = Matrix(SymmetricToeplitz.^(0:(p-1))))
L = cholesky(Sigma).L
X = randn(n, p) * L # var(X) = L var(N(0, 1)) L' = var(Σ)
true_mu = zeros(p)

Knockoffs.jl

@time knockoff = modelX_gaussian_knockoffs(X, :sdp, true_mu, Sigma) # precompile
s = knockoff.s
  2.686531 seconds (507.05 k allocations: 399.686 MiB, 2.05% gc time)

Compare with Matteo's knockoffs in R

@rput Sigma X true_mu
@time begin
    R"""
    library(knockoff)
    r_s = create.solve_sdp(Sigma)
    X_ko = create.gaussian(X, true_mu, Sigma, method = "sdp", diag_s=r_s)
    """
end
@rget r_s X_ko
  1.208416 seconds (54 allocations: 1.328 KiB)

Compare with Matteo's knockoffs in Matlab

# compare with Matteo's knockoffs in MATLAB
@mput Sigma X true_mu
@time begin
    mat"""
    addpath("/Users/biona001/Documents/MATLAB/knockoffs_matlab")
    matlab_s = knockoffs.create.solveSDP(Sigma)
    """
end
@mget matlab_s
  0.715965 seconds (7 allocations: 192 bytes)

At least the answer agrees:

[s r_s matlab_s]
200×3 Matrix{Float64}:
 1.0       1.0       1.0
 0.971429  0.971427  0.971428
 0.811429  0.811431  0.811429
 0.875429  0.875427  0.875428
 0.849829  0.849829  0.849829
 0.860069  0.860068  0.860068
 0.855973  0.855973  0.855973
 0.857611  0.857611  0.857611
 0.856956  0.856956  0.856956
 0.857218  0.857218  0.857218
 0.857113  0.857113  0.857113
 0.857155  0.857155  0.857155
 0.857138  0.857138  0.857138
                    
 0.857155  0.857155  0.857155
 0.857113  0.857113  0.857113
 0.857218  0.857218  0.857218
 0.856956  0.856956  0.856956
 0.857611  0.857611  0.857611
 0.855973  0.855973  0.855973
 0.860069  0.860068  0.860068
 0.849829  0.849829  0.849829
 0.875429  0.875427  0.875428
 0.811429  0.811431  0.811429
 0.971429  0.971427  0.971428
 1.0       1.0       1.0

Fixed-X knockoffs X.ko'X.ko

Hi Benjamin,

I hope you are doing well!

I wanted to check about the following. The following seems not to match (for any of the fixed-X knockoff constructions):

me = fixed_knockoffs(X, :maxent)
maximum(abs.(me.Xko'me.Xko .- me.Sigma)) #large 

Do you know why that is? (Happy to provide a MWE if it helps, though it happens for all designs I have tried it on.)

Thank you,
Nikos

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Possible numerical error when initializing with equi correlation solution

For group knockoffs, when we use SDP/ME/MVR solvers, we initiailzing the optimzation with equicorrelated S matrix. However, that S matrix is the largest S such that (m+1)/m*Sigma-S > 0. This means that when we compute cholesky((m+1)/m*Sigma-S), we could potentially introduce numerical stabilities.

For this reason, the next minor patch release might initialize S with, for example, equi.S ./ 2 to naively circumvent this issue.

HMM normalizing constant overflows

The following creates MWE data (test.bed/bim/fam)

using SnpArrays
using MendelIHT
using Random
Random.seed!(2022) # set seed for reproducibility
n = 1000 # sample size
p = 10000 # number of covariates
x = simulate_correlated_snparray("test.bed", n, p, block_length=20, prob=0.75)
make_bim_fam_files(x, rand(1000), "test")

and a call to fastPHASE HMM knockoffs will throw NaN error that is caused by normalizing constants overflowing

@time= hmm_knockoff(data_path, plink_outfile="test.fastphase.knockoffs")

Improve documentation

  • Group knockoffs: Missing documentation for representative group knockoffs
  • Calling from R: objective function wrong
  • Need docstrings for solve_s_graphical_group and related methods
  • GhostKnockoffs: update reference
  • Example in model-X knockoffs: increase sample size or effect size so power is better
  • The :SDP option in group knockoffs should be :sdp

Also update readme:

  • Maybe add message for pr/bugs
  • update installation instructions once package is registered

Efficient sampling of multiple knockoffs

There is a trick to sample m-knockoffs without forming a p(m+1) * p(m+1) covariance in memory. The trick was used in sample_mvn_efficient but not yet in condition. Need to do so asap.

Performance regression

As reported by Jiaqi, there seems to be some significant performance regression for group knockoff solver.

MWE: (Knockoffs.jl v1.1.7 and on Julia 1.10)

using Knockoffs
using LinearAlgebra
using Random
using CSV
using DataFrames

# gnomad test data
datadir = "/Users/biona001/Benjamin_Folder/research/4th_project_groupKO/group_knockoff_test_data"
mapfile = CSV.read(joinpath(datadir, "Groups_2_127374341_128034347.txt"), DataFrame)
groups = mapfile[!, :group]
covfile = CSV.read(joinpath(datadir, "CorG_2_127374341_128034347.txt"), DataFrame)
Σ = covfile |> Matrix{Float64}
Σ = 0.99Σ + 0.01I

@time solve_s_group(
    Symmetric(Σ), groups, :maxent, 
    m = 1,          # number of knockoffs per variable to generate
    tol = 0.001,    # convergence tolerance
    outer_iter = 10,    # max number of coordinate descent iterations
    robust = false, # whether to use robust cholesky updates
    verbose = true    # whether to print informative intermediate results
)

┌ Warning: Maximum group size is 192, optimization may be slow. Consider running `modelX_gaussian_rep_group_knockoffs` to speed up convergence.
└ @ Knockoffs ~/.julia/dev/Knockoffs/src/group.jl:360
Maxent initial obj = -44822.11340914695
Iter 1 (PCA): obj = -26794.756871827496, δ = 2.6915336716594664, t1 = 24.96, t2 = 32.32
Iter 2 (CCD): obj = -23681.635922280064, δ = 0.5920258362537, t1 = 534.9, t2 = 881.71, t3 = 0.22
Iter 3 (PCA): obj = -23476.599883936684, δ = 1.0851838794654751, t1 = 559.7, t2 = 913.86
Iter 4 (CCD): obj = -23441.194586522815, δ = 0.01917853738286209, t1 = 1059.27, t2 = 1761.83, t3 = 0.44
Iter 5 (PCA): obj = -23428.13516213957, δ = 0.20912865614936338, t1 = 1085.4, t2 = 1794.69
... # terminated

On Knockoffs.jl v0.3.0 (Julia 1.6.7), see Fast Cholesky update notebook

solve_group_max_entropy_ccd: Optimizing 58997 variables
┌ Warning: Maximum group size is 192, optimization may be slow. Consider running `modelX_gaussian_rep_group_knockoffs` to speed up convergence.
└ @ Knockoffs /Users/biona001/.julia/dev/Knockoffs/src/group.jl:230
initial obj = -13811.93738418022
Iter 1: obj = -8863.340455999352, δ = 0.9033274319764998, t1 = 8.85, t2 = 18.23, t3 = 0.04
Iter 2: obj = -7509.4699925543855, δ = 0.7627302489344706, t1 = 19.04, t2 = 36.57, t3 = 0.08
Iter 3: obj = -7495.512979153219, δ = 0.153471691368552, t1 = 29.25, t2 = 54.77, t3 = 0.12
Iter 4: obj = -7491.027981075002, δ = 0.0467433749392695, t1 = 39.2, t2 = 72.95, t3 = 0.16
Iter 5: obj = -7489.0243504490645, δ = 0.007200479138871564, t1 = 49.08, t2 = 90.98, t3 = 0.2
Iter 6: obj = -7487.899557778442, δ = 0.004873425110742543, t1 = 58.98, t2 = 109.34, t3 = 0.24
Iter 7: obj = -7487.198762711129, δ = 0.0031886466238885947, t1 = 68.75, t2 = 127.98, t3 = 0.28
Iter 8: obj = -7486.7318922165505, δ = 0.002106998635746351, t1 = 78.17, t2 = 146.58, t3 = 0.32
Iter 9: obj = -7486.405698134362, δ = 0.0011212737859179697, t1 = 87.39, t2 = 165.21, t3 = 0.37
Iter 10: obj = -7486.1699950030525, δ = 0.0006708441935208511, t1 = 96.52, t2 = 183.44, t3 = 0.41
283.395020 seconds (3.23 M allocations: 329.596 MiB, 0.05% gc time, 0.61% compilation time)

Initial objective value being different is due to different initialization of S. However the timing seems much slower, even if ran on different computers.

Normalizing to unit vector introduces numerical instabilities

MWE

Only centering and scaling each column to mean 0 variance 1 (but not to unit norm)

using Knockoffs, Random, StatsBase, Statistics

Random.seed!(2021)
n = 3000
p = 1000
X = randn(n, p)
zscore!(X, mean(X, dims=1), std(X, dims=1))
knockoff = knockoff_equi(X);

X̃ = knockoff.X̃
Σ = knockoff.Σ

julia> [vec(X̃' * X̃) vec(Σ)]
1000000×2 Matrix{Float64}:
 2999.0      2999.0
   68.3776     68.3776
  -22.3342    -22.3342
  -43.2302    -43.2302
  -93.228     -93.2281
   32.9963     32.9963
  -43.5092    -43.5092
   71.2002     71.2002
  -91.1404    -91.1404
  -28.554     -28.554
  -31.7598    -31.7598
  -13.1698    -13.1698
  -34.6885    -34.6885
   22.8903     22.8903
   15.2742     15.2742
   36.5119     36.5119
  -23.4127    -23.4127
    
  -95.6452    -95.6452
  -79.5508    -79.5508
 -150.037    -150.037
  -24.5637    -24.5637
  -86.4215    -86.4215
  -26.914     -26.914
    2.37256     2.37256
  -68.1161    -68.1161
  -38.4121    -38.4121
  -17.9981    -17.9981
   16.1139     16.1139
   31.8385     31.8385
  -19.9923    -19.9923
   31.6304     31.6305
  151.434     151.434
  -32.3494    -32.3494
 2999.0      2999.0

Normaling each column so that ||x_j||_2^2 = 1:

using Knockoffs, Random

Random.seed!(2021)
n = 3000
p = 1000
X = randn(n, p)
normalize_col!(X)
knockoff = knockoff_equi(X);

X̃ = knockoff.X̃
Σ = knockoff.Σ

julia> [vec(X̃' * X̃) vec(Σ)]
1000000×2 Matrix{Float64}:
  1.0006        0.999942
  0.0113254     0.0227959
 -0.0117438    -0.007447
 -0.00865152   -0.0144145
 -0.0298051    -0.0310659
  0.00509079    0.0110014
 -0.0118063    -0.0145065
  0.0198903     0.0237353
 -0.0204552    -0.0303879
 -0.00816176   -0.00951761
 -0.00766452   -0.0105895
 -0.00383353   -0.00439077
 -0.00735197   -0.0115651
  0.00505043    0.00762776
 -0.000776242   0.00509295
  0.0112497     0.0121713
 -0.00566405   -0.00780654
  
 -0.0245307    -0.0318917
 -0.0281486    -0.0265195
 -0.0363757    -0.0500211
 -0.00516446   -0.00819044
 -0.0156618    -0.02881
 -0.0116239    -0.00897332
 -0.00495919    0.00079108
 -0.0227494    -0.0227123
 -0.00883139   -0.0128069
 -0.0114954    -0.00600044
  0.00750958    0.00537207
  0.0062047     0.010616
 -0.00481099   -0.00666484
  0.00555084    0.010545
  0.0406162     0.050493
 -0.0076684    -0.0107781
  0.994321      0.999956

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.