biona001 / knockoffs.jl Goto Github PK

Variable Selection with Knockoffs

License: MIT License

Julia 100.00%

lasso model-selection variable-selection group-knockoffs knockoffs multiple-knockoffs conditional-independence conditional-independence-test false-discovery-rate knockoff-filter

knockoffs.jl's Introduction

Here are some of my past and ongoing projects, which I think are pretty cool. Not all projects are being actively developed, but I will certainly respond to issues and pull requests.

I am the main developer for (more recent projects are listed first):

GhostKnockoffGWAS: Package for performing knockoff-based analysis for GWAS summary statistics data
Ghostbasil.jl: (WIP) Provides Julia wrappers to the C++ code of ghostbasil
Knockoffs.jl: Implements the knockoff filter framework for variable selection, which performs conditional independence testing and controls the FDR (false discovery rate)
groupknockoffs: Simple app to solve group knockoff optimization, without Julia installed!
EasyLD.jl: Julia utilities for downloading and reading LD (linkage disequilibrium) matrices stored in Hail's BlockMatrix format
knockoffspy: A Python package that provides a direct wrapper over Knockoffs.jl
knockoffsr: A R package that provides a direct wrapper over Knockoffs.jl
MendelIHT.jl: Implements iterative hard thresholding (l0 penalized regression solver). It is highly optimized for handling compressed (binary PLINK) genotype data
MendelImpute.jl: A package for genotype imputation, phasing, and (global/local) ancestry inference utilizing a reference haplotype panel. It is significantly faster than existing methods but slightly less accurate
Thyrosim.jl: An updated version of THYROSIM, Thyrosim.jl produces individualized thyroid hormone predictions (TT4/TT3/TSH) based on a rather complicated ODE model
VCFTools.jl: Julia utilities for handling VCF (Variant Call Format) files
fastPHASE.jl: Julia wrapper for the famous fastPHASE genetics software. The primary use case is to allow the original program to run on binary PLINK data.

I am a contributor for (at least 5 commits):

QuasiCopula.jl: Implements a new class of distribution (Quasi-Copulas) that captures correlation among non-Gaussian random variables efficiently
SnpArrays.jl: Julia package for handling binary PLINK formatted data. It has the fastest (matrix)-(vector) multiplication routine for compressed PLINK files as far as I know.
MendelKinship.jl: Calculates various empirical and theoretical kinship coefficients, based on pedigree or genotype data.

knockoffs.jl's People

Contributors

Stargazers

Watchers

knockoffs.jl's Issues

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Efficient sampling of multiple knockoffs

There is a trick to sample m-knockoffs without forming a p(m+1) * p(m+1) covariance in memory. The trick was used in sample_mvn_efficient but not yet in condition. Need to do so asap.

Scale group SDP objective by group size

Use squared Z-score for marginal feature importance statistic

#53 implemented fit_marginal which internally uses T[i] = |Zi| - |Zko[i]| as importance score for the ith feature. In Zihuai's ghostknockoff paper, he instead used T[i] = (Z[i])^2 - (Zko[i])^2 (see eq 15).

I should try this too to see if power is improved.

No progress printed on screen even when `verbose=true`

On cluster environment, it seems the progress for solve_s and solve_s_group is printed only when the job completes, which is not very meaningful for monitoring convergence.

Possibly reduce artifacts size

Knockoffs.jl will soon become a dependency for GhostKnockoffGWAS, which will be compiled into a relocatable app.

Knockoffs's dependency LowRankApprox.jl depends on FFTW.jl which in turn depends on MKL_jll.jl which is a staggering 596.668 MiB. Can we remove this dependency?

SDP solver a lot slower than Matlab/R equivalent

Currently the SDP solver in Knockoffs.jl is ~2 times slower than knockoff package in R and ~4 times slower than the knockoff package in Matlab. Speed difference is entirely due to underlying solver (Hypatia SDP solver+ Convex.jl in Julia vs R's Rdsdp vs Matlab's SDPT3 solver)

How do we could get more performance?

MWE

using Revise
using Knockoffs
using Test
using LinearAlgebra
using Random
using StatsBase
using Statistics
using Distributions
using ToeplitzMatrices
using RCall
using PositiveFactorizations
using UnicodePlots
using MATLAB
using SCS
using JuMP
using Convex

# simulate data
Random.seed!(2022)
n = 100
p = 400
ρ = 0.4
Sigma = Matrix(SymmetricToeplitz(ρ.^(0:(p-1))))
L = cholesky(Sigma).L
X = randn(n, p) * L # var(X) = L var(N(0, 1)) L' = var(Σ)
true_mu = zeros(p)

Knockoffs.jl

@time knockoff = modelX_gaussian_knockoffs(X, :sdp, true_mu, Sigma) # precompile
s = knockoff.s
  2.686531 seconds (507.05 k allocations: 399.686 MiB, 2.05% gc time)

Compare with Matteo's knockoffs in R

@rput Sigma X true_mu
@time begin
    R"""
    library(knockoff)
    r_s = create.solve_sdp(Sigma)
    X_ko = create.gaussian(X, true_mu, Sigma, method = "sdp", diag_s=r_s)
    """
end
@rget r_s X_ko
  1.208416 seconds (54 allocations: 1.328 KiB)

Compare with Matteo's knockoffs in Matlab

# compare with Matteo's knockoffs in MATLAB
@mput Sigma X true_mu
@time begin
    mat"""
    addpath("/Users/biona001/Documents/MATLAB/knockoffs_matlab")
    matlab_s = knockoffs.create.solveSDP(Sigma)
    """
end
@mget matlab_s
  0.715965 seconds (7 allocations: 192 bytes)

At least the answer agrees:

[s r_s matlab_s]
200×3 Matrix{Float64}:
 1.0       1.0       1.0
 0.971429  0.971427  0.971428
 0.811429  0.811431  0.811429
 0.875429  0.875427  0.875428
 0.849829  0.849829  0.849829
 0.860069  0.860068  0.860068
 0.855973  0.855973  0.855973
 0.857611  0.857611  0.857611
 0.856956  0.856956  0.856956
 0.857218  0.857218  0.857218
 0.857113  0.857113  0.857113
 0.857155  0.857155  0.857155
 0.857138  0.857138  0.857138
 ⋮                   
 0.857155  0.857155  0.857155
 0.857113  0.857113  0.857113
 0.857218  0.857218  0.857218
 0.856956  0.856956  0.856956
 0.857611  0.857611  0.857611
 0.855973  0.855973  0.855973
 0.860069  0.860068  0.860068
 0.849829  0.849829  0.849829
 0.875429  0.875427  0.875428
 0.811429  0.811431  0.811429
 0.971429  0.971427  0.971428
 1.0       1.0       1.0

Performance regression

As reported by Jiaqi, there seems to be some significant performance regression for group knockoff solver.

MWE: (Knockoffs.jl v1.1.7 and on Julia 1.10)

using Knockoffs
using LinearAlgebra
using Random
using CSV
using DataFrames

# gnomad test data
datadir = "/Users/biona001/Benjamin_Folder/research/4th_project_groupKO/group_knockoff_test_data"
mapfile = CSV.read(joinpath(datadir, "Groups_2_127374341_128034347.txt"), DataFrame)
groups = mapfile[!, :group]
covfile = CSV.read(joinpath(datadir, "CorG_2_127374341_128034347.txt"), DataFrame)
Σ = covfile |> Matrix{Float64}
Σ = 0.99Σ + 0.01I

@time solve_s_group(
    Symmetric(Σ), groups, :maxent, 
    m = 1,          # number of knockoffs per variable to generate
    tol = 0.001,    # convergence tolerance
    outer_iter = 10,    # max number of coordinate descent iterations
    robust = false, # whether to use robust cholesky updates
    verbose = true    # whether to print informative intermediate results
)

┌ Warning: Maximum group size is 192, optimization may be slow. Consider running `modelX_gaussian_rep_group_knockoffs` to speed up convergence.
└ @ Knockoffs ~/.julia/dev/Knockoffs/src/group.jl:360
Maxent initial obj = -44822.11340914695
Iter 1 (PCA): obj = -26794.756871827496, δ = 2.6915336716594664, t1 = 24.96, t2 = 32.32
Iter 2 (CCD): obj = -23681.635922280064, δ = 0.5920258362537, t1 = 534.9, t2 = 881.71, t3 = 0.22
Iter 3 (PCA): obj = -23476.599883936684, δ = 1.0851838794654751, t1 = 559.7, t2 = 913.86
Iter 4 (CCD): obj = -23441.194586522815, δ = 0.01917853738286209, t1 = 1059.27, t2 = 1761.83, t3 = 0.44
Iter 5 (PCA): obj = -23428.13516213957, δ = 0.20912865614936338, t1 = 1085.4, t2 = 1794.69
... # terminated

On Knockoffs.jl v0.3.0 (Julia 1.6.7), see Fast Cholesky update notebook

solve_group_max_entropy_ccd: Optimizing 58997 variables
┌ Warning: Maximum group size is 192, optimization may be slow. Consider running `modelX_gaussian_rep_group_knockoffs` to speed up convergence.
└ @ Knockoffs /Users/biona001/.julia/dev/Knockoffs/src/group.jl:230
initial obj = -13811.93738418022
Iter 1: obj = -8863.340455999352, δ = 0.9033274319764998, t1 = 8.85, t2 = 18.23, t3 = 0.04
Iter 2: obj = -7509.4699925543855, δ = 0.7627302489344706, t1 = 19.04, t2 = 36.57, t3 = 0.08
Iter 3: obj = -7495.512979153219, δ = 0.153471691368552, t1 = 29.25, t2 = 54.77, t3 = 0.12
Iter 4: obj = -7491.027981075002, δ = 0.0467433749392695, t1 = 39.2, t2 = 72.95, t3 = 0.16
Iter 5: obj = -7489.0243504490645, δ = 0.007200479138871564, t1 = 49.08, t2 = 90.98, t3 = 0.2
Iter 6: obj = -7487.899557778442, δ = 0.004873425110742543, t1 = 58.98, t2 = 109.34, t3 = 0.24
Iter 7: obj = -7487.198762711129, δ = 0.0031886466238885947, t1 = 68.75, t2 = 127.98, t3 = 0.28
Iter 8: obj = -7486.7318922165505, δ = 0.002106998635746351, t1 = 78.17, t2 = 146.58, t3 = 0.32
Iter 9: obj = -7486.405698134362, δ = 0.0011212737859179697, t1 = 87.39, t2 = 165.21, t3 = 0.37
Iter 10: obj = -7486.1699950030525, δ = 0.0006708441935208511, t1 = 96.52, t2 = 183.44, t3 = 0.41
283.395020 seconds (3.23 M allocations: 329.596 MiB, 0.05% gc time, 0.61% compilation time)

Initial objective value being different is due to different initialization of S. However the timing seems much slower, even if ran on different computers.

Fixed-X knockoffs X.ko'X.ko

Hi Benjamin,

I hope you are doing well!

I wanted to check about the following. The following seems not to match (for any of the fixed-X knockoff constructions):

me = fixed_knockoffs(X, :maxent)
maximum(abs.(me.Xko'me.Xko .- me.Sigma)) #large

Do you know why that is? (Happy to provide a MWE if it helps, though it happens for all designs I have tried it on.)

Thank you,
Nikos

CI failing on Julia v1.10

See https://github.com/JuliaRegistries/General/actions/runs/7543605423/job/20535002853?pr=98968

I think we need to wait until Hypatia makes a new release.

Normalizing to unit vector introduces numerical instabilities

MWE

Only centering and scaling each column to mean 0 variance 1 (but not to unit norm)

using Knockoffs, Random, StatsBase, Statistics

Random.seed!(2021)
n = 3000
p = 1000
X = randn(n, p)
zscore!(X, mean(X, dims=1), std(X, dims=1))
knockoff = knockoff_equi(X);

X̃ = knockoff.X̃
Σ = knockoff.Σ

julia> [vec(X̃' * X̃) vec(Σ)]
1000000×2 Matrix{Float64}:
 2999.0      2999.0
   68.3776     68.3776
  -22.3342    -22.3342
  -43.2302    -43.2302
  -93.228     -93.2281
   32.9963     32.9963
  -43.5092    -43.5092
   71.2002     71.2002
  -91.1404    -91.1404
  -28.554     -28.554
  -31.7598    -31.7598
  -13.1698    -13.1698
  -34.6885    -34.6885
   22.8903     22.8903
   15.2742     15.2742
   36.5119     36.5119
  -23.4127    -23.4127
    ⋮
  -95.6452    -95.6452
  -79.5508    -79.5508
 -150.037    -150.037
  -24.5637    -24.5637
  -86.4215    -86.4215
  -26.914     -26.914
    2.37256     2.37256
  -68.1161    -68.1161
  -38.4121    -38.4121
  -17.9981    -17.9981
   16.1139     16.1139
   31.8385     31.8385
  -19.9923    -19.9923
   31.6304     31.6305
  151.434     151.434
  -32.3494    -32.3494
 2999.0      2999.0

Normaling each column so that ||x_j||_2^2 = 1:

using Knockoffs, Random

Random.seed!(2021)
n = 3000
p = 1000
X = randn(n, p)
normalize_col!(X)
knockoff = knockoff_equi(X);

X̃ = knockoff.X̃
Σ = knockoff.Σ

julia> [vec(X̃' * X̃) vec(Σ)]
1000000×2 Matrix{Float64}:
  1.0006        0.999942
  0.0113254     0.0227959
 -0.0117438    -0.007447
 -0.00865152   -0.0144145
 -0.0298051    -0.0310659
  0.00509079    0.0110014
 -0.0118063    -0.0145065
  0.0198903     0.0237353
 -0.0204552    -0.0303879
 -0.00816176   -0.00951761
 -0.00766452   -0.0105895
 -0.00383353   -0.00439077
 -0.00735197   -0.0115651
  0.00505043    0.00762776
 -0.000776242   0.00509295
  0.0112497     0.0121713
 -0.00566405   -0.00780654
  ⋮
 -0.0245307    -0.0318917
 -0.0281486    -0.0265195
 -0.0363757    -0.0500211
 -0.00516446   -0.00819044
 -0.0156618    -0.02881
 -0.0116239    -0.00897332
 -0.00495919    0.00079108
 -0.0227494    -0.0227123
 -0.00883139   -0.0128069
 -0.0114954    -0.00600044
  0.00750958    0.00537207
  0.0062047     0.010616
 -0.00481099   -0.00666484
  0.00555084    0.010545
  0.0406162     0.050493
 -0.0076684    -0.0107781
  0.994321      0.999956

`choose_group_reps`: reducing over an empty collection is not allowed

MWE

using Knockoffs, LinearAlgebra
Sigma = Symmetric([1.0 0.14138129031570185 -0.07530156753912527 0.1425092255960416 -0.03975762881225776 -0.03975762881225776; 0.14138129031570185 1.0 -0.06269073019504869 0.18796864772097638 -0.042632500750932216 -0.042632500750932216; -0.07530156753912527 -0.06269073019504869 1.0 0.10393405423314547 0.8269051049537056 0.8269051049537056; 0.1425092255960416 0.18796864772097638 0.10393405423314547 1.0 0.1598138853324871 0.1598138853324871; -0.03975762881225776 -0.042632500750932216 0.8269051049537056 0.1598138853324871 1.0 0.9999999999999998; -0.03975762881225776 -0.042632500750932216 0.8269051049537056 0.1598138853324871 0.9999999999999998 1.0])
groups = [1, 2, 3, 4, 3, 3]
choose_group_reps(Sigma, groups, threshold=0.5)

ERROR: MethodError: reducing over an empty collection is not allowed; consider supplying `init` to the reducer
Stacktrace:
  [1] reduce_empty(op::Base.MappingRF{Base.var"#316#317"{…}, Base.BottomRF{…}}, ::Type{Pair{…}})
    @ Base ./reduce.jl:361
  [2] reduce_empty_iter
    @ ./reduce.jl:384 [inlined]
  [3] reduce_empty_iter
    @ ./reduce.jl:383 [inlined]
  [4] foldl_impl
    @ ./reduce.jl:49 [inlined]
  [5] mapfoldl_impl
    @ ./reduce.jl:44 [inlined]
  [6] mapfoldl
    @ ./reduce.jl:175 [inlined]
  [7] _findmax
    @ ./reduce.jl:908 [inlined]
  [8] findmax
    @ ./reducedim.jl:1226 [inlined]
  [9] _findmax
    @ ./reduce.jl:934 [inlined]
 [10] findmax
    @ ./reducedim.jl:1203 [inlined]
 [11] select_one(C::Matrix{Float64}, vlist::Vector{Int64}, RSS0::Int64, tol::Float64)
    @ Knockoffs /home/groups/sabatti/.julia/packages/Knockoffs/fRlxF/src/group.jl:2083
 [12] select_one(C::AbstractMatrix, vlist::Any, RSS0::Any, tol::Any)
    @ Knockoffs /home/groups/sabatti/.julia/packages/Knockoffs/fRlxF/src/group.jl:2081 [inlined]
 [13] select_best_rss_subset(C::SubArray{Float64, 2, Symmetric{…}, Tuple{…}, false}, k::Int64)
    @ Knockoffs /home/groups/sabatti/.julia/packages/Knockoffs/fRlxF/src/group.jl:2099
 [14] choose_group_reps(Σ::Symmetric{…}, groups::Vector{…}; threshold::Float64, prioritize_idx::Nothing, Σinv::Symmetric{…})
    @ Knockoffs /home/groups/sabatti/.julia/packages/Knockoffs/fRlxF/src/group.jl:1998
 [15] top-level scope
    @ REPL[4]:1
Some type information was truncated. Use `show(err)` to see complete types

This is on Julia v1.10 and Knockoffs.jl version v1.1.9

Possible numerical error when initializing with equi correlation solution

For group knockoffs, when we use SDP/ME/MVR solvers, we initiailzing the optimzation with equicorrelated S matrix. However, that S matrix is the largest S such that (m+1)/m*Sigma-S > 0. This means that when we compute cholesky((m+1)/m*Sigma-S), we could potentially introduce numerical stabilities.

For this reason, the next minor patch release might initialize S with, for example, equi.S ./ 2 to naively circumvent this issue.

Register package

@JuliaRegistrator register()

Improve documentation

Group knockoffs: Missing documentation for representative group knockoffs
Calling from R: objective function wrong
Need docstrings for solve_s_graphical_group and related methods
GhostKnockoffs: update reference
Example in model-X knockoffs: increase sample size or effect size so power is better
The :SDP option in group knockoffs should be :sdp

Also update readme:

Maybe add message for pr/bugs
update installation instructions once package is registered

documentation not building

HMM normalizing constant overflows

The following creates MWE data (test.bed/bim/fam)

using SnpArrays
using MendelIHT
using Random
Random.seed!(2022) # set seed for reproducibility
n = 1000 # sample size
p = 10000 # number of covariates
x = simulate_correlated_snparray("test.bed", n, p, block_length=20, prob=0.75)
make_bim_fam_files(x, rand(1000), "test")

and a call to fastPHASE HMM knockoffs will throw NaN error that is caused by normalizing constants overflowing

@time X̃ = hmm_knockoff(data_path, plink_outfile="test.fastphase.knockoffs")

`choose_group_reps` returns stochastic/incorrect results

Related to #71

MWE:

using Knockoffs, LinearAlgebra
Sigma = Symmetric([1.0 0.14138129031570185 -0.07530156753912527 0.1425092255960416 -0.03975762881225776 -0.03975762881225776; 0.14138129031570185 1.0 -0.06269073019504869 0.18796864772097638 -0.042632500750932216 -0.042632500750932216; -0.07530156753912527 -0.06269073019504869 1.0 0.10393405423314547 0.8269051049537056 0.8269051049537056; 0.1425092255960416 0.18796864772097638 0.10393405423314547 1.0 0.1598138853324871 0.1598138853324871; -0.03975762881225776 -0.042632500750932216 0.8269051049537056 0.1598138853324871 1.0 0.9999999999999998; -0.03975762881225776 -0.042632500750932216 0.8269051049537056 0.1598138853324871 0.9999999999999998 1.0])
groups = [1, 2, 3, 4, 3, 3]

julia> group_reps = choose_group_reps(Sigma, groups, threshold=0.9)
6-element Vector{Int64}:
     1
     2
     3
     4
     5
 72037

julia> group_reps = choose_group_reps(Sigma, groups, threshold=0.9)
6-element Vector{Int64}:
               1
               2
               3
               4
               5
 139762467531552

julia> group_reps = choose_group_reps(Sigma, groups, threshold=0.9)
6-element Vector{Int64}:
     1
     2
     3
     4
     5
 66968