*****************************************************************
*-- Data and code to reproduce figures and results in paper --*
*-- --*
*-- 'Accurate error control in high dimensional association --*
*-- testing using conditional false discovery rates' --*
*-- --*
*-- --*
*-- James Liley and Chris Wallace, 2020 --*
*-- Correspondence: JL, [email protected] --*
*****************************************************************
This folder contains all relevant material to reproduce plots
and results in the paper above. We assume that the associated R
package 'cfdr' is loaded. If it is not, use
library(devtools)
install.github("jamesliley/cfdr")
As a failsafe, the directory code/ contains a reproduction of
the package in code/functions.R. This can be sourced instead.
In some areas (mostly large-scale simulations), the complete
regeneration of all data used in the paper takes a prohibitively
long time to produce. For this reason, we include a script which
reproudces a single run of the simulation, and several matrices
of results from previous runs.
Prior to executing a run of the simulation, a folder should be
created in the same directory as this README called 'simulations'
(lowercase).
In all processes involving random number generation, we set a
random seed explicitly. For this reason, all results should match
those in the text exactly.
In this guide, we will outline the subdirectories in this folder,
then run through what each script does, and what each data object
contains.
An additional README in ./data explains columns of the matrix of
simulation results.
All code should be ran with the folder containing this README as
the working directory. File paths are otherwise relative. The
bottom of this readme indicates code and package versions, which
for full reproduction should be matched exactly.
*****************************************************************
*-- Directories --*
*****************************************************************
This folder contains four subdirectories
code: contains R scripts
outputs: contains figures included in the manuscript (as PDFs)
and tables of results
simulations: directory to which simulation output is written
(empty)
data: contains raw datasets used for analysis and matrices of
already-ran simulations
*****************************************************************
*-- Scripts --*
*****************************************************************
Folder 'code' contains six files:
code/run_simulation.R: runs a single iteration of the simulation,
given a random seed and other parameters
code/submit_codes.txt: details the scripts used to generate each
class of simulation results, and gives instructions for
reproducing them
code/simulation_analysis.R: given matrices of simulation results
and GWAS data, generates tables and draws plots as pdfs to
'outputs' directory.
code/twas_analysis.R: runs the analysis of transcriptome-wide
association study data in the motivating example in the paper.
code/functions.R: a reproduction of all necessary code in the
package cfdr.
code/reproducibility_check.R: checks R and package versions, and
for each simulation matrix (except sim_parametric_adjustment)
chooses a random row and reproduces it.
code/run_sim_paradj.R: a shortened version of run_simulation.R
which uses a parametric adjustment for parametric cFDR. Used
for reproducibility only.
*****************************************************************
*-- Data objects --*
*****************************************************************
Folder 'data' contains 15 objects. Nine are simulation matrices,
two are .RData files generated by code/simulation_analysis.R,
and three (in subfolder TWAS) are objects relating to the TWAS
analysis. The final object is a README describing columns of
simulation matrices.
data/sim_gen_high_fdr.txt: matrix of simulation results for
general circumstances, with parameters randomly chosen from
distribuation specified in manuscript, controlling FDR at 0.1.
data/sim_gen_high_fdr_null.txt: matrix of simulation results for
general circumstances, with parameters chosen randomly from
distribution specified in manuscript conditional on
n1p + n1pq=0; that is, no true associations, controlling FDR
at 0.1
data/sim_gen_low_fdr.txt: matrix of simulation results for
general circumstances, with parameters randomly chosen from
distribuation specified in manuscript, controlling FDR at 0.01.
data/sim_gen_low_fdr_null.txt: matrix of simulation results for
general circumstances, with parameters chosen randomly from
distribution specified in manuscript conditional on
n1p + n1pq=0; that is, no true associations, controlling FDR
at 0.01
data/sim_fixed.txt: matrix of simulation results in which
parameters are chosen from one of several fixed parameter sets.
data/sim_cov.txt: matrix of simulation results with parameters
selected randomly as for sim_gen_high_fdr.txt, but with
dependent observations according to either a block diagonal
or equicorrelated covariance matrix.
data/sim_cov_null.txt: matrix of simulation results with
parameters selected randomly as for sim_gen_high_fdr_null.txt;
that is, with no true associations, and with dependent
observations according to either a block diagonal or
equicorrelated covariance matrix.
data/sim_unrelated.txt: matrix of simulation results with
parameters chosen as for sim_gen_high_fdr.txt, but
conditioning on n1pq=0; that is, no shared associations.
data/sim_parametric_adjustment.txt: matrix of simulation results
using parametrised version of cFDR only, and using an
'adjustment' based on the parametrisation rather than the
empirical CDF.
data/iterated_cfdr_data.RData: data used in assessing iterated
cfdr. This is deterministically generated by a block of code in
code/simulation_analysis.R, but because it takes several hours,
it is saved and restored rather than regenerated.
data/convergence_data.RData: data used for drawing figure showing
convergence of various L-regions. This is deterministically
generated by a block of code in code/simulation_analysis.R, but
as it takes several hours to generate it is saved and restored
rather than regenerated.
data/TWAS/raw/BCAC.dat: raw data for breast cancer GWAS,
downloaded from twashub.org.
data/TWAS/raw/OCAC.dat: raw data for ovarian cancer GWAS,
downloaded from twashub.org.
data/TWAS/twas_summary.RData: processed TWAS data, generated
deterministically by code/twas_analysis. Takes several hours to
generate, so saved and restored rather than regenerated.
README.txt: an explanation of the columns of each of the
simulation matrices
*****************************************************************
*-- R and package versions --*
*****************************************************************
R version 3.3.3
mnormt version 1.5.5
mgcv version 1.8.17
pbivnorm version 0.6.0
MASS version 7.3.45
fields version 8.10
matrixStats version 0.51.0
latex2exp version 0.4.0
maps version 3.1.1
spam version 1.4.0
grid version 3.3.3
nlme version 3.1.131.1
Output of sessionInfo() on reproduction:
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Scientific Linux 7.8 (Nitrogen)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] latex2exp_0.4.0 fields_8.10 maps_3.1.1 spam_1.4-0 MASS_7.3-45 pbivnorm_0.6.0 mgcv_1.8-17 nlme_3.1-131.1 mnormt_1.5-5 matrixStats_0.51.0
loaded via a namespace (and not attached):
[1] magrittr_1.5 Matrix_1.2-8 tools_3.3.3 stringi_1.1.6 stringr_1.2.0 lattice_0.20-34
>