neurogenomics / mungesumstats Goto Github PK

View Code? Open in Web Editor NEW

66.0 5.0 15.0 8.91 MB

Rapid standardisation and quality control of GWAS or QTL summary statistics

Home Page: https://doi.org/doi:10.18129/B9.bioc.MungeSumstats

R 99.90% TeX 0.10%

bioinformatics r gwas summary-statistics bioconductor-package r-package standardisation genomics vcf-files database-api

mungesumstats's Introduction

`MungeSumstats`: Standardise the format of GWAS summary statistics

Authors: Alan Murphy, Brian Schilder and Nathan Skene

Updated: Apr-24-2024

Introduction

The MungeSumstats package is designed to facilitate the standardisation of GWAS summary statistics as utilised in our Nature Genetics paper¹.

Overview

The package is designed to handle the lack of standardisation of output files by the GWAS community. The MRC IEU Open GWAS team have provided full summary statistics for >10k GWAS, which are API-accessible via the ieugwasr and gwasvcf packages. But these GWAS are only standardised in the sense that they are VCF format, and can be fully standardised with MungeSumstats.

MungeSumstats provides a framework to standardise the format for any GWAS summary statistics, including those in VCF format, enabling downstream integration and analysis. It addresses the most common discrepancies across summary statistic files, and offers a range of adjustable Quality Control (QC) steps.

Citation

If you use MungeSumstats, please cite the original authors of the GWAS as well as:

Alan E Murphy, Brian M Schilder, Nathan G Skene (2021) MungeSumstats: A Bioconductor package for the standardisation and quality control of many GWAS summary statistics. Bioinformatics, btab665, https://doi.org/10.1093/bioinformatics/btab665

Installing `MungeSumstats`

MungeSumstats is available on Bioconductor (≥v3.13). To install MungeSumstats on Bioconductor run:

if (!require("BiocManager")) install.packages("BiocManager")

BiocManager::install("MungeSumstats")

You can then load the package and data package:

library(MungeSumstats)

Note that there is also a docker image for MungeSumstats.

Note that for a number of the checks implored by MungeSumstats a reference genome is used. If your GWAS summary statistics file of interest relates to GRCh38, you will need to install SNPlocs.Hsapiens.dbSNP155.GRCh38 and BSgenome.Hsapiens.NCBI.GRCh38 from Bioconductor as follows:

BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh38")
BiocManager::install("BSgenome.Hsapiens.NCBI.GRCh38")

If your GWAS summary statistics file of interest relates to GRCh37, you will need to install SNPlocs.Hsapiens.dbSNP155.GRCh37 and BSgenome.Hsapiens.1000genomes.hs37d5 from Bioconductor as follows:

BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh37")
BiocManager::install("BSgenome.Hsapiens.1000genomes.hs37d5")

These may take some time to install and are not included in the package as some users may only need one of GRCh37/GRCh38. If you are unsure of the genome build, MungeSumstats can also infer this information from your data.

Getting started

See the Getting started vignette website for up-to-date instructions on usage.

See the OpenGWAS vignette website for information on how to use MungeSumstats to access, standardise and perform quality control on GWAS Summary Statistics from the MRC IEU Open GWAS Project.

If you have any problems please do file an Issue here on GitHub.

Future Enhancements

The MungeSumstats package aims to be able to handle the most common summary statistic file formats including VCF. If your file can not be formatted by MungeSumstats feel free to report the Issue on GitHub along with your summary statistics file header.

We also encourage people to edit the code to resolve their particular issues too and are happy to incorporate these through pull requests on github. If your summary statistic file headers are not recognised by MungeSumstats but correspond to one of

SNP, BP, CHR, A1, A2, P, Z, OR, BETA, LOG_ODDS, SIGNED_SUMSTAT, N, N_CAS, N_CON, 
NSTUDY, INFO or FRQ,

Feel free to update the data("sumstatsColHeaders") following the approach in the data.R file and add your mapping. Then use a Pull Request on GitHub and we will incorporate this change into the package.

Contributors

We would like to acknowledge all those who have contributed to MungeSumstats development:

References

1. Nathan G. Skene, T. E. B., Julien Bryois. Genetic identification of brain cell types underlying schizophrenia. Nature Genetics (2018). doi:10.1038/s41588-018-0129-5

mungesumstats's People

Stargazers

Watchers

Forkers

kittymurphy daklab sjfandrews nandobonf befh ukdri xshen796 jonathangriffiths lifebit-ai ekacar cfbeuchel sharkts666 rmgpanw laleoarrow

mungesumstats's Issues

Speed up VCF download

Speed up VCF download from Open GWAS with axel at this step in import_sumstats:

if(vcf_download){
                    message("Downloading VCF ==> ",save_path) 
                    download.file(vcf_url, save_path)
                    # system(paste0("axel ",vcf_url," -o ",save_path))
                    # system(paste0("axel ",vcf_url,".tbi -o ",save_path,".tbi"))
                    vcf_url <- save_path 
                }

non-numeric argument to mathematical function

Bug at this following step, within check_n_int():

is the vignette properly formatted

I think you need a space before your bulletted list in markdown -- this is what i see upon rendering via install.packages build_vignettes=TRUE

Add header hashes for quick reference in Rstudio

Create search function to find datasets of interest on Open GWAS

find_sumstats now flexibly searches for substrings and min sample size criteria.

https://github.com/neurogenomics/MungeSumstats/blob/bschilder_dev/R/find_sumstats.R

Downside: its dep ieugwasr is only available as on GitHub, and i seem to recall Bioc not allowing deps as remotes.
Potential solution: extract the necessary code and remove the dep entirely. Downside: we'll be more susceptible to errors that result from changes to their API.

BiocCheck: WARNING: Add non-empty \value sections to the following man pages:

* Checking man page documentation...
    * WARNING: Add non-empty \value sections to the following man pages: man/axel.Rd, man/check_save_path.Rd,
      man/convert_sumstats.Rd, man/download_vcf.Rd, man/downloader.Rd, man/find_sumstats.Rd, man/preview_sumstats.Rd,
      man/read_header.Rd, man/read_sumstats.Rd, man/remove_nonstandard_vcf_cols.Rd, man/report_summary.Rd, man/select_api.Rd,
      man/sort_coords.Rd

Return sum stats as GRanges or VRanges

Implemented using new functions to_GRanges and to_VRanges

Option to tabix-index

Tabix indexing makes querying your sum stats extremely efficient, with minimal changes to the format.

seqminer has a number of functions to do this within R.
I've also written a bunch of command line wrappers to do these things in echolocatoR.

Provide liftover to other genome builds

Currently, MungeSumstats detects the current genome build but does not do liftover to one genome build or another. Would be awesome to implement this.

I extracted the functions needed to do this from XGR, since XGR can be a bit tricky to install and this is the only part I needed for liftover.

https://github.com/RajLabMSSM/echolocatoR/blob/master/R/xLiftOver.R

Change tests to be more robust

Changed all tests so they're less susceptible to slight changes in code.
For example, many tests used row index to identify the removed SNP, but row indices can change when rows are added/removed from the example data in the future, or when rows are sorted.

This now tests "handling missing data" more specifically (as opposed to other sources of differences).

Example:

expect_equal(reformatted_lines, org_lines[-58])

vs.

problem_snp <- "rs9320913"
rsid_index <- grep(problem_snp, org_lines, ignore.case = TRUE) 
expect_equal(reformatted_lines, org_lines[-rsid_index])

test-missing_data.R

test_that("Handle missing data", {
  file <- tempfile()
  #Remove data from line 3 to check it is deleted
  eduAttainOkbay <- readLines(system.file("extdata","eduAttainOkbay.txt",
                                          package="MungeSumstats"))
  eduAttainOkbay_missing <- eduAttainOkbay
  eduAttainOkbay_missing[3] <-
    "rs12987662\t2\t100821548\tA\tC\t0.3787\t0.027\t0.003\t"
  problem_snp <- "rs9320913"
  #write the Educational Attainment GWAS to a temp file for testing
  writeLines(eduAttainOkbay_missing,con = file)
  #Run MungeSumstats code
  reformatted <- MungeSumstats::format_sumstats(file,ref_genome="GRCh37",
                                                on_ref_genome = FALSE,
                                                strand_ambig_filter=FALSE,
                                                bi_allelic_filter=FALSE,
                                                allele_flip_check=FALSE,
                                                sort_coordinates = FALSE)
  reformatted_lines <- readLines(reformatted)
  #Should equal org apart from this one line
  writeLines(eduAttainOkbay,con = file)
  org <- MungeSumstats::format_sumstats(file,ref_genome="GRCh37",
                                        on_ref_genome = FALSE,
                                        strand_ambig_filter=FALSE,
                                        bi_allelic_filter=FALSE,
                                        allele_flip_check=FALSE, 
                                        sort_coordinates = FALSE)
  org_lines <- readLines(org)
  rsid_index <- grep(problem_snp, org_lines, ignore.case = TRUE) 
  #reordering in function, line 3 is now 58
  expect_equal(reformatted_lines, org_lines[-rsid_index])
})

Add N_dropNA arg

Allow users to decide whether they want to drop SNPs where N is NA (only when N column is present and not empty).

#' Ensure all SNPs have N less than X std dev below mean
#' 
#'Incase some SNPs were genotyped by a specialized genotyping array and 
#'have substantially more samples than others. These will be removed.
#'
#' @param sumstats_dt data table obj of the summary statistics file for the GWAS
#' @param path Filepath for the summary statistics file to be formatted
#' @param N_std numeric The number of standard deviations above the mean a 
#' SNP's N is needed to be removed. Default is 5.
#' @param N_dropNA Whether to keep rows where N is \code{NA}. Default is FALSE.
#' @return list containing sumstats_dt, the modified summary statistics data table object
#' @keywords internal
#' @importFrom stats sd
check_n_num <- function(sumstats_dt, 
                        path, 
                        N_std, 
                        N_dropNA=FALSE){
  message("Ensuring all SNPs have N<",N_std," std dev below mean.")
  N = NULL
  col_headers <- names(sumstats_dt)
  if("N" %in% col_headers && N_std>0){
    mean_N <- mean(sumstats_dt$N)
    sd_N <- stats::sd(sumstats_dt$N)
    num_bad_ids <- nrow(sumstats_dt[N>((N_std*sd_N)+mean_N),])
    if(num_bad_ids>0){
      msg <- paste0(num_bad_ids, " SNPs have N values ",
                    N_std," standard deviations above the mean",
                    " and will be removed")
      message(msg)
      sumstats_dt <- sumstats_dt[N<=((N_std*sd_N)+mean_N),]
    }
    if(!N_dropNA){
      message("Removing rows where is.na(N)")
      sumstats_dt <- sumstats_dt[complete.cases(N)]
    }
    return(list("sumstats_dt"=sumstats_dt))
  }
  else{
    return(list("sumstats_dt"=sumstats_dt))
  }
}

Miscellaneous errors encountered when parsing summary stats

Hi,

Thanks for this great tool. It really eases what I've been doing somewhat manually over the last year.

I've ran into some issues which may be of interest to the general community, when trying to parse summary stats from the wild:

BOLT-LMM outputs files with headers A0/A1. This is not contained in the sumstatsColHeaders. There's also the additional issue that A1 is no longer the effect allele
SNP columns of the format chr:bp:a1:a2 fails to parse. Whenever it's present in the data I just get a Error in .new_IRanges_from_start_end(start, end) 'start' or 'end' cannot contain NAs. Setting the value to missing for these entries fixes the problem
Lower case alleles are not automatically converted to upper case
Entries with missing CHR/BP are not automatically removed and instead results in an error
Entries with multiple RSIDs seperated by a comma (e.g. rs1,rs2) leads to an error. The current code seems to only catch rs1_rs2.

On a minor note, it would also be nice if it was possible to feed the main function, format_sumstats, a data.frame with sumstats directly, to avoid loading the data a second time if any manipulation is needed.

Report how many SNPs you started with.

Adding a report at the very beginning of format_sumstats so we know how much the dataset gets whittled down by the end.

  sumstats_return[["sumstats_dt"]] <- read_sumstats(path = path, 
                                                      nThread = nThread)
    
    report_summary(sumstats_dt = sumstats_return$sumstats_dt)
    orig_dims <- dim(sumstats_return$sumstats_dt)

Also adding at the very end, as an optional arg to thereport_summary function.

report_summary <- function(sumstats_dt,
                           orig_dims=NULL){
    orig_dims_report <- if(!is.null(orig_dims)){paste0(" (started with ",orig_dims[1],")")} else {NULL};
    
    message("Formatted summary statistics report:",
            "\n   - ", formatC(nrow(sumstats_dt), big.mark = ",")," rows",orig_dims_report,
            "\n   - ", formatC(length(unique(sumstats_dt$SNP)),big.mark = ",")," unique variants",
            "\n   - ", formatC(nrow(subset(sumstats_dt, P<5e-8)),big.mark = ",")," genome-wide significant variants (P<5e-8)",
            "\n   - ", formatC(length(unique(sumstats_dt$CHR)),big.mark = ",")," chromosomes"
            ) 
}

read VCF with data.table

read_vcf can now read in and parse different VCF formats using data.table (much faster than readLines).

Allow users to return data directly

 if(return_data){
    message("Returning data.table directly.")
   return(sumstats_return$sumstats_dt)
  } else {
    message("Returning path to saved data.")
    return(check_save_out$save_path) # Returns address of modified file
  }

When `allele_filp_check = TRUE` , allele frequence should also be filped

Thank you for this wonderful tool. I noticed that when using allele_filp_check = TRUE, the A1/A2 and BETA are fliped according to the reference. However, the FRQ column is unchanged. Actually, I'm not clear what is the FRQ column. It should be "The minor allele frequency (MAF) of the SNP" based on the manual. However, it sometimes appears to be larger than 0.5. It seems like it just copies the frequency column in the raw file without changing anything.

Some software requires FRQ to be the effect allele frequency (e.g., GCTA). Therefore the FRQ should be set as the frequency of A1 by default, and be flipped if needed, especially when the EAF column is detected in the raw file.

Make arguments explicit

This avoid errors later on if you decide to change the order of arguments in a function., eg:

validate_parameters(path=path) vs. validate_parameters(path)

`test_that("non-biallelic SNPs are removed",` failing

I think this needs to be rewritten so it doesn't assume that omitting line 58 is the only thing required to make reformatted and org_lines the same. Should instead actually detect where the biallelic SNPs are. That way, it's more obvious what this function is trying to do, and it makes it robust to changes in example data.

https://github.com/neurogenomics/MungeSumstats/blob/master/tests/testthat/test-bi_alllelic.R

Error in ids2rowids(ids) :

Input file VCF:
https://gwas.mrcieu.ac.uk/files/finn-a-G6_ALZHEIMER/finn-a-G6_ALZHEIMER.vcf.gz

`missing value where TRUE/FALSE needed`

As always, HPC is a joy to run otherwise working scripts on.

Requested the following interactive resources on HPC:

qsub -I -l select=01:ncpus=8:mem=96gb -l walltime=08:00:00

Then tried the following in R:

# Gather OpenGWAS IDs
metagwas <- MungeSumstats::find_sumstats(traits = c("amyotrophic","frontotemporal","dementi
    a"))    
metagwas=metagwas[grep("alzheimer",metagwas$trait, ignore.case = T, invert=T),]      

# Import
gwas <- MungeSumstats::import_sumstats(ids = metagwas$id,  
                                           save_dir = "/rds/general/project/neurogenomics-lab/li
    ve/GWAS_sumstats/OpenGWAS",  
                                           nThread = 8,  
                                           parallel_across_ids = FALSE)

The VCFs seem to download ok, but then I get the missing value where TRUE/FALSE needed error right after.

I'm showing the output from when I had already downloaded the VCF files first, and then imported them on a 2nd run. I checked and it doesn't seem like the VCFs got screwed up or anything during download (e.g. only partially downloaded).

...
...
...
========== Processing dataset : ebi-a-GCST005647 ==========

Using previously downloaded VCF.
Formatted summary statistics will be saved to ==> /rds/general/project/neurogenomics-lab/live/GWAS_sumstats/OpenGWAS/ebi-a-GCST005647.tsv.gz
missing value where TRUE/FALSE needed
...
...
...

I ran again with set to effect_columns_nonzero=TRUE and effect_columns_nonzero=FALSE (just in case), but didn't seem to make any difference.

Session info

R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 8 (Core)

Matrix products: default
BLAS/LAPACK: /rds/general/user/bms20/home/anaconda3/envs/ewce_suite/lib/libopenblasp-r0.3.17.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8       
 [4] LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] MungeSumstats_1.1.14

loaded via a namespace (and not attached):
  [1] bitops_1.0-7                matrixStats_0.60.0          fs_1.5.0                   
  [4] usethis_2.0.1               devtools_2.4.2              bit64_4.0.5                
  [7] filelock_1.0.2              progress_1.2.2              httr_1.4.2                 
 [10] googleAuthR_1.4.0           rprojroot_2.0.2             GenomeInfoDb_1.28.0        
 [13] tools_4.1.1                 utf8_1.2.2                  R6_2.5.0                   
 [16] DBI_1.1.1                   BiocGenerics_0.38.0         withr_2.4.2                
 [19] tidyselect_1.1.1            prettyunits_1.1.1           processx_3.5.2             
 [22] bit_4.0.4                   curl_4.3.2                  compiler_4.1.1             
 [25] cli_3.0.1                   Biobase_2.52.0              desc_1.3.0                 
 [28] DelayedArray_0.18.0         rtracklayer_1.52.0          callr_3.7.0                
 [31] rappdirs_0.3.3              stringr_1.4.0               digest_0.6.27              
 [34] Rsamtools_2.8.0             R.utils_2.10.1              XVector_0.32.0             
 [37] pkgconfig_2.0.3             sessioninfo_1.1.1           MatrixGenerics_1.4.0       
 [40] dbplyr_2.1.1                fastmap_1.1.0               BSgenome_1.60.0            
 [43] rlang_0.4.11                rstudioapi_0.13             RSQLite_2.2.5              
 [46] BiocIO_1.2.0                generics_0.1.0              jsonlite_1.7.2             
 [49] BiocParallel_1.26.0         dplyr_1.0.7                 R.oo_1.24.0                
 [52] VariantAnnotation_1.38.0    RCurl_1.98-1.4              magrittr_2.0.1             
 [55] GenomeInfoDbData_1.2.6      Matrix_1.3-4                Rcpp_1.0.7                 
 [58] S4Vectors_0.30.0            fansi_0.4.2                 lifecycle_1.0.0            
 [61] R.methodsS3_1.8.1           stringi_1.7.3               yaml_2.2.1                 
 [64] SummarizedExperiment_1.22.0 zlibbioc_1.38.0             pkgbuild_1.2.0             
 [67] BiocFileCache_2.0.0         grid_4.1.1                  blob_1.2.2                 
 [70] parallel_4.1.1              crayon_1.4.1                lattice_0.20-44            
 [73] Biostrings_2.60.0           GenomicFeatures_1.44.1      hms_1.1.0                  
 [76] KEGGREST_1.32.0             ps_1.6.0                    pillar_1.6.2               
 [79] GenomicRanges_1.44.0        rjson_0.2.20                biomaRt_2.48.0             
 [82] stats4_4.1.1                pkgload_1.2.1               XML_3.99-0.7               
 [85] glue_1.4.2                  data.table_1.14.0           remotes_2.4.0              
 [88] png_0.1-7                   vctrs_0.3.8                 testthat_3.0.4             
 [91] purrr_0.3.4                 assertthat_0.2.1            cachem_1.0.5               
 [94] restfulr_0.0.13             gargle_1.2.0                tibble_3.1.3               
 [97] GenomicAlignments_1.28.0    AnnotationDbi_1.54.0        memoise_2.0.0              
[100] IRanges_2.26.0              ellipsis_0.3.2

Option to write as VCF

Implemented using :

        VariantAnnotation::writeVcf(vr, filename = check_save_out$save_path)

Speed improvements

Not sure if this is already implemented in some places, but MungeSumstats could really benefit from optimizing the speed of processes. Not always an easy task, but here's some ideas.

1. Comb through the code and search for inefficiencies

[DONE ✅] I can take a stab at this since i have fresh eyes.

2. Parallelize across CPUs

Run non-sequentially dependent steps in parallel
Chunk the sum stats and process each chunk on a different core. This is basically what DelayedArray does, so that might be another option.
[DONE ✅] Process each sumstats dataset in parallel (implemented in import_sumstats).

3. Parallelize across GPU(s)

A la cudf, which is part of the RAPIDS suite.

Use formatC for more readable reports

formatC(n, big.mark=",") makes large number more readable in reports.

398198 SNPs are on chromosomes X, Y, MT and will be removed

vs.

398,198 SNPs are on chromosomes X, Y, MT and will be removed

Add alternative means of getting sample ID from VCF

VCF headers don't always have ##Sample row. Providing a backup solution.

   sample_id <- get_vcf_sample_ids(path = path) 
    
    #### Read in full data ####
    sumstats_file <- data.table::fread(path, nThread = nThread, sep="\t", skip = "#CHR") 
    sumstats_file <- sumstats_file %>% dplyr::rename(CHROM="#CHROM")
    
    if(is.null(sample_id)){
        #### Infer sample ID from data colnames if necessary #### 
        idcol_index <- grep("FORMAT",colnames(sumstats_file),ignore.case = TRUE)
        if(any(length(colnames(sumstats_file))>=idcol_index)){
            sample_id <- colnames(sumstats_file)[seq(idcol_index[1]+1,length(colnames(sumstats_file)))] 
            message("sample ID(s) inferred from data colnames: ",paste(sample_id,collapse = ", "))
        } else {
            stop("Sample ID(s) could not be extracted from VCF header or inferred from data colnames.")
        }
    }

BiocCheck :* ERROR: Packages providing 2 object(s) used in this package should be imported in the NAMESPACE file

* ERROR: Packages providing 2 object(s) used in this package should be imported in the NAMESPACE file, otherwise packages
      importing this package may fail.
    
      package::object in function()
        data.table::year in find_sumstats()
        BSgenome::seqnames in get_genome_build()

Add progress indicators / improve messages

It'd be great to have progress bars wherever possible to let users know exactly what step the process it at and how long it might take. It always makes me nervous when i see something taking a long time and I don't know whether Rstudio has frozen or whether it's still chugging along.

More generally, it would be good to report not just how many SNPs were removed, but how many are remaining. This is ultimately what people care the most about.

Also, i think there's a couple points where the messages could be a bit more informative.

1. "VCF format detected, this will be converted to a standard summary statistics file format."

I would say, "...to a standardised table format."

2. 75 SNP IDs are not correctly formatted. These will be corrected from the reference genome.

"Correct" is a bit subjective here, best to clarify what you're doing, such as converting all SNPs to RSIDS from version y of X resource.

3. "940147 SNPs are not on the reference genome. These will be corrected from the reference genome."

Not sure what this means.

4. "7 RS IDs are duplicated in the sumstats file. These duplicates will be removed"

Are only the first instances of these duplicates removed, or all instances? If the former, are they selected based on the summary statistics (mean or max p-value?) or arbitrarily?

5. "Succesfully finished preparing sumstats file, preview:"

Minor thing, but the columns headers are not aligned with the data below in the preview. Makes it hard to read.

Also, typo: Succesfully --> Successfully

Error in ids2rowids(ids) : 'ids' cannot contain NAs

Encountered when processing VCF step-by-step in format_sumstats.

Specifically, the get_genome_build step.

Data source: https://gwas.mrcieu.ac.uk/files/ieu-a-1124/ieu-a-1124.vcf.gz

Sort sum stats by genomic coordinates

Currently, format_sumstats returns the sum stats sorted by RSIDs (alphabetically).
Usually GWAS sum stats files are (ideally) sorted by genomic coordinates (CHR then BP).

This is especially important for tabix, which requires all rows to be sorted by coordinates. #7

Account for INFO with all NAs

I tried processing Kunkle2019 from Open GWAS but format_sumstats returned an empty file.

Found out this is because, while there is an INFO column, they're all NAs. In order to format_sumstats I had to edit all these NAs to 1s and rerun format_sumstats. This worked, but it'd be ideal if format_sumstatscould detect these situations, provide a warning, and then replace with 1s automatically.

More generally, there should be some reports and checks at the end of format_sumstats to ensure that there is a reasonable number SNPs in the formatted file and that the data isn't just blank. Beyond number of SNPs, reporting to the user the number of remaining SNPs with p-value<5e-8 might be nice.

Allow user-defined `sumstatsColHeaders` mapping as argument

Hi, just a suggestion: I think pre-defined column-name mapping is convenient but not very flexible. For example, I have a summary stats table with a1 as effective allele and a0 as other allele. The format_sumstats will, however, treat a1 as reference allele and a0 as effective allele. I'll have to change the colnames of the file manually or update the sumstatsColHeaders just for this dataset to get it to work.

I think it would be better if the format_sumstats could take colnames mapping as the argument. For the example I showed, it would be much better if I can just do something like:

format_sumstats(path = raw_file, A1 = "a0", A2 = "a1", ...)

new_col_header = data.frame(Uncorrected = c("a0", "a1"), Corrected = c("A1", "A2"))
format_sumstats(path = raw_file, colheaders = new_col_header)

Thank you!

`sumstatsColHeaders` variable missing

Can't seem to find where this is defined, unless the user defines it as a global beforehand.

Provide API-access to MRC IEU Open GWAS

Open GWAS

ieugwasr

Update Bioconductor binaries?

Hi,

It seems the binaries in Bioconductor (https://bioconductor.org/packages/release/bioc/html/MungeSumstats.html) is still v1.0.1. Any chance of pushing a new release? The current guides don't really match up and it seems there's been a lot of bug fixes since then.

Br,
David

Make `data.table::copy()` explicit

Given the generic name of this function, best to make it explicit to avoid potential conflicts with other packages.

data.table::copy() vs. copy()

Ensure all X and Y chromosomes are uppercase

Now implemented incheck_chr():

#### Make all CHR uppercase 
message("Making all CHR uppercase.")
sumstats_dt[,CHR:=toupper(CHR)]

Also added to sort_coordinates() just to be safe.

Recognize previously formatted files of the same name

This is helpful when looping over multiple sum stats, and you need to restart midway. Rather than reprocessing them all, you can just detect the ones that exist and return them.

Now implemented at the top of format_sumstats

 check_save_out <- check_save_path(save_path = save_path, 
                                    write_vcf = write_vcf)
  
  if(file.exists(check_save_out$save_path) & force_new==FALSE){
    message("Importing previously formatted file. Set `force_new=TRUE` to override this.")
  } else { 
    #Avoid reloading ref genome every time, save it to this parent environment
    #after being made once - speed up code
    rsids = NULL
    #Check input parameters

Saved formatted files disappear after restarting R session

Not sure what's up here, thought temp files stuck around for a while (at least a couple days). I ran import_sumstats, which ran all the way through and supposedly wrote several files (according to the messages), restarted my R session and now they supposedly don't exist.

Are tempdir() and tempfile() doing something different what I thought they were doing?

Read data with`data.table::fread()`

data.table::fread() is faster than readLines, even when single-threaded.

It's also more flexible, as it doesn't require tab-delimited format, and instead infers the format and imports accordingly. This means we can get rid of the check_tab_delimited() step.

Allow users to specify output path

tempdir is a good default, but some users may want to directly create the file in a particular path.

This is especially desirable bc the resulting path name is only stored in memory. If Rstudio crashes or you close the session before you wrote down the path name, it's extremely hard to figure out the tempdir location bc it uses random letters as the path. If you have a specific location you know the file will be, it makes it easier to find it.

Warning in readLines(infile) : incomplete final line found

Warning in readLines(infile) :
  incomplete final line found on '/Users/schilder/Desktop/MungeSumstats/tests/testthat/test-info_filter.R'
Warning in readLines(infile) :
  incomplete final line found on '/Users/schilder/Desktop/MungeSumstats/tests/testthat/test-strand_ambig.R'

 message("+ downloader:: axel not installed.\n",
                    "For Mac users, please install via brew in the command line (`brew install axel`)\n",
                    "or visit https://github.com/axel-download-accelerator/axel for more details");

  message("+ downloader:: axel not installed.\n",
                    "For Mac users, please install via brew in the command line (`brew install axel`)\n",
                    "or visit axel GitHub repo for more details");

Account for sitations where there's a mismatch between write_vcf and save_path

Now in check_save_path:

if(write_vcf){ 
        save_path <- gsub(paste(suffixes,collapse = "|"),".vcf.gz", save_path)
        sep = "\t"
        file_type = "vcf"
    } else {
       #### Account for mismatch between write_vcf and save_path ####
        suffixes.vcf <- supported_suffixes(tabular = FALSE, tabular_compressed = FALSE)
        if(any(endsWith(save_path,suffixes.vcf))){
            message("`write_vcf=FALSE` but `save_path` suggests VCF output format. ",
                    "Switching output to tabular format (.tsv.gz).")
            save_path <- gsub(paste(suffixes.vcf,collapse = "|"),".tsv.gz", save_path)
            sep = "\t"
            file_type = ".tsv"
        } 
    }

We need to standardise this.

Add function to infer MAF

May be useful to infer MAF although the downside is this wouldn't be related to the population in question so could result in questionable results. Similar code is here, taking MAF from UK Biobank: https://github.com/RajLabMSSM/echolocatoR/blob/dev/R/UKBiobank_LD.R
function located here: https://github.com/RajLabMSSM/echolocatoR/blob/6b9086d6755be0cc802027b76fb8314058fa0bf1/R/standardize.R#L304