richard-packer / deepphewas Goto Github PK

An R package for phenotype generation and association testing for phenome wide associations studies (PheWAS)

License: GNU General Public License v3.0

R 99.98% Rez 0.02%

deepphewas's Introduction

DeepPheWAS

Overview

DeepPheWAS is an R package for running phenome wide association studies (PheWAS). It allows user control of all the stages of PheWAS, from data wrangling, through phenotype generation and association testing.

Installation

# The development version from GitHub:
# install.packages("devtools")
devtools::install_github("Richard-Packer/DeepPheWAS")

Use

For a detailed tutorial see our Github pages site.

https://richard-packer.github.io/DeepPheWAS_site/

deepphewas's People

Contributors

Stargazers

Watchers

Forkers

mikkmart k-coley1 heroalone nshrine justicengom shicheng-guo

deepphewas's Issues

Can this package be run on other datasets ?

Hello,

I have a query and would like to know. Can this be used to run the analysis on other datasets? Based on what I observed, the phenotypes generation is solely based on ukbiobank.

Regards
Akhil

"Can't convert <double> to <date>" during concept creation

If for any reason the earliest_date of the occurrence of a clinical code is missing for a participant during the concept creation part, the program errors out with the following stack trace:

Error:
! Assigned data `value` must be compatible with existing data.
ℹ Error occurred for column `earliest_date`.
x Can't convert <double> to <date>.

The culprit seems to be this line which tries to assign 0 also to the missing date values:

https://github.com/Richard-Packer/DeepPheWAS/blob/d569e975213e4138d026fffe28ea1f42d6aa15ee/R/concept_creation.R#L203C21-L203C21

Handling duplicated columns in tab data files

We have our tab data spread across multiple files. However, the files have some overlap in the columns that they include. Currently, minimum_data_R() combines data from multiple files with a dplyr::full_join(), which causes duplicate columns to be suffixed with .x and .y respectively. As a result, the columns are essentially missing downstream, where such suffixes aren't expected.

Could minimum_data_R() be updated to be a bit smarter about this, and in the presence of duplicate columns for example use the one from the file specified last in the list?

Problem with test data

Hi!

I'm trying to run the test data to better understand the software but I have the following problem. It seems that "hesin.csv.gz" file didn't have the correct colnames, in fact it only have 4 columns and it is suposed to have more. I also found an error with "GPC.csv.gz" file because of an incorrect colname, but I have change it manually. Can you please help me with this issue?

./02_data_preparation.R
--save_location $phewas_folder/data/
--min_data $phewas_folder/data/min_tab_test.gz
--hesin_diag $package_folder/extdata/worked_example/HES_Diag.csv.gz
--HESIN $package_folder/extdata/worked_example/hesin.csv.gz
--death_cause $package_folder/extdata/worked_example/death.csv.gz
--death $package_folder/extdata/worked_example/death_date.csv.gz
--king_coef $package_folder/extdata/worked_example/KING_coef.csv.gz
--GPC $package_folder/extdata/worked_example/GPC_new.csv.gz
Joining with by = join_by(eid)
Joining with by = join_by(eid)
Error in dplyr::na_if():
! Can't convert y to match type of x <data.table>.
Backtrace:
▆

├─DeepPheWAS::data_preparation_R(...)
│ └─... %>% tidyr::drop_na()
├─tidyr::drop_na(.)
├─dplyr::select(., .data$eid, .data$ins_index, .data$dates)
├─dplyr::mutate(., dates = lubridate::dmy(.data$dated))
├─dplyr::mutate(...)
├─dplyr::na_if(., "")
│ └─vctrs::vec_cast(x = y, to = x, x_arg = "y", to_arg = "x")
└─vctrs (local) <fn>()
└─vctrs::vec_default_cast(...)
```
├─base::withRestarts(...)
```

│ └─base (local) withOneRestart(expr, restarts[[1L]])

│   └─base (local) doWithOneRestart(return(expr), restart)

└─vctrs::stop_incompatible_cast(...)

  └─vctrs::stop_incompatible_type(...)

    └─vctrs:::stop_incompatible(...)

```
      └─vctrs:::stop_vctrs(...)
```

        └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)

Warning messages:
1: There was 1 warning in dplyr::mutate().
ℹ In argument: date_of_dx = lubridate::ymd(...).
Caused by warning:
! 295 failed to parse.
2: In data_preparation_R(min_data = arguments$min_data, GPC = arguments$GPC, :
'HESIN' does not have the correct colnames and may not produce the correct output, expected colnames are:
'eid,ins_index,dsource,source,epistart,epiend,epidur,bedyear,epistat,epitype,epiorder,spell_index,spell_seq,spelbgin,spelend,speldur,pctcode,gpprpct,category,elecdate,elecdur,admidate,admimeth_uni,admimeth,admisorc_uni,admisorc,firstreg,classpat_uni,classpat,intmanag_uni,intmanag,mainspef_uni,mainspef,tretspef_uni,tretspef,operstat,disdate,dismeth_uni,dismeth,disdest_uni,disdest,carersi'
not:
eid,ins_index,epistart,admidate
differences between inputed file and expected are:
dsource,source,epiend,epidur,bedyear,epistat,epitype,epiorder,spell_index,spell_seq,spelbgin,spelend,speldur,pctcode,gpprpct,category,elecdate,elecdur,admimeth_uni,admimeth,admisorc_uni,admisorc,firstreg,classpat_uni,classpat,intmanag_uni,intmanag,ma [... truncated]

Best regards,

Judit

How to create input fields to run 02_data_preparation.R

Dear developer,
I have successfully finished the example pipline. But, I met some problem with 02_data_preparation.R in my real UKB data. I have the UKB fields 20002 and 20004 in my min_tab_test.gz.
Could you tell me how to create this fields?

--hesin_diag $package_folder/extdata/worked_example/HES_Diag.csv.gz \
--HESIN $package_folder/extdata/worked_example/hesin.csv.gz \
--death_cause $package_folder/extdata/worked_example/death.csv.gz \
--death $package_folder/extdata/worked_example/death_date.csv.gz \
--king_coef $package_folder/extdata/worked_example/KING_coef.csv.gz \
--GPC $package_folder/extdata/worked_example/GPC.csv.gz

Or can I use these $package_folder/extdata/worked_example/HES_Diag.csv.gz in whole UKB data?

Hope to receive your answer soon.

Best wishes

dplyr version must be less than v1.1.0

I noticed that I kept getting an error when running 02_data_preparation.R saying Can't convert 'y' <character> to match type of 'x' <integer>.. After some searching online, I realized that this is because according to the changelog for dplyr 1.1.0, na_if() now uses the vctrs package, which is stricter about type stability. After downgrading my dplyr package in R to v1.0.10, the error is gone. At the moment in your imports you only require dplyr (>= 1.0.0). I recommend changing this to dplyr (>= 1.0.0 & <=1.0.10).

Problem running phenotype generation

Hi Richard,
I've just tried running the phenotype generation script using the examples you provided in the manual and got the following error.

Do you think you could help pointing out where/how I could get zlib installed? I tried several ways (e.g. using install.packages("Rcompression", repos = "http://www.omegahat.net/R ") or tried the link http://www.gzip.org/zlib/ ) but it does not work. Installing this package using "install.packages('zlib')" gave me the following error

Your help will be highly appreciated.
Many thanks
Hang