cbiit / r-cometsanalytics Goto Github PK

View Code? Open in Web Editor NEW

12.0 13.0 10.0 207.53 MB

R package development for COMETS Analytics

R 11.33% HTML 71.41% CSS 10.88% JavaScript 5.53% Shell 0.02% TeX 0.84%

ncats-dpi-ifx

r-cometsanalytics's People

Contributors

Stargazers

Watchers

Forkers

wheelerb kevinmmendez

r-cometsanalytics's Issues

COMETS 1.3. Minor tweaks to warning messages

For batch mode, model age.1, lets replace existing warning message with "We removed one or more dummy variables that were redundant (i.e. perfectly correlated with another variable)."

Also is it possible to specify the variable?

Warning: variance-covariance matrix Moore-Penrose

Still researching how/why this occurs

COMETS 1.4. heatmap (showHClust) issue

When running a model that has several XXX, the heatmap function fails because there are duplicate rownames:

> excorrdata  <- COMETS::runCorr(exmodeldata,exmetabdata,"DPP")
NULL
NULL
[1] "running unadjusted"
> COMETS::showHClust(excorrdata)
Error: Duplicate identifiers for rows (532, 611), (1143, 1222)

This is a new error and not sure why it's creeping up now since the vignette hasn't been changed in a while.

COMETS 1.3. Handle metabolites where variance=0

It will sometimes happen that a metabolite has no variance, i.e. has the same value for every single participant. When this occurs, there should be no analysis/results for this metabolite, but analysis/results for other metabolites should carry forward as normal. Currently, however, the analysis crashes when it runs into any metabolite with variance=0.

We need a better method for handling metabolites where variance=0.

Create S3 class for getcorr output

We will define the object to have the following slots:

outcome
exposures
adjusted
corr
corrmethod (spearman/pearson/mixed)
pvalue
n
super_pathway
biochemical
harmflag
hmdb
mz
rt
uid_01
multrow
uidsource

COMETS 1.3. "Harmonization" file

For our rollout, we have proposed a two step process for the cohorts:

Prepare data file, test integrity, download "harmonization" file, and run one simple analysis (age.2). Send harmonization file and results file to IMS so that they can begin harmonization.
IMS sends back a "Metabolites" tab that is identical to the original, except with an additional UID_01 column. With this new tab, the cohort then goes back to COMETS-Analytics and runs "All models". These models are now "pre-harmonized".

To accommodate this process change, I have two minor edits to the harmonization file, per discussion with Nathan Appel and David Ruggieri of IMS.

The variable that is currently called "UID_01" should be renamed to make room for the IMS UID_01 variable. My suggested rename is "UID_01.comets_analytics"--which reflects the fact that this UID_01 is based on the COMETS-Analytics algorithm. Making room for both columns also will give us data to track our algorithm's performance over time (% match between algorithm and IMS final UID).
The harmonization file changes the case (lower case vs. upper case) of the metabid variable as compared with the original input. To ensure that IMS can fully replicate the original harmonization file, we should provide the original harmonization case. Remember that if the cohort is not proceeding with the "All models" analysis initially, then IMS only has the "Harmonziation" file to work with and not the "Input file".

COMETS 1.5. Handling of missing data

Several investigators have been trying to input missing data, which causes errors. All metabolites and subjectdata variables should be tested for this.

Possible fixes include:
a) An improved tutorial
b) Tests in the Data Input function
c) Tests in the "Read datafile", or "Integrity check", or "getModel" functions.

Because this check could add some time, we could add this as an option, i.e. add a checkbox to test the variables, default is no test of the variables.

COMETS 1.3. Super-batch and Addition of metabolite meta-data table to zip file

The results files are outputting correctly from COMETS-Analytics. However, to link the results from one cohort to another's, we need to pull in their metabolite meta-data table in its entirety, most likely as a separate table in the zip file. In addition, if the results auto-harmonized, we should also pull in the UID column, possibly one or two other columns.

Without meta-data or the UID, we cannot harmonize the metabolites on the back end. This is a high priority fix.

Comets 1.3. Display in interactive mode needs modification

In the latest update, the display of results onscreen is using the wrong column of metabolite names in interactive mode. The display is correct, however, in batch mode. Screenshots of each are below.

Interactive mode:

Batch mode (model age.1):

Fix model spec in the csv file

COMETS 1.5 Rapid UID upload

After harmonization of study data, in our updated process, we will need to update COMETS with a new UID file to account for new metabolites added. Is there a way to automate this where when the files has changes, we can upload it to a specific spot and is used by COMETS without any input processing?

Does the file we send need to be modified to make something like this work.

COMETS 1.4 Tagging and other issues in interactive mode

Interactive mode - If using non-R friendly names for metabolites, it would fail with “individual metabolite names” or use the tag function. (Comets R). A lower priority issue for now.

COMETS 1.3. Infinite loop

When running the following model, the app enters an infinite loop:

Exposure: Age
Outcome: All metabolites
Adjusted covariates: race_grp

I have noted that this model can work when running individual metabolites or when running all models that are not adjusting for race. This suggests that the problem is related to metabolites where there are only one or a few values being combined with an adjustment where some categories have only one or a few values.

Thus, this could be a model singularity issue, like that described in issue #32 .

COMETS 1.5 - Harmonization file variable defintions: COMP_ID.Cohort, COMP_ID.Comets, and COMP_ID

Can you give us the differences between these 3 variables?

Also, why is the lone COMP_ID variable missing sometimes?

Thanks,
Nate

Warning: undefined columns selected

This occurs when any metabolite with an R-unfriendly name is used as an "individual metabolite" in an interactive mode analysis. Technically a bug (see issue #50) that will likely be eliminated in COMETS 1.4

Scrambled CPSII data.xlsx

COMETS 1.3. New test dataset jams Amazon queue

I have been working with a new test dataset from our collaborators at the American Cancer Society. At least one of their models has been jamming the queue, for reasons unknown. I can confirm that the first two models are fine, and that the problem is not solely due to the "all metabolites*all metabolites" analysis.

More testing to be done once the queue is unjammed.

Scrambled women CPSII data.xlsx

COMETS 1.3

filter function not working correctly using the CPS scrambled file:
the counts expected are

table(exmetabdata$subjdata$prev_heart_dx)
0 1 2
454 84 18

but when you filter using the where statement:

this should be 454

and here should be 18

Enhancement to Consider: Check for acceptable values

we should implement acceptable values check so that we can do integrity checks. for example, BMI of 0 is not acceptable

COMETS 1.5. Accommodate new field for data harmonization

For COMETS 1.4, I would like to focus on three things: 1) Harmonization; 2) Error handling; and 3) Queue management/troubleshooting. This issue applies to the first of these.

Currently, we are doing all the harmonization on the backend at IMS. For each cohort, they start with our attempt to auto-harmonize but then revise/edit substantially, until all entries are logically consistent. Nathan pointed out that, once this has been done for each study, the most sensible approach is to send our UID back to the cohort as a column to add to their datafile, so that files is permanently harmonized from then forward. Ella, Ewy and I should meet with Nathan to discuss, but on a preliminary basis, I agree.

If we go this route, we will need to accommodate a new column for each datafile in our harmonization algorithm. It may also change the (non-software) workflow for each cohort--for example, we have each study run the Integrity Check and one or two tables that they send to IMS for pre-harmonization. Then, we feed back the harmonized metabolite UID, teh cohort analyst adds it to their file, and runs one or two tables again. Then, if IMS is able to harmonize these easily, then we the cohort runs the whole analysis.

Let's discuss once 1.3 is complete.

COMETS 1.3 Permit adjustment for categorical variables

Currently, categorical variables are not properly adjusted for--they are entered into the model as continuous variables. Models should distinguish between categorical and continuous models using a new column that will be added to the Varmap tab.

This change will also require a change to the Sample file (to be logged separately) and to the "Create Input" utility (it needs to add this column to the input file that it creates).

Warning: bad alloc

I believe that this issue occurs when a greater number of processes are initiated than the server is currently configured to handle. That number right now is 3, but could be increased if the cost is justifiable on the basis of our usage tracking data.

COMETS 1.3. Adjustments no longer working. Critical priority.

The new function is not working for statistical adjustments--see screenshot below:

Comets 1.3 - Models adjusted for race (when analyses are stratified by BMI) returns errors

In the test file that I had prepared, the N for non-white/European persons was quite small. In fact, only 1 individual had a race_grp=2. This seems to be causing all kinds of problems in the adjusted/stratified analyses.

To test, run in interactive mode:

Exposure: Age
Outcome: Any individual metabolite
Adjusted covariates: race_grp
Strata by: BMI_grp

Two of the three values returned will have a value of NA. Possibly, this reflects a degrees of freedom issue?

Input file is below.

cometsInput_March_2018.xlsx

Create run all models function

COMETS 1.3. With new R package, models are not running

The test site will not run any correlations with the new R package.

Add stratification variable

Issues to deal with under R-COMETS 0.8004

[ ] rename all _ in variables with .
[ ] add lm and lmer code
[ ] summary statistics for covariates in modeldata$gdta
[ ] concatenate really long names or take only 1st if more than 1 in the display

COMETS 1.5. Run time is long when no covariates used for adjustment

The "Run model" is checking the data file again on the website end but it doesn't need to--the data is already saved in the created list and everything is there.

Resolving this will help speed the analyses.

COMETS 1.4. Warning: CSV input file does not exist

This appears to occur primarily if the extension in the data input file is capitalized (".XLSX"), which results in the software not being able to find the csv file at the analysis stage.

This is a bug that should be fixed in COMETS 1.4

COMETS 1.3. Problems with some adjustment-strata combos

Certain combinations of adjustments and stratification can still cause problems with models. One of the simplest scenarios uses the data below, with the model as follows:

Exposure: age
Outcome: glycine (can also use All Metabolites)
Adjusted: bmi_grp, alc_grp
Strata: smk_grp

Initially, I though this could be due to a code reversion, but that was a false lead.

I then thought it could reflect metabolites with high numbers of values below the limit of detection (i.e. little meaningful variance), but I tested against glycine (for which this issue does not apply) and still had the same problem.

I am thus forced to conclude that the issue reflects something about the joint distribution of the adjusted and strata variables that we are not quite fully handling.

Ella and Ewy, the data are attached. Let me know if you have any insights. I hope to test again toward the end of today.

Scrambled CPSII data.xlsx

COMETS 1.4. Individual metabolite analysis may result in misaligned names

This issue occurs when using a dataset with r-unfriendly metabolite names in interactive mode, when selecting individual metabolites.

So, for example:

Exposure: Age
Outcome: HOMOVANILLATE (HVA)

Will result in a warning of "Undefined columns".

This issue can be triggered using the dataset in issue #49.

COMETS 1.3: "Run model" is sensitive to caps in ".XLSX" but, like "check integrity", it should not be

If the extension in the data input file is capitalized (".XLSX"), the file will be able to complete the "Integrity check" but will return a warning message at the "Run model" stage, per the below:

COMETS 1.4. Correlations between metabolites with long names not currently working

After some revision on the processing of metabolite arrays, analyses the examine correlations of metabolites with one another fail when the names are long, have non-standard characters, etc.

COMETS 1.4. UID file integration

some weird UIDs to be checked:

PC-source_fragment
nh4 suffix

C14_0_CE_+NH4
C16_0_CE_+NH4
C16_1_CE_+NH4
C18_0_CE_+NH4
C18_0_MAG_+NH4
C18_1_CE_+NH4
C18_2_CE_+NH4

COMETS 1.5. Data input function not working

I used input files based on the sample dataset data_input.zip.

With these files, I was unable to "Create input". Instead these files returned an error as follows:
Java Exception .jcall(cell, "V", "setCellValue", value)

COMETS 1.3 Harmonization not consistently working

Sometimes the metabolites do not harmonize, even when HMDB IDs are present. To my understanding, Ella was looking into this issue. This also occurred with the R. Kelly VDAART file, which I can supply, if needed.

COMETS 1.3: "Where" functionality not working

The "where" functionality no longer appears to be working. This issue needs to be fixed before I can complete testing on categorical adjustment, since the model we have been testing includes a "where" statement.

COMETS 1.3. Update sample and template file

A number of changes need to be made to the test/sample file, including:

Adding a column for continuous/categorical variables
Adding a new value to the models tab for the age_grp variable (age<20 years), so that analyses can be stratified for the youngest participants
Adding BMI as a continuous variable
Adding models for the BMI analysis to the models tab (per R. Kelly), and cleanly distinguishing these models so that cohorts can delete if they elect not to participate.

COMETS 1.4: Data input and "COHORTVARIABLE" column

In tests done to date, all investigators have elected to use the variable names that we use. They are not using the variable matching in any meaningful way.

Thus, I think we could perhaps encourage users to simply code "COHORTVARIABLE" the same as "VARREFERENCE" and, if using the "Create input" utility, we could assume as a default the VARREFERENCE names. This could help to streamline the process of making data input and our writing of the tutorial. We should discuss as a group.

COMETS 1.3. Minor warning issue

I like the addition of the warnings--it will make testing easier.

Now that they are visible, there may be some tweaks needed. One such tweak is that, when running an analysis stratified by BMI, I received the following warning: "Warning: one of your models specifies bmi_grp as a stratification but that variable only has one possible value. Model will run without bmi_grp stratified"

There is a strata of BMI that had very few observations, but bmi_grp itself definitely has more than one possible value, as evidenced in the screenshot below. Any suggestions for how to modify wording ?

COMETS 1.2

Should we create a varlabel, a prettier string to display on the heatmap or just strip everything after ()

examples and vignettes

Currently, the code in the vignettes and examples do not match. To minimize confusion, it may be best to sync those up at one point...
Do you agree?

COMETS 1.3. Add "model" table to zip file

My IMS collaborators have requested that we add the model tab to our zip file so that they can double-check that the correct models were run and have it documented. I think this is a great idea. We may want to add the varmap as well.

Warning: system is computationally singular: reciprocal condition number = 1.92793e-17

Still investigating...

Test using this file and "Age 2.3"

Scrambled CPSII data.xlsx

The minimum level of adjustment to provoke the issue is shown below. Notably, even just adding bmi_grp to the model triggered warning messages about the penrose, etc. matrix:

COMETS 1.4 Pairwise Analysis Exposure UID

In the pairwise analysis, the exposure_uid and exposure columns are reporting the exposurespec. Would it be possible to get the actual UID and name in these two columns.

pairwise_ouput_sample.xlsx

COMETS 1.5 Allow other data formats such as stata, sas, rdata

we will use rio to allow other formats

COMETS 1.5 Make file sizes smaller

files are getting bigger with more templates, i think we can put minimal data in templates. see current file sizes

COMETS 1.3. SuperBatch::Table 1 functionality

COMETS manuscripts will need descriptive data from each of the participating studies that we can show in our Table 1. The descriptive data should be output as a zip file table. For categorical variables, the percent in each category will likely suffice. For continuous variables, I suggest outputting the mean, the standard deviation, and the values at the 0th (minimum), 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 100th (maximum) percentiles.

getModelData() error with modbatch

I've quick fixed the vignette clode bc the getModelData() function was not working. I figured out it was because the original call was as follows:
exmodeldata<- getModelData(exmetabdata,colvars="age",modbatch="1.1 Unadjusted")

HOwever, if modbatch is not specified, it errors because of this line (#62):
mods<-dplyr::filter(as.data.frame(readData[["mods"]]),model==modbatch)

I've fixed it the call to this, which now works:
`exmodeldata <- getModelData(exmetabdata,colvars="age",modbatch="1.1 Unadjusted")

However, does this make sense? If it's in batch, then all models should be read in right?

COMETS 1.3. N does not update when "where" statement used

If the "where" statement is used in either "interactive" mode or in "batch" mode, the N listed does not update. This will create downstream problems when calculating standard errors for meta-analysis and so is an important problem.

cbiit / r-cometsanalytics Goto Github PK

r-cometsanalytics's People

Contributors

Stargazers

Watchers

Forkers

r-cometsanalytics's Issues

Recommend Projects

Recommend Topics

Recommend Org