umccr / dracarys Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 10.38 MB

:fire: UMCCR workflow tidying :dragon_face:

Home Page: https://umccr.github.io/dracarys/

License: MIT License

Python 0.74% Shell 0.31% R 98.21% Dockerfile 0.43% CSS 0.31%

cancer-genomics r-package

dracarys's People

Contributors

Watchers

Forkers

pdiakumis

dracarys's Issues

Integrate with ICA API

Replace ICA API wrappers with https://github.com/umccr-illumina/icar1

cttso: Support for MergedSmallVariants VCF

Need to add parser for MergedSmallVariants.vcf.gz files, which contain PASSed and non-PASSed variants with useful FORMAT fields. Consider using bcftools with bedr as a fallback.

New Workflow: BCL Convert

See e.g.primary_data/230406_A00130_0254_AH5L3WDSX7/202304077f5d1d99/ for BCL Convert and Interop results.

Consider the option to directly write the data to an AWS database from dracarys itself (in addition to the TSV/Parquet export).
This would save code duplication and repeated data parsing/class guessing.

Reading stack:

cttso: AlignCollapseFusionCaller_metrics.json.gz overlaps with MultiQC JSON

Need to make sure we're not duplicating metrics in the db.

Python interface

As a conda app, dracarys is flexible enough to use any conda pkgs available like bcftools, bedtools, cyvcf2 etc., along with R pkgs from CRAN/Bioconductor and Python pkgs from PyPI.
We can explore enhancing its current Python interface (setup.py/setup.cfg/dracarys/ are already available) and use reticulate (https://rstudio.github.io/reticulate/) to connect with R.

Export metadata to output directory

Upon writing the tidy files, also write an e.g. dr_metadata.json file with information about the dracarys version and the input GDS directory path. Maybe even write the full command used.

Duplicate File Types Detected error

Not entirely sure this is actually a bug. When running v0.8.0 via a lamba with ~2GB of memory allocated we sometimes end up with the following error message:

2023-04-23T21:58:50.696000+00:00 2023/04/23/[$LATEST]eefcad7e070f46039ac9deff72cb08ae Error: Aborting - duplicated file types detected in /tmp/dracarys/dracarys_gds_sync
2023-04-23T21:58:50.696000+00:00 2023/04/23/[$LATEST]eefcad7e070f46039ac9deff72cb08ae Execution halted
2023-04-23T21:58:50.696000+00:00 2023/04/23/[$LATEST]eefcad7e070f46039ac9deff72cb08ae Warning message:
2023-04-23T21:58:50.696000+00:00 2023/04/23/[$LATEST]eefcad7e070f46039ac9deff72cb08ae system call failed: Cannot allocate memory

The OOM error at the end might be a red herring, it rather seems that all inputs are unique via

dracarys/R/tidy.R

Lines 95 to 99 in bfba3ca

    
           # TODO: 
        
           assertthat::assert_that( 
        
             !any(duplicated(d[["type"]])), 
        
             msg = glue("Aborting - duplicated file types detected in {in_dir}") 
        
           )

Here's an example input folder, gds://production/analysis_data/SBJ02892/tso_ctdna_tumor_only/20221104881bc21a/L2201559/Results/:

The input folder does not contain results from multiple re-runs which was my initial assumption. Any ideas on what might be going on here welcome! Once that is resolved we can run this across the collection again and check if memory is still an issue for some of the larger JSONs; might be solvable by limiting resource sharing, bumping up the memory or via your proposed #56.

Port Dragen QC metrics from old_dracarys

Namely: https://github.com/umccr/old_dracarys/tree/master/R

Explore: HMF DB

MultiQC: include 'Somatic And Germline' as workflow config title

Due to the recent update in the germline workflow running alongside the somatic instead of tagging along with umccrise (umccr/infrastructure#296), MultiQC is run on the outputs of both germline and somatic workflows (Slack thread).

Need to add a 'somatic and germline' option in

dracarys/R/multiqc.R

Line 145 in ebc98a5

dragen_workflows <- c("alignment", "transcriptome", "somatic", "ctdna")

Error that gets output:

rlang::last_trace()
<error/dplyr:::mutate_error>
Error in `dplyr::mutate()`:
ℹ In argument: `obj_parsed = list(.data$obj$read())`.
ℹ In row 1.
Caused by error:
! `%in%`(x = w, table = dragen_workflows) is not TRUE
---
Backtrace:
     ▆
  1. ├─dracarys::umccr_tidy(...)
  2. │ ├─dplyr::select(...) at dracarys/R/tidy.R:83:4
  3. │ ├─dplyr::mutate(...)
  4. │ └─dplyr:::mutate.data.frame(...)
  5. │   └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  6. │     ├─base::withCallingHandlers(...)
  7. │     └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
  8. │       └─mask$eval_all_mutate(quo)
  9. │         └─dplyr (local) eval()
 10. └─.data$obj$read()
 11.   └─dracarys::multiqc_tidy_json(x) at dracarys/R/multiqc.R:26:6
 12.     └─dracarys:::.multiqc_guess_workflow(p) at dracarys/R/multiqc.R:58:2
 13.       └─assertthat::assert_that(w %in% dragen_workflows) at dracarys/R/multiqc.R:150:6
 14.         └─base::stop(assertError(attr(res, "msg")))

Visualisation: explore plotly client-side linking

Lots of goodies/ideas at https://plotly-r.com/client-side-linking.html which can be considered for the React interface.

Port Dragen SV output(s)

https://github.com/umccr/old_dracarys/blob/master/R/sv.R

Documentation: UMCCR Workflow Outputs

Generate a table with the UMCCR workflow outputs of interest for ingestion/tidying

bclconvert
alignment
somatic
germline
transcriptome
umccrise
cttso
oncoanalyser
....

MultiQC: generate mini outputs

To simplify work for engineering team we can generate a subset of columns of interest.
Focus on the ctDNA TSO500 MultiQC output to get the ball rolling.

CLI: unify subcommands

Currently the CLI is split over the tso and multiqc subcommands, even though the underlying functionality is the same (with input being a GDS or local directory, and output being a local directory).
We can unify these into a single dracarys tidy command.

Add cttso500 metadata parsing from TSV to tidy dataframes

Which can be easily exported to .parquet w/ pandas (arrow) and be ingested into Athena.

CI/DVC error: failed to pull data from the cloud

Suddenly getting these for both gpgr and dracarys:

WARNING: No file hash info found for '/home/runner/work/dracarys/dracarys/inst/extdata/tso'. It won't be created.
WARNING: No file hash info found for '/home/runner/work/dracarys/dracarys/inst/extdata/wgs'. It won't be created.
2 files failed
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
/home/runner/work/dracarys/dracarys/inst/extdata/tso
/home/runner/work/dracarys/dracarys/inst/extdata/wgs
Is your cache up to date?
<https://error.dvc.org/missing-files>
Error: Process completed with exit code 1.

Might have something to do with the DVC version, or the DVC GDrive version, or with conda, or with UMCCR GDrive authentication. Or with something completely unrelated.
Which narrows it down quite a bit.

Add workflow run metadata parser

See data_portal.data_portal_workflow Athena table, which contains workflow input/output metadata.

Databricks: rlang/RJSONIO installation error

Setting up dracarys on a DB cluster, installs fine but need to lock in {rlang} and {RJSONIO} beforehand:

> dracarys::TsoTmbTraceTsvFile$new(url1)

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  namespace ‘rlang’ 1.0.6 is already loaded, but >= 1.1.0 is required

# before installing dracarys
> sessionInfo()
R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] SparkR_3.4.0   compiler_4.2.2 Rserve_1.8-12

Solution:

install.packages("RJSONIO")
devtools::install_github("umccr/dracarys", upgrade_dependencies = FALSE, dependencies = FALSE)

Will experiment a bit with the custom Docker setup at some point.

Explore: r-universe

Consider hosting all R pkgs on r-universe - https://r-universe.dev/search/

https://ropensci.org/blog/2021/06/22/setup-runiverse/

Use {lgr} for logging

See https://github.com/s-fleck/lgr. Especially like the machine-readable JSON output.

AWS Lambda

Create an AWS Lambda that receives an event with input directory and output prefix. Explore option to specify the name of the workflow that generated the input in case we want to be more specific and have more control on the output.

-- Edit: also see Andrei's work in https://github.com/umccr/dracarys-to-s3-cdk

Specify inputs/outputs via TSV/JSON

Instead of running it one input at a time, allow for multiple inputs (along with their outputs + prefixes) to be specified in a TSV/JSON file. e.g.:

dracarys.R tidy \
  --multi_input samples.tsv \
  --token <TOKEN> \
  --format {tsv,parquet,both}

Time Metrics Ingestion Label are params from a Presigned URL

Describe the bug
A clear and concise description of what the bug is. Include:

version of {dracarys}. (main branch as of 04-07-2023)
the complete command you used to invoke {dracarys}

gds_files_list <- dracarys::gds_files_list_filter_relevant(
  gds_output_directory, 
  token=Sys.getenv("ICA_ACCESS_TOKEN"), 
  include="PresignedUrl"
)

data_objects <- gds_files_list %>%
  # We only want the metrics files
  dplyr::filter(
    stringr::str_detect(.data[["type"]], "MultiqcFile|MetricsFile$")
  ) %>%
  # Work by row
  dplyr::rowwise() %>%
    # Use the func_selector to select the ingestion type
    # Based on the input data type
    dplyr::mutate(
      gen = list(dracarys:::dr_func_eval(.data$type))
    ) %>%
    # Call the dracarys function selected by mutate above
    # Using the presigned url as an input to the function
    dplyr::mutate(
      obj = list(.data$gen$new(.data$presigned_url))
    ) %>%
    # Now read the dracarys digested object
    dplyr::mutate(
      objp = list(.data$obj$read())
    ) %>%
    # Add the portal run id to the objp tbl
    dplyr::mutate(
      objp = list(
        .data$objp %>%
        dplyr::mutate(
          portal_run_id=portal_run_id
        )
      )
    ) %>%
    # # Get output table based on workflow type and data type
    dplyr::mutate(
      output_table = paste0(
        workflow_type_name, 
        "__", 
        stringr::str_replace(.data[["type"]], "MetricsFile$", "_metrics")
      )
    ) %>%
  # De row-wise tibble
  dplyr::ungroup() %>%
  # Select required outputs
  dplyr::select(
    "type",
    "output_table",
    "objp"
  )

# Write to table
data_objects %>%
  dplyr::rowwise() %>%
  dplyr::mutate(
    wrote_output = write_tbl_to_uc_table(sc, .data$objp, .data$output_table)
  ) %>%
  dplyr::ungroup()

the full error message

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

time_metrics.csv.csv

Explore docker image labelling

See https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry#labelling-container-images.

Use jq for SampleAnalysisResults.json.gz parsing

Depending on data size it can take ages to read in due to the json format, and probably consumes a fair bit of mem. Utilise jq at least for the raw data parsing.

Decouple copying from tidying

e.g. dracarys.R copy -i gds://path/to/workflow_dir/ -o /mnt/dracarys_sync/

-- Note that if you already have synced data, this can work:

dracarys.R tidy -i /mnt/dracarys_sync/ -o /mnt/tidied_files/

Convert MultiQC config date to ISO8601

From OH:

$ grep "date" multiqc_data.json
    "config_creation_date": "2022-12-14, 05:52",

Having that in ISO8601 would be easier - that works nicely with the various date packages in R (and SQL timeline queries).

Currently we just replace the , with underscore at https://github.com/umccr/dracarys/blob/ebc98a/R/multiqc.R#L57.

Better handling of presigned URLs

Presigned URLs show a No such file or directory warning due to https://github.com/umccr/dracarys/blob/b3a15ab/R/File.R#L22

Warning message:
In normalizePath(path) :
[<url>]
 No such file or directory

Need to add an ifelse there.

DRAGEN alt contig coverage

Potential 5th flame. Check the viral contig coverage output from DRAGEN (tracked in MultiQC plot).

Trello: https://trello.com/c/G6EOZVaG/1434-debug-check-for-ebv-integration

Better ICA token validation logging

Handle cases where the token:

is valid but for a different context
is not a JWT/missing

Already handled in some part by {jose} at https://github.com/umccr/dracarys/blob/4d8738/R/ica.R#L134, but need to improve logging/edge cases me thinks.

e.g. for a valid token but for a different context:

backtrace

Error in `dplyr::mutate()`:
ℹ In argument: `size = fs::as_fs_bytes(.data$size)`.
Caused by error in `.data$size`:
! Column `size` not found in `.data`.
Backtrace:
     ▆
  1. ├─global tidy_parse_args(args)
  2. │ ├─base::do.call(umccr_tidy, tidy_args)
  3. │ └─dracarys (local) `<fn>`(...)
  4. │   └─dracarys::dr_gds_download(...)
  5. │     ├─dplyr::select(...)
  6. │     ├─dplyr::mutate(...)
  7. │     └─dracarys::gds_files_list(gdsdir = gdsdir, token = token)
  8. │       ├─dplyr::select(...)
  9. │       ├─dplyr::mutate(...)
 10. │       └─dplyr:::mutate.data.frame(...)
 11. │         └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
 12. │           ├─base::withCallingHandlers(...)
 13. │           └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
 14. │             └─mask$eval_all_mutate(quo)
 15. │               └─dplyr (local) eval()
 16. ├─fs::as_fs_bytes(.data$size)
 17. ├─size
 18. ├─rlang:::`$.rlang_data_pronoun`(.data, size)
 19. │ └─rlang:::data_pronoun_get(...)
 20. └─rlang:::abort_data_pronoun(x, call = y)
 21.   └─rlang::abort(msg, "rlang_error_data_pronoun_not_found", call = call)
Execution halted

e.g. for an invalid JWT

backtrace

Error in jose::jwt_split(token) : length(input) %in% c(2, 3) is not TRUE
Calls: tidy_parse_args -> ica_token_validate -> <Anonymous> -> stopifnot
Execution halted

Support umccrise outputs

Need to handle the TSV copying from the umccrise cancer_report_tables/ directory.
Think the data is tidy enough in there.

SNV artifact tracking

Use the MergedSmallVariants VCF to keep track of variants called in e.g. more than 2 samples.

Slack: https://umccr.slack.com/archives/C016A5BSKRC/p1686641089183329

Handle presigned URLs of gzipped JSONs

{RJSONIO} cannot handle URLs to gzipped JSONs:

> url_jsongz
[1] "https://stratus-gds-aps2.s3.ap-southeast-2.amazonaws.com/11143240-789c-4320-b4e2-08d89d4636a9/analysis_data/SBJ00006/tso_ctdna_tumor_only/20230506af3bd7d9/L2300565/Results/NTC_ctTSO230501_L2300565/NTC_ctTSO230501_L2300565.AlignCollapseFusionCaller_metrics.json.gz?X-Amz-Expires=604800&response-content-disposition=filename%3D%22NTC_ctTSO230501_L2300565.AlignCollapseFusionCaller_metrics.json.gz%22&response-content-type=binary%2Foctet-stream[...]

> RJSONIO::fromJSON(url_jsongz)
Error in nchar(content) : invalid multibyte string, element 1

# how about something non-ICA related
> RJSONIO::fromJSON("https://wiki.mozilla.org/images/f/ff/Example.json.gz")
Error in nchar(content) : invalid multibyte string, element 1

Solution (based on SO) is to create a gz connection to the URL, then read it in as text via base::readLines or readr::read_lines, and finally feed that string to RJSONIO:fromJSON.

> "https://wiki.mozilla.org/images/f/ff/Example.json.gz" |>
+     base::url() |> 
+     base::gzcon() |> 
+     readr::read_lines() |>
+     RJSONIO::fromJSON() |> str()
List of 26                                                                                                                                                                            
 $ InstallTime          : chr "1295768962"
 $ Comments             : chr "Will test without extension."
 $ Theme                : chr "classic/1.0"
 $ Version              : chr "4.0b10pre"
 $ id                   : chr "ec8030f7-c20a-464f-9b0e-13a3a9e97384"
 $ Vendor               : chr "Mozilla"
 $ EMCheckCompatibility : chr "false"
 $ Throttleable         : chr "1"
[...]

Missing ggrepel dependency

Used for TsoTargetRegionCoverageFile read depth vs. cov pct ggplot:

Error in `dplyr::mutate()`:
ℹ In argument: `plot = ifelse(...)`.
ℹ In row 2.
Caused by error in `loadNamespace()`:
! there is no package called ‘ggrepel’

CRAN dependency check passed because it doesn't scan R6 classes properly.

Set up bare bones db

Assuming the dracarys AWS Lambda (issue #28) is able to drop data in an S3 bucket based on a (e.g. umccrise) workflow completion event, what are the next steps?

How will the db look like?
How are the S3 objects imported into the db?
How do the queries look like?
How does the sample metadata get dumped in the db?

There is already a starting point for this setup - see umccr/data-portal-apis#551 and Slack discussion.

There is also a 576row x 18col TSV/Parquet file that can be used as an 'end product' from dracarys in case we want to focus on this use case first (see Slack).

portal_run_id cli parameter

Context

First of all, please let me know if this belongs here or whether I completely misunderstood :-S

We need Dracarys to pass along a portal_run_id string when the dockerlambda called, i.e: umccr/infrastructure@bca42e6#diff-8f4808e7159a7778740b25ff7eb07fcb804114281e9721024c70efbbf6edfe3fR37

Describe the solution you'd like

Add a --portal-run-id or similarly named parameter to dracarys cli.

/cc @reisingerf

TSO: parse MergedSmallVariants.vcf.gz

MergedSmallVariants.vcf.gz contains some useful info in the FORMAT column that does not get captured elsewhere:

# A tibble: 13 × 4
   ID    Number Type    Description                                                                               
   <chr> <chr>  <chr>   <chr>                                                                                     
 1 GT    1      String  Genotype                                                                                  
 2 GQ    1      Integer Genotype Quality                                                                          
 3 AD    .      Integer Allele Depth                                                                              
 4 DP    1      Integer Total Depth Used For Variant Calling                                                      
 5 VF    .      Float   Variant Frequency                                                                         
 6 NL    1      Integer Applied BaseCall Noise Level                                                              
 7 SB    1      Float   StrandBias Score                                                                          
 8 NC    1      Float   Fraction of bases which were uncalled or with basecall quality below the minimum threshold
 9 US    .      Integer Supporting read type counts                                                               
10 AQ    1      Float   Variant artifact adjusted quality score                                                   
11 LQ    1      Float   Likelihood ratio quality score                                                            
12 LQUS  6      Float   Likelihood ratio quality score by supporting read types                                   
13 BFQ   1      Float   Variant support Bias Filter Quality score

Note that the VCF contains PASSed and non-PASSed variants. Consider using bcftools with bedr as a fallback option for pure R.

Tidy SampleAnalysisResults JSON

See https://github.com/umccr/dracarys/blob/main/R/tso.R for a starting point.

Switch `{jose}` to Imports instead of Suggests

{jose} is used to validate the ICA JWT token. I had originally added it under the suggestions since I didn't think it'd get used that much, but now it should move to the core import dependencies.

Oviraptor parsing

Grab oviraptor results from the umccrise work directory:

work/<sbj>/oncoviruses/present_viruses.txt
work/<sbj>/oncoviruses/oncoviral_breakpoints.tsv

Related somewhat to #73.

MultiQC: `dplyr::bind_rows` can't combine character with double

Edge case, need to handle binding missing values with numeric like in

dracarys/R/multiqc.R

Line 61 in ebc98a5

# replace the "NA" strings with NA, else we get a column class error

.
Related to the germline workflow update I suspect - see #49.

rlang::last_trace()
<error/dplyr:::mutate_error>
Error in `dplyr::mutate()`:
ℹ In argument: `obj_parsed = list(.data$obj$read())`.
ℹ In row 1.
Caused by error in `dplyr::bind_rows()`:
! Can't combine `PRJ230202$(Chr X SNPs)/(chr Y SNPs) ratio over genome` <character> and `PRJ230203$(Chr X SNPs)/(chr Y SNPs) ratio over genome` <double>.
---
Backtrace:
     ▆
  1. ├─dracarys::umccr_tidy(...)
  2. │ ├─dplyr::select(...) at dracarys/R/tidy.R:83:4
  3. │ ├─dplyr::mutate(...)
  4. │ └─dplyr:::mutate.data.frame(...)
  5. │   └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  6. │     ├─base::withCallingHandlers(...)
  7. │     └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
  8. │       └─mask$eval_all_mutate(quo)
  9. │         └─dplyr (local) eval()
 10. └─.data$obj$read()
 11.   └─dracarys::multiqc_tidy_json(x) at dracarys/R/multiqc.R:26:6
 12.     ├─dplyr::select(...) at dracarys/R/multiqc.R:66:2
 13.     ├─dplyr::mutate(...)
 14.     └─dplyr::bind_rows(d, .id = "umccr_id")

trimmer_metrics.csv
fastqc_metrics.csv
insert-stats.tab
roh.bed
roh_metrics.csv
wgs_overall_mean_cov.csv
sv_metrics.csv
wgs_hist.csv

Test data for development

Currently dracarys uses test data via DVC which get downloaded into the build conda pkg. Would be good to also have a folder on gds://development, potentially synced.

Slack Message

cttso: data normalisation

Think most of the tidy tables are fine as-is, just this one can probably be broken down further:

AlignCollapseFusionCaller_metrics.json.gz:
- hist.tsv.gz split by:
  - num_supporting_fragments
  - unique_UMIs_per_fragment_position
- main.tsv.gz split by:
  - CoverageSummary
  - MappingAligningPerRg
  - MappingAligningSummary
  - RunTime
  - SvSummary
  - TrimmerStatistics
  - UmiStatistics

	# TODO:
	assertthat::assert_that(
	!any(duplicated(d[["type"]])),
	msg = glue("Aborting - duplicated file types detected in {in_dir}")
	)

umccr / dracarys Goto Github PK

dracarys's People

Contributors

Watchers

Forkers

dracarys's Issues

Recommend Projects

Recommend Topics

Recommend Org