Code Monkey home page Code Monkey logo

dracarys's People

Contributors

actions-user avatar pdiakumis avatar victorskl avatar

Watchers

 avatar  avatar

Forkers

pdiakumis

dracarys's Issues

cttso: Support for MergedSmallVariants VCF

Need to add parser for MergedSmallVariants.vcf.gz files, which contain PASSed and non-PASSed variants with useful FORMAT fields. Consider using bcftools with bedr as a fallback.

New Workflow: BCL Convert

See e.g.primary_data/230406_A00130_0254_AH5L3WDSX7/202304077f5d1d99/ for BCL Convert and Interop results.

Directly write to db

Consider the option to directly write the data to an AWS database from dracarys itself (in addition to the TSV/Parquet export).
This would save code duplication and repeated data parsing/class guessing.

Reading stack:

Python interface

As a conda app, dracarys is flexible enough to use any conda pkgs available like bcftools, bedtools, cyvcf2 etc., along with R pkgs from CRAN/Bioconductor and Python pkgs from PyPI.
We can explore enhancing its current Python interface (setup.py/setup.cfg/dracarys/ are already available) and use reticulate (https://rstudio.github.io/reticulate/) to connect with R.

Export metadata to output directory

Upon writing the tidy files, also write an e.g. dr_metadata.json file with information about the dracarys version and the input GDS directory path. Maybe even write the full command used.

Duplicate File Types Detected error

Not entirely sure this is actually a bug. When running v0.8.0 via a lamba with ~2GB of memory allocated we sometimes end up with the following error message:

2023-04-23T21:58:50.696000+00:00 2023/04/23/[$LATEST]eefcad7e070f46039ac9deff72cb08ae Error: Aborting - duplicated file types detected in /tmp/dracarys/dracarys_gds_sync
2023-04-23T21:58:50.696000+00:00 2023/04/23/[$LATEST]eefcad7e070f46039ac9deff72cb08ae Execution halted
2023-04-23T21:58:50.696000+00:00 2023/04/23/[$LATEST]eefcad7e070f46039ac9deff72cb08ae Warning message:
2023-04-23T21:58:50.696000+00:00 2023/04/23/[$LATEST]eefcad7e070f46039ac9deff72cb08ae system call failed: Cannot allocate memory

The OOM error at the end might be a red herring, it rather seems that all inputs are unique via

dracarys/R/tidy.R

Lines 95 to 99 in bfba3ca

# TODO:
assertthat::assert_that(
!any(duplicated(d[["type"]])),
msg = glue("Aborting - duplicated file types detected in {in_dir}")
)
Here's an example input folder, gds://production/analysis_data/SBJ02892/tso_ctdna_tumor_only/20221104881bc21a/L2201559/Results/:

image

The input folder does not contain results from multiple re-runs which was my initial assumption. Any ideas on what might be going on here welcome! Once that is resolved we can run this across the collection again and check if memory is still an issue for some of the larger JSONs; might be solvable by limiting resource sharing, bumping up the memory or via your proposed #56.

MultiQC: include 'Somatic And Germline' as workflow config title

Due to the recent update in the germline workflow running alongside the somatic instead of tagging along with umccrise (umccr/infrastructure#296), MultiQC is run on the outputs of both germline and somatic workflows (Slack thread).

Need to add a 'somatic and germline' option in

dragen_workflows <- c("alignment", "transcriptome", "somatic", "ctdna")

Error that gets output:

rlang::last_trace()
<error/dplyr:::mutate_error>
Error in `dplyr::mutate()`:
ℹ In argument: `obj_parsed = list(.data$obj$read())`.
ℹ In row 1.
Caused by error:
! `%in%`(x = w, table = dragen_workflows) is not TRUE
---
Backtrace:
     ▆
  1. ├─dracarys::umccr_tidy(...)
  2. │ ├─dplyr::select(...) at dracarys/R/tidy.R:83:4
  3. │ ├─dplyr::mutate(...)
  4. │ └─dplyr:::mutate.data.frame(...)
  5. │   └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  6. │     ├─base::withCallingHandlers(...)
  7. │     └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
  8. │       └─mask$eval_all_mutate(quo)
  9. │         └─dplyr (local) eval()
 10. └─.data$obj$read()
 11.   └─dracarys::multiqc_tidy_json(x) at dracarys/R/multiqc.R:26:6
 12.     └─dracarys:::.multiqc_guess_workflow(p) at dracarys/R/multiqc.R:58:2
 13.       └─assertthat::assert_that(w %in% dragen_workflows) at dracarys/R/multiqc.R:150:6
 14.         └─base::stop(assertError(attr(res, "msg")))

Documentation: UMCCR Workflow Outputs

Generate a table with the UMCCR workflow outputs of interest for ingestion/tidying

  • bclconvert
  • alignment
  • somatic
  • germline
  • transcriptome
  • umccrise
  • cttso
  • oncoanalyser
  • ....

MultiQC: generate mini outputs

To simplify work for engineering team we can generate a subset of columns of interest.
Focus on the ctDNA TSO500 MultiQC output to get the ball rolling.

CLI: unify subcommands

Currently the CLI is split over the tso and multiqc subcommands, even though the underlying functionality is the same (with input being a GDS or local directory, and output being a local directory).
We can unify these into a single dracarys tidy command.

CI/DVC error: failed to pull data from the cloud

Suddenly getting these for both gpgr and dracarys:

WARNING: No file hash info found for '/home/runner/work/dracarys/dracarys/inst/extdata/tso'. It won't be created.
WARNING: No file hash info found for '/home/runner/work/dracarys/dracarys/inst/extdata/wgs'. It won't be created.
2 files failed
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
/home/runner/work/dracarys/dracarys/inst/extdata/tso
/home/runner/work/dracarys/dracarys/inst/extdata/wgs
Is your cache up to date?
<https://error.dvc.org/missing-files>
Error: Process completed with exit code 1.

Might have something to do with the DVC version, or the DVC GDrive version, or with conda, or with UMCCR GDrive authentication. Or with something completely unrelated.
Which narrows it down quite a bit.

Databricks: rlang/RJSONIO installation error

Setting up dracarys on a DB cluster, installs fine but need to lock in {rlang} and {RJSONIO} beforehand:

> dracarys::TsoTmbTraceTsvFile$new(url1)

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  namespace ‘rlang’ 1.0.6 is already loaded, but >= 1.1.0 is required
# before installing dracarys
> sessionInfo()
R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] SparkR_3.4.0   compiler_4.2.2 Rserve_1.8-12 

Solution:

install.packages("RJSONIO")
devtools::install_github("umccr/dracarys", upgrade_dependencies = FALSE, dependencies = FALSE)

Will experiment a bit with the custom Docker setup at some point.

AWS Lambda

Create an AWS Lambda that receives an event with input directory and output prefix. Explore option to specify the name of the workflow that generated the input in case we want to be more specific and have more control on the output.

-- Edit: also see Andrei's work in https://github.com/umccr/dracarys-to-s3-cdk

Specify inputs/outputs via TSV/JSON

Instead of running it one input at a time, allow for multiple inputs (along with their outputs + prefixes) to be specified in a TSV/JSON file. e.g.:

dracarys.R tidy \
  --multi_input samples.tsv \
  --token <TOKEN> \
  --format {tsv,parquet,both}

Time Metrics Ingestion Label are params from a Presigned URL

Describe the bug
A clear and concise description of what the bug is. Include:

  • version of {dracarys}. (main branch as of 04-07-2023)
  • the complete command you used to invoke {dracarys}
gds_files_list <- dracarys::gds_files_list_filter_relevant(
  gds_output_directory, 
  token=Sys.getenv("ICA_ACCESS_TOKEN"), 
  include="PresignedUrl"
)

data_objects <- gds_files_list %>%
  # We only want the metrics files
  dplyr::filter(
    stringr::str_detect(.data[["type"]], "MultiqcFile|MetricsFile$")
  ) %>%
  # Work by row
  dplyr::rowwise() %>%
    # Use the func_selector to select the ingestion type
    # Based on the input data type
    dplyr::mutate(
      gen = list(dracarys:::dr_func_eval(.data$type))
    ) %>%
    # Call the dracarys function selected by mutate above
    # Using the presigned url as an input to the function
    dplyr::mutate(
      obj = list(.data$gen$new(.data$presigned_url))
    ) %>%
    # Now read the dracarys digested object
    dplyr::mutate(
      objp = list(.data$obj$read())
    ) %>%
    # Add the portal run id to the objp tbl
    dplyr::mutate(
      objp = list(
        .data$objp %>%
        dplyr::mutate(
          portal_run_id=portal_run_id
        )
      )
    ) %>%
    # # Get output table based on workflow type and data type
    dplyr::mutate(
      output_table = paste0(
        workflow_type_name, 
        "__", 
        stringr::str_replace(.data[["type"]], "MetricsFile$", "_metrics")
      )
    ) %>%
  # De row-wise tibble
  dplyr::ungroup() %>%
  # Select required outputs
  dplyr::select(
    "type",
    "output_table",
    "objp"
  )

# Write to table
data_objects %>%
  dplyr::rowwise() %>%
  dplyr::mutate(
    wrote_output = write_tbl_to_uc_table(sc, .data$objp, .data$output_table)
  ) %>%
  dplyr::ungroup()
  • the full error message

Screenshots
If applicable, add screenshots to help explain your problem.

image

Additional context
Add any other context about the problem here.

time_metrics.csv.csv

Decouple copying from tidying

e.g. dracarys.R copy -i gds://path/to/workflow_dir/ -o /mnt/dracarys_sync/

-- Note that if you already have synced data, this can work:

dracarys.R tidy -i /mnt/dracarys_sync/ -o /mnt/tidied_files/

Better ICA token validation logging

Handle cases where the token:

  • is valid but for a different context
  • is not a JWT/missing

Already handled in some part by {jose} at https://github.com/umccr/dracarys/blob/4d8738/R/ica.R#L134, but need to improve logging/edge cases me thinks.

  • e.g. for a valid token but for a different context:
backtrace
Error in `dplyr::mutate()`:
ℹ In argument: `size = fs::as_fs_bytes(.data$size)`.
Caused by error in `.data$size`:
! Column `size` not found in `.data`.
Backtrace:
     ▆
  1. ├─global tidy_parse_args(args)
  2. │ ├─base::do.call(umccr_tidy, tidy_args)
  3. │ └─dracarys (local) `<fn>`(...)
  4. │   └─dracarys::dr_gds_download(...)
  5. │     ├─dplyr::select(...)
  6. │     ├─dplyr::mutate(...)
  7. │     └─dracarys::gds_files_list(gdsdir = gdsdir, token = token)
  8. │       ├─dplyr::select(...)
  9. │       ├─dplyr::mutate(...)
 10. │       └─dplyr:::mutate.data.frame(...)
 11. │         └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
 12. │           ├─base::withCallingHandlers(...)
 13. │           └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
 14. │             └─mask$eval_all_mutate(quo)
 15. │               └─dplyr (local) eval()
 16. ├─fs::as_fs_bytes(.data$size)
 17. ├─size
 18. ├─rlang:::`$.rlang_data_pronoun`(.data, size)
 19. │ └─rlang:::data_pronoun_get(...)
 20. └─rlang:::abort_data_pronoun(x, call = y)
 21.   └─rlang::abort(msg, "rlang_error_data_pronoun_not_found", call = call)
Execution halted
  • e.g. for an invalid JWT
backtrace
Error in jose::jwt_split(token) : length(input) %in% c(2, 3) is not TRUE
Calls: tidy_parse_args -> ica_token_validate -> <Anonymous> -> stopifnot
Execution halted

Support umccrise outputs

Need to handle the TSV copying from the umccrise cancer_report_tables/ directory.
Think the data is tidy enough in there.

Handle presigned URLs of gzipped JSONs

{RJSONIO} cannot handle URLs to gzipped JSONs:

> url_jsongz
[1] "https://stratus-gds-aps2.s3.ap-southeast-2.amazonaws.com/11143240-789c-4320-b4e2-08d89d4636a9/analysis_data/SBJ00006/tso_ctdna_tumor_only/20230506af3bd7d9/L2300565/Results/NTC_ctTSO230501_L2300565/NTC_ctTSO230501_L2300565.AlignCollapseFusionCaller_metrics.json.gz?X-Amz-Expires=604800&response-content-disposition=filename%3D%22NTC_ctTSO230501_L2300565.AlignCollapseFusionCaller_metrics.json.gz%22&response-content-type=binary%2Foctet-stream[...]

> RJSONIO::fromJSON(url_jsongz)
Error in nchar(content) : invalid multibyte string, element 1

# how about something non-ICA related
> RJSONIO::fromJSON("https://wiki.mozilla.org/images/f/ff/Example.json.gz")
Error in nchar(content) : invalid multibyte string, element 1

Solution (based on SO) is to create a gz connection to the URL, then read it in as text via base::readLines or readr::read_lines, and finally feed that string to RJSONIO:fromJSON.

> "https://wiki.mozilla.org/images/f/ff/Example.json.gz" |>
+     base::url() |> 
+     base::gzcon() |> 
+     readr::read_lines() |>
+     RJSONIO::fromJSON() |> str()
List of 26                                                                                                                                                                            
 $ InstallTime          : chr "1295768962"
 $ Comments             : chr "Will test without extension."
 $ Theme                : chr "classic/1.0"
 $ Version              : chr "4.0b10pre"
 $ id                   : chr "ec8030f7-c20a-464f-9b0e-13a3a9e97384"
 $ Vendor               : chr "Mozilla"
 $ EMCheckCompatibility : chr "false"
 $ Throttleable         : chr "1"
[...]

Missing ggrepel dependency

Used for TsoTargetRegionCoverageFile read depth vs. cov pct ggplot:

Error in `dplyr::mutate()`:
ℹ In argument: `plot = ifelse(...)`.
ℹ In row 2.
Caused by error in `loadNamespace()`:
! there is no package called ‘ggrepel’

CRAN dependency check passed because it doesn't scan R6 classes properly.

Set up bare bones db

Assuming the dracarys AWS Lambda (issue #28) is able to drop data in an S3 bucket based on a (e.g. umccrise) workflow completion event, what are the next steps?

  • How will the db look like?
  • How are the S3 objects imported into the db?
  • How do the queries look like?
  • How does the sample metadata get dumped in the db?

There is already a starting point for this setup - see umccr/data-portal-apis#551 and Slack discussion.

There is also a 576row x 18col TSV/Parquet file that can be used as an 'end product' from dracarys in case we want to focus on this use case first (see Slack).

TSO: parse MergedSmallVariants.vcf.gz

MergedSmallVariants.vcf.gz contains some useful info in the FORMAT column that does not get captured elsewhere:

# A tibble: 13 × 4
   ID    Number Type    Description                                                                               
   <chr> <chr>  <chr>   <chr>                                                                                     
 1 GT    1      String  Genotype                                                                                  
 2 GQ    1      Integer Genotype Quality                                                                          
 3 AD    .      Integer Allele Depth                                                                              
 4 DP    1      Integer Total Depth Used For Variant Calling                                                      
 5 VF    .      Float   Variant Frequency                                                                         
 6 NL    1      Integer Applied BaseCall Noise Level                                                              
 7 SB    1      Float   StrandBias Score                                                                          
 8 NC    1      Float   Fraction of bases which were uncalled or with basecall quality below the minimum threshold
 9 US    .      Integer Supporting read type counts                                                               
10 AQ    1      Float   Variant artifact adjusted quality score                                                   
11 LQ    1      Float   Likelihood ratio quality score                                                            
12 LQUS  6      Float   Likelihood ratio quality score by supporting read types                                   
13 BFQ   1      Float   Variant support Bias Filter Quality score

Note that the VCF contains PASSed and non-PASSed variants. Consider using bcftools with bedr as a fallback option for pure R.

Switch `{jose}` to Imports instead of Suggests

{jose} is used to validate the ICA JWT token. I had originally added it under the suggestions since I didn't think it'd get used that much, but now it should move to the core import dependencies.

Oviraptor parsing

Grab oviraptor results from the umccrise work directory:

  • work/<sbj>/oncoviruses/present_viruses.txt
  • work/<sbj>/oncoviruses/oncoviral_breakpoints.tsv

Related somewhat to #73.

MultiQC: `dplyr::bind_rows` can't combine character with double

Edge case, need to handle binding missing values with numeric like in

# replace the "NA" strings with NA, else we get a column class error
.
Related to the germline workflow update I suspect - see #49.

rlang::last_trace()
<error/dplyr:::mutate_error>
Error in `dplyr::mutate()`:
ℹ In argument: `obj_parsed = list(.data$obj$read())`.
ℹ In row 1.
Caused by error in `dplyr::bind_rows()`:
! Can't combine `PRJ230202$(Chr X SNPs)/(chr Y SNPs) ratio over genome` <character> and `PRJ230203$(Chr X SNPs)/(chr Y SNPs) ratio over genome` <double>.
---
Backtrace:
     ▆
  1. ├─dracarys::umccr_tidy(...)
  2. │ ├─dplyr::select(...) at dracarys/R/tidy.R:83:4
  3. │ ├─dplyr::mutate(...)
  4. │ └─dplyr:::mutate.data.frame(...)
  5. │   └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  6. │     ├─base::withCallingHandlers(...)
  7. │     └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
  8. │       └─mask$eval_all_mutate(quo)
  9. │         └─dplyr (local) eval()
 10. └─.data$obj$read()
 11.   └─dracarys::multiqc_tidy_json(x) at dracarys/R/multiqc.R:26:6
 12.     ├─dplyr::select(...) at dracarys/R/multiqc.R:66:2
 13.     ├─dplyr::mutate(...)
 14.     └─dplyr::bind_rows(d, .id = "umccr_id")

Support additional DRAGEN outputs

Some of these are likely contained somewhere else too:

  • trimmer_metrics.csv
  • fastqc_metrics.csv
  • insert-stats.tab
  • roh.bed
  • roh_metrics.csv
  • wgs_overall_mean_cov.csv
  • sv_metrics.csv
  • wgs_hist.csv

Test data for development

Currently dracarys uses test data via DVC which get downloaded into the build conda pkg. Would be good to also have a folder on gds://development, potentially synced.

Slack Message

cttso: data normalisation

Think most of the tidy tables are fine as-is, just this one can probably be broken down further:

  • AlignCollapseFusionCaller_metrics.json.gz:
    • hist.tsv.gz split by:
      • num_supporting_fragments
      • unique_UMIs_per_fragment_position
    • main.tsv.gz split by:
      • CoverageSummary
      • MappingAligningPerRg
      • MappingAligningSummary
      • RunTime
      • SvSummary
      • TrimmerStatistics
      • UmiStatistics

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.