Code Monkey home page Code Monkey logo

mtswirl's Introduction

Nuclear genetic control of mtDNA copy number and heteroplasmy in humans

DOI

This repo contains pipeline files for the reference-aware mtSwirl pipeline as well as the code used to run, merge, and annotate the results.

Citation and data

This pipeline was released as part of the manuscript: Nuclear genetic control of mitochondrial DNA copy number and heteroplasmy in humans, which can be found at Nature. If you use these resources in your work, please cite as Gupta et al. 2023 Nature:

Gupta, R., Kanai, M., Durham, T.J. et al. Nuclear genetic control of mtDNA copy number and heteroplasmy in humans. Nature, in press. https://doi.org/10.1038/s41586-023-06426-5.

Individual-level data

Individual level data corresponding to mtDNA copy number (before and after covariate correction) and the post-QC variant callset can be found:

  • For UKB, via the UKB data showcase. Note that final data return is currently in process.
  • For AoU, as part of the Nuclear genetic control of mtDNA copy number and heteroplasmy in humans workspace. Note that controlled tier access is required to clone this workspace.

Summary statistics

Summary statistics from UKB are available:

  • Via GWAS Catalog under ID GCP000614, where we have uploaded summary statistics corresponding to our largest analysis for each phenotype, corresponding to cross-ancestry meta-analyses when performed or EUR when no other populations had sufficient N for GWAS. These summary statistics are filtered to include only stringently "high_quality" variants; the full summary statistics including all otherwise QC-passing variants can be found on GCP (see below). PLEASE NOTE: these data were corrected on 03/2024 as the effect_allele and other_allele columns were originally reversed. No other columns were changed. No data deposited in other locations (e.g., GCP, AllofUs; see below) required updating.
  • On Google Cloud Platform, in the gs://mito-wgs-public-2023 bucket. Please note that this is a requester pays bucket. This bucket also contains ukb_b37_b38_lifted_variants.tsv.bgz, which maps GRCh37 coordinates in the UKB data to GRCh38. The summary statistics on GCP correspond to the same data, but are stored using the Pan UKB schema. These files contain the cross-ancestry meta-analysis as well as per-ancestry association statistics as well (and thus are more comprehensive than those on GWAS Catalog). More information on the schema is described in the README_ukb.md file located in the gs://mito-wgs-public-2023 bucket.

Summary statistics from AoU are available in the Nuclear genetic control of mtDNA copy number and heteroplasmy in humans workspace in the same format as UKB summary statistics found on GCP. Note that controlled tier access is required to clone this workspace.

See Supplementary table 1 for sample size information.

AllofUs workspace access

Please note that at the time of writing, there is no mechanism by which custom workspaces in AoU can be made available to anyone with controlled tier access. Thus, we ask that in the interim, any users who desire to work with these data in AoU contact us to be added to the workspace. We are committed to making these data automatically available when this mechanism becomes available, and plan to beta-test this functionality when it is possible to do so.

mtSwirl: Reference-aware quantification of mtDNA copy number and heteroplasmy using WGS

See the WDL folder for the self-contained WDL. The v2.5_MongoSwirl_Single folder contains the single-sample pipeline oriented for use with Cromwell. The v2.6_MongoSwirl_Multi folder contains a multi-sample pipeline for use on the UKB Research Analysis Platform using dxCompiler. This folder also contains supporting scripts and reference NUMTs used to generate nucDNA self-reference sequences. See manuscript Methods for more details.

Generate multi-sample MatrixTables and perform QC

The generate_mtdna_call_mt folder contains code used to merge single-sample VCFs into Hail MatrixTables. This code was written originally as an extension of code previously released for mtDNA analysis (Laricchia et al. 2022 Genome Res). Scripts in the root of this folder work across any platform; scripts in each sub-folder are platform specific.

dx

Run dx_pipeline.sh to run the merging pipeline.

AoU

  1. Currently, AoU does not have a central Cromwell implementation. Thus, we created aou_mtdna_analysis_launcher.sh to run the WDL. Tweak the parameters in the header for your configuration.
  2. To combine per-base coverage into an MT use aou_annotate_coverage.py
  3. To combine single-sample VCFs into an MT use aou_combine_vcfs.py

Terra

  1. To combine per-base coverage into an MT use annotate_coverage.py
  2. To combine single-sample VCFs into an MT use combine_vcfs.py

All platforms

  1. To generate sample statistics after QC (e.g., mtCN), use process_sample_stats.py
  2. To annotate the VCF MatrixTable, run QC, run VEP, and output a QC'd variant flat file, use add_annotations.py

Genome-wide association study pipeline

UKB

To run GWAS in UKB use the files in gwas_ukb. Using the outputs of QC, we run covariate correction with generate_covariate_corrected_traits.Rmd for mtCN (and for sensitivity analyses). To produce final heteroplasmy phenotypes, we use produce_final_HL_traits.Rmd. We use saige_pan_ancestry_custom.py to run SAIGE in UKB with custom_load_custom_sumstats_into_mt.py to combine results into an MT.

AllofUs

We use the files in gwas_aou to run GWAS in AoU. To produce custom PCs by recomputing them per-ancestry, we use run_per_ancestry_pca.py. We run aou_run_full_hl_gwas.py to run the GWAS.

mtswirl's People

Contributors

rahulg603 avatar

Stargazers

 avatar Joyonna Gamble-George, PhD avatar TJ Singh avatar  avatar Truman avatar  avatar Luo Lintao avatar  avatar  avatar  avatar  avatar

Watchers

James Cloos avatar Xin Gong avatar  avatar  avatar

mtswirl's Issues

mtCN result from WDL output files

Hi, we have successfully run the v2.5_MongoSwirl_Single pipeline on a single WGS sample, and are reviewing the output files. Could you please point us to the output file that indicates the mtDNA copy number (mtCN)? Thanks!

Issue with jsontools.py file

while running the multi sample code i get this error

multi_sam/cromwell-executions/MitochondriaPipeline/f10932f4-93f7-4b3c-8bea-e5c4e2455827/call-AlignAndCallR2/AlignAndCallR2/1887beac-6a1e-457e-8d2c-e25b68b1fa2a/call-AlignToMtRegShiftedAndMetrics/execution/stderr

Upon looking to the stderr file i get the following statements

Traceback (most recent call last):
File "/cromwell-executions/MitochondriaPipeline/f10932f4-93f7-4b3c-8bea-e5c4e2455827/call-AlignAndCallR2/AlignAndCallR2/1887beac-6a1e-457e-8d2c-e25b68b1fa2a/call-AlignToMtRegShiftedAndMetrics/inputs/603824970/jsontools.py", line 67, in
data_in = parse_vars(args.set, type=str)
File "/cromwell-executions/MitochondriaPipeline/f10932f4-93f7-4b3c-8bea-e5c4e2455827/call-AlignAndCallR2/AlignAndCallR2/1887beac-6a1e-457e-8d2c-e25b68b1fa2a/call-AlignToMtRegShiftedAndMetrics/inputs/603824970/jsontools.py", line 31, in parse_vars
key, value = parse_var(item)
File "/cromwell-executions/MitochondriaPipeline/f10932f4-93f7-4b3c-8bea-e5c4e2455827/call-AlignAndCallR2/AlignAndCallR2/1887beac-6a1e-457e-8d2c-e25b68b1fa2a/call-AlignToMtRegShiftedAndMetrics/inputs/603824970/jsontools.py", line 20, in parse_var
return (key, value)
UnboundLocalError: local variable 'value' referenced before assignment

The script is running with the default values using cromwell-87 on local machine. Is this issue is inside the jsontools.py file? How to fix this error?

Error connecting to https://us.gcr.io using address us.gcr.io:443

hi,rahulg,Why Pipeline need to connect https://us.gcr.io;
[2024-02-24 00:54:22,33] [info] Request threw an exception on attempt #1. Retrying after 989 milliseconds
org.http4s.client.ConnectionFailure: Error connecting to https://us.gcr.io using address us.gcr.io:443 (unresolved: false)
at org.http4s.client.blaze.Http1Support.$anonfun$buildPipeline$1(Http1Support.scala:90)
at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:477)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
at async @ org.http4s.internal.package$.$anonfun$fromFuture$1(package.scala:144)
at flatMap @ org.http4s.internal.package$.fromFuture(package.scala:139)
at flatMap @ org.http4s.client.PoolManager.$anonfun$createConnection$2(PoolManager.scala:119)
at shift @ org.http4s.client.PoolManager.$anonfun$createConnection$2(PoolManager.scala:119)
at uncancelable @ org.http4s.client.ConnectionManager$.pool(ConnectionManager.scala:83)
at unsafeRunSync @ cromwell.docker.DockerInfoActor.preStart(DockerInfoActor.scala:173)

Summary-level data

Hi rahulg,
I tried to find the Summary statistics from UKB either via GWAS Catalog or on Google Cloud Platform, but I failed.
In GWAS Catalog, I can not find the resource under ID GCP000614.
On Google Cloud Platform, I do not have access, as shown in the picture below.
How could I apply for the summary-level GWAS data?
Thanks!

Regard,
Lyn

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.