Code Monkey home page Code Monkey logo

sethsh7 / hla-prs-toolkit Goto Github PK

View Code? Open in Web Editor NEW
9.0 2.0 5.0 957 KB

Toolkit to aid generation of advanced Polygenic Risk Scores (PRS) including SNP-allele tagging and interaction models, especially in HLA. Developed and maintained by Seth Sharp ([email protected]).

License: GNU General Public License v3.0

Python 92.17% Shell 7.83%
hla genetics polygenic-risk-scores genomics python bioinformatics immunogenetics genetic-risk-score

hla-prs-toolkit's Introduction

Polygenic Risk Score (PRS) Toolkit for HLA

This toolkit is for generating Single Nucleotide Polymorphism (SNP) based Polygenic Risk Scores (PRS) for common polygenic disease / traits where it is necessary to impute HLA alleles using single SNPs and model epistasis.

The major advantage of this is the ability to assign HLA haplogenotypes (e.g DR3/DQ2.5) using SNPs alone which is quicker and more cost effective than typical methods such array imputation and HLA typing. Demo data and guide files are provided for our own published PRS but these could be easily generated in order to apply code to your own HLA calling and or scoring schema.

For any questions about code on this page please contact Seth Sharp by email ([email protected]).

Updates

7/8/2022 - Updated the SNP lists for TOPMED (r2) imputations as per NIH TOPMED server
7/8/2022 - Reordered the SNP lists folder to split by TOPMED vs 1000G imputation
7/8/2022 - Added a wrapper script

Download

hla-prs-toolkit-0.22a.zip - Latest version of all scripts and demo data.
Alternatively "git clone https://github.com/sethsh7/hla-prs-toolkit.git" to clone onto your working local directory or server.

Requirements

PLINK 1.9 is required.
Anaconda or Python (Version 3.8).
PLINK genotyping data (.bed/.bim/.fam)

Scripts

1. PLINK2CALL - Hard calling of HLA haplogenotypes from SNP proxies

Usage:

python 1_plink2call.py --bfile <prefix> --mapping <mapping> --rank <ranking>

This script takes your PLINK data containing a set of SNPs for which you wish to generate HLA haplogenotypes and a mapping file which should be text (space or tab delimited) in the format:

SNP ALLELE
snp1  allele1
snp2 allele2
snp3 allele3
...

The SNPs are matched on RSID and should match exactly with your PLINK .bim file. The script will output the following results if succesful:

a) <prefix>_clean.txt - A table of allele counts by sample and mapped allele.
b) <prefix>_count.txt - Row wise sum of alleles per person for quality control.
c) <prefix>_cat.txt - List of categorical haplogenotypes by sample.

Additionally a ranking file is now needed to resolve impossible calls and invoke the HLA-LD tiebreak algorithm (see below).

HLA LD-Tiebreak algorithm (new December 2021)

Since only 2 HLA calls are possible per person (corresponding to 2 chromosomes) any samples with >2 calls were previously filtered out. This is no longer the case as an (unpublished) HLA LD-Tiebreak algorithm as been added based on a reference ranking. The reference ranking provided has been calculated based on the following:

Ranking score = SNP LD (r2) * Prior odds (frequency in reference population)

The reference prior odds were generated from HLA typing data take from Klitz et al (2003) and are therefore appropriate for a White US (European ancestry) population. This scenario typically occurs either in genotyping miscalls or in non-European ancestry samples.You can generate your own reference if you have other ethnicity data from any published HLA typing database. Our internal data shows ~72% of calls recovered this way are accurate compared to HLA typing.

The ranking file provided must have the following format:

ALELE RANK
allele1 1
allele2 2
allele3 3
...

2. CAT2SCORES - Assign scores to samples by HLA haplogenotype

Usage:

python 2_cat2scores.py --cat <your_cat.txt> --score <scorefile> 

The second script takes the by-sample categorical haplogenotype list file (c above) generated from PLINK2CALL and applies a scoring scheme. This can be EITHER on a full haplogenotype basis (one weight per combination) or on an additive basis (sum of each allele weight) or a mixture of both. The scorefile should be text (space or tab delimited) in the format:

ALLELE1 ALLELE2 BETA
allele1  allele2 value - Use this format for complete haplogenotypes
allele1 allele3 value
allele3 allele2 value
NULL allele1 value - For addition of an additive model use the NULL representation for the first column.
...

By default the script will NOT generate an additive term if a complete haplogenotype term is available (this differs from specific "interaction terms"). The script will output the following results file:

a) <prefix>_Scored.txt - List of samples and their corresponding scores as above

To generate a complete PRS (e.g. variants outside the HLA region) you can then sum this component with regular allele scoring generated using PLINK score function.

SNP Lists

Scoring and mapping files are provided to generate scores from our publications with either TOPMED or 1000G imputed data.

Type 1 Diabetes (T1D-GRS)

10-SNP / 30-SNP score ("GRS1") [1] - Original 10 and 30 SNP Exeter scores, they use a simple interaction model for DR3 and DR4 haplotypes.

67-SNP score ("GRS2") [2] - Updated 67 SNP Exeter score, complex interaction model, combines interaction and additive models (as described above), best prediction.

Coeliac Disease (CD-GRS)

42-SNP score [3] - 42 SNP Exeter score, interaction model only consisting of DQ2.5, DQ2.2, DQ8.1 and DQ2.2 haplotypes.

Example with Demo Data

In the scripts folder a subfolder "Demo" contains randomly generated data on 50 samples. Run with RunDemo.sh.

FAQ

I have array genotyping data. How do I get the desired SNPs?

You will need to impute your data using NIH Imputation Server. You will then need to convert your VCF of imputed dosages to PLINK format and extract the variants you need.

What genome build should my data be in?

For TOPMED it should be hg38, for 1000G hg19.

Should I impute to HRC, TOPMED or 1000Genomes?

We have provided SNP lists for TOPMED r2 and 1000 Genomes as an alternative.

How do I apply this to my own 23andMe data?

You will need to download your raw genome data and impute it via a third party service.

How are missing genotype values handled?

They are set to 0 by default as the algorithm is conservative with calls.

Supporting Publications

[1] "A Type 1 Diabetes Genetic Risk Score Can Aid Discrimination Between Type 1 and Type 2 Diabetes in Young Adults", DiabetesCare 2016, RA Oram et al

[2] "Development and Standardization of an Improved Type 1 Diabetes Genetic Risk Score for Use in Newborn Screening and Incident Diagnosis", DiabetesCare 2019, SA Sharp et al

[3] "A single nucleotide polymorphism genetic risk score to aid diagnosis of coeliac disease: a pilot study in clinical care", AP&T 2020, SA Sharp et al

License

Resources are provided under the GNU General Public License v3.0. For more information please LICENSE.md.

hla-prs-toolkit's People

Contributors

liutiming avatar sethsh7 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

hla-prs-toolkit's Issues

Typo in GRS67 mapping - 2 SNPs correspond to DQ92

In the file hla-prs-toolkit/Snplists/T1D_GRS67/mapping_TOPMED.txt, both rs28746898 and rs9405117 are marked as DQ92.

From the file T1D_GRS67_1000G_nopalin_pos_hg19.xlsx, it looks like rs9405117 should correspond to DQ93.

Option for specifying ref alleles with third column

A future feature needs to be added that allows you to assign your score allele for DQ Haplotypes with a mapping, this is to avoid situations where one DQ type is >50% frequency and therefore will be wrongly called. This therefore only applies to cohorts heavily selected on certain DQ types.

matching on rsid

Hi I have imputed my data with Topmed imputation,
I wasnt provided rsid only position is this a problem?

Reported spreadsheet errors

(row 63) rs2111485 is on Chr2 not Chr1
(row 19) rs1281934 - major and minor may need swapping
(row 16) alleles not matching
(Row 6) two SNPs appear synonymous (possibly not an issue?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.