Code Monkey home page Code Monkey logo

llm-for-clinical-variants's Introduction

Protein language model rescue mutations highlight variant effects and structure in clinically relevant genes

This repo contains the scripts and metadata used in our work presented at NeurIPS 2022 Learning Meaningful Representation of Life (LMRL) workshop.

Workshop website | Paper

Abstract: Despite being self-supervised, protein language models have shown remarkable performance in fundamental biological tasks such as predicting impact of genetic variation on protein structure and function. The effectiveness of these models on diverse set of tasks suggests that they learn meaningful representation of fitness landscape that can be useful for downstream clinical applications. Here, we interrogate the use of these language models in characterizing known pathogenic mutations in curated, medically actionable genes through an exhaustive search of putative compensatory mutations on each variant's genetic background. Systematic analysis of the predicted effects of these compensatory mutations reveal unappreciated structural features of proteins that are missed by other structure predictors like AlphaFold. While deep mutational scan experiments provide an unbiased estimate of the mutational landscape, we encourage the community to generate and curate rescue mutation experiments to inform the design of more sophisticated co-masking strategies and leverage large language models more effectively for downstream clinical prediction tasks.

Pretrained models

Model Number of layers Number of parameters Training dataset Implementation in our work
ESM-2 33 650M UR50/D Single model with wt-marginals scoring strategy
ESM-1v 33 650M UR90/S Ensemble of 5 models with the same scoring strategy as ESM-2
ESMFold 48 690M PDB + UR50 Structure prediction for BAG3
AlphaFold2 AlphaFold2 structural model prediction for BAG3
Cross-protein transfer Zero-shot prediction scores for all 53 ACMG genes except MAX and HNF1A

Data on gene list and sequence variation

Description Data source
List of clinically actionable genes ACMG v3.1
Allele frequency gnomAD v2 GRCh38 liftover
ClinVar annotations Accessed on 09/17/2022
Multiple sequence alignments UCSC multiz-100 way CDS alignment (Placental mammals)

Citation

If you find this work useful, please cite it as follows:

@misc{
  url = {https://arxiv.org/abs/2211.10000},
  author = {Soylemez, Onuralp and Cordero, Pablo},
  keywords = {Machine Learning (cs.LG), Genomics (q-bio.GN), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Biological sciences, FOS: Biological sciences},
  title = {Protein language model rescue mutations highlight variant effects and structure in clinically relevant genes},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

Feedback

If you have any questions or comments, or would like to collaborate, please feel free to reach out.

llm-for-clinical-variants's People

Contributors

cx0 avatar dimenwarper avatar

Stargazers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.