Code Monkey home page Code Monkey logo

coronavirus_data's Introduction

Data and scripts for COVID-19

This GitHub repo contains data and scripts relevant to COVID-19, which is the disease caused by the virus SARS-CoV-2. For a full descriptions of our efforts, please see https://www.aicures.mit.edu/.

Note that since relatively little data for SARS-CoV-2 is available, most of the data in this repo is for SARS-CoV (responsible for the 2002/3 SARS outbreak) and other related coronaviruses. The hope is that models trained on this data will be able to retain their predictive ability on SARS-CoV-2.

Although the data contained in this repo can be used by any model, we have primarily been working with the message passing neural network model chemprop. Our trained models are available on http://chemprop.csail.mit.edu/predict and the predictions from these models on the Broad Repurposing Hub are available in predictions/.

SARS-CoV data

  • AID1706_binarized_sars.csv - (N = 290,726; hits = 405) In-vitro assay that detects inhibition of SARS-CoV 3CL protease via fluorescence from PubChem AID1706.
  • evaluation_set_v2.csv - (N = 5,671; hits = 41) An evaluation set for SARS-CoV 3CL protease containing 41 experimentally validated hits along with 5630 molecules from the Broad Repurposing Hub which are treated as non-hits. There is no overlap with AID1706_binarized_sars.csv.
  • AID1706_binarized_sars_full_eval_actives.csv - (N = 290,767; hits = 446) is AID1706_binarized_sars.csv combined with the 41 validated hits from evaluation_set_v2.csv.
  • PLpro.csv - (N = 233,891; hits = 697) Bioassay that detects activity against SARS-CoV in yeast models via PL protease inhibition. Combines PubChem data from AID652038 and AID485353.

SARS-CoV-2 data

​Data extracted from literature

  • corona_literature_idex.csv - (N = 101) FDA-approved drugs that are mentioned in generic coronavirus literature. Drug to SMILES mapping is generated through the PubChem idex service and may contain multiple SMILES for generic drug names. These are not guaranteed to be effective against any targets; they simply appear in the literature.

​Catalogues of drugs that can be screened for repurposing

Other property prediction data

Contains train/dev/test splits (using a scaffold split) of some of the above datasets for benchmarking purposes.

Original raw data files and format conversions.

Predictions made by trained models on some of the repurposing datasets. See the README inside the predictions/ directory for details.

t-SNE plots comparing the datasets. Note that in the plots, "sars_pos" and "sars_neg" refer to any hits or non-hits, respectively, across both AID1706_binarized_sars.csv and PLpro.csv.

Files for converting between smiles/cid/name. Obtained from https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi

The nearest neighbor computations from each test set to the training set.

Various data processing scripts for reuse/reproducibility.

Statistics about overlap between the SMILES strings of various datasets.

t-SNE plots of chemical rationales extracted (using this code) from a model trained on the combined AID1706 and PLpro datasets.

Older versions of files from when we combined AID1706 data with other data that was unhelpful.

Experiment Commands

These commands are for running experiments using chemprop and should be run from the main directory in the chemprop repo. You may need to modify some paths depending on your directory structure. The commands below assume you are using AID1706_binarized_sars.csv but can be modified to work with any of the datasets.

Generating RDKit features

To speed up experiments, you can pre-generate RDKit features using the script save_features.py in chemprop/scripts. You should run this command:

python save_features.py
    --data_path ../../coronavirus_data/data/AID1706_binarized_sars.csv \
    --save_path ../../coronavirus_data/features/AID1706_binarized_sars.npz \
    --features_generator rdkit_2d_normalized

By default this will run feature generation using parallel processing. On occasion the parallel processing gets stuck near the end of feature generation, so if this happens, just kill the process and restart with the --sequential flag. This will pick up where the parallel version stopped and will finish correctly.

Training and testing

python train.py \
    --data_path ../coronavirus_data/data/AID1706_binarized_sars.csv \
    --dataset_type classification \
    --save_dir ../coronavirus_data/ckpt/AID1706_binarized_sars \
    --features_path ../coronavirus_data/features/AID1706_binarized_sars.npz \
    --no_features_scaling \
    --split_type scaffold_balanced \
    --quiet

The data splitting mechanism in chemprop is seeded so that this will reproduce the same train/dev/test split as in splits.zip.

Class balance

To run experiments with class balance, switch to the class_weights branch of chemprop (git checkout class_weights) and add the --class_balance flag. This will train with an equal number of positives and negatives in each batch.

Multi-task training for SARS-CoV-2 3CLpro and SARS-CoV 3CLpro

Experiment combining data on the 3CLpro target for SARS-CoV-2 ​mpro_xchem.csv and SARS-CoV AID1706_binarized_sars.csv.

5-fold cross validation performance is 0.850 +/- 0.022.

python multitask.py \
    --data_path data/mpro_xchem.csv \
    --source_data_path data/AID1706_binarized_sars.csv \
    --dataset_type classification \
    --save_dir ckpt/

coronavirus_data's People

Contributors

rmwu avatar swansonk14 avatar varal7 avatar wengong-jin avatar yangkevin2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

coronavirus_data's Issues

Some clarifications on the datasets

Thank you for the amazing effort! I am excited to try out your datasets and contribute to COVID-19 research.

I would like to get some clarifications on the datasets and the splits that you would like us to use for evaluation. I initially thought they are in https://github.com/yangkevin2/coronavirus_data/blob/master/splits.zip, but you also used https://github.com/yangkevin2/coronavirus_data/blob/master/old/AID1706_splits.zip for 5-fold CV, scaffold splits.
Which one do you expect us to use, and which one did you use when you obtain E.coli 0.887 and SARS-CoV-3CLpro 0.722 in https://www.aicures.mit.edu/tasks? Also, what is the relationship between these two datasets and the main dataset on pseudonomas aeruginosa? Finally, are SARS-CoV-3CLpro and AID1706 the same dataset or are they different ones? I initially thought they are the same but saw a section called "Multi-task training for SARS-COV-2 3CLPro and AID1706" and got a bit confused.

Thanks in advance.

Error Building SciPy

I have been using chemprop since early last year and recently had to rebuild the program, but I have continuously been met with an error while building the wheels for SciPy (Error with PEP 517). Not sure if the dependencies listed in the requirements txt file are outdated, but I have not been able to solve this issue.

Chemprop SARS models working?

Hi! The SMILES listed under predictions are sorted from highest to lowest probability of activity. But, for example, the values returned by Chemprop for the first three (lines 2-4) in AID1706_model_broad_repurposing_library_preds.csv using the "SARS" model are:
(1) 0.4181
(2) 0.4514
(3) 0.4549
and for lines 4500 and 5000 (ranked 4499 and 4999):
(4499) 0.1093
(4999) 0.3064
Apparently no specific relationship exists between the ranking and the predicted activity values. This is also true about AID1706_balanced_model_broad_repurposing_library_preds.csv using the "SARS - balanced" model.

The predictions are also rather incorrect regarding the experimental data on which the model has been trained. For example Chemprop gives 0.4531 for this compound which has the highest score (100) and inhibition (80.36%) against the SARS 3CLPro as reported in AID1706 and gives 0.47603 for this one which has a score of zero and an inhibition of -6.48%.

And the top SARS-CoV-2 antiviral molecules in the Broad drug repurposing hub listed in arXiv:2005.03004. Chemprop returns an activity of 0.2707 for the molecule with the highest activity (0.955; first line of Table 4 in page 8).
Am I missing a point?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.