molssi / covid Goto Github PK

View Code? Open in Web Editor NEW

27.0 10.0 49.0 8.84 MB

MolSSI SARS-CoV-2 Biomolecular Simulation Data and Algorithm Store

Home Page: https://covid.molssi.org

CSS 1.85% Python 14.49% HTML 82.99% Shell 0.67%

covid-19 covid19-data sars-cov-2 molecular-dynamics molecular-dynamics-simulation md-simulations

covid's People

Contributors

Stargazers

Watchers

Forkers

bryanjacksondesres robertodr playgroundtabs apayne97 egoldber choderalab jchodera cmanci dmmorozo icamps catenate15 stain anabiman amjjbonvin binikarki zhanglabs amineaboussalah chemlove kurtzmanlab djhuggins lcasalino pkoukos harimenath neginforouzesh rafwiewiora mizimmer90 yabmtm janus91 pabloggaray zhang-ivy carlos-a-ramos-g rikensugitalab tsztain rebeccawalters95 asfo1979 richardbsessions jcgumbart alexanderkuzmin124 jamesmkrieger lnaden hdokainish efectosmediouv kzinovjev dhimanray logan-phospholipid ozlemd78 mackevinbraza gorangiud

covid's Issues

Allow view of data generated by a given organization

It would be extremely useful to allow a view of all data deposited by a specific organization (e.g. Folding@home), so that we can present views of all data deposited from a single organization at a time.

Add Swissmodel SARS-CoV-2 models

The Swissmodel SARS-CoV-2 models may be useful to link:
https://swissmodel.expasy.org/repository/species/2697049

We could link both "all" models (which are useful to browse) as well as specific high-value models for specific targets.

Determine Quality/Usefulness Metrics

The current data, after review, can be given a star rating. This might not be the best system.

What we need to do:

Determine the metrics by which a datum is higher/lower quality than others
Display how to present that
Sort by that (can use the rating tag and then any system can be built around that)

cc @apayne97 @Binikarki @jchodera @sjayellis @Andrew-AbiMansour @egoldber

Distinguish between unsolvated and solvated models

@Lnaden:

It would be useful to distinguish between unsolvated models in the "Models" section that is associated with each Structure, as well as on the Models page. For example:

The explicitly solvated snapshot may be useful for MD simulations, but is more likely important for provenance tracking purposes with the associated simulation dataset that contains trajectories since very few modeling applications can use the solvated snapshot.

By contrast, the unsolvated protein models---into which missing loops have been built, structural modeling errors have been corrected, misperceived structural ions have been corrected, etc. can be used in essentially all modeling workflows.

I would suggest we move the solvated snapshots to be associated with their relevant simulation trajectories into "Simulations", and reserve "Models" for pre-solvated models that have corrected issues with the original structural data.

Suggestions to ingest data automatically

@apayne97, @henriberger and I have been talking about solutions to incorporate information from the Thorne Lab in a more automated way. We have come with this "ideal" pipeline:

Tier 1) Create a script that can diff their PDB IDs with our PDB IDs. Report the set difference for a human to review which new ones are worth adding.

Tier 2) Create a GitHub Actions pipeline that does this automatically either with an hourly cronjob or, if technically possible, after every push to the Thorne Lab repo

Tier 3) Add bot features to GHA to submit the PRs needed for each new candidate PDB ID. A human reviews it, editing the information as needed, and merges or rejects it. The closed PRs serve as a history on what we have tried so we don't resubmit twice.

Let us know if you have feedback!

Feig lab updated models

We should be sure to update the Feig lab models:
https://twitter.com/MeikelFeig/status/1254876370896855041

Improving internal and external links

Several times now I have been looking for something I know is on the website but just can't seem to find it - for instance, I just wanted to download a video showing a DESRES trajectory but just kept going in circles clicking on links that led nowhere.

One concrete suggestion is to have a constantly visible legend for the different labels telling me where I will go when I click on it (i.e. a "proteins" page, an external link, etc)

Additionally, if there could be some way to orient the user in "space" in the website, i.e. by highlighting the panel you're in and showing how far down the page you are (maybe not possible with a static site? idk)

Remove stoplights from structural data

We need to remove the stoplights from structural data. They are entirely misleading as to the quality of the structures and their utility for different purposes.

Publication status has no impact on structure quality. If we want to communicate publication status as presence of preprint or published version, we should simply come up with an icon that is displayed for preprint and published that shows up if these are available and absent if they are not.

This is going to cause active harm to the community if we keep these.

The appropriate annotation data should instead be pulled from the Coronavirus Structural Task Force, but we shouldn't wait for the implementation of that to strip out the stoplight nonsense.

Have validation.py dynamically retrieve valid values from directory

As it says on the label...
validation.py should populate classes such as "ValidProteins" dynamically from the data/proteins directory files, so that users can simply add the yml files instead of adding directly to the python script.

Request for data addition/refinement for variants

Please briefly describe your suggestion.
It would be good to have a field about variants so they are easily searchable rather than just being plain text.

Please provide the schema for the new/refined data class of interest.
List below all the keywords/values you would like to modify or add.

20E (EU1; D614G+A222V)
Alpha (B.1.1.7)
Beta (B.1.351)
Delta (B.1.617.2)
Epsilon (B.1.427 and B.1.429)
Omicron (B.1.1.529)

Additional context
Modelling variants and their mutations is important to understand what they are doing and this area will likely continue to grow. I have recently worked on this in both my previous postdoc in the Bahar lab (https://dx.doi.org/10.2139/ssrn.3907841) and my current Marie Curie fellowship in the Carazo/Sorzano lab (https://www.biorxiv.org/content/10.1101/2021.12.05.471263v2).

How do data analysis offers get handled?

How will this hub take in potential data analysis contributions? We have taken the DESRES 3CLpro trajectories and extracted a small set (34, but adjustable) diverse conformations of the catalytic domain from the 100,000 snapshots - something we think could be useful to those planning docking studies. "Analyses" are not a current data type, so I guess a new schema is needed. In our case the input data (in addition to the DESRES trajectories) is a Jupyter notebook and one PDB format file, the output is the 34 selected structures, again as PDB files. As a starter:

type: one of [Jupyter notebook, bash script....]
title: (required)
description: (required)
creator: (required)
organization: (optional)
lab: (optional)
institute: (optional)
models: (optional) must point to model in models dir
- modelname_1
- ...
proteins: (required) Must be a valid protein (see proteins dir)
- protein 1
- ...
structures: (optional) must point to structure which could be in structure dir
- structure 1
- ...
simulations: (optional) must point to simulation which could be in simulation dir
- simulation 1
- ...
rating: (optional) int on domain [1,5], 5 is better
files: (required) URLs to input and supporting files.
- file 1
- ...
references: (optional) List of referfences associated with the programs and methods you want to mention. For publications tied to this exact analysis, use the publicaton and preprint categories
- ref1
- ref2
publication: (optional) URL of the publication which includes THIS analysis
preprint: (optional) URL of the preprint for the publication. Can also be used to note if submitted to a peer reviewed journal by the exact word "Submitted"

Collaborating -> Contributing?

It seems like it might be helpful to change "About > Collaborating" to "About > Contributing" to lower the perceptual barrier for others to get involved:

Here, it looks like we have to go through a significant process to decide whether someone is allowed to be a "collaborator" for a monolithic site that is intended to be a community hub.

Instead, it may make sense to list contributors so far (which can be automatically pulled from YAML files, which we could add contributor: fields to) and describe several ways in which folks can get involved by

contributing a PR about a new potential target, structure, model, or molecule (dataset)
joining a data/contribution review team
helping with web development
collaborating in a more substantial way (for orgs like BioExcel/JEDI)

Models page target's identification does not work

The logic which determines if a model is part of a target or not appears to be broken and everything is categorized as "No target"

Another dataset to link to: Stanford University Coronavirus Antiviral Research Database

This website appears to index some other investigational compounds of interest: https://covdb.stanford.edu/

Might be worth a link!

How to present drug discovery efforts against 3CLpro, PLpro, RdRP, etc

Not a bug, I just didn't see a format that looked right
There are a WHOLE bunch of 3CLpro (Mpro, Main Protease, nsp5) structures. And potentially a WHOLE BUNCH of molecules that will target it. I think it's worth thinking about the best way to curate and share this data.
My current idea would be to just:

identify useful key classes of small molecules
curate just a few structures / specific examples of those classes, and display those directly
have a separate page for linking to other repositories for more of this info.

This could be expanded to PLpro (nsp3) and RdRP (nsp12) in a similar fashion.

Add I-TASSER models

It would be great to add the I-TASSER models for various viral targets:
https://zhanglab.ccmb.med.umich.edu/COVID-19/

The beautiful image on the top of the page might be useful as well---perhaps we could get permission to use it and have it link to the appropriate proteins?

Add DOI for RIKEN trajectories

I believe the citation for RIKEN trajectories like https://covid.molssi.org/simulations/#riken-cpr-tms-tmd1_toup-trajectory should be:

Takaharu Mori, Jaewoon Jung, Chigusa Kobayashi, Hisham M Dokainish, Suyong Re, Yuji Sugita (2021):
Elucidation of interactions regulating conformational stability and dynamics of SARS-CoV-2 S-protein.
Biophysical Journal 120(6)
https://doi.org/10.1016/j.bpj.2021.01.012

Note that there are several deposits but I have not checked all of them.

Structure for 6M17.yml has double entries for "protein"

Describe the bug
As it says on the tin, that file has two top level keys for "protein." I don't know which set is correct.

cc @apayne97 @Binikarki @egoldber

Connect to competitor database www.scov2-md.org

"SCoV2-MD (www.scov2-md.org) is a new online resource that systematically organizes atomistic simulations of the SARS-CoV-2 proteome." https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab977/6425545

This came up about a month ago but no one has made an issue yet. Alternative database with a similar goal to this one that doesn't seem to know we exist. It's also missing a good number of the useful datasets.

Incorporate the new images into the repository

This is a note to self (Levi) to add the images introduced by #29 to the pages

"New data entries" and "Tracking issue" issue tracker buttons don't open an issue

Describe the bug
The "New data entries" and "Tracking issue" issue tracker buttons take you nowhere useful.

To Reproduce
Click on the "New data entries" and "Tracking issue" issue tracker buttons.

Expected behavior
These buttons should take you to where they claim to.

Screenshots
If applicable, add screenshots to help explain your problem.

Set up a review process for new data with goal to have review pre-merge

A review process needs to be established for data.

One process, proposed at the onset of the project was to just merge all data as it came in, and give it a color coded system to indicate review. It has been pointed out that this will likely lead to chaos and be hard to maintain.

Another process suggested was to have data be reviewed before it ever gets merged and I think this is what should be done. The pipeline would be like this:

Contributing person accumulates all data they wish to submit and open a PR with their own assessment of the quality of the data.
That assessment is reviewed by the curation team for that type of data
Adjustments are made / discussed in PR
Quality is assessed and noted in the PR
Merged

cc @Andrew-AbiMansour @sjayellis @jchodera

Simulation data descriptions do not render Markdown properly

Describe the bug
The description field in simulations do not render Markdown correctly.
We'll need this for our incoming Folding@home data sharing PRs.
Fortunately, this seems easy to fix---will create a PR momentarily.

To Reproduce
Example: https://covid.molssi.org//simulations/#sars-cov-2-spike-s-glycoprotein

Expected behavior
Markdown should render correctly to allow inclusion of links to simulation data sources and inline shell examples of how to download the data

Screenshots

Update targets for Folding@home dataset?

Some of the Folding@home simulation datasets end up under "no specified targets" when there are defined targets:

nsp7 appears under "No Targets Recorded", rather than "Inhibition of viral polymerases"
nsp8 appears under "No Targets Recorded", rather than "Inhibition of viral polymerases"
nsp12 (RdRP) appears under "No Targets Recorded", rather than "Inhibition of viral polymerases"
nsp13 appears under "No Targets Recorded", rather than "Inhibition of nsp13 helicase activity"

Any idea how we fix this?

Establish process for post-accepted data review adjustment

With respect to #43, data need to be re-assessed regularly to see if adjustment need to be made in the quality department. A process should be established for what that is.