Code Monkey home page Code Monkey logo

phencards's Introduction

PhenCards

Please cite our paper:

Havrilla, J.M., Liu, C., Dong, X., Weng, C., Wang, K. PhenCards: a data resource linking human phenotype information to biomedical knowledge. Genome Med 13, 91 (2021). https://doi.org/10.1186/s13073-021-00909-8

Zenodo for Code: DOI

Zenodo for Data: DOI

This is the repository for the code used to make PhenCards.org

(C) Wang Lab 2020-2021

Running the site

We have uploaded Docker images for PhenCards (don't forget to use 1.0.0 for paper version), Doc2Hpo, and Phen2Gene at: https://hub.docker.com/u/genomicslab. You can click the links to find them as well. You will also need the Docker image for Elasticsearch 7.8.1 which was used to build the Lucene indices and make the autocompletion and the site fast.

You need to set up certbot to get certificates to establish HTTPS for Doc2Hpo and communication with Phen2Gene and UMLS. You will need to use nginx or, as we did, httpd (Apache) to run services to create the site. Thanks to the docker-compose.yml file, running docker-compose build prod builds the production version of the site using the Dockerfile and the code there. Since you already have the docker images you can just run docker-compose up -d prod and it will run the Elasticsearch service and the Phen2Gene service for production. If you want to edit the code in dev mode and see how it affects the site, use docker-compose up -d app. And you can see it change in real time on the 5010 port. Production comes out the 5005 port. Elasticsearch runs on the 9300 and 9200 ports. Phen2Gene is locally run on the 6000 port and Doc2Hpo is run on the 7000 port. However, for your purposes, you can use https://phen2gene.wglab.org and https://doc2hpo.wglab.org for the services on the site. No real need to set up local Phen2Gene or Doc2Hpo. As stated below, you will need to run index_db.py initially on the data from Zenodo to create the Lucene index database once you have your Elasticsearch service running. Then you should not have to run it again. There are custom HTML, CSS, and JS templates for style on the site and these can be modified to your liking.

To run the Flask app locally:

Make sure Python 3 is installed. cd into the directory run pip install -r requirements.txt
run python app.py
Go to localhost:5005 in your browser

If you would like to use debug mode when adjusting the features, run the following:

cd into the directory
do export FLASK_DEBUG=1 for Linux and Mac, or set FLASK_DEBUG=1 for Windows users
do FLASK run
Got to localhost:5000 in your browser, now you can monitor the changes in browser when changing the Flask code.

Additional note: use pip3 rather than pip since most systems have pip as part of python 2. To keep the server persistent, use nohup python3 app.py & to spin up the server.

Elastic-search for autocompletion

The autocompletion feature is achieved by using elastic-search-7.8.1 on the backend and implemented using jquery-ui (esQuery.js) on the frondend. Install and start elastic-search first: path-to-elastic-search/bin/elasticsearch, and then modified and executed index_db.py to index the database documents from https://zenodo.org/record/4755959. To avoid any CORS header issues, adding the following two lines in path-to-elastic-search/config/elasticsearch.yml

http.cors.enabled : true
http.cors.allow-origin : "*"

More details can be found in https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html

Development Logic

Front-end files include templates/index.html, which is used to transfer input parameters from user; templates/results.html, which is used to generate result page with external links to other result pages inside templates folder. Another important part is templetes/templete.html, which is used for generate the overall templete of the whole front-end, other htmls are inheritated from this one.

Back-end files include API.py, which is used to connect with APIs and return formatted data structures; app.py is the high-level framework built for the app based on Flask; queries.py is used to execute local queries.

How to deploy the Docker image on DigitalOcean in a basic way

Documentation is here

phencards's People

Contributors

jimhavrilla avatar shawnxd avatar stormliucong avatar kaichop avatar dependabot[bot] avatar

Stargazers

Jian Wang avatar Zuber avatar Mehdi Zandi avatar bobosui avatar Shaurita Hutchins avatar

Watchers

James Cloos avatar  avatar  avatar  avatar  avatar  avatar

Forkers

kchennen

phencards's Issues

Implement database ID search

Can be a partial ID (e.g. LIKE searching in SQLite) and can be HPO id, OMIM, etc.

In query search on website make the option no longer say "Search by HPO ID" but instead say "Search by external database ID (e.g., OMIM, HPO, UMLS, etc.)"

Add KEGG and WikiData/FAERS (if possible) to Drugs section

I think that KEGG is doable, and note that it is quite possible there is no link on KEGG, but if it is, it is well-established. WikiData needs a lot of filtering upon my inspection, and I think we can try that after KEGG.

Wikidata works like this:

You search for seizures: https://www.wikidata.org/w/index.php?sort=relevance&search=seizures&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1&ns120=1

You can click a phenotype-like result which is also relevant:
https://www.wikidata.org/wiki/Q41571

Notice that it is an "instance of" disease. And in that result you can see a list of drugs under "drug used for treatment".

I think we can do a search and do an advanced search for only things that are "instances of disease" and they should all have at least some drug data. I am looking into how you can structure your search right now.

web application version 1

Implement basic functions: search by phenotype name and HPO-ID

Same code is also accessible at /mnt/nas-0-0/home/dongx4/Project_PhenCards/ in biocluster.

Disease section

If the user searches for HPO terms then simply the xref links or disease names that contain the substring will be a good starting point as we already have several processed disease DBs.

HTTPS site set up

So I spent a very long time trying to get this to work with Nginx and uWSGI and certbot. To no avail. Then I tried with httpd (Apache) both got me about the same result. I'm running the flask app at port 5005 and listening on port 80/443 for the domain phencards-dev.wglab.org which thanks to Kai and certbot actually works. But for some reason in both solutions the proxy_pass redirect seems not to work because I get "bad gateway" for Nginx or "503 service unavailable" for Apache. I have tried using wsgi.py to run app.py and running app.py by itself no good results either way.

My /etc/httpd/conf.d/phencards.conf file:

ServerName phencards-dev.wglab.org

<VirtualHost *:80>
    ServerName phencards-dev.wglab.org
    Redirect / https://phencards-dev.wglab.org
</VirtualHost>


<VirtualHost *:443>
        DocumentRoot "/var/www/html/phencards-dev.wglab.org"
        ServerName phencards-dev.wglab.org

                SSLEngine on


                SSLProtocol all -SSLv2 -SSLv3 -TLSv1 -TLSv1.1
                SSLHonorCipherOrder on
                SSLCipherSuite "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+aRSA+RC4 EECDH EDH+aRSA RC4 !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS !RC4"


                RewriteOptions inherit


                WSGIScriptAlias /dquest /data/html/phencards/app.wsgi

        DefaultType text/plain
        ErrorLog logs/phencards-error_log
        CustomLog logs/phencards-access_log common
        <Directory /data/html/phencards/FlaskApp/>
                Order allow,deny
                Allow from all
        </Directory>
        <Directory "/data/html/phencards/cgi-bin">
                Options +ExecCGI
                Order allow,deny
                Allow from all
        </Directory>

        ProxyPass / http://localhost:5005/
        ProxyPassReverse / http://localhost:5005/

        Include /etc/letsencrypt/options-ssl-apache.conf

        SSLCertificateFile /etc/letsencrypt/live/phencards-dev.wglab.org/cert.pem
        SSLCertificateKeyFile /etc/letsencrypt/live/phencards-dev.wglab.org/privkey.pem
        SSLCertificateChainFile /etc/letsencrypt/live/phencards-dev.wglab.org/chain.pem

</VirtualHost>

The app works fine by itself if I use a random port, so I know there's nothing wrong with the Flask app.

Adding OHDSI

I think it may be worthwhile to add all of the OHDSI vocab to PhenCards. I looked for license information (which is conveniently a not completed section of the website) but could not find anything concrete. From the docs on OHDSI itself (https://ohdsi.github.io/TheBookOfOhdsi/StandardizedVocabularies.html) as well as the Athena website it sounds like if the vocabulary is downloadable and does not require explicit license, they can distribute it to us open source. This includes SNOMED CT, and ICD among many many other vocabularies. This is what goes into COHD concept IDs. We can index with elastic search and use the same search as COHD so we can basically add all of COHD into PhenCards.

Chunhua is not on the OHDSI paper, but is officially listed as a collaborator on their site, so I think this should actually be okay to do. I have made an account on Athena and I am awaiting download approval of the whole database of vocabularies.

Parsing UMLS

/mnt/isilon/wang_lab/shared/datasets/UMLS_unzipped is where the data is stored. @jimhavrilla had to help with unzipping the MetamorphoSys software and extract the data from binary .nlm files. Apparently Yunyun and Mengge had only used it in CLAMP which has the library built-in by default, so this will be the first time the UMLS library has been manually extracted by the lab.

In order to parse the UMLS data, we plan to use the following GitHub code here to extract the data we need: https://github.com/Georgetown-IR-Lab/QuickUMLS

Improve PMC literature search section

After some testing, I am sure my queries already grab review articles, which is great. However, after some research including posts on Biostars, and one extremely informative post, as well as reading the literal manual for Entrez searching, I have concluded ounting linked PMC indexed articles is best, to get and sort by citation count. Impact factor is a private, paid for thing and we cannot distribute it, nor is it linked to Pubmed or Google Scholar.

That in mind, my current strategy is as follows:

  1. Esearch by "relevance"
  2. Grab only top 200-500
  3. Count how many articles are linked to the top 200-500, sort by most cited
  4. Keep top 25, post in site.

Parse and add HPO correctly

This has been done with the help of macarthur-lab's obo-parser.py script to turn the OBO into a quick TSV. This is MIT license and they have been credited with the parsing, but we will not distribute the code.

I have created two indices one for HPO obo (which was not done before) and one for the related phenotype databases.

Distributable docker image

@kaichop, I believe I have made a tar file that should work. I have uploaded it to the previous DigitalOcean droplet (142.93.205.155) as /home/havrillaj/phencards.tar.

docker load --input /home/havrillaj/phencards.tar will allow you to add it.
Then run with docker
systemctl start docker
docker run -d -p 5000:5000 phencards

Drugs section make search buttons

This should be pretty straightforward. Since the drug search uses HTTP parsing from the drug sites it takes a while (not too long, but isn't instant). We can implement caching in the future, but for now it can be an option for users to implement this longer search by clicking on Tocris and APEXBio buttons for generating each respective list.

Alias section

You can use root terms or related terms with the same parent in HPO tree as well as UMLS terms, Mesh terms, snomed.

ASHG 2020 Abstract

Guidelines:

You are allotted 2,300 characters (excluding spaces) for the body text of your abstract. Title, author, and institutional data are not included in the 2,300 characters. See Step-by-Step Instructions (step 10) for details.

The docx (since I know that is the preference of the lab):

For now it reads like a pretty short tool paper. I'm not sure what else can be said about it in its present state and with the way queries work you can't just Google a phenotype and get our site first, we'd need to have pre-cached results for everything like GeneCards and MalaCards which I imagine costs them a fortune in server space.

PhenCards.docx

Adding Doc2HPO to site, and fixing redirects for API

If running this test script on reslnklab01:

import requests
import json
# for string-based match. faster.
url = "http://localhost:8000/doc2hpo/parse/acdat"
json = {
        "note": "He denies synophrys.",
    "negex": True
}
# headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}
r = requests.post(url,json = json)
print(r.json())

You obtain the output:

{'hmName2Id': [{'start': 10, 'length': 9, 'hpoId': 'HP:0000664', 'hpoName': 'synophrys', 'negated': True}], 'hpoOption': False}

Because of the http.

If you run the same script with https:

import requests
import json
# for string-based match. faster.
url = "https://localhost:8000/doc2hpo/parse/acdat"
json = {
        "note": "He denies synophrys.",
    "negex": True
}
# headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}
r = requests.post(url,json = json)
print(r.json())

You get the error:

requests.exceptions.SSLError: HTTPSConnectionPool(host='localhost', port=8000): Max retries exceeded with url: /doc2hpo/parse/acdat (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_record', 'wrong version number')])")))

Create external database TSVs

Shawn and I are grabbing UMLS, SNOMED, etc. and creating individual TSV files for each. These will be linked together into one large TSV file linked to each HPO ID and also stored as individual databases in the future web server.

Link disease phenotype databases with phenotype_database.csv

Basically, you can create a dict using this file, and the dict is HP:000003 (or whatever id) as the key, and the values are a list of each OMIM, DECIPHER, ORPHA, etc id that linked to it. You can use python dict = defaultdict(set) and then dict[key].add(OMIM)

Fix ICD-10 bug

This should be simple as well, currently the search is startswith and needs to be changed to partial string match

add KEGG pathways

image
Made a lot of progress, finally...! Can now create temporary files that get ported as images if user clicks on KEGG pathway logo, and lists all diseases linked to string term query (HPO term or other) that return disease results on KEGG that also have assigned pathways. It gives the pathway image, and the list of relevant diseases.

Needs a lot of formatting still, but this was a great learning experience and is good progress.

Database ID search still does not work

Breaks no matter what is searched for, can try to fix (not sure what @shawnxd changed) but can also remove it since users will probably not use it. My guess is it is searching the wrong columns/not substring matching. An easy fix should be to treat the ID like the substring, though why not just search by string in that case?

Allow searching of external databases not linked to HPO

Users don't all know what HPO is or even necessarily care, so allow the HPO column to be blank. My suggestion is actually to implement each database as a separate table result. HPO, OMIM, etc. and if they're linked to each other, specify that, that's great to use HPO as a base.

Get proper license approval and permissions

Now that Cong and I are starting to revamp the site we need to do this before the site is public. I got a fairly thorough response back from UMLS/SNOMED:

The UMLS license does not convey any right to redistribute any data freely on the web. If you publish UMLS data on the web, you should have permission from each individual vocabulary publisher. Alternatively, you can ensure that your users also have a UMLS license / account. If you want to authenticate your users, we have an API for this purpose.

SNOMED CT license restrictions are covered in Appendix 2 of the license: https://uts.nlm.nih.gov//license.html. You may want to clear your particular use case with SNOMED International: http://www.snomed.org/help-center/contact-us

If you have questions about individual UMLS source vocabularies other than SNOMED CT, we can help.

The yearly report is generally due at the beginning of each year. You will receive an email with instructions for the annual report at the email address associated with your account.

All of the UMLS source vocabularies are documented here: https://www.nlm.nih.gov/research/umls/sourcereleasedocs/index.html

Contact information is available in our documentation. For example, go here and then click the "Metadata" tab: https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/CPT/index.html

Some vocabularies are fairly permissive, but restrictions vary. For example, see the license for Human Phenotype Ontology (HPO): https://hpo.jax.org/app/license

As for KEGG, I wrote them and am still waiting to hear back. HPO is pretty chill they mostly want us to cite them and display their logo on the site. Should we perhaps have a "logo" section and a "citation" section added to the site, @kaichop? So it is clear? Or maybe we can add that all to the index search page. I think that is probably best. Then every user sees it.

PhenCards

Working on a TSV linking HPO - OMIM - DECIPHER - ORPHANET...basically trying to emulate JAX for now. Repo for webserver at https://github.com/WGLab/PhenCards. This repo is where code for processing data goes.

Drugs section

Can grab from API of drug databases but KEGG API mentioned in #16 can be used for drugs and pathway information

Elastic search integration

Trying to replace SQL query with Elasticsearch entirely, and improve the way ranking is done as well. This may be a long persisting issue.

Add companies and foundations section

This has been a long time coming. I recently happened upon the 990finder. Works pretty darn well and can parse the tables and the URL I think.

As for licensing, there is nothing saying I have to pay them a subscription for the data. But they do have paid for APIs and such. Nothing saying I can't scrape their site or I have to pay a fee to put in ours like KEGG though...

Parse and add in UMLS (finally)

I have done the monstrous task of parsing UMLS at long last, something no one has done heretofore on the PhenCards team. It is a very large index in ElasticSearch ~ 326 MB. Lots of repetitive entries. I'm tempted to ignore duplicates by ignoring original sources in the near future.

Put UMLS into the results table.

The header should be like the following, so there will be 18 columns in total:

  • Unique identifier for concept
  • Language of term
  • Term status
  • Unique identifier for term
  • String type
  • Unique identifier for string
  • Atom status
  • Unique identifier for atom
  • Source asserted atom identifier
  • Source asserted concept identifier
  • Source asserted descriptor identifier
  • Abbreviated source name
  • Abbreviation for term type in source vocabulary
  • Most useful source asserted identifier
  • String
  • Source restriction level
  • Suppressible flag
  • Content View Flag

Implement tables for each database on website

Have a table only showing the top 10, not 100, for each DB, HPO, UMLS, OMIM, etc. Then have a link the user can click that takes them to the full table (100 per page, e.g.) for the DB of their choice. Maybe they want phenotype information, maybe they want HPO terms, etc.

Check imports and package version

@shawnxd make sure we have werkzeug >=1.0.1 and flask-wtf>=0.14.3 in requirements.txt and remember to use conda to test locally and export the environment (conda env export > environment.yml) if you are unsure how versions clash.

ICD-11

Not yet relevant since most data sources do not yet use it but for future there is a REST API for ICD-11 queries specifically called SWAGGER https://id.who.int/swagger/index.html

For the future...not right now.

Design the site a bit more like GeneCards

Basically @kaichop wants us to link out to more external websites like www.genecards.org/cgi-bin/carddisp.pl?gene=MYC. Like we can implement a lot of these searches pretty easily, such as publications searching PubMed, via HPO we can have gene lists as another column. We can use name to create a unified identifier for a phenotype term or syndrome across many databases and create a sort of summary page like GeneCards.

We basically need to redesign the whole site. This is why we need all the databases done ASAP, not just UMLS because we need to start making it user friendly, not just a simple table search with links. We can also perhaps have the external search links partially on the page the way PhenCards does with the "Publications" section for example.

I'm sure @kaichop will have more suggestions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.