Code Monkey home page Code Monkey logo

snorkeling's Introduction

Snorkeling

This repository stores data and code to scale up the extraction of biomedical relationships (i.e. Disease-Gene associations, Compounds binding to Genes, Gene-Gene interactions etc.) from the Pubmed Abstracts.

Depreciation Note

An updated version of this project can be found at: greenelab/snorkeling-full-text. New changes pertaining to the repository can be found at the link provided previously.

Quick Synopsis

This work uses a subset of Hetionet v1 (bolded in the resource schema below), which is a heterogenous network that contains pharmacological and biological information in the form of nodes and edges. This network was made from publicly available data, which is usually populated via manual curation. Manual curation is time consuming and difficult to scale as the rate of publications continues to rise. A recently introduced "Data Programming" paradigm can circumvent this issue by being able to generate large annotated datasets quickly. This paradigm combines distant supervision with simple rules and heuristics written as labeling functions to automatically annotate large datasets. Unfortunately, it takes a significant amount of time and effort to write a useful label function. Because of this fact, we aimed to speed up this process by re-using label functions across edge types. Read the full paper here.

Highlighted edges used in Hetionet v1

Directories

Described below are the main folders for this project. For convention the folder names are based on the schema shown above.

Name Descirption
compound_disease Head folder that contains all relationships compounds and diseases may share
compound_gene Head folder that contains all relationships compounds and genes may share
disease_gene Head folder that contains all realtionships disease and genes may share
gene_gene Head folder than contains all realtionships genes may share with each other
dependency cluster This folder contains preprocessed results from the "A global network of biomedical relationships derived from text" paper.
figures This folder contains figures for this work
modules This folder contains helper scripts that this work uses
playground This folder contains ancient code designed to test and understand the snorkel package.

Installing/Setting Up The Conda Environment

Snorkeling uses conda as a python package manager. Before moving on to the instructions below, please make sure to have it installed. Download conda here!!

Once everything has been installed, type following command in the terminal:

conda env create --file environment.yml

You can activate the environment by using the following command:

source activate snorkeling

Note: If you want to leave the environment, just enter the following command:

source deactivate 

License

This repository is dual licensed as BSD 3-Clause and CC0 1.0, meaning any repository content can be used under either license. This licensing arrangement ensures source code is available under an OSI-approved License, while non-code content ā€” such as figures, data, and documentation ā€” is maximally reusable under a public domain dedication.

snorkeling's People

Contributors

ajlee21 avatar danich1 avatar dhimmel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

snorkeling's Issues

Specifying a conda environment

It would be great to create a conda environment for this project. The environment should use Python 2.7 rather than 3.6 (see snorkel-team/snorkel#509).

In private communication, @ajratner wrote:

In particular, if you're willing to absorb the bumps of code being actively developed, you can check out the dev branch: https://github.com/HazyResearch/snorkel/tree/dev

Also see @ajratner's note here:

However, it's certainly still at a stage where our active involvementā€“both for help/bugfixes and active collaboration on code developmentā€“seems to be a big help, in part since the code is evolving so rapidly in response to feedback. We're hoping that it will begin to increasingly stabilize soon; and any feedback you have for us on how we could make it easier to use would be greatly appreciated, either via github issues, a skype call, or on this platform!

@danich1, for these reasons I think we may want to use the a dev version of snorkel. You can specify this in the conda environment.yml using a pip installation. Example at http://stackoverflow.com/a/32799944/4651668.

Planning the Snorkel for whether a compound treats a disease

@danich1 is a new rotation student in the Greene Lab šŸ¾. We were thinking a good rotation project would be to extract medical indications from the literature. The project is intended as a pilot. The ultimate goal is to automate the integration of all biological knowledge into a single hetnet.

We were thinking Compoundā€“treatsā€“Disease relationships were a good place to start for the following reasons:

  1. Comprehensive catalogs of treatments will be crucial for computational drug repurposing approaches.
  2. In Project Rephetio, we physician curated a gold standard catalog of indications, called PharmacotherapyDB. In addition, we have cataloged treatments under investigation in clinical trial. Both of these catalogs of treatments could be used to create labeling functions.
  3. We shouldn't have to invent tagging methods for diseases and compounds, as there should already be mature/implemented solutions.

Project Rephetio used Disease Ontology as its disease vocabulary and DrugBank as its drug vocabulary. But we're flexible here.

Paging @ajratner and @stephenbach -- creators of Snorkel.

Verifying Experimental Analysis Design

I talked with @dhimmel yesterday and we came up with a design for determining whether or not adding input from a deep learning model (LSTM) is beneficial for predicting relationships between Diseases and Genes.

Background:

project overview

Within the image above we have all disease-gene pair mappings where some edges are mentioned in pubmed abstracts (noted by the black dashes) and majority of edges arenā€™t mentioned at all. The edges in green are considered true edges as they are currently contained in hetnet v1 and the other edges (not highlighted) have the potential to be a true Disease-Gene relationship. We aim to classify each edge as either positive (True edge) or negative (False edge), under the hypothesis that using NLP and deep learning (Long short term memory networks or LSTM for short) will provide better accuracy than standard methods.

Analysis Design:
To test this hypothesis we plan to use the following design:

Categories: Prior Co-occurrences Natural Language Processing (NLP)
1 Model 1 Model with sentences 1 Model with Sentences
1 Model w/o Sentences 1 Model w/o Sentences
Literature unaware LSTM unaware LSTM aware

The prior category is where we plan to use a model to classify each disease-gene edge without using any information from biomedical literature (hence literature unaware). The co-occureence category is where we plan to use a model that combines the prior category model with information obtained from biomedical literature i.e. (expected number of sentences that mentions a given disease-gene pair, the p-value for each disease-gene edge, how many unique abstracts that mention a given disease-gene pair etc.) To note this model doesnā€™t use the LSTM and just relies on the features extracted from the literature itself. A challenge for this will be handling the edges that arenā€™t mentioned within the literature itself. (Model w/o Sentences) Lastly, the NLP category combines the other two models and adds input from a deep learning model (probability that a sentence is evidence for a true disease-gene relationship). We expect to see the NLP category model outperform the models from the other two categories.

Challenges:

  1. What is a fair prior model to use for this analysis?
  2. What do we do about edges that are in hetnet, but arenā€™t mentioned in literature? How can we classify these edges?

Extracting relationships from Hetionet v1.0

For each relationship we're trying to model, we'll need to extract the Hetionet relationships. Right now we'll be using Hetionet v1.0 relationships as the only knowledge base for a relationships. In the future, we could use multiple resources as knowledge bases for a specific relationship type. Each knowledge base forms its own labeling function.

You can read all Hetionet relationships (with no relationship properties) from this TSV. It's formatted like:

source  metaedge        target
Gene::9021      GpBP    Biological Process::GO:0071357
Gene::51676     GpBP    Biological Process::GO:0098780
Gene::19        GpBP    Biological Process::GO:0055088
Gene::3176      GpBP    Biological Process::GO:0010243

Alternatively, you can make a cypher query for each relationship type to https://neo4j.het.io, like:

MATCH path = (disease:Disease)-[:ASSOCIATES_DaG]->(gene:Gene)
RETURN
  disease.identifier AS disease_id,
  gene.identifier AS gene_id
ORDER BY disease_id, gene_id

You can make these queries programmatically to return pandas DataFrames in Python.

Update numbskull

In 66a5deb, we forgot to upgrade numbskull.

Let's update from

git+https://github.com/HazyResearch/numbskull@40ac1af20538c17b4726c963c69adcb81314efa5

to

git+https://github.com/HazyResearch/numbskull@ac52265038bac8edca3f8e930eff34ebaef4c7a0

Conda forge breaks lxml

There is a bug when trying to rely on conda forge to install the python module lxml. Right now if one were to create the conda environment and import lxml.etree, the following error will occur:

Traceback (most recent call last):
File "", line 1, in
ImportError: libicui18n.so.56: cannot open shared object file: No such file or directory

Therefore, I argue that we move lxml to be a pip dependency as installing lxml through that way fixes it.

Scaling Snorkeling To Handle Pubmed

I've got the version of the Snorkeling project from the greenelab repo downloaded and running. See [1] at the bottom for a suggestion about that.

I can't run David's code [2] so I can't test the issue myself. However, I've read the code and have a few questions that will help me in my investigation.

  1. What is the exact problem? Is it that CorpusParser.apply with the default implementation of XMLMultiDocPreparser runs out of memory when loading data from the file /home/davidnicholson/Documents/Data/pubmed_docs.xml?

  2. I heard something about a memory leak. Is it still thought that there is a memory leak? If so, why?

  3. Why was chunking done? It appears that "corpus_parser.apply(xml_parser)", when using XMLMultiDocPreparser from snorkeling/All_Relationships/utils/bigdata_utils.py, should follow the scalable process of reading in one document from the XMLMultiDocPreparser, calling CorpusParserUDF.apply once, dereferencing that document so that it can be garbage collected, and then repeating. The parallel version of corpus_parser.apply should have each process follow this loop independently.

-- Notes --

[1] Not sure if it's just me, but in the future you guys might want to recommend that new developers also do these things:

Add conda-forge to their conda channels.
Command: conda config --add channels conda-forge

Install icu 56.1 from the conda-forge channel.
Command: conda install icu==56.1

I was having trouble loading the library lxml for Snorkel after sourcing the conda env due to a missing shared object from icu version 56. Installing this dependency fixed that.

[2] David's code refers to a Postgres database that I don't have and a path on David's computer that is not in the git repo, /home/davidnicholson/Documents/Data.

Retrieving the pre-processed Snorkel PubMed data

In a private communication, @ajratner wrote:

we have all the pubmed articles pre-processed, tagged with some basic entities (genes, diseases, chemicals, species and mutations), and pre-loaded in Snorkel format, on an internal server; and we'd love to share with you. Do you have any preferred method of transferring large files is? Otherwise I'll figure something out!

@ajratner, awesome. Are chemicals what I'm calling a compound... i.e. a small molecule that could be in DrugBank? What vocabularies are your diseases and chemicals identified in?

How big are the files? How many files are there? The ideal solution would be to use Git LFS. You could fork this repository and create a pull request which adds these files. This would require you to make the files public... and we should consider whether we need to exclude them from the repo's licensing.

Gathering Tagged Pubmed Copora

The goal here is to have a full listing of tagged pubmed abstracts (full text down the road). A few key issues to sort out are:

  • What tagging resources are the best for tagging abstracts? (currently using Pubtator, but skeptical of its scaling abilities)
  • What is the best format to parse these tagged abstracts? (currently using Pubtator's xml format, but amenable to other formats)

If I am missing anything, feel free to let me know: @dhimmel or @cgreene.

Issue when installing from conda

Hello!

I am having the following issue with installing the condo environment from YML file.

(base) MacBook-Pro-4:snorkeling alexanderli$ conda env create --file environment.yml
Collecting package metadata: done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be in conflict:
  - gensim=3.8.1 -> python_abi=[build=*_cp37m] -> pypy[version='<0a0']
  - sqlalchemy=1.1.13
Use "conda search <package> --info" to see the dependencies for each package.

Assessing term co-occurrence across sentences

@danich1 has extracted a bunch of sentences that include both a gene and a disease. He's computing summary statistics for gene-disease pairs. One basic measure being the number of sentences with both the gene and disease. Now we want to compute the expected number of sentences, given the marginal frequency of the gene and disease.

Let's take an approach similar to computing MEDLINE term co-occurrence. You will need for each gene disease pair to compute the values for a contingency table where:

  • a is the number of sentences containing both the gene and the disease (cooccurrence)
  • b is the number of sentences containing the gene but not the disease
  • c is the number of sentences containing the disease but not the gene
  • d is the number of sentences without either the gene or disease

We should limit ourselves to only sentences that contain a gene and a disease. You'll be able to compute the expected and the p-value from a fisher's exact test using code like here.

The expected number of sentences is actually quite easy to compute. You just take the number of sentences with the gene * the number of sentences with the disease divided by the total number of sentences (only considering sentences with both a gene and disease).

Is there a snorkel_labels_train.xlsx file anywhere?

I'd like to utilise these labels for another project. It seems the folder

snorkeling/disease_gene/disease_associates_gene/data/sentences

should also have snorkel_labels_train.xlsx to go along with its test and dev files. Does this exist and if so is there any chance of getting access?

Gold standard of epilepsy-associated genes

@danich1 has been prototyping with extracting epilepsy associated genes. This has been convenient since we don't have to deal with mapping PubTator diseases, which use the MEDIC vocabulary. Additionally, PubTator tags genes using Entrez identifiers, which Hetionet uses as well.

Here is a Cypher query to get a "gold standard" of epilepsy-associated genes from https://neo4j.het.io (adapted from here):

MATCH (disease:Disease)-[assoc:ASSOCIATES_DaG]-(gene:Gene)
WHERE disease.name = 'epilepsy syndrome'
RETURN
 gene.name AS gene_symbol,
 gene.description AS gene_name,
 gene.identifier AS entrez_gene_id,
 assoc.sources AS sources
ORDER BY gene_symbol

There are 399 epilepsy-associated genes. As an aside, these genes are not all guaranteed to be bonafide epilepsy genes. We integrated several databases -- the list is not perfect but it should be good enough.

I downloaded the results as a CSV: epilepsy-associated-genes.csv.txt. @danich1 does this look like it will suit your needs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.