greenelab / snorkeling Goto Github PK

Extracting biomedical relationships from literature with Snorkel 🏊

License: Other

Jupyter Notebook 95.08% Python 4.92%

snorkel hetnet nlp machine-learning text-mining tool workflow script analysis methodology

snorkeling's Introduction

Snorkeling

This repository stores data and code to scale up the extraction of biomedical relationships (i.e. Disease-Gene associations, Compounds binding to Genes, Gene-Gene interactions etc.) from the Pubmed Abstracts.

Depreciation Note

An updated version of this project can be found at: greenelab/snorkeling-full-text. New changes pertaining to the repository can be found at the link provided previously.

Quick Synopsis

This work uses a subset of Hetionet v1 (bolded in the resource schema below), which is a heterogenous network that contains pharmacological and biological information in the form of nodes and edges. This network was made from publicly available data, which is usually populated via manual curation. Manual curation is time consuming and difficult to scale as the rate of publications continues to rise. A recently introduced "Data Programming" paradigm can circumvent this issue by being able to generate large annotated datasets quickly. This paradigm combines distant supervision with simple rules and heuristics written as labeling functions to automatically annotate large datasets. Unfortunately, it takes a significant amount of time and effort to write a useful label function. Because of this fact, we aimed to speed up this process by re-using label functions across edge types. Read the full paper here.

Directories

Described below are the main folders for this project. For convention the folder names are based on the schema shown above.

Name	Descirption
compound_disease	Head folder that contains all relationships compounds and diseases may share
compound_gene	Head folder that contains all relationships compounds and genes may share
disease_gene	Head folder that contains all realtionships disease and genes may share
gene_gene	Head folder than contains all realtionships genes may share with each other
dependency cluster	This folder contains preprocessed results from the "A global network of biomedical relationships derived from text" paper.
figures	This folder contains figures for this work
modules	This folder contains helper scripts that this work uses
playground	This folder contains ancient code designed to test and understand the snorkel package.

Installing/Setting Up The Conda Environment

Snorkeling uses conda as a python package manager. Before moving on to the instructions below, please make sure to have it installed. Download conda here!!

Once everything has been installed, type following command in the terminal:

conda env create --file environment.yml

You can activate the environment by using the following command:

source activate snorkeling

Note: If you want to leave the environment, just enter the following command:

source deactivate

License

This repository is dual licensed as BSD 3-Clause and CC0 1.0, meaning any repository content can be used under either license. This licensing arrangement ensures source code is available under an OSI-approved License, while non-code content — such as figures, data, and documentation — is maximally reusable under a public domain dedication.

snorkeling's People

Contributors

Stargazers

Watchers

Forkers

danich1 dhimmel dongqing7 tusharbihani lcy081099 mindis hubayirp zorrotrying strategist922 aspirincode aerinkim metabdel 2018luyi zhangyuqing ajlee21 curtis0982 turkanarit

snorkeling's Issues

Specifying a conda environment

It would be great to create a conda environment for this project. The environment should use Python 2.7 rather than 3.6 (see snorkel-team/snorkel#509).

In private communication, @ajratner wrote:

In particular, if you're willing to absorb the bumps of code being actively developed, you can check out the dev branch: https://github.com/HazyResearch/snorkel/tree/dev

Also see @ajratner's note here:

However, it's certainly still at a stage where our active involvement–both for help/bugfixes and active collaboration on code development–seems to be a big help, in part since the code is evolving so rapidly in response to feedback. We're hoping that it will begin to increasingly stabilize soon; and any feedback you have for us on how we could make it easier to use would be greatly appreciated, either via github issues, a skype call, or on this platform!

@danich1, for these reasons I think we may want to use the a dev version of snorkel. You can specify this in the conda environment.yml using a pip installation. Example at http://stackoverflow.com/a/32799944/4651668.

Planning the Snorkel for whether a compound treats a disease

@danich1 is a new rotation student in the Greene Lab 🍾. We were thinking a good rotation project would be to extract medical indications from the literature. The project is intended as a pilot. The ultimate goal is to automate the integration of all biological knowledge into a single hetnet.

We were thinking Compound–treats–Disease relationships were a good place to start for the following reasons:

Comprehensive catalogs of treatments will be crucial for computational drug repurposing approaches.
In Project Rephetio, we physician curated a gold standard catalog of indications, called PharmacotherapyDB. In addition, we have cataloged treatments under investigation in clinical trial. Both of these catalogs of treatments could be used to create labeling functions.
We shouldn't have to invent tagging methods for diseases and compounds, as there should already be mature/implemented solutions.

Project Rephetio used Disease Ontology as its disease vocabulary and DrugBank as its drug vocabulary. But we're flexible here.

Paging @ajratner and @stephenbach -- creators of Snorkel.

Verifying Experimental Analysis Design

I talked with @dhimmel yesterday and we came up with a design for determining whether or not adding input from a deep learning model (LSTM) is beneficial for predicting relationships between Diseases and Genes.

Background:

Within the image above we have all disease-gene pair mappings where some edges are mentioned in pubmed abstracts (noted by the black dashes) and majority of edges aren’t mentioned at all. The edges in green are considered true edges as they are currently contained in hetnet v1 and the other edges (not highlighted) have the potential to be a true Disease-Gene relationship. We aim to classify each edge as either positive (True edge) or negative (False edge), under the hypothesis that using NLP and deep learning (Long short term memory networks or LSTM for short) will provide better accuracy than standard methods.

Analysis Design:
To test this hypothesis we plan to use the following design:

Prior	Co-occurrences	Natural Language Processing (NLP)
1 Model	1 Model with sentences	1 Model with Sentences
	1 Model w/o Sentences	1 Model w/o Sentences
Literature unaware	LSTM unaware	LSTM aware

The prior category is where we plan to use a model to classify each disease-gene edge without using any information from biomedical literature (hence literature unaware). The co-occureence category is where we plan to use a model that combines the prior category model with information obtained from biomedical literature i.e. (expected number of sentences that mentions a given disease-gene pair, the p-value for each disease-gene edge, how many unique abstracts that mention a given disease-gene pair etc.) To note this model doesn’t use the LSTM and just relies on the features extracted from the literature itself. A challenge for this will be handling the edges that aren’t mentioned within the literature itself. (Model w/o Sentences) Lastly, the NLP category combines the other two models and adds input from a deep learning model (probability that a sentence is evidence for a true disease-gene relationship). We expect to see the NLP category model outperform the models from the other two categories.

Challenges:

What is a fair prior model to use for this analysis?
What do we do about edges that are in hetnet, but aren’t mentioned in literature? How can we classify these edges?

Extracting relationships from Hetionet v1.0

For each relationship we're trying to model, we'll need to extract the Hetionet relationships. Right now we'll be using Hetionet v1.0 relationships as the only knowledge base for a relationships. In the future, we could use multiple resources as knowledge bases for a specific relationship type. Each knowledge base forms its own labeling function.

You can read all Hetionet relationships (with no relationship properties) from this TSV. It's formatted like:

source  metaedge        target
Gene::9021      GpBP    Biological Process::GO:0071357
Gene::51676     GpBP    Biological Process::GO:0098780
Gene::19        GpBP    Biological Process::GO:0055088
Gene::3176      GpBP    Biological Process::GO:0010243

Alternatively, you can make a cypher query for each relationship type to https://neo4j.het.io, like:

MATCH path = (disease:Disease)-[:ASSOCIATES_DaG]->(gene:Gene)
RETURN
  disease.identifier AS disease_id,
  gene.identifier AS gene_id
ORDER BY disease_id, gene_id

You can make these queries programmatically to return pandas DataFrames in Python.

Update numbskull

In 66a5deb, we forgot to upgrade numbskull.

Let's update from

git+https://github.com/HazyResearch/numbskull@40ac1af20538c17b4726c963c69adcb81314efa5

git+https://github.com/HazyResearch/numbskull@ac52265038bac8edca3f8e930eff34ebaef4c7a0

Conda forge breaks lxml

There is a bug when trying to rely on conda forge to install the python module lxml. Right now if one were to create the conda environment and import lxml.etree, the following error will occur:

Traceback (most recent call last):
File "", line 1, in
ImportError: libicui18n.so.56: cannot open shared object file: No such file or directory

Therefore, I argue that we move lxml to be a pip dependency as installing lxml through that way fixes it.

Writing "Good" Labeling functions

Our aim is to generate useful labeling functions from a given set of candidate sentences provided below:

Scaling Snorkeling To Handle Pubmed

I've got the version of the Snorkeling project from the greenelab repo downloaded and running. See [1] at the bottom for a suggestion about that.

I can't run David's code [2] so I can't test the issue myself. However, I've read the code and have a few questions that will help me in my investigation.

What is the exact problem? Is it that CorpusParser.apply with the default implementation of XMLMultiDocPreparser runs out of memory when loading data from the file /home/davidnicholson/Documents/Data/pubmed_docs.xml?
I heard something about a memory leak. Is it still thought that there is a memory leak? If so, why?
Why was chunking done? It appears that "corpus_parser.apply(xml_parser)", when using XMLMultiDocPreparser from snorkeling/All_Relationships/utils/bigdata_utils.py, should follow the scalable process of reading in one document from the XMLMultiDocPreparser, calling CorpusParserUDF.apply once, dereferencing that document so that it can be garbage collected, and then repeating. The parallel version of corpus_parser.apply should have each process follow this loop independently.

-- Notes --

[1] Not sure if it's just me, but in the future you guys might want to recommend that new developers also do these things:

Add conda-forge to their conda channels.
Command: conda config --add channels conda-forge

Install icu 56.1 from the conda-forge channel.
Command: conda install icu==56.1

I was having trouble loading the library lxml for Snorkel after sourcing the conda env due to a missing shared object from icu version 56. Installing this dependency fixed that.

[2] David's code refers to a Postgres database that I don't have and a path on David's computer that is not in the git repo, /home/davidnicholson/Documents/Data.

Retrieving the pre-processed Snorkel PubMed data

In a private communication, @ajratner wrote:

we have all the pubmed articles pre-processed, tagged with some basic entities (genes, diseases, chemicals, species and mutations), and pre-loaded in Snorkel format, on an internal server; and we'd love to share with you. Do you have any preferred method of transferring large files is? Otherwise I'll figure something out!

@ajratner, awesome. Are chemicals what I'm calling a compound... i.e. a small molecule that could be in DrugBank? What vocabularies are your diseases and chemicals identified in?

How big are the files? How many files are there? The ideal solution would be to use Git LFS. You could fork this repository and create a pull request which adds these files. This would require you to make the files public... and we should consider whether we need to exclude them from the repo's licensing.

Gathering Tagged Pubmed Copora

The goal here is to have a full listing of tagged pubmed abstracts (full text down the road). A few key issues to sort out are:

What tagging resources are the best for tagging abstracts? (currently using Pubtator, but skeptical of its scaling abilities)
What is the best format to parse these tagged abstracts? (currently using Pubtator's xml format, but amenable to other formats)

If I am missing anything, feel free to let me know: @dhimmel or @cgreene.

Issue when installing from conda

Hello!

I am having the following issue with installing the condo environment from YML file.

(base) MacBook-Pro-4:snorkeling alexanderli$ conda env create --file environment.yml
Collecting package metadata: done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be in conflict:
  - gensim=3.8.1 -> python_abi=[build=*_cp37m] -> pypy[version='<0a0']
  - sqlalchemy=1.1.13
Use "conda search <package> --info" to see the dependencies for each package.

Assessing term co-occurrence across sentences

@danich1 has extracted a bunch of sentences that include both a gene and a disease. He's computing summary statistics for gene-disease pairs. One basic measure being the number of sentences with both the gene and disease. Now we want to compute the expected number of sentences, given the marginal frequency of the gene and disease.

Let's take an approach similar to computing MEDLINE term co-occurrence. You will need for each gene disease pair to compute the values for a contingency table where:

a is the number of sentences containing both the gene and the disease (cooccurrence)
b is the number of sentences containing the gene but not the disease
c is the number of sentences containing the disease but not the gene
d is the number of sentences without either the gene or disease

We should limit ourselves to only sentences that contain a gene and a disease. You'll be able to compute the expected and the p-value from a fisher's exact test using code like here.

The expected number of sentences is actually quite easy to compute. You just take the number of sentences with the gene * the number of sentences with the disease divided by the total number of sentences (only considering sentences with both a gene and disease).

Update snorkel to v0.5

The big update @ajratner mentioned is now merged into master.

@danich1, let's update the snorkel submodule to snorkel-team/snorkel@36e8bcb (approx v0.5).

Is there a snorkel_labels_train.xlsx file anywhere?

I'd like to utilise these labels for another project. It seems the folder

snorkeling/disease_gene/disease_associates_gene/data/sentences

should also have snorkel_labels_train.xlsx to go along with its test and dev files. Does this exist and if so is there any chance of getting access?

Expert snorkelers: cataloging related projects and snorkels

This issue is a place to note other snorkel projects. The hope is that we can compile a list of experts on specific domains to reach out to later. Furthermore, we can make sure we're not duplicating efforts.

Miscellaneous notes from the great snorkeling of 2018

In Palo Alto.

Gold standard of epilepsy-associated genes

@danich1 has been prototyping with extracting epilepsy associated genes. This has been convenient since we don't have to deal with mapping PubTator diseases, which use the MEDIC vocabulary. Additionally, PubTator tags genes using Entrez identifiers, which Hetionet uses as well.

Here is a Cypher query to get a "gold standard" of epilepsy-associated genes from https://neo4j.het.io (adapted from here):

MATCH (disease:Disease)-[assoc:ASSOCIATES_DaG]-(gene:Gene)
WHERE disease.name = 'epilepsy syndrome'
RETURN
 gene.name AS gene_symbol,
 gene.description AS gene_name,
 gene.identifier AS entrez_gene_id,
 assoc.sources AS sources
ORDER BY gene_symbol

There are 399 epilepsy-associated genes. As an aside, these genes are not all guaranteed to be bonafide epilepsy genes. We integrated several databases -- the list is not perfect but it should be good enough.

I downloaded the results as a CSV: epilepsy-associated-genes.csv.txt. @danich1 does this look like it will suit your needs?