Code Monkey home page Code Monkey logo

bioconceptvec's Introduction

BioConceptVec:
creating and evaluating literature-based biomedical concept embeddings on a large scale

HitCount

Table of contents

Text corpora

We created BioConceptVec using the entire PubMed. The texts were split and tokenized using NLTK. We also lowercased all the words.

Using PubTator for annotating concepts in the PubMed

We employed PubTator to annotate biomedical concepts in the PubMed. It covers genes, mutations, chemicals, diseases and cellines. The trained embeddings contain over 400,000 concepts.

BioConceptVec: embeddings and concept files

We release four versions of BioConceptVec (cbow, skip-gram, glove and fastText). For each version, we make both the embedding(contains concepts and other words) in binary format and the concept-only file in json format available.

  1. BioConceptVec cbow: embedding (2.4GB) and concept-only (798MB).
  2. BioConceptVec skip-gram: embedding (2.4GB) and concept-only (812MB).
  3. BioConceptVec glove: embedding (2.4GB) and concept-only (835MB).
  4. BioConceptVec fastText: embedding (2.4GB) and concept-only (813MB).

Tutorial

You can find this tutorial on how to use BioConceptVec (for both embedding and concept-only files) for a quick start.

Datasets

We also make all the 9 evaluation datasets publicly available. It covers 4 applications:

  1. Drug-Gene interactions. The dataset contains (1) ID: the instance ID, (2) num_of_genes: the number of genes for this instance, (3) pos_rel_genes: the IDs of related genes, and (4) neg_rel_genes: the IDs of unrelated genes.

  2. Gene-Gene interactions. 5 datasets on gene-gene interactions have the same format as above.

  3. Protein-Protein interaction. It contains two datasets: (1) combined: protein-protein interactions created based on STRING combined scores and (2) exp700: protein-protein interactions created based on STRING experimental scores over 700. Both datasets contain train, valid and test files. The file contains (1) query: query protein ID, (2) subject: subject protein ID, (3) score: STRING score and (4) label: whether it is a protein-protein interaction.

  4. Drug-Drug interaction. This dataset is from Drug-Drug interaction semeval-2013. Please see the details there.

References

When using our resources, please cite the following papers:

Chen, Q., Lee, K., Yan, S., Kim, S., Wei, C. H., & Lu, Z. (2019). BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale. To appear in PLOS Computational Biology.

Acknowledgments

This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine.

bioconceptvec's People

Contributors

qingyu-chen avatar qingyu-qc avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.