Code Monkey home page Code Monkey logo

recursive_chemprot's Introduction

recursive_chemprot

This is the open source repository of my Recursive Neural Network model for the BioCreative 6 track5 - chemprot task.
The paper is located at: https://academic.oup.com/database/article/doi/10.1093/database/bay060/5042822
Our experiments from the paper are divided into 3 approaches.
The first approach use the tree-LSTM model and this approach was used in the CHEMPROT challenge.
After the challenge, we experimented two enhacements. The second approach applied additional preprocessing steps to our tree-LSTM model, and the third approach tested the performance of the another Recursive Neural Network model, SPINN.

We distribute the newly preprocessed data only (2nd and 3rd approaches).
We provide the two enhancement approaches with different directory.

Requirements:

tree-LSTM
 Tensorflow ver 1.0  (we do not test the other versions, we found that tensorflow fold is known to be working best at tf version 1.0.0.)
 Tensorflow Fold https://github.com/tensorflow/fold

SPINN
 Tensorflow ver 1.8  (eager mode is required)

Common
 python 3.4 or later
  gensim https://radimrehurek.com/gensim/install.html
  sklearn http://scikit-learn.org/stable/install.html
  numpy
  nltk

Data:

 All the data for the model is inside the data folder of the the each approaches. Due to the file sizes, we zipped the files.
 Extract the data zipped files inside the data folder. All data is not the original data.(pre-processed data)
 Word embedding is not included. visit : http://evexdb.org/pmresources/vec-space-models/
 Original Challenge data is in the biocreative site : http://www.biocreative.org/resources/corpora/chemprot-corpus-biocreative-vi/ (login required)
 *update: biocreative site has been changed to ==> https://biocreative.bioinformatics.udel.edu
 *update: original biocreative site is opened again!

Training and Testing:

tree-LSTM
 First, run the "saveWEUsedinDataOnly.py" code to reduce the size of the word embedding. Our code assumes that the original word embedding file is located in the folder "bioWordEmbedding" and the shortened word embedding file is located inside the "shorten_bc6" folder.
 Second, run the "BC6track5Recursive.py" code for the treeLSTM single model classifier.  Note:You can switch the train/test mode using the "train" flag.

SPINN
 Note: You may want to delete the zipped file in the data folder before running any code.
 First, run the "saveEmbedding.py" code to reduce the size of the word embedding. Our code assumes that the original word embedding file is located in the folder "bioWordEmbedding" and the shortened word embedding file is located inside the "shorten_bc6" folder.
 Second, run the "spinn_chemprot.py" code for the SPINN single model classifier.

Extra information:

 Potision Embedding implementation - treeLSTM only
 The relative distance value in the original position embedding has a range from -21 to 21. When the absolute distance value is greater than 5, same vector is given in units of 5.
 In the preprocessed test data, the position embedding range is from 0 to 18. First, we merge the absolute distance values that share the same vectors, then the range is changed into -9 to 9. Secondly, we added 9 to the each distance value, because we do not want unnecessary negative values in the tree nodes.

 This code do not contain automatic ensemble because it requires human hand for now. However, it is easy to implement.

Data Format:

tree-LSTM
 After preprocessing step, we separate each pairs into lines. Each line contains

  • PMID,
  • preprocessed sentence (words are separated with comma.),
  • Interaction,
  • BioCreative6 type(if negative, ""),
  • the first target entity id,
  • the first target entity name,
  • the second target entity id,
  • the second target entity name,
  • parsed tree of a sentence,
  • and words in a sentence separated with comma.
    Each element is separated with "\t".

Every node in a tree has the follwing input format.

  • (label/Fsc/Fd1/Fd2 content) The label is the answer label of a entity pair, and all the nodes of a tree share the same label as the root node. The model has six classes for classification. The Fsc/Fd1/Fd2 are the subtree containment (context), position1, and position2 features of a node, respectively. A node in a tree could be a leaf or an internal node. The content in the node is an input word if it is a leaf node, and if the current node is not a leaf node, the contents consists of children nodes of the current node.

SPINN
 After preprocessing step, we separate each pairs into lines. Each line contains

  • label (0 to 5),
  • parsed tree of a sentence,
  • preprocessed sentence (words are separated with comma.),
  • PMID,
  • the first target entity id,
  • the second target entity id,
    Each element is separated with "\t".

recursive_chemprot's People

Contributors

arwhirang avatar sunkyukim avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.