Code Monkey home page Code Monkey logo

agatha's Introduction

Agatha Overview

logo

Checkout our Docs on ReadTheDocs

We are currently doing a bunch of development around the CORD-19 datsset. These customizations have been funded by an NSF RAPID grant. Follow Along with Development on trello.

If you're here looking for the CBAG: Conditional Biomedical Abstract Generation project, take a look in the agatha/ml/abstract_generator submodule.

Install Agatha to use pretrained models

In our paper we present state-of-the-art performance numbers across a range of recent biomedical discoveries across popular biomedical sub-domains. We trained the Agatha system using only data published prior to 2015, and supply the necessary subset of that data in an easy-to-replicate package. Note, the full release is also available for those wishing to tinker further. Here's how to get started.

Setup a conda environment

conda create -n agatha python=3.8
conda activate agatha

Install PyTorch. We need a version >= 1.4, but different systems will require different cuda library versions. We installed PyTorch using this command:

conda install pytorch cudatoolkit=9.2 -c pytorch

We use protobufs to help configure aspects of the Agatha pipeline, if you don't already have protoc installed, you can pull it in through conda.

conda install -c anaconda protobuf

Install Agatha. This comes along with the dependencies necessary to run the pretrained model. Note, we're aware of a pip warning produced by this install method, we're working on providing a easier pip-installable wheel.

cd <AGATHA_INSTALL_DIR>
git clone https://github.com/JSybrandt/agatha.git .
pip install -e .

Now we can download the 2015 hypothesis prediction subset. Note, at the time of writing, we only provide a 2015 validation version of Agatha. We are in the process of preparing an up-to-date 2020 version. We recommend the tool gdown that comes along with Agatha to download our 38.5GB file. If you don't want to use that tool, you can download the same file from your browser via this link. We recommend you place this somewhere within <AGATHA_INSTALL_DIR>/data.

# Remeber where you place your file
cd <AGATHA_DATA_DIR>
# This will place 2015_hypothesis_predictor_512.tar.gz in AGATHA_DATA_DIR
gdown --id 1Tka7zPF0PdG7yvGOGOXuEsAtRLLimXmP
# Unzip the download, creates hypothesis_predictor_512/...
tar -zxvf 2015_hypothesis_predictor_512.tar.gz

We can now load the Agatha model in python. After loading, we need to inform the model of where it can find its helper data. By default it looks in the current working directory.

# We need to load the pretrained agatha model.
import torch
model = torch.load("<AGATHA_DATA_DIR>/hypothesis_predictor_512/model.pt")

# We need to tell the model abouts its helper data.
model.set_data_root("<AGATHA_DATA_DIR>/hypothesis_predictor_512/")

# We need to setup the internal datastructures around that helper data.
model.init()

# Now we can run queries specifying two umls terms! Note, this process has some
# random smapling involved, so your result might not look exactly like what we
# show here.
# Keywords:
#   Cancer: C0006826
#   Tobacco: C0040329
model.predict_from_terms([("C0006826", "C0040329")])
>>> [0.78358984]

# Kewords:
#   Cancer: C0006826
#   Tobacco: C0040329
model.predict_from_terms([("C0006826", "C0040329")])
>>> [0.78358984]

# If you want to run loads of queries, we recommend first using
# model.init_preload(), and then the following syntax. Note that
# predict_from_terms will automatically compute in batches of size:
# model.hparams.batch_size.
queries = [("C###", "C###"), ("C###", "C###"), ..., ("C###", "C###")]
model = model.eval()
model = model.cuda()
with torch.no_grad():
  predictions = model.predict_from_terms(queries)

Replicate the 2015 Validation Experiments

Provided in ./benchmarks are the files we use to produce the results found in our paper. Using the 2015 pretrained model, you should be able to replicate these results. This guide focuses on the recommendation experiments, wherein all pairs of elements from among the 100 most popular new predicates per-subdomain are evaluated by the Agatha model. For each of the 20 considered types, we generated all pairs, and removed any pair that is trivially discoverable from within the Agatha semantic graph. The result are a list of predicates in the following json file: ./benchmarks/all_pairs_top_20_types.json

The json predicate file has the following schema:

{
  "<type1>:<type2>": [
    {
      "source": "<source keyword>",
      "target": "<target keyword>",
      "label": 0 or 1
    },
    ...
  ],
  ...
}

Here's how to load the pretrained model and evaluate the provided set of predicates:

import torch
import json

# Load the pretrained model
model = torch.load("<AGATHA_DATA_DIR>/hypothesis_predictor_512/model.pt")
# Configure the helper data
model.set_data_root("<AGATHA_DATA_DIR>/hypothesis_predictor_512")
# Initialize the model for batch processing
model.init_preload()


# Load the json file
with open("<AGATHA_INSTALL_DIR>/benchmarks/all_pairs_top_20_types") as file:
  types2predicates = json.load(file)

# prepare model
model = model.eval()
model = model.cuda()
with torch.no_grad():

  # Predict ranking criteria for each predicate
  types2predictions = {}
  for typ, predicates in types2predictions.items():
    types2predictions[typ] = model.predict_from_terms([
      (pred["source"], pred["target"])
      for pred in predicates
    ])

Note that the order of resulting scores will be the same as the order of the input predicates per-type. Using the label field of each predicate, we can then compare how the ranking critera correlates with the true connections 1 and the undisovered connections 0.

Installing Agatha for Development

These instructions are useful if you want to customize Agatha, especially if you are also running this system on the Clemson Palmetto Cluster. This guide also assumes that you have already installed anaconda3.

Step zero. Get yourself a node made in the last few years with a decent GPU. Currently supported GPU's on palmetto include the P100 and the V100. Recent changes to pytorch are incompatible with older models.

The recommended node request is:

qsub -I -l select=5:ncpus=40:mem=365gb:ngpus=2:gpu_model=v100,walltime=72:00:00

First, load the following modules:

module load gcc/8.3.0          \
            cuDNN/9.2v7.2.1    \
            sqlite/3.21.0      \
            cuda-toolkit/9.2   \
            nccl/2.4.2-1       \
            hdf5/1.10.5        \
            mpc/0.8.1

Now follow the above list of installation instructions, beginning with creating a conda environment, through cloning the git repo, and ending with pip install -e ..

At this point, we can install all the additional dependencies required to construct the Agatha semantic graph and train the transformer model. To do so, return to the AGATHA_INSTALL_DIR and install requirements.txt.

cd <AGATHA_INSTALL_DIR>
# Installs the developer requirements
pip install -r requirements.txt

Now you should be ready to roll! I recommend you create the following file in order to handle all the module loading and preparation.

# Remove current modules (if any)
module purge
# Leave current conda env (if any)
conda deactivate
# Load all nessesary palmetto modules
module load gcc/8.3.0 mpc/0.8.1 cuda-toolkit/9.2 cuDNN/9.2v7.2.1 nccl/2.4.2-1 \
            sqlite/3.21.0 hdf5/1.10.5
# Include hdf5, needed to build tools
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/software/hdf5/1.10.5/include
# Load python modules
conda activate agatha

agatha's People

Contributors

aicaffeinelife avatar jsybrandt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

agatha's Issues

Could not find config.proto or config_pb2 for abstract generator

Hi,

I was trying to run this repo but could not find the config file (python class or proto). Hope you can add it.

abstract_generator/generation_util.py", line 4, in
from agatha.config import config_pb2 as cpb
ModuleNotFoundError: No module named 'agatha.config'

ModuleNotFoundError: No module named 'agatha.ml.util.embedding_index'

Hi,
When following the README to loading the Agatha model in
model = torch.load("<AGATHA_DATA_DIR>/hypothesis_predictor_512/model.pt")
, it prompts
ModuleNotFoundError: No module named 'agatha.ml.util.embedding_index'.
I looked up the directory agatha/ml/util/ and there is no file named embedding_index.
Could u tell how to solve it?
Thx

Can't remote download pretrained model

When I try using gdown to get the pretrained model, I get the following error:

Access denied with the following error:

        Cannot retrieve the public link of the file. You may need to change
        the permission to 'Anyone with the link', or have had many accesses. 

You may still be able to access the file from the browser:

         https://drive.google.com/uc?id=1Tka7zPF0PdG7yvGOGOXuEsAtRLLimXmP 

I can download it form the link, but would prefer to be able to use gdown since I'm working on a cluster, and it's a very large model.

ModuleNotFoundError: No module named 'agatha.ml.abstract_generator.sentencepiece_pb2'

Hi,

I am trying to run the abstract-generation code from the ReadMe page.

However, I could not even properly import dependencies:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-d9a5f573c05f> in <module>
      1 import torch
----> 2 from agatha.ml.abstract_generator import generation_util
      3 
      4 # Load model from the current working directory
      5 model = torch.load("model/model.pt")

/reproducibility/agatha/agatha/ml/abstract_generator/__init__.py in <module>
----> 1 from .abstract_generator import AbstractGenerator

/reproducibility/agatha/agatha/ml/abstract_generator/abstract_generator.py in <module>
----> 1 from agatha.ml.abstract_generator import datasets
      2 from agatha.ml.util.lamb_optimizer import Lamb
      3 from agatha.ml.abstract_generator.tokenizer import AbstractGeneratorTokenizer
      4 from agatha.ml.util.kv_store_dataset import KVStoreDictDataset
      5 from argparse import Namespace

/reproducibility/agatha/agatha/ml/abstract_generator/datasets.py in <module>
      1 import torch
      2 from torch.utils.data import Dataset
----> 3 from agatha.ml.abstract_generator.tokenizer import AbstractGeneratorTokenizer
      4 import random
      5 from typing import Dict, Tuple, List, Any, Set

/reproducibility/agatha/agatha/ml/abstract_generator/tokenizer.py in <module>
      1 # from agatha.ml.abstract_generator.sentencepiece import SentencePieceText
----> 2 from agatha.ml.abstract_generator.sentencepiece_pb2 import SentencePieceText
      3 from agatha.util.misc_util import Record
      4 from pathlib import Path
      5 from typing import Any, Dict, List

ModuleNotFoundError: No module named 'agatha.ml.abstract_generator.sentencepiece_pb2'

There is indeed no module (file) - "agatha.ml.abstract_generator.sentencepiece_pb2" in the repo. Instead, there is "sentencepiece.proto" file. Unfortunately, changing the import statement from "agatha.ml.abstract_generator.sentencepiece_pb2" to "sentencepiece.proto" did not help.

Could you please point out how to replace the import of non-existing file and thus avoid the error?

Many thanks!

无法远程下载预训练模型

When I try using gdown to get the pretrained model, I get the following error:
PS D:\Agatha\agatha\data> gdown --id 1Tka7zPF0PdG7yvGOGOXuEsAtRLLimXmP
D:\anaconda3\Lib\site-packages\gdown_main_.py:132: FutureWarning: Option --id was deprecated in version 4.3.1 and will be removed in 5.0. You don't need to pass it anymore to use a file ID.
warnings.warn(
D:\anaconda3\Lib\site-packages\gdown\download.py:32: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
soup = bs4.BeautifulSoup(line, features="html.parser")
Failed to retrieve file url:

    Cannot retrieve the public link of the file. You may need to change
    the permission to 'Anyone with the link', or have had many accesses.
    Check FAQ in https://github.com/wkentaro/gdown?tab=readme-ov-file#faq.

You may still be able to access the file from the browser:

    https://drive.google.com/uc?id=1Tka7zPF0PdG7yvGOGOXuEsAtRLLimXmP

but Gdown can't. Please check connections and permissions.
I can download it form the link, but would prefer to be able to use gdown since I'm working on a cluster, and it's a very large model.

Looking for the decoder alone

Hi, very happy to find an off-the-shelf biomed decoder. I am trying to use the decoder alone for a generation task and was wondering, how I can access the decoder weights and also finetune the decoder model on other data if necessary. Which python file is the decoder model?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.