re-search / docproduct Goto Github PK

View Code? Open in Web Editor NEW

563.0 25.0 159.0 17.09 MB

Medical Q&A with Deep Language Models

License: MIT License

Python 2.14% Jupyter Notebook 55.83% HTML 42.03%

artificial-intelligence machine-learning deep-learning nlp medical health healthcare bert gpt-2 tensorflow tensorflow-2

docproduct's Introduction

Doc Product: Medical Q&A with Deep Language Models

Collaboration between Santosh Gupta, Alex Sheng, and Junpeng Ye

Download trained models and embedding file here.

Winner Top 6 Finalist of the ⚡#PoweredByTF 2.0 Challenge! https://devpost.com/software/nlp-doctor . Doc Product will be presented to the Tensorflow Engineering Team at Tensorflow Connect. Stay tuned for details.

We wanted to use TensorFlow 2.0 to explore how well state-of-the-art natural language processing models like BERT and GPT-2 could respond to medical questions by retrieving and conditioning on relevant medical data, and this is the result.

DISCLAIMER

The purpose of this project is to explore the capabilities of deep learning language models for scientific encoding and retrieval IT SHOULD NOT TO BE USED FOR ACTIONABLE MEDICAL ADVICE.

How we built Doc Product

As a group of friends with diverse backgrounds ranging from broke undergrads to data scientists to top-tier NLP researchers, we drew inspiration for our design from various different areas of machine learning. By combining the power of transformer architectures, latent vector search, negative sampling, and generative pre-training within TensorFlow 2.0's flexible deep learning framework, we were able to come up with a novel solution to a difficult problem that at first seemed like a herculean task.

700,000 medical questions and answers scraped from Reddit, HealthTap, WebMD, and several other sites
Fine-tuned TF 2.0 BERT with pre-trained BioBERT weights for extracting representations from text
Fine-tuned TF 2.0 GPT-2 with OpenAI's GPT-2-117M parameters for generating answers to new questions
Network heads for mapping question and answer embeddings to metric space, made with a Keras.Model feedforward network
Over a terabyte of TFRECORDS, CSV, and CKPT data

If you're interested in the whole story of how we built Doc Product and the details of our architecture, take a look at our GitHub README!

Challenges

Our project was wrought with too many challenges to count, from compressing astronomically large datasets, to re-implementing the entirety of BERT in TensorFlow 2.0, to running GPT-2 with 117 million parameters in Colaboratory, to rushing to get the last parts of our project ready with a few hours left until the submission deadline. Oddly enough, the biggest challenges were often when we had disagreements about the direction that the project should be headed. However, although we'd disagree about what the best course of action was, in the end we all had the same end goal of building something meaningful and potentially valuable for a lot of people. That being said, we would always eventually be able to sit down and come to an agreement and, with each other's support and late-night pep talks over Google Hangouts, rise to the challenges and overcome them together.

What's next?

Although Doc Product isn't ready for widespread commercial use, its surprisingly good performance shows that advancements in general language models like BERT and GPT-2 have made previously intractable problems like medical information processing accessible to deep NLP-based approaches. Thus, we hope that our work serves to inspire others to tackle these problems and explore the newly open NLP frontier themselves.

Nevertheless, we still plan to continue work on Doc Product, specifically expanding it to take advantage of the 345M, 762M, and 1.5B parameter versions of GPT-2 as OpenAI releases them as part of their staged release program. We also intend to continue training the model, since we still have quite a bit more data to go through.

NOTE: We are currrently working on research in scientific/medical NLP and information retrieval. If you're interested in collaborating, shoot us an e-mail at [email protected]!

Try it out!

Install from pip

You can install Doc Product directly from pip and run it on your local machine. Here's the code to install Doc Product, along with TensorFlow 2.0 and FAISS:

!wget  https://anaconda.org/pytorch/faiss-cpu/1.2.1/download/linux-64/faiss-cpu-1.2.1-py36_cuda9.0.176_1.tar.bz2
#To use GPU FAISS use
# !wget  https://anaconda.org/pytorch/faiss-gpu/1.2.1/download/linux-64/faiss-gpu-1.2.1-py36_cuda9.0.176_1.tar.bz2
!tar xvjf faiss-cpu-1.2.1-py36_cuda9.0.176_1.tar.bz2
!cp -r lib/python3.6/site-packages/* /usr/local/lib/python3.6/dist-packages/
!pip install mkl

!pip install tensorflow-gpu==2.0.0-alpha0
import tensorflow as tf
!pip install https://github.com/Santosh-Gupta/DocProduct/archive/master.zip

Our repo contains scripts for generating .tfrefords data, training Doc Product on your own Q&A data, and running Doc Product to get answers for medical questions. Please see the Google Colaboratory demos section below for code samples to load data/weights and run our models.

Colaboratory demos

Take a look at our Colab demos! We plan on adding more demos as we go, allowing users to explore more of the functionalities of Doc Product. All new demos will be added to the same Google Drive folder.

The demos include code for installing Doc Product via pip, downloading/loading pre-trained weights, and running Doc Product's retrieval functions and fine-tuning on your own Q&A data.

Run our interactive retrieval model to get answers to your medical questions

https://colab.research.google.com/drive/11hAr1qo7VCSmIjWREFwyTFblU2LVeh1R

Train your own medical Q&A retrieval model

https://colab.research.google.com/drive/1Rz2rzkwWrVEXcjiQqTXhxzLCW5cXi7xA

[Experimental] Run the full Doc Product pipeline with BERT, FCNN, FAISS, and GPT-2 to get your medical questions answered by state-of-the-art AI.

The end-to-end Doc Product demo is still experimental, but feel free to try it out! https://colab.research.google.com/drive/1Bv7bpPxIImsMG4YWB_LWjDRgUHvi7pxx

What it does

Our BERT has been trained to encode medical questions and medical information. A user can type in a medical question, and our model will retrieve the most relevant medical information to that question.

Data

We created datasets from several medical question and answering forums. The forums are WebMD, HealthTap, eHealthForums, iClinic, Question Doctors, and Reddit.com/r/AskDocs

Architecture

The architecture consists of a fine-tuned bioBert (same for both questions and answers) to convert text input to an embedding representation. The embedding is then input into a FCNN (a different one for the questions and answers) to develop an embedding which is used for similarity lookup. The top similar questions and answers are then used by GPT-2 to generate an answer. The full architecture is shown below.

Lets take a look at the first half of the diagram above above in more detail, the training of the BERT and the FCNNs. A detailed figure of this part is shown below

During training, we take a batch of medical questions and their corresponding medical answers, and convert them to bioBert embeddings. The same Bert weights are used for both the questions and answers.

These embeddings are then inputted into a FCNN layer. There are separate FCNN layers for both the question and answer embeddings. To recap, we use the same weights in the Bert layer, but the questions and answers each have their own seperate FCNN layer.

Now here's where things get a little tricky. Usually embedding similarity training involves negative samples, like how word2vec uses NCE loss. However, we can not use NCE loss in our case since the embeddings are generated during each step, and the weights change during each training step.

So instead of NCE loss, what we did was compute the dot product for every combination of the question and answer embeddings within our batch. This is shown in the figure below

Then, a softmax is taken across the rows; for each question, all of it's answer combinations are softmaxed.

Finally, the loss used is cross entropy loss. The softmaxed matrix is compared to a ground truth matrix; the correct combinations of questions and answers are labeled with a '1', and all the other combinations are labeled with a '0'.

Technical Obstacles we ran into

Data Gathering and Wrangling

The data gathering was tricky because the formatting of all of the different medical sites was significantly different. Custom work needed to be done for each site in order to pull questions and answers from the correct portion of the HTML tags. Some of the sites also had the possibility of multiple doctors responding to a single question so we needed a method of gathering multiple responses to individual questions. In order to deal with this, we created multiple rows for every question-answer pair. From here we needed to run the model through BERT and store the outputs from one of the end layers in order to make BioBERT embeddings we could pass through the dense layers of our feed-forward neural network(FFNN). 768 dimension vectors were stored for both the question and answers and concatenated with the corresponding text in a CSV file. We tried various different formats for more compact and faster loading and sharing, but CSV ended up being the easiest and most flexible method. After the BioBERT embeddings were created and stored the similarity training process was done and then FFNN embeddings were created that would capture the similarity of questions to answers. These were also stored along with the BioBERT embeddings and source text for later visualization and querying.

Combining Models Built in TF 1.X and TF 2.0

The embedding models are built in TF 2.0 which utilizes the flexibility of eager execution of TF 2.0. However, GPT2 model that we use are are built in TF 1.X. Luckily, we can train two models separately. While inference, we need to maintain disable eager execution with tf.compat.v1.disable_eager_execution and maintain two separate sessions. We also need to take care of the GPU memory of two sessions to avoid OOM.

Accomplishments that we're proud of

Robust Model with Careful Loss and Architecture Design

One obvious approach to retrieve answers based on user’s questions is that we use a powerful encoder(BERT) to encode input questions and questions in our database and do a similarity search. There is no training involves and the performance of this approach totally rely on the encoder. Instead, we use separate Feed-forward networks for questions and answers and calculate cosine similarity between them. Inspired by the negative sampling of word2vec paper, we treat other answers in the same batch as negative samples and calculate cross entropy loss. This approach makes the questions embeddings and answers embeddings in one pair as close as possible in terms of Euclidean distance. It turns out that this approach yields more robust results than doing similarity search directly using BERT embedding vector.

High-performance Input Pipeline

The preprocessing of BERT is complicated and we totally have around 333K QA pairs and over 30 million tokens. Considering shuffle is very important in our training, we need the shuffle buffer sufficiently large to properly train our model. It took over 10 minutes to preprocess data before starting to train model in each epoch. So we used the tf.data and TFRecords to build a high-performance input pipeline. After the optimization, it only took around 20 seconds to start training and no GPU idle time.

Another problem with BERT preprocessing is that it pads all data to a fixed length. Therefore, for short sequences, a lot of computation and GPU memory are wasted. This is very important especially with big models like BERT. So we rewrite the BERT preprocessing code and make use of tf.data.experimental.bucket_by_sequence_length to bucket sequences with different lengths and dynamically padding sequences. By doing this, we achieved a longer max sequence length and faster training.

Imperative BERT Model

After some modification, the Keras-Bert is able to run in tf 2.0 environment. However, when we try to use the Keras-Bert as a sub-model in our embedding models, we found the following two problems.

It uses the functional API. Functional API is very flexible, however, it’s still symbolic. That means even though eager execution is enabled, we still cannot use the traditional python debugging method at run time. In order to fully utilize the power of eager execution, we need to build the model using tf.keras.Model
We are not directly using the input layer of Keras-Bert and ran into this issue. It’s not easy to avoid this bug without changing our input pipeline.

As a result, we decided to re-implement an imperative version of BERT. We used some components of Keras-Bert(Multihead Attention, Checkpoint weight loading, etc) and write the call method of Bert. Our implementation is easier to debug and compatible with both flexible eager mode and high-performance static graph mode.

Answer Generation with Auxiliary Inputs

Users may experience multiple symptoms in various condition, which makes the perfect answer might be a combination of multiple answers. To tackle that, we make use of the powerful GPT2 model and feed the model the questions from users along with Top K auxiliary answers that we retrieved from our data. The GPT2 model will be based on the question and the Top K answers and generate a better answer. To properly train the GPT2 model, we create the training data as following: we take every question in our dataset, do a similarity search to obtain top K+1 answer, use the original answer as target and other answers as auxiliary inputs. By doing this we get the same amount of GPT2 training data as the embedding model training data.

What we learned

Bert is fantastic for encoding medical questions and answers, and developing robust vector representations of those questions/answers.

We trained a fine-tuned version of our model which was initialized with Naver's bioBert. We also trained a version where the bioBert weights were frozen, and only trained the two FCNNs for the questions and answers. While we expected the fine-tuned version to work well, we were surprised at how robust later was. This suggests that bioBert has innate capabilities in being able to encode the means of medical questions and answers.

What's next for Information Retrieval w/BERT for Medical Question Answering

Explore if there's any practical use of this project outside of research/exploratory purposes. A model like this should not be used in the public for obtaining medical information. But perhaps it can be used by trained/licenced medical professionals to gather information for vetting.

Explore applying the same method to other domains (ie history information retrieval, engineering information retrieval, etc.).

Explore how the recently released sciBert (from Allen AI) compares against Naver's bioBert.

Thanks!

We give our thanks to the TensorFlow team for providing the #PoweredByTF2.0 Challenge as a platform through which we could share our work with others, and a special thanks to Dr. Llion Jones, whose insights and guidance had an important impact on the direction of our project.

docproduct's People

Contributors

Stargazers

Watchers

Forkers

merajat metusnp gkoytiger kelvinson henghuiz-zz abhinavm24 sd37 deoko 53x derek-fan yyht lql0716 mingkin cccjourney ricklentz hualichenxi maozhiqiang bharatr21 yuanjun1991 luciferhe damengde awesome-archive sunnymarkliu ibrahim85 gdh756462786 ilyeong-ai sunilsivadas husamx hwaranlee canerbaran stanxii aascode djbramall8 balaprasanna jayasuryajsk harshit-saraswat iamvazu oluwamakinwa lukemshannonhill serviolimareina ephchem zsffuture timliutong s4sarath geochri gkovaig saurabh6996 johangenis fripztech pawo161 manava yingyuankai siddheshsingh karthickn210 getabhishekified pjzoio vince-lynch attler grantaguinaldo pankajmehar sahand68 laisun seanita ismailfatih kapilnchauhan77 anupgoenka saxh hafsah2018 todun vinnytwice bikramjitroy intuitionmachine chunlei dragomirradev drupalcompany zhangjiekui manoj1995madushanka renjiank ajayisamuel biranchi2018 smartfreedom fatmalearning ryannetwork nonprofitz semanticbeeng aspirincode sunnyreddy9299 210010 amirunpri2018 perfmjs databill86 fasladodo amirstudy chandu7077 joshuagithub pilgrim2go dremonkey nikolausn arianpasquali anigi98932

docproduct's Issues

Training Data

Hello,

Is it possible to get the training data used to train the DocProduct? I mean all the scraped data.

generating FFNNEmbed csv files

hello,
how do we derive the FFNNEmbed csv files for new dataset.
do we use QAEmbed or train_data_to_embedding or keras-bert feature extraction to derive the FFNNEmbeddings
Thanks

Downloading the Model to my drive FAILED

`!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

model.save('model.h5')
model_file = drive.CreateFile({'title' : 'model.h5'})
model_file.SetContentFile('model.h5')
model_file.Upload()

drive.CreateFile({'id': model_file.get('id')})`

Used this code to Download the model but Failed

ERROR :-

NameError: name 'model' is not defined

can you please help me to save model ?

Notebook loading error

Details regarding your Training on different Models.

can you please post documentation on how accurate the system is and different scores of different models

Convert into other language

how can i convert it to other language?

BadZipFile - zip file no longer exist

it seems that the zip file no longer exist in the onedrive link shared in notebook. can you please have a look

import tensorflow as tf error

it shows import error tensorflow as tf . please help

ImportError: cannot import name 'dense_features'

Unable to import the package. The solution seems to be to update the tf but then its throwing an error that the DocProduct doesn't support tf==2.1

Failed to load GPU Faiss: No module named 'faiss.swigfaiss_gpu'
Faiss falling back to CPU-only.

ImportError Traceback (most recent call last)
in ()
1 #@title Import and initalize the fine-tuned GPT2 generator. Double-click to see code.
2
----> 3 from docproduct.predictor import GenerateQADoc
4
5 pretrained_path = 'BioBertFolder/biobert_v1.0_pubmed_pmc/'

8 frames
/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/canned/dnn.py in ()
21 import six
22
---> 23 from tensorflow.python.feature_column import dense_features
24 from tensorflow.python.feature_column import dense_features_v2
25 from tensorflow.python.feature_column import feature_column

ImportError: cannot import name 'dense_features'

NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

Translation of BioBERT Tensorflow Weights

How did you transfer the tensorflow weights for Biobert from TF1.0 to TF2.0? When I go to the biobert site, they appear to be for 1.0. Thanks.

What were the 'loss' and 'qa_pair_batch_accuracy' you got when you trained on your entire dataset?

I am running training with your dataset and wondering what were the 'loss' and 'qa_pair_batch_accuracy' you got after you trained your model for comparison?

Please note that I am using entire dataset you provided including askdocs , healthtap , and webmd, icliniqQAs and ehealthforumQAs...

Can you publish you tensor-board training trajectories?

I would be happy to see some of the training trajectories, can you publish some (Even just a screen-shot here would be appreciated)

Thank you

System requirements for pandas.load_parquet

system crashes when i load pandas.load_parquet(bertffn_crossentropy.zip)
Can you tell system specifications of the system to load this file ?
I'm running this in my laptop

Google Colab not running

Hello Team,
I am unable to run the Google Colab. Can you please tell me where am I going wrong?

Apparently the tensor-flow has issues with the version. I am getting this issue when I run the " Load model weights and Q&A data. Double click to see code" cell.

AttributeError: module 'tensorflow' has no attribute 'compat'

can you please help me to save model ?

I am having problem in finding saved the model in this form so can you please help me with that in the DEMO you guys created ?
this is the formate:-

├── checkpoint
├── frozen_inference_graph.pb
├── model.ckpt.data-00000-of-00001
├── model.ckpt.index
├── model.ckpt.meta
└── saved_model (folder)

├── saved_model.pb
└── variables

possible to upload dataset on Kaggle?

Hi, it is really astounding work to curate such data, but I think it would even more beneficial for the community if you could add the dataset on Kaggle.

ImportError: cannot import name 'export_saved_model'

This error is showing while installing !pip install tensorflow-gpu==2.0.0-alpha0.
How can i resolve it, please help me out...

I have tried out this stackoverflow solution, but it did not working:-
https://stackoverflow.com/questions/60879608/tensorflow-raised-error-importerror-cannot-import-name-export-saved-model

Replacing the pubmed v1.0 with v1.1

Do you see any improvements in replacing the biobert model with it's latest release?

Need help understanding "Train Your Own QA Models" Tutorial

Hi all,
First of all, thank you so much for releasing such a brilliant work. I need your help in understanding the tutorial jupyter-notebook of training our own QA. Before I tried to trained "our own data", I managed to run DocProductPresentation successfully (as well as downloading all necessary files).

To trained our own data, I downloaded the "sampleData.csv" and Train_Your_Own_QA notebook file. When I tried to run the training, I notice that I got OOM error (My GPU is RTX 2070 8GB). So, my first step is to reduce the batch size by half and so on. However, even after I set batch size to 1, I still got OOM error. Therefore, I played a little bit with "bert_config.json" from BioBert pre-train model and change the num_hidden_layers to 6 (default is 12) and it ran. Also, I noticed that you set the num_epochs to 1 so I didn't change it.

Once the training is finished (it took around 35 mins), I used DocProductPresentation notebook to test the new model. However, the result is totally out of the topic from the question I asked. Therefore to test if this new model work as intended, I copied one question from "sampleData.csv" and I still got out of topic answer.

So, my questions are,

What GPU you use to trained your model? Is 8 GB VRAM not enough? Does the OOM error comes from Loading BioBERT model or from your architecture?
Did you use num_epochs 1 in training "sampleData.csv" and got good result? If not, what are good parameters I need to use?
I notice, you used "Float16EmbeddingsExpanded.pkl" in DocProductPresentation notebook but not in the training our own QA. Then, what is the importance of this file?
Does the answer is auto-generated or just retrieved from "sampleData.csv"? If this is a retrieval, the model must look from a some kind of database or a pool of QA, where is this?
Also, I couldn't find some of the result answers from the "sampleData.csv", where do these answers come from?

Thank you so much for your help.

"TF record not found"

Upon running the "Train your own medical Q&A retrieval model" colab notebook I found that even though the pandas dataframe can read the file in "data/sampleData.csv" but running the
"d = create_dataset_for_bert(
'data/sampleData.csv', tokenizer=tokenizer, batch_size=batch_size,
shuffle_buffer=500000, dynamic_padding=True, max_seq_length=max_seq_len)"
gives an output of "TF record not found"
Also if "d" is printed we can see that all the attributes like question id or question mask etc have none in their value and the end result of running the last cell on the notebook results in a "value error: empty training data"
How can I fix this?

error while unzipping qa_embeddings/bertffn_crossentropy.zip

models extraction worked fine but problem with qa_embeddings/bertffn_crossentropy.zip
Shows "zip error: Zip file structure invalid (bertffn_crossentropy.zip)" error

Train your own medical Q&A retrieval model

When I am trying to execute Train_Your_Own_Q_A_ModelsV0.2.0.ipynb file locally on my machine, I get the below error-

File "Train_Your_Own_Q_A_ModelsV0.2.0.ipynb", line 47, in
medical_qa_model = MedicalQAModelwithBert(config_file=os.path.join(pretrained_path, 'bert_config.json'),checkpoint_file=os.path.join(pretrained_path, 'biobert_model.ckpt'))
File "/docproduct/models.py", line 79, in init
build=build)
File "/docproduct/bert.py", line 214, in build_model_from_config
model.build(input_shape=[(None, None), (None, None), (None, None)])
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/network.py", line 720, in build
raise ValueError('You cannot build your model by calling build '
ValueError: You cannot build your model by calling build if your layers do not support float type inputs. Instead, in order to instantiate and build your model, call your model on real tensor data (of the correct dtype).

Please help me out!!
Thanks !

ModuleNotFoundError: No module named 'docproduct', even after performing `!pip install docproduct` in official colab notebook.

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-10-856d4f647535> in <module>()
      1 #@title Load model weights and Q&A data. Double click to see code
      2 
----> 3 from docproduct.predictor import RetreiveQADoc
      4 
      5 pretrained_path = 'BioBertFolder/biobert_v1.0_pubmed_pmc/'

ModuleNotFoundError: No module named 'docproduct'

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Hello!! Can The above issue be resolved in the notebook DocProductPresentationV6-0.2.0.ipynb. Thank you!!

How can I convert this to tf.keras h5 model?

I have little experience with pure Tensorflow and NLP and I'm having a really hard time converting this to a keras or tf.keras h5 model so I can then convert it to Tensorflow lite or Tensorflow js. Someone managed to do that ?

File damage

I tried to download the docproduct_0.2.0.zip from onedrive. Everythings fine until I tried to extract the bertffn_crossentropy.zip folder in qa_embeddings. I got a message :

! C:\Users\Afdal\Downloads\DocProduct_0.2.0\qa_embeddings\bertffn_crossentropy.zip: The archive is either in unknown format or damaged
! No archives found

any ideas?

Train your own medical Q&A retrieval model

ImportError: cannot import name 'export_saved_model' from 'tensorflow.python.keras.saving.saved_model' (/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/saving/saved_model/init.py)

I take this error when ı run Install Faiss, TF 2.0, and our Github. Double Click to see code block.

pandas can't read_parquet Embedded File:- BioBertFolder/bertffn_crossentropy.zip

in predictor.py> Class FaissTopK > self.df = pd.read_parquet(self.embedding_file)
embedding_file="BioBertFolder/bertffn_crossentropy.zip"
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'. pyarrow or fastparquet is required for parquet support

and also manually i can't extract that zip file.

Continual training

Hello,
Thanks for sharing your code. It is greatly appreciated. I am trying to apply this project to answer COVID-19 related questions. So far, I trained the Biobert retriever model on a Kaggle dataset of COVID-19 QA pairs. My goal is for the model to continually train as new questions arise on the topic of COVID-19. I have a few questions regarding how I can do this:

Is there any way I can scrape QA pairs with relation to COVID-19 from medical question answering websites?
Is there any way I can continually train the model without the problem of catastrophic forgetting?
Thanks,
Weichen Huang.

Data found under https://github.com/Santosh-Gupta/datasets is not 700,000 question/answer pairs

Hello!
There are 2 links under https://github.com/Santosh-Gupta/datasets:

1 - https://drive.google.com/drive/folders/1PymmjbrgfOIs-HJ7oBmjZKH8j4rYsGZj
2 - https://drive.google.com/drive/folders/1kYD57uStDd4kXyb3JOYCTQd92Al6Il4K

However, there are duplicates in both links; for example each of the AskDocs.csv and icliniqQAs.csv is found in both links. Therefore, when I import all the non-duplicate data, I only see about 200,000 QA pairs, not the 700,000 which your repo mentions. Is the rest of the data somewhere else. Please kindly let me know how to import the entire 700,000 QA Dataset.

Thank you!

No module named '_swigfaiss'

any idea how i can solve this?

TypeError while running DocProduct/docproduct/prediction.py

File "predictor.py", line 314, in
gen = GenerateQADoc()
File "predictor.py", line 273, in init
load_pretrain=False
File "predictor.py", line 85, in init
self.predict(questions=question, answers=answer, dataset=False)
File "predictor.py", line 154, in predict
model_outputs = self.model(model_inputs)
File "/home/bhuvanesh/anaconda3/envs/stark/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 703, in call
outputs = self.call(inputs, *args, **kwargs)
File "/home/bhuvanesh/Documents/pystuff/Medical_dr_source/DocProduct/docproduct/models.py", line 114, in call
(inputs['q_input_ids'], inputs['q_segment_ids'], inputs['q_input_masks']))[self.layer_ind]
File "/home/bhuvanesh/anaconda3/envs/stark/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 703, in call
outputs = self.call(inputs, *args, **kwargs)
File "/home/bhuvanesh/Documents/pystuff/Medical_dr_source/DocProduct/docproduct/bert.py", line 169, in call
trainable=self.trainable)
File "/home/bhuvanesh/Documents/pystuff/Medical_dr_source/DocProduct/docproduct/bert.py", line 141, in _wrap_layer
build_output = build_func(input_layer)
File "/home/bhuvanesh/anaconda3/envs/stark/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 696, in call
self.build(input_shapes)
File "/home/bhuvanesh/Documents/pystuff/Medical_dr_source/DocProduct/keras_bert/keras_multi_head/multi_head_attention.py", line 94, in build
name='%s_Wq' % self.name,
File "/home/bhuvanesh/anaconda3/envs/stark/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 534, in add_weight
use_resource=use_resource)
File "/home/bhuvanesh/anaconda3/envs/stark/lib/python3.6/site-packages/tensorflow/python/training/checkpointable/base.py", line 497, in _add_variable_with_custom_getter
**kwargs_for_getter)
File "/home/bhuvanesh/anaconda3/envs/stark/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1873, in make_variable
use_resource=use_resource)
File "/home/bhuvanesh/anaconda3/envs/stark/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 2234, in variable
use_resource=use_resource)
File "/home/bhuvanesh/anaconda3/envs/stark/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 2224, in
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "/home/bhuvanesh/anaconda3/envs/stark/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 2196, in default_variable_creator
constraint=constraint)
File "/home/bhuvanesh/anaconda3/envs/stark/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 312, in init
constraint=constraint)
File "/home/bhuvanesh/anaconda3/envs/stark/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 417, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/home/bhuvanesh/anaconda3/envs/stark/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1860, in
shape, dtype=dtype, partition_info=partition_info)
File "/home/bhuvanesh/anaconda3/envs/stark/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py", line 468, in call
scale /= max(1., (fan_in + fan_out) / 2.)
TypeError: unsupported operand type(s) for /: 'Dimension' and 'float'

re-search / docproduct Goto Github PK

docproduct's Introduction

Doc Product: Medical Q&A with Deep Language Models

DISCLAIMER

How we built Doc Product

Challenges

What's next?

Try it out!

Install from pip

Colaboratory demos

Run our interactive retrieval model to get answers to your medical questions

Train your own medical Q&A retrieval model

[Experimental] Run the full Doc Product pipeline with BERT, FCNN, FAISS, and GPT-2 to get your medical questions answered by state-of-the-art AI.

What it does

Data

Architecture

Technical Obstacles we ran into

Data Gathering and Wrangling

Combining Models Built in TF 1.X and TF 2.0

Accomplishments that we're proud of

Robust Model with Careful Loss and Architecture Design

High-performance Input Pipeline

Imperative BERT Model

Answer Generation with Auxiliary Inputs

What we learned

What's next for Information Retrieval w/BERT for Medical Question Answering

Thanks!

docproduct's People

Contributors

Stargazers

Watchers

Forkers

docproduct's Issues

Failed to load GPU Faiss: No module named 'faiss.swigfaiss_gpu' Faiss falling back to CPU-only.

To view examples of installing some common dependencies, click the "Open Examples" button below.

Recommend Projects

Recommend Topics

Recommend Org

Failed to load GPU Faiss: No module named 'faiss.swigfaiss_gpu'
Faiss falling back to CPU-only.

To view examples of installing some common dependencies, click the
"Open Examples" button below.