castorini / castor Goto Github PK
View Code? Open in Web Editor NEWPyTorch deep learning models for text processing
Home Page: http://castor.ai/
License: Apache License 2.0
PyTorch deep learning models for text processing
Home Page: http://castor.ai/
License: Apache License 2.0
Update code for Kim's sentence classification model to use Python 3 instead of Python 2.
Hi,
Is there a possibility to use the models with your own data (specifically the mp-cnn)? I couldn't find anything in the documentation.
Thanks!
SM CNN needs to be refactored to match the API of MP CNN.
Replace this with the PTB tokenizer.
This issue is to keep track of Linguistically Annotated Input (LAI) CNNs for SM model.
Current implement runs on CPU - let's port it over the GPU.
NCE = Noise Contrastive Estimation
http://dl.acm.org/citation.cfm?id=2983872
Use our SM model has the building block.
We have access to this GPU cluster: https://docs.computecanada.ca/wiki/Graham
Let's try to see how difficult it would be to take advantage of these resources...
Assigning to @dishant-mittal
kim_cnn code needs refactoring:
data
repo.Other things to work on will be added as it comes up.
@salman1993 Given that we have https://github.com/salman1993/BuboQA do we still need simple_qa_rnn/
here? If not let's clean up so we don't get confused... (send PR to remove?)
Let's reorganized Castor-data
:
embeddings/
subdirectory for embeddingsdatasets/
subdirectory for datasetsClean up top-level README:
├── Castor
│ ├── README.md
│ ├── ...
│ └── mp_cnn/
├── data
│ ├── README.md
│ ├── ...
│ ├── msrvid/
│ ├── sick/
│ └── GloVe/
We'll keep open thread as additional items come up.
Replicate results on all the datasets used in the initial Kim paper, along with any other common ones...
We need high-level documentation to make sure everyone's environment is sync'ed. For example:
The TrecQA dataset comes in clean-
and raw-
versions, the sm_cnn main file loads one of these hardcoded prefixes (with a really badly named/described argument to switch between them), code ref, but other datasets (such as WikiQA) don't have these distinctions leading to an exception being thrown:
Traceback (most recent call last):
File "main.py", line 152, in <module>
trainer.load_input_data(args.dataset_folder, cache_file, train_set, dev_set, test_set)
File "/castorini/castor/sm_cnn/train.py", line 51, in load_input_data
utils.read_in_dataset(dataset_root_folder, set_folder)
File "/castorini/castor/sm_cnn/utils.py", line 141, in read_in_dataset
questions = read_in_data(dataset_folder, set_folder, "a.toks", False, stop_punct, dash_split)
File "/castorini/castor/sm_cnn/utils.py", line 98, in read_in_data
with open(os.path.join(datapath, set_name, file), encoding='utf-8') as inf:
FileNotFoundError: [Errno 2] No such file or directory: '../../data/WikiQA/raw-test/a.toks'
I'm torn between whether this is the dataset loaders fault, or whether it may just be simpler to change the dataset generation to match TrecQA -- that is, by default if a collection doesn't have the distinction generate everything under the raw-
prefix.
Thoughts and comments?
According to @daemon - the VDPWI works https://github.com/castorini/Castor/tree/master/vdpwi
But the effectiveness is still below STOA because the hyper-parameters haven't been tuned yet.
I saw the dataset loading fixed to four dataset(sick,msrvid, trecqa, wikiqa). I wanted to know how to trained vdpwi with other datasets. what's more, how to reasonably organize the dataset. I try to copied my dataset into the file 'sick' and my embedding into the file 'GloVe', but the model trained with 0 loss. Can you give me the correct instruction?
Now that we have https://github.com/castorini/MP-CNN-Torch integrated into Castorini - it would be nice to have a PyTorch implementation of the model.
@gauravbaruah it occurs to me that MP-CNN does not have any of the issues associated with SM re: idf feature. re: #8
$ python train.py --mode static --gpu 1
Note: You are using GPU for training
Dataset TREC Mode static
VOCAB num 13
LABEL.target_class: 13
LABELS: ['', '2', '0', '7', '3', '1', '8', '4', '5', '9', '6', '\t', '.']
Train instance 53417
Dev instance 1148
Test instance 1517
Shift model to GPU
Time Epoch Iteration Progress (%Epoch) Loss Dev/Loss Accuracy Dev/Accuracy
Traceback (most recent call last):
File "train.py", line 147, in
for batch_idx, batch in enumerate(train_iter):
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/iterator.py", line 151, in iter
self.train)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/batch.py", line 27, in init
setattr(self, name, field.process(batch, device=device, train=train))
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/field.py", line 188, in process
tensor = self.numericalize(padded, device=device, train=train)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/field.py", line 308, in numericalize
arr = self.postprocessing(arr, None, train)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 37, in call
x = pipe.call(x, *args)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 52, in call
return [self.convert_token(tok, *args) for tok in x]
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 52, in
return [self.convert_token(tok, *args) for tok in x]
File "train.py", line 62, in
postprocessing=data.Pipeline(lambda arr, _, train: [float(y) for y in arr]))
File "train.py", line 62, in
postprocessing=data.Pipeline(lambda arr, _, train: [float(y) for y in arr]))
ValueError: could not convert string to float: ''
The e2e QA demo over free text still uses the buggy SM code. Port the code to support the new SM model. I will take this issue.
Given our code clean-up, it should be fairly straightforward to build a demo iPython notebook for MP-CNN to walkthrough its features?
We can also try https://github.com/szagoruyko/pytorchviz to visualize e.g.,
https://github.com/szagoruyko/pytorchviz/blob/master/examples.ipynb
Dataset: SST-1
Mode: multichannel
This was run on dragon with GPU:
Accuracy:
run1: 45.47061422889969
run2: 43.923994697304465
Something worth looking at:
https://discuss.pytorch.org/t/non-reproducible-result-with-gpu/1831
Extend SM model to run on WikiQA dataset. I'll take this issue.
Merge in https://github.com/castorini/vdpwi-nn-pytorch under Castor/vdpwi
:
Currently the Kim CNN model runs until the user explicitly stops training. Stop training after there are no improvements in validation accuracy, say over 5 iterations.
Here's the repo from Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging at EMNLP 2017.
The implementation is based on Keras 1.x and can be run with Theano (0.9.0) or Tensorflow (0.12.1) as backend.
We should port it to PyTorch.
I think there is something we need to discuss on the implementation of SM-CNN and MP-CNN model.
For the convolution net, these two models use nn.conv1d and pass first argument as word_dim. However, according the API, the first argument is number of input_channels. In these two model, this argument should be set as 1. nn.conv2d should be used in these two models.
Use SM-CNN model as an example:
The output of model parameters is as followed:
QAModel (
(conv_q): Sequential (
(0): Conv1d(50, 100, kernel_size=(5,), stride=(1,), padding=(4,))
(1): Tanh ()
)
(conv_a): Sequential (
(0): Conv1d(50, 100, kernel_size=(5,), stride=(1,), padding=(4,))
(1): Tanh ()
)
(combined_feature_vector): Linear (204 -> 204)
(combined_features_activation): Tanh ()
(dropout): Dropout (p = 0.5)
(hidden): Linear (204 -> 2)
(logsoftmax): LogSoftmax ()
)
In the forwarding procedure, we have the following input size.
input size: torch.Size([1, 50, 20])
First dimension is batch size. Second dimension is word dimension. Third dimension is sentence length. This tensor is transposed when constructed.
This tensor is put in conv1d as declared above.
Now, let see the example in pytorch document.
m = nn.Conv1d(16, 33, 3, stride=2)
input = autograd.Variable(torch.randn(20, 16, 50))
output = m(input)
API says:
Input: (N,C_in,L_in)
Output: (N,C_out,L_out)
The input format should be N = batch size, C_in should be input channel number, L_in should be the length.
Output format : N = batch size, C_out should be the output channel number. L_out is calculated by L_in, kernel size, padding and stride.
According this, sm model's input shape (1, 50, 20) might not match the API. SM model regards input channel number as 50 which is wrong.
After the wrong conv1d, we get a tensor with following shape
after conv 1d sequential: torch.Size([1, 100, 24])
According output format, we say batch size = 1, output channel number = 100 and output length is 24 which is calculated from input length 20, stride = 1, kernel size = 5 and padding 4.
And after maxpooling, sm model get the max value in third dimension and get tensor with shape
after max pool 1d: torch.Size([1, 100, 1])
after reshape: torch.Size([1, 100])
In this case, we see each dimension of word embedding as a channel.
But according the original paper, kernel size should be (width, word_dim) .
So my suggestion is :
nn.Conv2d(input_channel=1, output_channel=100, (5, words_dim), padding=(4,0))
and the input size should be
(1, 1, 20, 50)
where batch = 1, number of input channel = 1, sentence length = 20, word_dim = 50.
After the conv net, we will get tensor size with
(1, 100 , 24)
which is totally same as before but with different meanings, I guess.
And MP-CNN model seem have similar issue.
But which idea is better, it depends on experiments and datasets.
Write up documentation about using group-shared configs on dragon. Should probably go in docs/
.
Also write up docs on how you set up the shared env, for when we need to upgrade later... for example, are we using a dedicated pytorch users? There should also be a canonical checkout of the data and models repo, right?
Will do this after #128
I ran the sm_cnn
model as it is in the master branch and I get different results each time I run it:
Run | MAP | MRR |
---|---|---|
1 | 0.7476 | 0.8126 |
2 | 0.7289 | 0.8006 |
3 | 0.7566 | 0.8175 |
.cache
before the 1st and 3rd run (although it wasn't really necessary).I have followed the readme to setup the environment. And I also export the PYTHONPATH of Castor in the castor virtual environment. When i use the tool of vdpwi, it exits the error("ModuleNotFoundError: No module named 'utils.relevancy_metrics'"). However, the other python file (for example: python files in the "common" directory ) can be imported successfully. I want to get some help.
@tuzhucheng Can you check in a MP-CNN model in Caster-models/
and write up instructions on how to actually use it? I should be able to open a Python shell, copy-and-paste a few commands, and be able to run forward inference. This way I can start playing with the code...
@daemon can then follow up with the same for VDPWI.
These two new datasets should be checked into?
https://git.uwaterloo.ca/jimmylin/Castor-data
@Victor0118 please coordinate with @tuzhucheng ?
Too slow to run original MP-CNN on insuranceQA
Too slow to run original NCE-MP on insuranceQA
mode | Dev | Test1 | Test2 |
---|---|---|---|
non-static | 0.6137 | 0.6118 | 0.6028 |
mode | Dev | Test1 | Test2 |
---|---|---|---|
random | 0.6121 | 0.6149 | 0.6087 |
max | 0.6400 | 0.6391 | 0.6257 |
model | Dev | Test1 | Test2 |
---|---|---|---|
Bag-of-word | 31.9 | 32.1 | 32.2 |
Metzler-Bendersky IR model | 52.7 | 55.1 | 50.8 |
GRU | 59.4 | 53.2 | 58.1 |
CNN (Feng et al., 2015) | 61.8 | 62.8 | 59.2 |
CNN with GESD (Feng et al., 2015) | 65.4 | 65.3 | 61.0 |
Attentive LSTM (Tan et al. 2016) | 68.4 | 68.1 | 62.2 |
IARNN (Wang et al., 2016) | 69.9 | 70.1 | 62.8 |
AP-BiLSTM (Santos et al., 2016) | 68.7 | 71.7 | 64.4 |
I propose we replicate this paper:
https://arxiv.org/abs/1606.05029
Ferhan can help us if we run into any issues.
For @rosequ - we currently have both sm_cnn
and sm_cnn_modified
- we need to clean up and reconnect the e2e pipeline...
Might want to rename sm_model to sm_cnn_model to be consistent with @Jeffyrao 's impl?
https://github.com/castorini/SM-CNN-Torch
I am working on this.
Opening issue for @gauravbaruah to keep track of progress on SM model over wikiQA.
@Impavidity also noted that we are using conv1d
but in reality we should be using conv2d
here since the input is in 2D space. It'd be interesting to compare the scores with this implementation.
Ref #99
conv_rnn
and kim_cnn
are both sentence classification models - they should share the same API, and in general be structured the same way.
@Impavidity @daemon please coordinate on this.
According to @rosequ the end-to-end QA system has lower accuracy than just using idf passage scorer. This makes no sense. We need to figure out why.
so far, 2018-08-18.
the data path using in the Castor/sm_cnn/create_dataset.sh such as ''../../Castor-data/TrecQA''
is NOT match with the real path in Castor-data dir.
can you please check it?
According to https://arxiv.org/abs/1510.03820 and other places, "non-static" word-embeddings are a win. We should adapt our implementation of the SM model to do so.
Let's try and reimplement A Hybrid Framework for Text Modeling with Convolutional RNN. It reports SST-1 at 51.67.
Data should go into this repo: https://github.com/lintool/Castor-data
Instructions should be at a sufficient level of detail that I can copy/paste into my shell and replicate some reasonable facsimile of the original model.
Add data loaders, trainer, and evaluator for training TrecQA data for MP-CNN per request from @Victor0118.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.