castorini / castor Goto Github PK

View Code? Open in Web Editor NEW

180.0 180.0 58.0 1.15 MB

PyTorch deep learning models for text processing

Home Page: http://castor.ai/

License: Apache License 2.0

Python 91.80% Shell 0.50% Java 2.91% HTML 0.61% JavaScript 4.18%

deep-learning

castor's People

Contributors

Stargazers

Watchers

Forkers

hatianzhang rosequ zuacubd gauravbaruah pustar salman1993 snapbug tuzhucheng daemon meowfei rain-y victor0118 dongcin cmacdonald chiuyeelau yiyun-liang kimihaylov shubhampachori12110095 jarvx uppinder ghiblifield zeynepakkalyoncu marinecarpuat likicode dsp6414 youngbink math0429 darthzhang achyudh ashutosh-adhikari nd1511 svetli-n karkaroff ahashisyuu berryhn little1tow guanlongtianzi codeaudit axelmueller wassname yngtodd chaojiang06 sarweshkrishnan wibruce ccatherineee cccshuang sudipta90 nahidalam xrosliang sandy4321 hccngu greysun jamon96 saisai

castor's Issues

Kim's sentence classification model: Upgrade to Python 3

Update code for Kim's sentence classification model to use Python 3 instead of Python 2.

Using pre-trained models on your own data

Hi,

Is there a possibility to use the models with your own data (specifically the mp-cnn)? I couldn't find anything in the documentation.

Thanks!

Refactor MP-CNN to Use torchtext and Implement Early Stopping

use torchtext for MP-CNN
implement early stopping if no improvements on validation set after a certain number of epochs

SM CNN refactoring to match MP CNN

SM CNN needs to be refactored to match the API of MP CNN.

LAI-CNNs for SM

This issue is to keep track of Linguistically Annotated Input (LAI) CNNs for SM model.

Port SM model over to GPU

Current implement runs on CPU - let's port it over the GPU.

Reimplement NCE on SM model

NCE = Noise Contrastive Estimation
http://dl.acm.org/citation.cfm?id=2983872

Use our SM model has the building block.

Try training models as SHARCNET

We have access to this GPU cluster: https://docs.computecanada.ca/wiki/Graham

Let's try to see how difficult it would be to take advantage of these resources...

Assigning to @dishant-mittal

kim_cnn refactoring

kim_cnn code needs refactoring:

Code should have some sort of generic interface for classification.
Data shouldn't be pulled from your private repo, should be in the data repo.

Other things to work on will be added as it comes up.

simple_qa_rnn cleanup

@salman1993 Given that we have https://github.com/salman1993/BuboQA do we still need simple_qa_rnn/ here? If not let's clean up so we don't get confused... (send PR to remove?)

Reorganized Castor-data

Let's reorganized Castor-data:

embeddings/ subdirectory for embeddings
datasets/ subdirectory for datasets

Clean up top-level documentation

Clean up top-level README:

Should reference the new data location https://git.uwaterloo.ca/jimmylin/Castor-data/
Should provide directory layout guidance, e.g.,

├── Castor
│   ├── README.md
│   ├── ...
│   └── mp_cnn/
├── data
│   ├── README.md
│   ├── ...
│   ├── msrvid/
│   ├── sick/
│   └── GloVe/

Should provide high-level overview of main dependencies, e.g., PyTorch 0.4.0, etc.

We'll keep open thread as additional items come up.

Kim CNN: replicate results on other datasets

Replicate results on all the datasets used in the initial Kim paper, along with any other common ones...

High level-documentation on setting up dev environment

We need high-level documentation to make sure everyone's environment is sync'ed. For example:

Python 3.5 or 3.6?
What version of PyTorch to install
Basically installation instructions on different OS (Mac for me)
etc

Old sm-cnn has hardcoded paths that match TrecQA

The TrecQA dataset comes in clean- and raw- versions, the sm_cnn main file loads one of these hardcoded prefixes (with a really badly named/described argument to switch between them), code ref, but other datasets (such as WikiQA) don't have these distinctions leading to an exception being thrown:

Traceback (most recent call last):
  File "main.py", line 152, in <module>
    trainer.load_input_data(args.dataset_folder, cache_file, train_set, dev_set, test_set)
  File "/castorini/castor/sm_cnn/train.py", line 51, in load_input_data
    utils.read_in_dataset(dataset_root_folder, set_folder)
  File "/castorini/castor/sm_cnn/utils.py", line 141, in read_in_dataset
    questions = read_in_data(dataset_folder, set_folder, "a.toks", False, stop_punct, dash_split)
  File "/castorini/castor/sm_cnn/utils.py", line 98, in read_in_data
    with open(os.path.join(datapath, set_name, file), encoding='utf-8') as inf:
FileNotFoundError: [Errno 2] No such file or directory: '../../data/WikiQA/raw-test/a.toks'

I'm torn between whether this is the dataset loaders fault, or whether it may just be simpler to change the dataset generation to match TrecQA -- that is, by default if a collection doesn't have the distinction generate everything under the raw- prefix.

If the dataset generator changes then other model files are almost certainly going to break if they were written with a non-TrecQA dataset in mind
Other models written for the TrecQA dataset will almost certainly have a similar problem
Adding a flag for a prefix is yet more flags

Thoughts and comments?

Hyper-parameter tuning for VDPWI

According to @daemon - the VDPWI works https://github.com/castorini/Castor/tree/master/vdpwi

But the effectiveness is still below STOA because the hyper-parameters haven't been tuned yet.

other dataset trained on vdpwi

I saw the dataset loading fixed to four dataset(sick,msrvid, trecqa, wikiqa). I wanted to know how to trained vdpwi with other datasets. what's more, how to reasonably organize the dataset. I try to copied my dataset into the file 'sick' and my embedding into the file 'GloVe', but the model trained with 0 loss. Can you give me the correct instruction?

Reimplement MP-CNN model in PyTorch

Now that we have https://github.com/castorini/MP-CNN-Torch integrated into Castorini - it would be nice to have a PyTorch implementation of the model.

@gauravbaruah it occurs to me that MP-CNN does not have any of the issues associated with SM re: idf feature. re: #8

in training sm_cnn, ValueError: could not convert string to float: '<pad>'

$ python train.py --mode static --gpu 1
Note: You are using GPU for training
Dataset TREC Mode static
VOCAB num 13
LABEL.target_class: 13
LABELS: ['', '2', '0', '7', '3', '1', '8', '4', '5', '9', '6', '\t', '.']
Train instance 53417
Dev instance 1148
Test instance 1517
Shift model to GPU
Time Epoch Iteration Progress (%Epoch) Loss Dev/Loss Accuracy Dev/Accuracy
Traceback (most recent call last):
File "train.py", line 147, in
for batch_idx, batch in enumerate(train_iter):
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/iterator.py", line 151, in iter
self.train)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/batch.py", line 27, in init
setattr(self, name, field.process(batch, device=device, train=train))
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/field.py", line 188, in process
tensor = self.numericalize(padded, device=device, train=train)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/field.py", line 308, in numericalize
arr = self.postprocessing(arr, None, train)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 37, in call
x = pipe.call(x, *args)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 52, in call
return [self.convert_token(tok, *args) for tok in x]
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 52, in
return [self.convert_token(tok, *args) for tok in x]
File "train.py", line 62, in
postprocessing=data.Pipeline(lambda arr, _, train: [float(y) for y in arr]))
File "train.py", line 62, in
postprocessing=data.Pipeline(lambda arr, _, train: [float(y) for y in arr]))
ValueError: could not convert string to float: ''

Port E2E Code

The e2e QA demo over free text still uses the buggy SM code. Port the code to support the new SM model. I will take this issue.

iPython notebook for MP-CNN

Given our code clean-up, it should be fairly straightforward to build a demo iPython notebook for MP-CNN to walkthrough its features?

We can also try https://github.com/szagoruyko/pytorchviz to visualize e.g.,
https://github.com/szagoruyko/pytorchviz/blob/master/examples.ipynb

Kim CNN: Non-deterministic results

Configuration:

Dataset: SST-1
Mode: multichannel
This was run on dragon with GPU:

Accuracy:
run1: 45.47061422889969
run2: 43.923994697304465

Something worth looking at:
https://discuss.pytorch.org/t/non-reproducible-result-with-gpu/1831

SM model for WikiQA

Extend SM model to run on WikiQA dataset. I'll take this issue.

Merge in and refactor vdpwi-nn-pytorch

Merge in https://github.com/castorini/vdpwi-nn-pytorch under Castor/vdpwi:

Pull directly into Castor repo to preserve history.
Refactor to use same API as what @tuzhucheng is building for MPCNN.

Kim CNN: Early Stop

Currently the Kim CNN model runs until the user explicitly stops training. Stop training after there are no improvements in validation accuracy, say over 5 iterations.

Port of BiLSTM-CNN-CRF for sequence labeling

Here's the repo from Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging at EMNLP 2017.

The implementation is based on Keras 1.x and can be run with Theano (0.9.0) or Tensorflow (0.12.1) as backend.

We should port it to PyTorch.

Implementation Comparison on conv1d and conv2d for SM and MP-CNN

I think there is something we need to discuss on the implementation of SM-CNN and MP-CNN model.
For the convolution net, these two models use nn.conv1d and pass first argument as word_dim. However, according the API, the first argument is number of input_channels. In these two model, this argument should be set as 1. nn.conv2d should be used in these two models.

Use SM-CNN model as an example:
The output of model parameters is as followed:

QAModel (
  (conv_q): Sequential (
    (0): Conv1d(50, 100, kernel_size=(5,), stride=(1,), padding=(4,))
    (1): Tanh ()
  )
  (conv_a): Sequential (
    (0): Conv1d(50, 100, kernel_size=(5,), stride=(1,), padding=(4,))
    (1): Tanh ()
  )
  (combined_feature_vector): Linear (204 -> 204)
  (combined_features_activation): Tanh ()
  (dropout): Dropout (p = 0.5)
  (hidden): Linear (204 -> 2)
  (logsoftmax): LogSoftmax ()
)

In the forwarding procedure, we have the following input size.

input size: torch.Size([1, 50, 20])

First dimension is batch size. Second dimension is word dimension. Third dimension is sentence length. This tensor is transposed when constructed.
This tensor is put in conv1d as declared above.

Now, let see the example in pytorch document.

m = nn.Conv1d(16, 33, 3, stride=2)
input = autograd.Variable(torch.randn(20, 16, 50))
output = m(input)

API says:

Input: (N,C_in,L_in)
Output: (N,C_out,L_out)

The input format should be N = batch size, C_in should be input channel number, L_in should be the length.
Output format : N = batch size, C_out should be the output channel number. L_out is calculated by L_in, kernel size, padding and stride.

According this, sm model's input shape (1, 50, 20) might not match the API. SM model regards input channel number as 50 which is wrong.

After the wrong conv1d, we get a tensor with following shape

after conv 1d sequential: torch.Size([1, 100, 24])

According output format, we say batch size = 1, output channel number = 100 and output length is 24 which is calculated from input length 20, stride = 1, kernel size = 5 and padding 4.
And after maxpooling, sm model get the max value in third dimension and get tensor with shape

after max pool 1d: torch.Size([1, 100, 1])
after reshape: torch.Size([1, 100])

In this case, we see each dimension of word embedding as a channel.
But according the original paper, kernel size should be (width, word_dim) .
So my suggestion is :

nn.Conv2d(input_channel=1, output_channel=100, (5, words_dim), padding=(4,0))

and the input size should be

(1, 1, 20, 50)

where batch = 1, number of input channel = 1, sentence length = 20, word_dim = 50.
After the conv net, we will get tensor size with

(1, 100 , 24)

which is totally same as before but with different meanings, I guess.
And MP-CNN model seem have similar issue.
But which idea is better, it depends on experiments and datasets.

Write documentation about dragon

@tuzhucheng

Write up documentation about using group-shared configs on dragon. Should probably go in docs/.

Also write up docs on how you set up the shared env, for when we need to upgrade later... for example, are we using a dedicated pytorch users? There should also be a canonical checkout of the data and models repo, right?

Reimplement Kim's sentence classification model

See https://arxiv.org/abs/1408.5882
Helpful post: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

NCE CNN refactoring to match MP CNN

Will do this after #128

SM model: different result each time - need integration testing framework

I ran the sm_cnn model as it is in the master branch and I get different results each time I run it:

Run	MAP	MRR
1	0.7476	0.8126
2	0.7289	0.8006
3	0.7566	0.8175

I started with a fresh .cache before the 1st and 3rd run (although it wasn't really necessary).

the codes of vdpwi is not work

I have followed the readme to setup the environment. And I also export the PYTHONPATH of Castor in the castor virtual environment. When i use the tool of vdpwi, it exits the error("ModuleNotFoundError: No module named 'utils.relevancy_metrics'"). However, the other python file (for example: python files in the "common" directory ) can be imported successfully. I want to get some help.

High-level documentation for using MP-CNN and VDPWI (forward inference)

@tuzhucheng Can you check in a MP-CNN model in Caster-models/ and write up instructions on how to actually use it? I should be able to open a Python shell, copy-and-paste a few commands, and be able to run forward inference. This way I can start playing with the code...

@daemon can then follow up with the same for VDPWI.

Check in Twitter-URL and PIT-2015 into Castor-data

These two new datasets should be checked into?
https://git.uwaterloo.ca/jimmylin/Castor-data

@Victor0118 please coordinate with @tuzhucheng ?

Add insuranceQA

Results of MP-CNN

Too slow to run original MP-CNN on insuranceQA

Results of NCE-MP

Too slow to run original NCE-MP on insuranceQA

Results of SM-CNN

mode	Dev	Test1	Test2
non-static	0.6137	0.6118	0.6028

Results of NCE-SM

mode	Dev	Test1	Test2
random	0.6121	0.6149	0.6087
max	0.6400	0.6391	0.6257

Results of Baselines

model	Dev	Test1	Test2
Bag-of-word	31.9	32.1	32.2
Metzler-Bendersky IR model	52.7	55.1	50.8
GRU	59.4	53.2	58.1
CNN (Feng et al., 2015)	61.8	62.8	59.2
CNN with GESD (Feng et al., 2015)	65.4	65.3	61.0
Attentive LSTM (Tan et al. 2016)	68.4	68.1	62.2
IARNN (Wang et al., 2016)	69.9	70.1	62.8
AP-BiLSTM (Santos et al., 2016)	68.7	71.7	64.4

Replicate some RNN model for learning PyTorch

I propose we replicate this paper:
https://arxiv.org/abs/1606.05029

Ferhan can help us if we run into any issues.

End-to-end clean up of sm_cnn

For @rosequ - we currently have both sm_cnn and sm_cnn_modified - we need to clean up and reconnect the e2e pipeline...

have some trouble about VDPWI

Rename sm_model to sm_cnn_model

Might want to rename sm_model to sm_cnn_model to be consistent with @Jeffyrao 's impl?
https://github.com/castorini/SM-CNN-Torch

Add NCE on Kim CNN

I am working on this.

SM model baselines on other QA datasets

Opening issue for @gauravbaruah to keep track of progress on SM model over wikiQA.

SM_CNN: Refactor

use torchtext
support batching
simplify code to support backprop over word embedding, linguistic features etc. (#13)
I'll work on this issue.

@Impavidity also noted that we are using conv1d but in reality we should be using conv2d here since the input is in 2D space. It'd be interesting to compare the scores with this implementation.

conv_rnn refactoring

Ref #99

conv_rnn and kim_cnn are both sentence classification models - they should share the same API, and in general be structured the same way.

@Impavidity @daemon please coordinate on this.