Code Monkey home page Code Monkey logo

castor's People

Contributors

achyudh avatar ashutosh-adhikari avatar daemon avatar gauravbaruah avatar hatianzhang avatar impavidity avatar likicode avatar lintool avatar meng-f avatar rosequ avatar salman1993 avatar tuzhucheng avatar victor0118 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

castor's Issues

LAI-CNNs for SM

This issue is to keep track of Linguistically Annotated Input (LAI) CNNs for SM model.

kim_cnn refactoring

kim_cnn code needs refactoring:

  • Code should have some sort of generic interface for classification.
  • Data shouldn't be pulled from your private repo, should be in the data repo.

Other things to work on will be added as it comes up.

Reorganized Castor-data

Let's reorganized Castor-data:

  • embeddings/ subdirectory for embeddings
  • datasets/ subdirectory for datasets

Clean up top-level documentation

Clean up top-level README:

├── Castor
│   ├── README.md
│   ├── ...
│   └── mp_cnn/
├── data
│   ├── README.md
│   ├── ...
│   ├── msrvid/
│   ├── sick/
│   └── GloVe/
  • Should provide high-level overview of main dependencies, e.g., PyTorch 0.4.0, etc.

We'll keep open thread as additional items come up.

High level-documentation on setting up dev environment

We need high-level documentation to make sure everyone's environment is sync'ed. For example:

  • Python 3.5 or 3.6?
  • What version of PyTorch to install
  • Basically installation instructions on different OS (Mac for me)
  • etc

Old sm-cnn has hardcoded paths that match TrecQA

The TrecQA dataset comes in clean- and raw- versions, the sm_cnn main file loads one of these hardcoded prefixes (with a really badly named/described argument to switch between them), code ref, but other datasets (such as WikiQA) don't have these distinctions leading to an exception being thrown:

Traceback (most recent call last):
  File "main.py", line 152, in <module>
    trainer.load_input_data(args.dataset_folder, cache_file, train_set, dev_set, test_set)
  File "/castorini/castor/sm_cnn/train.py", line 51, in load_input_data
    utils.read_in_dataset(dataset_root_folder, set_folder)
  File "/castorini/castor/sm_cnn/utils.py", line 141, in read_in_dataset
    questions = read_in_data(dataset_folder, set_folder, "a.toks", False, stop_punct, dash_split)
  File "/castorini/castor/sm_cnn/utils.py", line 98, in read_in_data
    with open(os.path.join(datapath, set_name, file), encoding='utf-8') as inf:
FileNotFoundError: [Errno 2] No such file or directory: '../../data/WikiQA/raw-test/a.toks'

I'm torn between whether this is the dataset loaders fault, or whether it may just be simpler to change the dataset generation to match TrecQA -- that is, by default if a collection doesn't have the distinction generate everything under the raw- prefix.

  • If the dataset generator changes then other model files are almost certainly going to break if they were written with a non-TrecQA dataset in mind
  • Other models written for the TrecQA dataset will almost certainly have a similar problem
  • Adding a flag for a prefix is yet more flags

Thoughts and comments?

other dataset trained on vdpwi

I saw the dataset loading fixed to four dataset(sick,msrvid, trecqa, wikiqa). I wanted to know how to trained vdpwi with other datasets. what's more, how to reasonably organize the dataset. I try to copied my dataset into the file 'sick' and my embedding into the file 'GloVe', but the model trained with 0 loss. Can you give me the correct instruction?

in training sm_cnn, ValueError: could not convert string to float: '<pad>'

$ python train.py --mode static --gpu 1
Note: You are using GPU for training
Dataset TREC Mode static
VOCAB num 13
LABEL.target_class: 13
LABELS: ['', '2', '0', '7', '3', '1', '8', '4', '5', '9', '6', '\t', '.']
Train instance 53417
Dev instance 1148
Test instance 1517
Shift model to GPU
Time Epoch Iteration Progress (%Epoch) Loss Dev/Loss Accuracy Dev/Accuracy
Traceback (most recent call last):
File "train.py", line 147, in
for batch_idx, batch in enumerate(train_iter):
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/iterator.py", line 151, in iter
self.train)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/batch.py", line 27, in init
setattr(self, name, field.process(batch, device=device, train=train))
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/field.py", line 188, in process
tensor = self.numericalize(padded, device=device, train=train)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/field.py", line 308, in numericalize
arr = self.postprocessing(arr, None, train)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 37, in call
x = pipe.call(x, *args)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 52, in call
return [self.convert_token(tok, *args) for tok in x]
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 52, in
return [self.convert_token(tok, *args) for tok in x]
File "train.py", line 62, in
postprocessing=data.Pipeline(lambda arr, _, train: [float(y) for y in arr]))
File "train.py", line 62, in
postprocessing=data.Pipeline(lambda arr, _, train: [float(y) for y in arr]))
ValueError: could not convert string to float: ''

Port E2E Code

The e2e QA demo over free text still uses the buggy SM code. Port the code to support the new SM model. I will take this issue.

Kim CNN: Early Stop

Currently the Kim CNN model runs until the user explicitly stops training. Stop training after there are no improvements in validation accuracy, say over 5 iterations.

Implementation Comparison on conv1d and conv2d for SM and MP-CNN

I think there is something we need to discuss on the implementation of SM-CNN and MP-CNN model.
For the convolution net, these two models use nn.conv1d and pass first argument as word_dim. However, according the API, the first argument is number of input_channels. In these two model, this argument should be set as 1. nn.conv2d should be used in these two models.

Use SM-CNN model as an example:
The output of model parameters is as followed:

QAModel (
  (conv_q): Sequential (
    (0): Conv1d(50, 100, kernel_size=(5,), stride=(1,), padding=(4,))
    (1): Tanh ()
  )
  (conv_a): Sequential (
    (0): Conv1d(50, 100, kernel_size=(5,), stride=(1,), padding=(4,))
    (1): Tanh ()
  )
  (combined_feature_vector): Linear (204 -> 204)
  (combined_features_activation): Tanh ()
  (dropout): Dropout (p = 0.5)
  (hidden): Linear (204 -> 2)
  (logsoftmax): LogSoftmax ()
)

In the forwarding procedure, we have the following input size.

input size: torch.Size([1, 50, 20])

First dimension is batch size. Second dimension is word dimension. Third dimension is sentence length. This tensor is transposed when constructed.
This tensor is put in conv1d as declared above.

Now, let see the example in pytorch document.

m = nn.Conv1d(16, 33, 3, stride=2)
input = autograd.Variable(torch.randn(20, 16, 50))
output = m(input)

API says:

Input: (N,C_in,L_in)
Output: (N,C_out,L_out)

The input format should be N = batch size, C_in should be input channel number, L_in should be the length.
Output format : N = batch size, C_out should be the output channel number. L_out is calculated by L_in, kernel size, padding and stride.

According this, sm model's input shape (1, 50, 20) might not match the API. SM model regards input channel number as 50 which is wrong.

After the wrong conv1d, we get a tensor with following shape

after conv 1d sequential: torch.Size([1, 100, 24])

According output format, we say batch size = 1, output channel number = 100 and output length is 24 which is calculated from input length 20, stride = 1, kernel size = 5 and padding 4.
And after maxpooling, sm model get the max value in third dimension and get tensor with shape

after max pool 1d: torch.Size([1, 100, 1])
after reshape: torch.Size([1, 100])

In this case, we see each dimension of word embedding as a channel.
But according the original paper, kernel size should be (width, word_dim) .
So my suggestion is :

nn.Conv2d(input_channel=1, output_channel=100, (5, words_dim), padding=(4,0))

and the input size should be

(1, 1, 20, 50)

where batch = 1, number of input channel = 1, sentence length = 20, word_dim = 50.
After the conv net, we will get tensor size with

(1, 100 , 24)

which is totally same as before but with different meanings, I guess.
And MP-CNN model seem have similar issue.
But which idea is better, it depends on experiments and datasets.

Write documentation about dragon

@tuzhucheng

Write up documentation about using group-shared configs on dragon. Should probably go in docs/.

Also write up docs on how you set up the shared env, for when we need to upgrade later... for example, are we using a dedicated pytorch users? There should also be a canonical checkout of the data and models repo, right?

the codes of vdpwi is not work

I have followed the readme to setup the environment. And I also export the PYTHONPATH of Castor in the castor virtual environment. When i use the tool of vdpwi, it exits the error("ModuleNotFoundError: No module named 'utils.relevancy_metrics'"). However, the other python file (for example: python files in the "common" directory ) can be imported successfully. I want to get some help.

Add insuranceQA

Results of MP-CNN

Too slow to run original MP-CNN on insuranceQA

Results of NCE-MP

Too slow to run original NCE-MP on insuranceQA

Results of SM-CNN

mode Dev Test1 Test2
non-static 0.6137 0.6118 0.6028

Results of NCE-SM

mode Dev Test1 Test2
random 0.6121 0.6149 0.6087
max 0.6400 0.6391 0.6257

Results of Baselines

model Dev Test1 Test2
Bag-of-word 31.9 32.1 32.2
Metzler-Bendersky IR model 52.7 55.1 50.8
GRU 59.4 53.2 58.1
CNN (Feng et al., 2015) 61.8 62.8 59.2
CNN with GESD (Feng et al., 2015) 65.4 65.3 61.0
Attentive LSTM (Tan et al. 2016) 68.4 68.1 62.2
IARNN (Wang et al., 2016) 69.9 70.1 62.8
AP-BiLSTM (Santos et al., 2016) 68.7 71.7 64.4

SM_CNN: Refactor

  • use torchtext
  • support batching
  • simplify code to support backprop over word embedding, linguistic features etc. (#13)
    I'll work on this issue.

@Impavidity also noted that we are using conv1d but in reality we should be using conv2d here since the input is in 2D space. It'd be interesting to compare the scores with this implementation.

conv_rnn refactoring

Ref #99

conv_rnn and kim_cnn are both sentence classification models - they should share the same API, and in general be structured the same way.

@Impavidity @daemon please coordinate on this.

Dataset path mismatch

so far, 2018-08-18.
the data path using in the Castor/sm_cnn/create_dataset.sh such as ''../../Castor-data/TrecQA''
is NOT match with the real path in Castor-data dir.

can you please check it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.