yikangshen / ordered-neurons Goto Github PK

View Code? Open in Web Editor NEW

578.0 578.0 101.0 17.12 MB

Code for the paper "Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks"

Home Page: https://arxiv.org/pdf/1810.09536.pdf

License: BSD 3-Clause "New" or "Revised" License

Python 66.61% Makefile 0.05% C 32.03% Scilab 1.23% Perl 0.08%

ordered-neurons's People

Contributors

Stargazers

Watchers

Forkers

anhad13 phu-pmh sarehalli marcwww chung-i ml-lab for-research zhihaolzh himmelstein chapzq77 hyzcn ianliyi1996 yangxuntu lutan bharatr21 jackhaha363 waallf-frock dendisuhubdy chubbymaggie mannykayy arkaung harsh19 gdh756462786 tonydeep batermj guytenn youtang1993 mad-ant alvations kwanegx paperplanet gavinzjchao billyzju everwind clarabing qingliu67 tj1116 platoneko pl8787 xyr643 zmskye phoebussi buaaalban minorfox mrbearwithhissword johnyjyu tianxieeryang uctoronto cenjat ldzhangyx ledzy yzh119 cosecant-csc catherinewong1 callmesaox zp1481616577 easonfzw speedcell4 qiuhual bob-lytton limingdeng binzhuwang jnxstj shadowkun bhattg huayiguazi swirlingcloud shyamsubramanian cpllab jiangchenglin521 scape1989 arendu-zz tangjiandd zjplab emilywangattri wuyifanisai connietong codeaudit yinggangzhang zhouchena1 5l1v3r1 yianzhang xiaoanshi explorerfreda futong sudipta90 langfangctt zhushaoquan cb1473258684 sy-zhang xiguaguoguo davidleon jiehui-xu soloist97 liqing-ustc jxjessieli yigaogao zhangxingone peerachetporkaew stay-foolish-forever

ordered-neurons's Issues

Question about dataset construction

Hello Yikang:

Hi~, I'm a research intern of HIT-SCIR lab, Yangming Li. It's great for your contribution about this repository. But I found some problems about the dataset construction (including test set):

1, the use of pytorch API "narrow" will unexpectedly abandon some words and result in incorrect PPL score.

2, It seems that your slide window on the whole corpus is not continuous and thus generate far less data than usual.

Great thanks again for your contribution about this repository.
Yangming, 19/08/02

Question about the model design details

Hi, thanks for sharing the source code.

According to the Equation (10) in your paper, I guess the last elements of $\tilde{i}_t$ will always be zero, e.g., [0.8, 0.3, 0.1, 0].
Is this on purpose? If yes, could you please explain why? I just think this will let the upmost neuron chunk keep copying history without writing in anything new, is this correct?

How to train using main.py using multiple GPUs?

@yikangshen @shawntan Is there an easy way to train the model to replicate the experiments using main.py using multiple GPUs?

When using model = nn.DataParallel(model) before train(), the initialization goes into the LSTM stack and then the ONLSTM cell to return the weights but it throws an error.

We also tried doing the model = nn.DataParallel(model) after the hidden = model.init_hidden(args.batch_size) and it seems like the LinearDropConnect layer can't access the .weight tensors.

cur_loss suddenly increases to a larger number

Hi, Yikang thanks a lot for this awesome paper!

When I try to run the below command,

python main.py --batch_size 20 --dropout 0.45 --dropouth 0.3 --dropouti 0.5 --wdrop 0.45 --chunk_size 10 --seed 141 --epoch 1000

Such error triggered at certain(5th) epoch:

File "main.py", line 269, in
train()
File "main.py", line 245, in train
elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss), cur_loss / math.log(2)))
OverflowError: math range error

My initial found is that the cur_loss suddenly increases to a larger number(from ~5 to more than 10000), which results in such error.
However, I am not sure what cause this sudden huge increment.

The utils/batchify() can not work for unsupervised parsing

Hi, It seems that this function should be different on the PTB dataset for unsupervised parsing , right?

And would you give detailed guidances for how we can train the unsupervised model and I found there are many codes that should be changed manually.

Thanks a lot!

High performance for right-branching strategy

Really appreciate for releasing the code.

I found when I testing the baseline of right-branching strategy on WSJ test set, the F1 is really high (39.87), which does not match the result in the paper (16.5).

I have just changed the code

distance = model.distance[0].squeeze().data.cpu().numpy()
distance_in = model.distance[1].squeeze().data.cpu().numpy()

into

distance = numpy.array([numpy.arange(len(sen), 0, -1)] * 3)
distance_in = numpy.array([numpy.arange(len(sen), 0, -1)] * 3)

, which represent a right-branching strategy.

And the result on WSJ test set is:

So, what my be the reason? Thanks a lot if u could help me out.

Default Parameters

Hi Yikang,

If I want to reproduce your work, what parameters should I use?

In the readme, you suggest use the default parameters in main.py. At the same time, you provide another set of parameters: "python main.py --batch_size 20 --dropout 0.45 --dropouth 0.3 --dropouti 0.5 --wdrop 0.45 --chunk_size 10 --seed 141 --epoch 1000 --data /path/to/your/data".

Which one should I use? I tried both ones and after 48 hours, the quoted parameters outperform the default one, so I would like to double-check with you.

Thanks,
Ian

Did you use the test data during training in the Unsupervised Parsing experiment ?

On reviewing the fellowing code, I find that the train data contain the test data. Is this coirrect?

Ordered-Neurons/data_ptb.py

Line 25 in 46d63cd

if 'WSJ/00/WSJ_0000.MRG' <= id <= 'WSJ/24/WSJ_2499.MRG':

for id in file_ids:
    if 'WSJ/00/WSJ_0000.MRG' <= id <= 'WSJ/24/WSJ_2499.MRG':
        train_file_ids.append(id)
    if 'WSJ/22/WSJ_2200.MRG' <= id <= 'WSJ/22/WSJ_2299.MRG':
        valid_file_ids.append(id)
    if 'WSJ/23/WSJ_2300.MRG' <= id <= 'WSJ/23/WSJ_2399.MRG':
        test_file_ids.append(id)
    # elif 'WSJ/00/WSJ_0000.MRG' <= id <= 'WSJ/01/WSJ_0199.MRG' or 'WSJ/24/WSJ_2400.MRG' <= id <= 'WSJ/24/WSJ_2499.MRG':
    #     rest_file_ids.append(id)

Data Directory used when running test_phrase_grammar.py

Hi Yikang and other Contributors,

Thank you for making public the source code! I am trying to reproduce your results, but I am not sure what path to use as the command line argument of test_phrase_grammar --data. I downloaded PTB data and I am currently using treebank_3/parsed/mrg as the data argument. It does not work.

The listings under treebank_3/parsed/mrg:
atis brown readme.mrg swbd wsj

The listings under treebank_3/parsed/mrg/wsj:

00 06 12 18 24
01 07 13 19 MERGE.LOG
02 08 14 20
03 09 15 21
04 10 16 22
05 11 17 23

Thank you for your time!
Ian

about chunk_size,

what chunk_size meaning?

Tuning contextual embeddings with hierarchical relations

I have a masked LM pretrained with bert.

The embeddings are poor on the sentence level, but do well for base tokens.
There is a natural tree structure to my corpus that I believe stands to gain from something like on-lstm.

Do you think swapping out the embedding layer of the on-lstm with pretrained bert embeddings could be fruitful?

nltk.corpus.ptb.fileids() is empty？

Why when I run nltk.corpus.ptb.fileids() in data_ptb.py, I got an empty list of fileids.

nltk.corpus.ptb.fileids()
[]

hyperlink of "Penn Treebank Parsed" is broken

The hyperlink of "Penn Treebank Parsed" in ReadMe.md is broken and I can not find the correct download link. Is this the same? I found this link on the page NLTK data.

question about visualization

Hello, I am interested in ON-LSTM and I want to reimplement it under Keras.

I trained a Chinese language model with ON-LSTM and then export the distance (as you do in https://github.com/yikangshen/Ordered-Neurons/blob/master/ON_LSTM.py#L89 ). However, I found all elements of distance is quite close to 1 (0.995, 0.99, 0.98, ...).

Is it a normal phenomenon in your experience ?

question about unidirectional

In your paper, you use a unidirectional ON-LSTM to trained a language model and then phrase grammar with the output distance of the pretrained language model. How can we explain that the level of first token is independent with the future tokens? Is there any bidirectional way to do it?

ZeroDivisionError test_phrase_grammar

Hi,when I run the test_phrase_grammar.py ,I will get this return like following:

ZeroDivisionError: float division by zero

This is the specific error:

checkpoint download

I'm sorry to bother you that when I try to test this model, I have no where to download the checkpoint. With the link you have proved ,I only found the ‘.txt’ files, where can I download the 'PTB.pt'

FileNotFoundError: [Errno 2] No such file or directory: 'PTB.pt'

How to train parsing

Hi
I wonder how to train parsing. The main.py seems to be only for training LM.

Besides, when I try testing the parsing, python test_phrase_grammar.py --cuda gives error of No such file or directory: 'PTB.pt'.

Best regards,
Ron

CAN RUN main.py without GPU.

I install Pytorch0.4 which choose cuda none.And after i run the main.py ,i got a error：torch.cuda.LongTensor is not enabled.，which i haven't found a simillar problem online.whether i need use a computer with GPU and CUDA?

Where is `corpus` object

Hi, I found test_phrase_grammar.py referring corpus object many times, but I didn't find the definition and initialization.

two evaluation result in test_phrase_grammar.py, which is reported in the paper?

It seems that there are two evaluation result in test_phrase_grammar.py(one comes from evalb software, the other is computed by yourself), which is reported in the paper? what is the difference between them?

What does the chunk_size mean in the ONLSTMStack object?

From the argparser, there is this variable chunk_size, it’s described as “number of units per chunk”. What does this chunk refer to? Is it a mini-batch? Or part of a batch? Or part of the sequence length?

Is it related to the paragraph before section 5 in the paper?

As the master gates only focus on coarse-grained control, modeling them with the same dimensions as the hidden states is computationally expensive and unnecessary. In practice, we set ˜ft and ˜it to be Dm = D C dimensional vectors, where D is the dimension of hidden state, and C is a chunk size factor. We repeat each dimension C times, before the element-wise multiplication with ft and it. The downsizing significantly reduces the number of extra parameters that we need to add to the LSTM. Therefore, every neuron within each C-sized chunk shares the same master gates.

where the hidden state of the RNN cell is split into chunk and computed individually? If so, could you give an example of what the chunk and hidden state computation looks like?

question about on-lstm hidden states initialization ?

I find that hidden vector is initialized only at the beginning with zero value.
Then for each batch, hidden vector is detached by calling

hidden = repackage_hidden(hidden)

My question is why not create a new hidden vector for each batch instead of using the old ones.

hidden = model.init_hidden(batch_size)

Confusion on eq. 15

Dear Yikang,

I am new to NLP but I really like this paper and appreciate your work. Now I am reading the paper and have a question about Eq. 15 in section 5.2. What does y_t mean? Thank you!

Processing of variable length sequences in a batch

Hello, I just started learning the language model.I am very interested in your method after reading your paper.But after I read the paper carefully, I have a question and I would like to ask you.But after I read the paper carefully, I have a question and I would like to ask you for advice.In the paper, you directly divide the words in the corpus into equal-length batches.But now every sentence in a batch is different in length, how should I handle it?I queried the handling of the official pytorch documentation（nn.utils.rnn.pad_packed_sequence）.But don't know if this method is right for your code.Can you please give me some advice?
Thanks