Code Monkey home page Code Monkey logo

coreference-resolution's Introduction

Coreference Resolution

PyTorch 0.4.1 | Python 3.6.5

This repository consists of an efficient, annotated PyTorch reimplementation of the EMNLP paper "End-to-end Neural Coreference Resolution" by Lee et al., 2017. Main code can be found in this file.

Data

The source code assumes access to the English train, test, and development data of OntoNotes Release 5.0. This data should be located in a folder called 'data' inside the main directory. The data consists of 2,802 training documents, 343 development documents, and 348 testing documents. The average length of all documents is 454 words with a maximum length of 4,009 words. The number of mentions and coreferences in each document varies drastically, but is generally correlated with document length.

Since the data require a license from the Linguistic Data Consortium to use, they are thus not supplied here. Information on how to download and preprocess them can be found here and here, respectively.

Beyond the data, the source files also assume access to both Turian embeddings and GloVe embeddings.

Problem Definition

Coreference is defined as occurring when one or more expressions in a document refer back to the an entity that came before it/them. Coreference resolution, then, is the task of finding all expressions that are coreferent with any of the entities found in a given text. While this problem definition seems simple enough, oftentimes the nomenclature found in papers regarding coreference resolution is quite confusing. Visualizing them makes things a bit easier to understand:

Words are colored according to whether they are entities or not. Different colored groups of words are members of the same coreference cluster. Entities that are the only member of their cluster are known as 'singleton' entities.

Why Corefence Resolution is Hard

Entities can be very long and coreferent entities can occur extremely far away from one another. A greedy system would compute every possible span (sequence) of tokens and then compare it to every possible span that came before it. This makes the complexity of the problem O(T4), where T is the document length. For a 100 word document this would be 100 million possible options and for the longest document in our dataset, this equates to almost one quadrillion possible combinations.

If this does not make it concrete, imagine that we had the sentence

* Arya Stark walks her direwolf, Nymeria. *

Here we have three entities: Arya Stark, her, and Nymeria. As a native speaker of English it should be trivial to tell that her refers to Arya Stark. But to a machine with no knowledge, how should it know that Arya and Stark should be a single entity rather than two separate ones, that Nymeria does not refer back to her even though they are arguably related, or even that that Arya Stark walks her direwolf, Nymeria is not just one big entity in and of itself?

For another example, consider the sentence

* Napoleon and all of his marvelously dressed, incredibly well-trained, loyal troops marched all the way across the Europe to enter into Russia in an, ultimately unsuccessful, effort to conquer it for their country. *

The word their is referent to Napoleon and all of his marvelously dressed, incredibly well trained, loyal troops; entities can span many, many tokens. Coreferent entities can also occur far away from one another.

Model Architecture

As a forewarning, this paper presents a beast of a model. The authors present the following series of images to provide clarity as to what the model is doing.

1. Token Representation

Tokens are represented using 300-dimension static GloVe embeddings, 50-dimensional static Turian embeddings, and 8-dimensional character embeddings from a CNN with 50-dimensional filter sizes 3, 4, and 5. Dropout with p=0.50 is applied to these embeddings. The token representations are passed into a 2-layer bidirectional LSTM with hidden state sizes of 200. Dropout with p=0.20 is applied to the output of the LSTM.

2. Span Representation

Using the regularized output, span representations are computed by extracting the LSTM hidden states between the index of the first word and the last word. These are used to compute a weighted sum of the hidden states. Then, we concatenate the first and last index with the weighted attention sum and a 20-dimensional feature representation for the total width (length) of the span under consideration. This is done for all spans up to length 10 in the document.

3. Pruning

The span representations are passed into a 3-layer, 150-dimensional feedforward network with ReLU activations and p=0.20 dropout applied between each layer. The output of this feedfoward network is 1-dimensional and represents the 'mention score' of each span in the document. Spans are then pruned in decreasing order of mention score unless, when considering a span i, there exists a previously accepted span j such that START(i) < START(j) <= END(i) < END(j) or START(j) < START(i) <= END(j) < END(j). Only LAMBDA * T spans are kept at the end, where LAMBDA is set to 0.40 and T is the document length.

4. Pairwise Representation

For these spans, pairwise representations are computed for a given span i and its antecedent j by concatenating the span representation for span i, the span representation for span j, the dot product between these representations, and 20-dimensional feature embeddings for genre, distance between the spans, and whether or not the two spans have the same speaker.

5. Final Score and Loss

These representations are passed into a feedforward network similar to that of scoring the spans. Clusters are then formed for these coreferences by identifying chains of coreference links (e.g. span j and span k both refer to span i). The learning objective is to maximize the log-likelihood of all correct antecedents that were not pruned.

Results

Originally from the paper,

Recent Work

The authors have since published another paper, which achieves an F1 score of 73.0.

coreference-resolution's People

Contributors

573phn avatar alexandrauma avatar shayneobrien avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

coreference-resolution's Issues

probs = [F.softmax(tensr) for tensr in with_epsilon] may be wrong?

The code "probs = [F.softmax(tensor) for tensor in with_epsilon] " in class Trainer in coref.py.When i train,i get the prob(size:span_size*(antecedent_size+1)*1 ) with all cell has a fixed value 1.Maybe the right code is "probs = [F.softmax(tensor,dim=0) for tensor in with_epsilon]".
My torch version is 1.4.0.
Ps:My english is pool,i hope you can understand what i say.

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Hi, when I ran coref.py file, I encountered a RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation. I've tried pytorch 0.4.1 in the requirements.txt and pytorch 1.0 but they got same error. Could you please look into this? Thanks!

File "coref.py", line 692, in
trainer.train(150)
File "coref.py", line 458, in train
self.train_epoch(epoch, *args, **kwargs)
File "coref.py", line 488, in train_epoch
corefs_found, total_corefs, corefs_chosen = self.train_doc(doc)
File "coref.py", line 555, in train_doc
loss.backward()
File "/opt/conda/envs/mlkit36/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/envs/mlkit36/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Error during training

I got this error while running
python coref.py

loss = torch.sum(torch.log(torch.sum(torch.mul(probs, gold_indexes), dim=1).clamp_(eps, 1-eps), dim=0) * -1)
TypeError: log() got an unexpected keyword argument 'dim'

model does not predict clusters

Hi,

I have implemented your code with the ontonotes dataset and have found that the model doesn't predict clusters (it predicts that each span is a cluster with itself only). I tried training for 10 and 150 epochs with the same outcome.

After training, I load the trained model and predict (with a simple adaption to the predict function from the Trainer class to also return the clusters variable). I get no outputted clusters for any document in the dataset.

I have attached the terminal output from the training and was wondering if you could tell me why this is.

During implementation, we found the code had a specific bug only when dealing with some documents (a small subset of the overall dataset). To offset this we added a try/except in the train_epoch function such that these documents are not trained on. (and omitted the evaluation step every ten epochs). No other changes were made to the code.

PS. The error we encountered, RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1 (input tensor's size at dimension 0), but got split_sizes=[], is demonstrated at the beginning of the attached file.

It would be great to get some insight into what might be going wrong during training/prediction.

Thanks!

Terminal Saved Output.txt

Is there a problem in function:"remove_overlapping(sorted_spans):" ?

Hey,
Is there a problem In utils.py, "def remove_overlapping(sorted_spans):" function in line 98 ?
I think we want to accept "span i" when "si.i1 < sj.i1 <= si.i2 < sj.i2 OR sj.i1 < si.i1 <= sj.i2 < si.i2".
But "if len(set(taken)) == 1 or (taken[0] == taken[-1] == False):" seems do the opposite thing.
For example, if seen = [2, 3, 4, 5, 6], when sj.i1 = 3, sj.2 = 5, then taken = [True, True, True], so len(set(taken)) == 1, it will be appended to the nonoverlapping[] list.

train issue

Hi, shayneobrien:
I'am Xiangyu from China, I have some quetions about your code. I used your code, but when I am training, the loss has not been descent. I have checked the code several times. The recall ande the precison is almost zero.
Best wish!

How to preprocess the Data ?

How to preprocess the data of OntoNotes Release 5.0 ? I can't open the link you gave in README, the website is gone. So, can you give another link or some other way to show how to preprocess the data ?

Pretrained model?

First of all, thank you for this pytorch translation.
I (like many others) don't have access to ontonotes dataset. I know you're not authorized to give away the data but can you please share the model weights trained on the dataset?
Thank you

What should I do with the data?

I downloaded OntoNotes Release 5.0.

and I did e2e-coref's getting started.

I created directories (data/train,data/development,data/test)
and data(output of getting started) are located in directories like data/train/train.english.v4_gold_conll

Did I miss anything or do something wrong?

Thanks.

RuntimeError: received an empty list of sequences

When the training epoches starts to evaluate, it will raise this Runtime Error. Hope anyone can help me to solve it.
The detail of the error message is below:
line 567, in evaluate
predicted_docs = [self.predict(doc) for doc in tqdm(val_corpus) if len(doc) != 0]
line 567, in
predicted_docs = [self.predict(doc) for doc in tqdm(val_corpus) if len(doc) != 0]
line 596, in predict
spans, probs = self.model(doc)
line 1102, in _call_impl
return forward_call(*input, **kwargs)
line 423, in forward
states, embeds = self.encoder(doc)
line 1102, in _call_impl
return forward_call(*input, **kwargs)
line 205, in forward
packed, reorder = pack(embeds)
line 73, in pack
packed = pack_sequence(sorted_tensors)
line 398, in pack_sequence
return pack_padded_sequence(pad_sequence(sequences), lengths, enforce_sorted=enforce_sorted)
line 363, in pad_sequence
return torch._C._nn.pad_sequence(sequences, batch_first, padding_value)
RuntimeError: received an empty list of sequences

list index out of range in pad_sequence of torch implementation.

During evaluation stage on development dataset, I am facing below error intermittently. Have you ever faced this issue and how did you resolve it?

Traceback (most recent call last):
  File "coref.py", line 693, in <module>
    trainer.train(150)
  File "coref.py", line 459, in train
    self.train_epoch(epoch, *args, **kwargs)
  File "coref.py", line 490, in train_epoch
    corefs_found, total_corefs, corefs_chosen = self.train_doc(doc)
  File "coref.py", line 523, in train_doc
    spans, probs = self.model(document)
  File "/home/rupimanoj/anaconda3/envs/project/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "coref.py", line 424, in forward
    states, embeds = self.encoder(doc)
  File "/home/rupimanoj/anaconda3/envs/project/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "coref.py", line 206, in forward
    packed, reorder = pack(embeds)
  File "/home/rupimanoj/coref/coreference-resolution/src/utils.py", line 74, in pack
    packed = pack_sequence(sorted_tensors)
  File "/home/rupimanoj/anaconda3/envs/project/lib/python3.7/site-packages/torch/nn/utils/rnn.py", line 353, in pack_sequence
    return pack_padded_sequence(pad_sequence(sequences), [v.size(0) for v in sequences])
  File "/home/rupimanoj/anaconda3/envs/project/lib/python3.7/site-packages/torch/nn/utils/rnn.py", line 311, in pad_sequence
    max_size = sequences[0].size()
IndexError: list index out of range

error when evaluating

Evaluating on validation corpus...
217it [12:27, 5.54s/it]Traceback (most recent call last):
File "./src/coref.py", line 690, in
trainer.train(150)
File "./src/coref.py", line 467, in train
results = self.evaluate(self.val_corpus)
File "./src/coref.py", line 566, in evaluate
predicted_docs = [self.predict(doc) for doc in tqdm(val_corpus)]
File "./src/coref.py", line 566, in
predicted_docs = [self.predict(doc) for doc in tqdm(val_corpus)]
File "./src/coref.py", line 595, in predict
spans, probs = self.model(doc)
File "/home/xtan/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
tracing_state._traced_module_stack.append(self)
File "./src/coref.py", line 429, in forward
spans, coref_scores = self.score_pairs(spans, g_i, mention_scores)
File "/home/xtan/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
tracing_state._traced_module_stack.append(self)
File "./src/coref.py", line 347, in forward
pairs = torch.cat((i_g, j_g, i_g*j_g, phi), dim=1)
RuntimeError: CUDA error: out of memory

I got this error when evaluating on validation corpus.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.