lxucs / coref-hoi Goto Github PK

View Code? Open in Web Editor NEW

58.0 2.0 19.0 19 KB

PyTorch implementation of the end-to-end coreference resolution model with different higher-order inference methods.

License: Apache License 2.0

Python 98.84% Shell 1.16%

coreference-resolution higher-order nlp pytorch

coref-hoi's People

Contributors

Stargazers

Watchers

Forkers

li-ming-fan sushantakpani norikinishida sean-blank polm paragdakle dogatekin trendingtechnology mickeysjm ericzxwang quanjiehan dreaminvoker yilunzhu gilmoright joerenner nicholaslea kyn76

coref-hoi's Issues

Running on our own CoNLL-U files

I have been able to run the models on the OntoNotes test set, but how do we get predictions for our own CoNLL-U files?

Train on spanbert large, but get F1 1 point lower than presented in paprer

Hi,

I use spanbert large model with default parameters in config file, and I get Avg F1 78.27, lower than Avg.F1 79.9 in paper.
config as following:

num_docs = 2802
bert_learning_rate = 1e-05
task_learning_rate = 0.0003
max_segment_len = 512
ffnn_size = 3000
cluster_ffnn_size = 3000
max_training_sentences = 3
bert_tokenizer_name = bert-base-cased

max_top_antecedents = 50
max_training_sentences = 5
top_span_ratio = 0.4
max_num_extracted_spans = 3900
max_num_speakers = 20
max_segment_len = 256

Learning

bert_learning_rate = 1e-5
task_learning_rate = 2e-4
loss_type = marginalized # {marginalized, hinge}
mention_loss_coef = 0
false_new_delta = 1.5 # For loss_type = hinge
adam_eps = 1e-6
adam_weight_decay = 1e-2
warmup_ratio = 0.1
max_grad_norm = 1 # Set 0 to disable clipping
gradient_accumulation_steps = 1

Model hyperparameters.

coref_depth = 1 # when 1: no higher order (except for cluster_merging)
higher_order = attended_antecedent # {attended_antecedent, max_antecedent, entity_equalization, span_clustering, cluster_merging}
coarse_to_fine = true
fine_grained = true
dropout_rate = 0.3
ffnn_size = 1000
ffnn_depth = 1
cluster_ffnn_size = 1000 # For cluster_merging
cluster_reduce = mean # For cluster_merging
easy_cluster_first = false # For cluster_merging
cluster_dloss = false # cluster_merging
num_epochs = 24
feature_emb_size = 20
max_span_width = 30
use_metadata = true
use_features = true
use_segment_distance = true
model_heads = true
use_width_prior = true # For mention score
use_distance_prior = true # For mention-ranking score

Other.

conll_eval_path = dev.english.v4_gold_conll # gold_conll file for dev
conll_test_path = test.english.v4_gold_conll # gold_conll file for test
genres = ["bc", "bn", "mz", "nw", "pt", "tc", "wb"]
eval_frequency = 1000
report_frequency = 100

Data Set up issue in Basic Set up

Install Python3 dependencies: pip install -r requirements.txt
Create a directory for data that will contain all data files, models and log files; set data_dir = /path/to/data/dir in experiments.conf

After step 1 and 2 I tried step 3 of the Basic setup

Prepare dataset (requiring OntoNotes 5.0 corpus): ./setup_data.sh /path/to/ontonotes /path/to/data/dir

.
.

reference-coreference-scorers/v8.01/test/DataFiles/TC-N.key
reference-coreference-scorers/v8.01/test/test.pl
reference-coreference-scorers/v8.01/test/TestCases.README
bash: conll-2012/v3/scripts/skeleton2conll.sh: No such file or directory

Though there exists a coref_hoi/data/dir/conll-2012/v3/scripts/skeleton2conll.sh file.
Do I need to change any other file prior to running setup_data.sh ?

CUDA out of memory error

Hi,

First, I want to thank you so much for your valuable efforts, and this perfectly comprehensible and clean code.

I do not know whether I should ask this here, but I ran into CUDA out of memory error in the evaluation phase (something like this: RuntimeError: CUDA out of memory. Tried to allocate 1.02 GiB (GPU 0; 7.93 GiB total capacity; 4.76 GiB already allocated; 948.81 MiB free; 6.23 GiB reserved in total by PyTorch).

First, I ran into this error in the training phase. I reduced the size of some parameters in the experiments.conf file, which I think would help to reduce the GPU usage and they did because I am now able to pass the training phase. However, this error appears in the evaluation phase no matter how much I decrease the parameters like span width, max_sentence_len, or the ffnn size. I wonder if you had the same problem or do you have any suggestions for me.

I am currently using GeForce GTX 1080 with 8GB memory.

Many thanks,
Arad

ValueError when predicting

All the data and models required have been downloaded into proper path.

Trying to run predict.py with command:
python predict.py --config_name=train_spanbert_large_ml0_d2 --model_identifier=May08_12-38-29_58000 --gpu_id=0
and encounter ValueError:

Traceback (most recent call last):
File "predict.py", line 71, in
nlp.add_pipe(nlp.create_pipe('sentencizer'))
File "/home/qliu/anaconda3/envs/e2e/lib/python3.6/site-packages/spacy/language.py", line 754, in add_pipe
raise ValueError(err)
ValueError: [E966] nlp.add_pipe now takes the string name of the registered component factory, not a callable component. Expected string, but got <spacy.pipeline.sentencizer.Sentencizer object at 0x7f7fabe3f288> (name: 'None').

If you created your component with nlp.create_pipe('name'): remove nlp.create_pipe and call nlp.add_pipe('name') instead.

If you passed in a component like TextCategorizer(): call nlp.add_pipe with the string name instead, e.g. nlp.add_pipe('textcat').

If you're using a custom component: Add the decorator @Language.component (for function components) or @Language.factory (for class components / factories) to your custom component and assign it a name, e.g. @Language.component('your_name'). You can then run nlp.add_pipe('your_name') to add it to the pipeline.

train on bert base

Hello, I'd to know how about the result of this model training on Bert_base? I have trianed on bert base with c2f , python run.py train_bert_base_ml0_d2, but only get a result about 67 F1

License

The repo does not contain any license specification. It would be great if you could license it explicitly under a FOSS license so that further research can build upon this great code!
Personally I'd suggest the MIT license but a Apache or a GPL variety could also be a great choice.

Most of these licenses require attribution in source code distributions so you would have to be credited (as you should be 😃).

Custom training data for coref-hoi

Hi all,
I was wondering if it is possible to use custom data that one can prepare themselves for training this model. If so, how does one do this with coref-hoi? Will it convert a txt file to the right format or does one have to convert it to a ConLL file first? Can it be ConLL-U? Thank you very much.

which checkpoint of the trained weights should I use?

Hi lxucs,
There are 2 checkpoint of the trained weights, which one is the one used in your paper?
Thanks

Below is an example:

train_spanbert_large_ml0_cm_fn1000_max_dloss/model_May14_05-15-38_63000.bin
train_spanbert_large_ml0_cm_fn1000_max_dloss/model_May22_23-31-16_66000.bin

Training issue: with bert_base

Hi @lxucs,

I want to train a model for bert_base with no HOI like the spanbert_large_ml0_d1 model

python run.py bert_base 0

Got this issue:

Traceback (most recent call last):
File "run.py", line 289, in
model = runner.initialize_model()
File "run.py", line 51, in initialize_model
model = CorefModel(self.config, self.device)
File "/VL/space/sushantakp/research_work/coref-hoi/model.py", line 33, in init
self.bert = BertModel.from_pretrained(config['bert_pretrained_name_or_path'])
File "/VL/space/sushantakp/.conda/envs/skp_env376/lib/python3.7/site-packages/transformers/modeling_utils.py", line 935, in from_pretrained
raise EnvironmentError(msg)
OSError: Can't load weights for 'bert-base-cased'. Make sure that:

'bert-base-cased' is a correct model identifier listed on 'https://huggingface.co/models'
or 'bert-base-cased' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.

Is it needed to change any parameter in experiments.conf ?

To handle above issue
to train with HOI/ No HOI

trained weights for base

Good work.
I only see weights for large, could you also provider weights for base? That will be much easy to handle for debugging.
Thanks.

How to analyse the result of a model?

Hi @lxucs
Please share brief information about the use of analyze.py.

Preprocess - Split into segments function

Hi again Liyan,

I had some brief questions regarding splitting documents into segments. I think the segments contain more than one sentence (based on the split_into_segments function in the preprocess.py file). Was not it be better if segments contain one sentence at last? I could not see the intuition behind it. Is it better to have longer segments or it is for having more efficient use of resources? or Is it practically tested and the trained model gained better accuracy this way?

Thanks,
Arad