Code Monkey home page Code Monkey logo

self-attentive-parser's Introduction

Berkeley Neural Parser

A high-accuracy parser with models for 11 languages, implemented in Python. Based on Constituency Parsing with a Self-Attentive Encoder from ACL 2018, with additional changes described in Multilingual Constituency Parsing with Self-Attention and Pre-Training.

New February 2021: Version 0.2.0 of the Berkeley Neural Parser is now out, with higher-quality pre-trained models for all languages. Inference now uses PyTorch instead of TensorFlow (training has always been PyTorch-only). Drops support for Python 2.7 and 3.5. Includes updated support for training and using your own parsers, based on your choice of pre-trained model.

Contents

  1. Installation
  2. Usage
  3. Available Models
  4. Training
  5. Reproducing Experiments
  6. Citation
  7. Credits

If you are primarily interested in training your own parsing models, skip to the Training section of this README.

Installation

To install the parser, run the command:

$ pip install benepar

Note: benepar 0.2.0 is a major upgrade over the previous version, and comes with entirely new and higher-quality parser models. If you are not ready to upgrade, you can pin your benepar version to the previous release (0.1.3).

Python 3.6 (or newer) and PyTorch 1.6 (or newer) are required. See the PyTorch website for instruction on how to select between GPU-enabled and CPU-only versions of PyTorch; benepar will automatically use the GPU if it is available to pytorch.

The recommended way of using benepar is through integration with spaCy. If using spaCy, you should install a spaCy model for your language. For English, the installation command is:

$ python -m spacy download en_core_web_md

The spaCy model is only used for tokenization and sentence segmentation. If language-specific analysis beyond parsing is not required, you may also forego a language-specific model and instead use a multi-language model that only performs tokenization and segmentation. One such model, newly added in spaCy 3.0, should work for English, German, Korean, Polish, and Swedish (but not Chinese, since it doesn't seem to support Chinese word segmentation).

Parsing models need to be downloaded separately, using the commands:

>>> import benepar
>>> benepar.download('benepar_en3')

See the Available Models section below for a full list of models.

Usage

Usage with spaCy (recommended)

The recommended way of using benepar is through its integration with spaCy:

>>> import benepar, spacy
>>> nlp = spacy.load('en_core_web_md')
>>> if spacy.__version__.startswith('2'):
        nlp.add_pipe(benepar.BeneparComponent("benepar_en3"))
    else:
        nlp.add_pipe("benepar", config={"model": "benepar_en3"})
>>> doc = nlp("The time for action is now. It's never too late to do something.")
>>> sent = list(doc.sents)[0]
>>> print(sent._.parse_string)
(S (NP (NP (DT The) (NN time)) (PP (IN for) (NP (NN action)))) (VP (VBZ is) (ADVP (RB now))) (. .))
>>> sent._.labels
('S',)
>>> list(sent._.children)[0]
The time for action

Since spaCy does not provide an official constituency parsing API, all methods are accessible through the extension namespaces Span._ and Token._.

The following extension properties are available:

  • Span._.labels: a tuple of labels for the given span. A span may have multiple labels when there are unary chains in the parse tree.
  • Span._.parse_string: a string representation of the parse tree for a given span.
  • Span._.constituents: an iterator over Span objects for sub-constituents in a pre-order traversal of the parse tree.
  • Span._.parent: the parent Span in the parse tree.
  • Span._.children: an iterator over child Spans in the parse tree.
  • Token._.labels, Token._.parse_string, Token._.parent: these behave the same as calling the corresponding method on the length-one Span containing the token.

These methods will raise an exception when called on a span that is not a constituent in the parse tree. Such errors can be avoided by traversing the parse tree starting at either sentence level (by iterating over doc.sents) or with an individual Token object.

Usage with NLTK

There is also an NLTK interface, which is designed for use with pre-tokenized datasets and treebanks, or when integrating the parser into an NLP pipeline that already performs (at minimum) tokenization and sentence splitting. For parsing starting with raw text, it is strongly encouraged that you use spaCy and benepar.BeneparComponent instead.

Sample usage with NLTK:

>>> import benepar
>>> parser = benepar.Parser("benepar_en3")
>>> input_sentence = benepar.InputSentence(
    words=['"', 'Fly', 'safely', '.', '"'],
    space_after=[False, True, False, False, False],
    tags=['``', 'VB', 'RB', '.', "''"],
    escaped_words=['``', 'Fly', 'safely', '.', "''"],
)
>>> tree = parser.parse(input_sentence)
>>> print(tree)
(TOP (S (`` ``) (VP (VB Fly) (ADVP (RB safely))) (. .) ('' '')))

Not all fields of benepar.InputSentence are required, but at least one of words and escaped_words must be specified. The parser will attempt to guess the value for missing fields, for example:

>>> input_sentence = benepar.InputSentence(
    words=['"', 'Fly', 'safely', '.', '"'],
)
>>> parser.parse(input_sentence)

Use parse_sents to parse multiple sentences.

>>> input_sentence1 = benepar.InputSentence(
    words=['The', 'time', 'for', 'action', 'is', 'now', '.'],
)
>>> input_sentence2 = benepar.InputSentence(
    words=['It', "'s", 'never', 'too', 'late', 'to', 'do', 'something', '.'],
)
>>> parser.parse_sents([input_sentence1, input_sentence2])

Some parser models also allow Unicode text input for debugging/interactive use, but passing in raw text strings is strongly discouraged for any application where parsing accuracy matters.

>>> parser.parse('"Fly safely."')  # For debugging/interactive use only.

When parsing from raw text, we recommend using spaCy and benepar.BeneparComponent instead. The reason is that parser models do not ship with a tokenizer or sentence splitter, and some models may not include a part-of-speech tagger either. A toolkit must be used to fill in these pipeline components, and spaCy outperforms NLTK in all of these areas (sometimes by a large margin).

Available Models

The following trained parser models are available. To use spaCy integration, you will also need to install a spaCy model for the appropriate language.

Model Language Info
benepar_en3 English 95.40 F1 on revised WSJ test set. The training data uses revised tokenization and syntactic annotation based on the same guidelines as the English Web Treebank and OntoNotes, which better matches modern tokenization practices in libraries like spaCy. Based on T5-small.
benepar_en3_large English 96.29 F1 on revised WSJ test set. The training data uses revised tokenization and syntactic annotation based on the same guidelines as the English Web Treebank and OntoNotes, which better matches modern tokenization practices in libraries like spaCy. Based on T5-large.
benepar_zh2 Chinese 92.56 F1 on CTB 5.1 test set. Usage with spaCy allows supports parsing from raw text, but the NLTK API only supports parsing previously tokenized sentences. Based on Chinese ELECTRA-180G-large.
benepar_ar2 Arabic 90.52 F1 on SPMRL2013/2014 test set. Only supports using the NLTK API for parsing previously tokenized sentences. Parsing from raw text and spaCy integration are not supported. Based on XLM-R.
benepar_de2 German 92.10 F1 on SPMRL2013/2014 test set. Based on XLM-R.
benepar_eu2 Basque 93.36 F1 on SPMRL2013/2014 test set. Usage with spaCy first requires implementing Basque support in spaCy. Based on XLM-R.
benepar_fr2 French 88.43 F1 on SPMRL2013/2014 test set. Based on XLM-R.
benepar_he2 Hebrew 93.98 F1 on SPMRL2013/2014 test set. Only supports using the NLTK API for parsing previously tokenized sentences. Parsing from raw text and spaCy integration are not supported. Based on XLM-R.
benepar_hu2 Hungarian 96.19 F1 on SPMRL2013/2014 test set. Usage with spaCy requires a Hungarian model for spaCy. The NLTK API only supports parsing previously tokenized sentences. Based on XLM-R.
benepar_ko2 Korean 91.72 F1 on SPMRL2013/2014 test set. Can be used with spaCy's multi-language sentence segmentation model (requires spaCy v3.0). The NLTK API only supports parsing previously tokenized sentences. Based on XLM-R.
benepar_pl2 Polish 97.15 F1 on SPMRL2013/2014 test set. Based on XLM-R.
benepar_sv2 Swedish 92.21 F1 on SPMRL2013/2014 test set. Can be used with spaCy's multi-language sentence segmentation model (requires spaCy v3.0). Based on XLM-R.
benepar_en3_wsj English Consider using benepar_en3 or benepar_en3_large instead. 95.55 F1 on canonical WSJ test set used for decades of English constituency parsing publications. Based on BERT-large-uncased. We believe that the revised annotation guidelines used for training benepar_en3/benepar_en3_large are more suitable for downstream use because they better handle language usage in web text, and are more consistent with modern practices in dependency parsing and libraries like spaCy. Nevertheless, we provide the benepar_en3_wsj model for cases where using the revised treebanking conventions are not appropriate, such as benchmarking different models on the same dataset.

Training

Training requires cloning this repository from GitHub. While the model code in src/benepar is distributed in the benepar package on PyPI, the training and evaluation scripts directly under src/ are not.

Software Requirements for Training

  • Python 3.7 or higher.
  • PyTorch 1.6.0, or any compatible version.
  • All dependencies required by the benepar package, including: NLTK 3.2, torch-struct 0.4, transformers 4.3.0, or compatible.
  • pytokenizations 0.7.2 or compatible.
  • EVALB. Before starting, run make inside the EVALB/ directory to compile an evalb executable. This will be called from Python for evaluation. If training on the SPMRL datasets, you will need to run make inside the EVALB_SPMRL/ directory instead.

Training Instructions

A new model can be trained using the command python src/main.py train .... Some of the available arguments are:

Argument Description Default
--model-path-base Path base to use for saving models N/A
--evalb-dir Path to EVALB directory EVALB/
--train-path Path to training trees data/wsj/train_02-21.LDC99T42
--train-path-text Optional non-destructive tokenization of the training data Guess raw text; see --text-processing
--dev-path Path to development trees data/wsj/dev_22.LDC99T42
--dev-path-text Optional non-destructive tokenization of the development data Guess raw text; see --text-processing
--text-processing Heuristics for guessing raw text from descructively tokenized tree files. See load_trees() in src/treebanks.py Default rules for languages other than Arabic, Chinese, and Hebrew
--subbatch-max-tokens Maximum number of tokens to process in parallel while training (a full batch may not fit in GPU memory) 2000
--parallelize Distribute pre-trained model (e.g. T5) layers across multiple GPUs. Use at most one GPU
--batch-size Number of examples per training update 32
--checks-per-epoch Number of development evaluations per epoch 4
--numpy-seed NumPy random seed Random
--use-pretrained Use pre-trained encoder Do not use pre-trained encoder
--pretrained-model Model to use if --use-pretrained is passed. May be a path or a model id from the HuggingFace Model Hub bert-base-uncased
--predict-tags Adds a part-of-speech tagging component and auxiliary loss to the parser Do not predict tags
--use-chars-lstm Use learned CharLSTM word representations Do not use CharLSTM
--use-encoder Use learned transformer layers on top of pre-trained model or CharLSTM Do not use extra transformer layers
--num-layers Number of transformer layers to use if --use-encoder is passed 8
--encoder-max-len Maximum sentence length (in words) allowed for extra transformer layers 512

Additional arguments are available for other hyperparameters; see make_hparams() in src/main.py. These can be specified on the command line, such as --num-layers 2 (for numerical parameters), --predict-tags (for boolean parameters that default to False), or --no-XXX (for boolean parameters that default to True).

For each development evaluation, the F-score on the development set is computed and compared to the previous best. If the current model is better, the previous model will be deleted and the current model will be saved. The new filename will be derived from the provided model path base and the development F-score.

Prior to training the parser, you will first need to obtain appropriate training data. We provide instructions on how to process standard datasets like PTB, CTB, and the SMPRL 2013/2014 Shared Task data. After following the instructions for the English WSJ data, you can use the following command to train an English parser using the default hyperparameters:

python src/main.py train --use-pretrained --model-path-base models/en_bert_base

See EXPERIMENTS.md for more examples of good hyperparameter choices.

Evaluation Instructions

A saved model can be evaluated on a test corpus using the command python src/main.py test ... with the following arguments:

Argument Description Default
--model-path Path of saved model N/A
--evalb-dir Path to EVALB directory EVALB/
--test-path Path to test trees data/23.auto.clean
--test-path-text Optional non-destructive tokenization of the test data Guess raw text; see --text-processing
--text-processing Heuristics for guessing raw text from descructively tokenized tree files. See load_trees() in src/treebanks.py Default rules for languages other than Arabic, Chinese, and Hebrew
--test-path-raw Alternative path to test trees that is used for evalb only (used to double-check that evaluation against pre-processed trees does not contain any bugs) Compare to trees from --test-path
--subbatch-max-tokens Maximum number of tokens to process in parallel (a GPU does not have enough memory to process the full dataset in one batch) 500
--parallelize Distribute pre-trained model (e.g. T5) layers across multiple GPUs. Use at most one GPU
--output-path Path to write predicted trees to (use "-" for stdout). Do not save predicted trees
--no-predict-tags Use gold part-of-speech tags when running EVALB. This is the standard for publications, and omitting this flag may give erroneously high F1 scores. Use predicted part-of-speech tags for EVALB, if available

As an example, you can evaluate a trained model using the following command:

python src/main.py test --model-path models/en_bert_base_dev=*.pt

Exporting Models for Inference

The benepar package can directly use saved checkpoints by replacing a model name like benepar_en3 with a path such as models/en_bert_base_dev_dev=95.67.pt. However, releasing the single-file checkpoints has a few shortcomings:

  • Single-file checkpoints do not include the tokenizer or pre-trained model config. These can generally be downloaded automatically from the HuggingFace model hub, but this requires an Internet connection and may also (incidentally and unnecessarily) download pre-trained weights from the HuggingFace Model Hub
  • Single-file checkpoints are 3x larger than necessary, because they save optimizer state

Use src/export.py to convert a checkpoint file into a directory that encapsulates everything about a trained model. For example,

python src/export.py export \
  --model-path models/en_bert_base_dev=*.pt \
  --output-dir=models/en_bert_base

When exporting, there is also a --compress option that slightly adjusts model weights, so that the output directory can be compressed into a ZIP archive of much smaller size. We use this for our official model releases, because it's a hassle to distribute model weights that are 2GB+ in size. When using the --compress option, it is recommended to specify a test set in order to verify that compression indeed has minimal impact on parsing accuracy. Using the development data for verification is not recommended, since the development data was already used for the model selection criterion during training.

python src/export.py export \
  --model-path models/en_bert_base_dev=*.pt \
  --output-dir=models/en_bert_base \
  --test-path=data/wsj/test_23.LDC99T42

The src/export.py script also has a test subcommand that's roughly similar to python src/main.py test, except that it supports exported models and has slightly different flags. We can run the following command to verify that our English parser using BERT-large-uncased indeed achieves 95.55 F1 on the canonical WSJ test set:

python src/export.py test --model-path benepar_en3_wsj --test-path data/wsj/test_23.LDC99T42

Reproducing Experiments

See EXPERIMENTS.md for instructions on how to reproduce experiments reported in our ACL 2018 and 2019 papers.

Citation

If you use this software for research, please cite our papers as follows:

@inproceedings{kitaev-etal-2019-multilingual,
    title = "Multilingual Constituency Parsing with Self-Attention and Pre-Training",
    author = "Kitaev, Nikita  and
      Cao, Steven  and
      Klein, Dan",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1340",
    doi = "10.18653/v1/P19-1340",
    pages = "3499--3505",
}

@inproceedings{kitaev-klein-2018-constituency,
    title = "Constituency Parsing with a Self-Attentive Encoder",
    author = "Kitaev, Nikita  and
      Klein, Dan",
    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2018",
    address = "Melbourne, Australia",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P18-1249",
    doi = "10.18653/v1/P18-1249",
    pages = "2676--2686",
}

Credits

The code in this repository and portions of this README are based on https://github.com/mitchellstern/minimal-span-parser

self-attentive-parser's People

Contributors

mitchellstern avatar nikitakit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

self-attentive-parser's Issues

No tensorflow 2.0 support?

I'm getting the error AttributeError: module 'tensorflow' has no attribute 'GraphDef' which appears to be an error resulting from using tensorflow 2.0. Is there any plan to add support for tensorflow 2.0? In the meantime what should I do to fix this?

Error in downloading benepar_en

Hi,

When I trying to use NLTK with benepar, I have this problem. I can't download the benepar_en.gz. Can anyone help me with this? Thanks a lot!!!

Resource benepar_en.gz not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
benepar.download('benepar_en.gz')

ValueError: No op named GatherV2 in defined operations

hello,I want to run ' parser = benepar.Parser("benepar_en")', but it raise the error like this "ValueError: No op named GatherV2 in defined operations". could you tell me your python version and tensorflow-gpu version and cudnn version. I don't know how to fix this error . thank you very much.

RuntimeError when trying to train a new model

First of all, thank you for sharing the code for this great work.

I'm trying to train a model in the most simple setup using the following command-line arguments:
(python self-attentive-parser-master/src/main.py) train --model-path-base . --train-path self-attentive-parser-master\data\02-21.10way.clean --use-words

using python 3.6 on windows 10 with the latest pytorch, cython etc. I get the following error after "Training...":

Traceback (most recent call last):
File "self-attentive-parser-master/src/main.py", line 612, in
main()
File "self-attentive-parser-master/src/main.py", line 608, in main
args.callback(args)
File "self-attentive-parser-master/src/main.py", line 564, in
subparser.set_defaults(callback=lambda args: run_train(args, hparams))
File "self-attentive-parser-master/src/main.py", line 312, in run_train
_, loss = parser.parse_batch(subbatch_sentences, subbatch_trees)
File "self-attentive-parser-master\src\parse_nk.py", line 1010, in parse_batch
annotations, _ = self.encoder(emb_idxs, batch_idxs, extra_content_annotations=extra_content_annotations)
File "venv_parsing_36\lib\site-packages\torch\nn\modules\module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "self-attentive-parser-master\src\parse_nk.py", line 607, in forward
res, timing_signal, batch_idxs = emb(xs, batch_idxs, extra_content_annotations=extra_content_annotations)
File "venv_parsing_36\lib\site-packages\torch\nn\modules\module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "self-attentive-parser-master\src\parse_nk.py", line 486, in forward
for x, emb, emb_dropout in zip(xs, self.embs, self.emb_dropouts)
File "self-attentive-parser-master\src\parse_nk.py", line 486, in
for x, emb, emb_dropout in zip(xs, self.embs, self.emb_dropouts)
File "venv_parsing_36\lib\site-packages\torch\nn\modules\module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "venv_parsing_36\lib\site-packages\torch\nn\modules\sparse.py", line 118, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "venv_parsing_36\lib\site-packages\torch\nn\functional.py", line 1454, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.IntTensor instead (while checking arguments for embedding)

Process finished with exit code 1

Trying to run with 'use_cuda = False' in parse_nk.py I get the same error (with 'torch.IntTensor' instead of 'torch.cuda.IntTensor'), so it doesn't seem to be cuda-related.

To make sure this is not a compatibility issue, I tried running in another virtual environment with python 3.6, cython 0.25.2 and pytorch 0.4.1 (with which the code was originally tested, according to the documentation), and I get the same error, with 'torch.cpu.IntTensor' replaced by 'CUDAIntTensor'.

I found some references for this error on the web but nothing helpful. Have you encountered this error? Any idea what's causing it?

Thanks

cannot download any model (certificate verify failed)

benepar.download('benepar_en')
[nltk_data] Error loading benepar_en: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed
[nltk_data] (_ssl.c:847)>
False

Attempts to download the models as in the instructions fail due to the above. Any ideas?

Error as I follow the "Usage with spaCy" example

Hi,

On macosx 10.13.5 I got an error at the 'nlp.add_pipe(BeneparComponent("benepar_en"))' statement:

`
$ python
Python 3.6.4 |Anaconda custom (64-bit)| (default, Dec 21 2017, 15:39:08)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import benepar
/Users/colingoldberg/anaconda3/lib/python3.6/site-packages/h5py/init.py:34: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
benepar.download('benepar_en')
[nltk_data] Downloading package benepar_en to
[nltk_data] /Users/colingoldberg/nltk_data...
True
import spacy
from benepar.spacy_plugin import BeneparComponent
nlp = spacy.load('en')
nlp.add_pipe(BeneparComponent("benepar_en"))
Traceback (most recent call last):
File "", line 1, in
File "/Users/colingoldberg/anaconda3/lib/python3.6/site-packages/benepar/spacy_plugin.py", line 73, in init
super(BeneparComponent, self).init(filename, batch_size)
File "/Users/colingoldberg/anaconda3/lib/python3.6/site-packages/benepar/base_parser.py", line 163, in init
tf.import_graph_def(graph_def, name='')
File "/Users/colingoldberg/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 316, in new_func
return func(*args, **kwargs)
File "/Users/colingoldberg/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 541, in import_graph_def
raise ValueError('No op named %s in defined operations.' % node.op)
ValueError: No op named ClipByValue in defined operations.

`

I also tried this on an AWS EC2 (t2.small) instance (2GB memory), and got the following error:

`

python

Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 13 2017, 12:02:49)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import benepar
benepar.download('benepar_en')
[nltk_data] Downloading package benepar_en to /root/nltk_data...
[nltk_data] Package benepar_en is already up-to-date!
True
import spacy
from benepar.spacy_plugin import BeneparComponent
nlp = spacy.load('en')
nlp.add_pipe(BeneparComponent("benepar_en"))
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/share/anaconda3/lib/python3.6/site-packages/benepar/spacy_plugin.py", line 73, in init
super(BeneparComponent, self).init(filename, batch_size)
File "/usr/local/share/anaconda3/lib/python3.6/site-packages/benepar/base_parser.py", line 163, in init
tf.import_graph_def(graph_def, name='')
File "/usr/local/share/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 432, in new_func
return func(*args, **kwargs)
File "/usr/local/share/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 486, in import_graph_def
with c_api_util.tf_buffer(graph_def.SerializeToString()) as serialized:
MemoryError

`

I look forward to having this issue resolved so that I can continue testing.

Thank you.

Colin Goldberg

Problem with benepar.download("benepar_en")

Hi,

I want to use benepar, so I try to download benepar_en with this code:

import benepar
benepar.download("benepar_en")

I have this result:

[nltk_data] Downloading package benepar_en to
[nltk_data]     /Users/azerafelie/nltk_data...

And no download (I waited 30min), when I go to this folder, I have the file benepar_en created, but its size is 0KB...

How can I solve that please ?

I try this on another computer, same things...

Best,

Tagging errors

First, thank you for making this wonderful tool available. I can't say enough good things about it. Very impressive indeed.

That makes these bizarre tagger errors all the more surprising.

import spacy
spacy.__version__ # '2.1.3'
import benepar
benepar.download('benepar_en2')
from benepar.spacy_plugin import BeneparComponent
nlp = spacy.load('en_core_web_lg')
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer, first=True)

bnp = BeneparComponent("benepar_en2")
nlp.add_pipe(bnp, before='parser')
nlp.pipe_names # ['sentencizer', 'tagger', 'benepar', 'parser', 'ner']

text = "Justin O'Mara Brown (born April 16, 1982 in Fletcher, Oklahoma) is a former professional American, Canadian football and arena football defensive end. He was signed as an undrafted free agent by the Indianapolis Colts in 2005." # Wikipedia
docs = nlp(text, disable=['benepar'])
docb = nlp(text, disable=['tagger'])

for s, b in zip(docs, docb):
    if s.tag != b.tag:
        print(s.i, s.text, s.tag_, b.tag_)

Output:

18 American JJ NNP
20 Canadian JJ NNP
28 He PRP NNP

In all three, spaCy's tagger was right and BNP's was wrong. The third one was the most troubling. With all due respect to the famous mariner Zheng He, it's not very wise to guess that "He" at the beginning of a sentence is any thing other than a pronoun.

When I read that "tagger in models such as benepar_en2 gives better results," I actually went out of my way to make sure that BNP's tags are used instead of spaCy's. Now I'm not so sure. Can you quantify or describe how BNP's tagging is better, please? I don't have time to do an exhaustive test, so your input will carry a lot of weight. How do you think benepar_en2 compares with spaCy 2.1.3's en_core_web_lg for tagging?

Again, I emphasize that these tagging errors in no way detract from the project, which to me is about parsing, and BNP is excellent at that.

Pre-trained model Performance inconsistent with the one in README

Hi,

thanks for this great work. I installed the parser ("benepar_en") using pip and tested its performance the WSJ 23 section. What I found is that, the performance is higher than the number claimed in README (in my side, it shows 95.98 (vs 95.07 in README)). I am wondering if the pre-trained model in pip has used WSJ 23 in training? It seems unclear what is used to train the pip model.

Thanks in advance.

Is code for multilingual training available?

Hi, this is Brian.
After reading your paper ("Multilingual Constituency Parsing with Self-Attention and Pre-Training), I've been searching for the code used in training the model.
However, I couldn't find the code for the multilingual model in this repository.
Is it available to see the code for training the multilingual model?

decode() takes exactly 6 positional arguments (5 given)

Training...
main()
File "src/main.py", line 608, in main
args.callback(args)
File "src/main.py", line 564, in
subparser.set_defaults(callback=lambda args: run_train(args, hparams))
File "src/main.py", line 312, in run_train
_, loss = parser.parse_batch(subbatch_sentences, subbatch_trees)
File "/home/test/project/self-attentive-parser-master/src/parse_nk.py", line 1095, in parse_batch
p_i, p_j, p_label, p_augment, g_i, g_j, g_label = self.parse_from_annotations(fencepost_annotations_start[start:end,:], fencepost_annotations_end[start:end,:], sentences[i], golds[i])
File "/home/test/project/self-attentive-parser-master/src/parse_nk.py", line 1148, in parse_from_annotations
p_score, p_i, p_j, p_label, p_augment = chart_helper.decode(False, **decoder_args)
File "chart_helper.pyx", line 11, in chart_helper.decode (/home/test/.pyxbld/temp.linux-x86_64-3.6/pyrex/chart_helper.c:1674)
def decode(int force_gold, int sentence_len, np.ndarray[DTYPE_t, ndim=3] label_scores_chart, int is_train, gold, label_vocab):
TypeError: decode() takes exactly 6 positional arguments (5 given)

installation fails

Ubuntu 16
anaconda install
pip install benepar
---this yields
RuntimeError: module compiled against API version 0xc but this version of numpy is 0xa

How to train on gold tags dataset

I have a copy of the revised PennTreebank that looks like the format of the files in data/.
However, the code breaks when I try to use these files. On further inspection, I'm guessing I need to insert a "TOP" tag at the start of every sentence? I did that and the model starts training, but then the EVAL script doesn't work. My copy of the treebank is somehow also missing a sentence. Is this what's causing the problem for the EVAL script? Can I just copy and paste the sentence that's missing from the silver trees you provided?

Questioning a parse result

I am not sure about the following - please clarify - or report a bug (?)

The text I am parsing:
when a notification is received that a driver is available then update that fact in the database

Result from parse_string:

print(sent._.parse_string)
(FRAG (SBAR (WHADVP (WRB when)) (S (NP (DT a) (NN notification)) (VP (VBZ is) (VP (VBN received) (SBAR (IN that) (S (NP (DT a) (NN driver)) (VP (VBZ is) (ADJP (JJ available))))))))) (ADVP (RB then)) (VP (NN update) (NP (DT that) (NN fact)) (PP (IN in) (NP (DT the) (NN database)))))

In it, I see that the word 'update' is marked as NN. Should it not be a verb?

[The following added later, after further exploration]

A test script (below) produces a dict that is missing the entry for the word "update" - I think because no label was available for the word.

Please excuse the naivete of the test - I am a newcomer to Python as well.

Note: spaces replaced by dots.

`
import benepar
import spacy
from benepar.spacy_plugin import BeneparComponent
nlp = spacy.load('en')
nlp.add_pipe(BeneparComponent("benepar_en"))

doc = nlp("when a notification is received that a driver is available then update that fact in the database")
sent = list(doc.sents)[0]
print(sent._.parse_string)

def.get_children(parent):
..ret_dict.=.{}
..if.len(list(parent..children)).>.0:
....for.child.in.parent.
.children:
......print(child)
......try:
........if.len(list(child..labels)).>.0:
..........lab.=.list(child.
.labels)[0]
..........print(lab)
..........child_dict.=.{"label":.lab,."text":.str(child)}
..........gc.=.get_children(child)
..........child_dict["children"].=.gc
..........ret_dict[lab].=.child_dict
......except.Exception.as.e:
...... pass
..return.ret_dict

gc = get_children(sent)
print(gc)
`

The following gc result was output:
{ "SBAR": { "label": "SBAR", "text": "when a notification is received that a driver is available", "children": { "WHADVP": { "label": "WHADVP", "text": "when", "children": {} }, "S": { "label": "S", "text": "a notification is received that a driver is available", "children": { "NP": { "label": "NP", "text": "a notification", "children": {} }, "VP": { "label": "VP", "text": "is received that a driver is available", "children": { "VP": { "label": "VP", "text": "received that a driver is available", "children": { "SBAR": { "label": "SBAR", "text": "that a driver is available", "children": { "S": { "label": "S", "text": "a driver is available", "children": { "NP": { "label": "NP", "text": "a driver", "children": {} }, "VP": { "label": "VP", "text": "is available", "children": { "ADJP": { "label": "ADJP", "text": "available", "children": {} } } } } } } } } } } } } } } }, "ADVP": { "label": "ADVP", "text": "then", "children": {} }, "VP": { "label": "VP", "text": "update that fact in the database", "children": { "NP": { "label": "NP", "text": "that fact", "children": {} }, "PP": { "label": "PP", "text": "in the database", "children": { "NP": { "label": "NP", "text": "the database", "children": {} } } } } } }
Note: Word "update" is missing.

Colin Goldberg

Error when parsing the American Constitution with spacy

I'm trying to run the parse using spacy on the American Constitution (https://pastebin.com/Ss2MWFVr) with this code:

import spacy
from benepar.spacy_plugin import BeneparComponent

nlp = spacy.load('en')
nlp.add_pipe(BeneparComponent("benepar_en"))

nlp(american_constitution_text)

I'm getting an error - the error is different for using CPU or GPU. I've also tried longer text, and there is no problem. If I remove nlp.add_pipe(BeneparComponent("benepar_en")), it works.

CPU Error

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1321     try:
-> 1322       return fn(*args)
   1323     except errors.OpError as e:

/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1306       return self._call_tf_sessionrun(
-> 1307           options, feed_dict, fetch_list, target_list, run_metadata)
   1308 

/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1408           self._session, options, feed_dict, fetch_list, target_list,
-> 1409           run_metadata)
   1410     else:

InvalidArgumentError: indices[2590] = 29134 is not in [0, 19200)
	 [[Node: GatherV2_1 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Tile, ToInt32, GatherV2_8/axis)]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-10-aeb678c45aba> in <module>()
----> 1 nlp_only_parsing(PROBLEMATIC_TEXT)

~/.local/lib/python3.5/site-packages/spacy/language.py in __call__(self, text, disable)
    350             if not hasattr(proc, '__call__'):
    351                 raise ValueError(Errors.E003.format(component=type(proc), name=name))
--> 352             doc = proc(doc)
    353             if doc is None:
    354                 raise ValueError(Errors.E005.format(name=name))

~/.local/lib/python3.5/site-packages/benepar/spacy_plugin.py in __call__(self, doc)
     75     def __call__(self, doc):
     76         constituent_data = PartialConstituentData()
---> 77         for parse_raw, sent in self._batched_parsed_raw(self._process_doc(doc)):
     78             # The optimized cython decoder implementation doesn't actually
     79             # generate trees, only scores and span indices. Indices follow a

~/.local/lib/python3.5/site-packages/benepar/base_parser.py in _batched_parsed_raw(self, sentence_data_pairs)
    220             batch_data.append(datum)
    221             if len(batch_sentences) >= self.batch_size:
--> 222                 for chart_np, datum in zip(self._make_charts(batch_sentences), batch_data):
    223                     yield chart_decoder.decode(chart_np), datum
    224                 batch_sentences = []

~/.local/lib/python3.5/site-packages/benepar/base_parser.py in _make_charts(self, sentences)
    203     def _make_charts(self, sentences):
    204         inp_val = self._charify(sentences)
--> 205         out_val = self._sess.run(self._charts, {self._chars: inp_val})
    206         for snum, sentence in enumerate(sentences):
    207             chart_size = len(sentence) + 1

/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    898     try:
    899       result = self._run(None, fetches, feed_dict, options_ptr,
--> 900                          run_metadata_ptr)
    901       if run_metadata:
    902         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1133     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1134       results = self._do_run(handle, final_targets, final_fetches,
-> 1135                              feed_dict_tensor, options, run_metadata)
   1136     else:
   1137       results = []

/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1314     if handle is None:
   1315       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1316                            run_metadata)
   1317     else:
   1318       return self._do_call(_prun_fn, handle, feeds, fetches)

/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1333         except KeyError:
   1334           pass
-> 1335       raise type(e)(node_def, op, message)
   1336 
   1337   def _extend_graph(self):

InvalidArgumentError: indices[2590] = 29134 is not in [0, 19200)
	 [[Node: GatherV2_1 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Tile, ToInt32, GatherV2_8/axis)]]

Caused by op 'GatherV2_1', defined at:
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 486, in start
    self.io_loop.start()
  File "/usr/local/lib/python3.5/dist-packages/tornado/platform/asyncio.py", line 127, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib/python3.5/asyncio/base_events.py", line 345, in run_forever
    self._run_once()
  File "/usr/lib/python3.5/asyncio/base_events.py", line 1312, in _run_once
    handle._run()
  File "/usr/lib/python3.5/asyncio/events.py", line 125, in _run
    self._callback(*self._args)
  File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 759, in _run_callback
    ret = callback()
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/zmq/eventloop/zmqstream.py", line 536, in <lambda>
    self.io_loop.add_callback(lambda : self._handle_events(self.socket, 0))
  File "/usr/local/lib/python3.5/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/usr/local/lib/python3.5/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/local/lib/python3.5/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py", line 2662, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py", line 2903, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-9-6f0432446d27>", line 6, in <module>
    nlp_only_parsing.add_pipe(BeneparComponent("benepar_en"))
  File "/home/users/shlohod/.local/lib/python3.5/site-packages/benepar/spacy_plugin.py", line 73, in __init__
    super(BeneparComponent, self).__init__(filename, batch_size)
  File "/home/users/shlohod/.local/lib/python3.5/site-packages/benepar/base_parser.py", line 163, in __init__
    tf.import_graph_def(graph_def, name='')
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 432, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 513, in import_graph_def
    _ProcessNewOps(graph)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 303, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3540, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3540, in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3428, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): indices[2590] = 29134 is not in [0, 19200)
	 [[Node: GatherV2_1 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Tile, ToInt32, GatherV2_8/axis)]]

GPU Error

---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1321     try:
-> 1322       return fn(*args)
   1323     except errors.OpError as e:

/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1306       return self._call_tf_sessionrun(
-> 1307           options, feed_dict, fetch_list, target_list, run_metadata)
   1308 

/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1408           self._session, options, feed_dict, fetch_list, target_list,
-> 1409           run_metadata)
   1410     else:

ResourceExhaustedError: OOM when allocating tensor of shape [1024,16384] and type float
	 [[Node: ConstantFolding/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/kernel_enter = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1024,16384] values: [0.0838829875 0.0388401747 0.173968613]...>, _device="/job:localhost/replica:0/task:0/device:GPU:0"](^bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/while/lstm_cell/MatMul/Enter)]]

During handling of the above exception, another exception occurred:

ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-3-8f26748b155f> in <module>()
      5 nlp.add_pipe(BeneparComponent("benepar_en"))
      6 
----> 7 nlp(american)

~/.local/lib/python3.5/site-packages/spacy/language.py in __call__(self, text, disable)
    350             if not hasattr(proc, '__call__'):
    351                 raise ValueError(Errors.E003.format(component=type(proc), name=name))
--> 352             doc = proc(doc)
    353             if doc is None:
    354                 raise ValueError(Errors.E005.format(name=name))

~/.local/lib/python3.5/site-packages/benepar/spacy_plugin.py in __call__(self, doc)
     75     def __call__(self, doc):
     76         constituent_data = PartialConstituentData()
---> 77         for parse_raw, sent in self._batched_parsed_raw(self._process_doc(doc)):
     78             # The optimized cython decoder implementation doesn't actually
     79             # generate trees, only scores and span indices. Indices follow a

~/.local/lib/python3.5/site-packages/benepar/base_parser.py in _batched_parsed_raw(self, sentence_data_pairs)
    220             batch_data.append(datum)
    221             if len(batch_sentences) >= self.batch_size:
--> 222                 for chart_np, datum in zip(self._make_charts(batch_sentences), batch_data):
    223                     yield chart_decoder.decode(chart_np), datum
    224                 batch_sentences = []

~/.local/lib/python3.5/site-packages/benepar/base_parser.py in _make_charts(self, sentences)
    203     def _make_charts(self, sentences):
    204         inp_val = self._charify(sentences)
--> 205         out_val = self._sess.run(self._charts, {self._chars: inp_val})
    206         for snum, sentence in enumerate(sentences):
    207             chart_size = len(sentence) + 1

/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    898     try:
    899       result = self._run(None, fetches, feed_dict, options_ptr,
--> 900                          run_metadata_ptr)
    901       if run_metadata:
    902         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1133     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1134       results = self._do_run(handle, final_targets, final_fetches,
-> 1135                              feed_dict_tensor, options, run_metadata)
   1136     else:
   1137       results = []

/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1314     if handle is None:
   1315       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1316                            run_metadata)
   1317     else:
   1318       return self._do_call(_prun_fn, handle, feeds, fetches)

/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1333         except KeyError:
   1334           pass
-> 1335       raise type(e)(node_def, op, message)
   1336 
   1337   def _extend_graph(self):

ResourceExhaustedError: OOM when allocating tensor of shape [1024,16384] and type float
	 [[Node: ConstantFolding/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/kernel_enter = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1024,16384] values: [0.0838829875 0.0388401747 0.173968613]...>, _device="/job:localhost/replica:0/task:0/device:GPU:0"](^bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/while/lstm_cell/MatMul/Enter)]]

Tagset

Hi,
what is the full tagset used by the parser?
Thanks!

Installation failure

This is the error

$ pip install benepar[cpu]
Collecting benepar[cpu]
Using cached https://files.pythonhosted.org/packages/12/a7/f5322420e6eb8528a6874e608aabf539a4759cfb324f5a63997c3eda991d/benepar-0.0.3.tar.gz
Requirement already satisfied: cython in /anaconda/lib/python3.6/site-packages (from benepar[cpu]) (0.28.5)
Requirement already satisfied: numpy in /anaconda/lib/python3.6/site-packages (from benepar[cpu]) (1.14.5)
Requirement already satisfied: nltk>=3.2 in /anaconda/lib/python3.6/site-packages (from benepar[cpu]) (3.2.5)
Requirement already satisfied: tensorflow>=1.8.0 in /anaconda/lib/python3.6/site-packages (from benepar[cpu]) (1.10.1)
Requirement already satisfied: six in /anaconda/lib/python3.6/site-packages (from nltk>=3.2->benepar[cpu]) (1.11.0)
Requirement already satisfied: absl-py>=0.1.6 in /anaconda/lib/python3.6/site-packages (from tensorflow>=1.8.0->benepar[cpu]) (0.4.1)
Requirement already satisfied: protobuf>=3.6.0 in /anaconda/lib/python3.6/site-packages (from tensorflow>=1.8.0->benepar[cpu]) (3.6.1)
Requirement already satisfied: astor>=0.6.0 in /anaconda/lib/python3.6/site-packages (from tensorflow>=1.8.0->benepar[cpu]) (0.7.1)
Requirement already satisfied: tensorboard<1.11.0,>=1.10.0 in /anaconda/lib/python3.6/site-packages (from tensorflow>=1.8.0->benepar[cpu]) (1.10.0)
Requirement already satisfied: wheel>=0.26 in /anaconda/lib/python3.6/site-packages (from tensorflow>=1.8.0->benepar[cpu]) (0.31.1)
Requirement already satisfied: gast>=0.2.0 in /anaconda/lib/python3.6/site-packages (from tensorflow>=1.8.0->benepar[cpu]) (0.2.0)
Requirement already satisfied: termcolor>=1.1.0 in /anaconda/lib/python3.6/site-packages (from tensorflow>=1.8.0->benepar[cpu]) (1.1.0)
Requirement already satisfied: grpcio>=1.8.6 in /anaconda/lib/python3.6/site-packages (from tensorflow>=1.8.0->benepar[cpu]) (1.14.2)
Collecting setuptools<=39.1.0 (from tensorflow>=1.8.0->benepar[cpu])
Using cached https://files.pythonhosted.org/packages/8c/10/79282747f9169f21c053c562a0baa21815a8c7879be97abd930dbcf862e8/setuptools-39.1.0-py2.py3-none-any.whl
Requirement already satisfied: werkzeug>=0.11.10 in /anaconda/lib/python3.6/site-packages (from tensorboard<1.11.0,>=1.10.0->tensorflow>=1.8.0->benepar[cpu]) (0.14.1)
Requirement already satisfied: markdown>=2.6.8 in /anaconda/lib/python3.6/site-packages (from tensorboard<1.11.0,>=1.10.0->tensorflow>=1.8.0->benepar[cpu]) (2.6.11)
Building wheels for collected packages: benepar
Running setup.py bdist_wheel for benepar ... error
Complete output from command /anaconda/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/gr/plmtcxzd19705bxnz4k6hqz00000gn/T/pip-install-v5le71uh/benepar/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d /private/var/folders/gr/plmtcxzd19705bxnz4k6hqz00000gn/T/pip-wheel-vtm6_ah6 --python-tag cp36:
running bdist_wheel
running build
running build_py
creating build
creating build/lib.macosx-10.9-x86_64-3.6
creating build/lib.macosx-10.9-x86_64-3.6/benepar
copying benepar/downloader.py -> build/lib.macosx-10.9-x86_64-3.6/benepar
copying benepar/init.py -> build/lib.macosx-10.9-x86_64-3.6/benepar
copying benepar/base_parser.py -> build/lib.macosx-10.9-x86_64-3.6/benepar
copying benepar/spacy_plugin.py -> build/lib.macosx-10.9-x86_64-3.6/benepar
copying benepar/nltk_plugin.py -> build/lib.macosx-10.9-x86_64-3.6/benepar
copying benepar/chart_decoder.pyx -> build/lib.macosx-10.9-x86_64-3.6/benepar
running build_ext
building 'benepar.chart_decoder' extension
creating build/temp.macosx-10.9-x86_64-3.6
creating build/temp.macosx-10.9-x86_64-3.6/benepar
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/anaconda/include -mmacosx-version-min=10.9 -m64 -fPIC -I/anaconda/include -mmacosx-version-min=10.9 -m64 -fPIC -I/anaconda/lib/python3.6/site-packages/numpy/core/include -I/anaconda/include/python3.6m -c benepar/chart_decoder.c -o build/temp.macosx-10.9-x86_64-3.6/benepar/chart_decoder.o
In file included from benepar/chart_decoder.c:17:
/anaconda/include/python3.6m/Python.h:25:10: fatal error: 'stdio.h' file not found
#include <stdio.h>
^~~~~~~~~
1 error generated.
error: command 'clang' failed with exit status 1


Failed building wheel for benepar
Running setup.py clean for benepar
Failed to build benepar
Installing collected packages: benepar, setuptools
Running setup.py install for benepar ... error
Complete output from command /anaconda/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/gr/plmtcxzd19705bxnz4k6hqz00000gn/T/pip-install-v5le71uh/benepar/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /private/var/folders/gr/plmtcxzd19705bxnz4k6hqz00000gn/T/pip-record-fy8y1cv0/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build/lib.macosx-10.9-x86_64-3.6
creating build/lib.macosx-10.9-x86_64-3.6/benepar
copying benepar/downloader.py -> build/lib.macosx-10.9-x86_64-3.6/benepar
copying benepar/init.py -> build/lib.macosx-10.9-x86_64-3.6/benepar
copying benepar/base_parser.py -> build/lib.macosx-10.9-x86_64-3.6/benepar
copying benepar/spacy_plugin.py -> build/lib.macosx-10.9-x86_64-3.6/benepar
copying benepar/nltk_plugin.py -> build/lib.macosx-10.9-x86_64-3.6/benepar
copying benepar/chart_decoder.pyx -> build/lib.macosx-10.9-x86_64-3.6/benepar
running build_ext
building 'benepar.chart_decoder' extension
creating build/temp.macosx-10.9-x86_64-3.6
creating build/temp.macosx-10.9-x86_64-3.6/benepar
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/anaconda/include -mmacosx-version-min=10.9 -m64 -fPIC -I/anaconda/include -mmacosx-version-min=10.9 -m64 -fPIC -I/anaconda/lib/python3.6/site-packages/numpy/core/include -I/anaconda/include/python3.6m -c benepar/chart_decoder.c -o build/temp.macosx-10.9-x86_64-3.6/benepar/chart_decoder.o
In file included from benepar/chart_decoder.c:17:
/anaconda/include/python3.6m/Python.h:25:10: fatal error: 'stdio.h' file not found
#include <stdio.h>
^~~~~~~~~
1 error generated.
error: command 'clang' failed with exit status 1

----------------------------------------

Command "/anaconda/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/gr/plmtcxzd19705bxnz4k6hqz00000gn/T/pip-install-v5le71uh/benepar/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /private/var/folders/gr/plmtcxzd19705bxnz4k6hqz00000gn/T/pip-record-fy8y1cv0/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/gr/plmtcxzd19705bxnz4k6hqz00000gn/T/pip-install-v5le71uh/benepar/

NonConstituentException on sentence level

To start, thanks for making this available, putting it on pypi and having good documentation.


The following code raises an NonConstituentException, which is unexpected since we're accessing the tree on the sentence level.

import en_core_web_sm
from benepar.spacy_plugin import BeneparComponent

nlp = en_core_web_sm.load()
nlp.add_pipe(BeneparComponent('benepar_en2'))
doc = nlp('Sur la base du nombre de reçus à l‘examen en 2006, les\n chambres de métiers sont nettement surreprésentées \n sont légèrement sous-représentées (52,9 % des reçus).')

list(doc.sents)[0]._.parse_string

Without the newlines there is no exception.

I'm of course aware that analyzing a French sentence with an English model isn't useful but when analyzing large corpora from the web it happens from time to time that texts in different languages sneak in.

The following library versions were used:

  • spaCy 2.0.18
  • benepar 0.1.1
  • en-core-web-sm (spaCy model) 2.0.0

Parser produces too many "UNK"s when using self-trained model

When I use "run_parse" routine of the script "src/main.py" and use the self-trained model, it outputs parse trees with many "UNK"s. like that:

(S (INTJ (UNK No)) (UNK ,) (NP (UNK it)) (VP (UNK was) (UNK n't) (NP (UNK Black) (UNK Monday))) (UNK .))
(S (UNK But) (SBAR (UNK while) (S (NP (UNK the) (UNK New) (UNK York) (UNK Stock) (UNK Exchange)) (VP (UNK did) (UNK n't) (VP (UNK fall) (ADVP (UNK apart)) (NP (UNK Friday)) (UNK as) (S (NP (UNK the) (UNK Dow) (UNK Jones) (UNK Industrial) (UNK Average)) (VP (UNK plunged) (NP (UNK 190.58) (UNK points)) (PRN (UNK --) (NP (NP (UNK most)) (PP (UNK of) (NP (UNK it))) (PP (UNK in) (NP (UNK the) (UNK final) (UNK hour)))) (UNK --)))))))) (NP (UNK it)) (VP (ADVP (UNK barely)) (UNK managed) (S (VP (UNK to) (VP (UNK stay) (NP (NP (UNK this) (UNK side)) (PP (UNK of) (NP (UNK chaos)))))))) (UNK .))

And it's simillar when I use pretrained-model. How to deal with that?

Do the default models include BERT embeddings?

Just curious if the default models (specifically benepar_en2 or benepar_en2_large) include BERT embeddings? I see references to BERT and ELMo in the training section. Do we need to do that training ourselves if we want a parser to use those embeddings, or are they already included?

And more importantly, even if the default models don't include BERT/ELMo embeddings, is it worth doing the training?

Thanks!

binary tree

when I changed the data from tree to binary tree
I got
Error reading EVALB results.
Gold path: /tmp/evalb-jndjlzq5/gold.txt
Predicted path: /tmp/evalb-jndjlzq5/predicted.txt
Output path: /tmp/evalb-jndjlzq5/output.txt
dev-fscore (Recall=nan, Precision=nan, FScore=nan, CompleteMatch=nan) dev-elapsed 0h00m00s total-elapsed 0h00m03s

Can't access leaf labels

Hello! I need to access each tree node label while traversing the tree. And the problem I faced is that I don't have an access to the leaf node's labels.

I have following code:

import spacy
from benepar.spacy_plugin import BeneparComponent
nlp = spacy.load('en')
nlp.add_pipe(BeneparComponent("benepar_en2"))

text = 'John Moss hates him.'
doc = nlp(text)
for sent in doc.sents:
  print(sent._.parse_string)
  child_1 = list(sent._.children)[0]
  print(child_1._.parse_string, child_1._.labels)
  child_2 = list(child_1._.children)[0]
  print(child_2._.parse_string, child_2._.labels)

which gives such output:

(S (NP (NNP John) (NNP Moss)) (VP (VBZ hates) (NP (PRP him))) (. .))
(NP (NNP John) (NNP Moss)) ('NP',)
(NNP John) ()

As you can see, last parse string has NNP, but the labels tuple is empty

is it a bug or am I doing it wrong?

Sentence length limit

When I test the sentence, I am prompted that the sentence is too long. Why is this?

Spacy - BeneparComponent updates pos_ attribute of spacy tagger output

As I read the benepar documents, benepar should not update pos_ attribute, it should only set span._.* and token._.* custom attributes.

import spacy
from benepar.spacy_plugin import BeneparComponent
spacynlp = spacy.load('en_core_web_sm')
spacynlp.add_pipe(BeneparComponent("benepar_en2"))

tokenizedtext = ['Bütün', 'insanlar', 'hür', ',', 'haysiyet', 've', 'haklar', 'bakımından','eşit', 'doğarlar','.']
doc = spacy.tokens.doc.Doc(spacynlp.vocab, words=tokenizedtext )
for name, proc in spacynlp.pipeline:
      doc = proc(doc)
      print(name, doc[0].pos_)

The output is like this;

tagger  NOUN
parser  NOUN
ner     NOUN
benepar PROPN

Why does this happen?

ImportError: Building module benepar.chart_decoder failed: ["distutils.errors.CompileError: command 'x86_64-linux-gnu-gcc' failed with exit status 1\n"]

import benepar
/home/jhy/.pyxbld/temp.linux-x86_64-3.6/pyrex/benepar/chart_decoder.c:598:31: fatal error: numpy/arrayobject.h: No such file or directory
compilation terminated.
Traceback (most recent call last):
File "/usr/lib/python3.6/distutils/unixccompiler.py", line 118, in _compile
extra_postargs)
File "/usr/lib/python3.6/distutils/ccompiler.py", line 909, in spawn
spawn(cmd, dry_run=self.dry_run)
File "/usr/lib/python3.6/distutils/spawn.py", line 36, in spawn
_spawn_posix(cmd, search_path, dry_run=dry_run)
File "/usr/lib/python3.6/distutils/spawn.py", line 159, in _spawn_posix
% (cmd, exit_status))
distutils.errors.DistutilsExecError: command 'x86_64-linux-gnu-gcc' failed with exit status 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/jhy/py3.6/lib/python3.6/site-packages/pyximport/pyximport.py", line 215, in load_module
inplace=build_inplace, language_level=language_level)
File "/home/jhy/py3.6/lib/python3.6/site-packages/pyximport/pyximport.py", line 191, in build_module
reload_support=pyxargs.reload_support)
File "/home/jhy/py3.6/lib/python3.6/site-packages/pyximport/pyxbuild.py", line 102, in pyx_to_dll
dist.run_commands()
File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
self.run_command(cmd)
File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/home/jhy/py3.6/lib/python3.6/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
_build_ext.build_ext.run(self)
File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "/home/jhy/py3.6/lib/python3.6/site-packages/Cython/Distutils/old_build_ext.py", line 194, in build_extensions
self.build_extension(ext)
File "/usr/lib/python3.6/distutils/command/build_ext.py", line 533, in build_extension
depends=ext.depends)
File "/usr/lib/python3.6/distutils/ccompiler.py", line 574, in compile
self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
File "/usr/lib/python3.6/distutils/unixccompiler.py", line 120, in _compile
raise CompileError(msg)
distutils.errors.CompileError: command 'x86_64-linux-gnu-gcc' failed with exit status 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/home/jhy/project/self-attentive-parser-master/benepar/init.py", line 6, in
from .nltk_plugin import Parser
File "/home/jhy/project/self-attentive-parser-master/benepar/nltk_plugin.py", line 4, in
from .base_parser import BaseParser, IS_PY2, STRING_TYPES, PTB_TOKEN_ESCAPE
File "/home/jhy/project/self-attentive-parser-master/benepar/base_parser.py", line 8, in
from . import chart_decoder
File "/home/jhy/py3.6/lib/python3.6/site-packages/pyximport/pyximport.py", line 462, in load_module
language_level=self.language_level)
File "/home/jhy/py3.6/lib/python3.6/site-packages/pyximport/pyximport.py", line 231, in load_module
raise exc.with_traceback(tb)
File "/home/jhy/py3.6/lib/python3.6/site-packages/pyximport/pyximport.py", line 215, in load_module
inplace=build_inplace, language_level=language_level)
File "/home/jhy/py3.6/lib/python3.6/site-packages/pyximport/pyximport.py", line 191, in build_module
reload_support=pyxargs.reload_support)
File "/home/jhy/py3.6/lib/python3.6/site-packages/pyximport/pyxbuild.py", line 102, in pyx_to_dll
dist.run_commands()
File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
self.run_command(cmd)
File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/home/jhy/py3.6/lib/python3.6/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
_build_ext.build_ext.run(self)
File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "/home/jhy/py3.6/lib/python3.6/site-packages/Cython/Distutils/old_build_ext.py", line 194, in build_extensions
self.build_extension(ext)
File "/usr/lib/python3.6/distutils/command/build_ext.py", line 533, in build_extension
depends=ext.depends)
File "/usr/lib/python3.6/distutils/ccompiler.py", line 574, in compile
self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
File "/usr/lib/python3.6/distutils/unixccompiler.py", line 120, in _compile
raise CompileError(msg)
ImportError: Building module benepar.chart_decoder failed: ["distutils.errors.CompileError: command 'x86_64-linux-gnu-gcc' failed with exit status 1\n"]

Incorrect parse_tree when token(s) contain parenthesis

I am using your parser to get the parse_string of a lot of data to calculate their parse tree depth. Most sentences work fine, but the exception is sentences that contain brackets. Take as input the following sentence.

The Solano(r) range is phthalate-free and contains no heavy metals.

This will be parsed and stringified as

(S (NP (DT The) (NN Solano (-RRB- -RRB-) (NN range)) (VP (VP (VBZ is) (ADJP (NN phthalate) (: -) (JJ free))) (CC and) (VP (VBZ contains) (NP (DT no) (JJ heavy) (NNS metals)))) (. .))

Prettifying this, gives us the following:

(S
    (NP
        (DT The)
        (NN Solano(r)
            (-RRB- -RRB-)
            (NN range)
        )
        (VP
            (VP
                (VBZ is)
                (ADJP
                    (NN phthalate)
                    (: -)
                    (JJ free)
                )
            )
            (CC and)
            (VP
                (VBZ contains)
                (NP
                    (DT no)
                    (JJ heavy)
                    (NNS metals)
                )
            )
        )
        (. .)
    )

As is clear, the tree is incomplete. If you try to parse it with NLTK, it will fail. The parentheses don't match. As far as I can see, the NP should've been closed with a parenthesis and the -RBR- should be removed, so the correct parse looks like this:

(S
    (NP
        (DT The)
        (NN Solano(r)
            (NN range)
        )
    )
    (VP
        (VP
            (VBZ is)
            (ADJP
                (NN phthalate)
                (: -)
                (JJ free)
            )
        )
        (CC and)
        (VP
            (VBZ contains)
            (NP
                (DT no)
                (JJ heavy)
                (NNS metals)
            )
        )
    )
    (. .)
)

The issue seems to be the word with the parentheses in it. When a parenthesis is not part of a token (like this), it works fine and it is parsed as -RBR-. However, if it's part of the token, things go wrong. Another example sentence is

(They like(d) it a lot.)

The outer parentheses are parsed correctly, but (d) seems to give errors when generating the parse_string.

The above seems to be a bug of how the parse string is generated. Of course everything depends on the parsing early on in the pipeline, but even then the parse_string method should at least return a valid string (even if it is not 100% accurate).

However, another issue arises even when the parse string is corrected. Assume the 'correct' parse

(S (-LRB- -LRB-) (NP (PRP They)) (VP like(d) (NP (PRP it)) (NP (DT a) (NN lot))) (. .) (-RRB- -RRB-))

The NLTK parse would look like

(S
  (-LRB- -LRB-)
  (NP (PRP They))
  (VP like (d ) (NP (PRP it)) (NP (DT a) (NN lot)))
  (. .)
  (-RRB- -RRB-))

You can see that the (d) is, unsurprisingly, interpreted as a node. We don't want that. This is rather an issue with NLTK - but I am not sure how to "escape" parentheses here. Suggestions welcome, perhaps better suited on this SO post.

Sentence length limit

When my sentence length exceeds 300, I get an error, then I modified SENTENCE_MAX_LEN = 3000, but another error occurred. “tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[483] = 483 is not in [0, 300)
[[{{node GatherV2_1}}]],” what is the cause, is there any way to solve it?

Cannot load model when its name unicode

One of the most problematic things in Py2 is unicode and str. To backport several features, I (and so do many) usually use

from __future__ import unicode_literals

This means when I import your model benepar_en_small, I get an error.

parser = BeneparComponent("benepar_en_small")
Exception: Argument is neither a valid module name nor a path to an existing file: benepar_en_small

This is because of

            if isinstance(name, str) and '/' not in name:
                graph_def = tf.GraphDef.FromString(load_model(name))

I think making it basestring instead of just str can resolve this issue. Please let me know if this is comfortable.

Terminating on example - out of RAM

Python 3.7.3

In [3]: parser.parse("hi")                                                      
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

support for multiprocessing

hi, i'm using the cpu version in nltk, i hope to parallel the parser with python's multiprocessing module, but it fails. Is there any way to solve this?

Deprecation warnings

Hi,

Not sure what impact these messages have (so far) - I am reporting them FYI:

(On macosx 10.14.5, python 3.6.7, tensorflow 1.14.0)

nlp.add_pipe(BeneparComponent("benepar_en"))
WARNING: Logging before flag parsing goes to stderr.
W0718 09:41:51.815104 4736062912 deprecation_wrapper.py:119] From
/Users/.../anaconda3/lib/python3.6/site-packages/benepar/base_parser.py:199: The name
tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

W0718 09:41:53.919924 4736062912 deprecation_wrapper.py:119] From 
/Users/.../anaconda3/lib/python3.6/site-packages/benepar/base_parser.py:202: The name 
tf.Session is deprecated. Please use tf.compat.v1.Session instead.

Regards

Colin Goldberg

AttributeError: 'LeafParseNode' object has no attribute 'oracle_label'

When I tried to run as the follow:
python src/main.py train --use-elmo --model-path-base models/en_elmo --num-layers 4

There was a Error:

epoch 1 batch 14/112 processed 2,800 batch-loss 101.9406 grad-norm 93.0121 epoch-elapsed 0h01m57s total-elapsed 0h01m57s
epoch 1 batch 15/112 processed 3,000 batch-loss 90.0741 grad-norm 73.8915 epoch-elapsed 0h02m05s total-elapsed 0h02m05s
sentence:  1 [('PU', ')')]
Traceback (most recent call last):
  File "src/main.py", line 615, in <module>
    main()
  File "src/main.py", line 611, in main
    args.callback(args)
  File "src/main.py", line 564, in <lambda>
    subparser.set_defaults(callback=lambda args: run_train(args, hparams))
  File "src/main.py", line 312, in run_train
    _, loss = parser.parse_batch(subbatch_sentences, subbatch_trees)
  File "/home/workspace/self-attentive-parser/src/parse_nk.py", line 1087, in parse_batch
    p_i, p_j, p_label, p_augment, g_i, g_j, g_label = self.parse_from_annotations(fencepost_annotations_start[start:end,:], fencepost_annotations_end[start:end,:], sentences[i], golds[i])
  File "/home/workspace/self-attentive-parser/src/parse_nk.py", line 1140, in parse_from_annotations
    p_score, p_i, p_j, p_label, p_augment = chart_helper.decode(False, **decoder_args)
  File "src/chart_helper.pyx", line 48, in chart_helper.decode
    oracle_label_chart[left, right] = label_vocab.index(gold.oracle_label(left, right))
AttributeError: 'LeafParseNode' object has no attribute 'oracle_label'

Compatibility issues with pytorch 0.4.x and allennlp 0.6.1

Hello,

Thanks a lot for making your code available and for the thorough documentation!

I have pytorch 0.4.x, which is also the default version when installing allennlp 0.6.1. A couple of questions/comments I have:

  1. [EDIT: never mind]

  2. When using the option --use-elmo, I was able to load the weights and everything seems to be running fine, but I get this warning at the beginning:

In file included from /atm/turkey/vol/transitory/ttmt001/envs/py3.6-gpu/lib/python3.6/site-packages/numpy/core/include/numpy/ndarraytypes.h:1821:0,
from /atm/turkey/vol/transitory/ttmt001/envs/py3.6-gpu/lib/python3.6/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /atm/turkey/vol/transitory/ttmt001/envs/py3.6-gpu/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from /homes/ttmt001/.pyxbld/temp.linux-x86_64-3.6/pyrex/thinc/neural/gpu_ops.c:568:
/atm/turkey/vol/transitory/ttmt001/envs/py3.6-gpu/lib/python3.6/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
/homes/ttmt001/.pyxbld/temp.linux-x86_64-3.6/pyrex/thinc/neural/gpu_ops.c:570:24: fatal error: _cuda_shim.h: No such file or directory
#include "_cuda_shim.h"
^
compilation terminated.

So the issue seems to be somewhere in the interaction between allennlp, numpy, and perhaps cython (.pyx)? I'm using a virtual environment (python 3.6) without gpu, so I'm not sure why there's a warning about cuda and gpu_ops. I don't get this warning if I just do from allennlp.modules.elmo import Elmo in the command line (without .pyx dependencies).

Buggy output for parentheticals

Hi, I'm using the pretrained benepar model as described in Usage with NLTK. It does not produce (-LRB- -LRB-)/(-RRB- -RRB-) as other standard parsers for cases of parentheticals. For example, parsing this sentence:

Representative George Hansen (R., Idaho) drew a reprimand in nineteen eighty-four after a felony conviction for falsifying his financial disclosures.

gives

(S
(NP
(NP (JJ Representative) (NNP George) (NNP Hansen))
(PRN (( () (NP (NNP R.)) (, ,) (NP (NNP Idaho)) () ))))
(VP
(VBD drew)
(NP (DT a) (NN reprimand))
(PP (IN in) (NP (JJ nineteen) (JJ eighty-four)))
(PP
(IN after)
(NP
(NP (DT a) (NN felony) (NN conviction))
(PP
(IN for)
(S
(VP
(VBG falsifying)
(NP (PRP$ his) (JJ financial) (NNS disclosures))))))))
(. .))

The empty labels are particularly problematic when used with the trees.py module in this repo. Is this a bug or is this your own label convention?

ctb result

I used default parameters and 8 layers bert ,trained on data ctb5.1.
I got 'FScore=90.50, CompleteMatch=28.82'
But the paper 'Cross-Domain Generalization of Neural Constituency Parsers'
refered F1=92.14, exactmatch=44.42
I was confused if there were anything wrong on my training, have you fine-tuned bert?

Training script crashes on pytorch 1.2

The current version of the code does not run with pytorch 1.2, which is the current latest version. I am running the training script on the ptb data with --use-words as the only flag.

The error is in the call of FeatureDropoutFunction.apply(), in the line

output.mul_(ctx.noise)

.
output is of shape ([2016, 1024]) and ctx.noise is of shape ([1379, 1024]), due to which the mul operation fails.

Note that this does not happen in every call of FeatureDropoutFunction.apply(). While stepping through, this exception is seen in the second call only. In the first time it's called, both the dimensions match and there is no exception thrown.

With Pytorch 1.1, these errors do not seem to appear. In a trial run, output and ctx.noise are of shape (1413, 1024) and there is no problem.

I can provide further stack traces if needed.

Inquiry on the SPMRL dataset

Hi,
First of all, thanks for your great contributions to your work and for releasing awesome codes.
I have a few questions about the SPMRL dataset, which has been used for testing your model in your paper.
It would be very appreciated if you could answer my inquiry below:

  • Is it possible to get the dataset from its official website?
    I'm not sure I can obtain the dataset to replicate your result reported on the paper because the SPMRL organizers require one who wants to download their dataset to be certificated and it's actually been a while since the real workshop (or competition) was held.
  • I am wondering you've recently got the dataset directly from the workshop organizers or have used in-house one (which may exist in your school or lab), or have fortunately obtained with a kind of indirect route.

Thanks for your response in advance.

Invalid parsing

Hi @nikitakit ,

I was wondering - what is the result of the parsing when the model is given an invalid sentence according to the English grammar?

Thank you very much in advance.

RuntimeError in training: Sizes of tensors must match except in dimension 1

When I run python3 src/main.py train --model-path-base models/ --train-path data/gamalt_icepahc/train_random.clean --dev-path data/gamalt_icepahc/dev_random.clean --predict-tags --epochs 20 --use-words the following error occurs:

Traceback (most recent call last):
  File "src/main.py", line 613, in <module>
    main()
  File "src/main.py", line 609, in main
    args.callback(args)
  File "src/main.py", line 565, in <lambda>
    subparser.set_defaults(callback=lambda args: run_train(args, hparams))
  File "src/main.py", line 313, in run_train
    _, loss = parser.parse_batch(subbatch_sentences, subbatch_trees)
  File "/users/home/tha86/berk/self-attentive-parser-copy/src/parse_nk.py", line 1028, in parse_batch
    annotations, _ = self.encoder(emb_idxs, batch_idxs, extra_content_annotations=extra_content_annotations)
  File "/opt/share/python/3.6.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/users/home/tha86/berk/self-attentive-parser-copy/src/parse_nk.py", line 612, in forward
    res, timing_signal, batch_idxs = emb(xs, batch_idxs, extra_content_annotations=extra_content_annotations)
  File "/opt/share/python/3.6.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/users/home/tha86/berk/self-attentive-parser-copy/src/parse_nk.py", line 505, in forward
    annotations = torch.cat([content_annotations, timing_signal], 1)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 627 and 645 in dimension 0 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:83

I am using a Mac with python 3.6, pytorch 1.0.0 and Cython 0.29.14. Training is successful when the number of input sentences for training is small but this error comes up quickly when the number is increased.

Does anyone know why this error occurs and how it can be fixed?

cannot import name 'chart_decoder'

import benepar
Traceback (most recent call last):
File "", line 1, in
File "/project/self-attentive-parser-acl2018/benepar/init.py", line 6, in
from .nltk_plugin import Parser
File "/project/self-attentive-parser-acl2018/benepar/nltk_plugin.py", line 4, in
from .base_parser import BaseParser
File "/project/self-attentive-parser-acl2018/benepar/base_parser.py", line 6, in
from . import chart_decoder
ImportError: cannot import name 'chart_decoder'

ImportError: cannot import name 'ssl' from 'urllib3.util.ssl_'

When I tried to run the following:
python3 src/main.py train --model-path-base models/ --train-path data/gamalt_icepahc/train_random.clean --dev-path data/gamalt_icepahc/dev_random.clean --predict-tags --use-bert

An error arose:

Processing trees for training...
Constructing vocabularies...
Initializing model...
Traceback (most recent call last):
 File "src/main.py", line 612, in <module>
    main()
 File "src/main.py", line 608, in main
    args.callback(args)
  File "src/main.py", line 564, in <lambda>
    subparser.set_defaults(callback=lambda args: run_train(args, hparams))
  File "src/main.py", line 203, in run_train
    parser = parse_nk.NKChartParser(
  File "/users/home/tha86/berk/self-attentive-parser-copy/src/parse_nk.py", line 711, in __init__
    self.bert_tokenizer, self.bert = get_bert(hparams.bert_model, hparams.bert_do_lower_case)
  File "/users/home/tha86/berk/self-attentive-parser-copy/src/parse_nk.py", line 569, in get_bert
    from pytorch_pretrained_bert import BertTokenizer, BertModel
  File "/users/home/tha86/.local/lib/python3.8/site-packages/pytorch_pretrained_bert/__init__.py", line 2, in <module>
    from .tokenization import BertTokenizer, BasicTokenizer, WordpieceTokenizer
  File "/users/home/tha86/.local/lib/python3.8/site-packages/pytorch_pretrained_bert/tokenization.py", line 25, in <module>
    from .file_utils import cached_path
  File "/users/home/tha86/.local/lib/python3.8/site-packages/pytorch_pretrained_bert/file_utils.py", line 20, in <module>
    import boto3
  File "/opt/share/python/3.8.1/lib/python3.8/site-packages/boto3/__init__.py", line 16, in <module>
    from boto3.session import Session
  File "/opt/share/python/3.8.1/lib/python3.8/site-packages/boto3/session.py", line 17, in <module>
    import botocore.session
  File "/opt/share/python/3.8.1/lib/python3.8/site-packages/botocore/session.py", line 29, in <module>
    import botocore.credentials
  File "/opt/share/python/3.8.1/lib/python3.8/site-packages/botocore/credentials.py", line 34, in <module>
    from botocore.config import Config
  File "/opt/share/python/3.8.1/lib/python3.8/site-packages/botocore/config.py", line 16, in <module>
    from botocore.endpoint import DEFAULT_TIMEOUT, MAX_POOL_CONNECTIONS
  File "/opt/share/python/3.8.1/lib/python3.8/site-packages/botocore/endpoint.py", line 22, in <module>
    from botocore.awsrequest import create_request_object
  File "/opt/share/python/3.8.1/lib/python3.8/site-packages/botocore/awsrequest.py", line 25, in <module>
    import botocore.utils
  File "/opt/share/python/3.8.1/lib/python3.8/site-packages/botocore/utils.py", line 31, in <module>
    import botocore.httpsession
  File "/opt/share/python/3.8.1/lib/python3.8/site-packages/botocore/httpsession.py", line 8, in <module>
    from urllib3.util.ssl_ import (
ImportError: cannot import name 'ssl' from 'urllib3.util.ssl_' (/opt/share/python/3.8.1/lib/python3.8/site-packages/urllib3/util/ssl_.py)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.