bjascob / amrlib Goto Github PK

View Code? Open in Web Editor NEW

216.0 7.0 33.0 974 KB

A python library that makes AMR parsing, generation and visualization simple.

License: MIT License

Python 98.80% Shell 0.89% Brainfuck 0.16% Forth 0.15%

python text-generation spacy neural-network transformer pytorch amr-parser amr-parsing amr spacy-extension

amrlib's Introduction

amrlib

A python library that makes AMR parsing, generation and visualization simple.

For the latest documentation, see ReadTheDocs.

!! Note: The models must be downloaded and installed separately. See the Installation Instructions.

About

amrlib is a python module designed to make processing for Abstract Meaning Representation (AMR) simple by providing the following functions

Sentence to Graph (StoG) parsing to create AMR graphs from English sentences.
Graph to Sentence (GtoS) generation for turning AMR graphs into English sentences.
A QT based GUI to facilitate conversion of sentences to graphs and back to sentences
Methods to plot AMR graphs in both the GUI and as library functions
Training and test code for both the StoG and GtoS models.
A SpaCy extension that allows direct conversion of SpaCy Docs and Spans to AMR graphs.
Sentence to Graph alignment routines
- FAA_Aligner (Fast_Align Algorithm), based on the ISI aligner code detailed in this paper.
- RBW_Aligner (Rule Based Word) for simple, single token to single node alignment
An evaluation metric API including including...
- Smatch (multiprocessed with enhanced/detailed scores) for graph parsing
  see note at the bottom about smatch scoring
- BLEU for sentence generation
- Alignment scoring metrics detailing precision/recall

AMR Models

The system includes different neural-network models for parsing and for generation. !! Note: Models must be downloaded and installed separately. See amrlib-models for all parse and generate model download links.

Parse (StoG) model_parse_xfm_bart_large gives an 83.7 SMATCH score with LDC2020T02.
For a technical description of the parse model see its wiki-page
Generation (GtoS) generate_t5wtense gives a 54 BLEU with tense tags or 44 BLEU with un-tagged LDC2020T02.

AMR View

The GUI allows for simple viewing, conversion and plotting of AMR Graphs.

AMR CoReference Resolution

The library does not contain code for AMR co-reference resolution but there is a related project at amr_coref.

The following papers have GitHub projects/code that have similar or better scoring than the above..

Requirements and Installation

The project was built and tested under Python 3 and Ubuntu but should run on any Linux, Windows, Mac, etc.. system.

See Installation Instructions for details on setup.

Library Usage

To convert sentences to graphs

import amrlib
stog = amrlib.load_stog_model()
graphs = stog.parse_sents(['This is a test of the system.', 'This is a second sentence.'])
for graph in graphs:
    print(graph)

To convert graphs to sentences

import amrlib
gtos = amrlib.load_gtos_model()
sents, _ = gtos.generate(graphs)
for sent in sents:
    print(sent)

For a detailed description see the Model API.

Usage as a Spacy Extension

To use as an extension, you need spaCy version 2.0 or later. To setup the extension and use it do the following

import amrlib
import spacy
amrlib.setup_spacy_extension()
nlp = spacy.load('en_core_web_sm')
doc = nlp('This is a test of the SpaCy extension. The test has multiple sentences.')
graphs = doc._.to_amr()
for graph in graphs:
    print(graph)

For a detailed description see the Spacy API.

Paraphrasing

For an example of how to use the library to do paraphrasing, see the Paraphrasing section in the docs.

SMATCH Scoring

amrlib uses the smatch library for scoring. This is the library that is most commonly used for scoring AMR parsers and reporting results in literature. There are some cases where the code may give inconsistant or erroneous results. You may wish to look at smatchpp for an improved scoring algorithm.

Issues

If you find a bug, please report it on the GitHub issues list. Additionally, if you have feature requests or questions, feel free to post there as well. I'm happy to consider suggestions and Pull Requests to enhance the functionality and usability of the module.

amrlib's People

Stargazers

Watchers

amrlib's Issues

Incorrect PENMAN with multi word expression

Hi,
thanks for this great tool !
I am using it with the T5 parser model, and I stumbled across an error for a sentence with an abbreviation: Using SEO to Inform Your Website Content Strategy. Somewhere during the decoding it looks like that SEO is replaced with search engine (optimization) so the created PENMAN graph is syntactically incorrect
(taken from the parse_sents() method amrlib/amrlib/models/parse_t5/inference.py just before the call to gstring = PenmanDeSerializer(g).get_graph_string() (variable gin line 70) when the instances not yet given.

amrlib/amrlib/models/parse_t5/inference.py

Lines 69 to 71 in 7ddb4dd

    
           for bnum, g in enumerate(raw_graphs): 
        
               gstring = PenmanDeSerializer(g).get_graph_string() 
        
               if gstring is not None:

( use-01
    :ARG1 ( search engine    #<--- here is a invalid space
	:name ( name 
		:op1 "SEO" ) )
   :ARG2 ( inform-01 
	:ARG0 use-01
	:ARG1 ( strategy
		 :topic ( content 
			:mod ( website ) )
		 :poss ( you ) ) ) )

I guess this can happen with seq2seq models, but is there anything to do to avoid spaces in concept names ?

The package fails to find the models?

Hello there!

I am on a M1 Mac (2020) and have python3, spacy and amrlib all installed through Homebrew.

LocalUser@LocalMac ~ % python3
Python 3.9.6 (default, Jun 28 2021, 19:24:41) 
[Clang 12.0.5 (clang-1205.0.22.9)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import amrlib
>>> stog = amrlib.load_stog_model()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.9/site-packages/amrlib/__init__.py", line 35, in load_stog_model
    stog_model = load_inference_model(model_dir, **kwargs)
  File "/opt/homebrew/lib/python3.9/site-packages/amrlib/models/model_factory.py", line 46, in load_inference_model
    raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), model_directory)
FileNotFoundError: [Errno 2] No such file or directory: '/opt/homebrew/lib/python3.9/site-packages/amrlib/data/model_stog'

Examining a little bit indicates decisively that the directory /opt/homebrew/lib/python3.9/site-packages/amrlib/ does not have a folder named data. Here's some sort of proof that python3 is the right one:

LocalUser@LocalMac ~ % which python3                                       
/opt/homebrew/bin/python3

I'd appreciate your help!

KeyError: 'snt'

Hi,

I got the KeyError: 'snt' when I run the 20_Train_Model.py under the scripts/31_Model_Parse_T5/. I have run the 10_Collect_AMR_Data.py to get the whole training, dev, and test dataset. Also, Does anyone know the difference between train.txt and train.txt.nowiki? It seems they are quite similar, but in the model_generate_t5.json, I only see it uses train.txt.

Here is scripts/31_Model_Parse_T5/, What is the meaning of the number in each file name? For example, why 10_Collect_AMR_Data.py has 10? For my understanding, it means the number of orders I need to execute for training the model like I need to run 10_Collect_AMR_Data.py firstly and then run 20_Train_Model.py and so on. Am I correct?

Here is the processed dataset after I run 10_Collect_AMR_Data.py.

I use the default hyperparameter from model_generate_t5.json

{   "gen_args" :
    {
        "model_name_or_path"            : "t5-base",
        "corpus_dir"                    : "amrlib/data/LDC2020T02/",
        "train_fn"                      : "train.txt",
        "valid_fn"                      : "dev.txt",
        "max_in_len"                    : 512,
        "max_out_len"                   :  90

    },
    "hf_args" :
    {
        "output_dir"                    : "amrlib/data/model_generate_t5",
        "do_train"                      : true,
        "do_eval"                       : false,
        "overwrite_output_dir"          : false,
        "prediction_loss_only"          : true,
        "num_train_epochs"              : 8,
        "save_steps"                    : 1000,
        "save_total_limit"              : 2,
        "per_device_train_batch_size"   : 6,
        "per_device_eval_batch_size"    : 6,
        "gradient_accumulation_steps"   : 4,
        "learning_rate"                 : 1e-4,
        "seed"                          : 42
    }
}

Here is the detail error when I run the 10_Collect_AMR_Data.py.

(venv) qbao775@Broad-AI-2:/data/qbao775/amrlib/scripts/31_Model_Parse_T5$ python                                                                              20_Train_Model.py
Loading model and tokenizer
Building datasets
Loading and converting amrlib/data/tdata_t5/train.txt.nowiki
  0%|                                                                     | 0/55                                                                               0%|                                                                     | 0/55                                                                             635 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "20_Train_Model.py", line 18, in <module>
  File "/data/qbao775/amrlib/amrlib/models/parse_t5/trainer.py", line 46, in tra                                                                             in
    train_dataset   = self.build_dataset(train_file_path)
  File "/data/qbao775/amrlib/amrlib/models/parse_t5/trainer.py", line 73, in bui                                                                             ld_dataset
    entries = load_and_serialize(fpath) # returns a dict of lists
  File "/data/qbao775/amrlib/amrlib/models/parse_t5/penman_serializer.py", line                                                                              23, in load_and_serialize
    serials['sents'].append(serializer.get_meta('snt').strip())
  File "/data/qbao775/amrlib/amrlib/models/parse_t5/penman_serializer.py", line                                                                              47, in get_meta
    return self.graph.metadata[key]
KeyError: 'snt'

Cannot draw the AMR plot with AMRPlot() function

Hi Brad, thank you for this great library. However, I met a problem when I tried to draw the plot of the AMR graph with AMRPlot() function. May I ask how can I solve this problem?

I installed the package following the instruction page, and tested the sample code in Jupyter Notebook (Google Colab). It worked the job perfectly.

stog = amrlib.load_stog_model(model_dir="/my_local_path_to_model/model_parse_xfm_bart_base-v0_1_0")
graphs = stog.parse_sents(['This is a test of the system and this is a second sentence.'])
print(graphs[0])

>>># ::snt This is a test of the system and this is a second sentence.
(a / and
      :op1 (t / test-01
            :ARG0 (t2 / this)
            :ARG1 (s / system))
      :op2 (s2 / sentence
            :ord (o / ordinal-entity
                  :value 2)
            :domain (t3 / this)))

Then I tried to draw the plot with AMRPlot(). However, the graph did not show up, and there were no error messages.
(I checked the sample in the Docs, but seems the test.txt has been removed. And I have installed graphviz and it was successfully imported.)

plot = AMRPlot()
plot.build_from_graph(graphs[0], debug=False)
plot.view()

Would you mind giving me some hints on this issue? Thank you!

Consider Adding a New Parse Model

The amrlib/parse_T5 model scores 81 on AMR-3. There are two publicly available models that have slightly better performance..

SapienzaNLP/SPRING
- 83.0 on AMR-3 and 84.5 on AMR-2
- Only works with transformers < 3 (transformers 4.12 is current release)
- Has code and pretrained models models on GitHub
IBM/Structrued-BART
- 82.7 on AMR-3 and 84.7 on AMR-2
- Doesn't use transformers (trains with FB fairseq lib)
- Has code on GitHub but no pretrained model is publicly available
- Currently code does not support stand-alone parsing.
- Is a "transition parser". (Is this faster than the auto-regressive transformer style models?)
The paper Hierarchical Curriculum learning claims to be able to improve Bart (in SPRING model) by 1 point to 84.1 on AMR-3
- The Instance Curricular portion of this has not proved to be useful in improving the T5 model's scores

Both of these models are based on BART-large which has roughly 2X the parameters of the T5-base model used in amrlib. This may cause issues training on older/smaller GPUs and could be slower for inference.

errors while trying to run "amr_view"

HI:
Thank you for sharing.
I might have missed the instructions, but when I tried to run "amr_view", I got the following errors:
"AMRView Config
stog_model_dir = /home/bancherd/.local/lib/python3.8/site-packages/amrlib/data/model_stog
stog_model_fn = model.pt
stog_device = cpu
gtos_model_dir = /home/bancherd/.local/lib/python3.8/site-packages/amrlib/data/model_gtos
gtos_num_ret_seq = 8
gtos_num_beams = 8
gtos_batch_size = 1
gtos_device = cpu
render_format = pdf
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/bancherd/.local/lib/python3.8/site-packages/amrlib/amr_view/processor_gtos.py", line 39, in load_model
self.inference = load_inference_model(self.model_dir, num_beams=self.num_beams,
File "/home/bancherd/.local/lib/python3.8/site-packages/amrlib/models/model_factory.py", line 46, in load_inference_model
raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), model_directory)
FileNotFoundError: [Errno 2] No such file or directory: '/home/bancherd/.local/lib/python3.8/site-packages/amrlib/data/model_gtos'
Exception in thread Thread-1:
"

However, I have this in my "amr_view.json" file:

"{
"stog_model_dir": "/home/bancherd/amrlib/amrlib/data/model_stog",
"stog_model_fn": "model.pt",
"stog_device": "cuda:0",
"gtos_model_dir": "Home/amrlib/amrlib/data/model_gtos",
"gtos_num_ret_seq": 8,
"gtos_num_beams": 8,
"gtos_batch_size": 1,
"gtos_device": "cuda:0",
"render_format": "pdf"
}"

Thanks,

Bancherd

Model Stog not found

Hi,
I want to use amrlib with spacy extension but I get this error: No such file or directory: '/usr/local/lib/python3.7/dist-packages/amrlib/data/model_stog' after this block:

import spacy
import amrlib
amrlib.setup_spacy_extension()
nlp = spacy.load('en_core_web_sm')
doc = nlp('This is a test of the spaCy extension. The test has multiple sentences.')
graphs = doc._.to_amr()

How can I solve? Help me

Better visualization for big AMR graph

The visualization helps a lot. Thanks for releasing this useful tool.

I may have a small suggestion when dealing with big AMR graphs derived from long sentences.

amrlib/amrlib/graph_processing/amr_plot.py

Line 12 in dadb5c4

self.graph = Digraph('amr_graph', filename=render_fn, format='png')

I was wondering if we could change format="png" -> format="pdf", since the image quality of pdf is better than png when zooming the image.

I tried pdf on my own mac laptop, it works well. Anyway, it is just my personal preference.

Best,
Jiaying

A request for the souce code for training the model

Hi,

I saw the models for Parse SPRING, Parse T5, Parse GSII, Generate T5wtense, Generate T5. But I did not see the source code for them. Can you post the source code for training those models? Thank you so much.

Merry Christmas.

AttributeError: 'str' object has no attribute 'size'

Hello there!

I'm not sure if this is a noobish problem, but I am out of my wits!

After setting up the package according to the documentation, I did run the proposed minimal code:

import amrlib
stog = amrlib.load_stog_model()
graphs = stog.parse_sents(['This is a test of the system.', 'This is a second sentence.'],True)
for graph in graphs:
    print(graph)

However, I get the following error:

$ python3 stog_test.py 
Loading model /home/gandalf/.local/lib/python3.7/site-packages/amrlib/data/model_stog/model.pt
Traceback (most recent call last):                                                            
  File "stog_test.py", line 5, in <module>                                                    
    graphs = stog.parse_sents(['This is a test of the system.', 'This is a second sentence.'],True)                                                                                         
  File "/home/gandalf/.local/lib/python3.7/site-packages/amrlib/models/parse_gsii/inference.py", line 65, in parse_sents                                                                    
    return self.parse_file_handle(sio_f, add_metadata)                                        
  File "/home/gandalf/.local/lib/python3.7/site-packages/amrlib/models/parse_gsii/inference.py", line 93, in parse_file_handle                                                              
    res = self._parse_batch(batch)                                                            
  File "/home/gandalf/.local/lib/python3.7/site-packages/amrlib/models/parse_gsii/inference.py", line 152, in _parse_batch
    beams = self.model.work(batch, self.beam_size, self.max_time_step)
  File "/home/gandalf/.local/lib/python3.7/site-packages/amrlib/models/parse_gsii/modules/parser.py", line 81, in work
    data['pos'], data['ner'], data['word_char'], data['bert_token'], data['token_subword_index'])
  File "/home/gandalf/.local/lib/python3.7/site-packages/amrlib/models/parse_gsii/modules/parser.py", line 62, in encode_step_with_bert
    bert_embed, _ = self.bert_encoder(bert_token, token_subword_index=token_subword_index)
  File "/home/gandalf/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/gandalf/.local/lib/python3.7/site-packages/amrlib/models/parse_gsii/bert_utils.py", line 60, in forward
    return self.average_pooling(encoded_layers, token_subword_index), pooled_output
  File "/home/gandalf/.local/lib/python3.7/site-packages/amrlib/models/parse_gsii/bert_utils.py", line 66, in average_pooling
    _, num_total_subwords, hidden_size = encoded_layers.size()
AttributeError: 'str' object has no attribute 'size'

I'm not sure if this is related to amrlib at all, encoded_layers here is a String with the value "last_hidden_layer".

My pip freeze:

amrlib==0.2.1
asn1crypto==0.24.0
beautifulsoup4==4.7.1
blis==0.7.3
Brlapi==0.6.7
catalogue==1.0.0
certifi==2018.8.24
chardet==3.0.4
click==7.1.2
cryptography==2.6.1
cupshelpers==1.0
cymem==2.0.4
dataclasses==0.6
distro==1.3.0
distro-info==0.21
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
entrypoints==0.3
evdev==1.1.2
filelock==3.0.12
future==0.18.2
galternatives==1.0.4
graphviz==0.15
html5lib==1.0.1
httplib2==0.11.3
idna==2.6
importlib-metadata==3.1.1
joblib==0.17.0
keyring==17.1.1
keyrings.alt==3.1.1
louis==3.8.0
lxml==4.3.2
meteo-qt==1.0.0
murmurhash==1.0.4
nltk==3.5
numpy==1.19.4
olefile==0.46
packaging==20.7
Penman==1.1.0
pexpect==4.6.0
Pillow==5.4.1
plac==1.1.3
preshed==3.0.4
pycairo==1.16.2
pycrypto==2.6.1
pycups==1.9.73
pycurl==7.43.0.2
PyGObject==3.30.4
pyparsing==2.4.7
PyQt5==5.15.2
PyQt5-sip==12.8.1
PySimpleSOAP==1.16.2
pysmbc==1.0.15.6
python-apt==1.8.4.1
python-debian==0.1.35
python-debianbts==2.8.2
python-magic==0.4.16
pyxattr==0.6.1
pyxdg==0.25
PyYAML==3.13
regex==2020.11.13
reportbug===7.5.3-deb10u1
reportlab==3.5.13
requests==2.21.0
sacremoses==0.0.43
SecretStorage==2.3.1
setproctitle==1.1.10
six==1.12.0
smatch==1.0.4
soupsieve==1.8
spacy==2.3.4
srsly==1.0.4
thinc==7.4.3
tokenizers==0.9.4
torch==1.7.0+cpu
torchaudio==0.7.0
torchvision==0.8.1+cpu
tqdm==4.54.0
transformers==4.0.0
typing-extensions==3.7.4.3
unattended-upgrades==0.1
Unidecode==1.1.1
urllib3==1.24.1
wasabi==0.8.0
webencodings==0.5.1
word2number==1.1
youtube-dl==2020.12.2
zipp==3.4.0

If you need further information, let me know, I would appreciate any help!

Loading stog_model

Hello 👋
When trying to load the stog_model with load_stog_model I get a FileNotFoundError because the path "amrlib/data/stog_model" does not exist.

Looking at the docs I think stog_model was renamed to parse_t5, correct? There's also no data folder.
Is there a function that downloads the stog_model and creates the data folder?

Thanks for the help! ✨

Question for RBW Aligner

would you mind describing the format of the alignments more specifically?

::alignments 0-1.1.1 1-1.1.2 3-1.1.2.1.2.1 4-1.1.2.1.1.1 5-1.1

parse_gtii?

Hello. I've been trying out amrlib, and have a question about the 2 models, parse_t5 vs. parse_gsii. I easily find the first one, but https://github.com/bjascob/amrlib/blob/master/README.md says there's the second one, too, which I can't find. I also went to the place the page says it's from (https://github.com/jcyk/AMR-gs), and can't find it there either. Any guidance you can provide?

Thanks!

Consider Adding BLINK for wiki Tag Annotations

The popular way to add wiki tags today is using BLINK. Consider adding code to utilize this.

Additionally check to see if there are any updates to the Spotlight DB server and code. The java code does not run properly under Java 11 and needs to run with Java 8. Also check to see if the online servers are still running and reliable. Compare performance between BLINK and Spotlight. Obsolete this wiki tagging solution if it's not longer reasonably well supported.

The current spotlight process gives wikification smatch scores of 73 on parse_spring and 72 on parse_t5. In the parse_spring paper, "One SPRING...", they show a wikification score or 84 using the Blink model. However, note that they are applying empty :wiki tags with the model and then using post-process to fill them in. The spotlight process used currently both finds where to apply the tags and then looks them up in the DB. I'm not sure which portion (finding tag locations or finding tag values) is causing the lower scores.

Missing runs folder

Hi,

I have one more question related to the runs folder. I found run_tensorboard.sh under 31_Model_Parse_T5/ and it seems have runs folder after training. But I did not find it. Do you know how can I have the runs folder? Do I need to set some parameters in model_parse_t5.json for that? Thank you so much.

Does the amrlib support the AMR-to-Text model for SPRING

Hi,

I found the amrlib support the Text-to-AMR model for SPRING. But there is a AMR-to-Text model for SPRING as well from the SPRING GitHub. I want to know whether the amrlib support that? Cause I have downloaded the AMR-to-Text model for SPRING which is a model.pt file instead of the pytorch_model.bin.

length problem of faa aligner

When I use the faa aligner, an error is thrown due to the length problem. Specifically, I intend to align 1000 amrs to 1000 sentences. However, the faa aligner returns 907 alignments, causing the error. I locate the problem that happened in faa_aligner.py (about line 98).

fwd_out, fwd_err = self.fwd_align.communicate(fa_in, timeout=self.timeout)
rev_out, fwd_err = self.rev_align.communicate(fa_in, timeout=self.timeout)

newly trained parse_xfm_bart models not generating first character

For newly trained parse_xfm models using bart-X, when looking at the raw graph output the first character is missing. This does not impact the released models and I've only confirmed it on bart-base. It likely exists with bart-large but not with the t5 models.

The issue is due to the fact that the huggingface model config.json file has changed. There is now a line "forced_bos_token_id": 0 present. This changes the behavior. The change to the model config happened around the end of Feburary 2022. It looks like it is from this huggingface/transformers#15559 issue.

Adding the line {..., "forced_bos_token_id" : null} to the config's model_args section appears to fix this. The fix needs to be tested with both bart models and t5-base needs to be verified to work correctly without any changes.

Note that it's not obvious that this is happening because the first character is always a "(" for all graphs and the deserializer can generally handle the missing start paren without issue. This means the bug may not introduce any visible changes, but could cause the smatch score to change a very small amount.

FAA Aligner output "alignment_strings" are lower-cased

The output graph with surface alignments is fully lower-cased. This is an issue for attributes that may need to be capitalized.

There are 2 places in processing where AMR strings are lower-cased.
1 - feat2tree.py::align()
2 - process_utils.py::get_lineartok_with_rel()

Note that removing the lower() in either of these could impact performance so testing needs to be done before making changes.

A work-around, for now, is to do a lower-case match to the original graph attributes and copy over the originally cased words.

Finetuning gsii

In the comments of a PR, you explained how to fine-tune the trained T5 parse model on additional data. Is there also a simple way to do this for the gsii parse model?

Compatibility issue with transformers >= 4.0.0

See #8 for behavior when running the stog model with transformers 4.0.0

The requirements.txt for amrlib v0.2.1 did not require transformers==3.5.1, but I added the version to that file. Code works with transformers==3.4.0 too, so it's probably OK to just do transformers<=3.5.1.

Review transformers 4 and consider changing code to support this version. Note that while the original error was reported for the stog model, the gtos model could have issues too.

Pytorch Version and Windows Compatibility

From Moha

Many thank for your quick response, I have done what you wrote, I have
installed req file. now I am fighting with this torch compatibility problem,
I faced with this:

now I am fighting with this torch compatibility problem:

version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at
..\caffe2\serialize\inline_container.cc:132, please report a bug to PyTorch.
Attempted to read a PyTorch file with version 3, but the maximum supported
version for reading is 2. Your PyTorch installation may be too old. (init at
..\caffe2\serialize\inline_container.cc:132)
(no backtrace available)

it seems it related to versio0n of the torch, if you have any idea would be happy to know that

Two questions for parse_xfm_bart_large

Hi,

I got one question for the reference paper for parse_xfm_bart_large. Can you post the reference paper for that? Many thanks.
https://github.com/bjascob/amrlib-models#sentence-to-graph-models

The second question is I cannot load the parse_xfm_bart_large using stog = amrlib.load_stog_model("./models/model_parse_xfm_bart_large-v0_1_0"). I got the following error.

Traceback (most recent call last):
  File "/data/qbao775/amrlib/reclor_if_then_xfm_t5wtense.py", line 36, in <module>
    stog = amrlib.load_stog_model("./models/model_parse_xfm_bart_large-v0_1_0")
  File "/data/qbao775/amrlib/amrlib/__init__.py", line 35, in load_stog_model
    stog_model = load_inference_model(model_dir, **kwargs)
  File "/data/qbao775/amrlib/amrlib/models/model_factory.py", line 62, in load_inference_model
    model_class = dynamic_load(module_name=meta['inference_module'], class_name=meta['inference_class'])
  File "/data/qbao775/amrlib/amrlib/models/model_factory.py", line 13, in dynamic_load
    module = importlib.import_module(module_name, package=package)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'amrlib.models.parse_xfm'

Compatibility with spaCy 3.0

spaCy released v3 on 2/1/2021, with new transformer models. These may work but should be tested. Specifically concerning is the issue v3 is no longer thread-safe when used with transformers #6879, since parsing full corpuses are done with threading which greatly decreases the parse time.

Also of concern are the following (from release notes)...

The PRON_LEMMA symbol and -PRON- as an indicator for pronoun lemmas has been removed.
The Lemmatizer is now a standalone pipeline component and doesn't provide lemmas by default or switch automatically between lookup and rule-based lemmas

FileNotFoundError FAA_Aligner

How to speed up my parsing?

hello, I use the pretrained parse_xfm_bart_base for parsing, but the speed of parsing is limited to one sentence per seconds instead of 31 sentences per seconds.
how to speed up it?

Plotting a Graph

@myeghaneh
To plot a graph see... see AMRPlot. There's also a button the UI for this.

Note that this requires graphviz which comes in 2 parts. The pip install, which is a python wrapper, and the non-python Graphviz install which needs to be done manually. See the installation instructions on the pypi page.

How to control the number of generation?

Hi,

I got one question about how to control the number of generations? For example, when I use the T5 parser to parse the sentence into the AMR graph. I can only get one graph for each sentence. How can I generate five graphs each time? Thanks.

import amrlib
stog = amrlib.load_stog_model()
graphs = stog.parse_sents(['This is a test of the system.', 'This is a second sentence.'])
for graph in graphs:
    print(graph)

Kind regards,
Qiming

What training data was used for GSII model?

Thank you for releasing this useful tool.
I was wondering which AMR release was used for training the Parse GSII?

Spring parser returning nothing for certain strings

Wanted to share some weird behavior I found

For the SPRING parser, I get no graph returned for the following strings

'advertisement To estimate the risk of death from heart disease most doctors use a calculator endorsed by the American Heart Association and the American College of Cardiology'

To muddy the waters further In 2016 a systematic review revealed no association with LDL cholesterol and heart disease in those aged over sixty and an inverse association with all cause mortality in other words the higher your cholesterol in this age group the longer you would live 15

Your body needs cholesterol to Promote brain health

Not sure why! Thought it might be of interest.

Trying To Understand Why Certain Inputs Break Parser

I am using the T5 parser and noticed that the input

Making certain distinctions is imperative in looking back on the past

Causes the model to return the error

gid=x Unhandled token imperative
Failed to deserialize, snum=0, beam=0
gid=x Unhandled token imperative
Failed to deserialize, snum=0, beam=1
gid=x Unhandled token imperative
Failed to deserialize, snum=0, beam=2
gid=x Unhandled token imperative
Failed to deserialize, snum=0, beam=3

[None]

Two questions:

Why does this happen? In general, what kinds of inputs cause such an issue?
Suppose I need the sample parsed, is there anything I can do to get an output?

Thanks!

[Feature request] Alingments: AMR <-> AMR and AMR <-> Sentence

Hi,

first, I want to say: You deserve a medal for creating this library. It is the first time I installed an AMR parser without getting a little headache :-). Also it's a nice idea to wrap the noRECAT variant of GSII and ablate all external java preprocessing. I think the noRECAT version may also be more robust.

I have two suggestions of which I think they would be cool to have in an amrlib:

AMR2sent alignment: As far as I know, there exist aligners (for instance as pre-processing of JAMR parser), that align AMR nodes to tokens. Since often lemmas of the sentence are projected into the AMR graph, a simple string match, maybe with some additional rules, could make up a first solid method. Maybe there are other methods that are more suitable and also easy-to-use.
AMR2AMR variable alignment: This could be useful, e.g., for computing AMR metrics or enabling sentence retrieval via AMR parsed corpora or sentence similarity computation via AMR. It is an NP hard problem but can be implemented via hill climbing maximizing triple matching. I have been working on this lately, here is a repo containing AMR metrics (Smatch and S2match) that are based on this alignment. Both alignments should be quite easy to implement in the lib, since it's all native python. (It could be worthwhile, though, to make the alignment faster, e .g., using cython, since it can be very slow for graphs with many variables)

Alas, these are just suggestions which may or may not be useful to have (in some near or distant future). Again, thanks for your awesome amrlib!

Question for transformers package

The original spring requires transformers<3.0 due to the change of positional embeddings. However, amrlib requires transformers>=3.0, should I downgrade the transformers if I want to use spring model?

Besides, we found that spring will produce frames out of propbank frame list. For example, we parse "It was aged in 30 percent new oak barrels , some coopered from American oak and some from French oak ." and the model produces "coch-00" and "coch-01" node.

I wonder if this problem is caused by the version of transformers.

Request to add more models

This repo seems to have a model with higher smatch, you could please include that in this library as well? I tried and it is breaking the code.-->

Model	Link
AMR2.0+BERT+GR=Smatch80.2	amr2.0.bert.gr.tar.gz
AMR2.0+BERT=Smatch78.7	amr2.0.bert.tar.gz

https://github.com/jcyk/AMR-gs

The T5 model is too slow to be used.

Tips on Debugging FAA Aligner

I have a dataset with hundreds of thousands of parsed AMRs I would like to align. The FAA aligner occasionally fails, most recently with the following error:

[/usr/local/lib/python3.7/dist-packages/amrlib/alignments/faa_aligner/feat2tree.py](https://localhost:8080/#) in __init__(self, id, val, type, feats, is_virtual, pi, pi_edge)
     55         alignsplit = val.split('~')
     56         if len(alignsplit) > 1:
---> 57             self.alignset |= set( int(i) for i in alignsplit[1].split('e.')[1].split(',') )
     58             self.val = alignsplit[0]
     59 

IndexError: list index out of range

I have checked that none of the graphs are empty (parser returned nothing). Not sure what else it could be.

If I dont run the FAA with batches of samples it takes a really long time, wondering if there is anyway to identify the samples breaking this code without aligning them individually.

What is the meaning of the alignment result?

Hi, thanks for this work.
I am trying to understand the alignment result of FAA. For example,

    # ::status ParsedStatus.OK
    # ::source QA-pairs-first-round_ali-baba-and-forty-thieves-story_1.txt
    # ::nsent 6
    # ::snt building on their father 's business -
    (z0 / build-01
        :location (z1 / business
                      :poss (z2 / person
                                :ARG0-of (z3 / have-rel-role-91
                                             :ARG1 (z4 / they)
                                             :ARG2 (z5 / father)))))

The corresponding alignment result is:

   0-1 1-1.1.r 5-1.1 4-1.1.1.r 2-1.1.1.1.1 3-1.1.1.1.2

I know the left in left-right denotes the left-th token in the source sentence :snt. For right, does 1 denote root node? 1.1 denotes the first child of the root node? What is the meaning of r in 1.1.r? Thanks.

Problem when loading amr with snt having new line \x85 character

Hi, I used function load_and_serialize in penman_serializer to load amr 2.0. It failed to load and serialise because in the snt there is a character \x85 which might e a new line character and break the line:

['', '# ::id bolt-eng-DF-170-181103-8882248_0335.5 ::date 2015-10-27T06:21:31 ::annotator SDL-AMR-09 ::preferred', '# ::snt That\x92s what we\x92re with\x85You\x92re not sittin\x92 there in a back alley and sayin\x92 hey what do you say, five bucks?', '# ::save-date Thu Oct 29, 2015 ::file bolt-eng-DF-170-181103-8882248_0335_5.txt', '(m / multi-sentence', '      :snt1 (a / accompany-01', '            :ARG0 (t / that)', '            :ARG1 (w / we))', '      :snt2 (a2 / and :polarity -', '            :op1 (s / sit-01', '                  :ARG1 (y / you)', '                  :ARG2 (a3 / alley', '                        :mod (b / back)', '                        :mod (t2 / there)))', '            :op2 (s2 / say-01', '                  :ARG0 y', '                  :ARG1 (s3 / say-01', '                        :ARG0 (y2 / you)', '                        :ARG1 (a4 / amr-unknown)', '                        :topic (m2 / monetary-quantity :quant 5', '                              :unit (b2 / buck))', '                        :mod (h / hey :mode expressive)))))', '']

Do you have any idea how to fix this issue? I am running on MACOS.

Warning from T5 transformer model about line ends </s>

The following is seen during unit tests...

testGtoS (auto.BasicAPITests.BasicAPITests) ... /home/bjascob/.local/lib/python3.8/site-packages/transformers/models/t5/tokenization_t5.py:183: UserWarning: This sequence already has . In future versions this behavior may lead to duplicated eos tokens being added.

Note that I'm adding a to all T5 based serializations (including during training). Find the correct behavior and fix.

amr_view : ImportError: cannot import name 'T5ForConditionalGeneration' from 'transformers'

When running amr_view I am getting an import error, though I'm not sure exactly why. The full error is..

  File "/home/bjascob/.local/lib/python3.8/site-packages/amrlib/models/parse_t5/inference.py", line 7, in <module>
    from   transformers import T5ForConditionalGeneration, T5Tokenizer
ImportError: cannot import name 'T5ForConditionalGeneration' from 'transformers' (/home/bjascob/.local/lib/python3.8/site-packages/transformers/__init__.py)

The library is installed and loading the model from the command line (stog = amrlib.load_stog_model()) works fine. I'm guessing this version of transformers has some dynamic loading issue for that module.

The simple fix is to put import transformers on the line above the error. that extra import seems to fix the issue.

Model generate_t5-v0_1_0 drop in BLEU scores

Recent testing shows that the BLEU scores for model_t5_generate has dropped by about 1.5 points

Original scores at model release:
num_beams = 1 batch_size = 32 BLEU score: 41.99
num_beams = 16 batch_size = 4 BLEU score: 43.09
Retest 12/26/2020
num_beams = 1 batch_size = 32 BLEU score: 40.65
num_beams = 16 batch_size = 4 BLEU score: 41.69

Likely candidates for the drop in scores is the update to transformers 4.0 and/or the changes to the code for for compatibility with that release. More testing is required to determine what's happened.

Training Parse Model with torch 1.7.0

Looks like the parse model has an issue training with torch 1.7.0. torch 1.6.0 should work.

Error message when training with torch 1.7.0
Traceback (most recent call last):
File "./20_Train.py", line 14, in
File "/home/bjascob/Files/Projects/AMR/012_AMRLib_GitHub/amrlib/models/parse_gsii/trainer.py", line 114, in run_training
loss.backward()
File "/home/bjascob/.local/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/bjascob/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [24, 129, 1536]], which is output 0 of AddBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

more AMR-2-text models?

Hello,

Thanks for the awesome work! Your library helps simplify the task on AMR a lot!

I wonder, do you have plan to support more AMR2text generative models? I find the T5 model can't handle deeply nested AMR.

Handle graph repair error in deserialize

Rarely I get an error here:

amrlib/amrlib/models/parse_t5/penman_serializer.py

Line 152 in 39c8929

variable = node_stack[-1]

    graphs = self.parser.parse_sents(sents)
  File "xxxx/python3.7/site-packages/amrlib/models/parse_t5/inference.py", line 70, in parse_sents
    gstring = PenmanDeSerializer(g).get_graph_string()
  File "xxxx/python3.7/site-packages/amrlib/models/parse_t5/penman_serializer.py", line 104, in __init__
    self.deserialize(gstring)       # sets self.pgraph and self.gstring
  File "xxxx/site-packages/amrlib/models/parse_t5/penman_serializer.py", line 152, in deserialize
    variable = node_stack[-1]
IndexError: list index out of range

This error, and possibly similar ones, could maybe be mitigated by replacing

        self.deserialize(gstring)       # sets self.pgraph and self.gstring

amrlib/amrlib/models/parse_t5/penman_serializer.py

Line 104 in 39c8929

self.deserialize(gstring) # sets self.pgraph and self.gstring

with

        try:
            self.deserialize(gstring) # sets self.pgraph and self.gstring
        except:
            self.pgraph, self.gstring = None, None

But there may be a better fix.

This tokenizer was incorrectly instantiated with a model max length of 512

When updating from my system with transformers 4.19.4 (previously was 4.16.2) and Sentencepiece 0.1.96. I'm getting the following message... FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5. This only happens for the T5Tokenizer, not bart so it only impacts the generate_t5 and generate_t5wtense models.

To reproduce...

>>> from transformers import T5Tokenizer
>>> tokenizer = T5Tokenizer.from_pretrained('t5-base')

Note that this also happens with AutoTokenizer.from_pretrained('t5-base') (but not bart-base)

It looks like this is an issue with the transformers code, since T5 should have a max length of 512. Since I'm setting max_length=512 during tokenization it shouldn't be an issue anyway. Hopefully this message will go away with later updates to the transformers lib.

AMR2AMR Variable Alignment

@flipz357 suggest the following addition

AMR2AMR variable alignment: This could be useful, e.g., for computing AMR metrics or enabling sentence retrieval via AMR parsed corpora or sentence similarity computation via AMR. It is an NP hard problem but can be implemented via hill climbing maximizing triple matching. I have been working on this lately, here is a repo containing AMR metrics (Smatch and S2match) that are based on this alignment. Both alignments should be quite easy to implement in the lib, since it's all native python. (It could be worthwhile, though, to make the alignment faster, e .g., using cython, since it can be very slow for graphs with many variables)

bjascob comment:
It looks like the feature is implemented in the above mentioned repo (and may still be under development). If other users feel this is required inside of amrlib, feel free to comment. For now let's keep the functionality separate. Maybe in a future iteration we can develop a uber alignment suite for amrlib that includes multiple and malleable functionality.

AttributeError: 'str' object has no attribute 'size'

without spacy
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/amrlib/models/parse_gsii/bert_utils.py", line 66, in average_pooling
_, num_total_subwords, hidden_size = encoded_layers.size()
AttributeError: 'str' object has no attribute 'size'

with spacy
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/amrlib/models/parse_gsii/bert_utils.py", line 66, in average_pooling
token_index = torch.arange(num_tokens).view(1, -1, 1).type_as(token_subword_index)
AttributeError: 'str' object has no attribute 'size'

Hi! I am getting this error with both spacy extension and first stog option, troubleshooting does not help much. I understand that encoded_layers is somehow not an array, but a string somehow, but how come? I just try to execute your examples (import amrlib
stog = amrlib.load_stog_model()
graphs = stog.parse_sents(['This is a test of the system.', 'This is a second sentence.'])
for graph in graphs:
print(graph)) and import amrlib
import spacy
amrlib.setup_spacy_extension()
nlp = spacy.load('en_core_web_sm')
doc = nlp('This is a test of the SpaCy extension. The test has multiple sentences.')
graphs = doc._.to_amr()
for graph in graphs:
print(graph)

Metadata information for AMR strings

Hi,
Would it be possible to output other metadata fields when using the stog parsing model? Such as the node offsets?

How to get alignment between english sentence word and AMR node ?

First of all I am sorry if I asked any silly question. I am new to AMR and I am doing college project in AMR.
Using amrlib I can parsed an english sentence and AMR texual representation.

Here is the code that I used to parse an english sentence

import spacy
import amrlib
amrlib.setup_spacy_extension()
nlp = spacy.load('en_core_web_sm')
doc = nlp('What did the girl find ?')
graphs = doc._.to_amr()

for graph in graphs:
    print(graph)

and got the below output

 # ::snt What did the girl find ?
 # ::tokens ["What", "did", "the", "girl", "find", "?"]
 # ::ner_tags ["O", "O", "O", "O", "O", "O"]
 # ::ner_iob ["O", "O", "O", "O", "O", "O"]
 # ::pos_tags ["WP", "VBD", "DT", "NN", "VB", "."]
 # ::lemmas ["what", "do", "the", "girl", "find", "?"]
(f0 / find-01
      :ARG0 (g0 / girl)
      :ARG1 (a0 / amr-unknown))

But my question is that how can I get alignment between AMR node and input english sentence words i.e. something like
# :: alignments 4-5|0 3-4|0.0 0-1|0.1

pip install doesn't install amrlib.alignments

I'm trying to set up amrlib using a pip install. I added a /data folder under amrlib in my conda env and put the models there, and they (seem to) work.

I then follow the instructions

from amrlib.graph_processing.annotator import add_lemmas
penman_graph = add_lemmas(graph_string, snt_key='snt')

and this works too

but then

correspondingly, the alignments subfolder is missing from the amrlib package in my conda env. is this a mistake?

thanks in advance

-Jack

AMRPlot Only Creates One Attribute Box

When plotting a graph where the same attribute string appears multiple times, AMRPlot is only creating one square box. This isn't technically wrong but looks a little confusing. For this to work correctly in graphvis I probably need to create unique ID's for each instance of the attribute.

	for bnum, g in enumerate(raw_graphs):
	gstring = PenmanDeSerializer(g).get_graph_string()
	if gstring is not None: