amir-zeldes / depedit Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 4.0 6.84 MB

A simple configurable tool for manipulating dependency trees.

Home Page: https://gucorpling.org/depedit/

License: Apache License 2.0

Python 100.00%

depedit's People

Contributors

Stargazers

Watchers

Forkers

nschneid pombredanne danielhers tezer

depedit's Issues

Basic dependency head of 0 is added for ellipsis node

The ellipsis node should only exist in the enhanced representation, thus its basic dependency head column should be empty (_). E.g. for 20111108072305AAPJTjj_ans.xml.conllu:

54c54
< 10.1	has	have	VERB	VBZ	_	_	_	8:parataxis	CopyOf=-1
---
> 10.1	has	have	VERB	VBZ	_	0	_	8:parataxis	CopyOf=-1

Value of $1 is sticky across rule applications

Script:

; move PROPN dependents under NOUN (basic, enhanced)
upos=/PROPN/;upos=/NOUN/&func=/flat/;func!=/flat/	#1>#2;#1>#3	#2>#3
upos=/PROPN/;upos=/NOUN/&func=/flat/;func!=/flat/&edep=/(.*)/	#1>#2;#1~#3	#2~#3;#3:edep=$1;#1~#3;#3:edep=OLDA$1
; change external basic head, change flat to compound
upos=/PROPN/&func=/(.*)/;upos=/NOUN/&func=/flat/;lemma=/.*/	#1>#2;#3>#1	#3>#2;#2:func=$1;#2>#1;#1:func=compound

Input:

# sent_id = weblog-juancole.com_juancole_20040604210986_ENG_20040604_210986-0022
# text = The bungling of post-war Iraq by the Bush administration created a weak and failed state.
1	The	the	DET	DT	Definite=Def|PronType=Art	2	det	2:det	_
2	bungling	bungling	NOUN	NN	Number=Sing	10	nsubj	10:nsubj	_
3	of	of	ADP	IN	_	5	case	5:case	_
4	post-war	post-war	ADJ	JJ	Degree=Pos	5	amod	5:amod	_
5	Iraq	Iraq	PROPN	NNP	Number=Sing	2	nmod	2:nmod:of	_
6	by	by	ADP	IN	_	8	case	8:case	_
7	the	the	DET	DT	Definite=Def|PronType=Art	8	det	8:det	_
8	Bush	Bush	PROPN	NNP	Number=Sing	2	nmod	2:nmod:by	_
9	administration	administration	NOUN	NN	Number=Sing	8	flat	8:flat	_
10	created	create	VERB	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	0	root	0:root	_
11	a	a	DET	DT	Definite=Ind|PronType=Art	15	det	15:det	_
12	weak	weak	ADJ	JJ	Degree=Pos	15	amod	15:amod	_
13	and	and	CCONJ	CC	_	14	cc	14:cc	_
14	failed	fail	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	12	conj	12:conj:and|15:amod	_
15	state	state	NOUN	NN	Number=Sing	10	obj	10:obj	SpaceAfter=No
16	.	.	PUNCT	.	_	10	punct	10:punct	_

Result:

6	by	by	ADP	IN	_	9	case	8:OLDAcase|9:case	_
7	the	the	DET	DT	Definite=Def|PronType=Art	9	det	8:OLDAcase|9:case	_
8	Bush	Bush	PROPN	NNP	Number=Sing	9	compound	2:nmod:by	_
9	administration	administration	NOUN	NN	Number=Sing	2	nmod	8:flat	_

Expected—det for node 7 edeps:

6	by	by	ADP	IN	_	9	case	8:OLDAcase|9:case	_
7	the	the	DET	DT	Definite=Def|PronType=Art	9	det	8:OLDAdet|9:det	_
8	Bush	Bush	PROPN	NNP	Number=Sing	9	compound	2:nmod:by	_
9	administration	administration	NOUN	NN	Number=Sing	2	nmod	8:flat	_

It seems the case value of $1 from the 2nd rule applied to node 6 is repeated on node 7.

Head field is buggy

The rule

lemma=/^[Tt]he$/&upostag=/^PROPN$/&head=/(.*)/
none
#1:upostag=DET;#1:xpostag=DT;#1:morph=Definite=Def|PronType=Art;#1:deprel=det;#1:deps=$1:amod

Is putting 3-digit indices in the deps field where it should be the head index.

Optimize precompiled rule matcher objects

I am using DepEdit to modify the .conllu files in English-EWT. One annoyance is that those files end with a blank line that DepEdit deletes. I can hack it to add a newline at the end, but should it support this in a more principled way (e.g. preserve all blank lines outside sentences that are seen in the input)?

Removing/replacing/changing the head of an enhanced dependency relation

I am looking for a way to replace an edep keeping the relation name but changing the head to a node matched in the query. I don't see any discussion of this case in the docs.

Some things that I tried:

Matching against edep: This adds a new edep but doesn't remove the old one.

upos=/PROPN/;upos=/NOUN/&func=/flat/;func!=/flat/&edep=/(.*)/	#1>#2;#1~#3	#2~#3;#3:edep=$1

Matching against node 3's edom string with #1 in the regex representing node 1 (no effect):

upos=/PROPN/;upos=/NOUN/&func=/flat/;func!=/flat/&edom=/#1:(.*)/	#1>#2	#3:edom=#2:$1

Matching node 1's token number in a capturing group and referencing it from the regex for node 3's edom string (no effect):

upos=/PROPN/&num=/(.*)/;upos=/NOUN/&func=/flat/;func!=/flat/&edom=/^$1:(.*)/	#1>#2	#3:edom=#2:$2

edom as the action only, in hopes of replacing all the node's edeps (no effect):

upos=/PROPN/;upos=/NOUN/&func=/flat/&num=/(.*)/;func!=/flat/&edep=/(.*)/	#1>#2;#1~#3	#3:edom=$1||$2

Clearing the old edep and then replacing it with #3:edep=;#3:edep=$1 as the action (syntax error).

Python 3 error

On my system:

$ python3 ~/dev/nlp-tools/DepEdit/depedit/depedit.py -c fixhash.depedit.ini "reviews/*.conllu"
Traceback (most recent call last):
  File "/Users/nathan/dev/nlp-tools/DepEdit/depedit/depedit.py", line 862, in <module>
    main(parser.parse_args())
  File "/Users/nathan/dev/nlp-tools/DepEdit/depedit/depedit.py", line 838, in main
    f.write(output_trees.encode("utf-8"))
TypeError: write() argument must be str, not bytes

Add relation shorthand

Similarly to AQL shorthand, support an alternative relation notation like this:

#1.#2.#3

Instead of forcing explicit:

#1.#2;#2.#3

mtlmodel.py error

getting an error only when runing on a list of files.
when ruining on that same file only it suns ok.

(nlp_env) F:\nlp_project\HebPipe\hebpipe>python heb_pipe.py  "F:\nlp_project\responsa_texts\all files\all files\*.txt"  --dirout "F:\nlp_project\responsa_texts\hebpipe_output\all files"  --cpu
! You selected no processing options
! Assuming you want all processing steps

Running tasks:
====================
o Automatic sentence splitting (neural)
o Whitespace tokenization
o Morphological segmentation
o POS and Morphological tagging
o Lemmatization
o Dependency parsing
o Entity recognition
o Coreference resolution

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 216kB [00:00, ?B/s]
Some weights of BertModel were not initialized from the model checkpoint at onlplab/alephbert-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertModel were not initialized from the model checkpoint at onlplab/alephbert-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.
Processing שו ת אבני נזר חלק אה ע סימן א.txt
C:\Users\msperka\AppData\Local\anaconda3\envs\nlp_env\lib\site-packages\sklearn\base.py:324: UserWarning: Trying to unpickle estimator LabelEncoder from version 0.23.2 when using version 1.0.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(
Processing שו ת אבני נזר חלק אה ע סימן ב.txt
Processing שו ת אבני נזר חלק אה ע סימן ג.txt
Processing שו ת אבני נזר חלק אה ע סימן ד.txt
Processing שו ת אבני נזר חלק אה ע סימן ה.txt
Processing שו ת אבני נזר חלק אה ע סימן ו.txt
Processing שו ת אבני נזר חלק אה ע סימן ז.txt
Processing שו ת אבני נזר חלק אה ע סימן ח.txt
Processing שו ת אבני נזר חלק אה ע סימן ט.txt
Traceback (most recent call last):
  File "heb_pipe.py", line 851, in <module>
    run_hebpipe()
  File "heb_pipe.py", line 828, in run_hebpipe
    processed = nlp(input_text, do_whitespace=opts.whitespace, do_to
[שו ת אבני נזר חלק אה ע סימן ט.txt](https://github.com/amir-zeldes/DepEdit/files/12388027/default.txt)
k=dotok, do_tag=opts.posmorph, do_lemma=opts.lemma,
  File "heb_pipe.py", line 613, in nlp
    tagged_conllu, tokenized, morphs, words = mtltagger.predict(tokenized,sent_tag=sent_tag,checkpointfile=model_dir + 'heb.sbdposmorph.pt')
  File "F:\nlp_project\HebPipe\hebpipe\lib\mtlmodel.py", line 1273, in predict
    split_indices, pos_tags, morphs, words = self.inference(no_pos_lemma,sent_tag=sent_tag,checkpointfile=checkpointfile)
  File "F:\nlp_project\HebPipe\hebpipe\lib\mtlmodel.py", line 1015, in inference
    for i in range(0, len(preds)):
TypeError: object of type 'int' has no len()
Elapsed time: 0:57:44.609
========================================`

שו ת אבני נזר חלק אה ע סימן ט.txt

tok_id = str(int(cols[0]) + tokoffset)

A line that triggers the error:
https://github.com/UniversalDependencies/UD_English/blob/7b898d9ac599b3d21bc36f487234d75943081878/not-to-release/sources/answers/20111108072305AAPJTjj_ans.xml.conllu#L54

Aliases

Add upos for pos/upostag, xpos for pos/cpostag

documentation not clear?

Given

4	por	por	ADP	PRP|@ADVL>	_	9	advmod	_	MWEPOS=ADV
5	exemplo	exemplo	NOUN	N|M|S|@P<	Gender=Masc|Number=Sing	4	fixed	_	ChangedBy=Issue165|SpaceAfter=No

The ini file

func!=/fixed/&text=/(.*)/&misc=/(.*)/;func=/fixed/&text=/(.*)/	#1>#2	#1:misc=$2|MWE=$1_$3

Why is the system producing the output below? I just one to

4	por	por	ADP	PRP|@ADVL>	_	9	advmod	_	MWEPOS=ADV|MWE=MWEPOS=ADV_MWEPOS=ADV
5	exemplo	exemplo	NOUN	N|M|S|@P<	Gender=Masc|Number=Sing	4	fixed	_	ChangedBy=Issue165|SpaceAfter=No

The expected output was:

4	por	por	ADP	PRP|@ADVL>	_	9	advmod	_	MWEPOS=ADV|MWE=por_exemplo
5	exemplo	exemplo	NOUN	N|M|S|@P<	Gender=Masc|Number=Sing	4	fixed	_	ChangedBy=Issue165|SpaceAfter=No

Empty node indices should not be floats

When I run DepEdit on a files containing Enhanced UD ellipsis nodes, it replaces them as floats even if the rule does not affect the token. E.g.:

7.1    found   find    VERB    VBD     Mood=Ind|Tense=Past|VerbForm=Fin        _       _       3:conj  CopyOf=3

Becomes

7.100000000000001      found   find    VERB    VBD     Mood=Ind|Tense=Past|VerbForm=Fin        _       _       3:conj  CopyOf=3