While updating the PROIEL/TOROT conversion s (with <a class="user-mention notran

FYI, the statement of the policy: <a href="https://universaldependencies.org/changes.h

Thank you, <a class="user-mention notranslate" data-hovercard-type="user" data-hoverca

Just guessing the first subject as outer won't do <p di

Validation error for PROIEL/TOROT treebanks [L3 Syntax too-many-subjects] Node has multiple subjects not subtyped as ':outer' about docs HOT 7 CLOSED

hanneme commented on August 18, 2024

Validation error for PROIEL/TOROT treebanks [L3 Syntax too-many-subjects] Node has multiple subjects not subtyped as ':outer'

from docs.

Comments (7)

martinpopel commented on August 18, 2024

marking the non-embedded subject as nsubj:outer automatically would be complicated

It is quite easy using Udapi:

udapy read.Conllu files='!*.conllu' util.Eval node='
inner = [c for c in node.children if c.udeprel.endswith("subj") and c.sdeprel!="outer" and c.misc["Subject"]!="Outer"]
if len(inner) > 1:
  for c in inner[1:] if node < inner[0] else inner[:-1]:
    if not c.sdeprel:
      c.sdeprel = "outer"
    else:
      c.misc["Subject"] = "Outer"' \
write.Conllu path=fixed

You can then visualize the edits using the following command, e.g. with UD_Latin-PROIEL dev branch:

udapy -M read.Conllu files=la_proiel-ud-train.conllu zone=orig read.Conllu files=fixed/la_proiel-ud-train.conllu zone=fixed util.MarkDiff gold_zone=orig write.TextModeTreesHtml attributes=form,upos,deprel,misc > diff-train.html

See the output at https://ufallab.ms.mff.cuni.cz/~popel/ud/la-proiel/ (ignore the Mark=1 - it is there just for the visualization purposes) and make sure all the edits are valid.

Note that

It may be difficult to guess which of the subjects is the inner and which the outer one(s). I use a simple word-order heuristics: if the verb precedes the first subject, the first subject is the inner one. Otherwise, the last subject is the inner one. Is there any better rule?
If the original deprel is e.g. nsubj:pass, I keep it and use Subject=Outer in MISC for marking the outer subject instead of the standard deprel=nsubj:outer, so that the information about passivization is not lost.
I am afraid there are still errors in your data, e.g. in sent_id = 33462, I think that "verba prophetiae" in "qui audiunt verba prophetiae" should be labeled as obj rather than nsubj. So perhaps it is still too early for using the above-mentioned Udapi script which would hide such annotation/conversion errors.

from docs.

nschneid commented on August 18, 2024

FYI, the statement of the policy: https://universaldependencies.org/changes.html#multiple-subjects

from docs.

hanneme commented on August 18, 2024

Thank you, @nschneid – I read this as a confirmation that we are not obliged to add :outer. (We are still working on the conversion, so we are well aware that there are still errors.)

from docs.

nschneid commented on August 18, 2024

Yes, the alternative to :outer is manually checking that any multiple subjects are correct, and registering this information with the validator. @dan-zeman

from docs.

dan-zeman commented on August 18, 2024

I read this as a confirmation that we are not obliged to add :outer.

Yes, that's correct, subtypes are optional, even if this one is highly recommended. If you have good reasons for not wanting to use the subtype, you can opt out. To inform the validator that a particular subject has been checked by you and should be allowed despite not having the :outer subtype, you add Subject=Outer to MISC on the line where nsubj:outer would be. Hence, you still have to act on each case individually, and "too complicated" is not a good reason for not using the subtype, but fortunately it is not too complicated, as @martinpopel shows. An example of a good reason is that you would end up with only one instance of nsubj:outer in the whole corpus, and in addition it would occur in the test data, meaning that no parser could ever learn how and where to predict it.

from docs.

hanneme commented on August 18, 2024

Many of the structures we have for this are quite complicated, in fact, with many levels of embedding and lots of ellipsis. Just guessing the first subject as outer won't do as the embedded clause can often precede the head. It certainly wouldn't produce data that would allow a parser to learn which subjects are outer! I frankly don't think we have enough examples even if we manually checked all of them either, as they vary quite a lot.

from docs.

martinpopel commented on August 18, 2024

Just guessing the first subject as outer won't do

This is not the heuristic I suggested. I suggested that the subject that is closer to the verb is the inner subject. I guess this heuristic may fail as well - this is why I warned about it. It may be especially unreliable if the verb is between the two subjects. There are 28 such cases in the current dev branch of UD_Latin-PROIEL and many of them seem like annotation errors (i.e. one of the subjects should not be labeled so):

cat *.conllu | udapy -TM util.Mark node='
(any(c.udeprel.endswith("subj") for c in node.children(preceding_only=1)) and
 any(c.udeprel.endswith("subj") for c in node.children(following_only=1)))' | less -R

It certainly wouldn't produce data that would allow a parser to learn which subjects are outer!

Currently, there are 122 instances of double subjects. Even if some of these are annotation errors (as I noticed above), which are going to be fixed in the final conversion, I guess there will be still substantially more than "one instance of nsubj:outer" mentioned by @dan-zeman as a possible excuse for not using the label. Anyway, there are many rare deprel subtypes in many UD corpora and parsers can handle rare events during training, so I think this is not a big issue.

from docs.

Validation error for PROIEL/TOROT treebanks [L3 Syntax too-many-subjects] Node has multiple subjects not subtyped as ':outer' about docs HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent