Code Monkey home page Code Monkey logo

Comments (7)

martinpopel avatar martinpopel commented on August 18, 2024

marking the non-embedded subject as nsubj:outer automatically would be complicated

It is quite easy using Udapi:

udapy read.Conllu files='!*.conllu' util.Eval node='
inner = [c for c in node.children if c.udeprel.endswith("subj") and c.sdeprel!="outer" and c.misc["Subject"]!="Outer"]
if len(inner) > 1:
  for c in inner[1:] if node < inner[0] else inner[:-1]:
    if not c.sdeprel:
      c.sdeprel = "outer"
    else:
      c.misc["Subject"] = "Outer"' \
write.Conllu path=fixed

You can then visualize the edits using the following command, e.g. with UD_Latin-PROIEL dev branch:

udapy -M read.Conllu files=la_proiel-ud-train.conllu zone=orig read.Conllu files=fixed/la_proiel-ud-train.conllu zone=fixed util.MarkDiff gold_zone=orig write.TextModeTreesHtml attributes=form,upos,deprel,misc > diff-train.html

See the output at https://ufallab.ms.mff.cuni.cz/~popel/ud/la-proiel/ (ignore the Mark=1 - it is there just for the visualization purposes) and make sure all the edits are valid.

Note that

  • It may be difficult to guess which of the subjects is the inner and which the outer one(s). I use a simple word-order heuristics: if the verb precedes the first subject, the first subject is the inner one. Otherwise, the last subject is the inner one. Is there any better rule?
  • If the original deprel is e.g. nsubj:pass, I keep it and use Subject=Outer in MISC for marking the outer subject instead of the standard deprel=nsubj:outer, so that the information about passivization is not lost.
  • I am afraid there are still errors in your data, e.g. in sent_id = 33462, I think that "verba prophetiae" in "qui audiunt verba prophetiae" should be labeled as obj rather than nsubj. So perhaps it is still too early for using the above-mentioned Udapi script which would hide such annotation/conversion errors.

from docs.

nschneid avatar nschneid commented on August 18, 2024

FYI, the statement of the policy: https://universaldependencies.org/changes.html#multiple-subjects

from docs.

hanneme avatar hanneme commented on August 18, 2024

Thank you, @nschneid – I read this as a confirmation that we are not obliged to add :outer. (We are still working on the conversion, so we are well aware that there are still errors.)

from docs.

nschneid avatar nschneid commented on August 18, 2024

Yes, the alternative to :outer is manually checking that any multiple subjects are correct, and registering this information with the validator. @dan-zeman

from docs.

dan-zeman avatar dan-zeman commented on August 18, 2024

I read this as a confirmation that we are not obliged to add :outer.

Yes, that's correct, subtypes are optional, even if this one is highly recommended. If you have good reasons for not wanting to use the subtype, you can opt out. To inform the validator that a particular subject has been checked by you and should be allowed despite not having the :outer subtype, you add Subject=Outer to MISC on the line where nsubj:outer would be. Hence, you still have to act on each case individually, and "too complicated" is not a good reason for not using the subtype, but fortunately it is not too complicated, as @martinpopel shows. An example of a good reason is that you would end up with only one instance of nsubj:outer in the whole corpus, and in addition it would occur in the test data, meaning that no parser could ever learn how and where to predict it.

from docs.

hanneme avatar hanneme commented on August 18, 2024

Many of the structures we have for this are quite complicated, in fact, with many levels of embedding and lots of ellipsis. Just guessing the first subject as outer won't do as the embedded clause can often precede the head. It certainly wouldn't produce data that would allow a parser to learn which subjects are outer! I frankly don't think we have enough examples even if we manually checked all of them either, as they vary quite a lot.

from docs.

martinpopel avatar martinpopel commented on August 18, 2024

Just guessing the first subject as outer won't do

This is not the heuristic I suggested. I suggested that the subject that is closer to the verb is the inner subject. I guess this heuristic may fail as well - this is why I warned about it. It may be especially unreliable if the verb is between the two subjects. There are 28 such cases in the current dev branch of UD_Latin-PROIEL and many of them seem like annotation errors (i.e. one of the subjects should not be labeled so):

cat *.conllu | udapy -TM util.Mark node='
(any(c.udeprel.endswith("subj") for c in node.children(preceding_only=1)) and
 any(c.udeprel.endswith("subj") for c in node.children(following_only=1)))' | less -R

It certainly wouldn't produce data that would allow a parser to learn which subjects are outer!

Currently, there are 122 instances of double subjects. Even if some of these are annotation errors (as I noticed above), which are going to be fixed in the final conversion, I guess there will be still substantially more than "one instance of nsubj:outer" mentioned by @dan-zeman as a possible excuse for not using the label. Anyway, there are many rare deprel subtypes in many UD corpora and parsers can handle rare events during training, so I think this is not a big issue.

from docs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.