Comments (7)
marking the non-embedded subject as nsubj:outer automatically would be complicated
It is quite easy using Udapi:
udapy read.Conllu files='!*.conllu' util.Eval node='
inner = [c for c in node.children if c.udeprel.endswith("subj") and c.sdeprel!="outer" and c.misc["Subject"]!="Outer"]
if len(inner) > 1:
for c in inner[1:] if node < inner[0] else inner[:-1]:
if not c.sdeprel:
c.sdeprel = "outer"
else:
c.misc["Subject"] = "Outer"' \
write.Conllu path=fixed
You can then visualize the edits using the following command, e.g. with UD_Latin-PROIEL dev branch:
udapy -M read.Conllu files=la_proiel-ud-train.conllu zone=orig read.Conllu files=fixed/la_proiel-ud-train.conllu zone=fixed util.MarkDiff gold_zone=orig write.TextModeTreesHtml attributes=form,upos,deprel,misc > diff-train.html
See the output at https://ufallab.ms.mff.cuni.cz/~popel/ud/la-proiel/ (ignore the Mark=1
- it is there just for the visualization purposes) and make sure all the edits are valid.
Note that
- It may be difficult to guess which of the subjects is the inner and which the outer one(s). I use a simple word-order heuristics: if the verb precedes the first subject, the first subject is the inner one. Otherwise, the last subject is the inner one. Is there any better rule?
- If the original deprel is e.g.
nsubj:pass
, I keep it and useSubject=Outer
in MISC for marking the outer subject instead of the standarddeprel=nsubj:outer
, so that the information about passivization is not lost. - I am afraid there are still errors in your data, e.g. in sent_id = 33462, I think that "verba prophetiae" in "qui audiunt verba prophetiae" should be labeled as
obj
rather thannsubj
. So perhaps it is still too early for using the above-mentioned Udapi script which would hide such annotation/conversion errors.
from docs.
FYI, the statement of the policy: https://universaldependencies.org/changes.html#multiple-subjects
from docs.
Thank you, @nschneid – I read this as a confirmation that we are not obliged to add :outer. (We are still working on the conversion, so we are well aware that there are still errors.)
from docs.
Yes, the alternative to :outer
is manually checking that any multiple subjects are correct, and registering this information with the validator. @dan-zeman
from docs.
I read this as a confirmation that we are not obliged to add :outer.
Yes, that's correct, subtypes are optional, even if this one is highly recommended. If you have good reasons for not wanting to use the subtype, you can opt out. To inform the validator that a particular subject has been checked by you and should be allowed despite not having the :outer
subtype, you add Subject=Outer
to MISC on the line where nsubj:outer
would be. Hence, you still have to act on each case individually, and "too complicated" is not a good reason for not using the subtype, but fortunately it is not too complicated, as @martinpopel shows. An example of a good reason is that you would end up with only one instance of nsubj:outer
in the whole corpus, and in addition it would occur in the test data, meaning that no parser could ever learn how and where to predict it.
from docs.
Many of the structures we have for this are quite complicated, in fact, with many levels of embedding and lots of ellipsis. Just guessing the first subject as outer won't do as the embedded clause can often precede the head. It certainly wouldn't produce data that would allow a parser to learn which subjects are outer! I frankly don't think we have enough examples even if we manually checked all of them either, as they vary quite a lot.
from docs.
Just guessing the first subject as outer won't do
This is not the heuristic I suggested. I suggested that the subject that is closer to the verb is the inner subject. I guess this heuristic may fail as well - this is why I warned about it. It may be especially unreliable if the verb is between the two subjects. There are 28 such cases in the current dev branch of UD_Latin-PROIEL and many of them seem like annotation errors (i.e. one of the subjects should not be labeled so):
cat *.conllu | udapy -TM util.Mark node='
(any(c.udeprel.endswith("subj") for c in node.children(preceding_only=1)) and
any(c.udeprel.endswith("subj") for c in node.children(following_only=1)))' | less -R
It certainly wouldn't produce data that would allow a parser to learn which subjects are outer!
Currently, there are 122 instances of double subjects. Even if some of these are annotation errors (as I noticed above), which are going to be fixed in the final conversion, I guess there will be still substantially more than "one instance of nsubj:outer
" mentioned by @dan-zeman as a possible excuse for not using the label. Anyway, there are many rare deprel subtypes in many UD corpora and parsers can handle rare events during training, so I think this is not a big issue.
from docs.
Related Issues (20)
- Zero width spaces (U+200b) inside the token HOT 8
- coordinated copulas HOT 2
- when to annotate `compound` versus `obj` HOT 26
- Co-relative relative Clauses in Saraiki HOT 4
- incoherence of `acl:relcl` versus `acl` distinction HOT 13
- problem with annotation of "sadece" in UD_Turkish-BOUN HOT 10
- Insertion of two new Feats for voicing and euphony HOT 13
- Question on requirement for 'aux' not to have children (Tswana) HOT 6
- complements of "be" HOT 9
- Treatment of split "what a ((ADJ) NOUN)" construction in Low Saxon and Dutch HOT 16
- Dative Subjects Saraiki HOT 5
- Misidentified Lemmas in Spanish HOT 1
- clausal appos HOT 36
- Flat:foreign with Typo=Yes HOT 3
- acl vs xcomp vs advcl HOT 1
- Deprel of list item enumerators HOT 11
- English nominal subtypes: merge :npmod and :tmod as :unmarked HOT 18
- Create treebank without syntactic dependencies HOT 2
- How to document script used for the data in treebank? HOT 7
- Some Broken or missing treebank links HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from docs.