universaldependencies / docs Goto Github PK
View Code? Open in Web Editor NEWUniversal Dependencies online documentation
Home Page: http://universaldependencies.org/
License: Apache License 2.0
Universal Dependencies online documentation
Home Page: http://universaldependencies.org/
License: Apache License 2.0
251d67a documents layered features on http://universaldependencies.github.io/docs/features.html. http://universaldependencies.github.io/docs/format.html, the validate.py
script, and test cases should be updated to match.
The relation documentation for English is pulled from the SD manual and likely not to be up to date wrt recent changes made for USD.
I am at
http://universaldependencies.github.io/docs/ud-feat/Gender.html
and I click on "edit page". It takes me to
https://github.com/universaldependencies/docs/edit/pages-source/ud-feat/Gender.md
which does not exist and triggers a 404 error.
The missing bit is probably that the server forgets to translate "ud-feat" to "_ud-feat". I do not know where this is done for the other folders.
The source file "_ud-feat/Gender.md" exists and can be edited off-line, then pushed to GitHub. It is just the on-line editor that does not work.
The bottom of the morphology page (http://universaldependencies.github.io/docs/morphology.html) has a list of Interset features that are not part of the selected subset of universal features (http://universaldependencies.github.io/docs/ud-feat-index.html).
This is probably not the best possible location for this information. @mcdm suggested creating a separate page summarizing language-specific extensions to relations; perhaps we could either 1) create a page for language-specific extensions to features, or 2) create a general language-specific extensions page containing information on both features and relations, and move this material there?
(Assigning @dan-zeman , please reassign or clear if you don't want this!)
It would be really nice if the auto-generated "all" page could auto-include links to the per-relation page (+ possibly also the USD relation documentation). This would be useful for editing these as well as when details are omited.
The upper middle cell in http://universaldependencies.github.io/docs/relations.html is named "Non-core dependents of clausal predicates". However, some of the relations mentioned in this cell do not depend exclusively on clausal predicates. For instance, the nmod
relation and its very first example, the Chair 's office.
Some more questions regarding the format spec (http://universaldependencies.github.io/docs/format.html):
[A-Z0-9][a-zA-Z0-9]*
. Are dependency relations similarly required to have a particular form (e.g. [a-z]+
)?HEAD = 0
(root
relations)?HEAD = 0
(root
relations)? (yes for CoNLL-X)root
) dependencies occur also in secondary dependencies (DEPS
)?(+extra non-dep question:) May multiword token ranges overlap? Intuitively no, but this doesn't seem to be explicit in the docs. Consider e.g. (nonsense example)
1 I I PRON PRN Num=Sing|Per=1 2 nsubj _ _
2-3 haven't _ _ _ _ _ _ _ _
2 have have VERB VB Tens=Pres 0 root _ _
3-4 nota _ _ _ _ _ _ _ _
3 not not ADV RB _ 2 neg _ _
4 a a DET DT _ 5 det _ _
http://universaldependencies.github.io/docs/en-all.html has the literal strings ldots
, hspace{3em}
(there may be others).
also http://universaldependencies.github.io/docs/en/rel.html is empty.
We need a page summarizing what a language-specific description should contain as a minimum. @jnivre suggested the following:
For the language-specific entry pages, I am not sure how to structure it, but I think the following info should be compulsory:
What else do we need?
Joakim
Revisit http://universaldependencies.github.io/docs/format.html and check feature names and values once (initial version of) features finalized.
(Suggestion from @dan-zeman .)
On http://universaldependencies.github.io/docs/usd/mwe.html, the first example is
I like dogs as well as cats
mwe(well, as)
but should perhaps be
I like dogs as well as cats
mwe(well, as-4)
mwe(as-6, well)
We need a survey of existing corpora w.r.t. the words "yes" and "no" (the latter as a response, not as in "we have no bananas"). How are they annotated there? Do we want to tag them PART, or INTJ?
http://universaldependencies.github.io/docs/ud-dep/remnant.html currently has a few examples such as
<div class="sd-parse">
Marie went to Paris and Miriam went to Prague
nsubj(went-2, Marie-1)
root(root-0, went-2)
[...]
</div>
The parser gives
failed to find token: "root(root-0, went-2)"
for these
The visualization system supports two equivalent ways to write examples, one using Markup-line block syntax (~~~
) and the other HTML (<div class="...">
). For example, the first visualization here http://universaldependencies.github.io/docs/embedsd.html can be equivalently written either as
~~~ sdparse
Dogs run
nsubj(run, Dogs)
~~~
or
<div class="sd-parse">
Dogs run
nsubj(run, Dogs)
</div>
(the hyphen in sdparse
in the second form only is a historical accident.)
The documentation currently contains a mix of both forms, which is potentially confusing for both authors and readers. It would be better to standardize on just one.
IMHO, the Markdown block (~~~
) syntax is not only more compact but also easier to read and write as well as more consistent with the overall preference for Markdown over HTML in the documentation.
On the other hand, the HTML (<div class="...">
) syntax does have the benefit of being more readily recognized by contributors who are familiar with basic web technologies.
On balance, I'd like to propose to use the Markdown block syntax consistently. If this is an acceptable choice, I'd be happy to write a script to implement this change globally in the documentation.
(Pre-empting one potential objection: attributes can also be specified for a Markdown block: http://kramdown.gettalong.org/syntax.html#attribute-list-definitions)
nmod
appears in the index table http://universaldependencies.github.io/docs/en-index.html but the linked page http://universaldependencies.github.io/docs/en/nmod.html is missing (404). There is no _en/nmod.md
source document in the repository.
@mcdm : is the issue with the table or the documentation?
When there are several trees on a page, the mouse hover highlights tokens in the same position also in other trees, although it looks like not all of them.
suggestion from @manning :
is there a way to get additional content to propagate
over to the single document version? E.g., at present
the USD relation table doesn’t appear in the single
document version.
to replicate:
cp test-cases/valid/tanl.conll dos-newlines.conll
unix2dos dos-newlines.conll
python validate.py < dos-newlines.conll
result:
[...]
File "validate.py", line 201, in proj
proj(dependent,s,deps)
File "validate.py", line 201, in proj
proj(dependent,s,deps)
RuntimeError: maximum recursion depth exceeded
expected: either accept or reject the file w/o crashing.
related issue: I didn't find comment in the format document on Unix (LF) vs. DOS (CR/LF) vs. other newline conventions.
To replicate, for https://github.com/universaldependencies/tools,
git pull
$ cat test-cases/nonvalid/duplicate-id.conll
# not valid: IDs must be sequential integers (1, 2, ...)
1 valid valid NOUN SP _ 0 ROOT _ _
1 . . . FS _ 1 p _ _
$ cat test-cases/nonvalid/duplicate-id.conll | python validate.py
[...]
File "validate.py", line 211, in proj
proj(dependent,s,deps)
RuntimeError: maximum recursion depth exceeded
https://github.com/UniversalDependencies/docs/blob/pages-source/_layouts/base.html#L22
has the overly complex and brittle part
[...] page.url | replace: '/en-dep/', '/_en-dep/' [...]
use page.path
instead of page.url
to remove the need to replace
here.
Multiple instances can be seen on http://universaldependencies.github.io/docs/fi-dep-all.html (e.g. http://universaldependencies.github.io/docs/fi-dep/cc.html)
Copying from e-mails:
@dan-zeman: I am quite okay with "yes" and "no" being interjections rather than particles. Either way would be a bit arbitrary. It is also clear that ambiguous usages should be separated ("no" in "no way", or in other languages where the same word translates as English responsive "no" or functional "not"). What about usages such as "I am waiting for his "yes" on the matter." Still interjection, or a noun?
@jnivre: The “waiting for his ‘yes’ example” is not peculiar to interjections, but is the general problem of what to do with words that are mentioned rather than used. Is ‘precede’ a verb or a noun in “He pronounced ‘precede’ in a funny way”?
We need to go through the documentation of universal relations and minimally make sure that all the text and examples are compatible with the general principles. Ideally, we should also add more information for relations to which some general principle applies (for example, explain how to treat multiple auxiliaries under "aux" and "auxpass"). Whoever does this may want to address issue #50 (avoid phrase structure language) at the same time.
Yoav says: Maybe we should consider some mechanism for distinguishing phrase level vs. word level modification.
Context: Constituent trees can naturally express distinctions like
(almost (at (my house)))
vs.
(at ((almost my) house))
while in dependency trees this distinction could get lost (disregarding word order, which may be different in other languages anyway).
Another example is a shared modifier of coordination:
(Peter ((bought and ate) (an apple))) ... Peter and apple are arguments of both verbs, i.e. the whole coordination. In dependencies, they will look as if they modify only the first conjunct, i.e. "bought".
(((Peter bought) and (Mary ate)) (an apple))
((Peter (bought (an apple))) and (Mary (ate (a pear))))
No milestone set—we should think about this for the next version.
suggestion from @manning :
it would also be useful to have sections that presented
particular linguistic constructions. The existing manual
section on copulas is an example of this, but I imagine
others ranging from linguistic topics (tough movement,
correlative comparatives, …) to practical topics
(address blocks, itemized lists, …).
(More format documentation nitpicking, thought I'd avoid spamming everybody and try issues instead.)
A few questions re: http://universaldependencies.github.io/docs/format.html#morphological-annotation:
=
and ,
are disallowed to keep the syntax unambiguous, but is e.g. non-ASCII alphabetical OK? Or just [a-zA-Z]
?case
feature. Is then case
< Def
(intuitively correct) or case
> Def
(ASCII order)? (Naive implementations would tend to produce the latter.)@fginter , @jnivre : clarifications would be much appreciated!
To replicate:
~~~ sdparse
extra space
dep(extra, space)
~~~
Adding an extra space character to the end of extra space
causes the SD parser to read the whole entry as text (no dependencies), producing a visualization where the text is extra space dep(extra, space)
.
Some feature names, originally taken from Interset, may seem too long. In general, I do not favor extremely short names because longer names are more self-explanatory. However, I would not mind shortening two names: Definiteness
and Negativeness
. By removing the ness
part, we would get Definite
and Negative
, which is probably understandable enough. Any opinions?
Many links between pages in a single collection are broken in merged (single-page) documents (see e.g. http://universaldependencies.github.io/docs/ud-pos-all.html).
Documents created as automatic merges of pages in particular collections, such as http://universaldependencies.github.io/docs/ud-pos-all.html, are currently found in the documentation root directory (docs/
), while the individual documents are found in the collection-specific subdirectory (e.g. docs/ud-pos/
). Consequently, relative links that work between the individual collection documents (e.g. <a href="DET">DET</a>
) are broken in the merged document.
Possible ways to resolve this:
href="http://universaldependencies.github.io/docs/ud-pos/DET.html"
){{ relative }}
in links (e.g. href="{{ relative }}ud-pos/DET.html"
)[DET]()
) and update the code to adjust accordingly(Related to #16, but distinct.)
Parts of the documentation make frequent use of phrase structure ideas and terminology in definitions. For example (http://universaldependencies.github.io/docs/u/dep/relcl.html):
A relative clause modifier of an NP is a relative clause modifying the NP.
The relation points from the head noun of the NP to the head of the relative
clause, normally a verb.
It would be better to reduce such usage, as the (exact) definitions of NP
, VP
, etc. are neither found in the documentation nor always obvious, not all languages intended to be covered by the UD documentation have a broadly accepted standard for phrase structure analysis, and (IMHO) it would be preferable if the dependency analyses could be defined without first defining or assuming a phrase structure analysis.
(Phrase structure terminology is particularly common in the English documentation, suggesting it is at least in part simply left over from the old SD documentation, where the scheme was defined as a conversion from a phrase structure analysis.)
I often use HTML to create hyperlinks, partly because I like it and partly because some types of links are currently not supported in the []()
syntax (see #40): <a href="../ud-pos/ADJ.html">adjectives</a>
.
Occasionally I also use the []()
syntax, as in [ud-dep/case]()
.
Now I realized that the results differ in style: the HTML-defined link is underlined while the []()
-defined link is not. (For example, see the first three paragraphs in http://universaldependencies.github.io/docs/ud-feat/Case.html) Is this intentional?
The docs.git repository cannot be fully cloned in Microsoft Windows because for silly historical reasons, this system disallows files with certain three-letter names, including "aux" (regardless case and .extension). The files affected in this repository are
_en-dep/aux.md
_fi-dep/aux.md
_ud-dep/aux.md
_ud-pos/AUX.md
Any chance to rename these so that the repository gets more portable?
Thanks
Dan
see http://forum.jquery.com/topic/jquery-ui-tabs-widget-flickers-hidden-divs-on-load for a candidate solution.
For the input
<div class="sd-parse" tabs="yes">
Go to the righ- to the left .
reparandum(left-7, righ--4)
[...]
</div>
jekyll replaces the double-dash in righ--4
with an mdash, which causes the embedded visualization to fail.
General case: jekyll processing should be off in all visualization divs.
As of 7dacf61, mousing over span and relation annotations produces a basic information popup with type and (for spans) marked text.
This information is entirely redundant with that already shown in the visualization.
The info popup is likely to be useful for displaying feature values, but should probably not be shown at all in cases where no additional information (wrt. base visualization) is available.
(comments welcome!)
from @dan-zeman :
"""
can I link to a label using a text other than the label itself? For example, how would I rewrite the following in your syntax?
<a href="../ud-pos/NOUN.html">nouns</a>
"""
this should be supported.
could use a jekyll conditional block like
{% if page.merged != true %}
[details here]
{% endif %}
on each page with details, but some more elegant solution would be preferred.
For https://github.com/universaldependencies/tools:
git pull
$ cat test-cases/nonvalid/empty-head.conll
# not valid: HEAD is empty
1 have have VERB VB Tens=Pres root _ _
$ python validate.py --no-lists < test-cases/nonvalid/empty-head.conll
[Line 2]: Empty value in column HEAD
Traceback (most recent call last):
File "validate.py", line 302, in <module>
validate(inp,out,args,tagsets)
File "validate.py", line 239, in validate
validate_tree(tree)
File "validate.py", line 223, in validate_tree
deps.setdefault(int(cols[HEAD]),set()).add(int(cols[ID]))
ValueError: invalid literal for int() with base 10: ''
Sampo, is it possible to support with brat doing interlinear glossing as is standard in linguistics for texts in different languages or just if you want to give more information about the morphology, etc. (http://en.wikipedia.org/wiki/Interlinear_gloss). I think that will be very useful for giving examples in different languages.
The merged docs http://universaldependencies.github.io/docs/ud-dep-all.html and http://universaldependencies.github.io/docs/fi-dep-all.html contain the same relation tables as the corresponding index docs (since #12). However, the links to the relation documentation remain to the per-relation docs even in the merged doc. The merged document links should be document-internal instead.
The cross-reference "#"-character-to-number resolution feature only works page-internally. Fix will likely require resolution during html generation (i.e. in jekyll).
The phrase "nondeterministic word segmentation"(http://universaldependencies.github.io/docs/format.html) may be misleading, as the running examples (e.g. dámelo = da me lo, au = à le) are AFAIK systematically split this way. Similar for "extraction of words [... ] is nondeterministic" (http://universaldependencies.github.io/docs/tokenization.html). Consider rephrasing?
As brat still doesn't have right-to-left support (nlplab/brat#774, nlplab/brat#1057), the visualization can't do e.g. Hebrew. Either true RTL support or some reasonable workaround is needed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.