universaldependencies / docs Goto Github PK

Universal Dependencies online documentation

Home Page: http://universaldependencies.org/

License: Apache License 2.0

HTML 84.01% Python 1.00% CSS 3.03% JavaScript 11.23% Shell 0.34% Ruby 0.03% Perl 0.37%

docs's Issues

Layered features: format documentation, validator support and test cases

251d67a documents layered features on http://universaldependencies.github.io/docs/features.html. http://universaldependencies.github.io/docs/format.html, the validate.py script, and test cases should be updated to match.

update English relation documentation

The relation documentation for English is pulled from the SD manual and likely not to be up to date wrt recent changes made for USD.

Pages about features cannot be edited on-line

I am at
http://universaldependencies.github.io/docs/ud-feat/Gender.html
and I click on "edit page". It takes me to
https://github.com/universaldependencies/docs/edit/pages-source/ud-feat/Gender.md
which does not exist and triggers a 404 error.

The missing bit is probably that the server forgets to translate "ud-feat" to "_ud-feat". I do not know where this is done for the other folders.

The source file "_ud-feat/Gender.md" exists and can be edited off-line, then pushed to GitHub. It is just the on-line editor that does not work.

Move "Interset features that are not part of this standard"?

The bottom of the morphology page (http://universaldependencies.github.io/docs/morphology.html) has a list of Interset features that are not part of the selected subset of universal features (http://universaldependencies.github.io/docs/ud-feat-index.html).

This is probably not the best possible location for this information. @mcdm suggested creating a separate page summarizing language-specific extensions to relations; perhaps we could either 1) create a page for language-specific extensions to features, or 2) create a general language-specific extensions page containing information on both features and relations, and move this material there?

(Assigning @dan-zeman , please reassign or clear if you don't want this!)

Link to relation page in the auto-generated "all" page

It would be really nice if the auto-generated "all" page could auto-include links to the per-relation page (+ possibly also the USD relation documentation). This would be useful for editing these as well as when details are omited.

Bad grouping of relations in relations.html?

The upper middle cell in http://universaldependencies.github.io/docs/relations.html is named "Non-core dependents of clausal predicates". However, some of the relations mentioned in this cell do not depend exclusively on clausal predicates. For instance, the nmod relation and its very first example, the Chair 's office.

format documentation: some (mostly) HEAD and DEPS questions

Some more questions regarding the format spec (http://universaldependencies.github.io/docs/format.html):

It was decided (#33) that feature names and values must have the form [A-Z0-9][a-zA-Z0-9]*. Are dependency relations similarly required to have a particular form (e.g. [a-z]+)?
May a CoNLL-U sentence to have no words with HEAD = 0 (root relations)?
May a CoNLL-U sentence to have more than one word with HEAD = 0 (root relations)? (yes for CoNLL-X)
May head 0 (root) dependencies occur also in secondary dependencies (DEPS)?

(+extra non-dep question:) May multiword token ranges overlap? Intuitively no, but this doesn't seem to be explicit in the docs. Consider e.g. (nonsense example)

1   I   I   PRON    PRN Num=Sing|Per=1  2   nsubj   _   _
2-3 haven't _   _   _   _   _   _   _   _
2   have    have    VERB    VB  Tens=Pres   0   root    _   _
3-4 nota    _   _   _   _   _   _   _   _
3   not not ADV RB  _   2   neg _   _
4   a   a   DET DT  _   5   det _   _

`~~~ sdparse` visualizations get different margins than `<div class="sd-parse">` ones

fix conversion artifacts in English documentation

http://universaldependencies.github.io/docs/en-all.html has the literal strings ldots, hspace{3em} (there may be others).

also http://universaldependencies.github.io/docs/en/rel.html is empty.

Guidelines for language-specific documentation

We need a page summarizing what a language-specific description should contain as a minimum. @jnivre suggested the following:

For the language-specific entry pages, I am not sure how to structure it, but I think the following info should be compulsory:

A description of how words and tokens are defined (including whether range tokens are used, etc.)
A description of the morphological annotation, including the use of universal tags, universal features used, and language-specific features if any.
A description of the syntactic annotation, including the use of universal dependencies, and language-specific relations if any.
A list of known discrepancies with the universal guidelines (Ryan’s language-specific diff).

What else do we need?

Joakim

restore and fix sentence sequence numbering

follow-up to #3 and #4.

examples in format.md should contain correct feature names and values

Revisit http://universaldependencies.github.io/docs/format.html and check feature names and values once (initial version of) features finalized.

(Suggestion from @dan-zeman .)

"as well as" missing a mwe dep (usd/mwe)?

On http://universaldependencies.github.io/docs/usd/mwe.html, the first example is

I like dogs as well as cats
mwe(well, as)

but should perhaps be

I like dogs as well as cats
mwe(well, as-4)
mwe(as-6, well)

yes / no

We need a survey of existing corpora w.r.t. the words "yes" and "no" (the latter as a response, not as in "we have no bananas"). How are they annotated there? Do we want to tag them PART, or INTJ?

Allow `root(root-0, ...)` in SD?

http://universaldependencies.github.io/docs/ud-dep/remnant.html currently has a few examples such as

<div class="sd-parse">
Marie went to Paris and Miriam went to Prague
nsubj(went-2, Marie-1)
root(root-0, went-2)
[...]
</div>

The parser gives

failed to find token: "root(root-0, went-2)"

for these

Standardize on `~~~ sdparse` syntax (instead of `<div class="sd-parse">`)?

The visualization system supports two equivalent ways to write examples, one using Markup-line block syntax (~~~) and the other HTML (<div class="...">). For example, the first visualization here http://universaldependencies.github.io/docs/embedsd.html can be equivalently written either as

~~~ sdparse
Dogs run
nsubj(run, Dogs)
~~~

<div class="sd-parse">
Dogs run
nsubj(run, Dogs)
</div>

(the hyphen in sdparse in the second form only is a historical accident.)

The documentation currently contains a mix of both forms, which is potentially confusing for both authors and readers. It would be better to standardize on just one.

IMHO, the Markdown block (~~~) syntax is not only more compact but also easier to read and write as well as more consistent with the overall preference for Markdown over HTML in the documentation.

On the other hand, the HTML (<div class="...">) syntax does have the benefit of being more readily recognized by contributors who are familiar with basic web technologies.

On balance, I'd like to propose to use the Markdown block syntax consistently. If this is an acceptable choice, I'd be happy to write a script to implement this change globally in the documentation.

(Pre-empting one potential objection: attributes can also be specified for a Markdown block: http://kramdown.gettalong.org/syntax.html#attribute-list-definitions)

nmod appears in English relation type table but lacks documentation

nmod appears in the index table http://universaldependencies.github.io/docs/en-index.html but the linked page http://universaldependencies.github.io/docs/en/nmod.html is missing (404). There is no _en/nmod.md source document in the repository.

@mcdm : is the issue with the table or the documentation?

Feature specification in SD visualizations

Suggestion from @yoavg and @jnivre : support features in SD visualizations using the format

~~~ sdparse
Word1/POS1[Feat1=Val1|Feat2=Val2] Word2/POS2[...]
[ parse goes here ]
~~~

Mouse hover highlights tokens/deps in all(?) examples on the page

When there are several trees on a page, the mouse hover highlights tokens in the same position also in other trees, although it looks like not all of them.

relation table in merged document

suggestion from @manning :

is there a way to get additional content to propagate
over to the single document version? E.g., at present
the USD relation table doesn’t appear in the single
document version.

validate.py crashes on DOS newlines

to replicate:

check out https://github.com/UniversalDependencies/tools (master branch)
cp test-cases/valid/tanl.conll dos-newlines.conll
unix2dos dos-newlines.conll
python validate.py < dos-newlines.conll

result:

[...]
File "validate.py", line 201, in proj
proj(dependent,s,deps)
File "validate.py", line 201, in proj
proj(dependent,s,deps)
RuntimeError: maximum recursion depth exceeded

expected: either accept or reject the file w/o crashing.

related issue: I didn't find comment in the format document on Unix (LF) vs. DOS (CR/LF) vs. other newline conventions.

validate.py crashes on duplicate ID

To replicate, for https://github.com/universaldependencies/tools,

git pull
$ cat test-cases/nonvalid/duplicate-id.conll 
# not valid: IDs must be sequential integers (1, 2, ...)
1   valid   valid   NOUN    SP  _   0   ROOT    _   _
1   .   .   .   FS  _   1   p   _   _
$ cat test-cases/nonvalid/duplicate-id.conll | python validate.py 
[...]
File "validate.py", line 211, in proj
proj(dependent,s,deps)
RuntimeError: maximum recursion depth exceeded

use page.path instead of page.url to link to edit page

https://github.com/UniversalDependencies/docs/blob/pages-source/_layouts/base.html#L22

has the overly complex and brittle part

[...] page.url | replace: '/en-dep/', '/_en-dep/' [...]

use page.path instead of page.url to remove the need to replace here.

Cross-language links in relation documentation

from suggestion by @fginter in #9:

we might want to cross-reference to USD as well anyway.
So for a Finnish cc: there could be a [FI] and [USD] link...?

SD parser fails on commas

Multiple instances can be seen on http://universaldependencies.github.io/docs/fi-dep-all.html (e.g. http://universaldependencies.github.io/docs/fi-dep/cc.html)

What to do with words that are mentioned rather than used?

Copying from e-mails:

@dan-zeman: I am quite okay with "yes" and "no" being interjections rather than particles. Either way would be a bit arbitrary. It is also clear that ambiguous usages should be separated ("no" in "no way", or in other languages where the same word translates as English responsive "no" or functional "not"). What about usages such as "I am waiting for his "yes" on the matter." Still interjection, or a noun?

@jnivre: The “waiting for his ‘yes’ example” is not peculiar to interjections, but is the general problem of what to do with words that are mentioned rather than used. Is ‘precede’ a verb or a noun in “He pronounced ‘precede’ in a funny way”?

Update documentation of USD relations in the light of new general principles

We need to go through the documentation of universal relations and minimally make sure that all the text and examples are compatible with the general principles. Ideally, we should also add more information for relations to which some general principle applies (for example, explain how to treat multiple auxiliaries under "aux" and "auxpass"). Whoever does this may want to address issue #50 (avoid phrase structure language) at the same time.

Phrase-level vs. word-level modification

Yoav says: Maybe we should consider some mechanism for distinguishing phrase level vs. word level modification.

Context: Constituent trees can naturally express distinctions like
(almost (at (my house)))
vs.
(at ((almost my) house))
while in dependency trees this distinction could get lost (disregarding word order, which may be different in other languages anyway).

Another example is a shared modifier of coordination:
(Peter ((bought and ate) (an apple))) ... Peter and apple are arguments of both verbs, i.e. the whole coordination. In dependencies, they will look as if they modify only the first conjunct, i.e. "bought".
(((Peter bought) and (Mary ate)) (an apple))
((Peter (bought (an apple))) and (Mary (ate (a pear))))

No milestone set—we should think about this for the next version.

sequence number appearing on visualizations is always "1"

illustration:

merged documents presenting particular linguistic constructions

suggestion from @manning :

it would also be useful to have sections that presented
particular linguistic constructions. The existing manual
section on copulas is an example of this, but I imagine
others ranging from linguistic topics (tough movement,
correlative comparatives, …) to practical topics
(address blocks, itemized lists, …).

format documentation: feature documentation details

(More format documentation nitpicking, thought I'd avoid spamming everybody and try issues instead.)

A few questions re: http://universaldependencies.github.io/docs/format.html#morphological-annotation:

What characters are permitted in feature names and values? Presumably at least = and , are disallowed to keep the syntax unambiguous, but is e.g. non-ASCII alphabetical OK? Or just [a-zA-Z]?
What is the precise definition of alphabetical sort? As the documentation doesn't enforce capitalization, resources may define e.g. a case feature. Is then case < Def (intuitively correct) or case > Def (ASCII order)? (Naive implementations would tend to produce the latter.)

@fginter , @jnivre : clarifications would be much appreciated!

SD parser fails on sentence-terminal space

To replicate:

~~~ sdparse
extra space 
dep(extra, space)
~~~

Adding an extra space character to the end of extra space causes the SD parser to read the whole entry as text (no dependencies), producing a visualization where the text is extra space dep(extra, space).

complete CoNLL-U support

full parsing and validation
visualization support for all aspects
...

Shorter feature names?

Some feature names, originally taken from Interset, may seem too long. In general, I do not favor extremely short names because longer names are more self-explanatory. However, I would not mind shortening two names: Definiteness and Negativeness. By removing the ness part, we would get Definite and Negative, which is probably understandable enough. Any opinions?

Broken links in merged documents

Many links between pages in a single collection are broken in merged (single-page) documents (see e.g. http://universaldependencies.github.io/docs/ud-pos-all.html).

Documents created as automatic merges of pages in particular collections, such as http://universaldependencies.github.io/docs/ud-pos-all.html, are currently found in the documentation root directory (docs/), while the individual documents are found in the collection-specific subdirectory (e.g. docs/ud-pos/). Consequently, relative links that work between the individual collection documents (e.g. <a href="DET">DET</a>) are broken in the merged document.

Possible ways to resolve this:

Place each merged document in the same directory as the documents it merges
Only use absolute links (e.g. href="http://universaldependencies.github.io/docs/ud-pos/DET.html")
Use the special variable {{ relative }} in links (e.g. href="{{ relative }}ud-pos/DET.html")
Use the auto-linking mechanism (e.g. [DET]()) and update the code to adjust accordingly

(Related to #16, but distinct.)

Avoid phrase structure language

Parts of the documentation make frequent use of phrase structure ideas and terminology in definitions. For example (http://universaldependencies.github.io/docs/u/dep/relcl.html):

A relative clause modifier of an NP is a relative clause modifying the NP.
The relation points from the head noun of the NP to the head of the relative
clause, normally a verb.

It would be better to reduce such usage, as the (exact) definitions of NP, VP, etc. are neither found in the documentation nor always obvious, not all languages intended to be covered by the UD documentation have a broadly accepted standard for phrase structure analysis, and (IMHO) it would be preferable if the dependency analyses could be defined without first defining or assuming a phrase structure analysis.

(Phrase structure terminology is particularly common in the English documentation, suggesting it is at least in part simply left over from the old SD documentation, where the scheme was defined as a conversion from a phrase structure analysis.)

Two styles of hyperlinks

I often use HTML to create hyperlinks, partly because I like it and partly because some types of links are currently not supported in the []() syntax (see #40): <a href="../ud-pos/ADJ.html">adjectives</a>.

Occasionally I also use the []() syntax, as in [ud-dep/case]().

Now I realized that the results differ in style: the HTML-defined link is underlined while the []()-defined link is not. (For example, see the first three paragraphs in http://universaldependencies.github.io/docs/ud-feat/Case.html) Is this intentional?

Files named "aux" cause problems

The docs.git repository cannot be fully cloned in Microsoft Windows because for silly historical reasons, this system disallows files with certain three-letter names, including "aux" (regardless case and .extension). The files affected in this repository are

_en-dep/aux.md
_fi-dep/aux.md
_ud-dep/aux.md
_ud-pos/AUX.md

Any chance to rename these so that the repository gets more portable?

Thanks
Dan

flash of unstyled content on index page (tabs)

see http://forum.jquery.com/topic/jquery-ui-tabs-widget-flickers-hidden-divs-on-load for a candidate solution.

jekyll conflicts w/visualizations w/words ending in dash

For the input

<div class="sd-parse" tabs="yes">
Go to the righ- to the left .
reparandum(left-7, righ--4)
[...]
</div>

jekyll replaces the double-dash in righ--4 with an mdash, which causes the embedded visualization to fail.

General case: jekyll processing should be off in all visualization divs.

more sensible information popups

As of 7dacf61, mousing over span and relation annotations produces a basic information popup with type and (for spans) marked text.

This information is entirely redundant with that already shown in the visualization.

The info popup is likely to be useful for displaying feature values, but should probably not be shown at all in cases where no additional information (wrt. base visualization) is available.

(comments welcome!)

Allow different text for auto-link

from @dan-zeman :

"""
can I link to a label using a text other than the label itself? For example, how would I rewrite the following in your syntax?

<a href="../ud-pos/NOUN.html">nouns</a>

"""

this should be supported.

exclude details by default on merged documentation pages

could use a jekyll conditional block like

{% if page.merged != true %}
[details here]
{% endif %}

on each page with details, but some more elegant solution would be preferred.

validate.py crashes on empty HEAD

For https://github.com/universaldependencies/tools:

git pull
$ cat test-cases/nonvalid/empty-head.conll
# not valid: HEAD is empty
1   have    have    VERB    VB  Tens=Pres       root    _   _
$ python validate.py --no-lists < test-cases/nonvalid/empty-head.conll
[Line         2]: Empty value in column HEAD
Traceback (most recent call last):
File "validate.py", line 302, in <module>
validate(inp,out,args,tagsets)
File "validate.py", line 239, in validate
validate_tree(tree)
File "validate.py", line 223, in validate_tree
deps.setdefault(int(cols[HEAD]),set()).add(int(cols[ID]))
ValueError: invalid literal for int() with base 10: ''

Multi-line glossing

Sampo, is it possible to support with brat doing interlinear glossing as is standard in linguistics for texts in different languages or just if you want to give more information about the morphology, etc. (http://en.wikipedia.org/wiki/Interlinear_gloss). I think that will be very useful for giving examples in different languages.

use document-internal links in merged document

The merged docs http://universaldependencies.github.io/docs/ud-dep-all.html and http://universaldependencies.github.io/docs/fi-dep-all.html contain the same relation tables as the corresponding index docs (since #12). However, the links to the relation documentation remain to the per-relation docs even in the merged doc. The merged document links should be document-internal instead.

The phrase "nondeterministic word segmentation"(http://universaldependencies.github.io/docs/format.html) may be misleading, as the running examples (e.g. dámelo = da me lo, au = à le) are AFAIK systematically split this way. Similar for "extraction of words [... ] is nondeterministic" (http://universaldependencies.github.io/docs/tokenization.html). Consider rephrasing?

RTL support (or workaround)

As brat still doesn't have right-to-left support (nlplab/brat#774, nlplab/brat#1057), the visualization can't do e.g. Hebrew. Either true RTL support or some reasonable workaround is needed.

universaldependencies / docs Goto Github PK

docs's Issues

Recommend Projects

Recommend Topics

Recommend Org