emilstenstrom / conllu Goto Github PK
View Code? Open in Web Editor NEWA CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
License: MIT License
A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
License: MIT License
I found a new case where the regular expression for parsing enhanced representations fails in the Arabic training set, see https://github.com/mdelhoneux/conllu/blob/cfc45fb5e52e4fe714472a2002464db4c6876cec/tests/test_parser.py#L477, I have not yet managed to fix this without breaking other tests.
Hi,
Many compliments on this library. It's by far my preferred one for working with conllu files. I especially appreciate that it's free of dependencies.
One issue that I've encountered is in adding a key-value pair where the value is a float or int to the misc
field of a token and then serializing it.
My use case is adding token-level analyses (e.g. dependency length) to the misc
field of a new conllu object to be serialized.
For example, I might add a new dict to the field as follows:
new_data = {"DL": 2, "Cosine": 0.45}
token["misc"].update(new_data)
Although one can modify the dictionary in the misc
field, adding new key-value pairs, when serializing the following error occurs in serialize.py
:
fields = []
for key, value in field.items():
value = "_"
if value == "":
fields.append(key)
continue
> fields.append('='.join((key, value)))
E TypeError: sequence item 1: expected str instance, int found
The culprit is this: fields.append('='.join((key, value)))
. This works okay as long as both the key and the value are str
type, but it breaks if the value is, for example, an int
or float
.
In my fork, I have changed this using an f-string:
def serialize_field(field: T.Any) -> str:
if field is None:
return '_'
if isinstance(field, dict):
if field == {}:
return '_'
fields = []
for key, value in field.items():
if value is None:
value = "_"
if value == "":
fields.append(key)
continue
fields.append(f'{key}={value}')
return '|'.join(fields)
if isinstance(field, tuple):
return "".join([serialize_field(item) for item in field])
if isinstance(field, list):
if len(field[0]) != 2:
raise ParseException("Can't serialize '{}', invalid format".format(field))
return "|".join([serialize_field(value) + ":" + str(key) for key, value in field])
return "{}".format(field)
This appears to solve the issue easily, though of course it then means that string representations of arbitrary datatypes could end up in misc
too, which may be undesirable. Perhaps checking if the value's datatype is in (str, int, float) would constrain this?
I still observe a strange thing, possibly linked to this workaround: If there are more than two tokens with head 0 in a conllu sentence and I try to get the upos of the sentence root (the top level one), I get a maximum recursion error.
for example the following script reproduces the error:
#!/usr/bin/env python3
sent = """1 the the DET _ _ 0 root _ _
2 mouse mouse NOUN _ _ 3 root _ _
3 sleeps sleep VERB _ _ 0 root _ _
"""
for sentence in conllu.parse(sent):
#print(sentence)
root = sentence.to_tree()
root.print_tree()
print(root.token)
print(root.token["upos"])
Running it gives the following error message
(deprel:root) form:_ [0] # <-- the toplevel root does not have an UPOS
(deprel:root) form:the lemma:the upos:DET [1]
(deprel:root) form:sleeps lemma:sleep upos:VERB [3]
(deprel:root) form:mouse lemma:mouse upos:NOUN [2]
{'id': 0, 'form': '_', 'deprel': 'root'}
Traceback (most recent call last):
File "./ex.py", line 15, in <module>
print(root.token["upos"])
File "/home/jh/.local/lib/python3.6/site-packages/conllu/models.py", line 31, in __missing__
return self[self.MAPPING[key]]
File "/home/jh/.local/lib/python3.6/site-packages/conllu/models.py", line 31, in __missing__
return self[self.MAPPING[key]]
File "/home/jh/.local/lib/python3.6/site-packages/conllu/models.py", line 31, in __missing__
return self[self.MAPPING[key]]
[Previous line repeated 246 more times]
RecursionError: maximum recursion depth exceeded
Originally posted by @jheinecke in #44 (comment)
Read column 9 and make the enhanced dependency graph available?
First off, thanks for this package. The API on the README looks great. Can't wait to dig in.
I've forked this repo and will add type annotations to every part that I will make use(most likely all the .py
files in the root dir).
If you'd like I can submit a PR to merge those type annotations back in(and this repo can be PEP 561 compliant).
There are some issues with this type error, as it mistakes a token containing <
or >
as a tag. For instance, the Danish sentence:
Antag at a og b er to reelle tal og at 0<a<b
.
raises the TypeError.
Moreover, what about actual tags like <p>
that can be found quoted in several texts?
Do you have plans to support the https://universaldependencies.org/ext-format.html?
The TreeNode is an OrderedDict.
Question: Is there any way to find the shortest path among two TreeNodes?
Thanks!
There are features for handling strangely formatted files. I should document them.
Hello,
I know this is a malformed line, but apparently new versions of Stanza produces this output:
72 sa sa PRON R PronType=Prs|Reflex=Yes 76 expl:pv _ start_char=633673|end_char=633675
73-73 zaň _ _ _ _ _ _ _ start_char=633676|end_char=633679
73 zaň zaň PRON PFms4 Animacy=Anim|Case=Acc|Gender=Masc|Number=Sing|Person=3|PronType=Prs 76 obj _
which raises this exception:
File "/usr/local/lib/python3.9/site-packages/conllu/__init__.py", line 40, in parse_tree
return list(parse_tree_incr(StringIO(data)))
File "/usr/local/lib/python3.9/site-packages/conllu/__init__.py", line 43, in parse_tree_incr
for tokenlist in parse_incr(in_file):
File "/usr/local/lib/python3.9/site-packages/conllu/__init__.py", line 32, in parse_incr
yield parse_token_and_metadata(
File "/usr/local/lib/python3.9/site-packages/conllu/parser.py", line 95, in parse_token_and_metadata
tokens.append(parse_line(line, fields, field_parsers))
File "/usr/local/lib/python3.9/site-packages/conllu/parser.py", line 132, in parse_line
raise ParseException("Failed parsing field '{}': ".format(field) + str(e))
conllu.exceptions.ParseException: Failed parsing field 'id': '72-72' is not a valid ID.
Since I have to parse a lot of files, would it be possible to safely skip malformed lines instead of raising an exception i.e., terminating the script?
Try this dataset: https://github.com/ufal/rh_nntagging/tree/master/data/ud-1.2/en
I got the error when parsing:
ParseException: Invalid line format, line must contain either tabs or two spaces.
I noticed some bugs in conllu. When I have misc features of the form Singleton|Key=Value, then Singleton field is left out completly, even though this form is not invalid and is something I have seen in CoNLL-U files. I intend to report these as well to see if anything comes out of it.
From Reddit
According to the format specification, comments such as # newdoc
should be allowed. However, when I try to parse and serialize the following text:
# newdoc
# sent_id = 1
# text = They buy and sell books.
1 They they PRON PRP Case=Nom|Number=Plur 2 nsubj 2:nsubj|4:nsubj _
2 buy buy VERB VBP Number=Plur|Person=3|Tense=Pres 0 root 0:root _
3 and and CONJ CC _ 4 cc 4:cc _
4 sell sell VERB VBP Number=Plur|Person=3|Tense=Pres 2 conj 0:root|2:conj _
5 books book NOUN NNS Number=Plur 2 obj 2:obj|4:obj SpaceAfter=No
6 . . PUNCT . _ 2 punct 2:punct _
serialization throws the TypeError: must be str, not NoneType
exception. Parsing works fine, I'm able to retrieve the metadata for the sentence above:
>>> sent.metadata
OrderedDict([('newdoc', None),
('sent_id', '1'),
('text', 'They buy and sell books.')])
Would it be possible to add support for such unstructured comments?
Hey I was wondering if your project supports the default conll format and if not if you know a good converter.
I am talking about this format:
mz/sinorama/10/ectb_1034 0 1 , , * - - - - * * * * * * * * -
mz/sinorama/10/ectb_1034 0 2 Lee NNP (NP(NP*) - - - - (PERSON) * * (ARG1*) (ARG0* * * * (23
mz/sinorama/10/ectb_1034 0 3 , , * - - - - * * * * * * * * -
mz/sinorama/10/ectb_1034 0 4 who WP (SBAR(WHNP*) - - - - * * * (R-ARG1*) * * * * -
mz/sinorama/10/ectb_1034 0 5 has VBZ (S(VP* have 01 - - * (V*) * * * * * * -
mz/sinorama/10/ectb_1034 0 6 been VBN (VP* be 03 - - * * (V*) * * * * * -
mz/sinorama/10/ectb_1034 0 7 dubbed VBN (VP* dub 01 1 - * * * (V*) * * * * -
mz/sinorama/10/ectb_1034 0 8 " `` * - - - - * * * * * * * * -
mz/sinorama/10/ectb_1034 0 9 the DT (S(NP* - - - - * * * (ARG2* * * * * -
mz/sinorama/10/ectb_1034 0 10 Asian NNP * - - - - (NORP) * * * * * * * -
mz/sinorama/10/ectb_1034 0 11 Godfather NNP *)))))))) - - - - * * * *) *) * * * 23)
mz/sinorama/10/ectb_1034 0 12 , , * - - - - * * * * * * * * -
mz/sinorama/10/ectb_1034 0 13 " '' * - - - - * * * * * * * * -
mz/sinorama/10/ectb_1034 0 14 played VBD (VP* play 02 3 - * * * * (V*) * * * -
mz/sinorama/10/ectb_1034 0 15 a DT (NP* - - - - * * * * (ARG1* * * * -
mz/sinorama/10/ectb_1034 0 16 key JJ * - - - - * * * * * * * * -
mz/sinorama/10/ectb_1034 0 17 role NN *) role - 1 - * * * * *) * * * -
mz/sinorama/10/ectb_1034 0 18 in IN (PP* - - - - * * * * (ARGM-LOC* * * * -
mz/sinorama/10/ectb_1034 0 19 the DT (NP* - - - - (DATE* * * * * * * * -
mz/sinorama/10/ectb_1034 0 20 early JJ * - - - - * * * * * * * * -
mz/sinorama/10/ectb_1034 0 21 1990s NNS *)) - - - - *) * * * *) * * * -
mz/sinorama/10/ectb_1034 0 22 in IN (PP* - - - - * * * * (ARGM-PRP* * * * -
mz/sinorama/10/ectb_1034 0 23 bringing VBG (S(VP* bring 05 5 - * * * * * (V*) * * -
mz/sinorama/10/ectb_1034 0 24 about RP (PRT*) - - - - * * * * * * * * -
mz/sinorama/10/ectb_1034 0 25 the DT (NP(NP* - - - - * * * * * (ARG1* * * -
mz/sinorama/10/ectb_1034 0 26 first JJ * - - - - (ORDINAL) * * * * * (ARGM-TMP*) * -
mz/sinorama/10/ectb_1034 0 27 round NN *) round 01 2 - * * * * * * (V*) * -
mz/sinorama/10/ectb_1034 0 28 of IN (PP* - - - - * * * * * * (ARG1* * -
mz/sinorama/10/ectb_1034 0 29 talks NNS (NP(NP*) talk 01 3 - * * * * * * * (V*) -
mz/sinorama/10/ectb_1034 0 30 between IN (PP* - - - - * * * * * * * (ARG0* -
mz/sinorama/10/ectb_1034 0 31 Taiwan NNP (NP(NP(NP* - - - - (GPE) * * * * * * * (19
mz/sinorama/10/ectb_1034 0 32 's POS *) - - - - * * * * * * * * 19)
mz/sinorama/10/ectb_1034 0 33 Koo NNP * - - - - (PERSON* * * * * * * * -
mz/sinorama/10/ectb_1034 0 34 Chen NNP * - - - - * * * * * * * * -
mz/sinorama/10/ectb_1034 0 35 - HYPH * - - - - * * * * * * * * -
mz/sinorama/10/ectb_1034 0 36 fu NNP *) - - - - *) * * * * * * * -
mz/sinorama/10/ectb_1034 0 37 and CC * - - - - * * * * * * * * -
mz/sinorama/10/ectb_1034 0 38 Beijing NNP (NP(NP* - - - - (GPE) * * * * * * * (4
mz/sinorama/10/ectb_1034 0 39 's POS *) - - - - * * * * * * * * 4)
mz/sinorama/10/ectb_1034 0 40 Wang NNP * - - - - (PERSON* * * * * * * * -
mz/sinorama/10/ectb_1034 0 41 Daohan NNP *)))))))))) - - - - *) * * * *) *) *) *) -
mz/sinorama/10/ectb_1034 0 42 . . *)) - - - - * * * * * * * * -
Additionally, I was wondering if there is a parser that gets the verb arguments (starting from eleventh to second last column above) in a structured way. Writing a parser from scratch would take a really long time for me.
Hi, I'm just wondering if it is possible to get CoNLL-U format for a given sentence something like this:
s = 'The quick brown fox jumps over the lazy dog.'
Then something like
to_conllu(s)
to produce
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
3 brown brown ADJ JJ Degree=Pos 4 amod _ _
4 fox fox NOUN NN Number=Sing 5 nsubj _ _
5 jumps jump VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
6 over over ADP IN _ 9 case _ _
7 the the DET DT Definite=Def|PronType=Art 9 det _ _
8 lazy lazy ADJ JJ Degree=Pos 9 amod _ _
9 dog dog NOUN NN Number=Sing 5 nmod _ SpaceAfter=No
10 . . PUNCT . _ 5 punct _ _
I'm not sure whether this is intentional, but it's causing problems for my app since I want to include these nodes at root nodes of the parse tree. Simply doing contents.replace('-1', '0')
won't do it since I only want to include particular nodes based on their upostag
field.
The problematic line is this: https://github.com/EmilStenstrom/conllu/blob/master/conllu/parser.py#L83. '-1'.isdigit()
returns false.
Thanks for this library!
The check for a blank line:
Line 55 in 70b144b
is not robust. I am working with the data from WNUT 2017 (https://noisy-text.github.io/2017/emerging-rare-entities.html) were blank lines are denoted as \t\n
so it fails. From what I could find, the CONLL Format only specifies "blank line" as sentence separator, so I believe the best check would be
if len(line.strip()) == 0
What's your thoughts on this?
For the following example,
# sent_id = reviews-002288-0001
# newpar id = reviews-002288-p0001
# text = It's well cool. :)
1-2 It's _ _ _ _ _ _ _ _
1 It it PRON PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs 4 nsubj 4:nsubj _
2 's be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 cop 4:cop _
3 well well ADV RB Degree=Pos 4 advmod 4:advmod _
4 cool cool ADJ JJ Degree=Pos 0 root 0:root SpaceAfter=No
5 . . PUNCT . _ 4 punct 4:punct _
6 :) :) SYM NFP _ 4 discourse 4:discourse _
The parsed result will be
TokenList<It's, It, 's, well, cool, ., :)>
[{'id': (1, '-', 2),
'form': "It's",
'lemma': '_',
'upos': '_',
'xpos': None,
'feats': None,
'head': None,
'deprel': '_',
'deps': None,
'misc': None},
{'id': 1,
'form': 'It',
'lemma': 'it',
'upos': 'PRON',
'xpos': 'PRP',
'feats': {'Case': 'Nom',
'Gender': 'Neut',
'Number': 'Sing',
'Person': '3',
'PronType': 'Prs'},
'head': 4,
'deprel': 'nsubj',
'deps': [('nsubj', 4)],
'misc': None},
]
The parsing has no problem. However, in some cases, one may want to use those id
that is integer.
How about make TokenList
to support filter those with only integer id?
Like this
sentence.filter(id=lambda x: type(x) is int)
pyconll is much faster. Using the UD_French-GSD dev data (36824 tokens and found here), it took 0.41 s, to load, while conllu took on the order of minutes (over 10) to the point where I had to simply exit the process. For any meaningful amount in of data in CL, conllu is not much use right now.
From Reddit
Given the following sentence:
Removed.
The code at parser.py:152
looks like this:
def parse_paired_list_value(value):
if fullmatch(MULTI_DEPS_PATTERN, value):
return [
(part.split(":", 1)[1], parse_id_value(part.split(":")[0]))
for part in value.split("|")
]
return parse_nullable_value(value)
Is seems like the definition of MULTI_DEPS_PATTERN
is not correct. Most deps
values are returned as strings instead of a list of tuples, because the parsing fails. For example '1:compound|6:ARG1|9:ARG1'
is considered bad, but according to the specification this should be alright. Actually the code for parsing inside of the if-statement works perfectly on this line. '4:ARG1'
is also being considered flawed, but '5:measure'
is being considered okay.
More info available here: https://universaldependencies.org/ext-format.html
Hi,
First, thank you for creating this awesome parser!
Users of the Flair library, have an issue with the released versions from yesterday.
The issue is that conllu.TokenList
cannot be imported anymore, but needs to change to conllu.models.TokenList
.
As I suppose, that not only flair runs into the issue, I would suggest adding from conllu.models import TokenList
in the conllu __init__.py
and releasing a hotfix.
On long term, it would be nice to add an __all__
tag to specify the public interface, so it would be clear that conllu.TokenList
wasn't intended to be used that way.
I recently upgraded from conllu 1.3.1 to 2.2 due to the latter version's ability to deal with elided tokens/copy nodes (e.g. token 8.1 below) which was addressed in #27.
I am parsing the deps
column and have a loop which iterates over the deps
tuples to put the heads into a heads list and the relations into a relations list. The upgrade now includes the copy nodes which is good but now all 0:root
labels are returned as a string and not a tuple which breaks my loop.
# sent_id = weblog-blogspot.com_healingiraq_20040409053012_ENG_20040409_053012-0022
# text = Over 300 Iraqis are reported dead and 500 wounded in Fallujah alone.
1 Over over ADV RB _ 2 advmod 2:advmod _
2 300 300 NUM CD NumType=Card 3 nummod 3:nummod _
3 Iraqis Iraqis PROPN NNPS Number=Plur 5 nsubj:pass 5:nsubj:pass|6:nsubj:xsubj|8:nsubj:pass _
4 are be AUX VBP Mood=Ind|Tense=Pres|VerbForm=Fin 5 aux:pass 5:aux:pass _
5 reported report VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 0 root 0:root _
6 dead dead ADJ JJ Degree=Pos 5 xcomp 5:xcomp _
7 and and CCONJ CC _ 8 cc 8:cc|8.1:cc _
8 500 500 NUM CD NumType=Card 5 conj 5:conj:and|8.1:nsubj:pass|9:nsubj:xsubj _
8.1 reported report VERB VBN Tense=Past|VerbForm=Part|Voice=Pass _ _ 5:conj:and CopyOf=5
9 wounded wounded ADJ JJ Degree=Pos 8 orphan 8.1:xcomp _
10 in in ADP IN _ 11 case 11:case _
11 Fallujah Fallujah PROPN NNP Number=Sing 5 obl 5:obl:in _
12 alone alone ADV RB _ 11 advmod 11:advmod SpaceAfter=No
13 . . PUNCT . _ 5 punct 5:punct _
I'm just wondering is this the desired behaviour? e.g. the output of deps looks like:
deps [[('advmod', 2)], [('nummod', 3)], [('nsubj:pass', 5), ('nsubj:xsubj', 6), ('nsubj:pass', 8)], [('aux:pass', 5)], '0:root', [('xcomp', 5)], [('cc', 8), ('cc', (8, '.', 1))], [('conj:and', 5), ('nsubj:pass', (8, '.', 1)), ('nsubj:xsubj', 9)], [('conj:and', 5)], [('xcomp', (8, '.', 1))], [('case', 11)], [('obl:in', 5)], [('advmod', 11)], [('punct', 5)]]
Is there any particular reason why '0:root'
shouldn't be [('root', 0)]
?
Thanks!
Hey. I have found this library quite useful. There is a suggestion that I ran into: TokenList
could maybe inherit list
, and so be treated like that.
I tried manipulating the sentences (TokenLists) as if they were lists, such as using len
but then I saw they weren't. I thought about adding __len__
, but I think it's more appropriate to make them lists, so they natively support other things such as adding them, comparing them, and so on.
What do you think?
When I use the parse_tree() to get the relationship between each term, it can not show the same format as in your example, instead the output is in one line. Can you help me to get the same format as in your example of parse_tree()? Thanks.
Is it possible to modifiy a parsed Conllu Sentence? I tried to add a new token exploiting the fact that a sentence is a token list:
import conllu
data = """# sent_id = 1
1 the the DET _ _ 2 det _ _
2 cat cat NOUN _ _ 3 nsubj _ _
3 sleeps sleep VERB _ _ 0 root _ _
"""
sentences = conllu.parse(data)
print(sentences [0].serialize())
newtoken = conllu.models.Token({"id":4, "form":"well", "head": 3, "deprel":"advmod"})
sentences[0].append(newtoken)
print(sentences [0].serialize()) # ERROR: Token "4' incorrectly serialized
Maybe this is not the best practice? (I haven't found any information, though)
But doing it this way the final output is incorrect, since my new token 4 was initialized only with id, form, head and deprel:
# sent_id = 1
1 the the DET _ _ 2 det _ _
2 cat cat NOUN _ _ 3 nsubj _ _
3 sleeps sleep VERB _ _ 0 root _ _
4 well 3 advmod
Shouldn't Token()
initialize all fields with "_" to assure a correct serialisation?
Lots of files have strange formats in their metadata. Adding a way to customize that parsing would likely help.
Hi,
Maybe adding a __next__()
method to the TokenList
type could be interesting. Currently when the TokenList is transformed into an iterator with iter()
, next(my_list)
returns only the first element of the tuple (id
/form
/lemma
/etc.).
37 for key, value in tokenlist.metadata.items():
38 if value:
---> 39 line = "# " + key + " = " + value
40 else:
41 line = "# " + key
TypeError: can only concatenate str (not "int") to str
I have written a metadata_parser that converts values to int
which is used in my downstream tasks. However, the serializer then breaks. A simple fix can be achieved by using f-strings (which we should be using anyway, now that there is a python 3.6 requirement)
It seems that if I write:
with open("my.conllup") as file:
content = file.read()
plus_fields = conllu.parse_conllu_plus_fields(file)
plus_fields
is equal to None
, while if run parse_conllu_plus_fields
first and then read the file everything works as expected.
Thanks for the exceptional library - just wanted to note a breaking change we ran into which wasn't documented:
https://github.com/EmilStenstrom/conllu/blob/master/conllu/parser.py#L117
Previously I was relying on float id
values being parsed as None
, which occurs for words which have been elided in the text, but automatically reconstructed in the annotations. This line now parses the id as the id of the word which it elides. If you were using this field to check for elipsis, you should now check whether the head
field is None
instead.
Hi,
does the conllu package process MWT tokens, lke French du which is a contraction of de and le:
1-2 du _ _ ...
1 de de ADP ...
2 le le DET ...
conllu seems to read them without error but the MWT tokens are not accessible (and serialize()
omits them)
Hello,
In the latest version (1.2.1), parse_tree
doesn't work properly if a sentence contains ids with integer ranges or decimal points.
data_file = """
1 En en ADP _ _ 2 case _ _
2 1872 1872 NUM _ _ 26 obl _ SpaceAfter=No
3 , , PUNCT _ _ 26 punct _ _
4 quand quand SCONJ _ _ 8 mark _ _
5 les le DET _ Definite=Def|Number=Plur|PronType=Art 6 det _ _
6 Georgiens Georgiens PROPN _ _ 8 nsubj _ _
7 ont avoir AUX _ Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin 8 aux _ _
8 repris reprendre VERB _ Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part 26 advcl _ _
9 les le DET _ Definite=Def|Gender=Fem|Number=Plur|PronType=Art 10 det _ _
10 commandes commande NOUN _ Gender=Fem|Number=Plur 8 obj _ _
11-12 du _ _ _ _ _ _ _ _
11 de de ADP _ _ 13 case _ _
12 le le DET _ Definite=Def|Gender=Masc|Number=Sing|PronType=Art 13 det _ _
13 gouvernement gouvernement NOUN _ Gender=Masc|Number=Sing 10 nmod _ SpaceAfter=No
14 , , PUNCT _ _ 26 punct _ _
15 Barnett Barnett PROPN _ _ 26 nsubj _ _
16 ( ( PUNCT _ _ 20 punct _ SpaceAfter=No
17 qui qui PRON _ PronType=Rel 20 nsubj:pass _ _
18 avait avoir AUX _ Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin 20 aux _ _
19 été être AUX _ Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part 20 aux:pass _ _
20 réélu réélire VERB _ Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part 15 acl:relcl _ _
21 par par ADP _ _ 23 case _ _
22 ce ce DET _ Gender=Masc|Number=Sing|PronType=Dem 23 det _ _
23 point point NOUN _ Gender=Masc|Number=Sing 20 obl _ SpaceAfter=No
24 ) ) PUNCT _ _ 20 punct _ _
25 a avoir AUX _ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 26 aux _ _
26 rapporté rapporter VERB _ Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part 0 root _ _
27 le le DET _ Definite=Def|Gender=Masc|Number=Sing|PronType=Art 28 det _ _
28 sceau sceau PROPN _ _ 26 obj _ SpaceAfter=No
29 . . PUNCT _ _ 26 punct _ _
"""
sentence_tree = parse_tree(data_file)
print(sentence_tree)
outputs
[TokenTree<token={id=(11, '-', 12), form=du}, children=None>]
Idea: Use an API that is similar to Django's, where you can write data.filter(id__gt=3)
and get all nodes with an ID greater than 3. data.filter(form="the", head=2)
would return all nodes with both of those properties set.
This example from AllenNLP returns deps as a string, as 5:conj:and|8.1:nsubj:pass|9:nsubj:xsubj
.
# sent_id = weblog-blogspot.com_healingiraq_20040409053012_ENG_20040409_053012-0022
# text = Over 300 Iraqis are reported dead and 500 wounded in Fallujah alone.
1 Over over ADV RB _ 2 advmod 2:advmod _
2 300 300 NUM CD NumType=Card 3 nummod 3:nummod _
3 Iraqis Iraqis PROPN NNPS Number=Plur 5 nsubj:pass 5:nsubj:pass|6:nsubj:xsubj|8:nsubj:pass _
4 are be AUX VBP Mood=Ind|Tense=Pres|VerbForm=Fin 5 aux:pass 5:aux:pass _
5 reported report VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 0 root 0:root _
6 dead dead ADJ JJ Degree=Pos 5 xcomp 5:xcomp _
7 and and CCONJ CC _ 8 cc 8:cc|8.1:cc _
8 500 500 NUM CD NumType=Card 5 conj 5:conj:and|8.1:nsubj:pass|9:nsubj:xsubj _
8.1 reported report VERB VBN Tense=Past|VerbForm=Part|Voice=Pass _ _ 5:conj:and CopyOf=5
9 wounded wounded ADJ JJ Degree=Pos 8 orphan 8.1:xcomp _
10 in in ADP IN _ 11 case 11:case _
11 Fallujah Fallujah PROPN NNP Number=Sing 5 obl 5:obl:in _
12 alone alone ADV RB _ 11 advmod 11:advmod SpaceAfter=No
13 . . PUNCT . _ 5 punct 5:punct _
Since the two formats are nearly identical, can this script read CoNLL? I get errors, probably because the format does not match exactly.
Could it also read CoNLLU formats where metadata (#-prefixed lines) are missing?
Hello,
Just a heads up, in case you haven't noticed yet: the 5.0.0
wheel on PyPI is empty (probably due to a project build config issue), thus, currently, installing the latest conllu through pip
results in a missing conllu
module.
Good luck :)
Hi, first of all, thanks for this awesome library, it is really helpful for my current project :-)
Second, for my current project I had to implement some helper methods regarding searching and collecting subtree tokens I want to share with you. You can decide if you want to incorporate some of them into the library (probably as instance methods on TokenTree). It might help others too.
Code:
from operator import itemgetter
from typing import Optional, List, Text
from conllu import TokenTree, TokenList, Token
def _get_node_by_id(tree_node: TokenTree, id_to_find: int) -> Optional[TokenTree]:
to_traverse = [tree_node]
while len(to_traverse):
node = to_traverse.pop()
if node.token["id"] == id_to_find:
return node
to_traverse.extend(node.children)
return None
def _get_node_by_id_recursive(tree_node: TokenTree, id_to_find: int) -> Optional[TokenTree]:
if tree_node.token["id"] == id_to_find:
return tree_node
for child_token in tree_node.children:
if found_node := _get_node_by_id_recursive(child_token, id_to_find):
return found_node
return None
def _collect_all_subtree_tokens(tree_node: TokenTree) -> List[Token]:
subtree_nodes = []
to_traverse = [tree_node]
while len(to_traverse):
node = to_traverse.pop()
subtree_nodes.append(node.token)
to_traverse.extend(node.children)
return subtree_nodes
# more general, could be used in the method get_word_subtree more or less instead of _collect_all_subtree_tokens
def to_list(root_node: TokenTree) -> TokenList:
def flatten_tree(root_token: TokenTree, token_list: List[Token]) -> List[Token]:
token_list.append(root_token.token)
for child_token in root_token.children:
flatten_tree(child_token, token_list)
return token_list
flatten_list = flatten_tree(root_node, [])
flatten_list_by_id = sorted(flatten_list, key=itemgetter("id"))
return TokenList(flatten_list_by_id, root_node.metadata)
def get_word_subtree(tree_node: TokenTree, token_id: int) -> Optional[Text]:
word_node = _get_node_by_id(tree_node, token_id)
if word_node is None:
return None
subtree_tokens = _collect_all_subtree_tokens(word_node)
sorted_tokens_by_id = sorted(subtree_tokens, key=itemgetter("id"))
return " ".join(token["form"] for token in sorted_tokens_by_id)
I think something may be broken when serializing (pickle-dumping) and loading back conllu-parsed sentences.
File "python3.9/site-packages/conllu/models.py", line 103, in extend
self.metadata.update(iterable.metadata)
AttributeError: 'TokenList' object has no attribute 'metadata'
Do you think writing a CoNLLu writer would be easier starting from your current code? Or should I just start one from scratch?
Can conllu
be integrated with NLTK?
NLTK has a ConllCorpusReader class, but it currently doesn't support CoNLL-U.
How can we cite this library?
Consider the part of sentence:
17 Sarajewo Sarajewo PROPN NE Case=Dat|Gender=Neut|Number=Sing 15 nmod _ NamedEntity=Yes
18-19 zur _ _ _ _ _ _ _ _
18 zu zu ADP APPR _ 21 case _ _
19 der der DET ART Case=Dat|Definite=Def|Gender=Fem|Number=Sing|PronType=Art 21 det _ _
20 humanitären humanitär ADJ ADJA Case=Dat|Gender=Fem|Number=Sing 21 amod _ _
The TokenList rendition of this sentence returns wrong values. For example:
for x in range(17, 21): print(sentence[x])
returns output
zur zu der humanitären
thereby messing the control during the list comprehension
According to UD data (v2.9, example language: Akkadian, treebank: RIAO) as reproduced below,
# sent_id = Q006035-192
# text = narê aškun
1 narê narû NOUN N Gender=Masc|Number=Plur 2 obj _ {NA₄}NA.RU₂.A.MEŠ
2 aškun šakānu VERB V Gender=Com|Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin|VerbStem=G 0 root _ aš-ku-un
The misc column doesn't necessarily contains key-value pairs (For example, SpaceAfter=No
). When the block is read as a TokenTree, and then re-serialised, it generates the following output. Notice the trailing =
at the end:
# sent_id = Q006035-192
# text = narê aškun
1 narê narû NOUN N Gender=Masc|Number=Plur 2 obj _ {NA₄}NA.RU₂.A.MEŠ=
2 aškun šakānu VERB V Gender=Com|Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin|VerbStem=G 0 root _ aš-ku-un=
The misc column should not be treated as a key-value pair, but rather as a list of values, delimited by pipe operator |
This is probably not strictly speaking an issue with this library so feel free to close the issue.
While using ParZu I came across the issue that most outputs define multiple root nodes. Take the example sentence "Ich esse einen Apfel." ("I eat an apple")
ParZu's output for this looks something like this:
1 Ich ich PRO PPER 1|Sg|_|Nom 2 subj _ _
2 esse essen V VVFIN 1|Sg|Pres|_ 0 root _ _
3 einen eine ART ART Indef|Masc|Acc|Sg 4 det _ _
4 Apfel Apfel N NN Masc|Acc|Sg 2 obja _ _
5 . . $. $. _ 0 root _ _
Both the verb and the punctuation claim to be root.
I did write a quick workaround. It just creates a top level ROOT
node for root
nodes to attach to.
The patch is probably not fit for inclusion in its current state. If you do have some specific ideas of how the problem of multiple root's like this could be merged I'd be open to that. If not that's fine as well, maybe this issue will help someone out some day ;)
I've got the error in print_tree()
:
File "/usr/local/lib/python3.6/dist-packages/conllu/tree_helpers.py", line 14, in print_tree
assert isinstance(node, TreeNode), "node not TreeNode %s" % type(node)
AssertionError: node not TreeNode <class 'list'>
... on the following part of data:
data = """
1 Настоящий _ ADJ _ Case=Nom|Degree=Pos|Gender=Masc|Number=Sing|fPOS=ADJ++ 3 amod _ _
2 Федеральный _ ADJ _ Case=Nom|Degree=Pos|Gender=Masc|Number=Sing|fPOS=ADJ++ 3 amod _ _
3 закон _ NOUN _ Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|fPOS=NOUN++ 4 nsubj _ _
4 определяет _ VERB _ Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act|fPOS=VERB++ 0 ROOT _ _
5 особенности _ NOUN _ Animacy=Inan|Case=Acc|Gender=Fem|Number=Plur|fPOS=NOUN++ 4 dobj _ _
6 гражданско-правового _ ADJ _ Case=Gen|Degree=Pos|Gender=Neut|Number=Sing|fPOS=ADJ++ 7 amod _ _
7 положения _ NOUN _ Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing|fPOS=NOUN++ 5 nmod _ _
8 некоммерческих _ ADJ _ Case=Gen|Degree=Pos|Number=Plur|fPOS=ADJ++ 9 amod _ _
9 организаций _ NOUN _ Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur|fPOS=NOUN++ 7 nmod _ _
10 отдельных _ ADJ _ Case=Gen|Degree=Pos|Number=Plur|fPOS=ADJ++ 12 amod _ _
11 организационно-правовых _ ADJ _ Case=Gen|Degree=Pos|Number=Plur|fPOS=ADJ++ 12 amod _ _
12 форм _ NOUN _ Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur|fPOS=NOUN++ 9 nmod _ _
13 , _ PUNCT , fPOS=PUNCT++, 12 punct _ _
14 видов _ NOUN _ Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|fPOS=NOUN++ 12 conj _ _
15 и _ CONJ _ fPOS=CONJ++ 12 cc _ _
16 типов _ NOUN _ Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|fPOS=NOUN++ 12 conj _ _
17 , _ PUNCT , fPOS=PUNCT++, 16 punct _ _
18 а _ CONJ _ fPOS=CONJ++ 5 cc _ _
19 также _ ADV _ Degree=Pos|fPOS=ADV++ 20 advmod _ _
20 возможные _ ADJ _ Case=Nom|Degree=Pos|Number=Plur|fPOS=ADJ++ 21 amod _ _
21 формы _ NOUN _ Animacy=Inan|Case=Nom|Gender=Fem|Number=Plur|fPOS=NOUN++ 5 conj _ _
22 поддержки _ NOUN _ Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing|fPOS=NOUN++ 21 nmod _ _
23 некоммерческих _ ADJ _ Case=Gen|Degree=Pos|Number=Plur|fPOS=ADJ++ 24 amod _ _
24 организаций _ NOUN _ Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur|fPOS=NOUN++ 22 dobj _ _
25 органами _ NOUN _ Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur|fPOS=NOUN++ 5 conj _ _
26 государственной _ ADJ _ Case=Gen|Degree=Pos|Gender=Fem|Number=Sing|fPOS=ADJ++ 27 amod _ _
27 власти _ NOUN _ Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing|fPOS=NOUN++ 25 nmod _ _
28 и _ CONJ _ fPOS=CONJ++ 5 cc _ _
29 органами _ NOUN _ Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur|fPOS=NOUN++ 5 conj _ _
30 местного _ ADJ _ Case=Gen|Degree=Pos|Gender=Neut|Number=Sing|fPOS=ADJ++ 31 amod _ _
31 самоуправления _ NOUN _ Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing|fPOS=NOUN++ 29 nmod _ _
32 . _ PUNCT . fPOS=PUNCT++. 34 punct _ _
33 ( _ PUNCT ( fPOS=PUNCT++( 34 punct _ _
34 п _ VERB _ Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing|fPOS=NOUN++ 4 parataxis _ _
35 . _ PUNCT . fPOS=PUNCT++. 4 punct _ _
"""
t = parse_tree(data)
print_tree(t)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.