emilstenstrom / conllu Goto Github PK

View Code? Open in Web Editor NEW

308.0 9.0 50.0 324 KB

A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.

License: MIT License

Python 100.00%

conll conll-u natural-language-processing python

conllu's People

Contributors

Stargazers

Watchers

Forkers

svetlana21 frankier erickrf iclementine sarthakmehta mfa hosford42 tomelf mariusvniekerk vadim-isakov zoharai howl-anderson bryant1410 kaivu1999 rizwan09 grivaz orenbaldinger sahitpj hayleypark ivyleavedtoadflax somiyagawa gazzola lgessler mdelhoneux gelistir spider8801 dnaaun hatzel matthewjungminyoon mklimasz cjer visualjoyce songyuqing-cloud huberemanuel rakeshwalisheter bminixhofer hrishikeshrt code-terror egumasa lrieb ilinkaa techthiyanes andidyer arpitjain799 peterr-s alierkan mirror-dump kylebgorman

conllu's Issues

Enhanced dependencies fail

I found a new case where the regular expression for parsing enhanced representations fails in the Arabic training set, see https://github.com/mdelhoneux/conllu/blob/cfc45fb5e52e4fe714472a2002464db4c6876cec/tests/test_parser.py#L477, I have not yet managed to fix this without breaking other tests.

Modifying misc with non-str values

Hi,

Many compliments on this library. It's by far my preferred one for working with conllu files. I especially appreciate that it's free of dependencies.

One issue that I've encountered is in adding a key-value pair where the value is a float or int to the misc field of a token and then serializing it.
My use case is adding token-level analyses (e.g. dependency length) to the misc field of a new conllu object to be serialized.

For example, I might add a new dict to the field as follows:

new_data = {"DL": 2, "Cosine": 0.45}

token["misc"].update(new_data)

Although one can modify the dictionary in the misc field, adding new key-value pairs, when serializing the following error occurs in serialize.py:

            fields = []
            for key, value in field.items():
                    value = "_"
                if value == "":
                    fields.append(key)
                    continue

>               fields.append('='.join((key, value)))
E               TypeError: sequence item 1: expected str instance, int found

The culprit is this: fields.append('='.join((key, value))). This works okay as long as both the key and the value are str type, but it breaks if the value is, for example, an int or float.

In my fork, I have changed this using an f-string:

def serialize_field(field: T.Any) -> str:
    if field is None:
        return '_'

    if isinstance(field, dict):
        if field == {}:
            return '_'

        fields = []
        for key, value in field.items():
            if value is None:
                value = "_"
            if value == "":
                fields.append(key)
                continue

            fields.append(f'{key}={value}')

        return '|'.join(fields)

    if isinstance(field, tuple):
        return "".join([serialize_field(item) for item in field])

    if isinstance(field, list):
        if len(field[0]) != 2:
            raise ParseException("Can't serialize '{}', invalid format".format(field))
        return "|".join([serialize_field(value) + ":" + str(key) for key, value in field])

    return "{}".format(field)

This appears to solve the issue easily, though of course it then means that string representations of arbitrary datatypes could end up in misc too, which may be undesirable. Perhaps checking if the value's datatype is in (str, int, float) would constrain this?

"RecursionError: maximum recursion depth exceeded" when accessing "upos"

I still observe a strange thing, possibly linked to this workaround: If there are more than two tokens with head 0 in a conllu sentence and I try to get the upos of the sentence root (the top level one), I get a maximum recursion error.
for example the following script reproduces the error:

#!/usr/bin/env python3

sent = """1	the	the	DET	_	_	0	root	_	_
2	mouse	mouse	NOUN	_	_	3	root	_	_
3	sleeps	sleep	VERB	_	_	0	root	_	_

"""

for sentence in conllu.parse(sent):
    #print(sentence)
    root = sentence.to_tree()
    root.print_tree()
    print(root.token)
    print(root.token["upos"])

Running it gives the following error message

(deprel:root) form:_ [0]  # <-- the toplevel root does not have an  UPOS
    (deprel:root) form:the lemma:the upos:DET [1]
    (deprel:root) form:sleeps lemma:sleep upos:VERB [3]
        (deprel:root) form:mouse lemma:mouse upos:NOUN [2]
{'id': 0, 'form': '_', 'deprel': 'root'}
Traceback (most recent call last):
  File "./ex.py", line 15, in <module>
    print(root.token["upos"])
  File "/home/jh/.local/lib/python3.6/site-packages/conllu/models.py", line 31, in __missing__
    return self[self.MAPPING[key]]
  File "/home/jh/.local/lib/python3.6/site-packages/conllu/models.py", line 31, in __missing__
    return self[self.MAPPING[key]]
  File "/home/jh/.local/lib/python3.6/site-packages/conllu/models.py", line 31, in __missing__
    return self[self.MAPPING[key]]
  [Previous line repeated 246 more times]
RecursionError: maximum recursion depth exceeded

Originally posted by @jheinecke in #44 (comment)

enhanced dependencies

Read column 9 and make the enhanced dependency graph available?

A PEP 561 compliant package

First off, thanks for this package. The API on the README looks great. Can't wait to dig in.

I've forked this repo and will add type annotations to every part that I will make use(most likely all the .py files in the root dir).

If you'd like I can submit a PR to merge those type annotations back in(and this repo can be PEP 561 compliant).

TypeError: initial_value must be str or None, not Tag

There are some issues with this type error, as it mistakes a token containing < or > as a tag. For instance, the Danish sentence:

Antag at a og b er to reelle tal og at 0<a<b.

raises the TypeError.

Moreover, what about actual tags like <p> that can be found quoted in several texts?

CoNLL-U Plus: Retain global.columns when slicing

Do you have plans to support the https://universaldependencies.org/ext-format.html?

Is there any way to find the shortest path among two TreeNodes?

The TreeNode is an OrderedDict.
Question: Is there any way to find the shortest path among two TreeNodes?
Thanks!

Document `fields` and `field_parsers`

There are features for handling strangely formatted files. I should document them.

raised exception on invalid ID

Hello,
I know this is a malformed line, but apparently new versions of Stanza produces this output:

72	sa	sa	PRON	R	PronType=Prs|Reflex=Yes	76	expl:pv	_	start_char=633673|end_char=633675
73-73	zaň	_	_	_	_	_	_	_	start_char=633676|end_char=633679
73	zaň	zaň	PRON	PFms4	Animacy=Anim|Case=Acc|Gender=Masc|Number=Sing|Person=3|PronType=Prs	76	obj	_

which raises this exception:

File "/usr/local/lib/python3.9/site-packages/conllu/__init__.py", line 40, in parse_tree
   return list(parse_tree_incr(StringIO(data)))
 File "/usr/local/lib/python3.9/site-packages/conllu/__init__.py", line 43, in parse_tree_incr
   for tokenlist in parse_incr(in_file):
 File "/usr/local/lib/python3.9/site-packages/conllu/__init__.py", line 32, in parse_incr
   yield parse_token_and_metadata(
 File "/usr/local/lib/python3.9/site-packages/conllu/parser.py", line 95, in parse_token_and_metadata
   tokens.append(parse_line(line, fields, field_parsers))
 File "/usr/local/lib/python3.9/site-packages/conllu/parser.py", line 132, in parse_line
   raise ParseException("Failed parsing field '{}': ".format(field) + str(e))
conllu.exceptions.ParseException: Failed parsing field 'id': '72-72' is not a valid ID.

Since I have to parse a lot of files, would it be possible to safely skip malformed lines instead of raising an exception i.e., terminating the script?

Parsing error: Invalid line format, line must contain either tabs or two spaces

Try this dataset: https://github.com/ufal/rh_nntagging/tree/master/data/ud-1.2/en

I got the error when parsing:

ParseException: Invalid line format, line must contain either tabs or two spaces.

Singletons in MISC gets removed

I noticed some bugs in conllu. When I have misc features of the form Singleton|Key=Value, then Singleton field is left out completly, even though this form is not invalid and is something I have seen in CoNLL-U files. I intend to report these as well to see if anything comes out of it.

From Reddit

Support for unstructured comments

According to the format specification, comments such as # newdoc should be allowed. However, when I try to parse and serialize the following text:

# newdoc
# sent_id = 1
# text = They buy and sell books.
1   They     they    PRON    PRP    Case=Nom|Number=Plur               2   nsubj   2:nsubj|4:nsubj   _
2   buy      buy     VERB    VBP    Number=Plur|Person=3|Tense=Pres    0   root    0:root            _
3   and      and     CONJ    CC     _                                  4   cc      4:cc              _
4   sell     sell    VERB    VBP    Number=Plur|Person=3|Tense=Pres    2   conj    0:root|2:conj     _
5   books    book    NOUN    NNS    Number=Plur                        2   obj     2:obj|4:obj       SpaceAfter=No
6   .        .       PUNCT   .      _                                  2   punct   2:punct           _

serialization throws the TypeError: must be str, not NoneType exception. Parsing works fine, I'm able to retrieve the metadata for the sentence above:

>>> sent.metadata                                                                                                                                                                                             
OrderedDict([('newdoc', None),
             ('sent_id', '1'),
             ('text', 'They buy and sell books.')])

Would it be possible to add support for such unstructured comments?

Conll Format & ARG parser

Hey I was wondering if your project supports the default conll format and if not if you know a good converter.

I am talking about this format:

mz/sinorama/10/ectb_1034   0    1            ,      ,              *       -    -   -   -           *      *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0    2          Lee    NNP        (NP(NP*)      -    -   -   -     (PERSON)     *      *      (ARG1*)       (ARG0*        *             *        *   (23
mz/sinorama/10/ectb_1034   0    3            ,      ,              *       -    -   -   -           *      *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0    4          who     WP    (SBAR(WHNP*)      -    -   -   -           *      *      *    (R-ARG1*)            *        *             *        *     -
mz/sinorama/10/ectb_1034   0    5          has    VBZ         (S(VP*     have  01   -   -           *    (V*)     *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0    6         been    VBN           (VP*       be  03   -   -           *      *    (V*)          *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0    7       dubbed    VBN           (VP*      dub  01   1   -           *      *      *         (V*)            *        *             *        *     -
mz/sinorama/10/ectb_1034   0    8            "     ``              *       -    -   -   -           *      *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0    9          the     DT         (S(NP*       -    -   -   -           *      *      *      (ARG2*             *        *             *        *     -
mz/sinorama/10/ectb_1034   0   10        Asian    NNP              *       -    -   -   -       (NORP)     *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0   11    Godfather    NNP       *))))))))      -    -   -   -           *      *      *           *)            *)       *             *        *    23)
mz/sinorama/10/ectb_1034   0   12            ,      ,              *       -    -   -   -           *      *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0   13            "     ''              *       -    -   -   -           *      *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0   14       played    VBD           (VP*     play  02   3   -           *      *      *           *           (V*)       *             *        *     -
mz/sinorama/10/ectb_1034   0   15            a     DT           (NP*       -    -   -   -           *      *      *           *        (ARG1*        *             *        *     -
mz/sinorama/10/ectb_1034   0   16          key     JJ              *       -    -   -   -           *      *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0   17         role     NN              *)    role   -   1   -           *      *      *           *             *)       *             *        *     -
mz/sinorama/10/ectb_1034   0   18           in     IN           (PP*       -    -   -   -           *      *      *           *    (ARGM-LOC*        *             *        *     -
mz/sinorama/10/ectb_1034   0   19          the     DT           (NP*       -    -   -   -      (DATE*      *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0   20        early     JJ              *       -    -   -   -           *      *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0   21        1990s    NNS             *))      -    -   -   -           *)     *      *           *             *)       *             *        *     -
mz/sinorama/10/ectb_1034   0   22           in     IN           (PP*       -    -   -   -           *      *      *           *    (ARGM-PRP*        *             *        *     -
mz/sinorama/10/ectb_1034   0   23     bringing    VBG         (S(VP*    bring  05   5   -           *      *      *           *             *      (V*)            *        *     -
mz/sinorama/10/ectb_1034   0   24        about     RP          (PRT*)      -    -   -   -           *      *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0   25          the     DT        (NP(NP*       -    -   -   -           *      *      *           *             *   (ARG1*             *        *     -
mz/sinorama/10/ectb_1034   0   26        first     JJ              *       -    -   -   -    (ORDINAL)     *      *           *             *        *    (ARGM-TMP*)       *     -
mz/sinorama/10/ectb_1034   0   27        round     NN              *)   round  01   2   -           *      *      *           *             *        *           (V*)       *     -
mz/sinorama/10/ectb_1034   0   28           of     IN           (PP*       -    -   -   -           *      *      *           *             *        *        (ARG1*        *     -
mz/sinorama/10/ectb_1034   0   29        talks    NNS        (NP(NP*)    talk  01   3   -           *      *      *           *             *        *             *      (V*)    -
mz/sinorama/10/ectb_1034   0   30      between     IN           (PP*       -    -   -   -           *      *      *           *             *        *             *   (ARG0*     -
mz/sinorama/10/ectb_1034   0   31       Taiwan    NNP     (NP(NP(NP*       -    -   -   -        (GPE)     *      *           *             *        *             *        *   (19
mz/sinorama/10/ectb_1034   0   32           's    POS              *)      -    -   -   -           *      *      *           *             *        *             *        *    19)
mz/sinorama/10/ectb_1034   0   33          Koo    NNP              *       -    -   -   -    (PERSON*      *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0   34         Chen    NNP              *       -    -   -   -           *      *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0   35           -    HYPH              *       -    -   -   -           *      *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0   36           fu    NNP              *)      -    -   -   -           *)     *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0   37          and     CC              *       -    -   -   -           *      *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0   38      Beijing    NNP        (NP(NP*       -    -   -   -        (GPE)     *      *           *             *        *             *        *    (4
mz/sinorama/10/ectb_1034   0   39           's    POS              *)      -    -   -   -           *      *      *           *             *        *             *        *     4)
mz/sinorama/10/ectb_1034   0   40         Wang    NNP              *       -    -   -   -    (PERSON*      *      *           *             *        *             *        *     -
mz/sinorama/10/ectb_1034   0   41       Daohan    NNP     *))))))))))      -    -   -   -           *)     *      *           *             *)       *)            *)       *)    -
mz/sinorama/10/ectb_1034   0   42            .      .             *))      -    -   -   -           *      *      *           *             *        *             *        *     -

Additionally, I was wondering if there is a parser that gets the verb arguments (starting from eleventh to second last column above) in a structured way. Writing a parser from scratch would take a really long time for me.

How to convert a sentence into CoNLL-U format?

Hi, I'm just wondering if it is possible to get CoNLL-U format for a given sentence something like this:

s = 'The quick brown fox jumps over the lazy dog.'

Then something like

to_conllu(s)

to produce

# text = The quick brown fox jumps over the lazy dog.
1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
3   brown   brown  ADJ    JJ   Degree=Pos                  4   amod    _   _
4   fox     fox    NOUN   NN   Number=Sing                 5   nsubj   _   _
5   jumps   jump   VERB   VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
6   over    over   ADP    IN   _                           9   case    _   _
7   the     the    DET    DT   Definite=Def|PronType=Art   9   det     _   _
8   lazy    lazy   ADJ    JJ   Degree=Pos                  9   amod    _   _
9   dog     dog    NOUN   NN   Number=Sing                 5   nmod    _   SpaceAfter=No
10  .       .      PUNCT  .    _                           5   punct   _   _

Fields with HEAD = -1 are excluded from the parse tree

I'm not sure whether this is intentional, but it's causing problems for my app since I want to include these nodes at root nodes of the parse tree. Simply doing contents.replace('-1', '0') won't do it since I only want to include particular nodes based on their upostag field.

The problematic line is this: https://github.com/EmilStenstrom/conllu/blob/master/conllu/parser.py#L83. '-1'.isdigit() returns false.

Blank line check is not robust

Thanks for this library!

The check for a blank line:

conllu/conllu/parser.py

Line 55 in 70b144b

if line == "\n":

is not robust. I am working with the data from WNUT 2017 (https://noisy-text.github.io/2017/emerging-rare-entities.html) were blank lines are denoted as \t\n so it fails. From what I could find, the CONLL Format only specifies "blank line" as sentence separator, so I believe the best check would be

if len(line.strip()) == 0

What's your thoughts on this?

A special case that requires filter over TokenList id

For the following example,

# sent_id = reviews-002288-0001
# newpar id = reviews-002288-p0001
# text = It's well cool. :)
1-2	It's	_	_	_	_	_	_	_	_
1	It	it	PRON	PRP	Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs	4	nsubj	4:nsubj	_
2	's	be	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	4	cop	4:cop	_
3	well	well	ADV	RB	Degree=Pos	4	advmod	4:advmod	_
4	cool	cool	ADJ	JJ	Degree=Pos	0	root	0:root	SpaceAfter=No
5	.	.	PUNCT	.	_	4	punct	4:punct	_
6	:)	:)	SYM	NFP	_	4	discourse	4:discourse	_

The parsed result will be

TokenList<It's, It, 's, well, cool, ., :)>

[{'id': (1, '-', 2),
  'form': "It's",
  'lemma': '_',
  'upos': '_',
  'xpos': None,
  'feats': None,
  'head': None,
  'deprel': '_',
  'deps': None,
  'misc': None},
 {'id': 1,
  'form': 'It',
  'lemma': 'it',
  'upos': 'PRON',
  'xpos': 'PRP',
  'feats': {'Case': 'Nom',
   'Gender': 'Neut',
   'Number': 'Sing',
   'Person': '3',
   'PronType': 'Prs'},
  'head': 4,
  'deprel': 'nsubj',
  'deps': [('nsubj', 4)],
  'misc': None},
]

The parsing has no problem. However, in some cases, one may want to use those id that is integer.

How about make TokenList to support filter those with only integer id?
Like this

sentence.filter(id=lambda x: type(x) is int)

Make parsing faster for large CoNLL-U files

pyconll is much faster. Using the UD_French-GSD dev data (36824 tokens and found here), it took 0.41 s, to load, while conllu took on the order of minutes (over 10) to the point where I had to simply exit the process. For any meaningful amount in of data in CL, conllu is not much use right now.

From Reddit

Column 'deps' is not being parsed most of the time

Given the following sentence:

Removed.

The code at parser.py:152 looks like this:

def parse_paired_list_value(value):
    if fullmatch(MULTI_DEPS_PATTERN, value):
        return [
            (part.split(":", 1)[1], parse_id_value(part.split(":")[0]))
            for part in value.split("|")
        ]

    return parse_nullable_value(value)

Is seems like the definition of MULTI_DEPS_PATTERN is not correct. Most deps values are returned as strings instead of a list of tuples, because the parsing fails. For example '1:compound|6:ARG1|9:ARG1' is considered bad, but according to the specification this should be alright. Actually the code for parsing inside of the if-statement works perfectly on this line. '4:ARG1' is also being considered flawed, but '5:measure' is being considered okay.

Consider supporting CoNLL-U Plus

More info available here: https://universaldependencies.org/ext-format.html

hidden breaking change in conllu 4.4.3 and 4.5.0

Hi,

First, thank you for creating this awesome parser!

Users of the Flair library, have an issue with the released versions from yesterday.
The issue is that conllu.TokenList cannot be imported anymore, but needs to change to conllu.models.TokenList.

As I suppose, that not only flair runs into the issue, I would suggest adding from conllu.models import TokenList in the conllu __init__.py and releasing a hotfix.

On long term, it would be nice to add an __all__ tag to specify the public interface, so it would be clear that conllu.TokenList wasn't intended to be used that way.

Special handling of "0:root" labels in deps column

I recently upgraded from conllu 1.3.1 to 2.2 due to the latter version's ability to deal with elided tokens/copy nodes (e.g. token 8.1 below) which was addressed in #27.

I am parsing the deps column and have a loop which iterates over the deps tuples to put the heads into a heads list and the relations into a relations list. The upgrade now includes the copy nodes which is good but now all 0:root labels are returned as a string and not a tuple which breaks my loop.

# sent_id = weblog-blogspot.com_healingiraq_20040409053012_ENG_20040409_053012-0022
# text = Over 300 Iraqis are reported dead and 500 wounded in Fallujah alone.
1	Over	over	ADV	RB	_	2	advmod	2:advmod	_
2	300	300	NUM	CD	NumType=Card	3	nummod	3:nummod	_
3	Iraqis	Iraqis	PROPN	NNPS	Number=Plur	5	nsubj:pass	5:nsubj:pass|6:nsubj:xsubj|8:nsubj:pass	_
4	are	be	AUX	VBP	Mood=Ind|Tense=Pres|VerbForm=Fin	5	aux:pass	5:aux:pass	_
5	reported	report	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root	0:root	_
6	dead	dead	ADJ	JJ	Degree=Pos	5	xcomp	5:xcomp	_
7	and	and	CCONJ	CC	_	8	cc	8:cc|8.1:cc	_
8	500	500	NUM	CD	NumType=Card	5	conj	5:conj:and|8.1:nsubj:pass|9:nsubj:xsubj	_
8.1	reported	report	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	_	_	5:conj:and	CopyOf=5
9	wounded	wounded	ADJ	JJ	Degree=Pos	8	orphan	8.1:xcomp	_
10	in	in	ADP	IN	_	11	case	11:case	_
11	Fallujah	Fallujah	PROPN	NNP	Number=Sing	5	obl	5:obl:in	_
12	alone	alone	ADV	RB	_	11	advmod	11:advmod	SpaceAfter=No
13	.	.	PUNCT	.	_	5	punct	5:punct	_

I'm just wondering is this the desired behaviour? e.g. the output of deps looks like:

deps [[('advmod', 2)], [('nummod', 3)], [('nsubj:pass', 5), ('nsubj:xsubj', 6), ('nsubj:pass', 8)], [('aux:pass', 5)], '0:root', [('xcomp', 5)], [('cc', 8), ('cc', (8, '.', 1))], [('conj:and', 5), ('nsubj:pass', (8, '.', 1)), ('nsubj:xsubj', 9)], [('conj:and', 5)], [('xcomp', (8, '.', 1))], [('case', 11)], [('obl:in', 5)], [('advmod', 11)], [('punct', 5)]]

Is there any particular reason why '0:root' shouldn't be [('root', 0)]?

Thanks!

TokenList could inherit "list"

Hey. I have found this library quite useful. There is a suggestion that I ran into: TokenList could maybe inherit list, and so be treated like that.

I tried manipulating the sentences (TokenLists) as if they were lists, such as using len but then I saw they weren't. I thought about adding __len__, but I think it's more appropriate to make them lists, so they natively support other things such as adding them, comparing them, and so on.

What do you think?

the proper way to perform the parse_tree

When I use the parse_tree() to get the relationship between each term, it can not show the same format as in your example, instead the output is in one line. Can you help me to get the same format as in your example of parse_tree()? Thanks.

Adding new token to existing sentence

Is it possible to modifiy a parsed Conllu Sentence? I tried to add a new token exploiting the fact that a sentence is a token list:

import conllu

data = """# sent_id = 1
1	the	the	DET	_	_	2	det	_	_
2	cat	cat	NOUN	_	_	3	nsubj	_	_
3	sleeps	sleep	VERB	_	_	0	root	_	_
"""

sentences = conllu.parse(data)
print(sentences [0].serialize())

newtoken = conllu.models.Token({"id":4, "form":"well", "head": 3, "deprel":"advmod"})
sentences[0].append(newtoken)

print(sentences [0].serialize()) # ERROR: Token "4' incorrectly serialized

Maybe this is not the best practice? (I haven't found any information, though)
But doing it this way the final output is incorrect, since my new token 4 was initialized only with id, form, head and deprel:

# sent_id = 1
1	the	the	DET	_	_	2	det	_	_
2	cat	cat	NOUN	_	_	3	nsubj	_	_
3	sleeps	sleep	VERB	_	_	0	root	_	_
4	well	3	advmod

Shouldn't Token() initialize all fields with "_" to assure a correct serialisation?

Add way to customize metadata parsing, `metadata_parser`?

Lots of files have strange formats in their metadata. Adding a way to customize that parsing would likely help.

Add a `next()` method to `TokenList` ?

Hi,

Maybe adding a __next__() method to the TokenList type could be interesting. Currently when the TokenList is transformed into an iterator with iter(), next(my_list) returns only the first element of the tuple (id/form/lemma/etc.).

BUG: Error when serializing metadata that is not a string

     37         for key, value in tokenlist.metadata.items():
     38             if value:
---> 39                 line = "# " + key + " = " + value
     40             else:
     41                 line = "# " + key

TypeError: can only concatenate str (not "int") to str

I have written a metadata_parser that converts values to int which is used in my downstream tasks. However, the serializer then breaks. A simple fix can be achieved by using f-strings (which we should be using anyway, now that there is a python 3.6 requirement)

Cannot parse CoNNL-U Plus fields after reading file content

It seems that if I write:

with open("my.conllup") as file:
        content = file.read()
        plus_fields = conllu.parse_conllu_plus_fields(file)

plus_fields is equal to None, while if run parse_conllu_plus_fields first and then read the file everything works as expected.

undocumented breaking change from 0.1 to 1.0

Thanks for the exceptional library - just wanted to note a breaking change we ran into which wasn't documented:

https://github.com/EmilStenstrom/conllu/blob/master/conllu/parser.py#L117

Previously I was relying on float id values being parsed as None, which occurs for words which have been elided in the text, but automatically reconstructed in the annotations. This line now parses the id as the id of the word which it elides. If you were using this field to check for elipsis, you should now check whether the head field is None instead.

Multiword tokens

Hi,

does the conllu package process MWT tokens, lke French du which is a contraction of de and le:

1-2	du	_	_	...
1	de	de	ADP	...
2	le	le	DET	...

conllu seems to read them without error but the MWT tokens are not accessible (and serialize() omits them)

parse_tree with decimal points and integer ranges

Hello,

In the latest version (1.2.1), parse_tree doesn't work properly if a sentence contains ids with integer ranges or decimal points.

data_file = """
1	En	en	ADP	_	_	2	case	_	_
2	1872	1872	NUM	_	_	26	obl	_	SpaceAfter=No
3	,	,	PUNCT	_	_	26	punct	_	_
4	quand	quand	SCONJ	_	_	8	mark	_	_
5	les	le	DET	_	Definite=Def|Number=Plur|PronType=Art	6	det	_	_
6	Georgiens	Georgiens	PROPN	_	_	8	nsubj	_	_
7	ont	avoir	AUX	_	Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin	8	aux	_	_
8	repris	reprendre	VERB	_	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	26	advcl	_	_
9	les	le	DET	_	Definite=Def|Gender=Fem|Number=Plur|PronType=Art	10	det	_	_
10	commandes	commande	NOUN	_	Gender=Fem|Number=Plur	8	obj	_	_
11-12	du	_	_	_	_	_	_	_	_
11	de	de	ADP	_	_	13	case	_	_
12	le	le	DET	_	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	13	det	_	_
13	gouvernement	gouvernement	NOUN	_	Gender=Masc|Number=Sing	10	nmod	_	SpaceAfter=No
14	,	,	PUNCT	_	_	26	punct	_	_
15	Barnett	Barnett	PROPN	_	_	26	nsubj	_	_
16	(	(	PUNCT	_	_	20	punct	_	SpaceAfter=No
17	qui	qui	PRON	_	PronType=Rel	20	nsubj:pass	_	_
18	avait	avoir	AUX	_	Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin	20	aux	_	_
19	été	être	AUX	_	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	20	aux:pass	_	_
20	réélu	réélire	VERB	_	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	15	acl:relcl	_	_
21	par	par	ADP	_	_	23	case	_	_
22	ce	ce	DET	_	Gender=Masc|Number=Sing|PronType=Dem	23	det	_	_
23	point	point	NOUN	_	Gender=Masc|Number=Sing	20	obl	_	SpaceAfter=No
24	)	)	PUNCT	_	_	20	punct	_	_
25	a	avoir	AUX	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	26	aux	_	_
26	rapporté	rapporter	VERB	_	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	0	root	_	_
27	le	le	DET	_	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	28	det	_	_
28	sceau	sceau	PROPN	_	_	26	obj	_	SpaceAfter=No
29	.	.	PUNCT	_	_	26	punct	_	_
"""
sentence_tree = parse_tree(data_file)
print(sentence_tree)

outputs

[TokenTree<token={id=(11, '-', 12), form=du}, children=None>]

Feature: Add filter() method of TokenList that returns a subset of nodes

Idea: Use an API that is similar to Django's, where you can write data.filter(id__gt=3) and get all nodes with an ID greater than 3. data.filter(form="the", head=2) would return all nodes with both of those properties set.

Handle elided and range tokens in deps

This example from AllenNLP returns deps as a string, as 5:conj:and|8.1:nsubj:pass|9:nsubj:xsubj.

# sent_id = weblog-blogspot.com_healingiraq_20040409053012_ENG_20040409_053012-0022
# text = Over 300 Iraqis are reported dead and 500 wounded in Fallujah alone.
1	Over	over	ADV	RB	_	2	advmod	2:advmod	_
2	300	300	NUM	CD	NumType=Card	3	nummod	3:nummod	_
3	Iraqis	Iraqis	PROPN	NNPS	Number=Plur	5	nsubj:pass	5:nsubj:pass|6:nsubj:xsubj|8:nsubj:pass	_
4	are	be	AUX	VBP	Mood=Ind|Tense=Pres|VerbForm=Fin	5	aux:pass	5:aux:pass	_
5	reported	report	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root	0:root	_
6	dead	dead	ADJ	JJ	Degree=Pos	5	xcomp	5:xcomp	_
7	and	and	CCONJ	CC	_	8	cc	8:cc|8.1:cc	_
8	500	500	NUM	CD	NumType=Card	5	conj	5:conj:and|8.1:nsubj:pass|9:nsubj:xsubj	_
8.1	reported	report	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	_	_	5:conj:and	CopyOf=5
9	wounded	wounded	ADJ	JJ	Degree=Pos	8	orphan	8.1:xcomp	_
10	in	in	ADP	IN	_	11	case	11:case	_
11	Fallujah	Fallujah	PROPN	NNP	Number=Sing	5	obl	5:obl:in	_
12	alone	alone	ADV	RB	_	11	advmod	11:advmod	SpaceAfter=No
13	.	.	PUNCT	.	_	5	punct	5:punct	_

CoNLL and sentence headers

Since the two formats are nearly identical, can this script read CoNLL? I get errors, probably because the format does not match exactly.

Could it also read CoNLLU formats where metadata (#-prefixed lines) are missing?

5.0.0 wheel on PyPI is empty

Hello,

Just a heads up, in case you haven't noticed yet: the 5.0.0 wheel on PyPI is empty (probably due to a project build config issue), thus, currently, installing the latest conllu through pip results in a missing conllu module.

Good luck :)

Idea: Introduce helper TokenTree methods

Hi, first of all, thanks for this awesome library, it is really helpful for my current project :-)

Second, for my current project I had to implement some helper methods regarding searching and collecting subtree tokens I want to share with you. You can decide if you want to incorporate some of them into the library (probably as instance methods on TokenTree). It might help others too.

Code:

from operator import itemgetter
from typing import Optional, List, Text
from conllu import TokenTree, TokenList, Token

def _get_node_by_id(tree_node: TokenTree, id_to_find: int) -> Optional[TokenTree]:
    to_traverse = [tree_node]
    while len(to_traverse):
        node = to_traverse.pop()
        if node.token["id"] == id_to_find:
            return node
        to_traverse.extend(node.children)
    return None

def _get_node_by_id_recursive(tree_node: TokenTree, id_to_find: int) -> Optional[TokenTree]:
    if tree_node.token["id"] == id_to_find:
        return tree_node

    for child_token in tree_node.children:
        if found_node := _get_node_by_id_recursive(child_token, id_to_find):
            return found_node

    return None

def _collect_all_subtree_tokens(tree_node: TokenTree) -> List[Token]:
    subtree_nodes = []
    to_traverse = [tree_node]
    while len(to_traverse):
        node = to_traverse.pop()
        subtree_nodes.append(node.token)
        to_traverse.extend(node.children)
    return subtree_nodes

# more general, could be used in the method get_word_subtree more or less instead of _collect_all_subtree_tokens
def to_list(root_node: TokenTree) -> TokenList:
    def flatten_tree(root_token: TokenTree, token_list: List[Token]) -> List[Token]:
        token_list.append(root_token.token)
        for child_token in root_token.children:
            flatten_tree(child_token, token_list)
        return token_list

    flatten_list = flatten_tree(root_node, [])

    flatten_list_by_id = sorted(flatten_list, key=itemgetter("id"))
    return TokenList(flatten_list_by_id, root_node.metadata)

def get_word_subtree(tree_node: TokenTree, token_id: int) -> Optional[Text]:
    word_node = _get_node_by_id(tree_node, token_id)
    if word_node is None:
        return None

    subtree_tokens = _collect_all_subtree_tokens(word_node)
    sorted_tokens_by_id = sorted(subtree_tokens, key=itemgetter("id"))
    return " ".join(token["form"] for token in sorted_tokens_by_id)

Failure with serializing/deserializing using Python `pickle`

I think something may be broken when serializing (pickle-dumping) and loading back conllu-parsed sentences.

File "python3.9/site-packages/conllu/models.py", line 103, in extend
    self.metadata.update(iterable.metadata)
AttributeError: 'TokenList' object has no attribute 'metadata'

Conllu writer

Do you think writing a CoNLLu writer would be easier starting from your current code? Or should I just start one from scratch?

Integrating with NLTK?

Can conllu be integrated with NLTK?
NLTK has a ConllCorpusReader class, but it currently doesn't support CoNLL-U.

how to cite

How can we cite this library?

Generating TokenList of a sentence containing multi-word tokens causes improper list comprehensions

Consider the part of sentence:

17	Sarajewo	Sarajewo	PROPN	NE	Case=Dat|Gender=Neut|Number=Sing	15	nmod	_	NamedEntity=Yes
18-19	zur	_	_	_	_	_	_	_	_  
18	zu	zu	ADP	APPR	_	21	case	_	_  
19	der	der	DET	ART	Case=Dat|Definite=Def|Gender=Fem|Number=Sing|PronType=Art	21	det	_	_  
20	humanitären	humanitär	ADJ	ADJA	Case=Dat|Gender=Fem|Number=Sing	21	amod	_	_

The TokenList rendition of this sentence returns wrong values. For example:

for x in range(17, 21): print(sentence[x])
returns output
zur zu der humanitären

thereby messing the control during the list comprehension

Buggy rendering of tokenTree if Misc column is not a key-value pair

According to UD data (v2.9, example language: Akkadian, treebank: RIAO) as reproduced below,

# sent_id = Q006035-192
# text = narê aškun
1	narê	narû	NOUN	N	Gender=Masc|Number=Plur	2	obj	_	{NA₄}NA.RU₂.A.MEŠ
2	aškun	šakānu	VERB	V	Gender=Com|Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin|VerbStem=G	0	root	_	aš-ku-un

The misc column doesn't necessarily contains key-value pairs (For example, SpaceAfter=No). When the block is read as a TokenTree, and then re-serialised, it generates the following output. Notice the trailing = at the end:

# sent_id = Q006035-192
# text = narê aškun
1	narê	narû	NOUN	N	Gender=Masc|Number=Plur	2	obj	_	{NA₄}NA.RU₂.A.MEŠ=
2	aškun	šakānu	VERB	V	Gender=Com|Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin|VerbStem=G	0	root	_	aš-ku-un=

The misc column should not be treated as a key-value pair, but rather as a list of values, delimited by pipe operator |

Allow for multiple root nodes

This is probably not strictly speaking an issue with this library so feel free to close the issue.

While using ParZu I came across the issue that most outputs define multiple root nodes. Take the example sentence "Ich esse einen Apfel." ("I eat an apple")

ParZu's output for this looks something like this:

1	Ich	ich	PRO	PPER	1|Sg|_|Nom	2	subj	_	_
2	esse	essen	V	VVFIN	1|Sg|Pres|_	0	root	_	_
3	einen	eine	ART	ART	Indef|Masc|Acc|Sg	4	det	_	_
4	Apfel	Apfel	N	NN	Masc|Acc|Sg	2	obja	_	_
5	.	.	$.	$.	_	0	root	_	_

Both the verb and the punctuation claim to be root.

I did write a quick workaround. It just creates a top level ROOT node for root nodes to attach to.

The patch is probably not fit for inclusion in its current state. If you do have some specific ideas of how the problem of multiple root's like this could be merged I'd be open to that. If not that's fine as well, maybe this issue will help someone out some day ;)

AssertionError: node not TreeNode <class 'list'>

I've got the error in print_tree() :

File "/usr/local/lib/python3.6/dist-packages/conllu/tree_helpers.py", line 14, in print_tree
   assert isinstance(node, TreeNode), "node not TreeNode %s" % type(node)

AssertionError: node not TreeNode <class 'list'>

... on the following part of data:

data = """
1       Настоящий       _       ADJ     _       Case=Nom|Degree=Pos|Gender=Masc|Number=Sing|fPOS=ADJ++  3       amod    _       _
2       Федеральный     _       ADJ     _       Case=Nom|Degree=Pos|Gender=Masc|Number=Sing|fPOS=ADJ++  3       amod    _       _
3       закон   _       NOUN    _       Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|fPOS=NOUN++       4       nsubj   _       _
4       определяет      _       VERB    _       Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act|fPOS=VERB++  0       ROOT    _       _
5       особенности     _       NOUN    _       Animacy=Inan|Case=Acc|Gender=Fem|Number=Plur|fPOS=NOUN++        4       dobj    _       _
6       гражданско-правового    _       ADJ     _       Case=Gen|Degree=Pos|Gender=Neut|Number=Sing|fPOS=ADJ++  7       amod    _       _
7       положения       _       NOUN    _       Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing|fPOS=NOUN++       5       nmod    _       _
8       некоммерческих  _       ADJ     _       Case=Gen|Degree=Pos|Number=Plur|fPOS=ADJ++      9       amod    _       _
9       организаций     _       NOUN    _       Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur|fPOS=NOUN++        7       nmod    _       _
10      отдельных       _       ADJ     _       Case=Gen|Degree=Pos|Number=Plur|fPOS=ADJ++      12      amod    _       _
11      организационно-правовых _       ADJ     _       Case=Gen|Degree=Pos|Number=Plur|fPOS=ADJ++      12      amod    _       _
12      форм    _       NOUN    _       Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur|fPOS=NOUN++        9       nmod    _       _
13      ,       _       PUNCT   ,       fPOS=PUNCT++,   12      punct   _       _
14      видов   _       NOUN    _       Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|fPOS=NOUN++       12      conj    _       _
15      и       _       CONJ    _       fPOS=CONJ++     12      cc      _       _
16      типов   _       NOUN    _       Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|fPOS=NOUN++       12      conj    _       _
17      ,       _       PUNCT   ,       fPOS=PUNCT++,   16      punct   _       _
18      а       _       CONJ    _       fPOS=CONJ++     5       cc      _       _
19      также   _       ADV     _       Degree=Pos|fPOS=ADV++   20      advmod  _       _
20      возможные       _       ADJ     _       Case=Nom|Degree=Pos|Number=Plur|fPOS=ADJ++      21      amod    _       _
21      формы   _       NOUN    _       Animacy=Inan|Case=Nom|Gender=Fem|Number=Plur|fPOS=NOUN++        5       conj    _       _
22      поддержки       _       NOUN    _       Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing|fPOS=NOUN++        21      nmod    _       _
23      некоммерческих  _       ADJ     _       Case=Gen|Degree=Pos|Number=Plur|fPOS=ADJ++      24      amod    _       _
24      организаций     _       NOUN    _       Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur|fPOS=NOUN++        22      dobj    _       _
25      органами        _       NOUN    _       Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur|fPOS=NOUN++       5       conj    _       _
26      государственной _       ADJ     _       Case=Gen|Degree=Pos|Gender=Fem|Number=Sing|fPOS=ADJ++   27      amod    _       _
27      власти  _       NOUN    _       Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing|fPOS=NOUN++        25      nmod    _       _
28      и       _       CONJ    _       fPOS=CONJ++     5       cc      _       _
29      органами        _       NOUN    _       Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur|fPOS=NOUN++       5       conj    _       _
30      местного        _       ADJ     _       Case=Gen|Degree=Pos|Gender=Neut|Number=Sing|fPOS=ADJ++  31      amod    _       _
31      самоуправления  _       NOUN    _       Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing|fPOS=NOUN++       29      nmod    _       _
32      .       _       PUNCT   .       fPOS=PUNCT++.   34      punct   _       _
33      (       _       PUNCT   (       fPOS=PUNCT++(   34      punct   _       _
34      п       _       VERB    _       Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing|fPOS=NOUN++       4       parataxis       _       _
35      .       _       PUNCT   .       fPOS=PUNCT++.   4       punct   _       _

"""
t = parse_tree(data)
print_tree(t)