Code Monkey home page Code Monkey logo

ud_italian-isdt's Introduction

Summary

The Italian corpus annotated according to the UD annotation scheme was obtained by conversion from ISDT (Italian Stanford Dependency Treebank), released for the dependency parsing shared task of Evalita-2014 (Bosco et al. 2014).

Introduction

ISDT is a resource annotated according to the Stanford dependencies scheme (de Marneffe et al. 2008, 2013a, 2013b, 2014), obtained through a semi-automatic conversion process starting from MIDT (the Merged Italian Dependency Treebank). MIDT, in turn, is the result of a previous effort in the direction of improving interoperability of data sets available for Italian by harmonizing and merging two existing dependency–based resources, differing both in corpus composition and adopted annotation schemes, namely:

  • TUT, the Turin University Treebank (Bosco et al. 2000);
  • ISST-TANL, first released as ISST-CoNLL for the CoNLL-2007 shared task (Montemagni, Simi 2007), which was developed as a joint effort by the Istituto di Linguistica Computazionale (ILC–CNR) and the University of Pisa and originating from the Italian Syntactic–Semantic Treebank (ISST, Montemagni et al. 2003).

The details of the harmonization and conversion process leading to MIDT are discussed in (Bosco, Montemagni, Simi, 2012). The Stanford annotation scheme, obtained from an enriched version of MIDT, was adapted to the specificity of the Italian language. We refer to (Bosco, Montemagni, Simi, 2013 and 2014) for a discussion.

Acknowledgments

We wish to thank all of the contributors to the original annotation efforts, as well as the supporting organizations, i.e. the Institute for Computational Linguistics "A. Zampolli", the University of Pisa, and the University of Torino. Thanks go to Chiara Alzetta and Giulia Venturi for the good work in defining the error detection methodology and the manual revision / correction of automatically identified errors in Version 2.1.

Main contributors

  • Cristina Bosco - Università di Torino, Dipartimento di Informatica
  • Alessandro Lenci - Università di Pisa, Dipartimento di Filologia, Letteratura, Linguistica
  • Simonetta Montemagni - Istituto di Linguistica Computazionale A. Zampolli, CNR, Pisa
  • Maria Simi - Università di Pisa, Dipartimento di Informatica

Corpus composition

Original formatSourceGenreSize in tokensSize in sentences
TUT-CONLLEvalita 2011 Dependency parsingLegal texts, news articles, Wikipedia articles101,3093,842
ISST-TANLEvalita 2011 Domain adaptation taskNewspaper articles80,9674,135
ISST-TANLSPLeT 2012 Legal texts: European directives6,166260
MIDTSeveral QA competitionsQuestions20,6802,228
MIDTEvalita 2014 Dependency parsing:test data set (partial)News articles7,618304
TUT-CONLLParallel TUT (Italian part)Various genres55,9422,131
UDDue ParoleSimplified Italian news24,9771,421
UD2New dataVarious sentences2,504150

Sentences ids explicitly mark the source of the sentence.

Corpus splitting

The Corpus (14,167 sentences; 278,429 tokens; 298,344 words) has been randomly split as follows:

  • it-ud-train.conllu: 257616 tokens (13121 sentences)
  • it-ud-dev.conllu: 11133 tokens (564 sentences)
  • it-ud-test.conllu: 9680 tokens (482 sentences)

Changelog

  • 2020-11-1 v2.6

    • few errors corrected
  • 2019-05-1 v2.4

    • validated with stricter validation script
  • 2018-11-01 v2.3

    • added enhanced dependencies
  • 2018-04-01 v2.2

    • Repository renamed from UD_Italian to UD_Italian-ISDT.
    • Additional corrections of 1340 arcs, specifically:
      • 525 arcs retrieved with the methodology already used in the previous release, applied to the rest of the treebank;
      • 815 non-projective arcs were also corrected.
    • Added to the train set a new section of 2Parole, a newspaper of simplified Italian texts (283 sentences, 4985 tokens)
  • 2017-11-01 v2.1

    • Corrected 786 dependency errors distributed into 567 sentences:
      • Auxiliary verbs erroneously treated as head of a dependency relation
      • Bare past participles functioning as adjectival modifiers of nouns erroneously annotated as clausal modifiers
      • Adjectives functioning as secondary predicates erroneously annotated as adjectival modifiers
      • Coordinating conjunctions erroneously headed by the first conjunct
      • Oblique nominal arguments erroneously annotated as nominal modifiers
      • Nonfinite verbs functioning as nominals erroneously annotated as oblique nominals
    • Consistency in the treatment of fixed multi-word expressions has been checked and improved.
  • 2017-02-15 v2.0

    • Changes to comply with V2.
    • Splitting revised to comply with shared task.
  • 2016-11-01 v1.4

    • Complete revision of the treatment of clitic pronouns
    • Added dependency subtype expl:pass, used in passive constructions
    • Added a new collection of texts from 2Parole, a newspaper of simplified Italian texts (25995 tokens)
  • 2016-05-01 v1.3

    • Added feature value PronType=Ord for ordinal pronouns
    • Added feature value PronType=Predet for predeterminers
    • Added feature value NumType=Range
    • Added feature value NumType=Gen
    • Added sentence full text as comment
    • Added SpaceAfter=No, needed for recovering original text
    • Fixed errors found running content validation queries
  • 2015-11-01 v1.2

    • Added dependencies expl:impers as specialization of expl for impersonal clitic pronouns
    • Fixed case in articulated preposition, previously lost during splitting
    • More fixes to xcomp/ccomp distinction
    • Harmonization of case marking for infinitive verbs introduced by articles
    • Harmonization of Light Verb constructions
    • Eliminated duplicated sentences and overlappings train/dev and train/test
    • Added short sentences to train
  • 2015-05-15 v1.1

    • Added Italian section of ParTUT (71645 tokens)
    • Checked SYM
    • Checked X
    • Added more negation adverbs
    • Eliminated Gender=Com and Number=Com
    • Eliminated Negation=Neg
    • Added language specific feature PronType=Clit
    • Changed 'case' into 'mark' for 'xcomp'
    • Fixed xcomp/ccomp distinction
    • Checked dependencies marked 'dep', and resolved most of them

References

=== Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: UD v1.0 License: CC BY-NC-SA 3.0 Includes text: yes Genre: legal news wiki Lemmas: converted from manual UPOS: converted from manual XPOS: manual native Features: converted from manual Relations: converted from manual Contributors: Bosco, Cristina; Lenci, Alessandro; Montemagni, Simonetta; Simi, Maria Contributing: elsewhere Contact: [email protected]

ud_italian-isdt's People

Contributors

dan-zeman avatar fginter avatar msimi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ud_italian-isdt's Issues

Sentence where the period should be split

I believe the period at the end of the following train sentence should be split into a separate word. I can do that quite easily if that is correct.

# sent_id = 2Parole_2-176
# text = I soldati sono entrati nel teatro e hanno ucciso tutti i ceceni con un gas velenoso.
1       I       il      DET     RD      Definite=Def|Gender=Masc|Number=Plur|PronType=Art       2       det     2:det   _
2       soldati soldato NOUN    S       Gender=Masc|Number=Plur 4       nsubj   4:nsubj|10:nsubj        _
3       sono    essere  AUX     VA      Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin   4       aux     4:aux   _
4       entrati entrare VERB    V       Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part        0       root    0:root  _
5-6     nel     _       _       _       _       _       _       _       _
5       in      in      ADP     E       _       7       case    7:case  _
6       il      il      DET     RD      Definite=Def|Gender=Masc|Number=Sing|PronType=Art       7       det     7:det   _
7       teatro  teatro  NOUN    S       Gender=Masc|Number=Sing 4       obl     4:obl:in        _
8       e       e       CCONJ   CC      _       10      cc      10:cc   _
9       hanno   avere   AUX     VA      Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin   10      aux     10:aux  _
10      ucciso  uccidere        VERB    V       Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part        4       conj    4:conj:e        _
11      tutti   tutto   DET     T       Gender=Masc|Number=Plur|PronType=Tot    13      det:predet      13:det:predet   _
12      i       il      DET     RD      Definite=Def|Gender=Masc|Number=Plur|PronType=Art       13      det     13:det  _
13      ceceni  ceceni  NOUN    S       Gender=Masc|Number=Plur 10      obj     10:obj  _
14      con     con     ADP     E       _       16      case    16:case _
15      un      uno     DET     RI      Definite=Ind|Gender=Masc|Number=Sing|PronType=Art       16      det     16:det  _
16      gas     gas     NOUN    S       Gender=Masc     10      obl     10:obl:con      _
17      velenoso.       velenoso.       ADJ     A       Gender=Masc|Number=Sing 16      amod    16:amod _

xpos and upos meaning

Hi,

I would like to know where I can find the correspondent meaning of each acrimonious/tagset for the upos and xpos fields .

thank you

Fine Grain POS tags documentation

Hi! @dan-zeman @msimi @fginter

I am using SpaCy's italian model for research and from SpaCy's documentation I see it has been trained using this dataset.

Could you please provide some clarification (or documentation) on the meaning of the following fine-grain POS tags?

A, AP, B, BN, B_PC, CC, CS, DD, DE, DI, DQ, DR, E, E_RD, FB, FC, FF, FS, I, N, NO, PART, PC, PC_PC, PD, PE, PI, PP, PQ, PR, RD, RI, S, SP, SW, SYM, Sw, T, V, VA, VA_PC, VM, VM_PC, VM_PC_PC, V_B, V_PC, V_PC_PC, X

Inspecting the dataset I can see they are used to tag words but I cannot extract their meaning.

I tried to use SpaCy's explain() method but unfortunately only works with the English and German model.

It would really help me having something like the following explanation of the Universal Pos TAGS ( which are coarse-grain).

ADJ: adjective
ADP: adposition
ADV: adverb
AUX: auxiliary
CCONJ: coordinating conjunction
DET: determiner
INTJ: interjection
NOUN: noun
NUM: numeral
PART: particle
PRON: pronoun
PROPN: proper noun
PUNCT: punctuation
SCONJ: subordinating conjunction
SYM: symbol
VERB: verb
X: other

Thank you very much in advance!

Stray xpos tag: "Sw"

sent_id = 2Parole_4-180
word 19

The tag here is Sw instead of SW:

17      film    film    NOUN    S       Gender=Masc     14      nmod    14:nmod:di      _
18      Les     Les     PROPN   SW      Foreign=Yes     17      nmod    17:nmod _
19      invasions       invasions       PROPN   Sw      _       18      flat:name       18:flat:name    _
20      barbares        barbares        PROPN   SW      Foreign=Yes     18      flat:name       18:flat:name    SpaceAfter=No
21      .       .       PUNCT   FS      _       5       punct   5:punct _

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.