Code Monkey home page Code Monkey logo

ud_estonian-ewt's Introduction

Summary

UD EWT treebank consists of different genres of new media. The treebank contains 4,493 trees, 56,399 tokens.

Introduction

Estonian Web Treebank UD v2.7 consists of three parts. Its older part (1,662 trees, v2.4) is a converted version of the Estonian Web Treebank (EWT), originally annotated in the Constraint Grammar (CG) annotation scheme, and consisting of different genres of new media. The second part (1,495 trees, v2.6) consists of internet forum texts and has been annotated using Stanza parser, followed by manual post-editing. The third part (v2.7) has been annnotated in the same way. It consists of users' feedbacks to news about Covid19 pandemic in March 2020 (~9,400 tokens).

The treebank consists of 4,493 trees, 56,399 tokens. As for enhanced dependencies, the empty nodes for missing predicates have been added, but there are no other types of enhanced dependencies in this version.

The treebank has been divided to train, test and dev parts as 34,287; 13,156 and 8,956 tokens respectively.

The treebank covers unedited new media texts.

Acknowledgments

We wish to thank developers of Udapi, UD Annotatrix, and ConlluEditor tools.

This work was financed by the National Programme for Estonian Language Technology and Estonian Ministery of Education and Research (grant 20-56 IUT20-56 "Computational models for Estonian").

References

  • Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen, Eleri Aedmaa, Riin Kirt, Dage Särg. 2014. Estonian Dependency Treebank and its annotation scheme. In: Proceedings of the 13th Workshop on Treebanks and Linguistic Theories (TLT13), pp. 285–291, ISBN 978-3-9809183-9-8, Tübingen, Germany.
  • Kadri Muischnek, Kaili Müürisep, Dage Särg. 2019. CG Roots of UD Treebank of Estonian Web Language. In Proceedings of the NoDaLiDa 2019 Workshop on Constraint Grammar-Methods, Tools and Applications, pp. 23-26, Turku, Finland

Changelog

  • UD v2.7: new texts, extra annotation for typos, better tokenization and sentence segmentation
  • UD v2.6: new internet forum texts (~15,000 tokens), 0-nodes in clauses.
  • UD v2.4: automatic conversion from CG, manual reannotation.
=== Machine-readable metadata =================================================
Documentation status: stub
Data source: semi-automatic
Data available since: UD v2.4
License: CC BY-NC-SA 4.0
Includes text: yes
Genre: blog web social
Lemmas: converted from manual
UPOS: converted from manual
XPOS: converted from manual
Features: converted from manual
Relations: converted from manual
Contributing: here
Contributors: Muischnek, Kadri; Müürisep, Kaili; Puolakainen, Tiina; Särg, Dage
Contact: [email protected], [email protected]
===============================================================================

ud_estonian-ewt's People

Contributors

dan-zeman avatar kailimp avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.