Code Monkey home page Code Monkey logo

ndt-tools's Introduction

ndt-tools

This repository provides NLP resources for Norwegian, based on the Norwegian Dependency Treebank (NDT). It provides a data set split (training/dev/test) of the treebank as well as PoS tagger models and syntactic parser models trained on the training data in the treebank.

Optimized PoS tag set

Hohle (2016) proposes a tag set optimized for syntactic dependency parsing of Norwegian, hitherto referred to as the optimized tag set. This tag set is based on the original tag set of NDT, with the addition of 20 PoS tags providing more fine-grained morphosyntactic information.

Data set split

This repository provides a data set split (training/dev/test) of NDT. This split follows the commonly used 80-10-10 split, where 80% of the data resides in the training data, 10% is used for testing during development and 10% is held-out and used for final evaluation. In the creation of this split, care was taken to preserve contiguous texts and to keep the split balanced in terms of genre.

Using the original tag set

  • training.conll contains the training data.
  • dev.conll contains the development data.
  • test.conll contains the test data.

Using the optimized tag set

  • training-optimized.conll contains the training data.
  • dev-optimized.conll contains the development data.
  • test-optimized.conll contains the test data.

PoS tagger models and syntactic parser models

  • svmtool-tagger-model contains the model files for use with the SVMTool tagger, using the original tag set.
  • svmtool-optimized-tagger-model contains the model files for use with the SVMTool tagger, using the optimized tag set.
  • mate-parser-model contains the model file for use with the Mate parser, using the original tag set.
  • mate-optimized-parser-model contains the model file for use with the Mate parser, using the optimized tag set.

Installation

In the evaluation of PoS taggers and syntactic dependency parsers in Hohle (2016), I found that SVMTool was the best tagger and Mate the best parser on NDT.

Please consult the documentation for SVMTool and Mate for details on how to install and run these tools once they are downloaded.

Scripts

  • generate_split.py generates a data set split (training/dev/test) of the treebank, provided a path to the original treebank files.
  • map_tagset.py maps the tag set of the treebank by introducing supplied morphological features present in the treebank.
  • tagging_error_analysis.py performs error analysis in terms of precision, recall and F score.

References

Please cite the following paper if you use the data sets in academic works:

Hohle, P., Velldal, E., Øvrelid, L. (2017). Optimizing a PoS Tagset for Norwegian Dependency Parsing. In Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 142-151). Gothenburg, Sweden.

Please cite the following thesis if you use the models or scripts in academic works:

Hohle, P. (2016). Optimizing a PoS Tag Set for Norwegian Dependency Parsing (Master's thesis). University of Oslo, Oslo, Norway.

ndt-tools's People

Contributors

petterhh avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.