ndt-tools

This repository provides NLP resources for Norwegian, based on the Norwegian Dependency Treebank (NDT). It provides a data set split (training/dev/test) of the treebank as well as PoS tagger models and syntactic parser models trained on the training data in the treebank.

Optimized PoS tag set

Hohle (2016) proposes a tag set optimized for syntactic dependency parsing of Norwegian, hitherto referred to as the optimized tag set. This tag set is based on the original tag set of NDT, with the addition of 20 PoS tags providing more fine-grained morphosyntactic information.

Data set split

This repository provides a data set split (training/dev/test) of NDT. This split follows the commonly used 80-10-10 split, where 80% of the data resides in the training data, 10% is used for testing during development and 10% is held-out and used for final evaluation. In the creation of this split, care was taken to preserve contiguous texts and to keep the split balanced in terms of genre.

Using the original tag set

training.conll contains the training data.
dev.conll contains the development data.
test.conll contains the test data.

Using the optimized tag set

training-optimized.conll contains the training data.
dev-optimized.conll contains the development data.
test-optimized.conll contains the test data.

PoS tagger models and syntactic parser models

svmtool-tagger-model contains the model files for use with the SVMTool tagger, using the original tag set.
svmtool-optimized-tagger-model contains the model files for use with the SVMTool tagger, using the optimized tag set.
mate-parser-model contains the model file for use with the Mate parser, using the original tag set.
mate-optimized-parser-model contains the model file for use with the Mate parser, using the optimized tag set.

Installation

In the evaluation of PoS taggers and syntactic dependency parsers in Hohle (2016), I found that SVMTool was the best tagger and Mate the best parser on NDT.

Please consult the documentation for SVMTool and Mate for details on how to install and run these tools once they are downloaded.

Scripts

generate_split.py generates a data set split (training/dev/test) of the treebank, provided a path to the original treebank files.
map_tagset.py maps the tag set of the treebank by introducing supplied morphological features present in the treebank.
tagging_error_analysis.py performs error analysis in terms of precision, recall and F score.

References

Please cite the following paper if you use the data sets in academic works:

Hohle, P., Velldal, E., Øvrelid, L. (2017). Optimizing a PoS Tagset for Norwegian Dependency Parsing. In Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 142-151). Gothenburg, Sweden.

Please cite the following thesis if you use the models or scripts in academic works:

Hohle, P. (2016). Optimizing a PoS Tag Set for Norwegian Dependency Parsing (Master's thesis). University of Oslo, Oslo, Norway.

malikmk / ndt-tools Goto Github PK

ndt-tools's Introduction

ndt-tools

Optimized PoS tag set

Data set split

Using the original tag set

Using the optimized tag set

PoS tagger models and syntactic parser models

Installation

Scripts

References

ndt-tools's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent