Code Monkey home page Code Monkey logo

id-pos-tagging's Introduction

Indonesian Part-of-Speech (POS) Tagging

This repository contains the implementation of our paper:

Kurniawan, K., & Aji, A. F. (2018). Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging. 2018 International Conference on Asian Language Processing (IALP), 303–307. https://doi.org/10.1109/IALP.2018.8629236

Requirements

Make sure you have conda package manager. Then, create a conda virtual environment with

$ conda env create -f environment.yml

The command will create a virtual environment named id-pos-tagging and also install all the required packages. Once it is done, activate the virtual environment to get started.

$ source activate id-pos-tagging

Dataset

The dataset is available in data/dataset.tar.gz. Decompress this file and you will have train.X.txt, dev.X.txt, and test.X.txt files for all 5 folds with X replaced with the fold number. Each file contains the indices of the sentences in the original corpus. To obtain the sentences, you must first download the IDN Tagged Corpus. Then, run

$ ./splits2tsv.py data Indonesian_Manually_Tagged_Corpus.tsv

where data is the directory containing the {train,dev,test}.X.txt files. The sentences will then be available in data/{train,dev,test}.X.tsv files.

Running experiments

Scripts to run our models are prefixed with run_. So, for example, to run the CRF model, use run_crf.py script. All scripts use Sacred to manage the experiment configuration and results. We will explain in a more detail and use this run_crf.py script as the example.

Training

A minimal command to train a model is

$ ./run_crf.py train with corpus.train=train.01.tsv

This will train a CRF model on the given training corpus and save the model in model file in the current directory. There are many configuration that can be set, which can all be listed with

$ ./run_crf.py print_config

The command above will show all the configuration for the script, including those that might be needed for commands other than train. The print_config command is available for other run_*.py scripts as well.

To make reproduction easier, we already named our best configuration reported in the paper as tuned_on_foldX where X is the fold number. For instance, to get our result on fold 1, run

$ ./run_crf.py train with tuned_on_fold1 corpus.train=train.01.tsv

These named configurations are also available for run_memo.py and run_neural.py.

Evaluation and prediction

To evaluate/predict, use evaluate and predict command respectively. The available configuration is still the same as that of training.

Observing experiments

Sacred allows us to save experiment runs to a MongoDB database. To enable this for our scripts, simply set SACRED_MONGO_URL and SACRED_DB_NAME to your MongoDB instance. Once this is done, an experiment run will be saved everytime you run any run_*.py scripts.

License

MIT

Citation

If you use our work, please cite:

@inproceedings{kurniawan2018,
  place={Bandung, Indonesia},
  title={Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging},
  url={https://ieeexplore.ieee.org/document/8629236},
  DOI={10.1109/IALP.2018.8629236},
  note={arXiv: 1809.03391},
  booktitle={2018 International Conference on Asian Language Processing (IALP)},
  publisher={IEEE},
  author={Kurniawan, Kemal and Aji, Alham Fikri},
  year={2018},
  month={Nov},
  pages={303–307}
}

id-pos-tagging's People

Contributors

afaji avatar kmkurn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.