Code Monkey home page Code Monkey logo

cantonesewn's Introduction

The Cantonese Wordnet

This repository contains the data for the Cantonese Wordnet project.

This project was created and is continuously updated by Joanna Ut-Seong Sio (Palacký University, Czech Republic) and Luis Morgado da Costa (Vrije Universiteit Amsterdam, the Netherlands).

Our wordnet contains data both in traditional characters and in Jyutping (a romanisation system for Cantonese developed by the Linguistic Society of Hong Kong in 1993). The Cantonese wordnet is currently supported in two formats:

  • the Lexical Markup Framework (LMF) compatible XML, released and maintained by the Global Wordnet Association;
  • a legacy TSV format adopted by the original version of the Open Multilingual Wordnet; (due to format constraints, not all data are available in the legacy format -- i.e. Jyutping forms).

Currently the Cantonese Wordnet contains over 16,500 hand-checked lemmas and respective romanizations, distributed across all major parts-of-speech. More descriptive statistics and methodology can be found in its canonical citation (see below).

Demo

In the future, the Cantonese Wordnet will be included in the Open Multilingual Wordnet (OMW). However, as OMW is currently undergoing restructuring, we are hosting it here in the meantime.

Notable features

  • Our wordnet is fully handchecked by trained linguists;
  • For each lemma, both its Jyutping and character representations are included. Concerning Jyutping, we include as much variation in pronunciation as possible (including bin3jam1 變⾳ ‘changed tone’ and laan5jam1 懶⾳ ‘lazy pronunciation’); Concerning character representations, we also include as much variation as possible, given that there is no official standardization;
  • Following recent trends, our wordnet is not limited to open class words, it also includes functional words (e.g., classifiers and post-verbal particles);
  • Our wordnet is being developed alongside a companion corpus (The Cantonese Wordnet Corpus), which is also being sense-tagged. This corpus is being used in attestation of senses, as well as to provide example sentences to individual sense-usages;

License

The Cantonese wordnet is released under a Creative Commons Attribution 4.0 International License (CC BY 4.0) and its canonical citation is:

Sio, Joanna Ut-Seong & Morgado da Costa, Luis. (2019). Building the Cantonese Wordnet. In Proceedings of the Tenth Global Wordnet Conference (GWC 2019), pp. 206-215. Wroclaw, Poland.

If you use any data from the Cantonese Wordnet Corpus, please also cite the following paper:

Sio, Joanna Ut-Seong & Morgado da Costa, Luis. (2022). Enriching Linguistic Representation in the Cantonese Wordnet and Building the New Cantonese Wordnet Corpus. Proceedings of the 13th Conference on Language Resources and Evaluation. European Language Resources Association (ELRA). Marseille, France.

References:

Sio, Joanna Ut-Seong & Morgado da Costa, Luís. (2023). The Open Cantonese Sense-Tagged Corpus. Proceedings of the 12th International Global Wordnet Conference. San Sebastian, Spain.

Sio, Joanna Ut-Seong & Morgado da Costa, Luis. (2022). Enriching Linguistic Representation in the Cantonese Wordnet and Building the New Cantonese Wordnet Corpus. Proceedings of the 13th Conference on Language Resources and Evaluation. European Language Resources Association (ELRA). Marseille, France.

Sio, Joanna Ut-Seong & Morgado da Costa, Luis. (2019). Building the Cantonese Wordnet. In Proceedings of the Tenth Global Wordnet Conference (GWC 2019), pp. 206-215. Wroclaw, Poland.

cantonesewn's People

Contributors

lmorgadodacosta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

cantonesewn's Issues

How to load the wordnet?

I appreciated this work very much, Could give me some guidance on how to load by NLTK or other tools.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.