Code Monkey home page Code Monkey logo

japanese-word-frequency's Introduction

Japanese word frequency

Warning

This is a word-in-progress repository containing data which were not proporly reviewed or cleaned and may contain errors.

wordfreq_tatoeba_raw.tsv file

Contains data obtained from parsing Japanese sentences from tatoeba.org on 2024-07-04 using spaCy with ja_core_news_trf model (slower, more accurate).

Format:

  • Tab-separated values
  • Header: lemma<TAB>pos<TAB>count
  • The 1st line after the header contains a special <total> value (pos set to -), representing a total count.
  • Format: {lemma}<TAB>{pos}<TAB>{count}, where:
    • lemma (string) := a lemma of a token, as parsed by the spaCy's parser/lemmatizer. Lemmas can be thought of as "dictionary forms," although they include puctuation, numbers, symbols, etc. For example, nouns "dog" and "dogs" are both lemmatized as "dog" (pos=NOUN), while "dog" in "dog food" would produce a lemma "dog", but with pos=ADJ.
    • pos (enum) := part of speech tag, as described in Universal POS tags.
    • count (number) := number of occurencies of this lemma.

Notes:

  • spaCy's Parser is not 100% accurate, often it produces weird results on phrases with uncommon grammatical structures.
  • Lemmas are not necessarily unique, only the pairs of (lemma, pos) are unique.
  • Countains proper names, punctuation, symbols, etc.
  • Pay special attention to these pos tags when working with this data: PROPN, PUNCT, CURR, SYM, NUM, X.
    • Numerals (pos=NUM) seem to be especially problematic.
    • Proper names (pos=PROPN) include generic names of people often used in sentences on Tatoeba.org, like "Tom" and "Mary".

japanese-word-frequency's People

Contributors

scriptin avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.