Code Monkey home page Code Monkey logo

kanji-frequency's Introduction

Kanji usage frequency

Datasets built from various Japanese language corpora

https://scriptin.github.io/kanji-frequency/ - see this website for the dataset description. This readme describes only technical aspects.

You can download the datasets here: https://github.com/scriptin/kanji-frequency/tree/master/data

Building the datasets

You'll need Node.js 18 or later.

See scripts section in package.json.

Aozora:

  • aozora:download - use crawler/scraper to collect the data
  • aozora:gaiji:extract - extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent them
  • aozora:gaiji:replacements - build gaiji replacements file - produces only partial results, which may need to be manually completed
  • aozora:clean - clean the scraped pages (apply gaiji replacements)
  • aozora:count - create the dataset

Wikipedia:

  • wikipedia:fetch - fetch random pages using MediaWiki API
  • wikipedia:count - create the dataset

News:

  • news:wikinews:fetch - fetch random pages from Wikinews using MediaWiki API
  • news:count - create the dataset
  • news:dates - create additional file with dates of articles

Building the website

See Astro docs and the scripts section in package.json.

kanji-frequency's People

Contributors

dependabot[bot] avatar scriptin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

kanji-frequency's Issues

Kanji decomposition (外字注記) bias in the Aozora list

It turns out, Aozora replaces some kanji with images, providing a decomposition in the alt attribute (see 外字注記). Since the dataset was generated by processing HTML files as plain text, a lot of radicals were mistakingly counted as actually appearing in the texts.

From http://vtrm.net/japanese/kanji-frequency/en:

Some kanji radicals or elements which are usually not used on their own gathered relatively high rankings. One would expect such elements not to occur at all, or nearly so. For example, in Shpika’s list, 廴, a radical not used on its own, is stated to occur 1595 times and is ranked 2294th most common kanji.
The explanation is simple: when a kanji outside the JIS X 0208 set appears in a text, the Aozora Bunko policy is to break it out into simpler parts. By instance, 𢌞 (it may not be displayed correctly if you don’t have a suitable font installed) is written ※[#「廴+囘」、第4水準2-12-11], where 廴+囘 is the kanji decomposition and 第4水準2-12-11 is the JIS X 0213 code point.

Example (from 蜘蛛の糸):

<img src="../../../gaiji/1-87/1-87-71.png" alt="※(「特のへん+廴+聿」、第3水準1-87-71)" class="gaiji">

Lists of replaced characters:

Absence of non-BMP characters

While studying the database, I see that there is not a single occurrence of non-BMP characters in it. Was it a consequence of the method used and, if so, would it be possible to ascertain the presence of any U+2XXXX characters within it?
(Similarly, there are no Compatibility characters in the lists, which leads me to a suspicion the data were completely Unicode-normalized before analysis, which deletes some data irretrievable specifically in Japanese case.)

Bias due to kanji repeating within a document

The kanji frequency lists here are skewed due to the counting methodology, as explained in the cited text below. For more accurate results, the frequency formula should be:

f = number of documents in which a kanji appears at least once / total number of documents

I suggest either changing the current lists to the new formula, or adding alternate lists that use this formula.

"The methodology for counting the characters is quite not right and tends to favor some kanji. Every table of kanji usage frequency I’ve found online, by Shpika or by others, is made by simply counting the number of times a given kanji is found in a whole text corpus and computing its frequency of occurrence using the total number of kanji in the corpus.
However, the resulting data is biased and not really representative of the usage of each kanji, especially for less common ones. The reason for this is that if some uncommon kanji appears in a given book, chances are it appears several times in this book. This is especially the case for character names and place names. Let’s stretch this reasoning to an extreme and consider a book in which a character’s name has a very rare kanji. Let’s say this kanji is so rare that it doesn’t appear in any the other several thousands books in the collection. The character’s name may appear, say, a few dozen times in the whole book. Thus the rare kanji will be counted several dozen times even though it’s never been used by any other author in the collection."

Source:
VTRM

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.