scriptin / kanji-frequency Goto Github PK

View Code? Open in Web Editor NEW

121.0 5.0 19.0 2.3 MB

Kanji usage frequency data collected from various sources

Home Page: http://scriptin.github.io/kanji-frequency/

License: Creative Commons Attribution 4.0 International

JavaScript 40.93% TypeScript 2.64% Astro 46.51% MDX 9.92%

kanji data japanese japanese-language data-visualization kanji-frequency frequency-lists corpus corpus-linguistics cjk

kanji-frequency's Introduction

Kanji usage frequency

Datasets built from various Japanese language corpora

https://scriptin.github.io/kanji-frequency/ - see this website for the dataset description. This readme describes only technical aspects.

You can download the datasets here: https://github.com/scriptin/kanji-frequency/tree/master/data

Building the datasets

You'll need Node.js 18 or later.

See scripts section in package.json.

Aozora:

aozora:download - use crawler/scraper to collect the data
aozora:gaiji:extract - extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent them
aozora:gaiji:replacements - build gaiji replacements file - produces only partial results, which may need to be manually completed
aozora:clean - clean the scraped pages (apply gaiji replacements)
aozora:count - create the dataset

Wikipedia:

wikipedia:fetch - fetch random pages using MediaWiki API
wikipedia:count - create the dataset

News:

news:wikinews:fetch - fetch random pages from Wikinews using MediaWiki API
news:count - create the dataset
news:dates - create additional file with dates of articles

Building the website

See Astro docs and the scripts section in package.json.

kanji-frequency's People

Contributors

Stargazers

Watchers

Forkers

wordbrewery xrival thuanvh angelaburova unixsuperhero bingokorean sschmidtu locker72 saveliylukash saveliilukash firasaltayeb ashrafuljoypb jlevym ssiika worgan99 ironicgitposting daikiejp lasfito escaroda

kanji-frequency's Issues

Kanji decomposition (外字注記) bias in the Aozora list

It turns out, Aozora replaces some kanji with images, providing a decomposition in the alt attribute (see 外字注記). Since the dataset was generated by processing HTML files as plain text, a lot of radicals were mistakingly counted as actually appearing in the texts.

From http://vtrm.net/japanese/kanji-frequency/en:

Some kanji radicals or elements which are usually not used on their own gathered relatively high rankings. One would expect such elements not to occur at all, or nearly so. For example, in Shpika’s list, 廴, a radical not used on its own, is stated to occur 1595 times and is ranked 2294th most common kanji.
The explanation is simple: when a kanji outside the JIS X 0208 set appears in a text, the Aozora Bunko policy is to break it out into simpler parts. By instance, 𢌞 (it may not be displayed correctly if you don’t have a suitable font installed) is written ※［＃「廴＋囘」、第4水準2-12-11］, where 廴＋囘 is the kanji decomposition and 第4水準2-12-11 is the JIS X 0213 code point.

Example (from 蜘蛛の糸):

<img src="../../../gaiji/1-87/1-87-71.png" alt="※(「特のへん＋廴＋聿」、第3水準1-87-71)" class="gaiji">

Lists of replaced characters:

Absence of non-BMP characters

While studying the database, I see that there is not a single occurrence of non-BMP characters in it. Was it a consequence of the method used and, if so, would it be possible to ascertain the presence of any U+2XXXX characters within it?
(Similarly, there are no Compatibility characters in the lists, which leads me to a suspicion the data were completely Unicode-normalized before analysis, which deletes some data irretrievable specifically in Japanese case.)

Bias due to kanji repeating within a document

The kanji frequency lists here are skewed due to the counting methodology, as explained in the cited text below. For more accurate results, the frequency formula should be:

f = number of documents in which a kanji appears at least once / total number of documents

I suggest either changing the current lists to the new formula, or adding alternate lists that use this formula.

"The methodology for counting the characters is quite not right and tends to favor some kanji. Every table of kanji usage frequency I’ve found online, by Shpika or by others, is made by simply counting the number of times a given kanji is found in a whole text corpus and computing its frequency of occurrence using the total number of kanji in the corpus.
However, the resulting data is biased and not really representative of the usage of each kanji, especially for less common ones. The reason for this is that if some uncommon kanji appears in a given book, chances are it appears several times in this book. This is especially the case for character names and place names. Let’s stretch this reasoning to an extreme and consider a book in which a character’s name has a very rare kanji. Let’s say this kanji is so rare that it doesn’t appear in any the other several thousands books in the collection. The character’s name may appear, say, a few dozen times in the whole book. Thus the rare kanji will be counted several dozen times even though it’s never been used by any other author in the collection."

Source:
VTRM

scriptin / kanji-frequency Goto Github PK

kanji-frequency's Introduction

Kanji usage frequency

Building the datasets

Building the website

kanji-frequency's People

Contributors

Stargazers

Watchers

Forkers

kanji-frequency's Issues

Kanji decomposition (外字注記) bias in the Aozora list

Absence of non-BMP characters

Bias due to kanji repeating within a document

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent