Code Monkey home page Code Monkey logo

biangbiang's Introduction

biangbiang

This package provides methods for Chinese language analysis and exploration in Node.js. In particular, it provides three broad functions:

  1. Dictionary definitions of words
  2. Retrieval and calculation of character and word frequency statistics
  3. Hierarchical decomposition of characters into components

Installation

For npm:

npm install biangbiang

For Yarn:

yarn add biangbiang

Getting started

With import:

import biangbiang from "biangbiang";

With require:

var biangbiang = require('biangbiang');

Methods

Dictionary

define(word, dictionary)

Get the pinyin and definition of a word, where dictionary is "simplified", "traditional", or "merged". Also returns the frequency index (rank).

define('面条', 'simplified');

{
    simplified: '面条',
    traditional: '麵條',
    pinyin: 'mian4 tiao2',
    definition: 'noodles',
    index: 6029
}
kind(character)

Check if a character is a traditional or simplified one. If so, returns the other form. type is 1 for simplified, 2 for traditional, and 3 for both.

kind("面");

{ type: 1, other: '麵'}
wordsContaining(character)

Get a list of all dictionary words containing a character, sorted in order of decreasing frequency.

wordsContaining('面');

[
	{
		word: '面',
		index: 322,
	},
	{
		word: '里面',
		index: 706,
	},
	{
		word: '面对',
		index: 930,
	},
	{
		word: '外面',
		index: 1234,
	},
	{
		word: '后面',
		index: 1270,
	},
  ...
]

Frequency

characterFrequency(character)

Get frequency statistics for a character.

characterFrequency('面');

{
	symbol: '面',
	index: 211,
	frequency: 1631866,
	percentage: 0.0006532897206780486,
	cumulativePercentage: 0.7101332080329651,
}
wordFrequency(word)

Get frequency statistics for a word.

wordFrequency('面条');

{
	symbol: '面条',
	index: 6029,
	frequency: 66879,
	percentage: 0.000015823013308250793,
	cumulativePercentage: 0.8864603725508198,
}
multiFrequency(sentence)

Get frequency statistics for a body of text.

multiFrequency('我喜欢吃面条。')

{
	byCharacter: [
		{
			symbol: '我',
			index: 1,
			frequency: 107133693,
			percentage: 0.042889146765223256,
			cumulativePercentage: 0.12608816399204145,
		},
		{
			symbol: '喜',
			index: 479,
			frequency: 681772,
			percentage: 0.0002729357921827617,
			cumulativePercentage: 0.8216732504061582,
		},
		{
			symbol: '欢',
			index: 1490,
			frequency: 140530,
			percentage: 0.000056258788679270345,
			cumulativePercentage: 0.9496496712024702,
		},
		{
			symbol: '吃',
			index: 42,
			frequency: 9348265,
			percentage: 0.0037424184526636244,
			cumulativePercentage: 0.46991986609112824,
		},
		{
			symbol: '面',
			index: 211,
			frequency: 1631866,
			percentage: 0.0006532897206780486,
			cumulativePercentage: 0.7101332080329651,
		},
		{
			symbol: '条',
			index: 169,
			frequency: 2102653,
			percentage: 0.0008417612665824651,
			cumulativePercentage: 0.6785621013285376,
		},
		{
			symbol: '。',
			index: -1,
			frequency: -1,
			percentage: -1,
			cumulativePercentage: -1,
		},
	],
	indices: [1, 479, 1490, 42, 211, 169],
	percentages: [
		0.042889146765223256,
		0.0002729357921827617,
		0.000056258788679270345,
		0.0037424184526636244,
		0.0006532897206780486,
		0.0008417612665824651,
	],
	cumulativePercentages: [
		0.12608816399204145,
		0.8216732504061582,
		0.9496496712024702,
		0.46991986609112824,
		0.7101332080329651,
		0.6785621013285376,
	],
}

Components

decompose(character, depth)

Decompose a character into its components up to a specified depth. If depth is undefined, then the full component tree is returned.

decompose('面')

{
	: {
		'㇐': '㇐',
		'㇓': '㇓',
	},
	: {
		'55103': {
			'10001': {
				'10001': '㇑',
			},
			: {
				: '㇐',
			},
		},
		: {
			'⺆': {
				'㇑': '㇑',
				'㇆': '㇆',
			},
			'㇐': '㇐',
		},
	},
}
charactersWithComponent(component)

Get a list of characters containing a component, sorted in order of decreasing frequency.

charactersWithComponent('囗')

[
	{ character: '回', index: 139 },
	{ character: '图', index: 166 },
	{ character: '口', index: 307 },
	{ character: '因', index: 381 },
	{ character: '西', index: 382 },
	{ character: '团', index: 388 },
	{ character: '困', index: 413 },
	{ character: '国', index: 544 },
	{ character: '围', index: 644 },
	{ character: '圈', index: 717 },
  ...
]

How it works

JSON files containing character/word/component information are generated by /src/prepare.js from raw files contained in /data/raw, with outputs saved to /data/processed.

The preparation script can also be run with npm run prepare or yarn prepare.

Sources

  • Dictionary entries are entirely from CEDICT
  • Frequency statistics are from BCC_LEX_Zh
  • Character composition entries are from CJK-decomp

This project was inspired by HanziJS and offers many of the same functionalities.

Etymology

Biangbiang noodles are a common cuisine in China's Shaanxi province. The character for 'biáng' is one of the most complicated in modern usage. Ironically, is not (yet) included in any of our datasets, as the character was only added to Unicode in March of 2020.

biangbiang's People

Contributors

deepsourcebot avatar dependabot-preview[bot] avatar dependabot[bot] avatar kevinhu avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

fossabot

biangbiang's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.