Code Monkey home page Code Monkey logo

character-embedding-chinese-segmentation's Introduction

Chinese Dictionary Construction using word2vec

Abstract

The aim is to construct a Chinese dictionary from large Chinese corpora using character-based word2vec technique.

The motivation of this project is to avoid maintaining a dictionary by man power, and to catch more slangs that are unlikely to be included in those dictionaries.

The dictionary is useful in various further NLP tasks such as Chinese word segmentation. A variation of this project is to directly use the vectors learned to segment Chinese words without intermediate dictionaries. An extension of this project is to find out some interesting language features in Chinese from the result of word2vec.

Description

* Brief intro to word2vec: word2vec is a kind of model that takes a corpus as input and outputs a set of feature vectors that can be projected in a latent space of N dimensions. It is extremely useful since vectors that have smaller cosine distance tend to have higher similarities, and hence word2vec is being used in various NLP deep learning tasks. The classical task using word2vec is to predict target word given some context, or predict context given a target word. Also, interesting relationships e.g. 'man' - 'woman' = 'king' - 'queen' can be identified in the space.

As compared to English, NLP in Chinese requires an additional preprocessing stage, i.e. segment the text into sequence of words, before they can be used in subsequent NLP tasks, including input into word2vec. There are various ways for Chinese word segmentation, but all of them need a dictionary beforehand. Maintaining the dictionary requires additional man power, and the dictionary tend not to be exhaustive enough, for example, it tend to miss some internet slangs.

This project is to actually build a Chinese dictionary from some known Chinese corpora using word2vec technique. Almost all word2vec application targeting on Chinese text tend to segment them into words first, however, Chinese in character unit also can have meanings. If we segment the text into characters, which is an extremely easy task, and feed them into word2vec, there would definitely be something interesting in the output space. Then maybe we can slice the text into 2-character units, and find out valid 2-character words from the frequency information, then based on the individual characters in the words and their position in the space, some cluster-to-cluster relationship may be found, and we can use it to construct possible dictionary words. For example, we have 校 in a cluster, 狗 in a cluster, and 校狗 is found to be a valid word; then we can somehow imply that if 貓 is in the same cluster as 狗, 校貓 is probably a valid word as well and can be included the dictionary, even though it doesn’t appear in the input corpus. This way some informal but reasonable words can be better catched.

As we play with character-based chinese word2vec space, some more interesting features may be identified as well, e.g. 徘 and 徊 should be almost overlapping in the space, since they are indivisible words.

About This Repo

/preparation    # pre-work information e.g. schedule, related work research, etc.
/main           # main project e.g. source code, report, etc.

character-embedding-chinese-segmentation's People

Contributors

pyliaorachel avatar

Stargazers

Sangjee Dondrub avatar

Watchers

James Cloos avatar ufukhurriyet avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.