Code Monkey home page Code Monkey logo

vietnamese-syllable-regex's Introduction

vietnamese-syllable-regex

Split Vietnamese syllable into onset and rime

Spliting library (written in Python): vietnamese_syllable_regex, (test)

Documentation here.


Các kí tự nguyên âm có dấu:

aàáảãạ
oòóỏõọ
ôồốổỗộ
ơờớởỡợ
uùúủũụ
ưừứửữự

ăằắẳẵặ
âầấẩẫậ

eèéẻẽẹ
êềếểễệ
iìíỉĩị
yỳýỷỹỵ

phụ âm

phụ âm đặc biệt

nhóm qu_, gi_

  • tính vần u_/i_
"quôc"  # -> uôc, chỉ có 1 TH là "quốc"

"giê([mu]|[cpt]|ng?)"  # -> iê_
"gin?|gi(p|ich)"       # -> i*
  • tính vần đằng sau qu/gi
"quy.*"   # -> i*
"qu(.+)"  # rồi kiểm tra vần hợp lệ
"gi(.+)"

không có phụ âm đầu, chỉ có vần

-> trường hợp iê_ biến thành yê_

"(i[^ê]|[^i]).+"  # rồi kiểm tra vần hợp lệ
## nhóm `y_`
"yê([mu]|ng?)"
"yêt"

phụ âm thường

nhóm đi với mọi vần

"[bdđlmnprsvx]|[cknp]?h|t[hr]?"

nhóm eiy ("eêiy"):

"k|n?gh"

nhóm aou ("aăâoôơuư"):

"c|n?g"

vần

vần đơn

"[aàáảãạeèéẻẽẹêềếểễệiìíỉĩịoòóỏõọôồốổỗộơờớởỡợuùúủũụưừứửữựyỳýỷỹỵ]"

"oa", "oe", "uê", "uy" có 2 cách bỏ dấu

bỏ dấu kiểu cũ

"[oòóỏõọ][ae]"
"[uùúủũụ][êy]"

bỏ dấu kiểu mới

"o[aàáảãạeèéẻẽẹ]"
"u[êềếểễệyỳýỷỹỵ]"

bình thường

gồm 2 nhóm:

  • sắc/nặng
  • mọi dấu

rút gọn

[cpt]|ch: sắc/nặng
"([ăâeou]|iê|o?a|u?ô|ươ)[cpt]"
"[êiơ][pt]"
"(ư|oă)[ct]"
"ooc"
"(oe|uâ|uyê?)t"
"([êi]|o?a|u[êy])ch"
[aimouy]|n[gh]?: mọi dấu
"([iuư]|uy)a"
"([ouư]|o?a|u?ô|ư?ơ)i"
"o?[ae]o"
"([aâiư]|uy|ươ|i?ê)u"
"(o?a|u?â)y"
"([âeouư]|o?[aă]|iê|u?ô|ươ)(m|ng?)"  # có m, n, g
"([êiơ]|oe)[mn]"  # chỉ m, n
"uâng?"  # chỉ n, ng
"uyê?n"  # chỉ n
"oong"  # chỉ ng
"([êi]|o?a|u[êy])nh"

đầy đủ

"a([imouy]|n[gh]?)"
"a([cpt]|ch)"

"ă(m|ng?)"
"ă[cpt]"

"â([muy]|ng?)"
"â[cpt]"

"e([mo]|ng?)"
"e[cpt]"

"ê([mu]|nh?)"
"ê([pt]|ch)"

"i([amu]|nh?)"
"i([pt]|ch)"
"iê([mu]|ng?)"
"iê[cpt]"

"o([im]|ng?)"
"o[cpt]"
"oa([imoy]|n[gh]?)"
"oa([cpt]|ch)"
"oă(m|ng?)"
"oă[cpt]"
"oe[mno]"
"oet"
"oong"
"ooc"

"ô([im]|ng?)"
"ô[cpt]"

"ơ[imn]"
"ơ[pt]"

"u([aim]|ng?)"
"u[cpt]"
"uâ(y|ng?)"
"uât"
"uênh"
"uêch"
"uô([im]|ng?)"
"uô[cpt]"
"uơ"
"uy([au]|nh?)"
"uy([pt]|ch)"
"uyên"
"uyêt"

"ư([aimu]|ng?)"
"ư[ct]"  # ư[cpt]
"ươ([imu]|ng?)"
"ươ[cpt]"

"yê([mu]|ng?)"
"yêt"

Tham khảo

vietnamese-syllable-regex's People

Contributors

tunc2112 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.