Code Monkey home page Code Monkey logo

chinese-rhymer's Introduction

Chinese Rhymer -- Build your own rhyming dictionary from the source you like

This project includes two main parts: the Learning System and the Search Module. The Learning System is responsible for downloading and analyzing the text, which can be lyrics, poetry lines or any text for the Rhyme Database to record. After building the Rhyme Database, the Search Module returns the result of matching the rhyme pattern of input keyword with words in the database.

Learning System

Learning System has several main functions:

analyzeSentence(chineseSentence)

Get the Rhyme Coordinate of a Chinese sentence. If more than one Chinese character is input, a list of Coordinates are returned.
With the API provided by Jieba and xpinyin, splitting sentences and get the original Pinyin is much easiler.

Here is one example:

analyzeSentence("这个项目由0-1CxH设计和编写", minLen=2, maxLen=6, allMode=True)

Result:

([[(5, 0), (5, 0)], [(3, 1), (13, 0)], [(5, 0), (9, 0)], [(2, 1), (11, 0)]], ['这个', '项目', '设计', '编写'])

As shown in the example, all Non-Chinese characters are ignored, and words whose length less than minLen=2 or more than maxLen=6 are also ignored.

downloadLyricsOfAPlayList(playlistid,filename)

With playlist and filename input, the crawler download ALL lyrics from the assigned NetEase playlist.
This function is mainly based on getPlaylistInfoByHibaiAPI(id), which provides the information of all songs in a playlist.
As you may notice, "TransCode=020111" is defined by the API, meaning the playlist is from NetEase Music.
The Hibai also has API of web-crawling other music websites, please contact the Hibai for more information.
And my src code includes the API provided by NetEase itself, please contact the NetEase for more information.

Here is an example:

downloadLyricsOfAPlayList("313835828","0001.txt")

analyzeAFile(filename,database)

After initiated an empty database (using Excel for convienence) with "initEmptyDB(dbname)", the xls/xlsx file is ready for storing analyse results.
"analyzeAFile()" uses lyrics text file and databse file as input, by analyzing LRC line by line using "analyzeSentence()" and record the result returned in specific workbook sheets.
This process consume so much time (Average: 2000 lines/hour or 700 characters/min).

Here is an example:

analyzeAFile("0001.txt","db.xlsx")

Search Module

Search Module includes:

searchDB(chineseword, dbfile)

This function mainly serves the query purpose :"Match words in Database whose Rhyme Coordinate(s) is/are same as the given Word."
Returned result is arranged like (searchRes,DoubleRhyme), "searchRes" contains all words that has same ending rhyme with the word, "DoubleRhyme" contains all words has at least two same ending rhymes with the word.
The search module is crude without much design, so the future versions would improve on more complex rhyme patterns.
For convienence all database files this project use are xlsx format.

Here is an example:

searchDB("未来", "db.xlsx")

Result:

([('现在', 41), ('离开', 29), ('不再', 27), ('下来', 27), ('应该', 26), ('未来', 22), ('明白', 20), ('期待', 17), ('静下来', 15), ('醒来', 14), ('后来', 13), ('等待', 12), ('回来', 12), ('不在', 12), ('人海', 12), ('大海', 10), ('存在', 10), ('不该', 9), ('留在', 9), ('从来', 9), ('坐在', 9), ('没来', 8), ('苍白', 8),
...此处省略其他结果...
], ['没来', '未来', '北海'])

i.e. ('现在', 41) means '现在' and "未来" has same rhyme, and was recorded for 41 times from what learnt so far. i.e. ['没来', '未来', '北海'] are at least double rhyme of "未来".

searchInterface.py

This file is the GUI of the Search Module.

Appendix

Rhyme Coordinate is the position of the Chinese YunMu(韵母) in the Mandarin Rhyme Table(普通话押韵表),which is listed below:

一、佳麻  a ia ua  
二、开来  ai uai    
三、先寒  an ian uan üan
四、江阳  ang iang uang 
五、逍遥  ao iao 
六、国歌  e o uo   
七、灰微  ei ui   
八、森林  en in un ün 
九、冬青  eng ing ong iong
十、希奇(儿)i(er并入) 
十一、诗词 i(整体认读)
十二、别叠 ie (y)e  
十三、忧愁 ou iu   
十四、读书 u  
十五、须臾 ü   
十六、绝学 üe

i.e.,the Rhyme Coordinate of 'ao' is (4,0).

chinese-rhymer's People

Contributors

tang-tang-tang avatar 0-1cxh avatar

Stargazers

RCJacH avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.