Light

moonshile / chinesewordsegmentation Goto Github PK

View Code? Open in Web Editor NEW

499.0 30.0 120.0 21 KB

Chinese word segmentation algorithm without corpus（无需语料库的中文分词）

License: MIT License

Python 100.00%

chinesewordsegmentation's Introduction

ChineseWordSegmentation

Chinese word segmentation algorithm without corpus

Usage

from wordseg import WordSegment
doc = u'十四是十四四十是四十，十四不是四十，四十不是十四'
ws = WordSegment(doc, max_word_len=2, min_aggregation=1, min_entropy=0.5)
ws.segSentence(doc)

This will generate words

十四是十四四十是四十，十四不是四十，四十不是十四

In fact, doc should be a long enough document string for better results. In that condition, the min_aggregation should be set far greater than 1, such as 50, and min_entropy should also be set greater than 0.5, such as 1.5.

Besides, both input and output of this function should be decoded as unicode.

WordSegment.segSentence has an optional argument method, with values WordSegment.L, WordSegment.S and WordSegment.ALL, means

WordSegment.L: if a long word that is combinations of several shorter words found, given only the long word.
WordSegment.S: given the several shorter words.
WordSegment.ALL: given both the long and the shorters.

Reference

Thanks Matrix67's article

chinesewordsegmentation's People

Contributors

Stargazers

Watchers

Forkers

rchunping comdex claymoreboy mozii openlp liangkai wangjun avatarzhang peipei1109 clear-datacenter zbxzc35 zhaodonghui3939 yuanboying bayesrule lijian8 silasxue songofhack bookchan pandasasa lyk125 hitluobin zhao07 havilee shelllong0630 westeast xtmhm2000 rubeeny qiaofei32 neariot hqueduxiamen pengyupatrik lxj0276 bpig karlyang2013 liyijincom softwarevamp tobby2002 benderpan querqing dukeenglish txyjz dafeix teqdex kanven crystalwlh xyyhcl scuttanxueshi godsme solofeng ryfan-rs hanksantford bung87 savourylie hailiang-wang ttklm20 ustcsun xiliangsong zhulin0808 v-chfeng searchmodel ilibx hhxiaohei generalzh mzhengmit barretthugh wyatt88 yinmingjun seekertrue bilio yclinyimeng sunshinejnjn baifengbai fendouai shikigit xxcharles peter05010402 airob berryhn qa8306202 wushicanasl mingxuanliu chengli0327 guopl gttttttt jacktaishan verinoy blackhandlyh kangjinle zglongfly zhang921210 hhy5277 strategist922 useric lli27 pengyuange chivychao hhhy1997 662d7 mengbingliang wangxiaolei130601

chinesewordsegmentation's Issues

return sorted(indexes,key=lambda m,n:doc[m:n])报错

return sorted(indexes,key=lambda m,n:doc[m:n])报 TypeError: () missing 1 required positional argument: 'n'错误，我用的是python3.6. 希望得到你的帮助。谢谢！

运行报错

Traceback (most recent call last):
File "D:/Desktop/ChineseWordSegmentation-master/wordseg/wordseg.py", line 11, in
from . probability import entropyOfList
ImportError: attempted relative import with no known parent package
请问如何解决呀？

计算左右熵值的特殊情况下报错

在计算熵值的函数中，如果传递的列表 ls 中只有一个元素，即左邻或右邻只有一个词且只出现一次的情况下，会导致列表推导式中的计算项分母为零报错

请问freqitem和hashtree并没有使用是吗？

错发

请问一个样例

货拉拉拉不拉拉布拉多

这个句子在您模型上分词效果如何呢？

大量数据的效率问题

如果是大量的数据，效率会不会很低？

doc输入如果是几十万行的文本的话，100g内存的机器都跑步起来

我们的语料库有几十万行，文件大小大概1G，这些文本作为doc输入，直接就oom了，有没有处理这种情况的好方法。

怎么用没看懂

怎么用啊一点都不会

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.