zhanzecheng / chinese_segment_augment Goto Github PK
View Code? Open in Web Editor NEWpython3实现互信息和左右熵的新词发现
python3实现互信息和左右熵的新词发现
如果是长文档的话,前面add这一步很慢。我用C#试了一下添加一个子node的字典,提升比较明显,可能内存多耗一点,供参考。
if (node.DictChilds.ContainsKey(word))
{
node = node.DictChilds[word];
}
else
{
var newNode = new TrieNode(word);
node.Childs.Add(newNode);
node.DictChilds.Add(word, newNode);
node = newNode;
}
PMI = math.log(max(ch.count, 1), 2) - math.log(total, 2) - math.log(one_dict[child.char], 2) - math.log(one_dict[ch.char], 2)
为什么和log2( P(X,Y) / (P(X) * P(Y))感觉不一样?
‘加载外部词频记录 dict.txt ’ 的作用是啥啊?
==>result[key] = (values[0] + min(left[d], right[d])) * values[1]
这一步理解不了是在干什么,我的理解是只要取 左右熵中的最小值作为 这一步需要赋值的值就可以了
def find_word(self, N):
# 通过搜索得到互信息
# 例如: dict{ "a_b": (PMI, 出现概率), .. }
bi = self.search_bi()
# 通过搜索得到左右熵
left = self.search_left()
right = self.search_right()
result = {}
for key, values in bi.items():
d = "".join(key.split('_'))
# 计算公式 score = PMI + min(左熵, 右熵) => 熵越小,说明越有序,这词再一次可能性更大!
# PMI 是为了计算共现值。 values[0] 也是共现值
result[key] = (values[0] + min(left[d], right[d])) * values[1]
@zhanzecheng 谢谢!!
运行报了这个错
#python3 demo_run.py
Traceback (most recent call last):
File "demo_run.py", line 44, in
stopwords = get_stopwords()
File "/data/home/tengenli/Chinese_segment_augment/utils.py", line 13, in get_stopwords
stopword = [line.strip() for line in f]
File "/data/home/tengenli/Chinese_segment_augment/utils.py", line 13, in
stopword = [line.strip() for line in f]
File "/usr/local/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)
互信息和左右熵通过语料不是就可以计算了吗?为什么需要一个外部词表呢?
请问这是为什么......
文本变长以后非常耗时
您好,请问为什么运行出来发现左右熵基本都为零呢?
个人也写过一个类似的东东,尝试过几种左右熵和互信息的结合方式都不是很满意,请问还有更好的方法吗,尝试过加权和比值的多种参数。
主要问题有两个:
假设有两个词串分别是[a,b,c]和[b,c,a],[a,b,c]在计算左熵的时候会转换成b->c->a存储到树中,[b,c,a]在顺序存储的时候也会转换成b->c->a存储到树中,那么这个时候计算bc的左熵的时候会有问题把,额外把a的次数多加了一。
PMI计算出现不符合数学规则的计算
实际上计算的就是 P(w2|w1) 的条件概率 = p(word) /p(w1) = I* pw2 ,也就是找到一种全局比较字词间的条件概率来获得w1->w2的映射概率,优先那些w1-w2概率高的,贪心算法,选择的那些词语概率高--通过学习语料获得的,--能更好地提高统计语言学习模型的准确率。
所谓的自由度实际上是个很不规范的东西,比如说有个单词多次在句尾结束,难道计算右熵为0?
P(S) = P(word1)P(word2|word1)P(word3|word2)...P(wordn|wordn-1)
我觉得通过训练类似的贝叶斯模型,然后调用模型训练语料的结果,来获得某些成词的置信度。可能更具有实用价值。简单来说 pw2 ,你们是用 左右熵来映射的,这是否可行我觉得很成问题。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.