linguishi / chinese_sentiment Goto Github PK

View Code? Open in Web Editor NEW

836.0 8.0 107.0 21.82 MB

中文情感分析，CNN，BI-LSTM，文本分类

Python 100.00%

chinese_sentiment's Introduction

中文情感分析

中文情感分析的实质是文本分类问题，本项目分别采用CNN和BI-LSTM两种模型解决文本分类任务，并用于情感分析，达到不错的效果。两种模型在小数据集上训练，在验证集的准确率、号回率及F1因子均接近90%

项目设计的目标可以接受不同语料的多种分类任务，只要语料按照特定格式准备好，就可以开始调参训练、导出、serving。

code environment

在 python3.6 & Tensorflow1.13 下工作正常

其他环境也许也可以，但是没有测试过。

还需要安装 scikit-learn package 来计算指标，包括准确率回召率和F1因子等等。

语料的准备

语料的选择为 谭松波老师的评论语料，正负例各2000。属于较小的数据集，本项目包含了原始语料，位于data/hotel_comment/raw_data/corpus.zip中

解压 corpus.zip 后运行，并在raw_data运行

python fix_corpus.py

将原本gb2312编码文件转换成utf-8编码的文件。

词向量的准备

本实验使用开源词向量chinese-word-vectors

选择知乎语料训练而成的Word Vector, 本项目选择词向量的下载地址为 https://pan.baidu.com/s/1OQ6fQLCgqT43WTwh5fh_lg ,需要百度云下载，解压，直接放在工程目录下

训练数据的格式

参考 data/hotel_comment/*.txt 文件

step1

本项目把数据分成训练集和测试集，比例为4:1, 集4000个样本被分开，3200个样本的训练集，800的验证集。

对于训练集和验证集，制作训练数据时遵循如下格式：在{}.words.txt文件中，每一行为一个样本的输入，其中每段评论一行，并用jieba分词，词与词之间用空格分开。

除了 地段 可以 ， 其他 是 一塌糊涂 ， 惨不忍睹 。 和 招待所 差不多 。
帮 同事 订 的 酒店 , 他 老兄 刚 从 东莞 回来 , 详细 地问 了 一下 他 对 粤海 酒店 的 印象 , 说 是 硬件 和 软件 : 极好 ! 所以 表扬 一下

在{}.labels.txt文件中，每一行为一个样本的标记

NEG
POS

本项目中，可在data/hotel_comment目录下运行build_data.py得到相应的格式

step2

因为本项目用了index_table_from_file来获取字符对应的id，需要两个文件表示词汇集和标志集，对应于vocab.labels.txt和vocab.words.txt,其中每一行代表一个词或者是一行代表一个标志。

本项目中，可在data/hotel_comment目录下运行build_vocab.py得到相应的文件

step3

由于下载的词向量非常巨大，需要提取训练语料中出现的字符对应的向量，对应本项目中的data/hotel_comment/w2v.npz文件

本项目中，可在data/hotel_comment目录下运行build_embeddings.py得到相应的文件

模型一：CNN

结构：

中文词Embedding
多个不同长度的定宽卷积核
最大池化层，每个滤波器输出仅取一个最大值
全连接

图来源于论文 https://arxiv.org/abs/1408.5882 ，但与论文不同的是，论文中采取了一个pre-train 的embeddings和一个没有训练的embeddings组成了类似图像概念的双通道。本项目中只采用了一个预训练embeddings的单通道。

CNN模型的训练，在cnn目录底下运行

python main.py

CNN模型训练时间

在GTX 1060 6G的加持下大概耗时2分钟

CNN模型的训练结果

在model目录底下运行

python score_report.py cnn/results/score/eval.preds.txt

输出：

              precision    recall  f1-score   support

         POS       0.91      0.87      0.89       400
         NEG       0.88      0.91      0.89       400

   micro avg       0.89      0.89      0.89       800
   macro avg       0.89      0.89      0.89       800
weighted avg       0.89      0.89      0.89       800

模型二： BI-LSTM

中文词Embedding
bi-lstm
全连接

BI-LSTM模型的训练，在lstm目录底下运行

python main.py

BI-LSTM模型训练时间

在GTX 1060 6G的加持下大概耗时5分钟

BI-LSTM模型的训练结果

在model目录底下运行

python score_report.py lstm/results/score/eval.preds.txt

输出：

              precision    recall  f1-score   support

         POS       0.90      0.87      0.88       400
         NEG       0.87      0.91      0.89       400

   micro avg       0.89      0.89      0.89       800
   macro avg       0.89      0.89      0.89       800
weighted avg       0.89      0.89      0.89       800

模型的导出和serving（BI-LSTM为例）

模型导出

在lstm目录底下运行

python export.py

导出estimator推断图，可以用作prediction。本项目已上传了saved_model，可以不通过训练直接测试。

在model/lstm目录底下运行 python serve.py可以利用导出的模型进行实体识别。详情见代码。

测试结果

虽然模型由真实评论数据训练而成，这些数据长短不一（有的分词后长度超过1000），但由上图可得，模型对短评论表现尚可。

参考

[1] http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

[2] https://arxiv.org/abs/1408.5882

chinese_sentiment's People

Contributors

Stargazers

Watchers

Forkers

blacog yushuang823 paulchou0309 skyorca itmedaniel evaoooos silencehu19 github-huyuan timothy2327 zhangjaixu awesome-archive liukingjia deepquantitative gdaguo wbh19981204 chihuataneo chijiao chenyilinuser haoyunhong nana123nana zxuer2020 shytiger xxyjt zoumt1633 wangpin000 efwfe qdj0511 fengredrum kkkarenn sumerzhang tanhaishan wanzhumi itjaylon louzh98 highbcs shouchenghe audrena ishirkhan wuliuyuedetian cloudfast-bit xbad ysyllrt epic327 allen-lzh heisenberg9885 grace-lige shizelong1985 yurui997 dystudio cxlcym hoolatech soda333333333 lx981121263 minovoo ttthanos projektarbeit1 lili2323 xuanlongq davidalphafox zlszhonglongshen wzhwzhwzhwzh zhangxinhaokb haileydong bankquant guozhuoweicat foreverls youzl runanw super-lcx billiecn liubingyu132 kvres advancingsweet yujian200016 xihuishawpy yukikiwa ichbinhandsome shinoharaharuna lhxone tt12306 leethyuan shen1226 walt-like gooooodstudying rllll the-bears dqyzhwk liaoran123 innocabroad techthiyanes khan-106 w0odst0ck a-little-snail xingjia-bin mjahyh 8767452 wuerlongxin vdsmitnov52 syp192516 feizai666

chinese_sentiment's Issues

模型预测时候输出的日志信息在哪里

2023-10-18 17:24:37.709628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
2023-10-18 17:24:37.716526: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2023-10-18 17:24:37.721962: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2023-10-18 17:24:37.728865: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll
2023-10-18 17:24:37.735798: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll
2023-10-18 17:24:37.747056: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll
2023-10-18 17:24:37.752686: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll
2023-10-18 17:24:37.764038: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2023-10-18 17:24:37.769760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2023-10-18 17:24:37.776367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-10-18 17:24:37.783494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2023-10-18 17:24:37.788907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2023-10-18 17:24:37.793320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4831 MB memory) -> physical GPU

这种信息是哪个模块输出的，找了一大圈没找到

你好

我最近在学习有关情感分析的内容，在运行你的代码时进行到最后一步没有找到模型导出的save_model.py是否能请您上传一下

错误从投报到尾

1

Hello, may I ask how to solve the fitting phenomenon

模块运行报错

main.py 文件中第72行和第73行 embeddings = tf.nn.embedding_lookup(w2v_var, word_ids)
nn.embedding_lookup有个模块已经停用了，如果在cpu上跑不会返回零tesor会报错。请问该怎么办呢？
InvalidArgumentError (see above for traceback): indices[18,130] = 21492 is not in [0, 3621)
[[node embedding_lookup (defined at main.py:73) = GatherV2[Taxis=DT_INT32, Tindices=DT_INT64, Tparams=DT_FLOAT, _device="/
job:localhost/replica:0/task:0/device:CPU:0"](Variable/read, string_to_index_Lookup, embedding_lookup/axis)]]

w2v.npz这个文件不在hotel_comment里

score_report.py: error: the following arguments are required: file

你好，我在运行CNN目录下的score_report.py的时候，出现了如下的错误：
usage: score_report.py [-h] file
score_report.py: error: the following arguments are required: file
这个该如何解决呢？谢谢。

还有就是在训练CNN目录下的main.py的时候，最终的准确率非常低。
Saving dict for global step 16074: acc = 0.610625, global_step = 16074, loss = 0.6398508, precision = 0.5851781, recall = 0.76
Saving 'checkpoint_path' summary for global step 16074: results/model\model.ckpt-16074
Loss for final step: 0.586049.

你好，运行python build_embedings.py报错，该如何解决。

C:\ai\chinese_sentiment\venv\Scripts\python.exe C:\ai\chinese_sentiment\data\hotel_comment\build_embedings.py
Traceback (most recent call last):
File "C:\ai\chinese_sentiment\data\hotel_comment\build_embedings.py", line 6, in
word_to_idx = {line.strip(): idx for idx, line in enumerate(f)}
File "C:\ai\chinese_sentiment\data\hotel_comment\build_embedings.py", line 6, in
word_to_idx = {line.strip(): idx for idx, line in enumerate(f)}
File "C:\Program Files\Python38\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd7 in position 3382: invalid continuation byte

Process finished with exit code 1

AttributeError: 'str' object has no attribute 'decode'

Traceback (most recent call last):
File "···/chinese_sentiment-master/data/hotel_comment/raw_data/fix_coupus.py", line 30, in
fix_corpus(POS, FIX_POS)
File "···/chinese_sentiment-master/data/hotel_comment/raw_data/fix_coupus.py", line 15, in fix_corpus
fix_s = s.decode('gb2312')
AttributeError: 'str' object has no attribute 'decode'
代码如下：
def fix_corpus(dir_s, dir_t):
for item in os.listdir(dir_s):
with open(os.path.join(dir_s, item), 'r',) as f:
try:
s = f.read()
fix_s = s.decode('gb2312')
except UnicodeDecodeError:
try:
fix_s = s.decode('gbk')
except UnicodeDecodeError:
fix_s = s.decode('gb2312', errors='ignore')
with codecs.open(os.path.join(dir_t, item), 'w', encoding='utf8') as ff:
ff.write(fix_s)

tensorflow版本问题

目前tensorflow没有1x版本了但是代码还是1x的版本的，在代码中采取.compat.v1的方式之后，有部分代码依旧无法使用AttributeError: module 'tensorflow._api.v2.compat.v1.compat.v1' has no attribute 'estimator'
请问如何解决