Code Monkey home page Code Monkey logo

chinese_sentiment's Introduction

中文情感分析

中文情感分析的实质是文本分类问题,本项目分别采用CNNBI-LSTM两种模型解决文本分类任务,并用于情感分析,达到不错的效果。 两种模型在小数据集上训练,在验证集的准确率、号回率及F1因子均接近90%

项目设计的目标可以接受不同语料的多种分类任务,只要语料按照特定格式准备好,就可以开始调参训练、导出、serving。

code environment

在 python3.6 & Tensorflow1.13 下工作正常

其他环境也许也可以,但是没有测试过。

还需要安装 scikit-learn package 来计算指标,包括准确率回召率和F1因子等等。

语料的准备

语料的选择为 谭松波老师的评论语料,正负例各2000。属于较小的数据集,本项目包含了原始语料,位于data/hotel_comment/raw_data/corpus.zip

解压 corpus.zip 后运行,并在raw_data运行

python fix_corpus.py

将原本gb2312编码文件转换成utf-8编码的文件。

词向量的准备

本实验使用开源词向量chinese-word-vectors

选择知乎语料训练而成的Word Vector, 本项目选择词向量的下载地址为 https://pan.baidu.com/s/1OQ6fQLCgqT43WTwh5fh_lg ,需要百度云下载,解压,直接放在工程目录下

训练数据的格式

参考 data/hotel_comment/*.txt 文件

  • step1

本项目把数据分成训练集和测试集,比例为4:1, 集4000个样本被分开,3200个样本的训练集,800的验证集。

对于训练集和验证集,制作训练数据时遵循如下格式: 在{}.words.txt文件中,每一行为一个样本的输入,其中每段评论一行,并用jieba分词,词与词之间用空格分开。

除了 地段 可以 , 其他 是 一塌糊涂 , 惨不忍睹 。 和 招待所 差不多 。
帮 同事 订 的 酒店 , 他 老兄 刚 从 东莞 回来 , 详细 地问 了 一下 他 对 粤海 酒店 的 印象 , 说 是 硬件 和 软件 : 极好 ! 所以 表扬 一下

{}.labels.txt文件中,每一行为一个样本的标记

NEG
POS

本项目中,可在data/hotel_comment目录下运行build_data.py得到相应的格式

  • step2

因为本项目用了index_table_from_file来获取字符对应的id,需要两个文件表示词汇集和标志集,对应于vocab.labels.txtvocab.words.txt,其中每一行代表一个词或者是一行代表一个标志。

本项目中,可在data/hotel_comment目录下运行build_vocab.py得到相应的文件

  • step3

由于下载的词向量非常巨大,需要提取训练语料中出现的字符对应的向量,对应本项目中的data/hotel_comment/w2v.npz文件

本项目中,可在data/hotel_comment目录下运行build_embeddings.py得到相应的文件

模型一:CNN

结构:

  1. 中文词Embedding
  2. 多个不同长度的定宽卷积核
  3. 最大池化层,每个滤波器输出仅取一个最大值
  4. 全连接

截图 图来源于论文 https://arxiv.org/abs/1408.5882 ,但与论文不同的是,论文中采取了一个pre-train 的embeddings和一个没有训练的embeddings组成了类似图像概念的双通道。本项目中只采用了一个预训练embeddings的单通道。

CNN模型的训练,在cnn目录底下运行

python main.py

CNN模型训练时间

GTX 1060 6G的加持下大概耗时2分钟

CNN模型的训练结果

model目录底下运行

python score_report.py cnn/results/score/eval.preds.txt

输出:

              precision    recall  f1-score   support

         POS       0.91      0.87      0.89       400
         NEG       0.88      0.91      0.89       400

   micro avg       0.89      0.89      0.89       800
   macro avg       0.89      0.89      0.89       800
weighted avg       0.89      0.89      0.89       800

模型二: BI-LSTM

  1. 中文词Embedding
  2. bi-lstm
  3. 全连接

截图

BI-LSTM模型的训练,在lstm目录底下运行

python main.py

BI-LSTM模型训练时间

GTX 1060 6G的加持下大概耗时5分钟

BI-LSTM模型的训练结果

model目录底下运行

python score_report.py lstm/results/score/eval.preds.txt

输出:

              precision    recall  f1-score   support

         POS       0.90      0.87      0.88       400
         NEG       0.87      0.91      0.89       400

   micro avg       0.89      0.89      0.89       800
   macro avg       0.89      0.89      0.89       800
weighted avg       0.89      0.89      0.89       800

模型的导出和serving(BI-LSTM为例)

模型导出

lstm目录底下运行

python export.py

导出estimator推断图,可以用作prediction。本项目已上传了saved_model,可以不通过训练直接测试。

model/lstm目录底下运行 python serve.py可以利用导出的模型进行实体识别。详情见代码。

测试结果

截图

虽然模型由真实评论数据训练而成,这些数据长短不一(有的分词后长度超过1000),但由上图可得,模型对短评论表现尚可。

参考

[1] http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

[2] https://arxiv.org/abs/1408.5882

chinese_sentiment's People

Contributors

dependabot[bot] avatar linguishi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chinese_sentiment's Issues

模型预测时候输出的日志信息在哪里

2023-10-18 17:24:37.709628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
2023-10-18 17:24:37.716526: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2023-10-18 17:24:37.721962: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2023-10-18 17:24:37.728865: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll
2023-10-18 17:24:37.735798: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll
2023-10-18 17:24:37.747056: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll
2023-10-18 17:24:37.752686: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll
2023-10-18 17:24:37.764038: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2023-10-18 17:24:37.769760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2023-10-18 17:24:37.776367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-10-18 17:24:37.783494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2023-10-18 17:24:37.788907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2023-10-18 17:24:37.793320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4831 MB memory) -> physical GPU

这种信息是哪个模块输出的,找了一大圈没找到

你好

我最近在学习有关情感分析的内容,在运行你的代码时进行到最后一步没有找到模型导出的save_model.py是否能请您上传一下

1

Hello, may I ask how to solve the fitting phenomenon

模块运行报错

main.py 文件中第72行和第73行 embeddings = tf.nn.embedding_lookup(w2v_var, word_ids)
nn.embedding_lookup有个模块已经停用了,如果在cpu上跑不会返回零tesor会报错。请问该怎么办呢?
InvalidArgumentError (see above for traceback): indices[18,130] = 21492 is not in [0, 3621)
[[node embedding_lookup (defined at main.py:73) = GatherV2[Taxis=DT_INT32, Tindices=DT_INT64, Tparams=DT_FLOAT, _device="/
job:localhost/replica:0/task:0/device:CPU:0"](Variable/read, string_to_index_Lookup, embedding_lookup/axis)]]

score_report.py: error: the following arguments are required: file

你好,我在运行CNN目录下的score_report.py的时候,出现了如下的错误:
usage: score_report.py [-h] file
score_report.py: error: the following arguments are required: file
这个该如何解决呢?谢谢。

还有就是在训练CNN目录下的main.py的时候,最终的准确率非常低。
Saving dict for global step 16074: acc = 0.610625, global_step = 16074, loss = 0.6398508, precision = 0.5851781, recall = 0.76
Saving 'checkpoint_path' summary for global step 16074: results/model\model.ckpt-16074
Loss for final step: 0.586049.

你好,运行python build_embedings.py报错,该如何解决。

C:\ai\chinese_sentiment\venv\Scripts\python.exe C:\ai\chinese_sentiment\data\hotel_comment\build_embedings.py
Traceback (most recent call last):
File "C:\ai\chinese_sentiment\data\hotel_comment\build_embedings.py", line 6, in
word_to_idx = {line.strip(): idx for idx, line in enumerate(f)}
File "C:\ai\chinese_sentiment\data\hotel_comment\build_embedings.py", line 6, in
word_to_idx = {line.strip(): idx for idx, line in enumerate(f)}
File "C:\Program Files\Python38\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd7 in position 3382: invalid continuation byte

Process finished with exit code 1

AttributeError: 'str' object has no attribute 'decode'

Traceback (most recent call last):
File "···/chinese_sentiment-master/data/hotel_comment/raw_data/fix_coupus.py", line 30, in
fix_corpus(POS, FIX_POS)
File "···/chinese_sentiment-master/data/hotel_comment/raw_data/fix_coupus.py", line 15, in fix_corpus
fix_s = s.decode('gb2312')
AttributeError: 'str' object has no attribute 'decode'
代码如下:
def fix_corpus(dir_s, dir_t):
for item in os.listdir(dir_s):
with open(os.path.join(dir_s, item), 'r',) as f:
try:
s = f.read()
fix_s = s.decode('gb2312')
except UnicodeDecodeError:
try:
fix_s = s.decode('gbk')
except UnicodeDecodeError:
fix_s = s.decode('gb2312', errors='ignore')
with codecs.open(os.path.join(dir_t, item), 'w', encoding='utf8') as ff:
ff.write(fix_s)

tensorflow版本问题

目前tensorflow没有1x版本了但是代码还是1x的版本的,在代码中采取.compat.v1的方式之后,有部分代码依旧无法使用AttributeError: module 'tensorflow._api.v2.compat.v1.compat.v1' has no attribute 'estimator'
请问如何解决

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.