Code Monkey home page Code Monkey logo

chinese-corpus-sentiment-data-analysis's Introduction

Chinese-corpus-sentiment-data-analysis

├───.gitignore 
├───README.md
├───requirements.txt
├───.vscode
│    └── settings.json  
├───res
│   ├───datanew
│   │   ├───neg
│   │   └───pos
│   └───word-vector
│       └───sgns.zhihu.bigram.bz2
├───src
│   └───run.py
└───tmp
    └───weights.hdf5

1 概览

  • 基于谭松波老师的酒店评论数据集的中文文本情感分析,二分类问题
  • 数据集标签有pos和neg,分别2000条txt文本
  • 选择RNN、LSTM和Bi-LSTM作为模型,借助Keras搭建训练
  • 主要工具包版本为TensorFlow 2.0.0、Keras 2.3.1和Python 3.6.2
  • 在测试集上可稳定达到92%的准确率

2 部署

  • 克隆repo:git clone https://github.com/lunarwhite/Chinese-corpus-sentiment-data-analysis.git
  • 更新pip:pip3 install --upgrade pip
  • 为项目创建虚拟环境:conda create --name <env_name> python=3.6
  • 激活env:conda activate <env_name>
  • 安装python库依赖:pip3 install -r requirements.txt
  • 下载封装好的中文词向量,本项目选择的是Zhihu_QA Word + Ngram,并放在res/word-vector目录下

3 运行

  • 运行:python src/run.py
  • 调参:在src/run.py文件中修改常用参数,如下
    my_lr = 1e-2 # 初始学习率
    my_test_size = 0.1
    my_validation_split = 0.1 # 验证集比例
    my_epochs = 40 # 训练轮数
    my_batch_size = 128 # 批大小
    my_dropout = 0.2 # dropout参数大小
    
    my_optimizer = Nadam(lr=my_lr) # 优化方法
    my_loss = 'binary_crossentropy' # 损失函数

4 流程

  • 观察数据
    • 数据集大小
    • 数据集样本
    • 样本长度
  • 数据预处理
    • 分词
    • 短句补全、长句裁剪
    • 索引化
    • 构建词向量
  • 搭建模型
    • RNN
    • LSTM
    • Bi-LSTM
  • 可视化分析
    • epochs-loss
    • epochs-accuracy
  • 调试
    • callback
    • checkpoint
  • 改进模型
    • loss function
    • optimizer
    • learning rate
    • epochs
    • batch_size
    • dropout
    • early-stopping

chinese-corpus-sentiment-data-analysis's People

Contributors

lunarwhite avatar mitsuki97 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.