Code Monkey home page Code Monkey logo

bert-textclassification's Introduction

Text Classification with RNN--2018CCFBDCI汽车用户观点提取

汽车用户观点提取,使用bert模型的词向量作为RNN的初始化,其中data的train_x.npy表示的是bert的输入格式 而原始的数据集是经过word2id以及padding的,y不需要变化,rnn和加bert的rnn都可以用。具体参考text_Loader下的process file函数。

使用循环神经网络进行中文文本分类

环境

  • Python 2/3
  • TensorFlow 1.3以上
  • numpy
  • scikit-learn
  • scipy

数据集

使用汽车用户观点提取的任务进行训练与测试,数据集请自行到2018CCFBCI(https://www.datafountain.cn/competitions/329/details)下载,请遵循数据提供方的开源协议。

本次训练使用了其中的10个分类

预处理

data/cnews_loader.py为数据的预处理文件。

  • read_file(): 读取文件数据;
  • build_vocab(): 构建词汇表,使用字符级的表示,这一函数会将词汇表存储下来,避免每一次重复处理;
  • read_vocab(): 读取上一步存储的词汇表,转换为{词:id}表示;
  • read_category(): 将分类目录固定,转换为{类别: id}表示;
  • to_words(): 将一条由id表示的数据重新转换为文字;
  • process_file(): 将数据集从文字转换为固定长度的id序列表示;
  • batch_iter(): 为神经网络的训练准备经过shuffle的批次的数据。

RNN循环神经网络

配置项

RNN可配置的参数如下所示,在rnn_model.py中。

class TRNNConfig(object):
    """RNN配置参数"""

    # 模型参数
    embedding_dim = 64      # 词向量维度
    seq_length = 600        # 序列长度
    num_classes = 10        # 类别数
    vocab_size = 5000       # 词汇表达小

    num_layers= 2           # 隐藏层层数
    hidden_dim = 128        # 隐藏层神经元
    rnn = 'gru'             # lstm 或 gru

    dropout_keep_prob = 0.8 # dropout保留比例
    learning_rate = 1e-3    # 学习率

    batch_size = 128         # 每批训练大小
    num_epochs = 10          # 总迭代轮次

    print_per_batch = 100    # 每多少轮输出一次结果
    save_per_batch = 10      # 每多少轮存入tensorboard

RNN-bert模型

具体参看run_rnn_bert.py的实现。

关于RNN-bert模型--清华新浪新闻数据集的实现见github(https://github.com/a414351664/Bert-THUCNews

bert-textclassification's People

Contributors

pengwei-iie avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.