jiangxinyang227 / textclassifier Goto Github PK

View Code? Open in Web Editor NEW

1.1K 35.0 559.0 271.95 MB

tensorflow implementation

Jupyter Notebook 82.16% Python 17.82% Shell 0.02%

textclassifier's Introduction

文本分类项目

本项目为基于CNN，RNN 和NLP中预训练模型构建的多个常见的文本分类模型。

requirements

python==3.5.6
tensorflow-gpu==1.10.0

1. 数据集

数据集为IMDB电影评论的情感分析数据集，总共有三个部分：

带标签的训练集：labeledTrainData.tsv
不带标签的训练集：unlabeledTrainData.tsv
测试集：testData.tsv

字段的含义：

id 电影评论的id
review 电影评论的内容
sentiment 情感分类的标签（只有labeledTrainData.tsv数据集中有）

2. 数据预处理

数据预处理方法/dataHelper/processData.ipynb

将原始数据处理成干净的数据，处理后的数据存储在/data/preProcess下，数据预处理包括：

去除各种标点符号
生成训练word2vec模型的输入数据 /data/preProcess/wordEmbedding.txt

3. 训练word2vec词向量

预训练word2vec词向量/word2vec/genWord2Vec.ipynb

预训练的词向量保存为bin格式 /word2vec/word2Vec.bin

4. textCNN 文本分类

textCNN模型来源于论文Convolutional Neural Networks for Sentence Classification

textCNN可以看作是一个由三个单层的卷积网络的输出结果进行拼接的融合模型，作者提出了三种大小的卷积核[3, 4, 5]，卷积核的滑动使得其类似于NLP中的n-grams，因此当你需要更多尺度的n-grams时，你可以选择增加不同大小的卷积核，比如大小为2的卷积核可以代表 2-grams.

textCNN代码在/textCNN/textCNN.ipynb。实现包括四个部分：

参数配置类 Config （包括训练参数，模型参数和其他参数）
数据预处理类 Dataset （包括生成词汇空间，获得预训练词向量，分割训练集和验证集）
textCNN模型类 TextCNN
模型训练

5. charCNN 文本分类

textCNN模型来源于论文Character-level Convolutional Networks for Text Classification

char-CNN是一种基于字符级的文本分类器，将所有的文本都用字符表示，注意这里的数据预处理时不可以去掉标点符号或者其他的各种符号，最好是保存论文中提出的69种字符，我一开始使用去掉特殊符号的字符后的文本输入到模型中会无法收敛。此外由于训练数据集比较少，即使论文中最小的网络也无法收敛，此时可以减小模型的复杂度，包括去掉一些卷积层等。

charCNN代码在/charCNN/charCNN.ipynb。实现也包括四个部分，也textCNN一致，但是在这里的数据预处理有很大不一样，剩下的就是模型结构不同，此外模型中可以引入BN层来对每一层的输出做归一化处理。

6. Bi-LSTM 文本分类

Bi-LSTM可以参考我的博客深度学习之从RNN到LSTM

Bi-LSTM是双向LSTM，LSTM是RNN的一种，是一种时序模型，Bi-LSTM是双向LSTM，旨在同时捕获文本中上下文的信息，在情感分析类的问题中有良好的表现。

Bi-LSTM的代码在/Bi-LSTM/Bi-LSTM.ipynb中。除了模型类的代码有改动，其余代码几乎和textCNN一样。

7. Bi-LSTM + Attention 文本分类

Bi-LSTM + Attention模型来源于论文Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification

Bi-LSTM + Attention 就是在Bi-LSTM的模型上加入Attention层，在Bi-LSTM中我们会用最后一个时序的输出向量作为特征向量，然后进行softmax分类。Attention是先计算每个时序的权重，然后将所有时序的向量进行加权和作为特征向量，然后进行softmax分类。在实验中，加上Attention确实对结果有所提升。

Bi-LSTM + Attention的代码在/Bi-LSTM+Attention/Bi-LSTMAttention.ipynb中，除了模型类中加入Attention层，其余代码和Bi-LSTM一致。

8. RCNN 文本分类

RCNN模型来源于论文Recurrent Convolutional Neural Networks for Text Classification

RCNN 整体的模型构建流程如下：

利用Bi-LSTM获得上下文的信息，类似于语言模型
将Bi-LSTM获得的隐层输出和词向量拼接[fwOutput, wordEmbedding, bwOutput]
将拼接后的向量非线性映射到低维
向量中的每一个位置的值都取所有时序上的最大值，得到最终的特征向量，该过程类似于max-pool
softmax分类

RCNN的代码在/RCNN/RCNN.ipynb中。

9. adversarialLSTM 文本分类

Adversarial LSTM模型来源于论文Adversarial Training Methods For Semi-Supervised Text Classification

adversarialLSTM的核心**是通过对word Embedding上添加噪音生成对抗样本，将对抗样本以和原始样本同样的形式喂给模型，得到一个Adversarial Loss，通过和原始样本的loss相加得到新的损失，通过优化该新的损失来训练模型，作者认为这种方法能对word embedding加上正则化，避免过拟合。

adversarialLSTM的代码在/adversarialLSTM/adversarialLSTM.ipynb中。

10. Transformer 文本分类

Transformer模型来源于论文Attention Is All You Need

Transformer模型有两个结构：Encoder和Decoder，在进行文本分类时只需要用到 Encoder结构，Decoder结构是生成式模型，用于自然语言生成的。Transformer的核心结构是 self-Attention机制，具体的介绍见Transformer模型（Atention is all you need）。

Transformer模型的代码在/Transformer/transformer.ipynb中。

11. ELMo预训练模型文本分类

ELMo模型来源于论文Deep contextualized word representations

ELMo的结构是BiLM（双向语言模型），基于ELMo的预训练模型能动态地生成词的向量表示，具体的介绍见ELMO模型（Deep contextualized word representation）

ELMo预训练模型用于文本分类的代码位于/ELMo/elmo.ipynb中。 /ELMo/bilm/下是ELMo项目中的源码，/ELMo/modelParams/下是各种文件。

12. Bert预训练模型文本分类

BERT模型来源于论文BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT模型是基于双向Transformer实现的语言模型，集预训练和下游任务于一个模型中，因此在使用的时候我们不需要搭建自己的下游任务模型，直接用BERT模型即可，我们将谷歌开源的源码下载下来放在bert文件夹中，在进行文本分类只需要修改run_classifier.py文件即可，另外我们需要将训练集和验证集分割后保存在两个不同的文件中，放置在/BERT/data下。然后还需要下载谷歌预训练好的模型放置在 /BERT/modelParams文件夹下，还需要建一个/BERT/output文件夹用来放置训练后的模型文件

做完上面的步骤之后只要执行下面的脚本即可

export BERT_BASE_DIR=../modelParams/uncased_L-12_H-768_A-12

export DATASET=../data/

python run_classifier.py
--data_dir=$MY_DATASET
--task_name=imdb
--vocab_file=$BERT_BASE_DIR/vocab.txt
--bert_config_file=$BERT_BASE_DIR/bert_config.json
--output_dir=../output/
--do_train=true
--do_eval=true
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt
--max_seq_length=200
--train_batch_size=16
--learning_rate=5e-5
--num_train_epochs=3.0

BERT模型用于文本分类的详细使用可以看我的博客文本分类实战（十）—— BERT 预训练模型

textclassifier's People

Contributors

Stargazers

Watchers

Forkers

liujing1036081 thinking110 hellojomo wangyiyao2016 debuluoyi chongp gdh756462786 duanzhihua lihengtianxia redmu7 myminsomnia niglewang netrookiecn wsp317 chenny0808 kissthink psyquant flyingzhy kyroad marmalade666 hatleon youngsmile fengzhou4 john9281 annyn zzzz123321 lianfei tianyunzqs 1234hello dengminna boluoyu sunny8898 berryhn meccy tectal orient12 phelanwang katsuyoo liusj0715 dx2048 yaojl2006 yuhuofei lizhzh8 1780041410 goodluckkk wangshoudao inetty chenzk1993 lee2015new ashes1018 fengdf zhangshaodong xueweixiansheng yaoleihxr xiaoshuzh liuriver123 mars-wei yutong007x 894551089 mirrormy hanhongchang allensmile xuewengeophysics jangocheng jianhua2022 mengqq6952 wangkanger bailixuance legendtianjin buptcai phoenix-06 isunym dynamic0617 shincheung gaogaoxiasha taylor009007 fuuuyuuu pumpkinduo liutianling chengli0327 caoyuji1986 xiaonan07 markovsc mrxiexianzhao process520 jensen1217 fishredleaf zhangleihan yexm xiaoxiong74 liurandong wengbenjue huhuigou nxw1994 yuweifamily zsl98 823858275 zhao131 work-er pjx1993

textclassifier's Issues

how to save the models?

ELMO输入长度问题

我在运行elmo的时候也遇到了“ValueError: setting an array element with a sequence.”的错误。我的疑问是，句子长度小于Maxlen的话怎么处理呢，要如何处理？

关闭

关闭！

请问我在调整好多次bert参数后，得出的结果都是一样的，recall都是0,我一般运用bilstm模型的时候recall=0.83

BERT中的bert_blstm_atten.py无法predict

在执行predict.sh时：

python bert_blstm_atten.py
--task_name=imdb
--do_predict=True
--data_dir=data/
--vocab_file=modelParams/uncased_L-12_H-768_A-12/vocab.txt
--bert_config_file=modelParams/uncased_L-12_H-768_A-12/bert_config.json
--init_checkpoint=blstm_output/
--max_seq_length=128
--output_dir=predicts/imdb/

报错！：
Instructions for updating:
Please use keras.layers.RNN(cell), which is equivalent to this API
INFO:tensorflow:H shape: Tensor("Attention/add:0", shape=(?, 128, 256), dtype=float32)
INFO:tensorflow:r shape: Tensor("Attention/MatMul_1:0", shape=(?, 256, 1), dtype=float32)
INFO:tensorflow:sequeeze_r shape: Tensor("Attention/Squeeze:0", shape=(?, 256), dtype=float32)
INFO:tensorflow:sentence embedding shape: Tensor("Attention/Tanh_1:0", shape=(?, 256), dtype=float32)
INFO:tensorflow:output shape: Tensor("Attention/dropout/mul:0", shape=(?, 256), dtype=float32)
INFO:tensorflow:output_w shape: <tf.Variable 'output/output_w:0' shape=(256, 2) dtype=float32_ref>
INFO:tensorflow:predictions: Tensor("predictions:0", shape=(?,), dtype=int64)
INFO:tensorflow:Error recorded from prediction_loop: Assignment map with scope only name Attention should map to scope only Attention/Variable. Should be 'scope/': 'other_scope/'.
INFO:tensorflow:prediction_loop marked as finished
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
File "bert_blstm_atten.py", line 865, in
tf.app.run()
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "bert_blstm_atten.py", line 847, in main
for (i, prediction) in enumerate(result):
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2500, in predict
rendezvous.raise_errors()
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
six.reraise(typ, value, traceback)
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2494, in predict
yield_single_examples=yield_single_examples):
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 611, in predict
features, None, model_fn_lib.ModeKeys.PREDICT, self.config)
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2251, in _call_model_fn
config)
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2534, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1323, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1593, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "bert_blstm_atten.py", line 539, in model_fn
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 190, in init_from_checkpoint
_init_from_checkpoint, args=(ckpt_dir_or_file, assignment_map))
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1516, in merge_call
return self._merge_call(merge_fn, args, kwargs)
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1524, in _merge_call
return merge_fn(self._distribution_strategy, *args, **kwargs)
File "/home/zys/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 246, in _init_from_checkpoint
scopes, tensor_name_in_ckpt))
ValueError: Assignment map with scope only name Attention should map to scope only Attention/Variable. Should be 'scope/': 'other_scope/'.

在do_train和do_eval时都没问题，我写了predict.sh执行就不行了，看起来像缺少了一些variable，能看下是怎么回事吗？谢谢！

您好，请问想用transformer的编码器对多类别标签的文本进行编码，怎么改代码？

如题，实验中需要用transformer的encoding对带类别标签的句子词向量编码（类别可以改成您在末尾0，1那样子，0是大类，1是小类别），请问要怎么改？希望大佬能不吝执教，非常感谢！

Transformer 计算key mask似乎有问题

位置嵌入生成的是一个[batch_size, sequence_length，sequence_length]维的单位矩阵；
然后与词嵌入拼接：self.embeddedWords = tf.concat([self.embedded, self.embeddedPosition], -1)
这个矩阵的维度应该是[batch_size, sequence_length，embedding_size+sequence_length]，再作为Q/K/V传入multiheadAttention；
计算key mask的时候：keyMasks = tf.sign(tf.abs(tf.reduce_sum(keys, axis=-1)))，比如最后时间步做的是Pad Zero， keys[:,:,-1]=[0,0,0,…0,|0,…,0,1] (“|”之前的是词嵌入，“|”之后的是位置嵌入)用上面函数计算出来的也是1，好像key mask永远不会有0值。

请问下作者用transformer做文本分类的效果怎么样？我尝试了keras上的实现总比LSTM差2-4个点，有遇到过吗？
谢谢！

关于 Transformer._positionEmbedding 中的 batchSize

您好，我想使用 Transformer._positionEmbedding ，在设置 batchSize 遇到个问题：
在验证时，验证集的 batch 和训练集的 batch 大小不同（训练集的batch 要大些，为了减少验证的时间，同时避免内存溢出）
当然可以把 PositionEmbedding 当做参数输入以解决这个问题，有没有更好地方法来解决呢？

关于测试文件的编写还是存在疑问？

关于测试文件的编写还是存在疑问，根据您给的博客链接我只会把checkpoint加载进来，后面的传参还是不太会，該传进那些参数不是很清楚，根据博客链接里面，没看见测试集如何加载进来的，希望您能够给予解答。
import tensorflow as tf
import numpy as np

graph = tf.Graph()
with graph.as_default():

session_conf = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
session_conf.gpu_options.allow_growth=True
session_conf.gpu_options.per_process_gpu_memory_fraction = 0.9  # 配置gpu占用率  

sess = tf.Session(config=session_conf)

with sess.as_default():
    checkpoint_file = tf.train.latest_checkpoint("C:/Users/韩泽峰/Desktop/textClassifier-master/model/transformer3classfier/")
    saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
    saver.restore(sess, checkpoint_file)

# 获得需要喂给模型的参数，输出的结果依赖的输入值

    inputX = graph.get_operation_by_name("inputX")
    dropoutKeepProb = graph.get_operation_by_name("dropoutKeepProb")
    embeddedPosition = graph.get_operation_by_name("embeddedPosition")
    
    # 获得输出的结果
    pred_all = graph.get_operation_by_name("inputY")

您好，我想咨询一下，要是应用在中文文本处理上面，数据应该怎么处理，代码上有什么变动没有，谢谢

输入数据维度问题

你好，在elmo中，我看到输入的长度只是限制在200以内，但是每个句子的长度都不同，那么最后训练的向量的输出的形状是（batchsize， 200， dim）吗？在elmo中，是否允许padding以后的句子呢？如果允许，是不是就是以空格作为padding呢，我看tensorflow_hub中就是这样封装的。

BERT下面的dataProcessor.ipynb 下面的ln[2] 路径

可以改为绝对路径，保护个人信息
由/data4T/share/jiangxinyang848/textClassifier/data/preProcess/labeledTrain.csv
改为 ../data/preProcess/labeledTrain.csv

测试问题

您好，我是初学者，仔细看了您的代码，写的很全面，但是好像没有test（或eval）方法，那么我想在测试集上测试性能，该怎么做呢？谢谢！

bert_blstm_atten.py无法进行多分类

你好，我在修改了bert_blstm_atten.py后发现无法进行多分类验证，训练可以训练，在bert里修改run_classifier就可以。我怀疑是blstm_atten.py这一层只能进行二分类，请问怎么修改代码可以使用bert_blstm进行多分类？

请问关于TextCNN\RCNN\BERT等文本分类模型，您是否做过实验比较过模型的优劣呢？

楼主也可以试试ULMFiT以及ResNet文本分类

how to change the procedure into 3-classifier?

how to change the procedure into 3-classifier? I want to change the procedure into a 3-classifier,which is about aspect based sentiment anysize. Thank you!

为什么bert原本的只有MLP层，accuracy能达到86%，加了BiLSTM+Attention却降到了65%

加BiLSTM+Attention
INFO:tensorflow:***** Eval results *****
INFO:tensorflow: eval_accuracy = 0.6544118
INFO:tensorflow: eval_auc = 0.52433944
INFO:tensorflow: eval_precision = 0.69602275
INFO:tensorflow: eval_recall = 0.8781362
INFO:tensorflow: global_step = 917
INFO:tensorflow: loss = 0.77603155

bert原生的MLP层