本项目采用Keras和ALBERT实现文本多标签分类任务,其中对ALBERT进行微调。
- jclian91
本项目以 2020语言与智能技术竞赛:事件抽取任务 中的数据作为多分类标签的样例数据,借助多标签分类模型来解决。
.
├── albert.py
├── albert_tiny
│ ├── albert_config_tiny.json
│ ├── albert_model.ckpt.data-00000-of-00001
│ ├── albert_model.ckpt.index
│ ├── albert_model.ckpt.meta
│ └── vocab.txt
├── data(数据集)
├── label.json(类别词典,生成文件)
├── model_evaluate.py(模型评估脚本)
├── model_predict.py(模型预测脚本)
├── model_train.py(模型训练脚本)
└── requirements.txt
albert-tiny预训练模型
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None, None) 0
__________________________________________________________________________________________________
input_2 (InputLayer) (None, None) 0
__________________________________________________________________________________________________
model_2 (Model) multiple 4077496 input_1[0][0]
input_2[0][0]
__________________________________________________________________________________________________
lambda_1 (Lambda) (None, 312) 0 model_2[1][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 65) 20345 lambda_1[0][0]
==================================================================================================
Total params: 4,097,841
Trainable params: 4,097,841
Non-trainable params: 0
__________________________________________________________________________________________________
- albert_tiny
模型参数: batch_size = 16, maxlen = 256, epoch=10
使用albert_tiny预训练模型,评估结果如下:
micro avg 0.9488 0.8606 0.9025 1657
macro avg 0.9446 0.8084 0.8589 1657
weighted avg 0.9460 0.8606 0.8955 1657
samples avg 0.8932 0.8795 0.8799 1657
accuracy: 0.828437917222964
hamming loss: 0.0031631919482386773
- albert_base
模型参数: batch_size = 16, maxlen = 256, epoch=10
使用albert_tiny预训练模型,评估结果如下:
micro avg 0.9471 0.9294 0.9382 1657
macro avg 0.9416 0.9105 0.9208 1657
weighted avg 0.9477 0.9294 0.9362 1657
samples avg 0.9436 0.9431 0.9379 1657
accuracy: 0.8931909212283045
hamming loss: 0.0020848310567936736
- 将ALBERT中文预训练模型放在对应的文件夹下
- 所需Python第三方模块参考requirements.txt文档
- 自己需要分类的数据按照data/train.csv的格式准备好
- 调整模型参数,运行model_train.py进行模型训练
- 运行model_evaluate.py进行模型评估
- 运行model_predict.py对新文本进行评估