xv44586 / ccf_2020_qa_match Goto Github PK

View Code? Open in Web Editor NEW

266.0 6.0 83.0 2.09 MB

ccf 2020 qa match competition top1

Python 100.00%

keras bert ccf top1

ccf_2020_qa_match's Introduction

Update

基于当前repo 优化后，A/B 榜皆是Top1，~~代码整理中，后续会陆续放上来！~~

总结博客：ccf问答匹配比赛（下）：如何只用“bert”夺冠

优化思路

Post training

mlm

提升mlm任务中的mask策略，提升难度，提高下游性能：挖掘新词，加入词典，whole word mask + dynamic mask

挖掘新词

python new_words_mining.py

nsp

句子级别的任务是有用的，不过替换为SOP/AOP: query-answer pair时互换位置(sop)，query-answer-list时，只打乱answer-list的顺序（aop)

model-adaptive

post training的样本格式与下游一致，也能带来提升（区别RoBERTa 中的结论）

完整post training代码为两份：query-answer pair 与 query-answerA-list两种方式：

python popint-post-training-wwm-sop.py
python pair-post-training-wwm-sop.py

PS: post training 后，bert 后接复杂分类层（CNN/RNN/DGCNN/...)基本不会带来提升

融入知识

融入知识主要两种方式：bert 的Embedding层融入与transformer output层融入:

embedding层融合
transformer output 层融合

融入的知识使用的gensim 训练的word2vec(dims=100)，不过两种方式多次实验后都没带来提升：

python pair-external-embedding.py

如何切换融入的方式，请查看代码后自行修改

对比学习

引入对比学习尝试提高模型性能，对比学习主要有两种方式：自监督对比学习与监督对比学习：

自监督对比学习
通过互换QA位置，并随机mask 10%的token来构建一对view，view之间互为正例：
loss
model
监督对比学习
将相同label的样本视为互为正例：
loss
model

执行自监督对比代码：

python pair-data-augment-contrstive-learning.py

执行监督对比学习代码：

python pair-supervised-contrastive-learning.py

自蒸馏

自蒸馏即Teacher 与 Student 为同一个模型，Teacher训练一次后，在train data上打上soften labels，然后迁移至Student 模型。

python pair-self-kd.py

对抗训练

使用FGM方法对EMbedding进行扰动：

python pair-adversarial-train.py

数据增强

数据增强主要尝试了两种方式：EDA与伪标签。

EDA 随机删除/随机替换/随机插入/随机重复，操作比例10%，每个样本生成4个新样本词向量质量低，所以使用从当前句子随机选取一个词作为同义词进行操作
伪标签用已训练的模型对test data打上标签加入训练集

Tips：数据增强时用已训练模型进行过滤，将低置信度（<0.7)的样本过滤掉，避免引入错误标签样本；此外，伪标签时，要结合数据比例，过多的测试数据提前进入训练集，最终的结果只会与“伪标签”一致，反而无法带来提升。

shuffle

在query-answer-list 样本格式下，解码时对answer-list进行全排列，然后投票。不过此次比赛的数据顺序很重要，乱序后结果较差，没带来提升

总结

-----------------------------------2020.01.18------------------------------------------------------------------

比赛

贝壳找房-房产行业聊天问答匹配，比赛地址https://www.datafountain.cn/competitions/474/datasets

总结博客：ccf问答匹配

简单说明

样本为一个问题多个回答，其中回答有的是针对问题的回答（1），也有不是的（0），其中回答是按顺序排列的。即： query1: [(answer1, 0), (answer2, 1),...] 任务是对每个回答进行分类，判断是不是针对问题的回答。

pretrain model weights

预训练模型使用的是华为开源的nezha-base-wwm

Baseline

思路一：

不考虑回答之间的顺序关系，将其拆为query-answer 对，然后进行判断。比如现在的样本是: {query: "房子几年了", answers: [("二年了", 1), ("楼层靠中间"， 0)]},此时我们将其拆分为单个query-answer pair，即： [{query: "房子几年了", answer: "二年了", label: 1}, {query: "房子几年了", answer: "楼层靠中间", label: 0}]

代码实现：pair_match

单模型提交f1: 0.752

思路二：

考虑对话连贯性，同时考虑其完整性，将所有回答顺序拼接后再与问题拼接，组成query-answer1-answer2，然后针对每句回答进行分类。上面的例子将被组成样本：{query: "房子几年了", answer: "两年了[SEP]楼层靠中间[SEP]", label: [mask, mask, mask, 0, mask, mask, mask,mask,mask, 0]} 即：将每句回答后面的[SEP] 作为最终的特征向量，然后去做二分类。

代码实现：match_point

单模型提交f1: 0.75

思路三：

Pattern-Exploiting Training(PET)，即增加一个pattern，将任务转换为MLM任务，然后通过pattern的得分来判断对应的类别。如本次样本可以添加一个前缀pattern："简接回答问题"/"直接回答问题"，分别对应label 0/1,pattern的得分只需看第一个位置中"间"/"直" 两个token的概率谁高即可。此外，训练时还可以借助bert的预训练任务中的mlm任务增强模型的泛化能力。更详细的请介绍请查阅文本分类秒解

对于本次样本，对应的示意图如下：

对应代码实现：pet classification

单模型提交f1: 0.76+

思路四

由于bert 不同的transformer 层提取到的语义粒度不同，而不同粒度的信息对分类来说起到的作用也不同，所以可以concat所以粒度的语义信息，拼接后作为特征进行分类。

对应于本次样本，示意图如下：

对应代码实现：concat classification 单模型提交f1: 0.75+

tips

贴几篇感觉有启发的关于文本分类的论文

ccf_2020_qa_match's People

Contributors

Stargazers

Watchers

Forkers

xmxoxo nuass shenzaimin andrew05200 binghuo007 chenjiashuo123 yyyyy1231 nanke4869 stefensa jackyvan allensmile fanfanba kongdzh wentaozhu syzong yuqianglxf wangbq18 chenchongyuan deeplearning2012 zcf131016 nipi64310 codewithzichao china-challengehub mr-insec danxu0621 haojiepan1 wuhurestaurant barryzm qianrenjian shenyi666666 xrosliang wudeshi da-southampton clear-love-k jiapengwei tukeyone hell-to-heaven zyw1218 laomagic p3n9w31 wangle1218 jkhenry520 chanchimin hungrysharkkk carry-xz genpeng wengbenjue ljch2018 hellomlwo apple55bc yrquni tecufly zymale laozhuang727 helenll jitingyu code-harvey sibodiamond zgd716 xiaoanshi antiwalker jdcmj saxh yueyedeai cjgcjgcjg qshuang123 sungc1 pandascute anbrose liuyuru156 doragd dumpmemory dragonroclp gshan4056 ccnudhj hflyzju zqp563830312 929359291 mandyli1996 deeplearning00 somezak1 geeklili

ccf_2020_qa_match's Issues

ask for dataset download link

Hi，thanks for your great job! But I have not taken part in the competetion. If convenient, could you share the datasets downlink? Thanks a lot

导入NEZHA时报错，

运行时报错，检查发现下载的预训练模型里面没有model.ckpt这个文件，请问是需要转换一下吗？谢谢

task-adaptive training

首先祝贺你们取得了第一名的成绩，同时也感谢你们把代码开源出来。我现在有个问题要咨询一下，请问，task-adaptive training和model-adaptive的区别是什么呀，我理解的task-adaptive training是使用比赛的数据，对预训练模型进行再次的训练，得到领域适配的预训练模型，请问我这样的理解对吗？

环境问题

请问模型融合策略是什么？

1、对于单个模型，根据不同随机数，训练多个模型；
2、对QA Pair 与 QA Point两类的模型，平均预测的概率作为最终的预测结果？

非常感谢！

你好用你的pet的那个方法验证的时候acc为nan predict的标签全是0 请问是和优化器有关吗我是tf1.14 没有AdamWG 使用的Adam

脚本（ccf_2020_qa_match_pair.py）报错

这比赛采用pairwise的思路试过没有？

目前最高也就是0.757，看了下前面的同学都是0.8+，感觉很神奇，目前没有太好的思路？难道换更大的模型，更多的ensambe？

请问你的模型单模最高多少？是融合了之后上0.8得嘛

如题，还有就是这次比赛有没有做数据增强呀？感觉这次比赛很多回答都很莫名其妙==

自蒸馏脚本（pair_self_kd.py）运行失败

失败日志如下。

Traceback (most recent call last):
  File "pair-self-kd.py", line 297, in <module>
    callbacks=[student_evaluator])
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/engine/training.py", line 1732, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/engine/training_generator.py", line 220, in fit_generator
    reset_metrics=False)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/engine/training.py", line 1514, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3792, in __call__
    outputs = self._graph_fn(*converted_inputs)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1605, in __call__
    return self._call_impl(args, kwargs)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1645, in _call_impl
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 598, in call
    ctx=ctx)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
  (0) Failed precondition:  Error while reading resource variable _AnonymousVar409 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar409/N10tensorflow3VarE does not exist.
         [[node ReadVariableOp_1191 (defined at /home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
         [[ReadVariableOp_1190/_12]]
  (1) Failed precondition:  Error while reading resource variable _AnonymousVar409 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar409/N10tensorflow3VarE does not exist.
         [[node ReadVariableOp_1191 (defined at /home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_keras_scratch_graph_150548]

Function call stack:
keras_scratch_graph -> keras_scratch_graph

Traceback (most recent call last):
  File "pair-self-kd.py", line 297, in <module>
    callbacks=[student_evaluator])
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/engine/training.py", line 1732, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/engine/training_generator.py", line 220, in fit_generator
    reset_metrics=False)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/engine/training.py", line 1514, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3792, in __call__
    outputs = self._graph_fn(*converted_inputs)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1605, in __call__
    return self._call_impl(args, kwargs)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1645, in _call_impl
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 598, in call
    ctx=ctx)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
  (0) Failed precondition:  Error while reading resource variable _AnonymousVar409 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar409/N10tensorflow3VarE does not exist.
         [[node ReadVariableOp_1191 (defined at /home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
         [[ReadVariableOp_1190/_12]]
  (1) Failed precondition:  Error while reading resource variable _AnonymousVar409 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar409/N10tensorflow3VarE does not exist.
         [[node ReadVariableOp_1191 (defined at /home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_keras_scratch_graph_150548]

Function call stack:
keras_scratch_graph -> keras_scratch_graph

求更新。。

希望有时间能够更新一下，希望学习一下大佬的做法