Code Monkey home page Code Monkey logo

ccf_2020_qa_match's Introduction

Update

基于当前repo 优化后,A/B 榜皆是Top1,代码整理中,后续会陆续放上来!

总结博客:ccf问答匹配比赛(下):如何只用“bert”夺冠

优化思路

Post training

mlm

提升mlm任务中的mask策略,提升难度,提高下游性能:挖掘新词,加入词典,whole word mask + dynamic mask

  • 挖掘新词
python new_words_mining.py 

nsp

句子级别的任务是有用的,不过替换为SOP/AOP: query-answer pair时互换位置(sop),query-answer-list时,只打乱answer-list的顺序(aop)

model-adaptive

post training的样本格式与下游一致,也能带来提升(区别RoBERTa 中的结论)

完整post training代码为两份:query-answer pair 与 query-answerA-list两种方式:

python popint-post-training-wwm-sop.py
python pair-post-training-wwm-sop.py

PS: post training 后,bert 后接复杂分类层(CNN/RNN/DGCNN/...)基本不会带来提升
post training result

融入知识

融入知识主要两种方式:bert 的Embedding层融入与transformer output层融入:

  • embedding层融合
    external-embedding-bottom
  • transformer output 层融合
    top-embedding

融入的知识使用的gensim 训练的word2vec(dims=100),不过两种方式多次实验后都没带来提升:

python pair-external-embedding.py

如何切换融入的方式,请查看代码后自行修改

对比学习

引入对比学习尝试提高模型性能,对比学习主要有两种方式:自监督对比学习与监督对比学习:

  • 自监督对比学习
    通过互换QA位置,并随机mask 10%的token来构建一对view,view之间互为正例:

  • loss
    自监督对比学习loss

  • model
    自监督对比学习模型

  • 监督对比学习
    将相同label的样本视为互为正例:

  • loss
    监督对比学习loss

  • model
    监督对比学习模型

执行自监督对比代码:

python pair-data-augment-contrstive-learning.py 

执行监督对比学习代码:

python pair-supervised-contrastive-learning.py

自蒸馏

自蒸馏即Teacher 与 Student 为同一个模型,Teacher训练一次后,在train data上打上soften labels,然后迁移至Student 模型。

python pair-self-kd.py

对抗训练

使用FGM方法对EMbedding进行扰动:

python pair-adversarial-train.py

数据增强

数据增强主要尝试了两种方式:EDA与伪标签。

  • EDA 随机删除/随机替换/随机插入/随机重复,操作比例10%,每个样本生成4个新样本 词向量质量低,所以使用从当前句子随机选取一个词作为同义词进行操作

  • 伪标签 用已训练的模型对test data打上标签加入训练集

Tips: 数据增强时用已训练模型进行过滤,将低置信度(<0.7)的样本过滤掉,避免引入错误标签样本;此外,伪标签时,要结合数据比例,过多的测试数据提前进入训练集,最终的结果只会与“伪标签”一致,反而无法带来提升。

shuffle

在query-answer-list 样本格式下,解码时对answer-list进行全排列,然后投票。不过此次比赛的数据顺序很重要,乱序后结果较差,没带来提升

总结

-----------------------------------2020.01.18------------------------------------------------------------------

比赛

贝壳找房-房产行业聊天问答匹配, 比赛地址https://www.datafountain.cn/competitions/474/datasets

总结博客:ccf问答匹配

简单说明

样本为一个问题多个回答,其中回答有的是针对问题的回答(1),也有不是的(0),其中回答是按顺序排列的。即: query1: [(answer1, 0), (answer2, 1),...] 任务是对每个回答进行分类,判断是不是针对问题的回答。

pretrain model weights

预训练模型使用的是华为开源的nezha-base-wwm

Baseline

思路一:

不考虑回答之间的顺序关系,将其拆为query-answer 对,然后进行判断。 比如现在的样本是: {query: "房子几年了", answers: [("二年了", 1), ("楼层靠中间", 0)]},此时我们将其拆分为单个query-answer pair,即: [{query: "房子几年了", answer: "二年了", label: 1}, {query: "房子几年了", answer: "楼层靠中间", label: 0}]

pair match

代码实现:pair_match

单模型提交f1: 0.752

思路二:

考虑对话连贯性,同时考虑其完整性,将所有回答顺序拼接后再与问题拼接,组成query-answer1-answer2,然后针对每句回答进行分类。 上面的例子将被组成样本:{query: "房子几年了", answer: "两年了[SEP]楼层靠中间[SEP]", label: [mask, mask, mask, 0, mask, mask, mask,mask,mask, 0]} 即:将每句回答后面的[SEP] 作为最终的特征向量,然后去做二分类。

代码实现:match_point

单模型提交f1: 0.75

思路三:

Pattern-Exploiting Training(PET),即增加一个pattern,将任务转换为MLM任务,然后通过pattern的得分来判断对应的类别。 如本次样本可以添加一个前缀pattern:"简接回答问题"/"直接回答问题",分别对应label 0/1,pattern的得分只需看第一个位置中"间"/"直" 两个token的概率谁高即可。 此外,训练时还可以借助bert的预训练任务中的mlm任务增强模型的泛化能力。更详细的请介绍请查阅文本分类秒解

对于本次样本,对应的示意图如下:

对应代码实现:pet classification

单模型提交f1: 0.76+

思路四

由于bert 不同的transformer 层提取到的语义粒度不同,而不同粒度的信息对分类来说起到的作用也不同,所以可以concat所以粒度的语义信息,拼接后作为特征进行分类。

对应于本次样本,示意图如下:

对应代码实现:concat classification 单模型提交f1: 0.75+

tips

贴几篇感觉有启发的关于文本分类的论文

ccf_2020_qa_match's People

Contributors

xv44586 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ccf_2020_qa_match's Issues

ask for dataset download link

Hi,thanks for your  great job! But I have not taken part in the competetion. If convenient, could you share the datasets downlink? Thanks a lot

导入NEZHA时报错,

image

运行时报错,检查发现下载的预训练模型里面没有model.ckpt这个文件,请问是需要转换一下吗?谢谢
image

task-adaptive training

首先祝贺你们取得了第一名的成绩,同时也感谢你们把代码开源出来。我现在有个问题要咨询一下,请问,task-adaptive training和model-adaptive的区别是什么呀,我理解的task-adaptive training是使用比赛的数据,对预训练模型进行再次的训练,得到领域适配的预训练模型,请问我这样的理解对吗?

请问模型融合策略是什么?

1、对于单个模型,根据不同随机数,训练多个模型;
2、对QA Pair 与 QA Point两类的模型,平均预测的概率作为最终的预测结果?

非常感谢!

自蒸馏脚本(pair_self_kd.py)运行失败

失败日志如下。

Traceback (most recent call last):
  File "pair-self-kd.py", line 297, in <module>
    callbacks=[student_evaluator])
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/engine/training.py", line 1732, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/engine/training_generator.py", line 220, in fit_generator
    reset_metrics=False)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/engine/training.py", line 1514, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3792, in __call__
    outputs = self._graph_fn(*converted_inputs)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1605, in __call__
    return self._call_impl(args, kwargs)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1645, in _call_impl
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 598, in call
    ctx=ctx)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
  (0) Failed precondition:  Error while reading resource variable _AnonymousVar409 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar409/N10tensorflow3VarE does not exist.
         [[node ReadVariableOp_1191 (defined at /home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
         [[ReadVariableOp_1190/_12]]
  (1) Failed precondition:  Error while reading resource variable _AnonymousVar409 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar409/N10tensorflow3VarE does not exist.
         [[node ReadVariableOp_1191 (defined at /home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_keras_scratch_graph_150548]

Function call stack:
keras_scratch_graph -> keras_scratch_graph

请问训练模型的过程是?

是在nezha 的 Post training 之后加入了对抗、EDA、自蒸馏优化方案进行nezha的 finetune,然后将这几个模型一起融合了么?是没有在baseline上面继续提升么?还是finetune用了baseline的哪个模型?最终是有多少个模型呢?是怎么个融合方式呢?

自蒸馏脚本运行失败

自蒸馏脚本错误日志如下:

Traceback (most recent call last):
  File "pair-self-kd.py", line 297, in <module>
    callbacks=[student_evaluator])
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/engine/training.py", line 1732, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/engine/training_generator.py", line 220, in fit_generator
    reset_metrics=False)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/engine/training.py", line 1514, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3792, in __call__
    outputs = self._graph_fn(*converted_inputs)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1605, in __call__
    return self._call_impl(args, kwargs)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1645, in _call_impl
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 598, in call
    ctx=ctx)
  File "/home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
  (0) Failed precondition:  Error while reading resource variable _AnonymousVar409 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar409/N10tensorflow3VarE does not exist.
         [[node ReadVariableOp_1191 (defined at /home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
         [[ReadVariableOp_1190/_12]]
  (1) Failed precondition:  Error while reading resource variable _AnonymousVar409 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar409/N10tensorflow3VarE does not exist.
         [[node ReadVariableOp_1191 (defined at /home/work/.conda/envs/py3-tf.2.2-ccf/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_keras_scratch_graph_150548]

Function call stack:
keras_scratch_graph -> keras_scratch_graph

求更新。。

希望有时间能够更新一下,希望学习一下大佬的做法

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.