guoday / ccf-bdci-sentiment-analysis-baseline Goto Github PK

View Code? Open in Web Editor NEW

435.0 8.0 117.0 1 MB

The code for CCF-BDCI-Sentiment-Analysis-Baseline

License: Apache License 2.0

Python 60.99% Dockerfile 0.01% Jupyter Notebook 38.83% Shell 0.17%

ccf-bdci-sentiment-analysis-baseline's Introduction

CCF-BDCI-Sentiment-Analysis-Baseline

1.从该开源代码中改写的

2.该模型将文本截成k段，分别输入语言模型，然后顶层用GRU拼接起来。好处在于设置小的max_length和更大的k来降低显存占用，因为显存占用是关于长度平方级增长的，而关于k是线性增长的

模型	线上F1
Bert-base	80.3
Bert-wwm-ext	80.5
XLNet-base	79.25
XLNet-mid	79.6
XLNet-large	--
Roberta-mid	80.5
Roberta-large (max_seq_length=512, split_num=1)	81.25

注:

1)实际长度 = max_seq_length * split_num

2)实际batch size 大小= per_gpu_train_batch_size * numbers of gpu

3)上面的结果所使用的是4卡GPU，因此batch size为4。如果只有1卡的话，那么per_gpu_train_batch_size应设为4, max_length设置小一些。

4)如果显存太小，可以设置gradient_accumulation_steps参数，比如gradient_accumulation_steps=2，batch size=4，那么就会运行2次，每次batch size为2，累计梯度后更新，等价于batch size=4，但速度会慢两倍。而且迭代次数也要相应提高两倍，即train_steps设为10000

具体batch size可看运行时的log，如：

09/06/2019 21:03:41 - INFO - main - ***** Running training *****

09/06/2019 21:03:41 - INFO - main - Num examples = 5872

09/06/2019 21:03:41 - INFO - main - Batch size = 4

09/06/2019 21:03:41 - INFO - main - Num steps = 5000

赛题说明

请查看该网站了解赛题

下载数据集

从该网站中下载数据集, 并解压在./data目录。

数据预处理

cd data
python preprocess.py
cd ..

Bert-base 模型

bash run_bert.sh
#5 fold取平均
python combine.py --model_prefix ./model_bert --out_path ./sub.csv

Bert Whole Word Masking 模型

从该网站下载pytorch权重，并解压到chinese_wwm_ex_bert目录下: https://github.com/ymcui/Chinese-BERT-wwm

bash run_bert_wwm_ext.sh
python combine.py --model_prefix ./model_bert_wwm_ext --out_path ./sub.csv

XLNet-mid 模型

从该网站下载pytorch权重，并解压到./chinese_xlnet_mid/目录下: https://github.com/ymcui/Chinese-PreTrained-XLNet

bash run_xlnet.sh
python combine.py --model_prefix ./model_xlnet --out_path ./sub.csv

Roberta-mid 模型

从该网站下载tensorflow版本的权重，并解压到./chinese_roberta/目录下: https://github.com/brightmart/roberta_zh

mv chinese_roberta/bert_config_middle.json chinese_roberta/config.json
python -u -m pytorch_transformers.convert_tf_checkpoint_to_pytorch --tf_checkpoint_path chinese_roberta/ --bert_config_file chinese_roberta/config.json --pytorch_dump_path chinese_roberta/pytorch_model.bin
bash run_roberta.sh
python combine.py --model_prefix ./model_roberta --out_path ./sub.csv

ccf-bdci-sentiment-analysis-baseline's People

Contributors

Stargazers

Watchers

Forkers

allensmile rejae mckaymk cooper111 chrisliu007 jingmouren hundred06 qianrenjian t110e4 tiffen zengai maxiaomu lightningsoon autterman jzysaber1996 hiterstone adherer brightmart hitalex phychaos silencelsy balatatree pandascute nihilitior senchfu mickeylq betty-zjl chivalrouss starssummer renhongkai birdflies fyh97 sangensong playai ch488674662 460130107 rogerrojur maogwleon linktopast1990 wangzikui zkcpku gyc913 milkwhite xianbaobao hearts-sp buaazeus md1993 jessie0624 wkw1259 jkhlot berrywrq sherlockholmefeng gaokaigithub duanxian hkzhao123 rwbfd u-help gdh756462786 kxzhang0118 heheomg weizaiff scievan xiong666 hischen mengyuanxi caicaijason jackyang122 rxc205 find-knowledge ukilin tuanbalala 1770031555 ggqshr r-craft dyleaf zheng5yu9 northfishall pokejeff fessence seeker1943 matrixcpu 1637mishenlan chapzq77 luckywonky xuehui0725 millionairechen helloworld729 chthub tszssong iambijav solemnrole sumerzhang wangkangdegithub qibaoyuan fan9 foeinlove lijiadong andrew05200 mangmang-ting zhihaolzh

ccf-bdci-sentiment-analysis-baseline's Issues

我不理解这里的cls是什么，还有这里的return super(BertTokenizer, cls)._from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)

如果不想切分文本，想整个都放进去，应该怎么改，是不是只改参数就好了？

/home/ming/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1439: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples. 'recall', 'true', average, warn_for) test 0.06457949662369551

/home/ming/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1439: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
'recall', 'true', average, warn_for)
test 0.06457949662369551
F1正常，test值低，而且出现这样的报错，寻求许久，未解决，请问这是因为什么？

XLNet_zh_Large上的效果

可以测试对比一下在XLNet_zh_Large上的效果吗？
（目前的XLNet_zh_Large是尝鲜版，如有问题会协助解决）

我看分类的demo上，是在bert之后接了一个lstm层吗，

是的话lstm的输入是bert的输出的第一向量，还是所有的输出呢

RuntimeError: set_storage is not allowed on Tensor created from .data or .detach()

跑roberta的时候出现了这个问题，麻烦郭大帮忙看下

运行robera-english报错的问题

哈喽。大佬。
我想运行robera-english，调用了pytorch_transformers中的RobertaForSequenceClassification, RobertaConfig，RobertaTokenizer。
但是会报错，RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
这个是什么原因。
好奇，感谢大佬解答。

请教文章截成的k段，在哪块代码出可以看出是分别输入模型处理？

求解释下一个文章截成k端后怎么输入训练的，没有找到是在哪个地方“分别输入语言模型”的？如果是这样，理论上是不是不管多长的文章都可以通过切成很多端，分别输入处理了，不用截断文章了
BertForSequenceClassification中的forward：
def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None,
position_ids=None, head_mask=None):

    flat_input_ids = input_ids.view(-1, input_ids.size(-1))
    flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
    flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None

flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
比如k＝2，input_ids中就包含文章划分的两端，通过view又展平了，那输入的长度还是没有变短？和不划分一样？