thu-coai / cdial-gpt Goto Github PK

View Code? Open in Web Editor NEW

1.7K 28.0 255.0 746 KB

A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models

License: MIT License

Python 100.00%

dialogue pytorch gpt gpt-2 lccc text-generation

cdial-gpt's People

Contributors

Stargazers

Watchers

Forkers

zhihongshao hahahawu barryzm corpus-dataset luweishuang 1zxli gdh756462786 githubgreat886 chenyang918 gmftbygmftby xrosliang zhuyaolin changernest 1ku1ku wind91725 dakelq yyht cdj0311 xmxoxo zerounnet chenhuayou codybai berryhn qingkongzhiqian joytianya napoler fancyerii thetargo qianrenjian scutcyr crownpku barrel-titor binglinchengxiash jamon96 520jefferson haojiepan1 xiejiachen guangningyu zjjhym lukelluke leileixiao jyxhyan coddinglxf archeryi xumeng123 litetoooooom zhangjiekui jugglecomemid xyanggu tlntin gongfuxiong karta282950 xiaoanshi zhaoyun630 tf302 cgq0816 saraderictk caihao20 chaooma mysterious0o0 laomagic hyybuaa jingqiangliu vyoz greitzmann detan comingboy0701 moritzmelon huabeixiaobai kellyhu12138 oswen harborzeng xudongenen harry8207 fancycheung lance-klx zjq1996518 786440445 li3cmz sqsky xiaobiaohust wushicanasl sjx0451 haojiepan tobran zhangxt lrank bestjex eminemrain wakafengfan misoknisky 1973sir jiasumatrix ericdoug-qi lemon234071 18166035475 songtaoshi hommmm strategist922 naturesphere

cdial-gpt's Issues

请问有没有LCCC数据集上的PPL指标

请问有没有尝试过从LCCC数据中拆分一部分出来作为验证集呢？微调后的GPT在LCCC验证集上的PPL能到达多少呢？谢谢~

[speaker1]/[speaker2]并没有被正确编码？

[speaker1]实际被BertTokenizer编码为了[、[UNK]、]三个字符

关于预训练模型

hello,首先感谢你们开源的中文语料!
然后弱弱的问一下,有没有考虑过训练参数量更大的GPT-2 model呢?

成功使用LCCC-Large模型

您好，非常感谢你们的工作，我成功的将LCCC-Large模型加入到了OpenDialog中，测试结果良好。

Question on the comparison between GPT and GPT2

Hi, Thanks for sharing the models! There's a detail that I'm curious about. Is there a reason why CDialGPT2LCCC performs worse than CDialGPTLCCC? I get that GPT2 uses pre-LayerNorm and also adds an additional layerNorm after final attention block compared with GPT, but I do not expect such difference results in much worse performance in CDialGPT2LCCC.

Besides, in the paper, you mentioned that both CDialGPTLCCC and CDialGPT2LCCC are firstly pretrained on your Chinese novel dataset. Does this imply that there's also a GPT2_novel model that you did not release (based on which CDialGPT2LCCC is post-trained)?

微調格式

您好不好意思我想詢問一下我微調的訓練資料格式是什麼樣子的

Discussion about method of self-attention.

Actually, I am a fresher on NLP topics, and recently I am studying self-attention-schema referred from http://jalammar.github.io/illustrated-gpt2/#part-3-beyond-language-modeling .

And also from the related paper, such as http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
and https://arxiv.org/abs/2006.16362 .

The core idea of attention is given the more related Key and Query with higher probability scores by

softmax(QK^T), where higher product results implies higher correlation. And they made 2 fully - connected W_k and W_q to optimize.

I am confused about, is that true: higher correlation = higher product results of QK ?

In my opinion, a more clearly approach way is directly use distance to express how pair of Q and K.

such as

softmax(-abs(Q-K)), giving narrower distance of key and query higher probability
softmax(-min(Q/K,K/Q)), similarly with above one, but use division other than minus.

May these ideas works?--I mean might it conduct some useful research papers.

Noam scheduler 的 lr

你好，

发现训练脚本中可能存在的问题：

使用noam scheduler的话initial lr应该设置成1吧，因为noam_lambda的返回值每次会乘以initial lr作为当前step的lr，设置成5e-5似乎会使lr比预期小20000倍。

此外，noam中的step以及其他名字中带有“step“的参数，是否应该以每次“optimizer.step()”为一个单位算作一个step，而不是engine的每个iteration？

如有理解错误之处请见谅。

没有论文链接

您好，论文下载链接不可用。

How to add special tokens as segment embeddings?

用自己的语料训练时出现index out of range

index out of range: Tried to access index 513 out of table with 512 rows.，还不是很会，网上也没找到别的解决办法，是我自己的语料没处理好吗，希望大神给予指点

STC微调事项

模型在STC数据集上微调也需要70个epoch吗

关于如何调用微调后的模型有些疑问

您好，按照官方说明，训练模型和生成文本的model-checkpoint是一个路径，然而使用CDial-GPT_LCCC-large微调后，这个文件里面的模型应该没有发生修改（从时间看出来的），那我生成文本的时候应该调用的是哪个路径呢？请问run文件夹里的是什么啊？

gpt2产生的开放域的文本生成，经常出现生成句子的有一直重复的现象，请问有什么好的办法吗

如这个：
这首歌我一直都在听，不知道为什么，我的心好像就是一个大水坑，一直往里冲，不知道为什么，我的心就是一个大水坑，一直往里冲，不知道为什么，我的心就是一个大水坑，一直往里冲，不知道为什么，我的心就是一个大水坑

《THE CURIOUS CASE OF NEURAL TEXT DEGENERATION》能解决这个问题吗？
我其实已经使用了 Top-p (nucleus) sampling了，但是还是经常出现上述的重复

关于WBDataset中标签生成的问题

你好，在阅读代码的时候对于WBDataset这个类中的process这个方法存在一个疑问：

在生成标签的时候，理论上来说应该是用前面的单词预测其下一个单词。但是，如果按照上面代码中的处理方式，所有最后一轮的单词对应的标签都是其自身了，就像下面这样：

和我理解的训练过程有出入，这里是否是我理解有问题呢？能够解答一下，谢谢！

请问识别速度如何，可以用来作为语音识别的语言模型吗

评价指标问题：请问在微调中的'average_ppl'是readme里的ppl吗？

How to select the most appropriate response with some ranking strategies ?

How to select the most appropriate response with some ranking strategies ?
Any good ideas for advice?

关于训练情况

论文说在数据集上训练了30个epoch，单步batch_size=8？我看了下语料有1200万组左右，按照单步0.1s算，总时间 1200万 * 30 / 8 * 0.1s 约等于52天？单卡训练了52天？？

关于自定义数据微调实验的疑问

我先试了试用STC数据集微调，因为只是想试着跑通一下所以就跑了1个epoch在单张2080Ti，一开始预估的时间是17:50:30，最后实际跑了28h左右……

在STC数据集上训练的语句是

python train.py --pretrained --model_checkpoint ./models/LCCD_GPT/ --data_path data/STC/STC.json --scheduler linear --n_epochs 1

然后我就开始在自己的数据上（用于训练的json文件大小才2M左右）训练5个epoch，然后我发觉它这个预估时间(17:31:46)怎么还跟我训练STC数据集差不多呢？我自己的数据集应该小了很多啊……而且我发觉我就算改成训练1个epoch，总的epoch数好像不变永远是2195633，预估时间也一直是17个小时左右……不知道是不是我哪里理解错了，请问是怎么回事呢？

在自定义数据集上训练5 epoch的语句是

python train.py --pretrained --model_checkpoint ./models/LCCD_GPT/ --data_path data/custom_train.json --scheduler linear --n_epochs 5

在自定义数据集上1 epoch的语句是

python train.py --pretrained --model_checkpoint ./models/LCCD_GPT/ --data_path data/custom_train.json --scheduler linear --n_epochs 1

求助与文件结构

OpenAIGPTLMHeadModel not initialized

python3 interact.py --model_checkpoint ./models/Novel_GPT/

INFO:transformers.modeling_utils:loading weights file ./models/Novel_GPT/pytorch_model.bin
INFO:transformers.modeling_utils:Weights of OpenAIGPTLMHeadModel not initialized from pretrained model: ['transformer.h.0.attn.bias', 'transformer.h.1.attn.bias', 'transformer.h.2.attn.bias', 'transformer.h.3.attn.bias', 'transformer.h.4.attn.bias', 'transformer.h.5.attn.bias', 'transformer.h.6.attn.bias', 'transformer.h.7.attn.bias', 'transformer.h.8.attn.bias', 'transformer.h.9.attn.bias', 'transformer.h.10.attn.bias', 'transformer.h.11.attn.bias']

HOW TO FIX IT

在不同机器训练和交互的问题

训练是在有GPU的机器上做的（ubuntu20.04），微调完运行interact.py载入我自己训练的模型没问题可以聊天……
部署的时候把训练好的模型和interact.py放到了另一台机器上（windows 10）……在部署机上载入你们预训练好的模型那个LCCD_GPT没问题可以聊天，但是载入我自己训练的模型文件会报以下错误（训练机和部署机的tensorflow版本都是1.14.0没有用tf2呀）：

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 187, in nti
    n = int(s.strip() or "0", 8)
ValueError: invalid literal for int() with base 8: 's\n_rebui'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 2289, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1095, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1037, in frombuf
    chksum = nti(buf[148:156])
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 189, in nti
    raise InvalidHeaderError("invalid header")
tarfile.InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\serialization.py", line 555, in _load
    return legacy_load(f)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\serialization.py", line 466, in legacy_load
    with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1591, in open
    return func(name, filemode, fileobj, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1621, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1484, in __init__
    self.firstmember = self.next()
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 2301, in next
    raise ReadError(str(e))
tarfile.ReadError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\modeling_utils.py", line 626, in from_pretrained
    state_dict = torch.load(resolved_archive_file, map_location="cpu")
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\serialization.py", line 386, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\serialization.py", line 559, in _load
    raise RuntimeError("{} is a zip archive (did you mean to use torch.jit.load()?)".format(f.name))
RuntimeError: .\data\Oct07_17-06-42_xxx\pytorch_model.bin is a zip archive (did you mean to use torch.jit.load()?)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".\bot_gpt.py", line 148, in <module>
    model = model_class.from_pretrained(args.model_checkpoint)
  File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\modeling_utils.py", line 629, in from_pretrained
    "Unable to load weights from pytorch checkpoint file. "
OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

请问是怎么回事呢？想问下你们预训练的模型和我们微调完的模型是有什么不一样的地方吗？

Wrong arxiv paper link?

https://arxiv.org/pdf/1503.06733.pdf

预训练

大佬好，想问一下，为什么要先在中文小说数据上预训练呢，这一步的预训练是不是和对话的方式不一样

论文问题: The GPT2-chitchat reaches the highest distinct scores but poor generation quality where we attribute it to the small scale of the model.

GPT2-chitchat得分高，但生成质量差。想知道依据什么评价指标来说明评价指标差，还想了解一下，这些评价指标（PPL，dist，Greedy Matching，Embedding Average）是在data/STC_test.json数据里测的吗？这部分代码能否公布一下吗（ppl，dist， Greedy Matching，Embedding Average)

请问如何从头开始训练模型

由于我希望用来训练语音识别的语言模型，希望可以从头开始训练，请问应该输入哪个命令
python3 train.py --data_path data/toy_data.json
报错
AttributeError: 'SummaryWriter' object has no attribute 'logdir'

STC的微调实验

你好，这个项目挺不错的，我也很感兴趣。我按照论文中参数设置，使用如下的命令训练：python train.py --pretrained --model_checkpoint ./models/ --data_path data/STC.json --lr 6.25e-5 --train_batch_size 8 --n_epochs 10CUDA_VISIBLE_DEVICES=0 python train.py --pretrained --model_checkpoint ./models/ --data_path data/STC.json --lr 6.25e-5 --train_batch_size 8 --n_epochs 10。但是训练后的，在val上的ppl收敛在29.几，模型收敛不是很理想，所以请教一下，除了以上的参数设置，是还有其他的一些参数需要设置吗？谢谢！

Can't download GPT_LCCC-large.zip

<Code>AccessDenied</Code>
<Message>You do not have read permission on this object.</Message>
<RequestId>5F7F9DC53ADDB93530C41430</RequestId>
<HostId>coai-dataset.oss-cn-beijing.aliyuncs.com</HostId>
</Error>```

ppl 困惑度越来越大问题

在微调过程中，loss值是在不断减少，但是困惑度ppl和null是在不断增大的，epoch=1时ppl是13.765 而epoch=67时，ppl已经达到26了，为什么模型越训练困惑度越大，效果越来越不好呢？

请问有数据清洗部分的代码吗？

预测lccc数据集差异较大问题

使用预训练的lccc-large在lccc验证集上预测，为啥回复的f1值评估会很差呢（<0.10）
相当于预训练模型已经看到答案了，但是预测回复与真实回复差异还是很大...

loss = lm_loss / int(args.gradient_accumulation_steps) TypeError: unsupported operand type(s) for /: 'str' and 'int'

(lm_loss), *_ = model(input_ids, labels=lm_labels, token_type_ids=token_type_ids)
print("lm_loss:",lm_loss)
loss = lm_loss / int(args.gradient_accumulation_steps)
打印结果是 lm_loss:loss

您好，非常开心能看到中文预训练 GPT2 的发布，请问，您们训这个 GPT2 使用的自己构建的字典，还是用的 bert 的字典呢？

谢谢

尝试重构您的代码，出现 IndexError: Target -1 is out of bounds

我使用 torch==1.7.0，transformers==3.5.1 重构您的代码，在 update 方法的

(lm_loss), *_ = model(input_ids, labels=lm_labels, token_type_ids=token_type_ids)

这一行遇到错误，但是切换环境到 torch==1.4.0，transformers==2.1.1 就没有这个问题，想必是版本问题，不知如何修复。

全文报错如下：

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.
['[CLS] [speaker1] 王 雁 盟 [speaker2] 1 9 9 6 年 ， 台 湾 计 算 机 程 序 设 计 师 王 雁 盟 到 欧 洲 旅 游 ， 在 布 拉 格 街 头 他 为 街 头 艺 人 的 手 风 琴 演 奏 所 着 迷 。 于 是 在 第 二 年 ， 他 拜 巴 黎 手 风 琴 演 奏 家 d o m i n i q u e b o d i n 为 师 ， 学 习 手 风 琴 演 奏 技 术 。 1 9 9 8 年 回 台 湾 ， 在 街 头 拉 着 他 的 手 风 琴 游 荡 。 之 后 ， 他 开 始 为 电 影 、 剧 团 演 出 等 伴 奏 手 风 琴 。 到 2 0 0 3 年 ， 他 为 几 米 的 《 地 下 铁 一 个 音 乐 的 旅 程 》 音 乐 剧 作 曲 与 演 出 。 《 漂 浮 的 手 风 琴 》 是 他 自 己 制 作 、 作 曲 并 演 奏 的 第 一 个 专 辑 。 [SEP]', '[CLS] [speaker1] 大 话 西 游 之 月 光 宝 盒 主 演 [speaker2] 罗 家 英 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']
['[CLS] [speaker1] [speaker1] [speaker1] [speaker1] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2]', '[CLS] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']
['[UNK] [UNK] [UNK] [UNK] [UNK] [UNK] 1 9 9 6 年 ， 台 湾 计 算 机 程 序 设 计 师 王 雁 盟 到 欧 洲 旅 游 ， 在 布 拉 格 街 头 他 为 街 头 艺 人 的 手 风 琴 演 奏 所 着 迷 。 于 是 在 第 二 年 ， 他 拜 巴 黎 手 风 琴 演 奏 家 d o m i n i q u e b o d i n 为 师 ， 学 习 手 风 琴 演 奏 技 术 。 1 9 9 8 年 回 台 湾 ， 在 街 头 拉 着 他 的 手 风 琴 游 荡 。 之 后 ， 他 开 始 为 电 影 、 剧 团 演 出 等 伴 奏 手 风 琴 。 到 2 0 0 3 年 ， 他 为 几 米 的 《 地 下 铁 一 个 音 乐 的 旅 程 》 音 乐 剧 作 曲 与 演 出 。 《 漂 浮 的 手 风 琴 》 是 他 自 己 制 作 、 作 曲 并 演 奏 的 第 一 个 专 辑 。 [SEP]', '[UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] 罗 家 英 [SEP] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]']
Current run is terminating due to exception: Target -1 is out of bounds..
Engine run is terminating due to exception: Target -1 is out of bounds..
Traceback (most recent call last):
  File "/t/main.py", line 40, in <module>
    trainer.run(train_dataloader, max_epochs=2)
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 691, in run
    return self._internal_run()
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 762, in _internal_run
    self._handle_exception(e)
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    raise e
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 730, in _internal_run
    time_taken = self._run_once_on_dataset()
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 828, in _run_once_on_dataset
    self._handle_exception(e)
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    raise e
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 811, in _run_once_on_dataset
    self.state.output = self._process_function(self, self.state.batch)
  File "/home/kingsoft/gang/t/main.py", line 21, in update
    (lm_loss), *_ = model(input_ids, labels=lm_labels, token_type_ids=token_type_ids)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/transformers/modeling_openai.py", line 595, in forward
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 962, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/lib/python3.7/site-packages/torch/nn/functional.py", line 2468, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/lib/python3.7/site-packages/torch/nn/functional.py", line 2264, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
IndexError: Target -1 is out of bounds.

Process finished with exit code 1

为什么模型输入不用包括attention_mask呢？

你好，想问下，既然是LM，训练的时候用第n个token预测第n+1个token，那为什么不用引入attention_mask呢？这样不就使得预测n+1token的时候，已经看到了ground_truth了吗？这样的话，和测试时候的场景是不是就不一致了呢？是否训练的时候加入attention_mask，可以使得测试的效果更好点呢？

超过512字符未自动进行截断

本地选了一个单轮对话的数据集，数据集部分样本长度超过512长度。在训练过程中会出现RuntimeError: index out of range: Tried to access index 513 out of table with 512 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418
的错误。希望代码可以训练过程中自动截断。

您好，请问LCCC-base的train,valid, test是怎么划分的

按照 readme 步骤进行，报错 Your PyTorch installation may be too old

INFO:train.py:Build train and validation dataloaders
INFO:train.py:Load tokenized dataset from cache at dataset_cache_BertTokenizer
Traceback (most recent call last):
  File "train.py", line 237, in <module>
    train()
  File "train.py", line 116, in train
    train_loader, val_loader, train_sampler, valid_sampler = loader_class(args, tokenizer, logger)
  File "/home/kingsoft/BDCI2020/CDial-GPT/od/inputters/inputter.py", line 47, in build_dataloaders
    datasets, raw_samples = get_data(tokenizer, args.data_path, args.dataset_cache, logger)
  File "/home/kingsoft/BDCI2020/CDial-GPT/od/inputters/inputter.py", line 21, in get_data
    dataset = torch.load(dataset_cache)
  File "/home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/serialization.py", line 527, in load
    with _open_zipfile_reader(f) as opened_zipfile:
  File "/home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/serialization.py", line 224, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f2ad21b4193 in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7f2ad533c9eb in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7f2ad533dc04 in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x6c6536 (0x7f2b1d66d536 in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x295a74 (0x7f2b1d23ca74 in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #37: __libc_start_main + 0xf0 (0x7f2b24906840 in /lib/x86_64-linux-gnu/libc.so.6)

个人猜测可能是tensorflow版本问题吗？

请问你们在LCCC pretrained后，在LCCC数据集上的指标结果是怎样的呢？

README.md里面只放了在STC数据集上的结果

关于数据组织方式

大佬好，请教一个问题，对于多轮对话的数据处理一直有个疑问，多轮对话在训练的时候需要拆分为多个样本吗，我看给出的示例数据好像没有分为多个样本，看代码预测的是最后一句，前面的作为context，不知道我的理解是否正确

验证集的ppl代码

请问在STC_test.json的验证集中 loss怎么求，我在infer.py的文件中加了(loss), *_ = model(input_ids, labels=lm_labels, token_type_ids=token_type_ids) ，但是lm_labels的值不知道，而在train.py中
input_ids, token_type_ids, lm_labels = tuple(input_tensor.to(args.device) for input_tensor in batch)
在infer.py中
instance, sequence = build_input_from_segments(history, reply, tokenizer, current_output, with_eos=False)
input_ids = torch.tensor(instance["input_ids"], dtype=torch.long, device=args.device).unsqueeze(0)

请问能否公布一下测试过程中的ppl代码

关于分布式代码问题

你好，在train.py中引用的包是
from torch.nn.parallel import DataParallel
是否引用错误，应该引用为
from torch.nn.parallel import DistributedDataParallel

我看代码中关于分布式的代码包括了 init_process_group、加载数据的DistributedSampler等

请问，fine tuning 的时候，学习率调整为多少比较合适？谢谢

About the evaluation

May I get the evaluation code? The BLUE-2 score is 67.2 for Transformer and 66.3 for GPTLCCC-large, respectively. It seems that the performance is so amazing. How can it be attained? Thanks.

关于训练的一个细节

非常感谢您的工作，我在仔细阅读您的代码和README的时候注意到，你们在训练的时候，将history部分的labels全部用[UNK]代替了，loss计算的是基于response的语言模型。
我之前也有尝试过训练GPT2的对话模型（不知道是不是数据量的问题，效果并不是特别理想，数据量是50w），不过用的是和GPT2-chitchat项目中类似的，history和response整体当作一句话来训练语言模型。

我的问题是，你们没有选择后者，单纯只计算response的语言模型是有一定的原因的么，例如只训练response的语言模型比history+response的所有语言模型一起训练更好之类的。

期待您的回复！！！

关于 STC 微调复现效果的疑问

请问在STC微调结果中的自动评估指标BLEU-2、BLEU-4的计算方式

由于在项目中没有找到与计算BLEU相关的代码，并且在自己尝试 fine-tuning 后，发现结果与您的结果存在偏差
以下结果实在第5个epoch的生成结果，采用的是 nltk 的 corpus_bleu 工具

The main difference with chitchat

What is the key theory difference with https://github.com/yangjianxin1/GPT2-chitchat

在STC数据集微调数据集微调有问题，在只有一个8G显卡的电脑上训练庞大的STC数据集微调了70个epoch只用了10分钟

操作指令python train.py --pretrained --model_checkpoint CDial-GPT_LCCC-large --data_path data/STC.json --train_batch_size 8 --scheduler linear 数据集用的是STC.json，用了10分钟就训练了70轮，电脑只有一个8G显卡，梯度累加是64，甚至最后把toy_train.txt文件移动到别的文件夹都还是快，而且最大的问题是每个epoch的step个数跟之前用toy_train.txt训练时一样都是125，不知道问题出现在哪？
Epoch [1/70]: [1/125] 1%|▏ , loss=0.0422, lr=5e-5 [00:00<?]l
m_loss: tensor(3.0413, device='cuda:0', grad_fn=)
Epoch [1/70]: [2/125] 2%|▎ , loss=0.0424, lr=5e-5 [00:00<00:29]l
m_loss: tensor(2.2726, device='cuda:0', grad_fn=)
Epoch [1/70]: [3/125] 2%|▌ , loss=0.0422, lr=5e-5 [00:00<00:19]l
m_loss: tensor(2.3869, device='cuda:0', grad_fn=)
Epoch [1/70]: [4/125] 3%|▋ , loss=0.0421, lr=5e-5 [00:00<00:15]l
m_loss: tensor(2.9688, device='cuda:0', grad_fn=)
Epoch [1/70]: [5/125] 4%|▉ , loss=0.0422, lr=5e-5 [00:00<00:13]l
m_loss: tensor(2.5291, device='cuda:0', grad_fn=)
Epoch [1/70]: [6/125] 5%|█ , loss=0.0422, lr=5e-5 [00:00<00:12]l
m_loss: tensor(2.7023, device='cuda:0', grad_fn=)
Epoch [1/70]: [7/125] 6%|

NFO:ignite.engine.engine.Engine:Epoch[70] Complete. Time taken: 00:00:13
INFO:ignite.engine.engine.Engine:Engine run complete. Time taken: 00:14:38