thu-coai / cdial-gpt Goto Github PK
View Code? Open in Web Editor NEWA Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models
License: MIT License
A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models
License: MIT License
请问有没有尝试过从LCCC数据中拆分一部分出来作为验证集呢?微调后的GPT在LCCC验证集上的PPL能到达多少呢?谢谢~
[speaker1]实际被BertTokenizer编码为了[、[UNK]、]三个字符
hello,首先感谢你们开源的中文语料!
然后弱弱的问一下,有没有考虑过训练参数量更大的GPT-2 model呢?
您好,非常感谢你们的工作,我成功的将LCCC-Large模型加入到了OpenDialog中,测试结果良好。
Hi, Thanks for sharing the models! There's a detail that I'm curious about. Is there a reason why CDialGPT2LCCC performs worse than CDialGPTLCCC? I get that GPT2 uses pre-LayerNorm and also adds an additional layerNorm after final attention block compared with GPT, but I do not expect such difference results in much worse performance in CDialGPT2LCCC.
Besides, in the paper, you mentioned that both CDialGPTLCCC and CDialGPT2LCCC are firstly pretrained on your Chinese novel dataset. Does this imply that there's also a GPT2_novel model that you did not release (based on which CDialGPT2LCCC is post-trained)?
您好 不好意思 我想詢問一下 我微調的訓練資料格式是什麼樣子的
Actually, I am a fresher on NLP topics, and recently I am studying self-attention-schema referred from http://jalammar.github.io/illustrated-gpt2/#part-3-beyond-language-modeling .
And also from the related paper, such as http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
and https://arxiv.org/abs/2006.16362 .
The core idea of attention is given the more related Key and Query with higher probability scores by
softmax(QK^T), where higher product results implies higher correlation. And they made 2 fully - connected W_k and W_q to optimize.
I am confused about, is that true: higher correlation = higher product results of QK ?
In my opinion, a more clearly approach way is directly use distance to express how pair of Q and K.
such as
May these ideas works?--I mean might it conduct some useful research papers.
你好,
发现训练脚本中可能存在的问题:
使用noam scheduler的话initial lr应该设置成1吧,因为noam_lambda的返回值每次会乘以initial lr作为当前step的lr,设置成5e-5似乎会使lr比预期小20000倍。
此外,noam中的step以及其他名字中带有“step“的参数,是否应该以每次“optimizer.step()”为一个单位算作一个step,而不是engine的每个iteration?
如有理解错误之处请见谅。
您好,论文下载链接不可用。
index out of range: Tried to access index 513 out of table with 512 rows.,还不是很会,网上也没找到别的解决办法,是我自己的语料没处理好吗,希望大神给予指点
模型在STC数据集上微调也需要70个epoch吗
您好,按照官方说明,训练模型和生成文本的model-checkpoint是一个路径,然而使用CDial-GPT_LCCC-large微调后,这个文件里面的模型应该没有发生修改(从时间看出来的),那我生成文本的时候应该调用的是哪个路径呢?请问run文件夹里的是什么啊?
如这个:
这首歌我一直都在听,不知道为什么,我的心好像就是一个大水坑,一直往里冲,不知道为什么,我的心就是一个大水坑,一直往里冲,不知道为什么,我的心就是一个大水坑,一直往里冲,不知道为什么,我的心就是一个大水坑
《THE CURIOUS CASE OF NEURAL TEXT DEGENERATION》 能解决这个问题吗?
我其实已经使用了 Top-p (nucleus) sampling了,但是还是经常出现上述的重复
How to select the most appropriate response with some ranking strategies ?
Any good ideas for advice?
论文说在数据集上训练了30个epoch,单步batch_size=8?我看了下语料有1200万组左右,按照单步0.1s算,总时间 1200万 * 30 / 8 * 0.1s 约等于52天?单卡训练了52天??
我先试了试用STC数据集微调,因为只是想试着跑通一下所以就跑了1个epoch在单张2080Ti,一开始预估的时间是17:50:30,最后实际跑了28h左右……
在STC数据集上训练的语句是
python train.py --pretrained --model_checkpoint ./models/LCCD_GPT/ --data_path data/STC/STC.json --scheduler linear --n_epochs 1
然后我就开始在自己的数据上(用于训练的json文件大小才2M左右)训练5个epoch,然后我发觉它这个预估时间(17:31:46)怎么还跟我训练STC数据集差不多呢?我自己的数据集应该小了很多啊……而且我发觉我就算改成训练1个epoch,总的epoch数好像不变永远是2195633,预估时间也一直是17个小时左右……不知道是不是我哪里理解错了,请问是怎么回事呢?
在自定义数据集上训练5 epoch的语句是
python train.py --pretrained --model_checkpoint ./models/LCCD_GPT/ --data_path data/custom_train.json --scheduler linear --n_epochs 5
python train.py --pretrained --model_checkpoint ./models/LCCD_GPT/ --data_path data/custom_train.json --scheduler linear --n_epochs 1
python3 interact.py --model_checkpoint ./models/Novel_GPT/
INFO:transformers.modeling_utils:loading weights file ./models/Novel_GPT/pytorch_model.bin
INFO:transformers.modeling_utils:Weights of OpenAIGPTLMHeadModel not initialized from pretrained model: ['transformer.h.0.attn.bias', 'transformer.h.1.attn.bias', 'transformer.h.2.attn.bias', 'transformer.h.3.attn.bias', 'transformer.h.4.attn.bias', 'transformer.h.5.attn.bias', 'transformer.h.6.attn.bias', 'transformer.h.7.attn.bias', 'transformer.h.8.attn.bias', 'transformer.h.9.attn.bias', 'transformer.h.10.attn.bias', 'transformer.h.11.attn.bias']
HOW TO FIX IT
训练是在有GPU的机器上做的(ubuntu20.04),微调完运行interact.py载入我自己训练的模型没问题可以聊天……
部署的时候把训练好的模型和interact.py放到了另一台机器上(windows 10)……在部署机上载入你们预训练好的模型那个LCCD_GPT没问题可以聊天,但是载入我自己训练的模型文件会报以下错误(训练机和部署机的tensorflow版本都是1.14.0没有用tf2呀):
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 187, in nti
n = int(s.strip() or "0", 8)
ValueError: invalid literal for int() with base 8: 's\n_rebui'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 2289, in next
tarinfo = self.tarinfo.fromtarfile(self)
File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1095, in fromtarfile
obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1037, in frombuf
chksum = nti(buf[148:156])
File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 189, in nti
raise InvalidHeaderError("invalid header")
tarfile.InvalidHeaderError: invalid header
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\serialization.py", line 555, in _load
return legacy_load(f)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\serialization.py", line 466, in legacy_load
with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1591, in open
return func(name, filemode, fileobj, **kwargs)
File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1621, in taropen
return cls(name, mode, fileobj, **kwargs)
File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1484, in __init__
self.firstmember = self.next()
File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 2301, in next
raise ReadError(str(e))
tarfile.ReadError: invalid header
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\modeling_utils.py", line 626, in from_pretrained
state_dict = torch.load(resolved_archive_file, map_location="cpu")
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\serialization.py", line 386, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\serialization.py", line 559, in _load
raise RuntimeError("{} is a zip archive (did you mean to use torch.jit.load()?)".format(f.name))
RuntimeError: .\data\Oct07_17-06-42_xxx\pytorch_model.bin is a zip archive (did you mean to use torch.jit.load()?)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".\bot_gpt.py", line 148, in <module>
model = model_class.from_pretrained(args.model_checkpoint)
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\modeling_utils.py", line 629, in from_pretrained
"Unable to load weights from pytorch checkpoint file. "
OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
请问是怎么回事呢?想问下你们预训练的模型和我们微调完的模型是有什么不一样的地方吗?
大佬好,想问一下,为什么要先在中文小说数据上预训练呢,这一步的预训练是不是和对话的方式不一样
GPT2-chitchat得分高,但生成质量差。想知道依据什么评价指标来说明评价指标差,还想了解一下,这些评价指标(PPL,dist,Greedy Matching,Embedding Average)是在data/STC_test.json数据里测的吗?这部分代码能否公布一下吗(ppl,dist, Greedy Matching,Embedding Average)
由于我希望用来训练语音识别的语言模型,希望可以从头开始训练,请问应该输入哪个命令
python3 train.py --data_path data/toy_data.json
报错
AttributeError: 'SummaryWriter' object has no attribute 'logdir'
你好,这个项目挺不错的,我也很感兴趣。我按照论文中参数设置,使用如下的命令训练:python train.py --pretrained --model_checkpoint ./models/ --data_path data/STC.json --lr 6.25e-5 --train_batch_size 8 --n_epochs 10CUDA_VISIBLE_DEVICES=0 python train.py --pretrained --model_checkpoint ./models/ --data_path data/STC.json --lr 6.25e-5 --train_batch_size 8 --n_epochs 10。但是训练后的,在val上的ppl收敛在29.几,模型收敛不是很理想,所以请教一下,除了以上的参数设置,是还有其他的一些参数需要设置吗?谢谢!
<Code>AccessDenied</Code>
<Message>You do not have read permission on this object.</Message>
<RequestId>5F7F9DC53ADDB93530C41430</RequestId>
<HostId>coai-dataset.oss-cn-beijing.aliyuncs.com</HostId>
</Error>```
在微调过程中,loss值是在不断减少,但是困惑度ppl和null是在不断增大的,epoch=1时ppl是13.765 而epoch=67时,ppl已经达到26了,为什么模型越训练困惑度越大,效果越来越不好呢?
使用预训练的lccc-large在lccc验证集上预测,为啥回复的f1值评估会很差呢(<0.10)
相当于预训练模型已经看到答案了,但是预测回复与真实回复差异还是很大...
(lm_loss), *_ = model(input_ids, labels=lm_labels, token_type_ids=token_type_ids)
print("lm_loss:",lm_loss)
loss = lm_loss / int(args.gradient_accumulation_steps)
打印结果是 lm_loss:loss
谢谢
我使用 torch==1.7.0,transformers==3.5.1 重构您的代码,在 update 方法的
(lm_loss), *_ = model(input_ids, labels=lm_labels, token_type_ids=token_type_ids)
这一行遇到错误,但是切换环境到 torch==1.4.0,transformers==2.1.1 就没有这个问题,想必是版本问题,不知如何修复。
全文报错如下:
Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.
['[CLS] [speaker1] 王 雁 盟 [speaker2] 1 9 9 6 年 , 台 湾 计 算 机 程 序 设 计 师 王 雁 盟 到 欧 洲 旅 游 , 在 布 拉 格 街 头 他 为 街 头 艺 人 的 手 风 琴 演 奏 所 着 迷 。 于 是 在 第 二 年 , 他 拜 巴 黎 手 风 琴 演 奏 家 d o m i n i q u e b o d i n 为 师 , 学 习 手 风 琴 演 奏 技 术 。 1 9 9 8 年 回 台 湾 , 在 街 头 拉 着 他 的 手 风 琴 游 荡 。 之 后 , 他 开 始 为 电 影 、 剧 团 演 出 等 伴 奏 手 风 琴 。 到 2 0 0 3 年 , 他 为 几 米 的 《 地 下 铁 一 个 音 乐 的 旅 程 》 音 乐 剧 作 曲 与 演 出 。 《 漂 浮 的 手 风 琴 》 是 他 自 己 制 作 、 作 曲 并 演 奏 的 第 一 个 专 辑 。 [SEP]', '[CLS] [speaker1] 大 话 西 游 之 月 光 宝 盒 主 演 [speaker2] 罗 家 英 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']
['[CLS] [speaker1] [speaker1] [speaker1] [speaker1] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2]', '[CLS] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']
['[UNK] [UNK] [UNK] [UNK] [UNK] [UNK] 1 9 9 6 年 , 台 湾 计 算 机 程 序 设 计 师 王 雁 盟 到 欧 洲 旅 游 , 在 布 拉 格 街 头 他 为 街 头 艺 人 的 手 风 琴 演 奏 所 着 迷 。 于 是 在 第 二 年 , 他 拜 巴 黎 手 风 琴 演 奏 家 d o m i n i q u e b o d i n 为 师 , 学 习 手 风 琴 演 奏 技 术 。 1 9 9 8 年 回 台 湾 , 在 街 头 拉 着 他 的 手 风 琴 游 荡 。 之 后 , 他 开 始 为 电 影 、 剧 团 演 出 等 伴 奏 手 风 琴 。 到 2 0 0 3 年 , 他 为 几 米 的 《 地 下 铁 一 个 音 乐 的 旅 程 》 音 乐 剧 作 曲 与 演 出 。 《 漂 浮 的 手 风 琴 》 是 他 自 己 制 作 、 作 曲 并 演 奏 的 第 一 个 专 辑 。 [SEP]', '[UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] 罗 家 英 [SEP] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]']
Current run is terminating due to exception: Target -1 is out of bounds..
Engine run is terminating due to exception: Target -1 is out of bounds..
Traceback (most recent call last):
File "/t/main.py", line 40, in <module>
trainer.run(train_dataloader, max_epochs=2)
File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 691, in run
return self._internal_run()
File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 762, in _internal_run
self._handle_exception(e)
File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
raise e
File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 730, in _internal_run
time_taken = self._run_once_on_dataset()
File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 828, in _run_once_on_dataset
self._handle_exception(e)
File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
raise e
File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 811, in _run_once_on_dataset
self.state.output = self._process_function(self, self.state.batch)
File "/home/kingsoft/gang/t/main.py", line 21, in update
(lm_loss), *_ = model(input_ids, labels=lm_labels, token_type_ids=token_type_ids)
File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/lib/python3.7/site-packages/transformers/modeling_openai.py", line 595, in forward
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 962, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "/lib/python3.7/site-packages/torch/nn/functional.py", line 2468, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/lib/python3.7/site-packages/torch/nn/functional.py", line 2264, in nll_loss
ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
IndexError: Target -1 is out of bounds.
Process finished with exit code 1
你好,想问下,既然是LM,训练的时候用第n个token预测第n+1个token,那为什么不用引入attention_mask呢?这样不就使得预测n+1token的时候,已经看到了ground_truth了吗?这样的话,和测试时候的场景是不是就不一致了呢?是否训练的时候加入attention_mask,可以使得测试的效果更好点呢?
本地选了一个单轮对话的数据集,数据集部分样本长度超过512长度。在训练过程中会出现RuntimeError: index out of range: Tried to access index 513 out of table with 512 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418
的错误。希望代码可以训练过程中自动截断。
INFO:train.py:Build train and validation dataloaders
INFO:train.py:Load tokenized dataset from cache at dataset_cache_BertTokenizer
Traceback (most recent call last):
File "train.py", line 237, in <module>
train()
File "train.py", line 116, in train
train_loader, val_loader, train_sampler, valid_sampler = loader_class(args, tokenizer, logger)
File "/home/kingsoft/BDCI2020/CDial-GPT/od/inputters/inputter.py", line 47, in build_dataloaders
datasets, raw_samples = get_data(tokenizer, args.data_path, args.dataset_cache, logger)
File "/home/kingsoft/BDCI2020/CDial-GPT/od/inputters/inputter.py", line 21, in get_data
dataset = torch.load(dataset_cache)
File "/home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/serialization.py", line 527, in load
with _open_zipfile_reader(f) as opened_zipfile:
File "/home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/serialization.py", line 224, in __init__
super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f2ad21b4193 in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7f2ad533c9eb in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7f2ad533dc04 in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x6c6536 (0x7f2b1d66d536 in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x295a74 (0x7f2b1d23ca74 in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #37: __libc_start_main + 0xf0 (0x7f2b24906840 in /lib/x86_64-linux-gnu/libc.so.6)
个人猜测可能是tensorflow版本问题吗?
README.md里面只放了在STC数据集上的结果
大佬好,请教一个问题,对于多轮对话的数据处理一直有个疑问,多轮对话在训练的时候需要拆分为多个样本吗,我看给出的示例数据好像没有分为多个样本,看代码预测的是最后一句,前面的作为context,不知道我的理解是否正确
请问在STC_test.json的验证集中 loss怎么求,我在infer.py的文件中 加了(loss), *_ = model(input_ids, labels=lm_labels, token_type_ids=token_type_ids) ,但是lm_labels的值不知道,而在train.py中
input_ids, token_type_ids, lm_labels = tuple(input_tensor.to(args.device) for input_tensor in batch)
在infer.py中
instance, sequence = build_input_from_segments(history, reply, tokenizer, current_output, with_eos=False)
input_ids = torch.tensor(instance["input_ids"], dtype=torch.long, device=args.device).unsqueeze(0)
请问能否公布一下测试过程中的ppl代码
你好,在train.py中引用的包是
from torch.nn.parallel import DataParallel
是否引用错误,应该引用为
from torch.nn.parallel import DistributedDataParallel
我看代码中关于分布式的代码包括了 init_process_group、加载数据的DistributedSampler等
May I get the evaluation code? The BLUE-2 score is 67.2 for Transformer and 66.3 for GPTLCCC-large, respectively. It seems that the performance is so amazing. How can it be attained? Thanks.
非常感谢您的工作,我在仔细阅读您的代码和README的时候注意到,你们在训练的时候,将history部分的labels全部用[UNK]代替了,loss计算的是基于response的语言模型。
我之前也有尝试过训练GPT2的对话模型(不知道是不是数据量的问题,效果并不是特别理想,数据量是50w),不过用的是和GPT2-chitchat项目中类似的,history和response整体当作一句话来训练语言模型。
我的问题是,你们没有选择后者,单纯只计算response的语言模型是有一定的原因的么,例如只训练response的语言模型比history+response的所有语言模型一起训练更好之类的。
期待您的回复!!!
What is the key theory difference with https://github.com/yangjianxin1/GPT2-chitchat
操作指令python train.py --pretrained --model_checkpoint CDial-GPT_LCCC-large --data_path data/STC.json --train_batch_size 8 --scheduler linear 数据集用的是STC.json,用了10分钟就训练了70轮,电脑只有一个8G显卡,梯度累加是64,甚至最后把toy_train.txt文件移动到别的文件夹都还是快,而且最大的问题是每个epoch的step个数跟之前用toy_train.txt训练时一样都是125,不知道问题出现在哪?
Epoch [1/70]: [1/125] 1%|▏ , loss=0.0422, lr=5e-5 [00:00<?]l
m_loss: tensor(3.0413, device='cuda:0', grad_fn=)
Epoch [1/70]: [2/125] 2%|▎ , loss=0.0424, lr=5e-5 [00:00<00:29]l
m_loss: tensor(2.2726, device='cuda:0', grad_fn=)
Epoch [1/70]: [3/125] 2%|▌ , loss=0.0422, lr=5e-5 [00:00<00:19]l
m_loss: tensor(2.3869, device='cuda:0', grad_fn=)
Epoch [1/70]: [4/125] 3%|▋ , loss=0.0421, lr=5e-5 [00:00<00:15]l
m_loss: tensor(2.9688, device='cuda:0', grad_fn=)
Epoch [1/70]: [5/125] 4%|▉ , loss=0.0422, lr=5e-5 [00:00<00:13]l
m_loss: tensor(2.5291, device='cuda:0', grad_fn=)
Epoch [1/70]: [6/125] 5%|█ , loss=0.0422, lr=5e-5 [00:00<00:12]l
m_loss: tensor(2.7023, device='cuda:0', grad_fn=)
Epoch [1/70]: [7/125] 6%|
NFO:ignite.engine.engine.Engine:Epoch[70] Complete. Time taken: 00:00:13
INFO:ignite.engine.engine.Engine:Engine run complete. Time taken: 00:14:38
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.