Code Monkey home page Code Monkey logo

cdial-gpt's People

Contributors

crownpku avatar kepei1106 avatar lemon234071 avatar silverriver avatar xiejiachen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cdial-gpt's Issues

关于预训练模型

hello,首先感谢你们开源的中文语料!
然后弱弱的问一下,有没有考虑过训练参数量更大的GPT-2 model呢?

Question on the comparison between GPT and GPT2

Hi, Thanks for sharing the models! There's a detail that I'm curious about. Is there a reason why CDialGPT2LCCC performs worse than CDialGPTLCCC? I get that GPT2 uses pre-LayerNorm and also adds an additional layerNorm after final attention block compared with GPT, but I do not expect such difference results in much worse performance in CDialGPT2LCCC.

Besides, in the paper, you mentioned that both CDialGPTLCCC and CDialGPT2LCCC are firstly pretrained on your Chinese novel dataset. Does this imply that there's also a GPT2_novel model that you did not release (based on which CDialGPT2LCCC is post-trained)?

微調格式

您好 不好意思 我想詢問一下 我微調的訓練資料格式是什麼樣子的

Discussion about method of self-attention.

Actually, I am a fresher on NLP topics, and recently I am studying self-attention-schema referred from http://jalammar.github.io/illustrated-gpt2/#part-3-beyond-language-modeling .

And also from the related paper, such as http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
and https://arxiv.org/abs/2006.16362 .

The core idea of attention is given the more related Key and Query with higher probability scores by

softmax(QK^T), where higher product results implies higher correlation. And they made 2 fully - connected W_k and W_q to optimize.

I am confused about, is that true: higher correlation = higher product results of QK ?

In my opinion, a more clearly approach way is directly use distance to express how pair of Q and K.

such as

  1. softmax(-abs(Q-K)), giving narrower distance of key and query higher probability
  2. softmax(-min(Q/K,K/Q)), similarly with above one, but use division other than minus.

May these ideas works?--I mean might it conduct some useful research papers.

Noam scheduler 的 lr

你好,

发现训练脚本中可能存在的问题:

使用noam scheduler的话initial lr应该设置成1吧,因为noam_lambda的返回值每次会乘以initial lr作为当前step的lr,设置成5e-5似乎会使lr比预期小20000倍。

此外,noam中的step以及其他名字中带有“step“的参数,是否应该以每次“optimizer.step()”为一个单位算作一个step,而不是engine的每个iteration?

如有理解错误之处请见谅。

关于如何调用微调后的模型有些疑问

您好,按照官方说明,训练模型和生成文本的model-checkpoint是一个路径,然而使用CDial-GPT_LCCC-large微调后,这个文件里面的模型应该没有发生修改(从时间看出来的),那我生成文本的时候应该调用的是哪个路径呢?请问run文件夹里的是什么啊?

gpt2产生的开放域的文本生成, 经常出现生成句子的有一直重复的现象,请问有什么好的办法吗

如这个:
这首歌我一直都在听,不知道为什么,我的心好像就是一个大水坑,一直往里冲,不知道为什么,我的心就是一个大水坑,一直往里冲,不知道为什么,我的心就是一个大水坑,一直往里冲,不知道为什么,我的心就是一个大水坑

《THE CURIOUS CASE OF NEURAL TEXT DEGENERATION》 能解决这个问题吗?
我其实已经使用了 Top-p (nucleus) sampling了,但是还是经常出现上述的重复

关于WBDataset中标签生成的问题

你好,在阅读代码的时候对于WBDataset这个类中的process这个方法存在一个疑问:

image

在生成标签的时候,理论上来说应该是用前面的单词预测其下一个单词。但是,如果按照上面代码中的处理方式,所有最后一轮的单词对应的标签都是其自身了,就像下面这样:

image

和我理解的训练过程有出入,这里是否是我理解有问题呢?能够解答一下,谢谢!

关于训练情况

论文说在数据集上训练了30个epoch,单步batch_size=8?我看了下语料有1200万组左右,按照单步0.1s算,总时间 1200万 * 30 / 8 * 0.1s 约等于52天?单卡训练了52天??

关于自定义数据微调实验的疑问

我先试了试用STC数据集微调,因为只是想试着跑通一下所以就跑了1个epoch在单张2080Ti,一开始预估的时间是17:50:30,最后实际跑了28h左右……

在STC数据集上训练的语句是

python train.py --pretrained --model_checkpoint ./models/LCCD_GPT/ --data_path data/STC/STC.json --scheduler linear --n_epochs 1

STC

然后我就开始在自己的数据上(用于训练的json文件大小才2M左右)训练5个epoch,然后我发觉它这个预估时间(17:31:46)怎么还跟我训练STC数据集差不多呢?我自己的数据集应该小了很多啊……而且我发觉我就算改成训练1个epoch,总的epoch数好像不变永远是2195633,预估时间也一直是17个小时左右……不知道是不是我哪里理解错了,请问是怎么回事呢?

在自定义数据集上训练5 epoch的语句是

python train.py --pretrained --model_checkpoint ./models/LCCD_GPT/ --data_path data/custom_train.json --scheduler linear --n_epochs 5

5epoch
在自定义数据集上1 epoch的语句是

python train.py --pretrained --model_checkpoint ./models/LCCD_GPT/ --data_path data/custom_train.json --scheduler linear --n_epochs 1

1epoch

OpenAIGPTLMHeadModel not initialized

python3 interact.py --model_checkpoint ./models/Novel_GPT/

INFO:transformers.modeling_utils:loading weights file ./models/Novel_GPT/pytorch_model.bin
INFO:transformers.modeling_utils:Weights of OpenAIGPTLMHeadModel not initialized from pretrained model: ['transformer.h.0.attn.bias', 'transformer.h.1.attn.bias', 'transformer.h.2.attn.bias', 'transformer.h.3.attn.bias', 'transformer.h.4.attn.bias', 'transformer.h.5.attn.bias', 'transformer.h.6.attn.bias', 'transformer.h.7.attn.bias', 'transformer.h.8.attn.bias', 'transformer.h.9.attn.bias', 'transformer.h.10.attn.bias', 'transformer.h.11.attn.bias']

HOW TO FIX IT

在不同机器训练和交互的问题

训练是在有GPU的机器上做的(ubuntu20.04),微调完运行interact.py载入我自己训练的模型没问题可以聊天……
部署的时候把训练好的模型和interact.py放到了另一台机器上(windows 10)……在部署机上载入你们预训练好的模型那个LCCD_GPT没问题可以聊天,但是载入我自己训练的模型文件会报以下错误(训练机和部署机的tensorflow版本都是1.14.0没有用tf2呀):

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 187, in nti
    n = int(s.strip() or "0", 8)
ValueError: invalid literal for int() with base 8: 's\n_rebui'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 2289, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1095, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1037, in frombuf
    chksum = nti(buf[148:156])
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 189, in nti
    raise InvalidHeaderError("invalid header")
tarfile.InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\serialization.py", line 555, in _load
    return legacy_load(f)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\serialization.py", line 466, in legacy_load
    with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1591, in open
    return func(name, filemode, fileobj, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1621, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 1484, in __init__
    self.firstmember = self.next()
  File "C:\ProgramData\Anaconda3\lib\tarfile.py", line 2301, in next
    raise ReadError(str(e))
tarfile.ReadError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\modeling_utils.py", line 626, in from_pretrained
    state_dict = torch.load(resolved_archive_file, map_location="cpu")
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\serialization.py", line 386, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\serialization.py", line 559, in _load
    raise RuntimeError("{} is a zip archive (did you mean to use torch.jit.load()?)".format(f.name))
RuntimeError: .\data\Oct07_17-06-42_xxx\pytorch_model.bin is a zip archive (did you mean to use torch.jit.load()?)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".\bot_gpt.py", line 148, in <module>
    model = model_class.from_pretrained(args.model_checkpoint)
  File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\modeling_utils.py", line 629, in from_pretrained
    "Unable to load weights from pytorch checkpoint file. "
OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

请问是怎么回事呢?想问下你们预训练的模型和我们微调完的模型是有什么不一样的地方吗?

预训练

大佬好,想问一下,为什么要先在中文小说数据上预训练呢,这一步的预训练是不是和对话的方式不一样

请问如何从头开始训练模型

由于我希望用来训练语音识别的语言模型,希望可以从头开始训练,请问应该输入哪个命令
python3 train.py --data_path data/toy_data.json
报错
AttributeError: 'SummaryWriter' object has no attribute 'logdir'

STC的微调实验

你好,这个项目挺不错的,我也很感兴趣。我按照论文中参数设置,使用如下的命令训练:python train.py --pretrained --model_checkpoint ./models/ --data_path data/STC.json --lr 6.25e-5 --train_batch_size 8 --n_epochs 10CUDA_VISIBLE_DEVICES=0 python train.py --pretrained --model_checkpoint ./models/ --data_path data/STC.json --lr 6.25e-5 --train_batch_size 8 --n_epochs 10。但是训练后的,在val上的ppl收敛在29.几,模型收敛不是很理想,所以请教一下,除了以上的参数设置,是还有其他的一些参数需要设置吗?谢谢!

Can't download GPT_LCCC-large.zip

<Code>AccessDenied</Code>
<Message>You do not have read permission on this object.</Message>
<RequestId>5F7F9DC53ADDB93530C41430</RequestId>
<HostId>coai-dataset.oss-cn-beijing.aliyuncs.com</HostId>
</Error>```

ppl 困惑度越来越大问题

在微调过程中,loss值是在不断减少,但是困惑度ppl和null是在不断增大的,epoch=1时ppl是13.765 而epoch=67时,ppl已经达到26了,为什么模型越训练困惑度越大,效果越来越不好呢?

预测lccc数据集差异较大问题

使用预训练的lccc-large在lccc验证集上预测,为啥回复的f1值评估会很差呢(<0.10)
相当于预训练模型已经看到答案了,但是预测回复与真实回复差异还是很大...

尝试重构您的代码,出现 IndexError: Target -1 is out of bounds

我使用 torch==1.7.0,transformers==3.5.1 重构您的代码,在 update 方法的

(lm_loss), *_ = model(input_ids, labels=lm_labels, token_type_ids=token_type_ids)

这一行遇到错误,但是切换环境到 torch==1.4.0,transformers==2.1.1 就没有这个问题,想必是版本问题,不知如何修复。

全文报错如下:

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.
['[CLS] [speaker1] 王 雁 盟 [speaker2] 1 9 9 6 年 , 台 湾 计 算 机 程 序 设 计 师 王 雁 盟 到 欧 洲 旅 游 , 在 布 拉 格 街 头 他 为 街 头 艺 人 的 手 风 琴 演 奏 所 着 迷 。 于 是 在 第 二 年 , 他 拜 巴 黎 手 风 琴 演 奏 家 d o m i n i q u e b o d i n 为 师 , 学 习 手 风 琴 演 奏 技 术 。 1 9 9 8 年 回 台 湾 , 在 街 头 拉 着 他 的 手 风 琴 游 荡 。 之 后 , 他 开 始 为 电 影 、 剧 团 演 出 等 伴 奏 手 风 琴 。 到 2 0 0 3 年 , 他 为 几 米 的 《 地 下 铁 一 个 音 乐 的 旅 程 》 音 乐 剧 作 曲 与 演 出 。 《 漂 浮 的 手 风 琴 》 是 他 自 己 制 作 、 作 曲 并 演 奏 的 第 一 个 专 辑 。 [SEP]', '[CLS] [speaker1] 大 话 西 游 之 月 光 宝 盒 主 演 [speaker2] 罗 家 英 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']
['[CLS] [speaker1] [speaker1] [speaker1] [speaker1] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2]', '[CLS] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']
['[UNK] [UNK] [UNK] [UNK] [UNK] [UNK] 1 9 9 6 年 , 台 湾 计 算 机 程 序 设 计 师 王 雁 盟 到 欧 洲 旅 游 , 在 布 拉 格 街 头 他 为 街 头 艺 人 的 手 风 琴 演 奏 所 着 迷 。 于 是 在 第 二 年 , 他 拜 巴 黎 手 风 琴 演 奏 家 d o m i n i q u e b o d i n 为 师 , 学 习 手 风 琴 演 奏 技 术 。 1 9 9 8 年 回 台 湾 , 在 街 头 拉 着 他 的 手 风 琴 游 荡 。 之 后 , 他 开 始 为 电 影 、 剧 团 演 出 等 伴 奏 手 风 琴 。 到 2 0 0 3 年 , 他 为 几 米 的 《 地 下 铁 一 个 音 乐 的 旅 程 》 音 乐 剧 作 曲 与 演 出 。 《 漂 浮 的 手 风 琴 》 是 他 自 己 制 作 、 作 曲 并 演 奏 的 第 一 个 专 辑 。 [SEP]', '[UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] 罗 家 英 [SEP] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]']
Current run is terminating due to exception: Target -1 is out of bounds..
Engine run is terminating due to exception: Target -1 is out of bounds..
Traceback (most recent call last):
  File "/t/main.py", line 40, in <module>
    trainer.run(train_dataloader, max_epochs=2)
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 691, in run
    return self._internal_run()
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 762, in _internal_run
    self._handle_exception(e)
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    raise e
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 730, in _internal_run
    time_taken = self._run_once_on_dataset()
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 828, in _run_once_on_dataset
    self._handle_exception(e)
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    raise e
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 811, in _run_once_on_dataset
    self.state.output = self._process_function(self, self.state.batch)
  File "/home/kingsoft/gang/t/main.py", line 21, in update
    (lm_loss), *_ = model(input_ids, labels=lm_labels, token_type_ids=token_type_ids)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/transformers/modeling_openai.py", line 595, in forward
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 962, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/lib/python3.7/site-packages/torch/nn/functional.py", line 2468, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/lib/python3.7/site-packages/torch/nn/functional.py", line 2264, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
IndexError: Target -1 is out of bounds.

Process finished with exit code 1

为什么模型输入不用包括attention_mask呢?

你好,想问下,既然是LM,训练的时候用第n个token预测第n+1个token,那为什么不用引入attention_mask呢?这样不就使得预测n+1token的时候,已经看到了ground_truth了吗?这样的话,和测试时候的场景是不是就不一致了呢?是否训练的时候加入attention_mask,可以使得测试的效果更好点呢?

超过512字符未自动进行截断

本地选了一个单轮对话的数据集,数据集部分样本长度超过512长度。在训练过程中会出现RuntimeError: index out of range: Tried to access index 513 out of table with 512 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418
的错误。希望代码可以训练过程中自动截断。

按照 readme 步骤进行,报错 Your PyTorch installation may be too old

INFO:train.py:Build train and validation dataloaders
INFO:train.py:Load tokenized dataset from cache at dataset_cache_BertTokenizer
Traceback (most recent call last):
  File "train.py", line 237, in <module>
    train()
  File "train.py", line 116, in train
    train_loader, val_loader, train_sampler, valid_sampler = loader_class(args, tokenizer, logger)
  File "/home/kingsoft/BDCI2020/CDial-GPT/od/inputters/inputter.py", line 47, in build_dataloaders
    datasets, raw_samples = get_data(tokenizer, args.data_path, args.dataset_cache, logger)
  File "/home/kingsoft/BDCI2020/CDial-GPT/od/inputters/inputter.py", line 21, in get_data
    dataset = torch.load(dataset_cache)
  File "/home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/serialization.py", line 527, in load
    with _open_zipfile_reader(f) as opened_zipfile:
  File "/home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/serialization.py", line 224, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f2ad21b4193 in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7f2ad533c9eb in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7f2ad533dc04 in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x6c6536 (0x7f2b1d66d536 in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x295a74 (0x7f2b1d23ca74 in /home/kingsoft/anaconda3/envs/gang_gpt/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #37: __libc_start_main + 0xf0 (0x7f2b24906840 in /lib/x86_64-linux-gnu/libc.so.6)

个人猜测可能是tensorflow版本问题吗?

关于数据组织方式

大佬好,请教一个问题,对于多轮对话的数据处理一直有个疑问,多轮对话在训练的时候需要拆分为多个样本吗,我看给出的示例数据好像没有分为多个样本,看代码预测的是最后一句,前面的作为context,不知道我的理解是否正确

验证集的ppl代码

请问在STC_test.json的验证集中 loss怎么求,我在infer.py的文件中 加了(loss), *_ = model(input_ids, labels=lm_labels, token_type_ids=token_type_ids) ,但是lm_labels的值不知道,而在train.py中
input_ids, token_type_ids, lm_labels = tuple(input_tensor.to(args.device) for input_tensor in batch)
在infer.py中
instance, sequence = build_input_from_segments(history, reply, tokenizer, current_output, with_eos=False)
input_ids = torch.tensor(instance["input_ids"], dtype=torch.long, device=args.device).unsqueeze(0)

请问能否公布一下测试过程中的ppl代码

关于分布式代码问题

你好,在train.py中引用的包是
from torch.nn.parallel import DataParallel
是否引用错误,应该引用为
from torch.nn.parallel import DistributedDataParallel

我看代码中关于分布式的代码包括了 init_process_group、加载数据的DistributedSampler等

About the evaluation

May I get the evaluation code? The BLUE-2 score is 67.2 for Transformer and 66.3 for GPTLCCC-large, respectively. It seems that the performance is so amazing. How can it be attained? Thanks.

关于训练的一个细节

非常感谢您的工作,我在仔细阅读您的代码和README的时候注意到,你们在训练的时候,将history部分的labels全部用[UNK]代替了,loss计算的是基于response的语言模型。
我之前也有尝试过训练GPT2的对话模型(不知道是不是数据量的问题,效果并不是特别理想,数据量是50w),不过用的是和GPT2-chitchat项目中类似的,history和response整体当作一句话来训练语言模型。

我的问题是,你们没有选择后者,单纯只计算response的语言模型是有一定的原因的么,例如只训练response的语言模型比history+response的所有语言模型一起训练更好之类的。

期待您的回复!!!

关于 STC 微调复现效果的疑问

请问在STC微调结果中的自动评估指标BLEU-2、BLEU-4的计算方式

由于在项目中没有找到与计算BLEU相关的代码,并且在自己尝试 fine-tuning 后,发现结果与您的结果存在偏差
以下结果实在第5个epoch的生成结果,采用的是 nltk 的 corpus_bleu 工具
image

在STC数据集微调数据集微调有问题,在只有一个8G显卡的电脑上训练庞大的STC数据集微调了70个epoch只用了10分钟

操作指令python train.py --pretrained --model_checkpoint CDial-GPT_LCCC-large --data_path data/STC.json --train_batch_size 8 --scheduler linear 数据集用的是STC.json,用了10分钟就训练了70轮,电脑只有一个8G显卡,梯度累加是64,甚至最后把toy_train.txt文件移动到别的文件夹都还是快,而且最大的问题是每个epoch的step个数跟之前用toy_train.txt训练时一样都是125,不知道问题出现在哪?
Epoch [1/70]: [1/125] 1%|▏ , loss=0.0422, lr=5e-5 [00:00<?]l
m_loss: tensor(3.0413, device='cuda:0', grad_fn=)
Epoch [1/70]: [2/125] 2%|▎ , loss=0.0424, lr=5e-5 [00:00<00:29]l
m_loss: tensor(2.2726, device='cuda:0', grad_fn=)
Epoch [1/70]: [3/125] 2%|▌ , loss=0.0422, lr=5e-5 [00:00<00:19]l
m_loss: tensor(2.3869, device='cuda:0', grad_fn=)
Epoch [1/70]: [4/125] 3%|▋ , loss=0.0421, lr=5e-5 [00:00<00:15]l
m_loss: tensor(2.9688, device='cuda:0', grad_fn=)
Epoch [1/70]: [5/125] 4%|▉ , loss=0.0422, lr=5e-5 [00:00<00:13]l
m_loss: tensor(2.5291, device='cuda:0', grad_fn=)
Epoch [1/70]: [6/125] 5%|█ , loss=0.0422, lr=5e-5 [00:00<00:12]l
m_loss: tensor(2.7023, device='cuda:0', grad_fn=)
Epoch [1/70]: [7/125] 6%|

NFO:ignite.engine.engine.Engine:Epoch[70] Complete. Time taken: 00:00:13
INFO:ignite.engine.engine.Engine:Engine run complete. Time taken: 00:14:38

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.