gaopeng97 / transformer-xl-chinese Goto Github PK

View Code? Open in Web Editor NEW

702.0 702.0 245.0 38.39 MB

transformer xl在中文文本生成上的尝试（可写小说、古诗）（transformer xl for text generation of chinese）

License: Apache License 2.0

Shell 12.88% Python 87.12%

transformer-xl-chinese's People

Contributors

Stargazers

Watchers

Forkers

moxiaoying just-do-it-for-everything see-u-see wenlong0913 shidaide2016 liujunzhi2018 lushzy3300 anshengqiang askintution shaoxiaoyu zhp510730568 ad50810344 yespon hongshengxin johnwu678 ashora laoli2046 erasme001 leaderyangzi qianrenjian himoutoumaru xiangyan99 frankchu0229 feileyu rex-du21 17621192638 anigi98932 readonce zftaiyalin myougg laipzh ghoshaw bwshan jdc08161063 crackercat gdh756462786 hundred06 barryzm adewin zenghaihong yaozhengjie holasyb saxh aaronzhangl icecream0 er010010 kdongyi little1tow carrychang xhxy fighting41love wly-thu zhouyuling mike575 liangsi03 goldmorningsmart tony1236 qiujz echoyes chgblog carlos9310 caihao20 bowendoctor ming-hai jiqiushielvin micklexqg toyhom nickgao86 alalaiii wangroot sunfangyong yukicheung2049 cyueclone snow19950625 junglezax gomapur yuxiang-yang aazjcom chenny0808 sxpdwkj wentian2017 yzu2ustc zhuzebi luolin19850304 scottishfold007 tingxin ricky8511 owhileo sn0wfree hameld sse001007 markyanggithubnote cuizhiguo excited-tiger nsl2014fm jz3707 xlzwhboy sharekiller yw1991 zhitunai

transformer-xl-chinese's Issues

what is the purpose of mems_id for inference.

in your code for inference there are variables such as new_mems_id and mems_id, but it doesn't seem like they are being used anywhere? Am I wrong, is there a purpose for these variables? Thank you.

inference restore model failed

将模型参数修改为：
#Model
DIV_VAL=1
N_LAYER=6
D_MODEL=500
D_EMBED=500
N_HEAD=10
D_HEAD=50
D_INNER=1000

#Training
TGT_LEN=70
MEM_LEN=70

BSZ=64
NUM_CORE=4

#Testing

TEST_TGT_LEN=70
TEST_MEM_LEN=500
TEST_CLAMP_LEN=400

#TEST_BSZ=10
TEST_BSZ=1

TEST_NUM_CORE=1

跑doupo_base_gpu.sh,训练和eval都是正常的，但是inference的时候报错：
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [4786] rhs shape= [84827]
模型加载不成功，请问是我的参数设置有问题吗

重复

您好，我用你的代码训练文章为什么会反复出现同一句话，请问这个问题怎么处理呀？

如何用transformerXL做文本摘要任务

您好，我想请教一下，如果我想用transformerXL模型做文本摘要任务，有什么好的建议吗？

请问1080Ti能训练 shi这个训练集吗

我自己跑了一下，发现OOM了

inference中的sess.run

for step in progress(range(output_len)):
time.sleep(0.01)
feed_dict = {}
for i in range(FLAGS.num_core_per_host):
for m, m_np in zip(tower_mems[i], tower_mems_np[i]):
feed_dict[m] = m_np

                for id, id_np in zip(tower_mems_id[i], tower_mems_id_np[i]):
                    feed_dict[id] = id_np

            sess.run(iterator.initializer, feed_dict={test_list: [encoded_input]})
            fetched = sess.run(fetches, feed_dict=feed_dict)

这一段中循环的作用是什么，生成的feed_dict为什么两个for 呀，然后下面为什么两个sess.run？

请问预测过程又可以优化执行时间的方法吗

我在训练好模型后，使用inference函数做推断时，发现你的代码逻辑是每次预测一个字，然后不断迭代200次，每次计算的时间约为0.5s（在gpu环境下做的推测，用的p40的卡），我看一些资料说xl对速度做了优化，可以输入[1,2,3,4]得出[5,6,7,8],如果我想每次迭代时预测后续多个汉字应该修改哪里呢？这样做对速度提升有帮助吗？

您用了怎么样的设备，训练了多长时间呢

AttributeError: module 'tensorflow.sparse' has no attribute 'to_dense'

File "/tf/data_utils.py", line 504, in parser
val = tf.sparse.to_dense(val)
AttributeError: module 'tensorflow.sparse' has no attribute 'to_dense'

知乎数据

你好，我想用一下知乎的数据，但是没有找到，请问还有知乎的数据吗？

ModuleNotFoundError: No module named 'progressbar'

处理完数据，训练时候碰到引入这个文件错误，发现这个repo里貌似没有progressbar.py文件，大佬可否提供下？

分词之后会提升生成的质量吗

您好，请问您有尝试过用分词之后的输入去训练模型吗？还有你觉得分词对提升生成的效果有帮助吗？

训练，验证，推理时指定的模型文件不一样

在doupo_base_gpu.sh文件中，训练，验证，推理指定的model_dir文件加名字不一样，在训练wt103时，验证数据集用的enwiki8?感觉有点混乱

你好，我发了邮件给[email protected]，请查看下哦

我有几个问题请教一下
1.我目前任务是针对一些文本生成他的关键词，不单单是简单的抽取，而是要有语义上的理解，比如说“王某偷窃十万元”-“偷窃罪”，请问能否直接训练“文本”-“关键词”达到效果
2.关于中文预处理的部分，请问具体在代码的哪个部分，data_utils_chinese.py吗

why takes only 135M gpu memory but takes 10-50G cpu memory?

inference的时候出现报错

我用的是poetry的数据集，inference那边的数据集名称改过来了，结果还是有问题，麻烦大侠帮忙看看：

INFO:tensorflow:n_token 5472
building vocab with min_freq=0, max_size=None
final vocab size 5472 from 5472 unique tokens
WARNING:tensorflow:From d:\anaconda3\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py:1419: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.

InvalidArgumentError Traceback (most recent call last)
d:\anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in _create_c_op(graph, node_def, inputs, control_inputs)
1658 try:
-> 1659 c_op = c_api.TF_FinishOperation(op_desc)
1660 except errors.InvalidArgumentError as e:

InvalidArgumentError: Dimension size must be evenly divisible by 2 but is 1
Number of ways to split should evenly divide the split dimension for 'split' (op: 'Split') with input shapes: [], [1,?] and with computed input tensors: input[0] = <0>.

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
in
----> 1 main()

in main()
12 train(n_token, cutoffs, "/gpu:0",record_info_dir,train_batch_size,tgt_len,min_lr_ratio ,learning_rate ,model_dir, warm_start_path, num_core_per_host )
13 if do_inference:
---> 14 inference(n_token, cutoffs, "/gpu:0")

in inference(n_token, cutoffs, ps_device)
16 input_feed = iterator.get_next()
17
---> 18 inputs = tf.split(input_feed, num_core_per_host, 0)
19 #inputs = input_feed
20

d:\anaconda3\lib\site-packages\tensorflow\python\ops\array_ops.py in split(value, num_or_size_splits, axis, num, name)
1506 if size_splits._rank() == 0 and size_splits.dtype.is_integer:
1507 return gen_array_ops.split(
-> 1508 axis=axis, num_split=num_or_size_splits, value=value, name=name)
1509
1510 if num is None:

d:\anaconda3\lib\site-packages\tensorflow\python\ops\gen_array_ops.py in split(axis, value, num_split, name)
10740 num_split = _execute.make_int(num_split, "num_split")
10741 _, _, _op = _op_def_lib._apply_op_helper(

10742 "Split", split_dim=axis, value=value, num_split=num_split, name=name)
10743 _result = _op.outputs[:]
10744 _inputs_flat = _op.inputs

d:\anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py in _apply_op_helper(self, op_type_name, name, **keywords)
786 op = g.create_op(op_type_name, inputs, output_types, name=scope,
787 input_types=input_types, attrs=attr_protos,
--> 788 op_def=op_def)
789 return output_structure, op_def.is_stateful, op
790

d:\anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py in new_func(*args, **kwargs)
505 'in a future version' if date is None else ('after %s' % date),
506 instructions)
--> 507 return func(*args, **kwargs)
508
509 doc = _add_deprecated_arg_notice_to_docstring(

d:\anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in create_op(failed resolving arguments)
3298 input_types=input_types,
3299 original_op=self._default_original_op,
-> 3300 op_def=op_def)
3301 self._create_op_helper(ret, compute_device=compute_device)
3302 return ret

d:\anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in init(self, node_def, g, inputs, output_types, control_inputs, input_types, original_op, op_def)
1821 op_def, inputs, node_def.attr)
1822 self._c_op = _create_c_op(self._graph, node_def, grouped_inputs,
-> 1823 control_input_ops)
1824
1825 # Initialize self._outputs.

d:\anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in _create_c_op(graph, node_def, inputs, control_inputs)
1660 except errors.InvalidArgumentError as e:
1661 # Convert to ValueError for backwards compatibility.
-> 1662 raise ValueError(str(e))
1663
1664 return c_op

ValueError: Dimension size must be evenly divisible by 2 but is 1
Number of ways to split should evenly divide the split dimension for 'split' (op: 'Split') with input shapes: [], [1,?] and with computed input tensors: input[0] = <0>.

关于中文的预处理

大佬，你好，我刚接触这块，想做中文文本摘要的任务，使用transformer框架，对于预处理这块不是很清楚，代码中的处理部分也看了一点，不知道大佬有没有对应讲中文预处理的博客或者资料可以拿来学习的，谢谢啦！

怎么在训练的时候添加样本的权重

比如我希望某些样本在损失函数里占更多的权重，不想通过过采样的方式，因为过采样会增加样本数量，训练时间边长

能加一下你的微信吗，有些问题想请教一下

训练完doupo以后, 运行inference的时候显示: ValueError: Can't load save_path when it is None.

输入命令: bash scripts/doupo_base_gpu.sh inference, 现实错误.
Run inference...
INFO:tensorflow:n_token 84827
building vocab with min_freq=0, max_size=None
final vocab size 4786 from 4786 unique tokens
WARNING:tensorflow:From /home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py:1419: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /media/lukuan/s/DL_lk/NLP/transformer-xl-chinese/tf/model.py:616: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
WARNING:tensorflow:From /media/lukuan/s/DL_lk/NLP/transformer-xl-chinese/tf/model.py:705: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:#params: 43046566
2019-08-26 08:58:24.718474: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-26 08:58:24.821696: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-26 08:58:24.822190: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55bcb0b0ca80 executing computations on platform CUDA. Devices:
2019-08-26 08:58:24.822203: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2019-08-26 08:58:24.823656: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4008359999 Hz
2019-08-26 08:58:24.824027: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55bcb0b76b90 executing computations on platform Host. Devices:
2019-08-26 08:58:24.824052: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2019-08-26 08:58:24.824230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.8225
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 6.62GiB
2019-08-26 08:58:24.824245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-08-26 08:58:24.824957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-26 08:58:24.824975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-08-26 08:58:24.824979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-08-26 08:58:24.825117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6442 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
lk_print_show: None
Traceback (most recent call last):
File "train_gpu.py", line 735, in
tf.app.run()
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train_gpu.py", line 499, in main
inference(n_token, cutoffs, "/gpu:0")
File "train_gpu.py", line 578, in inference
saver.restore(sess, eval_ckpt_path)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1264, in restore
raise ValueError("Can't load save_path when it is None.")
ValueError: Can't load save_path when it is None.
谢谢!

seed text干什么用的？

seed text干什么用的？
结果文件保存在哪里？
train_steps=1000000,你花了多少时间，用gpu的吗

Where is the zhihu dataset?

Hi, very grateful that you share the code & result.
When I work on your code, I can not find the zhihu dataset. Can you please help me with it?

hello GaoPeng97,only gpu? no cpu ?

改变输入样本的格式

您好，请问如果我希望我的输入是关键词+句子的形式，应该修改代码的哪一部分呢？

transformer-xl in 1-billion experiment with base model configuration

Have you run https://github.com/kimiyoung/transformer-xl/blob/master/tf/scripts/lm1b_base_gpu.sh successfully?

inference error Assign requires shapes of both tensors to match. lhs shape= [5472,410] rhs shape= [84827,410]

我对Poetry的数据进行了训练, 完成以后使用保存的模型进行推断发现如下错误:

Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from EXP-poetry_mem50/model-2000.ckpt
Traceback (most recent call last):
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [5472,410] rhs shape= [84827,410]
	 [[{{node save/Assign}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1276, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [5472,410] rhs shape= [84827,410]
	 [[node save/Assign (defined at train_gpu.py:571) ]]

Caused by op 'save/Assign', defined at:
  File "train_gpu.py", line 740, in <module>
    tf.app.run()
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "train_gpu.py", line 501, in main
    inference(n_token, cutoffs, "/gpu:0")
  File "train_gpu.py", line 571, in inference
    saver = tf.train.Saver()
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 832, in __init__
    self.build()
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 844, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 881, in _build
    build_save=build_save, build_restore=build_restore)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 513, in _build_internal
    restore_sequentially, reshape)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 354, in _AddRestoreOps
    assign_ops.append(saveable.restore(saveable_tensors, shapes))
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saving/saveable_object_util.py", line 73, in restore
    self.op.get_shape().is_fully_defined())
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/ops/state_ops.py", line 223, in assign
    validate_shape=validate_shape)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/ops/gen_state_ops.py", line 64, in assign
    use_locking=use_locking, name=name)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [5472,410] rhs shape= [84827,410]
	 [[node save/Assign (defined at train_gpu.py:571) ]]


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_gpu.py", line 740, in <module>
    tf.app.run()
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "train_gpu.py", line 501, in main
    inference(n_token, cutoffs, "/gpu:0")
  File "train_gpu.py", line 583, in inference
    saver.restore(sess, eval_ckpt_path)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1312, in restore
    err, "a mismatch between the current graph and the graph")
tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Assign requires shapes of both tensors to match. lhs shape= [5472,410] rhs shape= [84827,410]
	 [[node save/Assign (defined at train_gpu.py:571) ]]

Caused by op 'save/Assign', defined at:
  File "train_gpu.py", line 740, in <module>
    tf.app.run()
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "train_gpu.py", line 501, in main
    inference(n_token, cutoffs, "/gpu:0")
  File "train_gpu.py", line 571, in inference
    saver = tf.train.Saver()
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 832, in __init__
    self.build()
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 844, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 881, in _build
    build_save=build_save, build_restore=build_restore)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 513, in _build_internal
    restore_sequentially, reshape)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 354, in _AddRestoreOps
    assign_ops.append(saveable.restore(saveable_tensors, shapes))
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saving/saveable_object_util.py", line 73, in restore
    self.op.get_shape().is_fully_defined())
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/ops/state_ops.py", line 223, in assign
    validate_shape=validate_shape)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/ops/gen_state_ops.py", line 64, in assign
    use_locking=use_locking, name=name)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Assign requires shapes of both tensors to match. lhs shape= [5472,410] rhs shape= [84827,410]
	 [[node save/Assign (defined at train_gpu.py:571) ]]

在训练的过程中, 由于现存受限, 我对模型的参数进行了修改, 设置为
N_LAYER=2 (减少了模型的层数)
BSZ=64, TGT_LEN=100(为了能够找到/data/poetry/record_info-train.bsz-64.tlen-100.json)
train_steps=1000(为了能够尽快看到验证的效果)
save_steps=400
在inference阶段, 我在train_gpu.py 的504行, main函数中,修改了dataset_name = "poetry"
但是出现以上错误.
不知道是不是因为我修改了上述参数所导致? 谢谢

模型层缺少encoder层

这里跟之前的transformer，缺少了encoder层，做文本生成是不是有点不太合适呢？

你好，请问ppl到多少停止训练比较合适

如题

inference 在哪里分词，也不知道有没有人回复

请问tf.model中rel_multihead_attn的w是一个batch还是单条数据？

请问训练Doupo这个模型需要多大的显存？

2021-05-19 18:26:20.655889: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ****************************************************************************************************
2021-05-19 18:26:20.655945: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at batch_matmul_op_impl.h:586 : Resource exhausted: OOM when allocating tensor with shape[16,10,100,200] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
我使用的显卡是3G显存，在训练过程中发现显存已经不够用了。

有关cache memory的问题

你好，我在看代码的过程中有点小疑问，就是在model.py的第510行 new_mems.append(_cache_mem(output, mems[i], mem_len)) 这个的意思其实就是将最早的一个memory剃掉，将最近的一个插入。但是当i=0的时候，那个output中只有position的embedding，并没有multihead的attention。为啥不把new_mems.append(_cache_mem(output, mems[i], mem_len)) 这行代码放到for循环的 positionwise_FF 输出output之后呢？即把这行代码放到534行
多谢

AttributeError: 'list' object has no attribute 'get_shape'

i get this error when i run train_gpu.py
my env:
ubuntu16.04 with Tesla P100 (cuda9.0,cudnn7.6)
tf 1.12
numpy 1.16.4
python 3.6

does anything wrong？

====================
INFO:tensorflow:n_token 10091
INFO:tensorflow:[train] File names ['train.bsz-128.tlen-100.tfrecords']
INFO:tensorflow:num of batches 612
WARNING:tensorflow:From \tf_gpu\lib\site-packages\tensorflow\python\ops\sparse_ops.py:1165: sparse_to_dense (from tensorflow.python.ops.spa
rse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
Traceback (most recent call last):
File "train_gpu.py", line 735, in
tf.app.run()
File "\tf_gpu\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "train_gpu.py", line 495, in main
train(n_token, cutoffs, "/gpu:0")
File "train_gpu.py", line 280, in train
mems=mems_i)
File "train_gpu.py", line 229, in single_core_graph
is_training=is_training)
File "train_gpu.py", line 197, in model_fn
proj_same_dim=FLAGS.proj_same_dim)
File "\tf\model.py", line 554, in transformer
proj_same_dim=proj_same_dim)
File "\tf\model.py", line 258, in mask_adaptive_logsoftmax
output = _logit(hidden, params_W, softmax_b, params_projs)
File "\tf\model.py", line 243, in _logit
y = tf.einsum('ibd,ed->ibe', y, proj)
File "\tf_gpu\lib\site-packages\tensorflow\python\ops\special_math_ops.py", line 257, in einsum
axes_to_sum)
File "~\tf_gpu\lib\site-packages\tensorflow\python\ops\special_math_ops.py", line 306, in _einsum_reduction
if len(t1_axis_labels) != len(t1.get_shape()):
AttributeError: 'list' object has no attribute 'get_shape'

请问模型训练的终止条件是什么？

LM是根据training loss还是validation loss判断模型性能呢？
我看之前的issue，说模型要训练40~60h，这样我认为training loss会降得很低直至收敛，而训练后期的valid loss应该是上升的。不知如何判断模型好坏，以及inference时用的是training data还是valid data？
谢谢！

scripts/doupo_base_gpu.sh: line 175: 2335 Killed

在训练模型时候出现以下错误，直接退出，能给些建议么？

$ bash scripts/doupo_base_gpu.sh train
...
...
...
2019-09-27 09:48:18.632038: W tensorflow/core/common_runtime/colocation_graph.cc:1016] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
ApplyAdam: CPU
VariableV2: CPU
Const: CPU XLA_CPU
Identity: CPU XLA_CPU
Fill: CPU XLA_CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  transformer/adaptive_softmax/bias/Initializer/zeros/shape_as_tensor (Const)
  transformer/adaptive_softmax/bias/Initializer/zeros/Const (Const)
  transformer/adaptive_softmax/bias/Initializer/zeros (Fill)
  transformer/adaptive_softmax/bias (VariableV2) /gpu:0
  transformer/adaptive_softmax/bias/Assign (Assign) /gpu:0
  transformer/adaptive_softmax/bias/read (Identity) /gpu:0
  transformer/adaptive_softmax/bias/Adam/Initializer/zeros/shape_as_tensor (Const) /gpu:0
  transformer/adaptive_softmax/bias/Adam/Initializer/zeros/Const (Const) /gpu:0
  transformer/adaptive_softmax/bias/Adam/Initializer/zeros (Fill) /gpu:0
  transformer/adaptive_softmax/bias/Adam (VariableV2) /gpu:0
  transformer/adaptive_softmax/bias/Adam/Assign (Assign) /gpu:0
  transformer/adaptive_softmax/bias/Adam/read (Identity) /gpu:0
  transformer/adaptive_softmax/bias/Adam_1/Initializer/zeros/shape_as_tensor (Const) /gpu:0
  transformer/adaptive_softmax/bias/Adam_1/Initializer/zeros/Const (Const) /gpu:0
  transformer/adaptive_softmax/bias/Adam_1/Initializer/zeros (Fill) /gpu:0
  transformer/adaptive_softmax/bias/Adam_1 (VariableV2) /gpu:0
  transformer/adaptive_softmax/bias/Adam_1/Assign (Assign) /gpu:0
  transformer/adaptive_softmax/bias/Adam_1/read (Identity) /gpu:0
  Adam/update_transformer/adaptive_softmax/bias/ApplyAdam (ApplyAdam) /gpu:0
  save/Assign_6 (Assign) /gpu:0
  save/Assign_7 (Assign) /gpu:0
  save/Assign_8 (Assign) /gpu:0

scripts/doupo_base_gpu.sh: line 175:  2335 Killed                  CUDA_VISIBLE_DEVICES='0,1,2,3' python train_gpu.py --data_dir=${DATA_ROOT}/tfrecords --record_info_dir=${DATA_ROOT}/tfrecords/ --corpus_info_path=${DATA_ROOT}/corpus-info.json --model_dir=EXP-doupo4-1_head-1e4 --div_val=${DIV_VAL} --untie_r=True --proj_share_all_but_first=True --n_layer=${N_LAYER} --d_model=${D_MODEL} --d_embed=${D_EMBED} --n_head=${N_HEAD} --d_head=${D_HEAD} --d_inner=${D_INNER} --dropout=0.1 --dropatt=0.0 --learning_rate=0.00010 --warmup_steps=0 --train_steps=1000000 --tgt_len=${TGT_LEN} --mem_len=${MEM_LEN} --train_batch_size=${BSZ} --num_core_per_host=${NUM_CORE} --iterations=200 --save_steps=4000 ${@:2}

关于 inference

一般解码器推理的时候是：解码器输出 yt 的条件概率将基于之前的输出序列 y1,...,yt-1和背景向量c，但是您好像只基于 yt-1和c是吗，为什么这样做呢。

inference 中的saver.restore(sess, eval_ckpt_path)模型恢復不出來

tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [2222] rhs shape= [2223]
[[{{node save/Assign_1}}]]

求助，拜託了

Error generating English text. How to fix?

Hi:

Thanks for the repo. Using google translate I now have the following steps working 100%. I modified one of your bash scripts as per your instructions and use data_utils.py and renamed old_vocabulary.py for English.

I am using values same as "doupo" and --dataset=text8

Input is a 8 meg utf-8 text files (train.txt, valid.txt, test.txt)

text8.sh train_data (works - files created)

text8.sh test_data (works)

text8.sh train (works - models generated)

text8.sh eval (works - report to terminal)

Edit script for inference:

Delete: --do_eval=True \

Add: --do_inference=True \

Run text8.sh eval again ... starts then following error:
etc.
File "train_gpu.py", line 499, in main
inference(n_token, cutoffs, "/gpu:0")
File "train_gpu.py", line 506, in inference
tmp_Vocab.count_file("../data/{}/train.txt".format(dataset_name), add_eos=False)
File "/home/pixelhead/Desktop/Transformer-XL-PROBLEMS/transformer-xl-textgen/tf/vocabulary.py", line 48, in count_file

Any suggestions on how I can fix this?
Thanks,

infernence出现的问题

Assign requires shapes of both tensors to match. lhs shape= [4786] rhs shape= [84827] 用训练后的模型进行inference的时候出现这个错要改成n_token = 84827#len(tmp_Vocab) 不过改成这样就会预测出问题训练集保存下来的checkpoint和测试的时候的不一样导致的问题