gaopeng97 / transformer-xl-chinese Goto Github PK
View Code? Open in Web Editor NEWtransformer xl在中文文本生成上的尝试(可写小说、古诗)(transformer xl for text generation of chinese)
License: Apache License 2.0
transformer xl在中文文本生成上的尝试(可写小说、古诗)(transformer xl for text generation of chinese)
License: Apache License 2.0
in your code for inference there are variables such as new_mems_id
and mems_id
, but it doesn't seem like they are being used anywhere? Am I wrong, is there a purpose for these variables? Thank you.
将模型参数修改为:
#Model
DIV_VAL=1
N_LAYER=6
D_MODEL=500
D_EMBED=500
N_HEAD=10
D_HEAD=50
D_INNER=1000
#Training
TGT_LEN=70
MEM_LEN=70
BSZ=64
NUM_CORE=4
#Testing
TEST_TGT_LEN=70
TEST_MEM_LEN=500
TEST_CLAMP_LEN=400
#TEST_BSZ=10
TEST_BSZ=1
TEST_NUM_CORE=1
跑doupo_base_gpu.sh,训练和eval都是正常的,但是inference的时候报错:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [4786] rhs shape= [84827]
模型加载不成功,请问是我的参数设置有问题吗
您好,我用你的代码训练文章为什么会反复出现同一句话,请问这个问题怎么处理呀?
您好,我想请教一下,如果我想用transformerXL模型做文本摘要任务,有什么好的建议吗?
我自己跑了一下,发现OOM了
for step in progress(range(output_len)):
time.sleep(0.01)
feed_dict = {}
for i in range(FLAGS.num_core_per_host):
for m, m_np in zip(tower_mems[i], tower_mems_np[i]):
feed_dict[m] = m_np
for id, id_np in zip(tower_mems_id[i], tower_mems_id_np[i]):
feed_dict[id] = id_np
sess.run(iterator.initializer, feed_dict={test_list: [encoded_input]})
fetched = sess.run(fetches, feed_dict=feed_dict)
这一段中循环的作用是什么,生成的feed_dict为什么两个for 呀,然后下面为什么两个sess.run?
我在训练好模型后,使用inference函数做推断时,发现你的代码逻辑是每次预测一个字,然后不断迭代200次,每次计算的时间约为0.5s(在gpu环境下做的推测,用的p40的卡),我看一些资料说xl对速度做了优化,可以输入[1,2,3,4]得出[5,6,7,8],如果我想每次迭代时预测后续多个汉字应该修改哪里呢?这样做对速度提升有帮助吗?
File "/tf/data_utils.py", line 504, in parser
val = tf.sparse.to_dense(val)
AttributeError: module 'tensorflow.sparse' has no attribute 'to_dense'
你好,我想用一下知乎的数据,但是没有找到,请问还有知乎的数据吗?
处理完数据,训练时候碰到引入这个文件错误,发现这个repo里貌似没有progressbar.py文件,大佬可否提供下?
您好,请问您有尝试过用分词之后的输入去训练模型吗?还有你觉得分词对提升生成的效果有帮助吗?
在doupo_base_gpu.sh文件中,训练,验证,推理指定的model_dir文件加名字不一样,在训练wt103时,验证数据集用的enwiki8?感觉有点混乱
我有几个问题请教一下
1.我目前任务是针对一些文本生成他的关键词,不单单是简单的抽取,而是要有语义上的理解,比如说“王某偷窃十万元”-“偷窃罪”,请问能否直接训练“文本”-“关键词”达到效果
2.关于中文预处理的部分,请问具体在代码的哪个部分,data_utils_chinese.py吗
我用的是poetry的数据集,inference那边的数据集名称改过来了,结果还是有问题,麻烦大侠帮忙看看:
InvalidArgumentError Traceback (most recent call last)
d:\anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in _create_c_op(graph, node_def, inputs, control_inputs)
1658 try:
-> 1659 c_op = c_api.TF_FinishOperation(op_desc)
1660 except errors.InvalidArgumentError as e:
InvalidArgumentError: Dimension size must be evenly divisible by 2 but is 1
Number of ways to split should evenly divide the split dimension for 'split' (op: 'Split') with input shapes: [], [1,?] and with computed input tensors: input[0] = <0>.
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
in
----> 1 main()
in main()
12 train(n_token, cutoffs, "/gpu:0",record_info_dir,train_batch_size,tgt_len,min_lr_ratio ,learning_rate ,model_dir, warm_start_path, num_core_per_host )
13 if do_inference:
---> 14 inference(n_token, cutoffs, "/gpu:0")
in inference(n_token, cutoffs, ps_device)
16 input_feed = iterator.get_next()
17
---> 18 inputs = tf.split(input_feed, num_core_per_host, 0)
19 #inputs = input_feed
20
d:\anaconda3\lib\site-packages\tensorflow\python\ops\array_ops.py in split(value, num_or_size_splits, axis, num, name)
1506 if size_splits._rank() == 0 and size_splits.dtype.is_integer:
1507 return gen_array_ops.split(
-> 1508 axis=axis, num_split=num_or_size_splits, value=value, name=name)
1509
1510 if num is None:
d:\anaconda3\lib\site-packages\tensorflow\python\ops\gen_array_ops.py in split(axis, value, num_split, name)
10740 num_split = _execute.make_int(num_split, "num_split")
10741 _, _, _op = _op_def_lib._apply_op_helper(
10742 "Split", split_dim=axis, value=value, num_split=num_split, name=name)
10743 _result = _op.outputs[:]
10744 _inputs_flat = _op.inputs
d:\anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py in _apply_op_helper(self, op_type_name, name, **keywords)
786 op = g.create_op(op_type_name, inputs, output_types, name=scope,
787 input_types=input_types, attrs=attr_protos,
--> 788 op_def=op_def)
789 return output_structure, op_def.is_stateful, op
790
d:\anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py in new_func(*args, **kwargs)
505 'in a future version' if date is None else ('after %s' % date),
506 instructions)
--> 507 return func(*args, **kwargs)
508
509 doc = _add_deprecated_arg_notice_to_docstring(
d:\anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in create_op(failed resolving arguments)
3298 input_types=input_types,
3299 original_op=self._default_original_op,
-> 3300 op_def=op_def)
3301 self._create_op_helper(ret, compute_device=compute_device)
3302 return ret
d:\anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in init(self, node_def, g, inputs, output_types, control_inputs, input_types, original_op, op_def)
1821 op_def, inputs, node_def.attr)
1822 self._c_op = _create_c_op(self._graph, node_def, grouped_inputs,
-> 1823 control_input_ops)
1824
1825 # Initialize self._outputs.
d:\anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in _create_c_op(graph, node_def, inputs, control_inputs)
1660 except errors.InvalidArgumentError as e:
1661 # Convert to ValueError for backwards compatibility.
-> 1662 raise ValueError(str(e))
1663
1664 return c_op
ValueError: Dimension size must be evenly divisible by 2 but is 1
Number of ways to split should evenly divide the split dimension for 'split' (op: 'Split') with input shapes: [], [1,?] and with computed input tensors: input[0] = <0>.
大佬,你好,我刚接触这块,想做中文文本摘要的任务,使用transformer框架,对于预处理这块不是很清楚,代码中的处理部分也看了一点,不知道大佬有没有对应讲中文预处理的博客或者资料可以拿来学习的,谢谢啦!
比如我希望某些样本在损失函数里占更多的权重,不想通过过采样的方式,因为过采样会增加样本数量,训练时间边长
输入命令: bash scripts/doupo_base_gpu.sh inference, 现实错误.
Run inference...
INFO:tensorflow:n_token 84827
building vocab with min_freq=0, max_size=None
final vocab size 4786 from 4786 unique tokens
WARNING:tensorflow:From /home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py:1419: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /media/lukuan/s/DL_lk/NLP/transformer-xl-chinese/tf/model.py:616: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
WARNING:tensorflow:From /media/lukuan/s/DL_lk/NLP/transformer-xl-chinese/tf/model.py:705: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
INFO:tensorflow:#params: 43046566
2019-08-26 08:58:24.718474: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-26 08:58:24.821696: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-26 08:58:24.822190: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55bcb0b0ca80 executing computations on platform CUDA. Devices:
2019-08-26 08:58:24.822203: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2019-08-26 08:58:24.823656: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4008359999 Hz
2019-08-26 08:58:24.824027: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55bcb0b76b90 executing computations on platform Host. Devices:
2019-08-26 08:58:24.824052: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2019-08-26 08:58:24.824230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.8225
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 6.62GiB
2019-08-26 08:58:24.824245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-08-26 08:58:24.824957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-26 08:58:24.824975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-08-26 08:58:24.824979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-08-26 08:58:24.825117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6442 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
lk_print_show: None
Traceback (most recent call last):
File "train_gpu.py", line 735, in
tf.app.run()
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train_gpu.py", line 499, in main
inference(n_token, cutoffs, "/gpu:0")
File "train_gpu.py", line 578, in inference
saver.restore(sess, eval_ckpt_path)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1264, in restore
raise ValueError("Can't load save_path when it is None.")
ValueError: Can't load save_path when it is None.
谢谢!
seed text干什么用的?
结果文件保存在哪里?
train_steps=1000000,你花了多少时间,用gpu的吗
Hi, very grateful that you share the code & result.
When I work on your code, I can not find the zhihu dataset. Can you please help me with it?
hello GaoPeng97,only gpu? no cpu ?
您好,请问如果我希望我的输入是关键词+句子的形式,应该修改代码的哪一部分呢?
Have you run https://github.com/kimiyoung/transformer-xl/blob/master/tf/scripts/lm1b_base_gpu.sh successfully?
我对Poetry的数据进行了训练, 完成以后使用保存的模型进行推断发现如下错误:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from EXP-poetry_mem50/model-2000.ckpt
Traceback (most recent call last):
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [5472,410] rhs shape= [84827,410]
[[{{node save/Assign}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1276, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [5472,410] rhs shape= [84827,410]
[[node save/Assign (defined at train_gpu.py:571) ]]
Caused by op 'save/Assign', defined at:
File "train_gpu.py", line 740, in <module>
tf.app.run()
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train_gpu.py", line 501, in main
inference(n_token, cutoffs, "/gpu:0")
File "train_gpu.py", line 571, in inference
saver = tf.train.Saver()
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 832, in __init__
self.build()
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 844, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 881, in _build
build_save=build_save, build_restore=build_restore)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 513, in _build_internal
restore_sequentially, reshape)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 354, in _AddRestoreOps
assign_ops.append(saveable.restore(saveable_tensors, shapes))
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saving/saveable_object_util.py", line 73, in restore
self.op.get_shape().is_fully_defined())
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/ops/state_ops.py", line 223, in assign
validate_shape=validate_shape)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/ops/gen_state_ops.py", line 64, in assign
use_locking=use_locking, name=name)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [5472,410] rhs shape= [84827,410]
[[node save/Assign (defined at train_gpu.py:571) ]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train_gpu.py", line 740, in <module>
tf.app.run()
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train_gpu.py", line 501, in main
inference(n_token, cutoffs, "/gpu:0")
File "train_gpu.py", line 583, in inference
saver.restore(sess, eval_ckpt_path)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1312, in restore
err, "a mismatch between the current graph and the graph")
tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Assign requires shapes of both tensors to match. lhs shape= [5472,410] rhs shape= [84827,410]
[[node save/Assign (defined at train_gpu.py:571) ]]
Caused by op 'save/Assign', defined at:
File "train_gpu.py", line 740, in <module>
tf.app.run()
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train_gpu.py", line 501, in main
inference(n_token, cutoffs, "/gpu:0")
File "train_gpu.py", line 571, in inference
saver = tf.train.Saver()
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 832, in __init__
self.build()
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 844, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 881, in _build
build_save=build_save, build_restore=build_restore)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 513, in _build_internal
restore_sequentially, reshape)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 354, in _AddRestoreOps
assign_ops.append(saveable.restore(saveable_tensors, shapes))
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/training/saving/saveable_object_util.py", line 73, in restore
self.op.get_shape().is_fully_defined())
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/ops/state_ops.py", line 223, in assign
validate_shape=validate_shape)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/ops/gen_state_ops.py", line 64, in assign
use_locking=use_locking, name=name)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/home/lukuan/.pyenv/versions/anaconda3-5.0.1/envs/lk_TC/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Assign requires shapes of both tensors to match. lhs shape= [5472,410] rhs shape= [84827,410]
[[node save/Assign (defined at train_gpu.py:571) ]]
在训练的过程中, 由于现存受限, 我对模型的参数进行了修改, 设置为
N_LAYER=2 (减少了模型的层数)
BSZ=64, TGT_LEN=100(为了能够找到/data/poetry/record_info-train.bsz-64.tlen-100.json)
train_steps=1000(为了能够尽快看到验证的效果)
save_steps=400
在inference阶段, 我在train_gpu.py 的504行, main函数中,修改了dataset_name = "poetry"
但是出现以上错误.
不知道是不是因为我修改了上述参数所导致? 谢谢
这里跟之前的transformer,缺少了encoder层,做文本生成是不是有点不太合适呢?
如题
inference 在哪里分词,也不知道有没有人回复
2021-05-19 18:26:20.655889: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ****************************************************************************************************
2021-05-19 18:26:20.655945: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at batch_matmul_op_impl.h:586 : Resource exhausted: OOM when allocating tensor with shape[16,10,100,200] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
我使用的显卡是3G显存,在训练过程中发现显存已经不够用了。
你好,我在看代码的过程中有点小疑问,就是在model.py的第510行 new_mems.append(_cache_mem(output, mems[i], mem_len))
这个的意思其实就是将最早的一个memory剃掉,将最近的一个插入。但是当i=0的时候,那个output中只有position的embedding,并没有multihead的attention。为啥不把new_mems.append(_cache_mem(output, mems[i], mem_len))
这行代码放到for循环的 positionwise_FF 输出output之后呢?即把这行代码放到534行
多谢
i get this error when i run train_gpu.py
my env:
ubuntu16.04 with Tesla P100 (cuda9.0,cudnn7.6)
tf 1.12
numpy 1.16.4
python 3.6
does anything wrong?
tf.sparse.SparseTensor
and use tf.sparse.to_dense
instead.LM是根据training loss还是validation loss判断模型性能呢?
我看之前的issue,说模型要训练40~60h,这样我认为training loss会降得很低直至收敛,而训练后期的valid loss应该是上升的。不知如何判断模型好坏,以及inference时用的是training data还是valid data?
谢谢!
在训练模型时候出现以下错误,直接退出,能给些建议么?
$ bash scripts/doupo_base_gpu.sh train
...
...
...
2019-09-27 09:48:18.632038: W tensorflow/core/common_runtime/colocation_graph.cc:1016] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
/job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
ApplyAdam: CPU
VariableV2: CPU
Const: CPU XLA_CPU
Identity: CPU XLA_CPU
Fill: CPU XLA_CPU
Colocation members, user-requested devices, and framework assigned devices, if any:
transformer/adaptive_softmax/bias/Initializer/zeros/shape_as_tensor (Const)
transformer/adaptive_softmax/bias/Initializer/zeros/Const (Const)
transformer/adaptive_softmax/bias/Initializer/zeros (Fill)
transformer/adaptive_softmax/bias (VariableV2) /gpu:0
transformer/adaptive_softmax/bias/Assign (Assign) /gpu:0
transformer/adaptive_softmax/bias/read (Identity) /gpu:0
transformer/adaptive_softmax/bias/Adam/Initializer/zeros/shape_as_tensor (Const) /gpu:0
transformer/adaptive_softmax/bias/Adam/Initializer/zeros/Const (Const) /gpu:0
transformer/adaptive_softmax/bias/Adam/Initializer/zeros (Fill) /gpu:0
transformer/adaptive_softmax/bias/Adam (VariableV2) /gpu:0
transformer/adaptive_softmax/bias/Adam/Assign (Assign) /gpu:0
transformer/adaptive_softmax/bias/Adam/read (Identity) /gpu:0
transformer/adaptive_softmax/bias/Adam_1/Initializer/zeros/shape_as_tensor (Const) /gpu:0
transformer/adaptive_softmax/bias/Adam_1/Initializer/zeros/Const (Const) /gpu:0
transformer/adaptive_softmax/bias/Adam_1/Initializer/zeros (Fill) /gpu:0
transformer/adaptive_softmax/bias/Adam_1 (VariableV2) /gpu:0
transformer/adaptive_softmax/bias/Adam_1/Assign (Assign) /gpu:0
transformer/adaptive_softmax/bias/Adam_1/read (Identity) /gpu:0
Adam/update_transformer/adaptive_softmax/bias/ApplyAdam (ApplyAdam) /gpu:0
save/Assign_6 (Assign) /gpu:0
save/Assign_7 (Assign) /gpu:0
save/Assign_8 (Assign) /gpu:0
scripts/doupo_base_gpu.sh: line 175: 2335 Killed CUDA_VISIBLE_DEVICES='0,1,2,3' python train_gpu.py --data_dir=${DATA_ROOT}/tfrecords --record_info_dir=${DATA_ROOT}/tfrecords/ --corpus_info_path=${DATA_ROOT}/corpus-info.json --model_dir=EXP-doupo4-1_head-1e4 --div_val=${DIV_VAL} --untie_r=True --proj_share_all_but_first=True --n_layer=${N_LAYER} --d_model=${D_MODEL} --d_embed=${D_EMBED} --n_head=${N_HEAD} --d_head=${D_HEAD} --d_inner=${D_INNER} --dropout=0.1 --dropatt=0.0 --learning_rate=0.00010 --warmup_steps=0 --train_steps=1000000 --tgt_len=${TGT_LEN} --mem_len=${MEM_LEN} --train_batch_size=${BSZ} --num_core_per_host=${NUM_CORE} --iterations=200 --save_steps=4000 ${@:2}
一般 解码器 推理的时候是:解码器输出 yt 的条件概率将基于之前的输出序列 y1,...,yt-1和背景向量c,但是您好像只基于 yt-1和c是吗,为什么这样做呢。
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [2222] rhs shape= [2223]
[[{{node save/Assign_1}}]]
求助,拜託了
Hi:
Thanks for the repo. Using google translate I now have the following steps working 100%. I modified one of your bash scripts as per your instructions and use data_utils.py and renamed old_vocabulary.py for English.
I am using values same as "doupo" and --dataset=text8
Input is a 8 meg utf-8 text files (train.txt, valid.txt, test.txt)
text8.sh train_data (works - files created)
text8.sh test_data (works)
text8.sh train (works - models generated)
text8.sh eval (works - report to terminal)
Edit script for inference:
Delete: --do_eval=True \
Add: --do_inference=True \
Run text8.sh eval again ... starts then following error:
etc.
File "train_gpu.py", line 499, in main
inference(n_token, cutoffs, "/gpu:0")
File "train_gpu.py", line 506, in inference
tmp_Vocab.count_file("../data/{}/train.txt".format(dataset_name), add_eos=False)
File "/home/pixelhead/Desktop/Transformer-XL-PROBLEMS/transformer-xl-textgen/tf/vocabulary.py", line 48, in count_file
Any suggestions on how I can fix this?
Thanks,
Assign requires shapes of both tensors to match. lhs shape= [4786] rhs shape= [84827] 用训练后的模型 进行inference的时候 出现这个错 要改成n_token = 84827#len(tmp_Vocab) 不过改成这样就会预测出问题 训练集保存下来的checkpoint和测试的时候的不一样导致的问题
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.