train按照TDM的wiki(https://github.com/alibaba/x-deeplearning/wiki/%E6%B7%B1%E5%BA%A6%E6%A0%91%E5%8C%B9%E9%85%8D%E6%A8%A1%E5%9E%8B(TDM))
docker镜像中配置hadoop后 ,按照“单机试验小数据集”中的步骤从头做到了训练的一步,出现了hadoop异常,导致脚本退出,未能正常启动训练,以下是出错的异常信息,请问可能是什么原因。
=========================================================
config: {u'ps': {u'instance_num': 16, u'memory_m': 64000, u'gpu_cores': 0, u'cpu_cores': 16}, u'dependent_dirs': u'/home/hcx/tdm_mock/tdm_ub_att_ubuntu', u'script': u'train.py', u'worker': {u'instance_num': 20, u'memory_m': 100000, u'gpu_cores': 2, u'cpu_cores': 46}, u'max_local_failover_times': 3, u'auto_rebalance': {u'enable': u'false'}, u'min_finish_worker_rate': 100, u'max_failover_times': 3, u'job_name': u'xdl_tdm', u'docker_image': u'trn:img', u'checkpoint': {u'output_dir': u'hdfs:/train_ckpt/checkpoint'}}
mv data/ub_tree.pb data/ub_tree.pb.bak
hadoop fs -get hdfs:/tree_data/data/userbehavoir_tree.pb data/ub_tree.pb
Load successfully, leaf node count:hello1
parsers.txt
fs.hdfs
2018-12-29 11:42:41,021 [main] WARN util.NativeCodeLoader (NativeCodeLoader.java:(60)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hello2
hello3
hello4
hdfsGetPathInfo(hdfs:/tree_data/data/userbehavoir_train_sample.dat_[\d]+): getFileInfo error:
IllegalArgumentException: Wrong FS: hdfs:/tree_data/data/userbehavoir_train_sample.dat_[\d]+, expected: hdfs://localhost:9000java.lang.IllegalArgumentException: Wrong FS: hdfs:/tree_data/data/userbehavoir_train_sample.dat_[\d]+, expected: hdfs://localhost:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:240)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
hdfsGetPathInfo(hdfs:/tree_data/data/userbehavoir_train_sample.dat_[\d]+): getFileInfo error:
IllegalArgumentException: Wrong FS: hdfs:/tree_data/data/userbehavoir_train_sample.dat_[\d]+, expected: hdfs://localhost:9000java.lang.IllegalArgumentException: Wrong FS: hdfs:/tree_data/data/userbehavoir_train_sample.dat_[\d]+, expected: hdfs://localhost:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:240)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
data parallel for re: hdfs:/tree_data/data/userbehavoir_train_sample.dat_[\d]+
hdfsListDirectory(hdfs:/tree_data/data): FileSystem#listStatus error:
IllegalArgumentException: Wrong FS: hdfs:/tree_data/data, expected: hdfs://localhost:9000java.lang.IllegalArgumentException: Wrong FS: hdfs:/tree_data/data, expected: hdfs://localhost:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:240)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1052)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1119)
at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1116)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1126)
2018-12-29 11:42:42.068194: /home/yue.song/x-deeplearning/xdl/xdl/data_io/fs/file_system_hdfs.cc:129] Check failed: info != nullptr can't open dir hdfs:/tree_data/data
Aborted (core dumped)
=========================================================
以下是train.py脚本的修改的情况:
68 def train(is_training=True):
69 #np.set_printoptions(threshold='nan')
70 if is_training or xdl.get_task_index() == 0:
71 init()
72 else:
73 return
74
75 file_type = xdl.parsers.txt
76 if is_training:
77 print "hello1"
78 print file_type
79 print xdl.fs.hdfs
80 data_io = xdl.DataIO("tdm", file_type=file_type, fs_type=xdl.fs.hdfs,
81 namenode="hdfs://localhost:9000", enable_state=False)
82 print "hello2"
83
84 feature_count = 69
85 for i in xrange(1, feature_count + 1):
86 data_io.feature(name=("item_%s" % i), type=xdl.features.sparse, table=1)
87 data_io.feature(name="unit_id_expand", type=xdl.features.sparse, table=0)
88
89 print "hello3"
90 data_io.batch_size(intconf('train_batch_size'))
91 data_io.epochs(intconf('train_epochs'))
92 data_io.threads(intconf('train_threads'))
93 data_io.label_count(2)
94 base_path = '%s/%s/' % (conf('upload_url'), conf('data_dir'))
95 data = base_path + conf('train_sample') + '_' + r'[\d]+'
96 sharding = xdl.DataSharding(data_io.fs())
97 print "hello4"
98 sharding.add_path(data)
99 print "hello5"
100 paths = sharding.partition(rank=xdl.get_task_index(), size=xdl.get_task_num())
101 print "hello6"
102 print 'train: sharding.partition() =', paths
namenode的解释不是很能看明白(原话是:# 修改train的代码中DataIO的参数 namenode="hdfs://your/namenode/hdfs/path:9000",这是样本读取目录的hdfs根结点路径)
所以取值尝试了几种namenode值,最终配置为"hdfs://localhost:9000",不知道是不是这里引发了异常,因为对hadoop了解有限,所以看不太出是什么问题。
希望帮忙看一下可能的原因吧,就是想跑通整个TDM训练的流程。