rioyokotalab / caffe2 Goto Github PK
View Code? Open in Web Editor NEWThis project forked from facebookarchive/caffe2
Caffe2 is a lightweight, modular, and scalable deep learning framework.
Home Page: https://caffe2.ai
License: Other
This project forked from facebookarchive/caffe2
Caffe2 is a lightweight, modular, and scalable deep learning framework.
Home Page: https://caffe2.ai
License: Other
ssh接続時もそうだが
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = "en_US:en",
LC_ALL = (unset),
LC_MESSAGES = "en_US.UTF-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
が出る
https://askubuntu.com/questions/162391/how-do-i-fix-my-locale-issue
insert
locale
in ~/.bashrc
I try to train resnet50 in fp16 on commit of 0f72d25
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 125/128 of epoch 0 (201.81 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 126/128 of epoch 0 (202.05 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 127/128 of epoch 0 (201.43 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 128/128 of epoch 0 (201.80 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
Traceback (most recent call last):
File "/home/hiroki11/latest_caffe2/caffe2/caffe2/python/examples/resnet50_trainer.py", line 500, in <module>
main()
File "/home/hiroki11/latest_caffe2/caffe2/caffe2/python/examples/resnet50_trainer.py", line 496, in main
Train(args)
File "/home/hiroki11/latest_caffe2/caffe2/caffe2/python/examples/resnet50_trainer.py", line 421, in Train
explog
File "/home/hiroki11/latest_caffe2/caffe2/caffe2/python/examples/resnet50_trainer.py", line 188, in RunEpoch
assert loss < 40, "Exploded gradients :("
AssertionError: Exploded gradients :(
this experiment is wxecute with 10 category datasets . So I used --num_labels 10
$ python resnet50_trainer.py --train_data /path-to/ilsvrc12_train_lmdb
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 1500000
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
Traceback (most recent call last):
File "resnet50_trainer.py", line 490, in <module>
main()
File "resnet50_trainer.py", line 486, in main
Train(args)
File "resnet50_trainer.py", line 339, in Train
optimize_gradient_memory=True,
File "/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py", line 24, in Parallelize_GPU
Parallelize(*args, **kwargs)
File "/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py", line 142, in Parallelize
input_builder_fun(model_helper_obj)
File "resnet50_trainer.py", line 328, in add_image_input
img_size=args.image_size,
File "resnet50_trainer.py", line 61, in AddImageInput
mirror=1
File "/home/hiroki11/caffe2/build/caffe2/python/brew.py", line 104, in scope_wrapper
return func(*args, **new_kwargs)
File "/home/hiroki11/caffe2/build/caffe2/python/helpers/tools.py", line 21, in image_input
data, label = model.net.ImageInput(
File "/home/hiroki11/caffe2/build/caffe2/python/core.py", line 1840, in __getattr__
",".join(workspace.C.nearby_opnames(op_type)) + ']'
AttributeError: Method ImageInput is not a registered operator. Did you mean: []
I'd like to install protobuf which is depended library
cd $SRC_DIR
wget https://github.com/google/protobuf/archive/v3.3.0.tar.gz
tar zxvf v3.3.0.tar.gz
cd protobuf-3.3.0
./autogen.sh
./configure --prefix=$LOCAL_DIR/protobuf-3.3.0
make -j 64
make install
I want to do,
./autogen.sh
If there is no autoconf, an error will be displayed
therefore
cd $SRC_DIR
wget http://ftp.gnu.org/gnu/autoconf/autoconf-2.68.tar.gz
tar zxvf autoconf-2.68.tar.gz
cd autoconf-2.68
./configure --prefix=$LOCAL_DIR/autoconf-2.68
make -j 64
make check -j 64
make install -j 64
install to $LOCAL_DIR/autoconf-2.68
edit ~/.bashrc
and add following sentence
# For autoconf
export PATH=$LOCAL_DIR/autoconf-2.68:$PATH
export PATH=$LOCAL_DIR/autoconf-2.68/bin:$PATH
then, I tried
./autogen.sh
the error occured like below
+ autoreconf -f -i -Wall,no-obsolete
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = "UTF-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Can't exec "aclocal": Permission denied at $LOCAL_DIR/autoconf-2.68/share/autoconf/Autom4te/FileUtils.pm line 326.
autoreconf: failed to run aclocal: Permission denied
aclocal is in automake?
I have not reinstalled automake yet, but Permission denied means is it necessary to reinstalled ?
run below script
#!/bin/bash
for i in {0..3}
do
bsub \
-e error_file.log \
-o output_file.log \
-R rusage[ngpus_shared=4] \-
q excl \
python ${CAFFE2_HOME}/caffe2/python/examples/resnet50_trainer.py \
--train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb \
--gpus 0,1,2,3 \
--batch_size 128 \
--num_labels 10 \
--epoch_size 10240 \
--num_epochs 10 \
--num_shards 4 \
--shard_id $i \
--redis_host XXXXXX --redis_port 6379
done
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 236, in Train
prefix=args.run_id,
File "/path-to/caffe2/build/caffe2/python/core.py", line 324, in CreateOperator
operator.arg.add().CopyFrom(utils.MakeArgument(key, value))
File "/path-to/caffe2/build/caffe2/python/utils.py", line 128, in MakeArgument
key, value, type(value)
ValueError: Unknown argument type: key=prefix value=None, value type=<type 'NoneType'>
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 236, in Train
prefix=args.run_id,
File "/path-to/caffe2/build/caffe2/python/core.py", line 324, in CreateOperator
operator.arg.add().CopyFrom(utils.MakeArgument(key, value))
File "/path-to/caffe2/build/caffe2/python/utils.py", line 128, in MakeArgument
key, value, type(value)
ValueError: Unknown argument type: key=prefix value=None, value type=<type 'NoneType'>
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 236, in Train
prefix=args.run_id,
File "/path-to/caffe2/build/caffe2/python/core.py", line 324, in CreateOperator
operator.arg.add().CopyFrom(utils.MakeArgument(key, value))
File "/path-to/caffe2/build/caffe2/python/utils.py", line 128, in MakeArgument
key, value, type(value)
ValueError: Unknown argument type: key=prefix value=None, value type=<type 'NoneType'>
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 236, in Train
prefix=args.run_id,
File "/path-to/caffe2/build/caffe2/python/core.py", line 324, in CreateOperator
operator.arg.add().CopyFrom(utils.MakeArgument(key, value))
File "/path-to/caffe2/build/caffe2/python/utils.py", line 128, in MakeArgument
key, value, type(value)
ValueError: Unknown argument type: key=prefix value=None, value type=<type 'NoneType'>
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
INFO:data_parallel_model:Parallelizing model for devices: [0, 1, 2, 3]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Model for GPU : 1
INFO:data_parallel_model:Model for GPU : 2
INFO:data_parallel_model:Model for GPU : 3
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed computed params all-reduce not implemented yet
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
WARNING:data_parallel_model:------- DEPRECATED API, please use data_parallel_model.OptimizeGradientMemory() -----
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.335288047791 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.353416204453 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.340279817581 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.345367908478 secs
E0719 03:57:34.776022 145363 common_world_ops.h:75] Caught store handler timeout exception: [/path-to/caffe2/caffe2/distributed/file_store_handler.cc:132] Wait timeout for name(s): allreduce_0_cw_op/1/0
E0719 03:57:34.777902 145363 net.cc:145] Operator failed: input: "store_handler" output: "allreduce_0_cw" name: "allreduce_0_cw_op" type: "CreateCommonWorld" arg { name: "status_blob" s: "create_allreduce_cw_0_status" } arg { name: "rank" i: 0 } arg { name: "size" i: 4 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
E0719 03:57:34.778396 145363 workspace.cc:217] Error when running network resnet50_init
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 350, in Train
workspace.RunNetOnce(train_model.param_init_net)
File "/path-to/caffe2/build/caffe2/python/workspace.py", line 183, in RunNetOnce
StringifyProto(net),
File "/path-to/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
raise ex
RuntimeError: [enforce fail at pybind_state.cc:862] gWorkspace->RunNetOnce(def).
Sender: LSF System <[email protected]>
Subject: Job 327128: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 0 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> in cluster <gargblsf> Exited
Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 0 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c110.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 0 --redis_host xxx.xxx.xxx.xxx --redis_port 6379
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 0.99 sec.
Max Memory : 29 MB
Average Memory : 29.00 MB
Total Requested Memory : -
Delta Memory : -
Max Swap : -
Max Processes : 4
Max Threads : 5
Run time : 4 sec.
Turnaround time : 5 sec.
The output (if any) follows:
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
PS:
Read file <error_file.log> for stderr output of this job.
Sender: LSF System <[email protected]>
Subject: Job 327130: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 2 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> in cluster <gargblsf> Exited
Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 2 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c055.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 2 --redis_host xxx.xxx.xxx.xxx --redis_port 6379
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 0.99 sec.
Max Memory : 29 MB
Average Memory : 29.00 MB
Total Requested Memory : -
Delta Memory : -
Max Swap : -
Max Processes : 4
Max Threads : 5
Run time : 3 sec.
Turnaround time : 6 sec.
The output (if any) follows:
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
PS:
Read file <error_file.log> for stderr output of this job.
Sender: LSF System <[email protected]>
Subject: Job 327129: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 1 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> in cluster <gargblsf> Exited
Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 1 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c041.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 1 --redis_host xxx.xxx.xxx.xxx --redis_port 6379
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 0.99 sec.
Max Memory : 29 MB
Average Memory : 1.00 MB
Total Requested Memory : -
Delta Memory : -
Max Swap : -
Max Processes : 4
Max Threads : 5
Run time : 3 sec.
Turnaround time : 6 sec.
The output (if any) follows:
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
PS:
Read file <error_file.log> for stderr output of this job.
Sender: LSF System <[email protected]>
Subject: Job 327131: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 3 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> in cluster <gargblsf> Exited
Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 3 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c110.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 3 --redis_host xxx.xxx.xxx.xxx --redis_port 6379
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 0.90 sec.
Max Memory : 36 MB
Average Memory : 36.00 MB
Total Requested Memory : -
Delta Memory : -
Max Swap : -
Max Processes : 4
Max Threads : 5
Run time : 2 sec.
Turnaround time : 9 sec.
The output (if any) follows:
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
PS:
Read file <error_file.log> for stderr output of this job.
Sender: LSF System <[email protected]>
Subject: Job 327134: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4> in cluster <gargblsf> Exited
Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c143.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 9.58 sec.
Max Memory : 325 MB
Average Memory : 249.67 MB
Total Requested Memory : -
Delta Memory : -
Max Swap : -
Max Processes : 4
Max Threads : 11
Run time : 44 sec.
Turnaround time : 44 sec.
The output (if any) follows:
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
Traceback for operator 1069 in network resnet50_init
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:919
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:970
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:983
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:881
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:221
/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py:309
/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py:458
/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py:462
PS:
Read file <error_file.log> for stderr output of this job.
Rank 0
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 1281024
INFO:data_parallel_model:Parallelizing model for devices: [0, 1, 2, 3]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Model for GPU : 1
INFO:data_parallel_model:Model for GPU : 2
INFO:data_parallel_model:Model for GPU : 3
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed computed params all-reduce not implemented yet
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
WARNING:data_parallel_model:------- DEPRECATED API, please use data_parallel_model.OptimizeGradientMemory() -----
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.382014989853 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.402288913727 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.385862827301 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.391633987427 secs
E0804 00:42:11.653461 80925 common_world_ops.h:75] Caught store handler timeout exception: [/home/hiroki11/caffe2/caffe2/distributed/file_store_handler.cc:132] Wait timeout for name(s): allreduce_3_cw_op/3/0
E0804 00:42:11.657723 80925 net.cc:145] Operator failed: input: "store_handler" output: "allreduce_3_cw" name: "allreduce_3_cw_op" type: "CreateCommonWorld" arg { name: "status_blob" s: "create_allreduce_cw_3_status" } arg { name: "rank" i: 0 } arg { name: "size" i: 4 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
E0804 00:42:11.658283 80925 workspace.cc:217] Error when running network resnet50_init
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
Traceback for operator 1072 in network resnet50_init
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:919
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:970
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:983
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:881
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:221
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:309
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:458
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:462
Traceback (most recent call last):
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 350, in Train
workspace.RunNetOnce(train_model.param_init_net)
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 183, in RunNetOnce
StringifyProto(net),
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
raise ex
RuntimeError: [enforce fail at pybind_state.cc:862] gWorkspace->RunNetOnce(def).
When Caffe2 cmake
CMake Error at /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:108 (message):
Could NOT find CUDA: Found unsuitable version "9.0", but required is exact
version "8.0" (found /usr/local/cuda-9.0)
Call Stack (most recent call first):
/usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:313 (_FPHSA_FAILURE_MESSAGE)
/usr/share/cmake/Modules/FindCUDA.cmake:806 (find_package_handle_standard_args)
/home/hiroki11/env/local/opencv-2.4.13/share/OpenCV/OpenCVConfig.cmake:45 (find_package)
/home/hiroki11/env/local/opencv-2.4.13/share/OpenCV/OpenCVConfig.cmake:242 (find_host_package)
cmake/Dependencies.cmake:172 (find_package)
CMakeLists.txt:73 (include)
I have to rebuild opencv?
I tried opencv install by executing following sentence
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/home/gi75/i75012/env/local/opencv-2.4.13 -DCMAKE_BUILD_TYPE=RELEASE -DCUDA_NVCC_FLAGS='-std=c++11' -DCUDA_ARCH_BIN="2.0 2.1(2.0) 3.0 3.5 3.7 5.0 5.2 6.0 6.1" -DWITH_FFMPEG=OFF -DCMAKE_CXX_FLAGS=-D_FORCE_INLINES -DCUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME ..
after that
make -j 128
then
[ 27%] Built target pch_Generate_opencv_test_gpu
nvcc fatal : Could not open output file /home/gi75/i75012/env/src/opencv-2.4.13/build/modules/core/CMakeFiles/cuda_compile.dir/__/dynamicuda/src/cuda/cuda_compile_generated_matrix_operations.cu.o.NVCC-depend
CMake Error at cuda_compile_generated_matrix_operations.cu.o.cmake:208 (message):
Error generating
/home/gi75/i75012/env/src/opencv-2.4.13/build/modules/core/CMakeFiles/cuda_compile.dir/__/dynamicuda/src/cuda/./cuda_compile_generated_matrix_operations.cu.o
make[2]: *** [modules/core/CMakeFiles/cuda_compile.dir/__/dynamicuda/src/cuda/./cuda_compile_generated_matrix_operations.cu.o] Error 1
make[1]: *** [modules/core/CMakeFiles/opencv_core.dir/all] Error 2
make: *** [all] Error 2
ATLAS & OpenCV have to build with CUDA.
So, I have to use interactive job queue?
It depends on same issue
I cannot build Parallel Distributed Stable version
https://github.com/rioyokotalab/caffe2/tree/3a2e09674920fa9ac124a4facd6ef90e4eea1b47
However,
I can build bellow version
commit c59f291
Author: Yangqing Jia [email protected]
Date: Thu Aug 17 00:03:53 2017 -0700
Adios CNMEM. You will be remembered.
Summary:
As part of the cuda 9 move we have decided to deprecate the cnmem path
as it seems to be superceded by cub if one needs a memory pool.
Closes https://github.com/caffe2/caffe2/pull/1104
Differential Revision: D5647672
Pulled By: Yangqing
fbshipit-source-id: 988af5bf63e24efa1b631fd91ddb58e798ffc5c6
ldd make_image_db
linux-vdso.so.1 => (0x00007ffc179c3000)
libCaffe2_CPU.so => not found
libCaffe2_GPU.so => not found
libprotobuf.so.8 => /usr/lib64/libprotobuf.so.8 (0x00007f253d98e000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f253d772000)
libglog.so.0 => /lustre/gi75/i75012/env/local/glog-0.3.4/lib/libglog.so.0 (0x00007f253d542000)
libgflags.so.2.2 => /lustre/gi75/i75012/env/local/gflags-2.2.0/lib/libgflags.so.2.2 (0x00007f253d322000)
liblmdb.so => /lustre/gi75/i75012/env/local/lmdb-LMDB_0.9.18/lib/liblmdb.so (0x00007f253d10d000)
libhiredis.so.0.13 => /lustre/gi75/i75012/env/local/hiredis/lib/libhiredis.so.0.13 (0x00007f253cefb000)
libopencv_core.so.2.4 => /lustre/gi75/i75012/env/local/opencv-2.4.13/lib/libopencv_core.so.2.4 (0x00007f253ca51000)
libopencv_highgui.so.2.4 => /lustre/gi75/i75012/env/local/opencv-2.4.13/lib/libopencv_highgui.so.2.4 (0x00007f253c68f000)
libopencv_imgproc.so.2.4 => /lustre/gi75/i75012/env/local/opencv-2.4.13/lib/libopencv_imgproc.so.2.4 (0x00007f253c19e000)
libmpicxx.so.12 => /lustre/app/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/lib/libmpicxx.so.12 (0x00007f253bf7e000)
libmpifort.so.12 => /lustre/app/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/lib/libmpifort.so.12 (0x00007f253bbd5000)
libmpi.so.12 => /lustre/app/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/lib/libmpi.so.12 (0x00007f253aec4000)
libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f253acc0000)
librt.so.1 => /usr/lib64/librt.so.1 (0x00007f253aab8000)
libcudart.so.8.0 => /lustre/app/acc/cuda/8.0.44/lib64/libcudart.so.8.0 (0x00007f253a851000)
libcurand.so.8.0 => /lustre/app/acc/cuda/8.0.44/lib64/libcurand.so.8.0 (0x00007f25368e8000)
libcublas.so.8.0 => /lustre/app/acc/cuda/8.0.44/lib64/libcublas.so.8.0 (0x00007f2533f38000)
libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007f2533541000)
libnvrtc.so.8.0 => /lustre/app/acc/cuda/8.0.44/lib64/libnvrtc.so.8.0 (0x00007f2532124000)
libcudnn.so.6 => /lustre/gi75/i75012/env/local/cuda/lib/libcudnn.so.6 (0x00007f2528bc2000)
libnccl.so.1 => /lustre/gi75/i75012/env/local/nccl-1.3.4-1/lib/libnccl.so.1 (0x00007f2526566000)
libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f2526350000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f2526048000)
libm.so.6 => /usr/lib64/libm.so.6 (0x00007f2525d45000)
libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x00007f2525b1f000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00007f252575e000)
libz.so.1 => /usr/lib64/libz.so.1 (0x00007f2525547000)
/lib64/ld-linux-x86-64.so.2 (0x00007f253dcc2000)
libpng15.so.15 => /usr/lib64/libpng15.so.15 (0x00007f252531c000)
libtiff.so.5 => /usr/lib64/libtiff.so.5 (0x00007f25250a7000)
libgthread-2.0.so.0 => /usr/lib64/libgthread-2.0.so.0 (0x00007f2524ea5000)
libglib-2.0.so.0 => /usr/lib64/libglib-2.0.so.0 (0x00007f2524b6e000)
libnvidia-fatbinaryloader.so.375.20 => /usr/lib64/libnvidia-fatbinaryloader.so.375.20 (0x00007f2524921000)
libjbig.so.2.0 => /usr/lib64/libjbig.so.2.0 (0x00007f2524714000)
libjpeg.so.62 => /usr/lib64/libjpeg.so.62 (0x00007f25244bf000)
slayton58@e415b74
のように従って色々やってみた
caffe2/caffe2/python/examples/resnet50_trainer.py
with fp16 using P100.caffe2/caffe2/python/examples/resnet50_trainer.py
as followsadd output_type='float16'
in brew.image_input
argument
by using caffe2.python.modeling.initializers.pFP16Initializer
add pFP16Initializer
in brew.conv
argument
WeightInitializer=pFP16Initializer,
BiasInitializer=pFP16Initializer,
All changes are below
rioyokotalab/models@cc5f9a9
For intra-node parallel learning on a machine with four P100s, the following command is executed
python /path-to-examples/resnet50_trainer.py \
--train_data /path-to-ILSVRC2012-dataset/ilsvrc12_train_lmdb \
--num_gpus 4 \
--num_shards 1 \
--file_store_path . \
--image_size 224 \
--batch_size 128 \
--epoch_size 1281167 \
--num_epochs 1 \
--base_learning_rate 1.0 \
--weight_decay 0.0001 \
--num_labels=1000
INFO:resnet50_trainer:Finished iteration 91/10009 of epoch 0 (400.34 images/sec)
INFO:resnet50_trainer:Training loss: 2.21322536469, accuracy: 0.21875
*** Aborted at 1499852546 (unix time) try "date -d @1499852546" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x58) received by PID 102556 (TID 0x3aff01fff1d0) from PID 88; stack trace: ***
@ 0x3fffa05b0478 ([vdso]+0x477)
@ 0x3fff8dbbe268 (unknown)
@ 0x3fff8dd0ff50 (unknown)
@ 0x3fff8dc73a80 (unknown)
@ 0x3fff8dc7502c (unknown)
@ 0x3fff8dc753bc (unknown)
@ 0x3fff8db997a0 (unknown)
@ 0x3fff8da90ccc (unknown)
@ 0x3fff8dc14310 cuStreamSynchronize
@ 0x3fff9483d120 (unknown)
@ 0x3fff9488d808 cudaStreamSynchronize
@ 0x3fff96dbe440 caffe2::CUDAContext::FinishDeviceComputation()
@ 0x3fff96dbe8a0 caffe2::Operator<>::Run()
@ 0x3fff9678bff4 caffe2::DAGNet::RunAt()
@ 0x3fff96787c98 caffe2::DAGNetBase::WorkerFunction()
@ 0x3fff9678c2a4 std::thread::_Impl<>::_M_run()
@ 0x3fff79abbdd4 (unknown)
@ 0x3fffa0558728 start_thread
@ 0x3fffa034d210 __clone
Segmentation fault
name | description |
---|---|
OS | Red Hat Enterprise Linux Server release 7.3 (Maipo) |
CPU | POWER8NVL revision : 1.0 (pvr 004c 0100) ×8 |
GCC Compiler | gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11) |
GPU | Tesla P100 GPUs × 4 |
nvcc | release 8.0, V8.0.61 |
cuDNN | v6.0 (April 27, 2017), for CUDA 8.0 |
pip install numpy
pip install future
pip install protobuf
CMAKE_PREFIX_PATH=/home/hiroki11x/env/local/opencv-2.4.13:/home/hiroki11x/env/local/snappy-1.1.4 cmake .. \
-DBLAS=Eigen \
-DUSE_CUDA=ON \
-DUSE_ROCKSDB=OFF \
-DUSE_GLOO=ON \
-DUSE_REDIS=OFF \
-DUSE_OPENCV=ON \
-DUSE_GFLAGS=OFF \
-DUSE_MPI=OFF \
-DCUDNN_INCLUDE_DIR=/home/hiroki11x/env/local/cudnn7/cuda/include \
-DCUDNN_LIBRARY=/home/hiroki11x/env/local/cudnn7/cuda/lib/libcudnn.so \
-DCMAKE_INSTALL_PREFIX=/home/hiroki11x/dl/caffe2/local \
..
if you want to use mpi, append following options
-DMPI_C_COMPILER=/opt/intel/compilers_and_libraries_2017.1.132/linux/mpi/mic/bin/mpicc \
-DMPI_CXX_COMPILER=/opt/intel/compilers_and_libraries_2017.1.132/linux/mpi/mic/bin/mpicxx \
then , I execute cmake
Run Build Command:"/usr/bin/gmake" "cmTC_8b05e/fast"
/usr/bin/gmake -f CMakeFiles/cmTC_8b05e.dir/build.make CMakeFiles/cmTC_8b05e.dir/build
gmake[1]: Entering directory `/home/hiroki11x/dl/caffe2/build/CMakeFiles/CMakeTmp'
Building C object CMakeFiles/cmTC_8b05e.dir/CheckFunctionExists.c.o
/usr/bin/cc -DCHECK_FUNCTION_EXISTS=pthread_create -o CMakeFiles/cmTC_8b05e.dir/CheckFunctionExists.c.o -c /home/hiroki11x/env/src/cmake-3.4.0-rc3/Modules/CheckFunctionExists.c
Linking C executable cmTC_8b05e
/home/hiroki11x/env/src/cmake-3.4.0-rc3/bin/cmake -E cmake_link_script CMakeFiles/cmTC_8b05e.dir/link.txt --verbose=1
/usr/bin/cc -DCHECK_FUNCTION_EXISTS=pthread_create CMakeFiles/cmTC_8b05e.dir/CheckFunctionExists.c.o -o cmTC_8b05e -rdynamic -lpthreads
/usr/bin/ld: cannot find -lpthreads
collect2: error: ld returned 1 exit status
gmake[1]: *** [cmTC_8b05e] Error 1
gmake[1]: Leaving directory `/home/hiroki11x/dl/caffe2/build/CMakeFiles/CMakeTmp'
gmake: *** [cmTC_8b05e/fast] Error 2
I installed libtool from source
However, the following error occured
@warsaw:~/env/src/protobuf-3.3.0$ ./autogen.sh
~~
configure.ac:30: error: possibly undefined macro: AC_PROG_LIBTOOL
If this token and others are legitimate, please use m4_pattern_allow.
See the Autoconf documentation.
autoreconf: /usr/bin/autoconf failed with exit status: 1
Is is necessary to yum install libtool ?
I'd like to profile caffe2 distributed training resent50
First of all
I have to know how faster
then, I will compare
Before that,
I should research someone's benchmarks about resnet50 training
CMAKE_INSTALL_PREFIX=/path-to-install cmake -DUSE_REDIS=ON ..
hoge/caffe2/caffe2/operators/resize_op.cu(63): error: identifier "__ldg" is undefined
1 error detected in the compilation of "/tmp/tmpxft_0000477b_00000000-20_resize_op.compute_20.cpp1.ii".
CMake Error at Caffe2_GPU_generated_resize_op.cu.o.cmake:260 (message):
Error generating file
hoge/caffe2/build/caffe2/CMakeFiles/Caffe2_GPU.dir/operators/./Caffe2_GPU_generated_resize_op.cu.o
make[2]: *** [caffe2/CMakeFiles/Caffe2_GPU.dir/operators/./Caffe2_GPU_generated_resize_op.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [caffe2/CMakeFiles/Caffe2_GPU.dir/all] Error 2
make: *** [all] Error 2
wget http://www.netlib.org/lapack/lapack-3.6.1.tgz
wget https://sourceforge.net/projects/math-atlas/files/Stable/3.10.3/atlas3.10.3.tar.bz2
tar xjvf atlas3.10.3.tar.bz2
cd ATLAS
mkdir build
cd build
../configure -b 64 --prefix=$LOCAL_DIR/ATLAS --shared --with-netlib-lapack-tarfile=../../lapack-3.6.1.tgz
make -j $J
make install
make -j 32
the following error occured
make[10]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[9]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[8]: *** [tstlib.grd] Error 2
make[8]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/tune/blas/level1'
TST: make drottest urout=rot1_x1y1.c opt=""
make[8]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build/tune/blas/level1'
cd /home/gi75/i75012/env/src/ATLAS/build/src/testing ; make lib
make[8]: *** read jobs pipe EOF. Stop.
make[8]: *** Waiting for unfinished jobs....
make[9]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make -j 28 dlib.grd
make[9]: *** read jobs pipe EOF. Stop.
make[9]: *** Waiting for unfinished jobs....
make[10]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[10]: warning: -jN forced in submake: disabling jobserver mode.
make[10]: `dlib.grd' is up to date.
make[10]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[9]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[8]: *** [tstlib.grd] Error 2
make[8]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/tune/blas/level1'
TST: make drottest urout=rot4_x1y1.c opt=""
make[8]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build/tune/blas/level1'
cd /home/gi75/i75012/env/src/ATLAS/build/src/testing ; make lib
make[8]: *** read jobs pipe EOF. Stop.
make[8]: *** Waiting for unfinished jobs....
make[9]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make -j 28 dlib.grd
make[9]: *** read jobs pipe EOF. Stop.
make[9]: *** Waiting for unfinished jobs....
make[10]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[10]: warning: -jN forced in submake: disabling jobserver mode.
make[10]: `dlib.grd' is up to date.
make[10]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[9]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[8]: *** [tstlib.grd] Error 2
make[8]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/tune/blas/level1'
NO GENERAL CASE SURVIVED!! ABORTING!!
ID incX incY alpha beta ROUT
==== ==== ==== ===== ==== =============
1 0 0 2 2 rot1_x0y0.c
2 1 1 2 2 rot1_x1y1.c
3 1 1 2 2 rot4_x1y1.c
make[7]: *** [dinstall_rot] Error 255
make[7]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/tune/blas/level1'
make[6]: *** [Make_drot] Error 2
make[6]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/blas/level1'
make[5]: *** [dgen] Error 2
make[5]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/blas/level1'
make[4]: *** [dlib] Error 2
make[4]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/blas/level1'
make[3]: *** [lib.grd] Error 2
make[3]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/auxil'
make[2]: *** [IStage1] Error 2
make[2]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/bin'
ERROR 712 DURING CACHESIZE SEARCH!!. CHECK INSTALL_LOG/Stage1.log FOR DETAILS.
make[2]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build/bin'
cd /home/gi75/i75012/env/src/ATLAS/build ; make error_report
make[3]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build'
make -f Make.top error_report
make[4]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build'
uname -a 2>&1 >> bin/INSTALL_LOG/ERROR.LOG
/usr/bin/x86_64-redhat-linux-gcc -v 2>&1 >> bin/INSTALL_LOG/ERROR.LOG
Using built-in specs.
COLLECT_GCC=/usr/bin/x86_64-redhat-linux-gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC)
/usr/bin/x86_64-redhat-linux-gcc -V 2>&1 >> bin/INSTALL_LOG/ERROR.LOG
x86_64-redhat-linux-gcc: error: unrecognized command line option ‘-V’
x86_64-redhat-linux-gcc: fatal error: no input files
compilation terminated.
make[4]: [error_report] Error 4 (ignored)
/usr/bin/x86_64-redhat-linux-gcc --version 2>&1 >> bin/INSTALL_LOG/ERROR.LOG
tar cf error_UNKNOWNx8664AVXMAC.tar Make.inc bin/INSTALL_LOG/*
bzip2 error_UNKNOWNx8664AVXMAC.tar
make[4]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build'
make[3]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build'
make[2]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/bin'
Error report error_<ARCH>.tgz has been created in your top-level ATLAS
directory. Be sure to include this file in any help request.
cat: ../../CONFIG/error.txt: No such file or directory
cat: ../../CONFIG/error.txt: No such file or directory
make[1]: *** [build] Error 255
make[1]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build'
make: *** [build] Error 2
<class 'caffe2.python.core.Net'>
{}
<class 'caffe2.python.core.Net'>
{}
<class 'caffe2.python.core.Net'>
{}
<class 'caffe2.python.core.Net'>
{}
INFO:resnet50_trainer:Finished iteration 1/10009 of epoch 0 (25.41 images/sec)
INFO:resnet50_trainer:Training loss: 7.38396549225, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 2/10009 of epoch 0 (492.02 images/sec)
INFO:resnet50_trainer:Training loss: 190.478805542, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 3/10009 of epoch 0 (550.15 images/sec)
INFO:resnet50_trainer:Training loss: 723.197265625, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 4/10009 of epoch 0 (543.48 images/sec)
INFO:resnet50_trainer:Training loss: 704.564941406, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 5/10009 of epoch 0 (559.24 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 6/10009 of epoch 0 (550.31 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 7/10009 of epoch 0 (545.42 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 8/10009 of epoch 0 (569.45 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 9/10009 of epoch 0 (568.98 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 10/10009 of epoch 0 (543.75 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 11/10009 of epoch 0 (550.41 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 2501/2502 of epoch 0 (79.03 images/sec)
INFO:resnet50_trainer:Training loss: 0.432902753353, accuracy: 0.875
INFO:resnet50_trainer:Finished iteration 2502/2502 of epoch 0 (79.26 images/sec)
INFO:resnet50_trainer:Training loss: 0.462416082621, accuracy: 0.8125
Traceback (most recent call last):
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 388, in Train
explog
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 156, in RunEpoch
learning_rate = workspace.FetchBlob(prefix + '/conv1_w_lr')
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 323, in FetchBlob
return C.fetch_blob(StringifyBlobName(name))
RuntimeError: [enforce fail at pybind_state.cc:152] ws->HasBlob(name). Can't find blob: gpu_0/conv1_w_lr
I found this issue
I think it is occured by difference of Caffe2 (resnet50_trainer.py) version
same issue
https://github.com/rioyokotalab/caffe2/wiki/Caffe2-build-on-ReedBush
I tried
make install -j 128
libCaffe2_CPU.so: undefined reference to `google::protobuf::internal::WireFormat::ReadPackedEnumPreserveUnknowns(google::protobuf::io::CodedInputStream*, unsigned int, bool (*)(int), google::protobuf::UnknownFieldSet*, google::protobuf::RepeatedField<int>*)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::io::CodedInputStream::IncrementRecursionDepthAndPushLimit(int)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::internal::WireFormatLite::Int32Size(google::protobuf::RepeatedField<int> const&)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::io::CodedInputStream::ReadVarint32Fallback(unsigned int)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::internal::WireFormatLite::WriteBytesMaybeAliased(int, std::string const&, google::protobuf::io::CodedOutputStream*)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::io::CodedOutputStream::WriteVarint64SlowPath(unsigned long)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::internal::RegisterAllTypes(google::protobuf::Metadata const*, int)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::io::CodedOutputStream::WriteVarint32SlowPath(unsigned int)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::internal::InitProtobufDefaults()'
libCaffe2_CPU.so: undefined reference to `google::protobuf::Message::SpaceUsedLong() const'
libCaffe2_CPU.so: undefined reference to `google::protobuf::internal::WireFormatLite::WriteDoubleArray(double const*, int, google::protobuf::io::CodedOutputStream*)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::io::CodedInputStream::BytesUntilTotalBytesLimit() const'
libCaffe2_CPU.so: undefined reference to `google::protobuf::io::CodedInputStream::ReadVarintSizeAsIntFallback()'
libCaffe2_CPU.so: undefined reference to `google::protobuf::io::CodedInputStream::ReadTagFallback(unsigned int)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::internal::RepeatedPtrFieldBase::InternalExtend(int)'
collect2: error: ld returned 1 exit status
make[2]: *** [caffe2/binaries/blob_test] Error 1
make[1]: *** [caffe2/CMakeFiles/blob_test.dir/all] Error 2
Linking CXX shared module python/caffe2_pybind11_state_gpu.so
Linking CXX shared module python/caffe2_pybind11_state.so
[100%] Built target caffe2_pybind11_state_gpu
[100%] Built target caffe2_pybind11_state
make: *** [all] Error 2
$ pyenv --version
pyenv 1.1.3
$ pyenv versions
system
* 2.7.10 (set by /lustre/gi75/i75012/env/src/pyenv/version)
3.4.3
3.5.0
So , I'll try to change python3
then
$ pip install protobuf
I installed Redis to /path-to-redis/ as following instaruction
https://github.com/kurosawatsuyoshi/doshelper/wiki/1.-redis-Setup%EF%BC%88redis%E3%81%AE%E3%82%BB%E3%83%83%E3%83%88%E3%82%A2%E3%83%83%E3%83%97%EF%BC%89
redis-server
[80640] 17 Jul 22:37:55.764 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
[80640] 17 Jul 22:37:55.765 * Increased maximum number of open files to 10032 (it was originally set to 4096).
[80640] 17 Jul 22:37:55.765 # Creating Server TCP listening socket *:6379: bind: Address already in use
https://redis.io/topics/quickstart
then, don't mind
I installed Hiredis(This is redis connection library.)
$ wget -O hiredis.zip https://github.com/redis/hiredis/archive/master.zip
$ unzip hiredis.zip
$ cd hiredis-master/
you should edit Makefile as following
17 # Installation related variables and target
18 PREFIX=/path-to/local/hiredis
19 INCLUDE_PATH=include/hiredis
20 LIBRARY_PATH=lib
then execute
$ make
$ sudo make install
after that, rebuild caffe2
CMAKE_PREFIX_PATH=/path-to/opencv-2.4.13:/path-to/snappy_1.1.4:/path-to/redis-2.8.12 cmake .. \
-DBLAS=Eigen \
-DUSE_CUDA=ON \
-DUSE_ROCKSDB=OFF \
-DUSE_GLOO=ON \
-DUSE_REDIS=ON \
-DUSE_OPENCV=ON \
-DUSE_GFLAGS=OFF \
-DCUDNN_INCLUDE_DIR=/path-to/cuda/include \
-DCUDNN_LIBRARY=/path-to/cuda/lib/libcudnn.so \
-DCMAKE_INSTALL_PREFIX=/path-to/caffe2/local \
-DMPI_C_COMPILER=/path-to/openmpi-2.0.1/xl/bin/mpicc \
-DMPI_CXX_COMPILER=/path-to/openmpi-2.0.1/xl/bin/mpicxx
console out put
-- ******** Summary ********
-- General:
-- Git version :
-- System : Linux
-- C++ compiler : /usr/bin/c++
-- C++ compiler version : 4.8.5
-- Protobuf compiler : /usr/bin/protoc
-- CXX flags : -fopenmp -std=c++11 -fPIC -Wno-narrowing
-- Build type : Release
-- Compile definitions : CAFFE2_USE_EIGEN_FOR_BLAS;CAFFE2_USE_GOOGLE_GLOG;EIGEN_MPL2_ONLY;CAFFE2_FORCE_FALLBACK_CUDA_MPI;CAFFE2_NO_BUILTIN_CPU_SUPPORTS
--
-- BUILD_SHARED_LIBS : ON
-- BUILD_PYTHON : ON
-- Python version : 2.7.5
-- Python library : /usr/lib64/libpython2.7.so
-- BUILD_TEST : ON
-- USE_CUDA : ON
-- CUDA version : 8.0
-- USE_CNMEM : OFF
-- USE_NERVANA_GPU : OFF
-- USE_GLOG : ON
-- USE_GFLAGS : OFF
-- USE_LMDB : ON
-- LMDB version : 0.9.18
-- USE_LEVELDB : ON
-- LevelDB version : 1.20
-- Snappy version : 1.1.4
-- USE_OPENCV : ON
-- OpenCV version : 2.4.13
-- USE_FFMPEG :
-- USE_ZMQ : OFF
-- USE_ROCKSDB : OFF
-- USE_MPI : ON
-- USE_NCCL : ON
-- USE_NNPACK : OFF
-- USE_OPENMP : ON
-- USE_REDIS : ON
-- USE_GLOO : ON
-- Configuring done
-- Generating done
caffe2
$ python -m caffe2.python.operator_test.relu_op_test
WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
WARNING:root:Debug message: libCaffe2_CPU.so: cannot open shared object file: No such file or directory
CRITICAL:root:Cannot load caffe2.python. Error: libCaffe2_CPU.so: cannot open shared object file: No such file or directory
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.