rioyokotalab / caffe2 Goto Github PK

This project forked from facebookarchive/caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework.

License: Other

Shell 0.25% CMake 1.12% Makefile 0.01% Protocol Buffer 0.64% C++ 29.46% C 2.56% Python 18.53% Metal 0.54% Objective-C++ 4.00% Objective-C 0.18% Cuda 4.05% CSS 0.02% HTML 0.04% Jupyter Notebook 38.53% Batchfile 0.05%

caffe2's People

Contributors

Stargazers

Watchers

caffe2's Issues

./protobuf/autogen.sh error on warsaw

ssh接続時もそうだが

perl: warning: Setting locale failed.   
perl: warning: Please check that your locale settings:   
        LANGUAGE = "en_US:en",   
        LC_ALL = (unset),   
        LC_MESSAGES = "en_US.UTF-8",   
        LANG = "en_US.UTF-8"   
    are supported and installed on your system.

が出る

https://askubuntu.com/questions/162391/how-do-i-fix-my-locale-issue

insert

locale

in ~/.bashrc

fp16 training problem

I try to train resnet50 in fp16 on commit of 0f72d25

INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 125/128 of epoch 0 (201.81 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 126/128 of epoch 0 (202.05 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 127/128 of epoch 0 (201.43 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 128/128 of epoch 0 (201.80 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
Traceback (most recent call last):
  File "/home/hiroki11/latest_caffe2/caffe2/caffe2/python/examples/resnet50_trainer.py", line 500, in <module>
    main()
  File "/home/hiroki11/latest_caffe2/caffe2/caffe2/python/examples/resnet50_trainer.py", line 496, in main
    Train(args)
  File "/home/hiroki11/latest_caffe2/caffe2/caffe2/python/examples/resnet50_trainer.py", line 421, in Train
    explog
  File "/home/hiroki11/latest_caffe2/caffe2/caffe2/python/examples/resnet50_trainer.py", line 188, in RunEpoch
    assert loss < 40, "Exploded gradients :("
AssertionError: Exploded gradients :(

this experiment is wxecute with 10 category datasets . So I used --num_labels 10

AttributeError: Method ImageInput is not a registered operator

$ python resnet50_trainer.py --train_data /path-to/ilsvrc12_train_lmdb
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 1500000
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
Traceback (most recent call last):
  File "resnet50_trainer.py", line 490, in <module>
    main()
  File "resnet50_trainer.py", line 486, in main
    Train(args)
  File "resnet50_trainer.py", line 339, in Train
    optimize_gradient_memory=True,
  File "/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py", line 24, in Parallelize_GPU
    Parallelize(*args, **kwargs)
  File "/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py", line 142, in Parallelize
    input_builder_fun(model_helper_obj)
  File "resnet50_trainer.py", line 328, in add_image_input
    img_size=args.image_size,
  File "resnet50_trainer.py", line 61, in AddImageInput
    mirror=1
  File "/home/hiroki11/caffe2/build/caffe2/python/brew.py", line 104, in scope_wrapper
    return func(*args, **new_kwargs)
  File "/home/hiroki11/caffe2/build/caffe2/python/helpers/tools.py", line 21, in image_input
    data, label = model.net.ImageInput(
  File "/home/hiroki11/caffe2/build/caffe2/python/core.py", line 1840, in __getattr__
    ",".join(workspace.C.nearby_opnames(op_type)) + ']'
AttributeError: Method ImageInput is not a registered operator. Did you mean: []

[warsaw] Caffe2 setup

I'd like to install protobuf which is depended library

cd $SRC_DIR
wget https://github.com/google/protobuf/archive/v3.3.0.tar.gz
tar zxvf v3.3.0.tar.gz
cd protobuf-3.3.0
./autogen.sh
./configure --prefix=$LOCAL_DIR/protobuf-3.3.0
make -j 64
make install

I want to do,

./autogen.sh

If there is no autoconf, an error will be displayed

therefore

cd $SRC_DIR
wget http://ftp.gnu.org/gnu/autoconf/autoconf-2.68.tar.gz
tar zxvf autoconf-2.68.tar.gz
cd autoconf-2.68
./configure --prefix=$LOCAL_DIR/autoconf-2.68
make -j 64
make check -j 64
make install -j 64

install to $LOCAL_DIR/autoconf-2.68

edit ~/.bashrc and add following sentence

# For autoconf
export PATH=$LOCAL_DIR/autoconf-2.68:$PATH
export PATH=$LOCAL_DIR/autoconf-2.68/bin:$PATH

then, I tried

./autogen.sh

the error occured like below

+ autoreconf -f -i -Wall,no-obsolete
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "UTF-8",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Can't exec "aclocal": Permission denied at $LOCAL_DIR/autoconf-2.68/share/autoconf/Autom4te/FileUtils.pm line 326.
autoreconf: failed to run aclocal: Permission denied

aclocal is in automake?
I have not reinstalled automake yet, but Permission denied means is it necessary to reinstalled ?

cf. http://blog.csdn.net/ldl22847/article/details/8572406

Tips for Caffe2 ResNet50 Distributed Training

run below script

#!/bin/bash
for i in {0..3}
do


bsub \
-e error_file.log \
-o output_file.log \
-R rusage[ngpus_shared=4] \-
q excl \
python ${CAFFE2_HOME}/caffe2/python/examples/resnet50_trainer.py \
--train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb \
--gpus 0,1,2,3 \
--batch_size 128 \
--num_labels 10 \
--epoch_size 10240 \
--num_epochs 10 \
--num_shards 4 \
--shard_id $i \
--redis_host XXXXXX --redis_port 6379

done

INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 236, in Train
prefix=args.run_id,
File "/path-to/caffe2/build/caffe2/python/core.py", line 324, in CreateOperator
operator.arg.add().CopyFrom(utils.MakeArgument(key, value))
File "/path-to/caffe2/build/caffe2/python/utils.py", line 128, in MakeArgument
key, value, type(value)
ValueError: Unknown argument type: key=prefix value=None, value type=<type 'NoneType'>
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 236, in Train
prefix=args.run_id,
File "/path-to/caffe2/build/caffe2/python/core.py", line 324, in CreateOperator
operator.arg.add().CopyFrom(utils.MakeArgument(key, value))
File "/path-to/caffe2/build/caffe2/python/utils.py", line 128, in MakeArgument
key, value, type(value)
ValueError: Unknown argument type: key=prefix value=None, value type=<type 'NoneType'>
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 236, in Train
prefix=args.run_id,
File "/path-to/caffe2/build/caffe2/python/core.py", line 324, in CreateOperator
operator.arg.add().CopyFrom(utils.MakeArgument(key, value))
File "/path-to/caffe2/build/caffe2/python/utils.py", line 128, in MakeArgument
key, value, type(value)
ValueError: Unknown argument type: key=prefix value=None, value type=<type 'NoneType'>
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 236, in Train
prefix=args.run_id,
File "/path-to/caffe2/build/caffe2/python/core.py", line 324, in CreateOperator
operator.arg.add().CopyFrom(utils.MakeArgument(key, value))
File "/path-to/caffe2/build/caffe2/python/utils.py", line 128, in MakeArgument
key, value, type(value)
ValueError: Unknown argument type: key=prefix value=None, value type=<type 'NoneType'>
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
INFO:data_parallel_model:Parallelizing model for devices: [0, 1, 2, 3]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Model for GPU : 1
INFO:data_parallel_model:Model for GPU : 2
INFO:data_parallel_model:Model for GPU : 3
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed computed params all-reduce not implemented yet
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
WARNING:data_parallel_model:------- DEPRECATED API, please use data_parallel_model.OptimizeGradientMemory() -----
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.335288047791 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.353416204453 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.340279817581 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.345367908478 secs
E0719 03:57:34.776022 145363 common_world_ops.h:75] Caught store handler timeout exception: [/path-to/caffe2/caffe2/distributed/file_store_handler.cc:132] Wait timeout for name(s): allreduce_0_cw_op/1/0
E0719 03:57:34.777902 145363 net.cc:145] Operator failed: input: "store_handler" output: "allreduce_0_cw" name: "allreduce_0_cw_op" type: "CreateCommonWorld" arg { name: "status_blob" s: "create_allreduce_cw_0_status" } arg { name: "rank" i: 0 } arg { name: "size" i: 4 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
E0719 03:57:34.778396 145363 workspace.cc:217] Error when running network resnet50_init
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 350, in Train
workspace.RunNetOnce(train_model.param_init_net)
File "/path-to/caffe2/build/caffe2/python/workspace.py", line 183, in RunNetOnce
StringifyProto(net),
File "/path-to/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
raise ex
RuntimeError: [enforce fail at pybind_state.cc:862] gWorkspace->RunNetOnce(def).

Sender: LSF System <[email protected]>
Subject: Job 327128: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 0 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> in cluster <gargblsf> Exited

Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 0 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c110.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 0 --redis_host xxx.xxx.xxx.xxx --redis_port 6379
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

CPU time :                                   0.99 sec.
Max Memory :                                 29 MB
Average Memory :                             29.00 MB
Total Requested Memory :                     -
Delta Memory :                               -
Max Swap :                                   -
Max Processes :                              4
Max Threads :                                5
Run time :                                   4 sec.
Turnaround time :                            5 sec.

The output (if any) follows:

Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.


PS:

Read file <error_file.log> for stderr output of this job.

Sender: LSF System <[email protected]>
Subject: Job 327130: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 2 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> in cluster <gargblsf> Exited

Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 2 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c055.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 2 --redis_host xxx.xxx.xxx.xxx --redis_port 6379
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

CPU time :                                   0.99 sec.
Max Memory :                                 29 MB
Average Memory :                             29.00 MB
Total Requested Memory :                     -
Delta Memory :                               -
Max Swap :                                   -
Max Processes :                              4
Max Threads :                                5
Run time :                                   3 sec.
Turnaround time :                            6 sec.

The output (if any) follows:

Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.


PS:

Read file <error_file.log> for stderr output of this job.

Sender: LSF System <[email protected]>
Subject: Job 327129: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 1 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> in cluster <gargblsf> Exited

Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 1 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c041.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 1 --redis_host xxx.xxx.xxx.xxx --redis_port 6379
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

CPU time :                                   0.99 sec.
Max Memory :                                 29 MB
Average Memory :                             1.00 MB
Total Requested Memory :                     -
Delta Memory :                               -
Max Swap :                                   -
Max Processes :                              4
Max Threads :                                5
Run time :                                   3 sec.
Turnaround time :                            6 sec.

The output (if any) follows:

Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.


PS:

Read file <error_file.log> for stderr output of this job.

Sender: LSF System <[email protected]>
Subject: Job 327131: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 3 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> in cluster <gargblsf> Exited

Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 3 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c110.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 3 --redis_host xxx.xxx.xxx.xxx --redis_port 6379
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

CPU time :                                   0.90 sec.
Max Memory :                                 36 MB
Average Memory :                             36.00 MB
Total Requested Memory :                     -
Delta Memory :                               -
Max Swap :                                   -
Max Processes :                              4
Max Threads :                                5
Run time :                                   2 sec.
Turnaround time :                            9 sec.

The output (if any) follows:

Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.


PS:

Read file <error_file.log> for stderr output of this job.

Sender: LSF System <[email protected]>
Subject: Job 327134: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4> in cluster <gargblsf> Exited

Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c143.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

CPU time :                                   9.58 sec.
Max Memory :                                 325 MB
Average Memory :                             249.67 MB
Total Requested Memory :                     -
Delta Memory :                               -
Max Swap :                                   -
Max Processes :                              4
Max Threads :                                11
Run time :                                   44 sec.
Turnaround time :                            44 sec.

The output (if any) follows:

Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
Traceback for operator 1069 in network resnet50_init
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:919
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:970
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:983
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:881
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:221
/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py:309
/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py:458
/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py:462


PS:

Read file <error_file.log> for stderr output of this job.

Distributed Multinode Training Error

Rank 0
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 1281024
INFO:data_parallel_model:Parallelizing model for devices: [0, 1, 2, 3]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Model for GPU : 1
INFO:data_parallel_model:Model for GPU : 2
INFO:data_parallel_model:Model for GPU : 3
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed computed params all-reduce not implemented yet
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
WARNING:data_parallel_model:------- DEPRECATED API, please use data_parallel_model.OptimizeGradientMemory() ----- 
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.382014989853 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.402288913727 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.385862827301 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.391633987427 secs
E0804 00:42:11.653461 80925 common_world_ops.h:75] Caught store handler timeout exception: [/home/hiroki11/caffe2/caffe2/distributed/file_store_handler.cc:132] Wait timeout for name(s): allreduce_3_cw_op/3/0
E0804 00:42:11.657723 80925 net.cc:145] Operator failed: input: "store_handler" output: "allreduce_3_cw" name: "allreduce_3_cw_op" type: "CreateCommonWorld" arg { name: "status_blob" s: "create_allreduce_cw_3_status" } arg { name: "rank" i: 0 } arg { name: "size" i: 4 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
E0804 00:42:11.658283 80925 workspace.cc:217] Error when running network resnet50_init
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
Traceback for operator 1072 in network resnet50_init
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:919
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:970
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:983
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:881
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:221
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:309
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:458
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:462
Traceback (most recent call last):
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
    main()
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
    Train(args)
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 350, in Train
    workspace.RunNetOnce(train_model.param_init_net)
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 183, in RunNetOnce
    StringifyProto(net),
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
    raise ex
RuntimeError: [enforce fail at pybind_state.cc:862] gWorkspace->RunNetOnce(def).

CUDA 9 Update requirements

When Caffe2 cmake

CMake Error at /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:108 (message):
  Could NOT find CUDA: Found unsuitable version "9.0", but required is exact
  version "8.0" (found /usr/local/cuda-9.0)
Call Stack (most recent call first):
  /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:313 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake/Modules/FindCUDA.cmake:806 (find_package_handle_standard_args)
  /home/hiroki11/env/local/opencv-2.4.13/share/OpenCV/OpenCVConfig.cmake:45 (find_package)
  /home/hiroki11/env/local/opencv-2.4.13/share/OpenCV/OpenCVConfig.cmake:242 (find_host_package)
  cmake/Dependencies.cmake:172 (find_package)
  CMakeLists.txt:73 (include)

I have to rebuild opencv?

[OpenCV] Reedbush Setup

I tried opencv install by executing following sentence

mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/home/gi75/i75012/env/local/opencv-2.4.13 -DCMAKE_BUILD_TYPE=RELEASE -DCUDA_NVCC_FLAGS='-std=c++11' -DCUDA_ARCH_BIN="2.0 2.1(2.0) 3.0 3.5 3.7 5.0 5.2 6.0 6.1" -DWITH_FFMPEG=OFF -DCMAKE_CXX_FLAGS=-D_FORCE_INLINES -DCUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME ..

after that

make -j 128

then

[ 27%] Built target pch_Generate_opencv_test_gpu
nvcc fatal   : Could not open output file /home/gi75/i75012/env/src/opencv-2.4.13/build/modules/core/CMakeFiles/cuda_compile.dir/__/dynamicuda/src/cuda/cuda_compile_generated_matrix_operations.cu.o.NVCC-depend
CMake Error at cuda_compile_generated_matrix_operations.cu.o.cmake:208 (message):
  Error generating
  /home/gi75/i75012/env/src/opencv-2.4.13/build/modules/core/CMakeFiles/cuda_compile.dir/__/dynamicuda/src/cuda/./cuda_compile_generated_matrix_operations.cu.o


make[2]: *** [modules/core/CMakeFiles/cuda_compile.dir/__/dynamicuda/src/cuda/./cuda_compile_generated_matrix_operations.cu.o] Error 1
make[1]: *** [modules/core/CMakeFiles/opencv_core.dir/all] Error 2
make: *** [all] Error 2

[T3] Dependency Install

ATLAS & OpenCV have to build with CUDA.
So, I have to use interactive job queue?

[ReedBush] Caffe2 build (CMake) Error

It depends on same issue

#18

I cannot build Parallel Distributed Stable version
https://github.com/rioyokotalab/caffe2/tree/3a2e09674920fa9ac124a4facd6ef90e4eea1b47

However,

I can build bellow version

commit c59f291
Author: Yangqing Jia [email protected]
Date: Thu Aug 17 00:03:53 2017 -0700

Adios CNMEM. You will be remembered.

Summary:
As part of the cuda 9 move we have decided to deprecate the cnmem path
as it seems to be superceded by cub if one needs a memory pool.
Closes https://github.com/caffe2/caffe2/pull/1104

Differential Revision: D5647672

Pulled By: Yangqing

fbshipit-source-id: 988af5bf63e24efa1b631fd91ddb58e798ffc5c6

libCaffe2_CPU.so => not found

ldd make_image_db
	linux-vdso.so.1 =>  (0x00007ffc179c3000)
	libCaffe2_CPU.so => not found
	libCaffe2_GPU.so => not found
	libprotobuf.so.8 => /usr/lib64/libprotobuf.so.8 (0x00007f253d98e000)
	libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f253d772000)
	libglog.so.0 => /lustre/gi75/i75012/env/local/glog-0.3.4/lib/libglog.so.0 (0x00007f253d542000)
	libgflags.so.2.2 => /lustre/gi75/i75012/env/local/gflags-2.2.0/lib/libgflags.so.2.2 (0x00007f253d322000)
	liblmdb.so => /lustre/gi75/i75012/env/local/lmdb-LMDB_0.9.18/lib/liblmdb.so (0x00007f253d10d000)
	libhiredis.so.0.13 => /lustre/gi75/i75012/env/local/hiredis/lib/libhiredis.so.0.13 (0x00007f253cefb000)
	libopencv_core.so.2.4 => /lustre/gi75/i75012/env/local/opencv-2.4.13/lib/libopencv_core.so.2.4 (0x00007f253ca51000)
	libopencv_highgui.so.2.4 => /lustre/gi75/i75012/env/local/opencv-2.4.13/lib/libopencv_highgui.so.2.4 (0x00007f253c68f000)
	libopencv_imgproc.so.2.4 => /lustre/gi75/i75012/env/local/opencv-2.4.13/lib/libopencv_imgproc.so.2.4 (0x00007f253c19e000)
	libmpicxx.so.12 => /lustre/app/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/lib/libmpicxx.so.12 (0x00007f253bf7e000)
	libmpifort.so.12 => /lustre/app/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/lib/libmpifort.so.12 (0x00007f253bbd5000)
	libmpi.so.12 => /lustre/app/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/lib/libmpi.so.12 (0x00007f253aec4000)
	libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f253acc0000)
	librt.so.1 => /usr/lib64/librt.so.1 (0x00007f253aab8000)
	libcudart.so.8.0 => /lustre/app/acc/cuda/8.0.44/lib64/libcudart.so.8.0 (0x00007f253a851000)
	libcurand.so.8.0 => /lustre/app/acc/cuda/8.0.44/lib64/libcurand.so.8.0 (0x00007f25368e8000)
	libcublas.so.8.0 => /lustre/app/acc/cuda/8.0.44/lib64/libcublas.so.8.0 (0x00007f2533f38000)
	libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007f2533541000)
	libnvrtc.so.8.0 => /lustre/app/acc/cuda/8.0.44/lib64/libnvrtc.so.8.0 (0x00007f2532124000)
	libcudnn.so.6 => /lustre/gi75/i75012/env/local/cuda/lib/libcudnn.so.6 (0x00007f2528bc2000)
	libnccl.so.1 => /lustre/gi75/i75012/env/local/nccl-1.3.4-1/lib/libnccl.so.1 (0x00007f2526566000)
	libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f2526350000)
	libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f2526048000)
	libm.so.6 => /usr/lib64/libm.so.6 (0x00007f2525d45000)
	libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x00007f2525b1f000)
	libc.so.6 => /usr/lib64/libc.so.6 (0x00007f252575e000)
	libz.so.1 => /usr/lib64/libz.so.1 (0x00007f2525547000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f253dcc2000)
	libpng15.so.15 => /usr/lib64/libpng15.so.15 (0x00007f252531c000)
	libtiff.so.5 => /usr/lib64/libtiff.so.5 (0x00007f25250a7000)
	libgthread-2.0.so.0 => /usr/lib64/libgthread-2.0.so.0 (0x00007f2524ea5000)
	libglib-2.0.so.0 => /usr/lib64/libglib-2.0.so.0 (0x00007f2524b6e000)
	libnvidia-fatbinaryloader.so.375.20 => /usr/lib64/libnvidia-fatbinaryloader.so.375.20 (0x00007f2524921000)
	libjbig.so.2.0 => /usr/lib64/libjbig.so.2.0 (0x00007f2524714000)
	libjpeg.so.62 => /usr/lib64/libjpeg.so.62 (0x00007f25244bf000)

[Reedbush] NVCaffe Setup

Aborted at xxxxxx (unix time) SIGSEGV (@0x0) received by PID xxxx (TID 0xxxxxxxx) from PID 0; stack trace

slayton58@e415b74
のように従って色々やってみた

I want to run `caffe2/caffe2/python/examples/resnet50_trainer.py` with fp16 using P100.

Change

edit caffe2/caffe2/python/examples/resnet50_trainer.py as follows

add output_type='float16' in brew.image_input argument

also made the following changes to `caffe2/caffe2/python/models/resnet.py 'as follows.

by using caffe2.python.modeling.initializers.pFP16Initializer
add pFP16Initializer in brew.conv argument

WeightInitializer=pFP16Initializer,
BiasInitializer=pFP16Initializer,

All changes are below
rioyokotalab/models@cc5f9a9

Execution

For intra-node parallel learning on a machine with four P100s, the following command is executed

python  /path-to-examples/resnet50_trainer.py  \
--train_data /path-to-ILSVRC2012-dataset/ilsvrc12_train_lmdb \
--num_gpus 4   \
--num_shards 1  \
--file_store_path . \
--image_size 224  \
--batch_size 128 \
--epoch_size 1281167   \
--num_epochs 1  \
--base_learning_rate 1.0  \
--weight_decay 0.0001 \
--num_labels=1000

Error

INFO:resnet50_trainer:Finished iteration 91/10009 of epoch 0 (400.34 images/sec)
INFO:resnet50_trainer:Training loss: 2.21322536469, accuracy: 0.21875
*** Aborted at 1499852546 (unix time) try "date -d @1499852546" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x58) received by PID 102556 (TID 0x3aff01fff1d0) from PID 88; stack trace: ***
    @     0x3fffa05b0478 ([vdso]+0x477)
    @     0x3fff8dbbe268 (unknown)
    @     0x3fff8dd0ff50 (unknown)
    @     0x3fff8dc73a80 (unknown)
    @     0x3fff8dc7502c (unknown)
    @     0x3fff8dc753bc (unknown)
    @     0x3fff8db997a0 (unknown)
    @     0x3fff8da90ccc (unknown)
    @     0x3fff8dc14310 cuStreamSynchronize
    @     0x3fff9483d120 (unknown)
    @     0x3fff9488d808 cudaStreamSynchronize
    @     0x3fff96dbe440 caffe2::CUDAContext::FinishDeviceComputation()
    @     0x3fff96dbe8a0 caffe2::Operator<>::Run()
    @     0x3fff9678bff4 caffe2::DAGNet::RunAt()
    @     0x3fff96787c98 caffe2::DAGNetBase::WorkerFunction()
    @     0x3fff9678c2a4 std::thread::_Impl<>::_M_run()
    @     0x3fff79abbdd4 (unknown)
    @     0x3fffa0558728 start_thread
    @     0x3fffa034d210 __clone
Segmentation fault

Machine environment

name	description
OS	Red Hat Enterprise Linux Server release 7.3 (Maipo)
CPU	POWER8NVL revision : 1.0 (pvr 004c 0100) ×8
GCC Compiler	gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
GPU	Tesla P100 GPUs × 4
nvcc	release 8.0, V8.0.61
cuDNN	v6.0 (April 27, 2017), for CUDA 8.0

[Warsaw] Caffe2 build Problem "cannot find -lpthreads"

pip install numpy
pip install future
pip install protobuf

CMAKE_PREFIX_PATH=/home/hiroki11x/env/local/opencv-2.4.13:/home/hiroki11x/env/local/snappy-1.1.4 cmake .. \
-DBLAS=Eigen \
-DUSE_CUDA=ON \
-DUSE_ROCKSDB=OFF \
-DUSE_GLOO=ON \
-DUSE_REDIS=OFF \
-DUSE_OPENCV=ON \
-DUSE_GFLAGS=OFF \
-DUSE_MPI=OFF \
-DCUDNN_INCLUDE_DIR=/home/hiroki11x/env/local/cudnn7/cuda/include \
-DCUDNN_LIBRARY=/home/hiroki11x/env/local/cudnn7/cuda/lib/libcudnn.so \
-DCMAKE_INSTALL_PREFIX=/home/hiroki11x/dl/caffe2/local \
..

if you want to use mpi, append following options

-DMPI_C_COMPILER=/opt/intel/compilers_and_libraries_2017.1.132/linux/mpi/mic/bin/mpicc \
-DMPI_CXX_COMPILER=/opt/intel/compilers_and_libraries_2017.1.132/linux/mpi/mic/bin/mpicxx \

then , I execute cmake

Run Build Command:"/usr/bin/gmake" "cmTC_8b05e/fast"
/usr/bin/gmake -f CMakeFiles/cmTC_8b05e.dir/build.make CMakeFiles/cmTC_8b05e.dir/build
gmake[1]: Entering directory `/home/hiroki11x/dl/caffe2/build/CMakeFiles/CMakeTmp'
Building C object CMakeFiles/cmTC_8b05e.dir/CheckFunctionExists.c.o
/usr/bin/cc    -DCHECK_FUNCTION_EXISTS=pthread_create   -o CMakeFiles/cmTC_8b05e.dir/CheckFunctionExists.c.o   -c /home/hiroki11x/env/src/cmake-3.4.0-rc3/Modules/CheckFunctionExists.c
Linking C executable cmTC_8b05e
/home/hiroki11x/env/src/cmake-3.4.0-rc3/bin/cmake -E cmake_link_script CMakeFiles/cmTC_8b05e.dir/link.txt --verbose=1
/usr/bin/cc   -DCHECK_FUNCTION_EXISTS=pthread_create    CMakeFiles/cmTC_8b05e.dir/CheckFunctionExists.c.o  -o cmTC_8b05e -rdynamic -lpthreads 
/usr/bin/ld: cannot find -lpthreads
collect2: error: ld returned 1 exit status
gmake[1]: *** [cmTC_8b05e] Error 1
gmake[1]: Leaving directory `/home/hiroki11x/dl/caffe2/build/CMakeFiles/CMakeTmp'
gmake: *** [cmTC_8b05e/fast] Error 2

[warsaw] protobuf Install error

I installed libtool from source

However, the following error occured

@warsaw:~/env/src/protobuf-3.3.0$ ./autogen.sh 
~~
configure.ac:30: error: possibly undefined macro: AC_PROG_LIBTOOL
      If this token and others are legitimate, please use m4_pattern_allow.
      See the Autoconf documentation.
autoreconf: /usr/bin/autoconf failed with exit status: 1

maxmind/libmaxminddb#9

Is is necessary to yum install libtool ?

Profiling Caffe2 Distributed Training

I'd like to profile caffe2 distributed training resent50

First of all
I have to know how faster

caffe2 fp32 single gpu training

then, I will compare

caffe2 fp16 single gpu training
caffe2 fp32 single node (multi gpus) training
caffe2 fp16 single node (multi gpus) training
caffe2 fp32 multi node (multi gpus) training
caffe2 fp16 multi node (multi gpus) training

Before that,
I should research someone's benchmarks about resnet50 training

[Reedbush] error: identifier "__ldg" is undefined

CMAKE_INSTALL_PREFIX=/path-to-install cmake -DUSE_REDIS=ON ..

hoge/caffe2/caffe2/operators/resize_op.cu(63): error: identifier "__ldg" is undefined

1 error detected in the compilation of "/tmp/tmpxft_0000477b_00000000-20_resize_op.compute_20.cpp1.ii".
CMake Error at Caffe2_GPU_generated_resize_op.cu.o.cmake:260 (message):
  Error generating file
  hoge/caffe2/build/caffe2/CMakeFiles/Caffe2_GPU.dir/operators/./Caffe2_GPU_generated_resize_op.cu.o


make[2]: *** [caffe2/CMakeFiles/Caffe2_GPU.dir/operators/./Caffe2_GPU_generated_resize_op.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [caffe2/CMakeFiles/Caffe2_GPU.dir/all] Error 2
make: *** [all] Error 2

[Reedbush] ATLAS setup error

wget http://www.netlib.org/lapack/lapack-3.6.1.tgz
wget https://sourceforge.net/projects/math-atlas/files/Stable/3.10.3/atlas3.10.3.tar.bz2
tar xjvf atlas3.10.3.tar.bz2
cd ATLAS
mkdir build
cd build
../configure -b 64 --prefix=$LOCAL_DIR/ATLAS --shared --with-netlib-lapack-tarfile=../../lapack-3.6.1.tgz
make -j $J
make install

make -j 32

the following error occured

make[10]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[9]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[8]: *** [tstlib.grd] Error 2
make[8]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/tune/blas/level1'
TST: make drottest urout=rot1_x1y1.c opt="" 
make[8]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build/tune/blas/level1'
cd /home/gi75/i75012/env/src/ATLAS/build/src/testing ; make lib
make[8]: *** read jobs pipe EOF.  Stop.
make[8]: *** Waiting for unfinished jobs....
make[9]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make -j 28 dlib.grd
make[9]: *** read jobs pipe EOF.  Stop.
make[9]: *** Waiting for unfinished jobs....
make[10]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[10]: warning: -jN forced in submake: disabling jobserver mode.
make[10]: `dlib.grd' is up to date.
make[10]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[9]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[8]: *** [tstlib.grd] Error 2
make[8]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/tune/blas/level1'
TST: make drottest urout=rot4_x1y1.c opt="" 
make[8]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build/tune/blas/level1'
cd /home/gi75/i75012/env/src/ATLAS/build/src/testing ; make lib
make[8]: *** read jobs pipe EOF.  Stop.
make[8]: *** Waiting for unfinished jobs....
make[9]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make -j 28 dlib.grd
make[9]: *** read jobs pipe EOF.  Stop.
make[9]: *** Waiting for unfinished jobs....
make[10]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[10]: warning: -jN forced in submake: disabling jobserver mode.
make[10]: `dlib.grd' is up to date.
make[10]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[9]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/testing'
make[8]: *** [tstlib.grd] Error 2
make[8]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/tune/blas/level1'
NO GENERAL CASE SURVIVED!!  ABORTING!!
  ID  incX  incY  alpha  beta  ROUT
====  ====  ====  =====  ====  =============
   1     0     0     2     2  rot1_x0y0.c
   2     1     1     2     2  rot1_x1y1.c
   3     1     1     2     2  rot4_x1y1.c

make[7]: *** [dinstall_rot] Error 255
make[7]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/tune/blas/level1'
make[6]: *** [Make_drot] Error 2
make[6]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/blas/level1'
make[5]: *** [dgen] Error 2
make[5]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/blas/level1'
make[4]: *** [dlib] Error 2
make[4]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/blas/level1'
make[3]: *** [lib.grd] Error 2
make[3]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/src/auxil'
make[2]: *** [IStage1] Error 2
make[2]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/bin'
ERROR 712 DURING CACHESIZE SEARCH!!.  CHECK INSTALL_LOG/Stage1.log FOR DETAILS.
make[2]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build/bin'
cd /home/gi75/i75012/env/src/ATLAS/build ; make error_report
make[3]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build'
make -f Make.top error_report
make[4]: Entering directory `/home/gi75/i75012/env/src/ATLAS/build'
uname -a 2>&1 >> bin/INSTALL_LOG/ERROR.LOG
/usr/bin/x86_64-redhat-linux-gcc -v 2>&1  >> bin/INSTALL_LOG/ERROR.LOG
Using built-in specs.
COLLECT_GCC=/usr/bin/x86_64-redhat-linux-gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) 
/usr/bin/x86_64-redhat-linux-gcc -V 2>&1  >> bin/INSTALL_LOG/ERROR.LOG
x86_64-redhat-linux-gcc: error: unrecognized command line option ‘-V’
x86_64-redhat-linux-gcc: fatal error: no input files
compilation terminated.
make[4]: [error_report] Error 4 (ignored)
/usr/bin/x86_64-redhat-linux-gcc --version 2>&1  >> bin/INSTALL_LOG/ERROR.LOG
tar cf error_UNKNOWNx8664AVXMAC.tar Make.inc bin/INSTALL_LOG/*
bzip2 error_UNKNOWNx8664AVXMAC.tar
make[4]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build'
make[3]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build'
make[2]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build/bin'
Error report error_<ARCH>.tgz has been created in your top-level ATLAS
directory.  Be sure to include this file in any help request.
cat: ../../CONFIG/error.txt: No such file or directory
cat: ../../CONFIG/error.txt: No such file or directory
make[1]: *** [build] Error 255
make[1]: Leaving directory `/home/gi75/i75012/env/src/ATLAS/build'
make: *** [build] Error 2

accuracy is fixed to 1 Resnet50 fp16 training Problem

<class 'caffe2.python.core.Net'>
{}
<class 'caffe2.python.core.Net'>
{}
<class 'caffe2.python.core.Net'>
{}
<class 'caffe2.python.core.Net'>
{}
INFO:resnet50_trainer:Finished iteration 1/10009 of epoch 0 (25.41 images/sec)
INFO:resnet50_trainer:Training loss: 7.38396549225, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 2/10009 of epoch 0 (492.02 images/sec)
INFO:resnet50_trainer:Training loss: 190.478805542, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 3/10009 of epoch 0 (550.15 images/sec)
INFO:resnet50_trainer:Training loss: 723.197265625, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 4/10009 of epoch 0 (543.48 images/sec)
INFO:resnet50_trainer:Training loss: 704.564941406, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 5/10009 of epoch 0 (559.24 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 6/10009 of epoch 0 (550.31 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 7/10009 of epoch 0 (545.42 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 8/10009 of epoch 0 (569.45 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 9/10009 of epoch 0 (568.98 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 10/10009 of epoch 0 (543.75 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 11/10009 of epoch 0 (550.41 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0

Gloo update problem caused by topology of IBM Power System S822LC for High Performance Computing ("Minsky")

INFO:resnet50_trainer:Finished iteration 2501/2502 of epoch 0 (79.03 images/sec)
INFO:resnet50_trainer:Training loss: 0.432902753353, accuracy: 0.875
INFO:resnet50_trainer:Finished iteration 2502/2502 of epoch 0 (79.26 images/sec)
INFO:resnet50_trainer:Training loss: 0.462416082621, accuracy: 0.8125
Traceback (most recent call last):
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
    main()
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
    Train(args)
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 388, in Train
    explog
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 156, in RunEpoch
    learning_rate = workspace.FetchBlob(prefix + '/conv1_w_lr')
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 323, in FetchBlob
    return C.fetch_blob(StringifyBlobName(name))
RuntimeError: [enforce fail at pybind_state.cc:152] ws->HasBlob(name). Can't find blob: gpu_0/conv1_w_lr

I found this issue

https://stackoverflow.com/questions/45299351/caffe2-obtain-learning-rate-cant-find-blob-gpu-0-conv1-w-lr

I think it is occured by difference of Caffe2 (resnet50_trainer.py) version

same issue

facebookarchive#616 (comment)

Caffe2 Setup

https://github.com/rioyokotalab/caffe2/wiki/Caffe2-build-on-ReedBush

I tried

make install -j 128

libCaffe2_CPU.so: undefined reference to `google::protobuf::internal::WireFormat::ReadPackedEnumPreserveUnknowns(google::protobuf::io::CodedInputStream*, unsigned int, bool (*)(int), google::protobuf::UnknownFieldSet*, google::protobuf::RepeatedField<int>*)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::io::CodedInputStream::IncrementRecursionDepthAndPushLimit(int)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::internal::WireFormatLite::Int32Size(google::protobuf::RepeatedField<int> const&)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::io::CodedInputStream::ReadVarint32Fallback(unsigned int)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::internal::WireFormatLite::WriteBytesMaybeAliased(int, std::string const&, google::protobuf::io::CodedOutputStream*)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::io::CodedOutputStream::WriteVarint64SlowPath(unsigned long)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::internal::RegisterAllTypes(google::protobuf::Metadata const*, int)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::io::CodedOutputStream::WriteVarint32SlowPath(unsigned int)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::internal::InitProtobufDefaults()'
libCaffe2_CPU.so: undefined reference to `google::protobuf::Message::SpaceUsedLong() const'
libCaffe2_CPU.so: undefined reference to `google::protobuf::internal::WireFormatLite::WriteDoubleArray(double const*, int, google::protobuf::io::CodedOutputStream*)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::io::CodedInputStream::BytesUntilTotalBytesLimit() const'
libCaffe2_CPU.so: undefined reference to `google::protobuf::io::CodedInputStream::ReadVarintSizeAsIntFallback()'
libCaffe2_CPU.so: undefined reference to `google::protobuf::io::CodedInputStream::ReadTagFallback(unsigned int)'
libCaffe2_CPU.so: undefined reference to `google::protobuf::internal::RepeatedPtrFieldBase::InternalExtend(int)'
collect2: error: ld returned 1 exit status
make[2]: *** [caffe2/binaries/blob_test] Error 1
make[1]: *** [caffe2/CMakeFiles/blob_test.dir/all] Error 2
Linking CXX shared module python/caffe2_pybind11_state_gpu.so
Linking CXX shared module python/caffe2_pybind11_state.so
[100%] Built target caffe2_pybind11_state_gpu
[100%] Built target caffe2_pybind11_state
make: *** [all] Error 2

$ pyenv --version
pyenv 1.1.3

$ pyenv versions
  system
* 2.7.10 (set by /lustre/gi75/i75012/env/src/pyenv/version)
  3.4.3
  3.5.0

So , I'll try to change python3
then

$ pip install protobuf

[Redis Set up Error]

I installed Redis to /path-to-redis/ as following instaruction
https://github.com/kurosawatsuyoshi/doshelper/wiki/1.-redis-Setup%EF%BC%88redis%E3%81%AE%E3%82%BB%E3%83%83%E3%83%88%E3%82%A2%E3%83%83%E3%83%97%EF%BC%89

redis-server
[80640] 17 Jul 22:37:55.764 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
[80640] 17 Jul 22:37:55.765 * Increased maximum number of open files to 10032 (it was originally set to 4096).
[80640] 17 Jul 22:37:55.765 # Creating Server TCP listening socket *:6379: bind: Address already in use

https://redis.io/topics/quickstart

then, don't mind

I installed Hiredis(This is redis connection library.)

$ wget -O hiredis.zip https://github.com/redis/hiredis/archive/master.zip
$ unzip hiredis.zip
$ cd hiredis-master/

you should edit Makefile as following

 17 # Installation related variables and target
 18 PREFIX=/path-to/local/hiredis
 19 INCLUDE_PATH=include/hiredis
 20 LIBRARY_PATH=lib

then execute

$ make
$ sudo make install

after that, rebuild caffe2

CMAKE_PREFIX_PATH=/path-to/opencv-2.4.13:/path-to/snappy_1.1.4:/path-to/redis-2.8.12 cmake .. \
-DBLAS=Eigen \
-DUSE_CUDA=ON \
-DUSE_ROCKSDB=OFF \
-DUSE_GLOO=ON \
-DUSE_REDIS=ON \
-DUSE_OPENCV=ON \
-DUSE_GFLAGS=OFF \
-DCUDNN_INCLUDE_DIR=/path-to/cuda/include \
-DCUDNN_LIBRARY=/path-to/cuda/lib/libcudnn.so \
-DCMAKE_INSTALL_PREFIX=/path-to/caffe2/local \
-DMPI_C_COMPILER=/path-to/openmpi-2.0.1/xl/bin/mpicc \
-DMPI_CXX_COMPILER=/path-to/openmpi-2.0.1/xl/bin/mpicxx

console out put

-- ******** Summary ********
-- General:
--   Git version           : 
--   System                : Linux
--   C++ compiler          : /usr/bin/c++
--   C++ compiler version  : 4.8.5
--   Protobuf compiler     : /usr/bin/protoc
--   CXX flags             :  -fopenmp -std=c++11 -fPIC -Wno-narrowing
--   Build type            : Release
--   Compile definitions   : CAFFE2_USE_EIGEN_FOR_BLAS;CAFFE2_USE_GOOGLE_GLOG;EIGEN_MPL2_ONLY;CAFFE2_FORCE_FALLBACK_CUDA_MPI;CAFFE2_NO_BUILTIN_CPU_SUPPORTS
-- 
--   BUILD_SHARED_LIBS     : ON
--   BUILD_PYTHON          : ON
--     Python version      : 2.7.5
--     Python library      : /usr/lib64/libpython2.7.so
--   BUILD_TEST            : ON
--   USE_CUDA              : ON
--     CUDA version        : 8.0
--   USE_CNMEM             : OFF
--   USE_NERVANA_GPU       : OFF
--   USE_GLOG              : ON
--   USE_GFLAGS            : OFF
--   USE_LMDB              : ON
--     LMDB version        : 0.9.18
--   USE_LEVELDB           : ON
--     LevelDB version     : 1.20
--     Snappy version      : 1.1.4
--   USE_OPENCV            : ON
--     OpenCV version      : 2.4.13
--   USE_FFMPEG            : 
--   USE_ZMQ               : OFF
--   USE_ROCKSDB           : OFF
--   USE_MPI               : ON
--   USE_NCCL              : ON
--   USE_NNPACK            : OFF
--   USE_OPENMP            : ON
--   USE_REDIS             : ON
--   USE_GLOO              : ON
-- Configuring done
-- Generating done

Caffe2 update & PYTHONPATH problem

caffe2

$ python -m caffe2.python.operator_test.relu_op_test
WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
WARNING:root:Debug message: libCaffe2_CPU.so: cannot open shared object file: No such file or directory
CRITICAL:root:Cannot load caffe2.python. Error: libCaffe2_CPU.so: cannot open shared object file: No such file or directory