zengarden / light_head_rcnn Goto Github PK
View Code? Open in Web Editor NEWLight-Head R-CNN
Light-Head R-CNN
Hi, very appreciate to you excellent work!! I have tried your fast_nms operator during inference, I wonder how to calculate processing time of nms, I tried to add some code in $light_head_rcnn/lib/detection_opr/rpn_batched/proposal_opr.py as below, but the time logged out in the terminal doesn't seem like the real processing time, it seems that it's just the time cost of constructing the graph or calling the NMS function. Can you tell me how to calculate the nms processing time? Or how does tf_nms and fast_nms perform in tensorflow when processing number of anchors?
When I trained for an indefinite number of iterations, the iteration stopped and no reason was found.I use one 1080TI and use my dataset converted od format.It looks like stopping on the function of 'socket.recv'.Has been excluded from stopping in ‘get_data_for_singlegpu’ function, Because I stopped it and print in every possible exit that not occured.
Hi, everyone
I have some problem of modify the test code.
I try to write a program to detect on any image.
In this project, the author only released some code about evaluation with coco.
Can someone help me, how to use the trained model detecting on my image?
Thanks a lot.
Instaed of using just one feature map like c4 or c5 from resnet50 or 101, how about using FPN like Mask RCNN? in the original paper(Table 5), they also show that FPN improves mAP.
I'm trying it but I'm confusing how to apply global context 4 convs(maybe I should adjust the filter size according to the feature map size? - smaller filter size like 7x1 / 1x7 for small feature map like c5) and psaligned pooling.
Hi, I have found that the change of initializer did affect the final loss, but had less help to the final accuracy in my another project. Could you please share some motivation about this? Thank you.
BTW, I'm sorry for that this is not one bug about the project just a question. Thank you again.
Can any one please share result.txt file on coco. I have pasted my result file given below.
I have trained baseline code on coco then run test.py. but on COCO , I got 4.9 % Average precision but author claim about 35% on baseline code. Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.049. Can any one help me to understanding this result.txt file
evaluation epoch 26
loading annotations into memory...
Done (t=4.29s)
creating index...
index created!
Loading and preparing results...
DONE (t=3.23s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type bbox
DONE (t=79.43s).
Accumulating evaluation results...
DONE (t=17.21s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.049
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.077
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.052
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.031
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.053
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.064
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.038
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.058
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.060
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.041
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.065
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.079
Hi, do you have any plan to support PascalVoc dataset?
@zengarden When I run bash make.sh, I have modified cuda.h in cuda_kernel_helper.h and dso_loader.h. And it is normal. But when I run the test.py, it occurs one problem as follows:
Could you please help me to solve the problem, Thank you very much.
Hi,
I tried to compile this code using two GPUs V100 using sm_70 and I'm getting this warning during compiling and this error when I run the test.py:
/usr/local/cuda-9.0/bin/../targets/x86_64-linux/include/sm_30_intrinsics.hpp(213): here was declared deprecated ("__shfl_down() is not valid on compute_70 and above, and should be replaced with __shfl_down_sync().To continue using __shfl_down(), specify virtual architecture compute_60 when targeting sm_70 and above, for example, using the pair of compiler options: -arch=compute_60 -code=sm_70.")
NotFoundError: /home/edgar/light_head_rcnn/lib/lib_kernel/lib_fast_nms/fast_nms.so: undefined symbol: _ZN10tensorflow7strings6StrCatERKNS0_8AlphaNumE
Also, when I use -arch=compute_60 -code=sm_70, I got this warning during compiling and the same error when I run the test.py:
/usr/local/cuda/bin/../targets/x86_64-linux/include/sm_30_intrinsics.hpp(213): here was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
The lines to be compiled are:
CUDA_PATH=/usr/local/cuda-9.0/
nvcc -std=c++11 -c -o nms_op.cu.o nms_op.cu.cc \
-I $TF_INC -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC -arch=compute_60 -code=sm_70 --expt-relaxed-constexpr -Wno-deprecated-declarations
Hello,
I am training a network with a modified base model from scratch. but I'm having this error two times before having a complete error and crash the ipython (terminal). Why am I obtaining this error? it looks like a dataset problem or during the reading process.
Process _Worker-8:
Traceback (most recent call last):
File "/home/edgar/light_head_rcnn/experiments/lizeming/light_head_rcnn.ori_res101.coco.ps_roialign/dataset.py", line 95, in get_data_for_singlegpu
record = json.loads(raw_line)
File "/usr/lib/python3.6/json/init.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 219 (char 218)
##########################
Process _Worker-1:
Traceback (most recent call last):
File "/home/edgar/light_head_rcnn/experiments/lizeming/light_head_rcnn.ori_res101.coco.ps_roialign/dataset.py", line 95, in get_data_for_singlegpu
record = json.loads(raw_line)
File "/usr/lib/python3.6/json/init.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 219 (char 218)
When I try to use tensorrt to transformer this model to tensorrt model for speed.It Shows
2018-05-07 10:45:45.609149: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:383] MULTIPLE tensorrt candidate conversion: 26 2018-05-07 10:45:45.832778: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::377, condition: isValidDims(dims) 2018-05-07 10:45:45.832832: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:0 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 30 nodes) 2018-05-07 10:45:45.834499: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::377, condition: isValidDims(dims) 2018-05-07 10:45:45.834529: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:1 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 23 nodes) 2018-05-07 10:45:45.836056: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:2 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/ExpandDims" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.838489: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:3 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/Exp_1" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.840022: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:4 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_60" SKIPPING......( 6 nodes) 2018-05-07 10:45:45.841584: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:5 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/Exp_6" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.843144: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:6 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_6" SKIPPING......( 6 nodes) 2018-05-07 10:45:45.844673: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:7 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/ones_2" SKIPPING......( 4 nodes) 2018-05-07 10:45:45.846229: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::377, condition: isValidDims(dims) 2018-05-07 10:45:45.846257: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:8 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 33 nodes) 2018-05-07 10:45:45.847794: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:9 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/ones_1" SKIPPING......( 3 nodes) 2018-05-07 10:45:45.849348: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:10 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_6/bbox_fc/MatMul" SKIPPING......( 4 nodes) 2018-05-07 10:45:45.850883: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:11 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/Exp_7" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.852441: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:12 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/ones_3" SKIPPING......( 4 nodes) 2018-05-07 10:45:45.854356: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::377, condition: isValidDims(dims) 2018-05-07 10:45:45.854409: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:13 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 380 nodes) 2018-05-07 10:45:45.855983: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:14 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_8" SKIPPING......( 6 nodes) 2018-05-07 10:45:45.858416: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:15 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_89" SKIPPING......( 6 nodes) 2018-05-07 10:45:45.859942: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:16 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/ones" SKIPPING......( 4 nodes) 2018-05-07 10:45:45.861475: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:17 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/Exp_5" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.863010: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:18 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/Exp_4" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.864538: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:19 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_62" SKIPPING......( 6 nodes) 2018-05-07 10:45:45.866071: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:20 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/Exp_3" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.867597: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:21 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_6/ps_fc_1/MatMul" SKIPPING......( 3 nodes) 2018-05-07 10:45:45.869124: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:22 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_33" SKIPPING......( 6 nodes) 2018-05-07 10:45:45.870657: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:23 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/Exp_2" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.872184: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:24 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_35" SKIPPING......( 6 nodes) 2018-05-07 10:45:45.873872: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:25 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_87" SKIPPING......( 6 nodes)
Could you give me some advice for this?Or could you give me some advice for the goal 33map and 60fps?Thanks very much
Hello.
Thank you for your work.
In your code, you turn the json annotation of the COCO dataset into ".odgt".
I want to know how it was generated. If there is converted code, can you provide it?
Because I want to extract a specific class to train, such as "person". Is there another way to get the annotation of a specified class?
I add
tfconfig.gpu_options.per_process_gpu_memory_fraction = 0.05
to let it run, but I got error information like follow:
...
2018-05-18 19:14:25.380430: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 58.69MiB. Current allocation summary follows.
2018-05-18 19:14:25.380546: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256): Total Chunks: 38, Chunks in use: 37. 9.5KiB allocated for chunks. 9.2KiB in use in bin. 7.6KiB client-requested in use in bin.
...
4] 1 Chunks of size 91656192 totalling 87.41MiB
2018-05-18 19:14:25.404137: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 374.93MiB
2018-05-18 19:14:25.404163: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit: 425407283
InUse: 393138944
MaxInUse: 393138944
NumAllocs: 1096
MaxAllocSize: 91656192
2018-05-18 19:14:25.404278: W tensorflow/core/common_runtime/bfc_allocator.cc:279] **********************************************************_____******************xxxxxxx
2018-05-18 19:14:25.404328: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at conv_ops.cc:672 : Resource exhausted: OOM when allocating tensor with shape[1,64,400,601] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "test.py", line 244, in
eval_all(args)
File "test.py", line 137, in eval_all
result_dict = inference(func, inputs, data_dict)
File "test.py", line 69, in inference
_, scores, pred_boxes, rois = val_func(feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1140, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,64,400,601] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: resnet_v1_101/conv1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](resnet_v1_101/conv1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, resnet_v1_101/conv1/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: resnet_v1_101_5/concat_3/_1133 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2610_resnet_v1_101_5/concat_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
I see your code in lib/detection_opr/rpn_batched/proposal_opr.py. In proposal_opr, the type of anchors is numpy, the type of rpn_bbox_pred is tensor, why you can use it as numpy in bbox_transform_inv and other functions. Why you can use tensor as numpy?
And I get error when I transform the graph to tensorrt graph. and I find the place where error shows is the lib/detection_opr/rpn_batched/proposal_opr.py. Maybe it is because the confusion between numpy and tensor?
where can i find the ohem's implementation?
Hi,
Recently I found the ROIAlign in roi_align_op_gpu.cu.cc will have incorrect interpolation value in some conner case.
the case is like this, when the h or w passed to ROIAlignGetInterpolating() is some float value like 1.001 , the floor and ceil will return 1, and 2; but when the w or h have just like 1.000000, the ceil and floor will return same value.
as a result, it will return a Interpolating of two points instead of 4 points.
Looking through the odgt file, I find a small minority of boxes ignored looking at the extra.ignored field. Whats the significance of this?
Hello ,
I'm try to run your test.py on my env . but meet the problem below , did I make some mistake ?
(I could inport /lib/lib_kernel/lib_fast_nms/nms_op.py
but could't use it .)
Caused by op 'resnet_v1_101_5/NMS', defined at:
File "test.py", line 242, in <module>
eval_all(args)
File "test.py", line 162, in eval_all
proc.start()
File "/usr/lib/python3.4/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.4/multiprocessing/context.py", line 212, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/usr/lib/python3.4/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.4/multiprocessing/popen_fork.py", line 21, in __init__
self._launch(process_obj)
File "/usr/lib/python3.4/multiprocessing/popen_fork.py", line 77, in _launch
code = process_obj._bootstrap()
File "/usr/lib/python3.4/multiprocessing/process.py", line 254, in _bootstrap
self.run()
File "/usr/lib/python3.4/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "test.py", line 107, in worker
func, inputs = load_model(model_file, dev)
File "test.py", line 38, in load_model
net.inference('TEST', inputs)
File "/home/aipr/dennis_codebase/light_head_rcnn/experiments/lizeming/light_head_rcnn.ori_res101.coco.ps_roialign/network_desp.py", line 164, in inference
anchors, num_anchors, is_tfchannel=True, is_tfnms=False)
File "/home/aipr/dennis_codebase/light_head_rcnn/lib/detection_opr/rpn_batched/proposal_opr.py", line 95, in proposal_opr
cur_proposals, nms_thresh, post_nms_topN)
File "<string>", line 43, in nms
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/op_def_library.py", line 328, in apply_op
op_type_name, name, **keywords)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'NMS' with these attrs. Registered devices: [CPU], Registered kernels:
device='GPU'; T in [DT_FLOAT]
[[Node: resnet_v1_101_5/NMS = NMS[T=DT_FLOAT, max_out=1000, nms_overlap_thresh=0.7](resnet_v1_101_5/Gather)]]
Comparison of accuracy and speed between the mobilenetV2 and xceptionlike, mobilenetV2_05 or mobilenetV2_035
Tried light_head_rcnn.ori_res101.coco.ps_roialign training with 4 GPUs (same result with 1 GPU too), it simply hang at step "Restoring parameters from /home/dsu/ai/lh/data/imagenet_weights/res101.ckpt". GPUs show 207MB memory, 0% utility. A lot of CPUs/cores (about 30) running at 100%.
When kill the training, sees this error:
File "/home/dsu/ai/lh/experiments/my/light_head_rcnn.ori_res101.coco/dataset.py", line 129, in get_data_for_singlegpu
img = cv2.imread(image_path, cv2.IMREAD_COLOR)
Same issue with light_head_rcnn.ori_res101.cocoexperiment.
ubuntu 16.04, tf1.5, cuda9, and 1080 Ti (tf1.6 has same issue, downgrade to 1.5 didn't make any difference).
anyone runs into this issue?
In the faster rcnn work, the writer said :
In the first step, we train the RPN as described above. This network is initialized with an ImageNet-
pre-trained model and fine-tuned end-to-end for the region proposal task. In the second step, we
train a separate detection network by Fast R-CNN using the proposals generated by the step-1 RPN.
This detection network is also initialized by the ImageNet-pre-trained model. At this point the two
networks do not share conv layers. In the third step, we use the detector network to initialize RPN
training, but we fix the shared conv layers and only fine-tune the layers unique to RPN. Now the two
networks share conv layers. Finally, keeping the shared conv layers fixed, we fine-tune the fc layers
of the Fast R-CNN. As such, both networks share the same conv layers and form a unified network.
But I didn't find this step in your code? Could you please explain it
I don't know where is wrong in the code I changed a little code.
I train it with 1.2.1 tensorflow-gpu.the log shows that [rpn_bbox_loss 10±2], [bbox_loss 0.000]
iter 9071, rpn_loss_cls: 0.0063, rpn_loss_box: 9.4851, loss_cls: 0.2539, loss_box: 0.0000, tot_losses: 9.7453, lr: 0.0006, speed: 0.782s/iter: 11%|█▋ | 9071/80000 [2:01:21<15:48:55, 1.25it/s]
Hello.
I'm try to run your test.py. but I have a problem below, I don't know what wrong..
gt@gt:~/light_head_rcnn/experiments/lizeming/light_head_rcnn.ori_res101.coco.ps_roialign$ python3 test.py -d 0 -se 26
Traceback (most recent call last):
File "test.py", line 16, in
import network_desp
File "/home/gt/light_head_rcnn/experiments/lizeming/light_head_rcnn.ori_res101.coco.ps_roialign/network_desp.py", line 26, in
from detection_opr.rpn_batched.proposal_opr import proposal_opr
File "/home/gt/light_head_rcnn/lib/detection_opr/rpn_batched/proposal_opr.py", line 9, in
from lib_kernel.lib_fast_nms import nms_op
File "/home/gt/light_head_rcnn/lib/lib_kernel/lib_fast_nms/nms_op.py", line 10, in
_nms_module = tf.load_op_library(filename)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/load_library.py", line 56, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename, status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: /home/gt/light_head_rcnn/lib/lib_kernel/lib_fast_nms/fast_nms.so: undefined symbol: _Z15launch_gen_maskifPKfiPyRKN5Eigen9GpuDeviceE
I also finish Compiling step below.
gt@gt:~/light_head_rcnn/lib$ bash make.sh
~/light_head_rcnn/lib/utils/py_faster_rcnn_utils ~/light_head_rcnn/lib
python3 setup.py build_ext --inplace
running build_ext
skipping 'bbox.c' Cython extension (up-to-date)
skipping 'nms.c' Cython extension (up-to-date)
rm -rf build
~/light_head_rcnn/lib
~/light_head_rcnn/lib/lib_kernel/lib_fast_nms ~/light_head_rcnn/lib
~/light_head_rcnn/lib
~/light_head_rcnn/lib/lib_kernel/lib_psroi_pooling ~/light_head_rcnn/lib
~/light_head_rcnn/lib
~/light_head_rcnn/lib/lib_kernel/lib_roi_pooling ~/light_head_rcnn/lib
~/light_head_rcnn/lib
~/light_head_rcnn/lib/lib_kernel/lib_roi_align ~/light_head_rcnn/lib
~/light_head_rcnn/lib
~/light_head_rcnn/lib/lib_kernel/lib_psalign_pooling ~/light_head_rcnn/lib
~/light_head_rcnn/lib
~/light_head_rcnn/lib/lib_kernel/lib_nms_dev ~/light_head_rcnn/lib
~/light_head_rcnn/lib
~/light_head_rcnn/lib/datasets_odgt/lib_coco/PythonAPI ~/light_head_rcnn/lib
install pycocotools to the Python site-packages
python3 setup.py build_ext install --user
running build_ext
building 'pycocotools._mask' extension
creating build
creating build/temp.linux-x86_64-3.5
creating build/temp.linux-x86_64-3.5/pycocotools
creating build/common
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.5/dist-packages/numpy/core/include -I../common -I/usr/include/python3.5m -c pycocotools/_mask.c -o build/temp.linux-x86_64-3.5/pycocotools/_mask.o -Wno-cpp -Wno-unused-function -std=c99
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.5/dist-packages/numpy/core/include -I../common -I/usr/include/python3.5m -c ../common/maskApi.c -o build/temp.linux-x86_64-3.5/../common/maskApi.o -Wno-cpp -Wno-unused-function -std=c99
creating build/lib.linux-x86_64-3.5
creating build/lib.linux-x86_64-3.5/pycocotools
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.5/pycocotools/_mask.o build/temp.linux-x86_64-3.5/../common/maskApi.o -o build/lib.linux-x86_64-3.5/pycocotools/_mask.cpython-35m-x86_64-linux-gnu.so
running install
running build
running build_py
copying pycocotools/mask.py -> build/lib.linux-x86_64-3.5/pycocotools
copying pycocotools/coco.py -> build/lib.linux-x86_64-3.5/pycocotools
copying pycocotools/init.py -> build/lib.linux-x86_64-3.5/pycocotools
copying pycocotools/cocoeval.py -> build/lib.linux-x86_64-3.5/pycocotools
running install_lib
copying build/lib.linux-x86_64-3.5/pycocotools/_mask.cpython-35m-x86_64-linux-gnu.so -> /home/gt/.local/lib/python3.5/site-packages/pycocotools
running install_egg_info
Removing /home/gt/.local/lib/python3.5/site-packages/pycocotools-2.0.egg-info
Writing /home/gt/.local/lib/python3.5/site-packages/pycocotools-2.0.egg-info
rm -rf build
~/light_head_rcnn/lib
The error is occur at fast_nms.so file, but I can't read it because it is a file created by compilation..
I also change my gcc, g++ version( 5.4->4.8), but nothing change..
Can you help me with this error?
Hi there,
Awesome research and thanks for open-sourcing your code!
When trying to run train.py
:
Traceback (most recent call last):
File "train.py", line 12, in <module>
import network_desp
File "/home/indus/Documents/light_head_rcnn/experiments/lizeming/light_head_rcnn.ori_res101.coco.ps_roialign/network_desp.py", line 23, in <module>
from detection_opr.rpn_batched.proposal_target_layer import proposal_target_layer
File "/home/indus/Documents/light_head_rcnn/lib/detection_opr/rpn_batched/proposal_target_layer.py", line 14, in <module>
from utils.py_faster_rcnn_utils.cython_bbox import bbox_overlaps
ImportError: No module named cython_bbox
I looked in the lib/utils/py_faster_rcnn_utils/
folder but I couldn't find any file named cython_bbox.py
. I saw a file called bbox.py
so I tried changing the line to from utils.py_faster_rcnn_utils.bbox import bbox_overlaps
but to no avail.
Any suggestions as to how I can resolve this issue?
Thanks!
I tried to update msgpack-numpy and msgpack as you said, but it doesn't work. Can you tell us what system you are using? Ubuntu16.04 automatically loses IP over a period of time, but your code uses pip. I suspect that it is an iterative random stop caused by a system problem.
Print log as follows:
ch:0, iter:4399, rpn_loss_cls: 0.0677, rpn_loss_box: 0.0325, loss_cls: 0.3604, loss_box: 0.6708, tot_losses: 1.1314, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 43epoch:0, iter:4400, rpn_loss_cls: 0.0281, rpn_loss_box: 0.0015, loss_cls: 0.0247, loss_box: 0.0002, tot_losses: 0.0544, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4401, rpn_loss_cls: 0.0373, rpn_loss_box: 0.0025, loss_cls: 0.0489, loss_box: 0.0004, tot_losses: 0.0892, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4402, rpn_loss_cls: 0.0223, rpn_loss_box: 0.0026, loss_cls: 0.0173, loss_box: 0.0130, tot_losses: 0.0551, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4403, rpn_loss_cls: 0.0571, rpn_loss_box: 0.0198, loss_cls: 0.2124, loss_box: 0.2481, tot_losses: 0.5374, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4404, rpn_loss_cls: 0.0359, rpn_loss_box: 0.0135, loss_cls: 0.1283, loss_box: 0.1383, tot_losses: 0.3160, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4405, rpn_loss_cls: 0.0455, rpn_loss_box: 0.0516, loss_cls: 0.1455, loss_box: 0.0754, tot_losses: 0.3181, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4406, rpn_loss_cls: 0.0611, rpn_loss_box: 0.0380, loss_cls: 0.0184, loss_box: 0.0022, tot_losses: 0.1198, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4407, rpn_loss_cls: 0.0297, rpn_loss_box: 0.0195, loss_cls: 0.0216, loss_box: 0.0106, tot_losses: 0.0814, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4408, rpn_loss_cls: 0.0397, rpn_loss_box: 0.0038, loss_cls: 0.0574, loss_box: 0.0496, tot_losses: 0.1505, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 4408/13754 [28:43<57:22, 2.72it/s]
Hi @zengarden,
I am running the test script, but this error came out.
" lib/lib_kernel/lib_fast_nms/fast_nms.so: cannot open shared object file: No such file or directory"
Could you please tell me where is not right?
Many thanks.
some suggestion about to accelerate speed
I found your data format is "NHWC" , but according to page : https://www.tensorflow.org/performance/performance_guide
NHWC is the TensorFlow default and NCHW is the optimal format to use when training on NVIDIA GPUs using cuDNN.
You should use NCHW as your CNN format, and your nms, ps roi pooling may also need change.
Hello,
I'm running this in a docker container with cuda 9.0 and tensorflow 1.5.0 installed from pip3.
When I run make.sh it get's stuck while compiling psroi_pooling_op_gpu.cu.cc.
The exact error message is as follows:
In file included from psalign_pooling_op_gpu.cu.cc:7:0:
/usr/local/lib/python3.5/dist-packages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h:24:31: fatal error: cuda/include/cuda.h: No such file or directory
Do I have to compile tensorflow from source?
the predict procedure don't support multi-batch, and we test your code on Nvidia 1080Ti GPU which only achieve 2 images per seconds. Far lower than original paper 100 FPS.
Traceback (most recent call last):
File "test.py", line 241, in
eval_all(args)
File "test.py", line 131, in eval_all
func, inputs = load_model(model_file, devs[0])
File "test.py", line 35, in load_model
sess = tf.Session(config=tfconfig)
File "/home/lucifer/anaconda3/envs/slam/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1509, in init
super(Session, self).init(target, graph, config=config)
File "/home/lucifer/anaconda3/envs/slam/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 628, in init
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/home/lucifer/anaconda3/envs/slam/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
while running
root@982633c0cbbf:/dh/home/administrator/users_local/mamta/LightHead/lighthead_ROOT/light_head_rcnn/experiments/lizeming/light_head_rcnn.ori_res101.coco# python3 test.py -d 0 -se 1
getting error
2018-08-27 09:37:02.247790: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_tensor.cc:170 : Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /dh/home/administrator/users_local/LightHead/lighthead_ROOT/light_head_rcnn/output/root/light_head_rcnn.ori_res101.coco/eval_dump/epoch_1.ckpt
Not able to understand from where i can get epoch_1.ckpt
Hello, I have a question in lib_fast_nms/make.sh file.
In your code,
##if you install tf using already-built binary, or gcc version 4.x, uncomment the two lines below
g++ -std=c++11 -shared -D_GLIBCXX_USE_CXX11_ABI=0 -o fast_nms.so nms_op.cc
nms_op.cu.o -I $TF_INC -fPIC -lcudart -L $CUDA_PATH/lib64 -L$TF_LIB -ltensorflow_framework -I$TF_INC/external/nsync/public
#for gcc5-built tf
#g++ -std=c++11 -shared -D_GLIBCXX_USE_CXX11_ABI=1 -o roi_pooling.so roi_pooling_op.cc
roi_pooling_op.cu.o -I $TF_INC -fPIC -lcudart -L $CUDA_PATH/lib64
if you install tf using already-built binary, or gcc version 4.x, it makes fast_nms.so file
and for gcc5-but tf it makes roi_pooling.so file
I don't understand and wondering why you create another file based on the gcc version.
Hi, thank you for your great work. I have a very quick question, what's the difference between nms_fast and standard nms? Thanks.
hello, thank you for your job, i met the error when i compile the lib files.
i wonder if the wrong version of cuda i used, i used tf1.5.0+cuda8.0+py3. the error is
..................................................................................................................
/usr/local/lib/python3.6/sitepackages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h(701): error: identifier "__ballot_sync" is undefined
/usr/local/lib/python3.6/sitepackages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h(720): error: identifier "__shfl_sync" is undefined
/usr/local/lib/python3.6/sitepackages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h(742): error: identifier "__shfl_up_sync" is undefined
/usr/local/lib/python3.6/sitepackages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h(758): error: identifier "__shfl_down_sync" is undefined
/usr/local/lib/python3.6/sitepackages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h(770): error: identifier "__shfl_down_sync" is undefined
/usr/local/lib/python3.6/sitepackages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h(786): error: identifier "__shfl_xor_sync" is undefined
/usr/local/lib/python3.6/sitepackages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h(798): error: identifier "__shfl_xor_sync" is undefined
7 errors detected in the compilation of "/tmp/tmpxft_0000050e_00000000-7_psroi_pooling_op_gpu.cu.cpp1.ii".
g++: error: psroi_pooling_op.cu.o: No such file or directory
...........................................................................................................................................................................
i guess the error is caused by wrong version of cuda, those undefined functions "__ballot_syncare" "__shfl_sync"... are in cuda9.0. but i can only use cuda8.0, because of the GPU machine.
How can i solve this problem except change the version of cuda?
Try to download the odfromat file from the link but received redirection error. Keep trying for many times still cannot download.
Anyone has similar issue like me? I am trying to download it from Singapore.
when i run the bash make.sh, the following error occurs:
~/light_head_rcnn/lib
~/light_head_rcnn/lib/lib_kernel/lib_fast_nms ~/light_head_rcnn/lib
make.sh: 5: make.sh: nvcc: not found
g++: error: nms_op.cu.o: No such file or directory
what is the problem
Is there any ways to transform COCO dataset to odformat ?
Or any transfromed odformat for instances_train2017 and instances_val2017
Thanks!
I have run test.py on coco data and also used epoch_26.ckpt given by author. but still getting worst result like 4 % on coco data. Anyone have checked on coco data. Please let me know Maybe i am doing some mistake because author claim 40 % on coco.
I see this in the paper, but i can not find this in the code.
Is the position of the subsampling unit in each resnet block correct?
In your network you subsample at the beginning of each block:
https://github.com/zengarden/light_head_rcnn/blob/master/experiments/lizeming/light_head_rcnn.ori_res101.coco.ps_roialign/network_desp.py#L93
But in the original tensorflow implementation they subsample at the end of each block:
in the current implementation we subsample the output activations in the last residual unit of
each block, instead of subsampling the input activations in the first residual unit of each block.
See:
https://github.com/tensorflow/models/blob/master/research/slim/nets/resnet_utils.py#L30
https://github.com/tensorflow/models/blob/master/research/slim/nets/resnet_v1.py#L271
Is it alright to use their pretrained resnet-101 model in this case?
When I run python3 train.py, there is a error like this:
'tensorflow.python.framework.errors_impl.NotFoundError: ~/light_head_rcnn-master/lib/lib_kernel/lib_fast_nms/fast_nms.so: undefined symbol: _ZN10tensorflow7strings6StrCatERKNS0_8AlphaNumE'. How to solve this problem?
Does someone use Tensorboard or other ways to get the loss curve?
When I start test.py the following way
python3 test.py -s -se 26
It fails at the point it should view the processed image. Returns a couple of X Errors like
X Error: BadAccess (attempt to access private resource denied) 10
Extension: 130 (MIT-SHM)
Minor opcode: 1 (X_ShmAttach)
Resource id: 0x4600003
X Error: BadShmSeg (invalid shared segment parameter) 128
Extension: 130 (MIT-SHM)
Minor opcode: 3 (X_ShmPutImage)
Resource id: 0x4600008
X Error: BadAccess (attempt to access private resource denied) 10
Extension: 130 (MIT-SHM)
Minor opcode: 1 (X_ShmAttach)
Resource id: 0x149
X Error: BadShmSeg (invalid shared segment parameter) 128
Extension: 130 (MIT-SHM)
Minor opcode: 2 (X_ShmDetach)
Resource id: 0x149
Is anyone having the same or a similar problem?
#34 Can be closed.
Xhost was not configured for docker. My mistake - sorry for that!!
where is xception like network code which is written in original paper?
First of all, thanks for the paper, I enjoyed reading it.
In table 8 you compare FPS and COCO's AP with other CNNs, but I can't find the GPU in use.
Could you please share on what GPU these FPS values were obtained?
Is it Titan XP? If so, were the FPS values of the rest of the CNNs (YOLO, SSD, DSSD, etc) also computed on this GPU? (original values from corresponding papers are of Titan X Maxwell if I'm not wrong)
Thanks!
I trained on my own dataset that only three classes,then map is 78%.
But I trained on VOC0712 that have 20 classes,,the map is only 21%.
I guess the reason is that I only used one 1080TI, resulting in a serious under-fitting。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.