zengarden / light_head_rcnn Goto Github PK

View Code? Open in Web Editor NEW

833.0 833.0 223.0 2.41 MB

Light-Head R-CNN

Python 10.32% Makefile 0.01% Jupyter Notebook 84.78% C++ 4.45% C 0.29% Shell 0.14%

light_head_rcnn's People

Contributors

Stargazers

Watchers

Forkers

lianwaijinxifeng itwalker chenyilun95 xtanitfy huipengzhang aitechnology shubhampachori12110095 machanic murari023 zed-2018 ml-lab felixmonkey longchuan1985 labimage daicoolb zgsxwsdxg megvii-wzc millx2021 hajungong007 miracle-fmh armstrongyang wishinger-li qdet yamlong yangyangl wishgale liyuanyaun aihgf 32l clcarwin shlpu cvtower qingsong99 hzshonny soccergame jningwei vincentliubuaa yangxue0827 bladehd rkshuai locussam chunlicui nobodulu runauto signalimagecv 646677064 wuyuanyuan1990 hyzcn alexliyang a61730607 jacke121 hemp110 wanjinchang deruci moonsucha mati1994 dedoogong cndylan hxl1990 taokong dicksonyuan yougoforward clscy aust-hansen zqdeepbluesky xiaotie1005 glc12125 holygen lzc1994 powermano horaccefeng xyt2008 queenjuliazxx tony-hou aaronlau0 kyubeomlee123 hulalazz qinzheng93 left4back beijinghxl1990 scorpiodoctor wbb123 weitaoatvison zyxcambridge lijianfei06 mmelodious parsonszeng hzhang57 xmuofgjk chuckgithub autohe statml remyyang ydm2011 wwwanghao zumbalamambo marcelomata sinexue hfengshan feiward

light_head_rcnn's Issues

about speed of fast nms

Hi, very appreciate to you excellent work!! I have tried your fast_nms operator during inference, I wonder how to calculate processing time of nms, I tried to add some code in $light_head_rcnn/lib/detection_opr/rpn_batched/proposal_opr.py as below, but the time logged out in the terminal doesn't seem like the real processing time, it seems that it's just the time cost of constructing the graph or calling the NMS function. Can you tell me how to calculate the nms processing time? Or how does tf_nms and fast_nms perform in tensorflow when processing number of anchors?

iteration stopped randomly

When I trained for an indefinite number of iterations, the iteration stopped and no reason was found.I use one 1080TI and use my dataset converted od format.It looks like stopping on the function of 'socket.recv'.Has been excluded from stopping in ‘get_data_for_singlegpu’ function， Because I stopped it and print in every possible exit that not occured.

Can someone release a code to test detection on any image?

Hi, everyone
I have some problem of modify the test code.
I try to write a program to detect on any image.

In this project, the author only released some code about evaluation with coco.
Can someone help me, how to use the trained model detecting on my image?

Thanks a lot.

do you have a plan to apply FPN?

Instaed of using just one feature map like c4 or c5 from resnet50 or 101, how about using FPN like Mask RCNN? in the original paper(Table 5), they also show that FPN improves mAP.
I'm trying it but I'm confusing how to apply global context 4 convs(maybe I should adjust the filter size according to the feature map size? - smaller filter size like 7x1 / 1x7 for small feature map like c5) and psaligned pooling.

why you change the default initializer to random_normal_initializer before rpn head?

Hi, I have found that the change of initializer did affect the final loss, but had less help to the final accuracy in my another project. Could you please share some motivation about this? Thank you.

BTW, I'm sorry for that this is not one bug about the project just a question. Thank you again.

understanding of result.txt file

Can any one please share result.txt file on coco. I have pasted my result file given below.
I have trained baseline code on coco then run test.py. but on COCO , I got 4.9 % Average precision but author claim about 35% on baseline code. Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.049. Can any one help me to understanding this result.txt file

evaluation epoch 26
loading annotations into memory...
Done (t=4.29s)
creating index...
index created!
Loading and preparing results...
DONE (t=3.23s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type bbox
DONE (t=79.43s).
Accumulating evaluation results...
DONE (t=17.21s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.049
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.077
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.052
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.031
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.053
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.064
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.038
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.058
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.060
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.041
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.065
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.079

PascalVoc Support

Hi, do you have any plan to support PascalVoc dataset?

light_head_rcnn/lib/lib_kernel/lib_fast_nms/fast_nms.so: undefined symbol:

@zengarden When I run bash make.sh, I have modified cuda.h in cuda_kernel_helper.h and dso_loader.h. And it is normal. But when I run the test.py, it occurs one problem as follows:

Could you please help me to solve the problem, Thank you very much.

compile lib_kernel/lib_fast_nms/fast_nms using GPU V100

Hi,
I tried to compile this code using two GPUs V100 using sm_70 and I'm getting this warning during compiling and this error when I run the test.py:

/usr/local/cuda-9.0/bin/../targets/x86_64-linux/include/sm_30_intrinsics.hpp(213): here was declared deprecated ("__shfl_down() is not valid on compute_70 and above, and should be replaced with __shfl_down_sync().To continue using __shfl_down(), specify virtual architecture compute_60 when targeting sm_70 and above, for example, using the pair of compiler options: -arch=compute_60 -code=sm_70.")

NotFoundError: /home/edgar/light_head_rcnn/lib/lib_kernel/lib_fast_nms/fast_nms.so: undefined symbol: _ZN10tensorflow7strings6StrCatERKNS0_8AlphaNumE

Also, when I use -arch=compute_60 -code=sm_70, I got this warning during compiling and the same error when I run the test.py:

/usr/local/cuda/bin/../targets/x86_64-linux/include/sm_30_intrinsics.hpp(213): here was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")

The lines to be compiled are:

CUDA_PATH=/usr/local/cuda-9.0/
nvcc -std=c++11 -c -o nms_op.cu.o nms_op.cu.cc \
	-I $TF_INC -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC -arch=compute_60 -code=sm_70 --expt-relaxed-constexpr -Wno-deprecated-declarations

problem reading JSON

Hello,

I am training a network with a modified base model from scratch. but I'm having this error two times before having a complete error and crash the ipython (terminal). Why am I obtaining this error? it looks like a dataset problem or during the reading process.

Process _Worker-8:
Traceback (most recent call last):
File "/home/edgar/light_head_rcnn/experiments/lizeming/light_head_rcnn.ori_res101.coco.ps_roialign/dataset.py", line 95, in get_data_for_singlegpu
record = json.loads(raw_line)
File "/usr/lib/python3.6/json/init.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 219 (char 218)

##########################

Process _Worker-1:
Traceback (most recent call last):
File "/home/edgar/light_head_rcnn/experiments/lizeming/light_head_rcnn.ori_res101.coco.ps_roialign/dataset.py", line 95, in get_data_for_singlegpu
record = json.loads(raw_line)
File "/usr/lib/python3.6/json/init.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 219 (char 218)

TensorRT Error for input

When I try to use tensorrt to transformer this model to tensorrt model for speed.It Shows
2018-05-07 10:45:45.609149: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:383] MULTIPLE tensorrt candidate conversion: 26 2018-05-07 10:45:45.832778: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::377, condition: isValidDims(dims) 2018-05-07 10:45:45.832832: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:0 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 30 nodes) 2018-05-07 10:45:45.834499: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::377, condition: isValidDims(dims) 2018-05-07 10:45:45.834529: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:1 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 23 nodes) 2018-05-07 10:45:45.836056: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:2 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/ExpandDims" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.838489: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:3 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/Exp_1" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.840022: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:4 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_60" SKIPPING......( 6 nodes) 2018-05-07 10:45:45.841584: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:5 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/Exp_6" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.843144: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:6 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_6" SKIPPING......( 6 nodes) 2018-05-07 10:45:45.844673: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:7 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/ones_2" SKIPPING......( 4 nodes) 2018-05-07 10:45:45.846229: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::377, condition: isValidDims(dims) 2018-05-07 10:45:45.846257: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:8 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 33 nodes) 2018-05-07 10:45:45.847794: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:9 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/ones_1" SKIPPING......( 3 nodes) 2018-05-07 10:45:45.849348: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:10 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_6/bbox_fc/MatMul" SKIPPING......( 4 nodes) 2018-05-07 10:45:45.850883: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:11 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/Exp_7" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.852441: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:12 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/ones_3" SKIPPING......( 4 nodes) 2018-05-07 10:45:45.854356: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::377, condition: isValidDims(dims) 2018-05-07 10:45:45.854409: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:13 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 380 nodes) 2018-05-07 10:45:45.855983: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:14 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_8" SKIPPING......( 6 nodes) 2018-05-07 10:45:45.858416: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:15 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_89" SKIPPING......( 6 nodes) 2018-05-07 10:45:45.859942: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:16 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/ones" SKIPPING......( 4 nodes) 2018-05-07 10:45:45.861475: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:17 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/Exp_5" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.863010: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:18 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/Exp_4" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.864538: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:19 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_62" SKIPPING......( 6 nodes) 2018-05-07 10:45:45.866071: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:20 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/Exp_3" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.867597: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:21 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_6/ps_fc_1/MatMul" SKIPPING......( 3 nodes) 2018-05-07 10:45:45.869124: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:22 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_33" SKIPPING......( 6 nodes) 2018-05-07 10:45:45.870657: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:23 due to: "Unimplemented: Require 4 dimensional input. Got 2 resnet_v1_50_5/Exp_2" SKIPPING......( 9 nodes) 2018-05-07 10:45:45.872184: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:24 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_35" SKIPPING......( 6 nodes) 2018-05-07 10:45:45.873872: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] subgraph conversion error for subgraph_index:25 due to: "Unimplemented: Require 4 dimensional input. Got 1 resnet_v1_50_5/strided_slice_87" SKIPPING......( 6 nodes)
Could you give me some advice for this?Or could you give me some advice for the goal 33map and 60fps?Thanks very much

How to generate ".odgt" annotation file?

Hello.

Thank you for your work.

In your code, you turn the json annotation of the COCO dataset into ".odgt".

I want to know how it was generated. If there is converted code, can you provide it?

Because I want to extract a specific class to train, such as "person". Is there another way to get the annotation of a specified class?

Do anyone run successfully on a single gpu GTX 1080? I tried it and out of memory.

I add
tfconfig.gpu_options.per_process_gpu_memory_fraction = 0.05
to let it run, but I got error information like follow:
...
2018-05-18 19:14:25.380430: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 58.69MiB. Current allocation summary follows.
2018-05-18 19:14:25.380546: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256): Total Chunks: 38, Chunks in use: 37. 9.5KiB allocated for chunks. 9.2KiB in use in bin. 7.6KiB client-requested in use in bin.
...
4] 1 Chunks of size 91656192 totalling 87.41MiB
2018-05-18 19:14:25.404137: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 374.93MiB
2018-05-18 19:14:25.404163: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit: 425407283
InUse: 393138944
MaxInUse: 393138944
NumAllocs: 1096
MaxAllocSize: 91656192

2018-05-18 19:14:25.404278: W tensorflow/core/common_runtime/bfc_allocator.cc:279] **********************************************************_____******************xxxxxxx
2018-05-18 19:14:25.404328: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at conv_ops.cc:672 : Resource exhausted: OOM when allocating tensor with shape[1,64,400,601] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "test.py", line 244, in
eval_all(args)
File "test.py", line 137, in eval_all
result_dict = inference(func, inputs, data_dict)
File "test.py", line 69, in inference
_, scores, pred_boxes, rois = val_func(feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1140, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,64,400,601] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: resnet_v1_101/conv1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](resnet_v1_101/conv1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, resnet_v1_101/conv1/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Node: resnet_v1_101_5/concat_3/_1133 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2610_resnet_v1_101_5/concat_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

why can you use tensor as numpy?

I see your code in lib/detection_opr/rpn_batched/proposal_opr.py. In proposal_opr, the type of anchors is numpy, the type of rpn_bbox_pred is tensor, why you can use it as numpy in bbox_transform_inv and other functions. Why you can use tensor as numpy?
And I get error when I transform the graph to tensorrt graph. and I find the place where error shows is the lib/detection_opr/rpn_batched/proposal_opr.py. Maybe it is because the confusion between numpy and tensor?

ohem

where can i find the ohem's implementation?

ROIAlign Interpolating is incorrect

Hi,

Recently I found the ROIAlign in roi_align_op_gpu.cu.cc will have incorrect interpolation value in some conner case.

the case is like this, when the h or w passed to ROIAlignGetInterpolating() is some float value like 1.001 , the floor and ceil will return 1, and 2; but when the w or h have just like 1.000000, the ceil and floor will return same value.

as a result, it will return a Interpolating of two points instead of 4 points.

Why are some boxes ignored in the odgt file?

Looking through the odgt file, I find a small minority of boxes ignored looking at the extra.ignored field. Whats the significance of this?

No OpKernel was registered to support Op 'NMS' with these attrs

Hello ,
I'm try to run your test.py on my env . but meet the problem below , did I make some mistake ?
(I could inport /lib/lib_kernel/lib_fast_nms/nms_op.py but could't use it .)

Caused by op 'resnet_v1_101_5/NMS', defined at:
  File "test.py", line 242, in <module>
    eval_all(args)
  File "test.py", line 162, in eval_all
    proc.start()
  File "/usr/lib/python3.4/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.4/multiprocessing/context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.4/multiprocessing/context.py", line 267, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.4/multiprocessing/popen_fork.py", line 21, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.4/multiprocessing/popen_fork.py", line 77, in _launch
    code = process_obj._bootstrap()
  File "/usr/lib/python3.4/multiprocessing/process.py", line 254, in _bootstrap
    self.run()
  File "/usr/lib/python3.4/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "test.py", line 107, in worker
    func, inputs = load_model(model_file, dev)
  File "test.py", line 38, in load_model
    net.inference('TEST', inputs)
  File "/home/aipr/dennis_codebase/light_head_rcnn/experiments/lizeming/light_head_rcnn.ori_res101.coco.ps_roialign/network_desp.py", line 164, in inference
    anchors, num_anchors, is_tfchannel=True, is_tfnms=False)
  File "/home/aipr/dennis_codebase/light_head_rcnn/lib/detection_opr/rpn_batched/proposal_opr.py", line 95, in proposal_opr
    cur_proposals, nms_thresh, post_nms_topN)
  File "<string>", line 43, in nms
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/op_def_library.py", line 328, in apply_op
    op_type_name, name, **keywords)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'NMS' with these attrs.  Registered devices: [CPU], Registered kernels:
  device='GPU'; T in [DT_FLOAT]

         [[Node: resnet_v1_101_5/NMS = NMS[T=DT_FLOAT, max_out=1000, nms_overlap_thresh=0.7](resnet_v1_101_5/Gather)]]

Have you compared mobilenetV2 and xceptionlike?

Comparison of accuracy and speed between the mobilenetV2 and xceptionlike， mobilenetV2_05 or mobilenetV2_035

training hang at restoring res101.ckpt

Tried light_head_rcnn.ori_res101.coco.ps_roialign training with 4 GPUs (same result with 1 GPU too), it simply hang at step "Restoring parameters from /home/dsu/ai/lh/data/imagenet_weights/res101.ckpt". GPUs show 207MB memory, 0% utility. A lot of CPUs/cores (about 30) running at 100%.

When kill the training, sees this error:

File "/home/dsu/ai/lh/experiments/my/light_head_rcnn.ori_res101.coco/dataset.py", line 129, in get_data_for_singlegpu
    img = cv2.imread(image_path, cv2.IMREAD_COLOR)

Same issue with light_head_rcnn.ori_res101.cocoexperiment.

ubuntu 16.04, tf1.5, cuda9, and 1080 Ti (tf1.6 has same issue, downgrade to 1.5 didn't make any difference).

anyone runs into this issue?

Does the rpn and the rcnn trained dividly?

In the faster rcnn work, the writer said :
In the first step, we train the RPN as described above. This network is initialized with an ImageNet-
pre-trained model and fine-tuned end-to-end for the region proposal task. In the second step, we
train a separate detection network by Fast R-CNN using the proposals generated by the step-1 RPN.
This detection network is also initialized by the ImageNet-pre-trained model. At this point the two
networks do not share conv layers. In the third step, we use the detector network to initialize RPN
training, but we fix the shared conv layers and only fine-tune the layers unique to RPN. Now the two
networks share conv layers. Finally, keeping the shared conv layers fixed, we fine-tune the fc layers
of the Fast R-CNN. As such, both networks share the same conv layers and form a unified network.
But I didn't find this step in your code? Could you please explain it

could share a train log ?

I don't know where is wrong in the code I changed a little code.
I train it with 1.2.1 tensorflow-gpu.the log shows that [rpn_bbox_loss 10±2], [bbox_loss 0.000]

iter 9071, rpn_loss_cls: 0.0063, rpn_loss_box: 9.4851, loss_cls: 0.2539, loss_box: 0.0000, tot_losses: 9.7453, lr: 0.0006, speed: 0.782s/iter: 11%|█▋ | 9071/80000 [2:01:21<15:48:55, 1.25it/s]

light_head_rcnn/lib/lib_kernel/lib_fast_nms/fast_nms.so: undefined symbol:

Hello.
I'm try to run your test.py. but I have a problem below, I don't know what wrong..

gt@gt:~/light_head_rcnn/experiments/lizeming/light_head_rcnn.ori_res101.coco.ps_roialign$ python3 test.py -d 0 -se 26
Traceback (most recent call last):
File "test.py", line 16, in
import network_desp
File "/home/gt/light_head_rcnn/experiments/lizeming/light_head_rcnn.ori_res101.coco.ps_roialign/network_desp.py", line 26, in
from detection_opr.rpn_batched.proposal_opr import proposal_opr
File "/home/gt/light_head_rcnn/lib/detection_opr/rpn_batched/proposal_opr.py", line 9, in
from lib_kernel.lib_fast_nms import nms_op
File "/home/gt/light_head_rcnn/lib/lib_kernel/lib_fast_nms/nms_op.py", line 10, in
_nms_module = tf.load_op_library(filename)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/load_library.py", line 56, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename, status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: /home/gt/light_head_rcnn/lib/lib_kernel/lib_fast_nms/fast_nms.so: undefined symbol: _Z15launch_gen_maskifPKfiPyRKN5Eigen9GpuDeviceE

I also finish Compiling step below.

gt@gt:~/light_head_rcnn/lib$ bash make.sh
~/light_head_rcnn/lib/utils/py_faster_rcnn_utils ~/light_head_rcnn/lib
python3 setup.py build_ext --inplace
running build_ext
skipping 'bbox.c' Cython extension (up-to-date)
skipping 'nms.c' Cython extension (up-to-date)
rm -rf build
~/light_head_rcnn/lib
~/light_head_rcnn/lib/lib_kernel/lib_fast_nms ~/light_head_rcnn/lib
~/light_head_rcnn/lib
~/light_head_rcnn/lib/lib_kernel/lib_psroi_pooling ~/light_head_rcnn/lib
~/light_head_rcnn/lib
~/light_head_rcnn/lib/lib_kernel/lib_roi_pooling ~/light_head_rcnn/lib
~/light_head_rcnn/lib
~/light_head_rcnn/lib/lib_kernel/lib_roi_align ~/light_head_rcnn/lib
~/light_head_rcnn/lib
~/light_head_rcnn/lib/lib_kernel/lib_psalign_pooling ~/light_head_rcnn/lib
~/light_head_rcnn/lib
~/light_head_rcnn/lib/lib_kernel/lib_nms_dev ~/light_head_rcnn/lib
~/light_head_rcnn/lib
~/light_head_rcnn/lib/datasets_odgt/lib_coco/PythonAPI ~/light_head_rcnn/lib
install pycocotools to the Python site-packages
python3 setup.py build_ext install --user
running build_ext
building 'pycocotools._mask' extension
creating build
creating build/temp.linux-x86_64-3.5
creating build/temp.linux-x86_64-3.5/pycocotools
creating build/common
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.5/dist-packages/numpy/core/include -I../common -I/usr/include/python3.5m -c pycocotools/_mask.c -o build/temp.linux-x86_64-3.5/pycocotools/_mask.o -Wno-cpp -Wno-unused-function -std=c99
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.5/dist-packages/numpy/core/include -I../common -I/usr/include/python3.5m -c ../common/maskApi.c -o build/temp.linux-x86_64-3.5/../common/maskApi.o -Wno-cpp -Wno-unused-function -std=c99
creating build/lib.linux-x86_64-3.5
creating build/lib.linux-x86_64-3.5/pycocotools
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.5/pycocotools/_mask.o build/temp.linux-x86_64-3.5/../common/maskApi.o -o build/lib.linux-x86_64-3.5/pycocotools/_mask.cpython-35m-x86_64-linux-gnu.so
running install
running build
running build_py
copying pycocotools/mask.py -> build/lib.linux-x86_64-3.5/pycocotools
copying pycocotools/coco.py -> build/lib.linux-x86_64-3.5/pycocotools
copying pycocotools/init.py -> build/lib.linux-x86_64-3.5/pycocotools
copying pycocotools/cocoeval.py -> build/lib.linux-x86_64-3.5/pycocotools
running install_lib
copying build/lib.linux-x86_64-3.5/pycocotools/_mask.cpython-35m-x86_64-linux-gnu.so -> /home/gt/.local/lib/python3.5/site-packages/pycocotools
running install_egg_info
Removing /home/gt/.local/lib/python3.5/site-packages/pycocotools-2.0.egg-info
Writing /home/gt/.local/lib/python3.5/site-packages/pycocotools-2.0.egg-info
rm -rf build
~/light_head_rcnn/lib

The error is occur at fast_nms.so file, but I can't read it because it is a file created by compilation..
I also change my gcc, g++ version( 5.4->4.8), but nothing change..
Can you help me with this error?

Error with cython_bbox

Hi there,

Awesome research and thanks for open-sourcing your code!

When trying to run train.py:

Traceback (most recent call last):
  File "train.py", line 12, in <module>
    import network_desp
  File "/home/indus/Documents/light_head_rcnn/experiments/lizeming/light_head_rcnn.ori_res101.coco.ps_roialign/network_desp.py", line 23, in <module>
    from detection_opr.rpn_batched.proposal_target_layer import proposal_target_layer
  File "/home/indus/Documents/light_head_rcnn/lib/detection_opr/rpn_batched/proposal_target_layer.py", line 14, in <module>
    from utils.py_faster_rcnn_utils.cython_bbox import bbox_overlaps
ImportError: No module named cython_bbox

I looked in the lib/utils/py_faster_rcnn_utils/ folder but I couldn't find any file named cython_bbox.py. I saw a file called bbox.py so I tried changing the line to from utils.py_faster_rcnn_utils.bbox import bbox_overlaps but to no avail.

Any suggestions as to how I can resolve this issue?

Thanks!

Iteration stop randomly again

I tried to update msgpack-numpy and msgpack as you said, but it doesn't work. Can you tell us what system you are using? Ubuntu16.04 automatically loses IP over a period of time, but your code uses pip. I suspect that it is an iterative random stop caused by a system problem.
Print log as follows：
ch:0, iter:4399, rpn_loss_cls: 0.0677, rpn_loss_box: 0.0325, loss_cls: 0.3604, loss_box: 0.6708, tot_losses: 1.1314, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 43epoch:0, iter:4400, rpn_loss_cls: 0.0281, rpn_loss_box: 0.0015, loss_cls: 0.0247, loss_box: 0.0002, tot_losses: 0.0544, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4401, rpn_loss_cls: 0.0373, rpn_loss_box: 0.0025, loss_cls: 0.0489, loss_box: 0.0004, tot_losses: 0.0892, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4402, rpn_loss_cls: 0.0223, rpn_loss_box: 0.0026, loss_cls: 0.0173, loss_box: 0.0130, tot_losses: 0.0551, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4403, rpn_loss_cls: 0.0571, rpn_loss_box: 0.0198, loss_cls: 0.2124, loss_box: 0.2481, tot_losses: 0.5374, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4404, rpn_loss_cls: 0.0359, rpn_loss_box: 0.0135, loss_cls: 0.1283, loss_box: 0.1383, tot_losses: 0.3160, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4405, rpn_loss_cls: 0.0455, rpn_loss_box: 0.0516, loss_cls: 0.1455, loss_box: 0.0754, tot_losses: 0.3181, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4406, rpn_loss_cls: 0.0611, rpn_loss_box: 0.0380, loss_cls: 0.0184, loss_box: 0.0022, tot_losses: 0.1198, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4407, rpn_loss_cls: 0.0297, rpn_loss_box: 0.0195, loss_cls: 0.0216, loss_box: 0.0106, tot_losses: 0.0814, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4408, rpn_loss_cls: 0.0397, rpn_loss_box: 0.0038, loss_cls: 0.0574, loss_box: 0.0496, tot_losses: 0.1505, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 4408/13754 [28:43<57:22, 2.72it/s]

Cannot find fast_nms.so

Hi @zengarden,
I am running the test script, but this error came out.
" lib/lib_kernel/lib_fast_nms/fast_nms.so: cannot open shared object file: No such file or directory"

Could you please tell me where is not right?

Many thanks.

some suggestion about to accelerate speed

some suggestion about to accelerate speed
I found your data format is "NHWC" , but according to page : https://www.tensorflow.org/performance/performance_guide
NHWC is the TensorFlow default and NCHW is the optimal format to use when training on NVIDIA GPUs using cuDNN.
You should use NCHW as your CNN format, and your nms, ps roi pooling may also need change.

undefined symbol when import fast_nms.so

I have run the make.sh in lib directory and all succeeded
when I run test.py -d 0 -se 26 it reported this error
I googled many times but found no useful solutions for this problem
How can I figure this out?

0 fatal error: cuda/include/cuda.h: No such file or directory

Hello,
I'm running this in a docker container with cuda 9.0 and tensorflow 1.5.0 installed from pip3.
When I run make.sh it get's stuck while compiling psroi_pooling_op_gpu.cu.cc.
The exact error message is as follows:

In file included from psalign_pooling_op_gpu.cu.cc:7:0:
/usr/local/lib/python3.5/dist-packages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h:24:31: fatal error: cuda/include/cuda.h: No such file or directory

Do I have to compile tensorflow from source?

the predict procedure don't support multi-batch

the predict procedure don't support multi-batch, and we test your code on Nvidia 1080Ti GPU which only achieve 2 images per seconds. Far lower than original paper 100 FPS.

There are a lot of errors in testing. Using 1080ti,tensorflow1.5.0,python3.6

Traceback (most recent call last):
File "test.py", line 241, in
eval_all(args)
File "test.py", line 131, in eval_all
func, inputs = load_model(model_file, devs[0])
File "test.py", line 35, in load_model
sess = tf.Session(config=tfconfig)
File "/home/lucifer/anaconda3/envs/slam/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1509, in init
super(Session, self).init(target, graph, config=config)
File "/home/lucifer/anaconda3/envs/slam/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 628, in init
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/home/lucifer/anaconda3/envs/slam/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

The output(logs) folder in GoogleDrive is the same as transformed odformat in GoogleDrive

not able to run test.py

while running
root@982633c0cbbf:/dh/home/administrator/users_local/mamta/LightHead/lighthead_ROOT/light_head_rcnn/experiments/lizeming/light_head_rcnn.ori_res101.coco# python3 test.py -d 0 -se 1

getting error
2018-08-27 09:37:02.247790: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_tensor.cc:170 : Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /dh/home/administrator/users_local/LightHead/lighthead_ROOT/light_head_rcnn/output/root/light_head_rcnn.ori_res101.coco/eval_dump/epoch_1.ckpt

Not able to understand from where i can get epoch_1.ckpt

Different in lib_fast_nms/make.sh about gcc version

Hello, I have a question in lib_fast_nms/make.sh file.
In your code,

##if you install tf using already-built binary, or gcc version 4.x, uncomment the two lines below
g++ -std=c++11 -shared -D_GLIBCXX_USE_CXX11_ABI=0 -o fast_nms.so nms_op.cc
nms_op.cu.o -I $TF_INC -fPIC -lcudart -L $CUDA_PATH/lib64 -L$TF_LIB -ltensorflow_framework -I$TF_INC/external/nsync/public

#for gcc5-built tf
#g++ -std=c++11 -shared -D_GLIBCXX_USE_CXX11_ABI=1 -o roi_pooling.so roi_pooling_op.cc
roi_pooling_op.cu.o -I $TF_INC -fPIC -lcudart -L $CUDA_PATH/lib64

if you install tf using already-built binary, or gcc version 4.x, it makes fast_nms.so file
and for gcc5-but tf it makes roi_pooling.so file

I don't understand and wondering why you create another file based on the gcc version.

difference between nms_fast and standard nms

Hi, thank you for your great work. I have a very quick question, what's the difference between nms_fast and standard nms? Thanks.

version of cuda?

hello, thank you for your job, i met the error when i compile the lib files.
i wonder if the wrong version of cuda i used, i used tf1.5.0+cuda8.0+py3. the error is
..................................................................................................................
/usr/local/lib/python3.6/sitepackages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h(701): error: identifier "__ballot_sync" is undefined
/usr/local/lib/python3.6/sitepackages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h(720): error: identifier "__shfl_sync" is undefined
/usr/local/lib/python3.6/sitepackages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h(742): error: identifier "__shfl_up_sync" is undefined
/usr/local/lib/python3.6/sitepackages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h(758): error: identifier "__shfl_down_sync" is undefined
/usr/local/lib/python3.6/sitepackages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h(770): error: identifier "__shfl_down_sync" is undefined
/usr/local/lib/python3.6/sitepackages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h(786): error: identifier "__shfl_xor_sync" is undefined
/usr/local/lib/python3.6/sitepackages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h(798): error: identifier "__shfl_xor_sync" is undefined
7 errors detected in the compilation of "/tmp/tmpxft_0000050e_00000000-7_psroi_pooling_op_gpu.cu.cpp1.ii".
g++: error: psroi_pooling_op.cu.o: No such file or directory
...........................................................................................................................................................................
i guess the error is caused by wrong version of cuda, those undefined functions "__ballot_syncare" "__shfl_sync"... are in cuda9.0. but i can only use cuda8.0, because of the GPU machine.
How can i solve this problem except change the version of cuda?

Cannot download odformat from google drive

Try to download the odfromat file from the link but received redirection error. Keep trying for many times still cannot download.

Anyone has similar issue like me? I am trying to download it from Singapore.

g++ error

when i run the bash make.sh, the following error occurs:
~/light_head_rcnn/lib
~/light_head_rcnn/lib/lib_kernel/lib_fast_nms ~/light_head_rcnn/lib
make.sh: 5: make.sh: nvcc: not found
g++: error: nms_op.cu.o: No such file or directory

what is the problem

Odformat available for COCO2017 data?

Is there any ways to transform COCO dataset to odformat ?
Or any transfromed odformat for instances_train2017 and instances_val2017

Thanks!

The difference of tensorflow 1.5.0's nms and earler version?

anyone run on coco data

I have run test.py on coco data and also used epoch_26.ckpt given by author. but still getting worst result like 4 % on coco data. Anyone have checked on coco data. Please let me know Maybe i am doing some mistake because author claim 40 % on coco.

Where is the combination of FPN and Light_head RCNN.

I see this in the paper, but i can not find this in the code.

Unit with subsampling in a block

Is the position of the subsampling unit in each resnet block correct?

In your network you subsample at the beginning of each block:
https://github.com/zengarden/light_head_rcnn/blob/master/experiments/lizeming/light_head_rcnn.ori_res101.coco.ps_roialign/network_desp.py#L93

But in the original tensorflow implementation they subsample at the end of each block:

in the current implementation we subsample the output activations in the last residual unit of
each block, instead of subsampling the input activations in the first residual unit of each block.

See:
https://github.com/tensorflow/models/blob/master/research/slim/nets/resnet_utils.py#L30
https://github.com/tensorflow/models/blob/master/research/slim/nets/resnet_v1.py#L271

Is it alright to use their pretrained resnet-101 model in this case?

tensorflow.python.framework.errors_impl.NotFoundError:

When I run python3 train.py, there is a error like this:
'tensorflow.python.framework.errors_impl.NotFoundError: ~/light_head_rcnn-master/lib/lib_kernel/lib_fast_nms/fast_nms.so: undefined symbol: _ZN10tensorflow7strings6StrCatERKNS0_8AlphaNumE'. How to solve this problem?

How to visualize the loss curve？

Does someone use Tensorboard or other ways to get the loss curve?

Images can not be shown

When I start test.py the following way

python3 test.py -s -se 26

It fails at the point it should view the processed image. Returns a couple of X Errors like

X Error: BadAccess (attempt to access private resource denied) 10
   Extension:    130 (MIT-SHM)
   Minor opcode: 1 (X_ShmAttach)
   Resource id:  0x4600003

X Error: BadShmSeg (invalid shared segment parameter) 128
   Extension:    130 (MIT-SHM)
   Minor opcode: 3 (X_ShmPutImage)
   Resource id:  0x4600008

X Error: BadAccess (attempt to access private resource denied) 10
   Extension:    130 (MIT-SHM)
   Minor opcode: 1 (X_ShmAttach)
   Resource id:  0x149

X Error: BadShmSeg (invalid shared segment parameter) 128
   Extension:    130 (MIT-SHM)
   Minor opcode: 2 (X_ShmDetach)
   Resource id:  0x149

Is anyone having the same or a similar problem?

#34 Can be closed.
Xhost was not configured for docker. My mistake - sorry for that!!

where is xception like network code which is written in original paper?

On what GPU have you received FPS of 95/102?

First of all, thanks for the paper, I enjoyed reading it.
In table 8 you compare FPS and COCO's AP with other CNNs, but I can't find the GPU in use.
Could you please share on what GPU these FPS values were obtained?
Is it Titan XP? If so, were the FPS values of the rest of the CNNs (YOLO, SSD, DSSD, etc) also computed on this GPU? (original values from corresponding papers are of Titan X Maxwell if I'm not wrong)

Thanks!

Why is map low training on VOC？

I trained on my own dataset that only three classes,then map is 78%.
But I trained on VOC0712 that have 20 classes,,the map is only 21%.
I guess the reason is that I only used one 1080TI, resulting in a serious under-fitting。

zengarden / light_head_rcnn Goto Github PK

light_head_rcnn's People

Contributors

Stargazers

Watchers

Forkers

light_head_rcnn's Issues

Recommend Projects

Recommend Topics

Recommend Org