Light

Why is the test network different from the training network? about flow-guided-feature-aggregation HOT 20 OPEN

FCInter commented on September 15, 2024

Why is the test network different from the training network?

from flow-guided-feature-aggregation.

Comments (20)

ZhihuaGao commented on September 15, 2024 1

I think the warping sequence before or after embedding don't matter.
Because the warping operating do not contain any learning parameters.
My personal opinion，hope to help U.

from flow-guided-feature-aggregation.

FCInter commented on September 15, 2024

According to my test case, I'm afraid it really matters, because when I build the training network and load the test checkpoint, the model does not converge very well. Moreover, though the warping operation does not contain parameters, it changes the feature map. That is, performing warping first and then embedding, yields very different feature map, compared with embedding first and then warping.

from flow-guided-feature-aggregation.

ZhihuaGao commented on September 15, 2024

Really? I have train and test the network, it works well......
Could U show your logs?

from flow-guided-feature-aggregation.

FCInter commented on September 15, 2024

@aresgao I have updated the issue. I posted the printed logs during training process. The problem is that I cannot get good results when I continue to train from the demo checkpoint provided in the README. The demo checkpoint yields very good results, but when I continue training from this checkpoint, the results become terrible. Though I only trained for 4k iterations, I would believe that, since the initial checkpoint is pretty good, I do not need to train it for that many iterations.
BTW, I'm curious about why we are suggested to train the model from the checkpoint of ResNet-101 and FlowNet, instead of directly train from the demo checkpoint?
I also tried to train from the checkpoint of ResNet-101 and FlowNet for 100k+ iterations, the performance was even worse.

Thank you for your patience and kindness in helping me!

from flow-guided-feature-aggregation.

ZhihuaGao commented on September 15, 2024

That's really strange, I train from the checkpoint of ResNet-101 and FlowNet for 100k+ iterations, the performance was even better, here is test results.
motion [0.0 1.0], area [0.0 0.0 100000.0 100000.0]Mean [email protected] = 0.7648motion [0.0 0.7], area [0.0 0.0 100000.0 100000.0]Mean [email protected] = 0.5727motion [0.7 0.9], area [0.0 0.0 100000.0 100000.0]Mean [email protected] = 0.7515motion [0.9 1.0], area [0.0 0.0 100000.0 100000.0]Mean [email protected] = 0.8444.

from flow-guided-feature-aggregation.

txf201604 commented on September 15, 2024

@aresgao Can you help me ?
I have a problem about "sh ./init.sh".
Traceback (most recent call last):
File "setup_linux.py", line 63, in
CUDA = locate_cuda()
File "setup_linux.py", line 58, in locate_cuda
for k, v in cudaconfig.iteritems():
AttributeError: 'dict' object has no attribute 'iteritems'
If youIf you can reply me in time, I will be very grateful.

from flow-guided-feature-aggregation.

FCInter commented on September 15, 2024

@aresgao What version of mxnet are you using? I was wondering if it's caused by the version, since I got a bug because of the wrong version I was using.

from flow-guided-feature-aggregation.

ZhihuaGao commented on September 15, 2024

I use the latest version of mxnet @FCInter

from flow-guided-feature-aggregation.

ZhihuaGao commented on September 15, 2024

@txf201604
the func locate_cuda() finds where your cuda installed, I thought u might check your cuda location

def locate_cuda():
    """Locate the CUDA environment on the system
    Returns a dict with keys 'home', 'nvcc', 'include', and 'lib64'
    and values giving the absolute path to each directory.
    Starts by looking for the CUDAHOME env variable. If not found, everything
    is based on finding 'nvcc' in the PATH.
    """

from flow-guided-feature-aggregation.

txf201604 commented on September 15, 2024

First of all, I would like to express my sincere gratitude to you for replying to my email. I have already solved the problem of "sh ./init.sh". It is due to the python version. I installed python2.7 to solve the problem, but I have a new problem. I have not solved it correctly. . The problem is that when I modified the "USE_CUDA = 1 USE_CUDA_PATH = /usr/local/cudamxnet" in the config.mk under make, then the mxnet GPU compilation has a problem. As shown below. "In file included from src/operator/tensor/././sort_op.h:85:0, from src/operator/tensor/./indexing_op.h:24, from src/operator/tensor/indexing_op.cu:8: src/operator/tensor/./././sort_op-inl.cuh:15:44: fatal error: cub/device/device_radix_sort.cuh: No such file or directory #include <cub/device/device_radix_sort.cuh> ^ compilation terminated. Makefile:211: recipe for target 'build/src/operator/tensor/indexing_op_gpu.o' failed make: *** [build/src/operator/tensor/indexing_op_gpu.o] Error 1 make: *** Waiting for unfinished jobs...." Mxnet cpu version I can compile successfully, if you can not support the GPU, when running "demo.py" will report the following error: "Stack trace returned 10 entries: [bt] (0) /home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7febb91c2ebc] [bt] (1) /home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/libmxnet.so(MXImperativeInvoke+0x8c9) [0x7febb9f6d6e9] [bt] (2) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7feb6c623ec0] [bt] (3) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7feb6c62387d] [bt] (4) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x4de) [0x7feb6c83a8de] [bt] (5) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/_ctypes.so(+0x9b31) [0x7feb6c830b31] [bt] (6) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7febc057b973] [bt] (7) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3bb9) [0x7febc0611d49] [bt] (8) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9) [0x7febc06176c9] [bt] (9) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x6a08) [0x7febc0614b98] Traceback (most recent call last): File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/demo.py", line 257, in <module> main() File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/demo.py", line 155, in main arg_params=arg_params, aux_params=aux_params) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/tester.py", line 37, in __init__ self._mod.bind(provide_data, provide_label, for_training=False)" File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/module.py", line 844, in bind for_training, inputs_need_grad, force_rebind=False, shared_module=None) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/module.py", line 401, in bind state_names=self._state_names) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 191, in __init__ self.bind_exec(data_shapes, label_shapes, shared_group) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 277, in bind_exec shared_group)) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 550, in _bind_ith_exec context, self.logger) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 528, in _get_or_reshape arg_arr = nd.zeros(arg_shape, context, dtype=arg_type) File "/home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/ndarray.py", line 1003, in zeros return _internal._zeros(shape=shape, ctx=ctx, dtype=dtype) File "<string>", line 15, in _zeros File "/home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/_ctypes/ndarray.py", line 72, in _imperative_invoke c_array(ctypes.c_char_p, [c_str(str(val)) for val in vals]))) File "/home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/base.py", line 84, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [10:59:41] src/c_api/c_api_ndarray.cc:392: Operator _zeros cannot be run; requires at least one of FCompute<xpu>, NDArrayFunction, FCreateOperator be registered 高志华 <[email protected]> 于2018年11月13日周二下午6:33写道：

@txf201604 <https://github.com/txf201604> the func locate_cuda() finds where your cuda installed, I thought u might check your cuda location def locate_cuda(): """Locate the CUDA environment on the system Returns a dict with keys 'home', 'nvcc', 'include', and 'lib64' and values giving the absolute path to each directory. Starts by looking for the CUDAHOME env variable. If not found, everything is based on finding 'nvcc' in the PATH. """ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Aq0WQqOYkloNmeIsyVSCX5Gh-ceZPfykks5uuqAHgaJpZM4YSTVH> .

from flow-guided-feature-aggregation.

txf201604 commented on September 15, 2024

Hello First of all, I would like to express my sincere gratitude to you for replying to my email. I have already solved the problem of "sh ./init.sh". It is due to the python version. I installed python2.7 to solve the problem, but I have a new problem. I have not solved it correctly. . The problem is that when I modified the "USE_CUDA = 1 USE_CUDA_PATH = /usr/local/cudamxnet" in the config.mk under make, then the mxnet GPU compilation has a problem. As shown below. "In file included from src/operator/tensor/././sort_op.h:85:0, from src/operator/tensor/./indexing_op.h:24, from src/operator/tensor/indexing_op.cu:8: src/operator/tensor/./././sort_op-inl.cuh:15:44: fatal error: cub/device/device_radix_sort.cuh: No such file or directory #include <cub/device/device_radix_sort.cuh> ^ compilation terminated. Makefile:211: recipe for target 'build/src/operator/tensor/indexing_op_gpu.o' failed make: *** [build/src/operator/tensor/indexing_op_gpu.o] Error 1 make: *** Waiting for unfinished jobs...." Mxnet cpu version I can compile successfully, if you can not support the GPU, when running "demo.py" will report the following error: "Stack trace returned 10 entries: [bt] (0) /home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7febb91c2ebc] [bt] (1) /home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/libmxnet.so(MXImperativeInvoke+0x8c9) [0x7febb9f6d6e9] [bt] (2) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7feb6c623ec0] [bt] (3) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7feb6c62387d] [bt] (4) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x4de) [0x7feb6c83a8de] [bt] (5) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/_ctypes.so(+0x9b31) [0x7feb6c830b31] [bt] (6) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7febc057b973] [bt] (7) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3bb9) [0x7febc0611d49] [bt] (8) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9) [0x7febc06176c9] [bt] (9) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x6a08) [0x7febc0614b98] Traceback (most recent call last): File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/demo.py", line 257, in <module> main() File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/demo.py", line 155, in main arg_params=arg_params, aux_params=aux_params) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/tester.py", line 37, in __init__ self._mod.bind(provide_data, provide_label, for_training=False)" File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/module.py", line 844, in bind for_training, inputs_need_grad, force_rebind=False, shared_module=None) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/module.py", line 401, in bind state_names=self._state_names) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 191, in __init__ self.bind_exec(data_shapes, label_shapes, shared_group) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 277, in bind_exec shared_group)) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 550, in _bind_ith_exec context, self.logger) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 528, in _get_or_reshape arg_arr = nd.zeros(arg_shape, context, dtype=arg_type) File "/home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/ndarray.py", line 1003, in zeros return _internal._zeros(shape=shape, ctx=ctx, dtype=dtype) File "<string>", line 15, in _zeros File "/home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/_ctypes/ndarray.py", line 72, in _imperative_invoke c_array(ctypes.c_char_p, [c_str(str(val)) for val in vals]))) File "/home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/base.py", line 84, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [10:59:41] src/c_api/c_api_ndarray.cc:392: Operator _zeros cannot be run; requires at least one of FCompute<xpu>, NDArrayFunction, FCreateOperator be registered FCInter <[email protected]> 于2018年11月13日周二下午6:28写道：

@aresgao <https://github.com/AresGao> What version of mxnet are you using? I was wondering if it's caused by the version, since I got a bug because of the wrong version I was using. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#37 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Aq0WQpaPN8JFZjelrLU5ECPgt79mOR9wks5uup7JgaJpZM4YSTVH> .

from flow-guided-feature-aggregation.

FCInter commented on September 15, 2024

@aresgao Finally I got good results after training for 2 complete epoch!!!

I just have one last question. I find that when saving checkpoint at the end of each epoch, the following codes are used to create two new weights, namely rfcn_bbox_weight_test and rfcn_bbox_bias_test.

arg['rfcn_bbox_weight_test'] = weight * mx.nd.repeat(mx.nd.array(stds), repeats=repeat).reshape((bias.shape[0], 1, 1, 1))
arg['rfcn_bbox_bias_test'] = arg['rfcn_bbox_bias'] * mx.nd.repeat(mx.nd.array(stds), repeats=repeat) + mx.nd.repeat(mx.nd.array(means), repeats=repeat)

Why do we need to do this?
I have tested that if I do not do this, the checkpoint will make terrible predictions on the test data. This is also the reason why my previous predictions are bad, though the training loss looked good.
Thank you!

from flow-guided-feature-aggregation.

samanthawyf commented on September 15, 2024

Hi, @aresgao @FCInter @YuwenXiong , I tried the training and inference of the code. I used 4 gpus and all the setting is not changed, and the final mAP is 75.78. I am confused about the drop of mAP. Do you change any setting or do you have some advice on my case?

from flow-guided-feature-aggregation.

Feywell commented on September 15, 2024

That's really strange, I train from the checkpoint of ResNet-101 and FlowNet for 100k+ iterations, the performance was even better, here is test results.
motion [0.0 1.0], area [0.0 0.0 100000.0 100000.0]Mean [email protected] = 0.7648motion [0.0 0.7], area [0.0 0.0 100000.0 100000.0]Mean [email protected] = 0.5727motion [0.7 0.9], area [0.0 0.0 100000.0 100000.0]Mean [email protected] = 0.7515motion [0.9 1.0], area [0.0 0.0 100000.0 100000.0]Mean [email protected] = 0.8444.

@aresgao Hi~ I just have one GPU(1080TI), and only get mAP=0.7389 test by default setting.
How can you get better mAP? Can you tell us your setting detail?
Such as epochs, min_diff/max_diff, lr, gpus, test key_frame and so on...
Thank you!

from flow-guided-feature-aggregation.

withinnoitatpmet commented on September 15, 2024

@Feywell Hi Feywell, I have test the default setting with 2 GPU and 4 GPU, the result of 4 GPU is much better than 2. PS. lr = 0.00025 is equivalent to paper described 0.001. You could find more details in their code.

from flow-guided-feature-aggregation.

Feywell commented on September 15, 2024

@withinnoitatpmet Thank you！ So, if I just have one GPU , setting lr = 0.001, it will be better?

from flow-guided-feature-aggregation.

withinnoitatpmet commented on September 15, 2024

@Feywell I think the result could be even worse. Considering the relation between batch size and lr (idk if it is valid for small this batch size), lr should be 0.00025.

from flow-guided-feature-aggregation.

jucaowei commented on September 15, 2024

That's really strange, I train from the checkpoint of ResNet-101 and FlowNet for 100k+ iterations, the performance was even better, here is test results.
motion [0.0 1.0], area [0.0 0.0 100000.0 100000.0]Mean [email protected] = 0.7648motion [0.0 0.7], area [0.0 0.0 100000.0 100000.0]Mean [email protected] = 0.5727motion [0.7 0.9], area [0.0 0.0 100000.0 100000.0]Mean [email protected] = 0.7515motion [0.9 1.0], area [0.0 0.0 100000.0 100000.0]Mean [email protected] = 0.8444.

hi,i want to know how many epochs exactly you set to train the model, i train this model for
2 epochs,and get a result aboult 73.16% ,and why paper always talk about iterration not epochs,
i wish to hearing from you ,thank you

from flow-guided-feature-aggregation.

jucaowei commented on September 15, 2024

@aresgao
hi,i want to know how many epochs exactly you set to train the model, i train this model for
2 epochs,and get a result aboult 73.16% ,and why paper always talk about iterration not epochs,
i wish to hearing from you ,thank you

from flow-guided-feature-aggregation.

Feywell commented on September 15, 2024

@aresgao Finally I got good results after training for 2 complete epoch!!!

I just have one last question. I find that when saving checkpoint at the end of each epoch, the following codes are used to create two new weights, namely rfcn_bbox_weight_test and rfcn_bbox_bias_test.
arg['rfcn_bbox_weight_test'] = weight * mx.nd.repeat(mx.nd.array(stds), repeats=repeat).reshape((bias.shape[0], 1, 1, 1))
arg['rfcn_bbox_bias_test'] = arg['rfcn_bbox_bias'] * mx.nd.repeat(mx.nd.array(stds), repeats=repeat) + mx.nd.repeat(mx.nd.array(means), repeats=repeat)
Why do we need to do this?
I have tested that if I do not do this, the checkpoint will make terrible predictions on the test data. This is also the reason why my previous predictions are bad, though the training loss looked good.
Thank you!

Hi, @FCInter Do you know why there is arg['rfcn_bbox_weight_test'] here?
I try to change detection network to light-head, so I do not keep the arg['rfcn_bbox_weight_test'] .
but I get a bad result. Do you know what the meaning of arg['rfcn_bbox_weight_test'] is?

from flow-guided-feature-aggregation.

Related Issues (20)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.