opengvlab / dcnv4 Goto Github PK
View Code? Open in Web Editor NEW[CVPR 2024] Deformable Convolution v4
Home Page: https://arxiv.org/pdf/2401.06197.pdf
License: MIT License
[CVPR 2024] Deformable Convolution v4
Home Page: https://arxiv.org/pdf/2401.06197.pdf
License: MIT License
I have been using PyTorch's native Automatic Mixed Precision (AMP) feature and specified the data type as bf16 (bfloat16). However, I encountered an error during the backward propagation step. I would like to know if this is due to a lack of framework support or if I made an error in my usage.
Hi! Thx for your great work!!
Could you please provide pure Python/PyTorch implemented DCNv4?
It will be easier to migrate and run on a Non-CUDA device.
Referring to the README and https://huggingface.co/OpenGVLab/DCNv4/tree/main, the weights for dino_4scale_flash_internimage_b_1x_coco.pth is somehow 1.39GB while the supposely larger backbone dino_4scale_flash_internimage_l_1x_coco.pth is only 964MB. Is this to be expected?
Given that the flash intern image Large backbone is already 1.25GB flash_intern_image_l_22k_384.pth, I am suspecting that the checkpoint was saved wrongly
It is interesting in the paper that "Our observations indicate that substituting the previously used DWConv or Attention with our
DCNv4 leads to an increase in inference speed".
Could you provide the implemention details of "substituting the attention with DCNv4"?
I have built the DCNv4_op project and installed via pip install -e .
:
but when i'm trying to invoke it into my project, i met such error:
(mmd) zrway@zrpgs:/data/yolov8_proj$ python /data/yolov8_proj/project/test_model.py
from n params module arguments
0 -1 1 1392 ultralytics.nn.modules.conv.Conv [3, 48, 3, 2]
1 -1 1 41664 ultralytics.nn.modules.conv.Conv [48, 96, 3, 2]
2 -1 2 130513 ultralytics.nn.modules.customed.PRC2f.PRC2f [96, 96, 2, True]
3 -1 1 166272 ultralytics.nn.modules.conv.Conv [96, 192, 3, 2]
4 -1 4 888481 ultralytics.nn.modules.customed.PRC2f.PRC2f [192, 192, 4, True]
5 -1 1 664320 ultralytics.nn.modules.conv.Conv [192, 384, 3, 2]
6 -1 4 3546433 ultralytics.nn.modules.customed.PRC2f.PRC2f [384, 384, 4, True]
7 -1 1 1991808 ultralytics.nn.modules.conv.Conv [384, 576, 3, 2]
8 -1 2 3512225 ultralytics.nn.modules.customed.PRC2f_DCN_v4.PRC2f_DCN_v4[576, 576, 2, True]
9 -1 1 1670992 ultralytics.nn.modules.customed.SPPF_DCN.SPPF_DCN_v4[576, 576, 5]
10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
11 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1]
12 -1 2 2660929 ultralytics.nn.modules.customed.PRC2f.PRC2f [960, 384, 2]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1]
15 -1 2 703777 ultralytics.nn.modules.customed.PRC2f.PRC2f [576, 192, 2]
16 -1 1 332160 ultralytics.nn.modules.conv.Conv [192, 192, 3, 2]
17 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
18 -1 2 2366017 ultralytics.nn.modules.customed.PRC2f.PRC2f [576, 384, 2]
19 -1 1 1327872 ultralytics.nn.modules.conv.Conv [384, 384, 3, 2]
20 [-1, 9] 1 0 ultralytics.nn.modules.conv.Concat [1]
21 -1 2 5429089 ultralytics.nn.modules.customed.PRC2f.PRC2f [960, 576, 2]
22 [15, 18, 21] 1 3822016 ultralytics.nn.modules.head.Detect [80, [192, 384, 576]]
Traceback (most recent call last):
File "/data/yolov8_proj/project/test_model.py", line 26, in <module>
model = YOLO('yolov8m.yaml').load('/data/yolov8_proj/project/yolov8m.pt')
File "/data/yolov8_proj/ultralytics/ultralytics/engine/model.py", line 92, in __init__
self._new(model, task)
File "/data/yolov8_proj/ultralytics/ultralytics/engine/model.py", line 142, in _new
self.model = (model or self._smart_load("model"))(cfg_dict, verbose=verbose and RANK == -1) # build model
File "/data/yolov8_proj/ultralytics/ultralytics/nn/tasks.py", line 292, in __init__
m.stride = torch.tensor([s / x.shape[-2] for x in forward(torch.zeros(1, ch, s, s))]) # forward
File "/data/yolov8_proj/ultralytics/ultralytics/nn/tasks.py", line 291, in <lambda>
forward = lambda x: self.forward(x)[0] if isinstance(m, (Segment, Pose, OBB)) else self.forward(x)
File "/data/yolov8_proj/ultralytics/ultralytics/nn/tasks.py", line 84, in forward
return self.predict(x, *args, **kwargs)
File "/data/yolov8_proj/ultralytics/ultralytics/nn/tasks.py", line 102, in predict
return self._predict_once(x, profile, visualize, embed)
File "/data/yolov8_proj/ultralytics/ultralytics/nn/tasks.py", line 123, in _predict_once
x = m(x) # run
File "/home/zrway/anaconda3/envs/mmd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/data/yolov8_proj/ultralytics/ultralytics/nn/modules/customed/PRC2f_DCN_v4.py", line 177, in forward
y.extend(m(y[-1]) for m in self.m)
File "/data/yolov8_proj/ultralytics/ultralytics/nn/modules/customed/PRC2f_DCN_v4.py", line 177, in <genexpr>
y.extend(m(y[-1]) for m in self.m)
File "/home/zrway/anaconda3/envs/mmd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/data/yolov8_proj/ultralytics/ultralytics/nn/modules/customed/PRC2f_DCN_v4.py", line 49, in forward
return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
File "/home/zrway/anaconda3/envs/mmd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/data/yolov8_proj/ultralytics/ultralytics/nn/modules/customed/PRC2f_DCN_v4.py", line 27, in forward
x = self.dcnv4(
File "/home/zrway/anaconda3/envs/mmd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/data/yolov8_proj/deps/DCNv4/DCNv4_op/DCNv4/modules/dcnv4.py", line 130, in forward
x = DCNv4Function.apply(
File "/home/zrway/anaconda3/envs/mmd/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 86, in decorate_fwd
return fwd(*args, **kwargs)
File "/data/yolov8_proj/deps/DCNv4/DCNv4_op/DCNv4/functions/dcnv4_func.py", line 103, in forward
output = ext.dcnv4_forward(*args)
RuntimeError: Not implemented on the CPU
i'd appreciate it if you chould provide some help to solve this.
I load the checkpoint file "dino_4scale_flash_internimage_t_1x_coco.pth", and it reports the error as below. I downed it many times but it still not works.
"RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory"
I tried the dino_large_coco and maskrcnn_tiny_coco files, these files can be loaded.
1.n_points = 2,error: invalid k
2.n_points = 8,error: CUDA error: an illegal memory access was encountered
DCNv4/functions/dcnv4_func.py", line 125, in backward
ext.dcnv4_backward(*args)
RuntimeError: expected scalar type Float but found BFloat16
DCNV2加在ResNet后几层是当前目标检测等模型的通用涨点策略,请问DCNV4能这么做吗,我看其他人有说DCNV4不支持降采样
“RuntimeError:batch%im2col_step==0 INTERNAL ASSERT FAILED at './DCNv4_op/src/cuda/dcnv4_cuda.cu':55, please report a bug tp Pytorch batch(504) must divide im2col_step(256), for this, during testing, I have already set batch to 512, but the test data may not be evenly divided by 256,for example,the last batch is 504.
How to solve this problem, please.
Thank you very much for your contribution!
When I run the following code:
import DCNv4
help(DCNv4)
The following error message has occurred:
Traceback (most recent call last):
File "mytest.py", line 10, in <module>
import DCNv4
File "/home/coop/anaconda3/envs/v2xvit/lib/python3.7/site-packages/DCNv4/__init__.py", line 1, in <module>
from .functions import DCNv4Function, FlashDeformAttnFunction
File "/home/coop/anaconda3/envs/v2xvit/lib/python3.7/site-packages/DCNv4/functions/__init__.py", line 10, in <module>
from .flash_deform_attn_func import FlashDeformAttnFunction
File "/home/coop/anaconda3/envs/v2xvit/lib/python3.7/site-packages/DCNv4/functions/flash_deform_attn_func.py", line 34, in <module>
raise NotImplementedError
NotImplementedError
free(): invalid pointer
Aborted (core dumped)
I tried downloading or running make. sh compilation and installation of DCNv4 using pip, but it still hasn't been resolved.
Hi, thanks for the great repository.
Any plans to add CPU modules?
thanks
Who can provide the Windows compilation environment for dcnv4?
Is there an example of successful compilation on Windows system?
Amazing engineering implementation! But i have some questions, I compared DCNv3 and DCNv4 in RTX2080ti, PyTorch 1.13.1 environment. DCNv3 uses the built-in PyTorch function and the measured results are as follows. Does DCNV4 not have a speed advantage when there are fewer channels and higher resolution? May I ask what the possible reason is?
float32
dcnv3: 2.234679698944092
dcnv4: 2.8860323429107666
float16
dcnv3: 1.4734604358673096
dcnv4: 2.88419246673584
the code are
block1 = DCNv3(64, 64, group=4).cuda().eval()
block2 = DCNv4(64, 64, group=4).cuda().eval()
a = torch.rand(4, 64, 1920//4, 1088//4).cuda()
offset = torch.rand(4, 64, 1920//4, 1088//4).cuda()
for i in range(10):
with torch.no_grad():
block1(a, offset)
block2(a, offset)
torch.cuda.synchronize()
start_time = time.time()
for i in range(100):
with torch.no_grad():
block1(a, offset)
torch.cuda.synchronize()
print('dcnv3:', time.time() - start_time)
torch.cuda.synchronize()
start_time = time.time()
for i in range(100):
with torch.no_grad():
block2(a, offset)
torch.cuda.synchronize()
print('dcnv4:', time.time() - start_time)
In comparison to DCNV3, it seems that the DCNV4 code is missing the offset variable. Could you tell me where in the DCNV4 code implementation the deformation of convolutional positions is achieved?
Dear author, hello, I compile DCVv4 in win11 environment, after using
python setup.py build install command, the following problem appears, what is the problem
14 errors detected in the compilation of "C:Desktop/yolov8/ultralytics/nn/extra_modules/DCNv4_op/src/cuda/dcnv4_cuda.cu".
error: command 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin\nvcc.exe' failed with exit code 1
Dear Author,
In the module file, DCNv4 function is called with 6 parameters. However, in the forward function, it applied to 8 parameters. Could you correct it?
mask2former_flash_internimage_b_640_160k_ade20k_ss.pth file seems broken.
Can DCNv4 support torch=2.0 ?
这里调用apply函数时,会进入autocast_decorator
从而导致传入了self本身,使得参数第一位置增加了flashdeforable本身,因此参数args变成了8个,调用forward将会报错takes from 6 to 7 positional arguments but 8 were given
,麻烦解决此bug,我是在ultralytics中使用改模块
Dear everyone
1,Many people have encountered environment installation problems, so let me be the terminator to make the installation easier for everyone.
2,After installation, many people don’t know how to modify the model. I have a group where everyone can come together and discuss together to speed up understanding and publish papers.
+V:13862010554 build
+V:wechat group together
what's the problem? HELP!
btw, my environments are vs2017 torch 2.0.1+cu118 torchvision0.15.2+cu118
Dear author, hello, I compile DCVv4 in win11 environment, after using
python setup.py build install command, the following problem appears, what is the problem
16 errors detected in the compilation of "C:/Users/YG/Desktop/lishan/dcn/DCNv4_op/src/cuda/dcnv4_cuda.cu".
error: command 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin\nvcc.exe' failed with exit code 1
Hello author, thank you for your work. I am using a 4090 graphics card and an Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz. I found that the memory consumption of DCNV4 is indeed smaller than DCNV3, but the GPU utilization has decreased, and its training speed has become slower. What could be the reason for this?
Encountered with Nan loss when training classification task on imagenet.
Cuda: 11.6
Torch: 1.12.1+cu116
Timm: 0.6.11
Any help is appreciated! Thanks
Hello,
Thanks for releasing the open source code. I tried to follow the instruction (DCNv4/segmentation/readme.md) to install DCNv4, but encountering an error message that says "RROR: Could not find a version that satisfies the requirement DCNv4==latest ERROR: No matching distribution found for DCNv4==latest". When I typed "pip install DCNv4", I encountered the following error message:
site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-kox_k2e6/dcnv4_191e7104a4a646c5a96941957ab1b8ee/setup.py", line 70, in
ext_modules=get_extensions(),
File "/tmp/pip-install-kox_k2e6/dcnv4_191e7104a4a646c5a96941957ab1b8ee/setup.py", line 48, in get_extensions
raise NotImplementedError('Cuda is not available')
NotImplementedError: Cuda is not available
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-11.2'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Could you give me some hints for this problems? In any case thanks a lot!
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/mmcv/utils/registry.py", line 52, in build_from_cfg
return obj_cls(**args)
File "/opt/conda/lib/python3.8/site-packages/mmseg/models/segmentors/encoder_decoder.py", line 36, in init
self.backbone = builder.build_backbone(backbone)
File "/opt/conda/lib/python3.8/site-packages/mmseg/models/builder.py", line 20, in build_backbone
return BACKBONES.build(cfg)
File "/opt/conda/lib/python3.8/site-packages/mmcv/utils/registry.py", line 215, in build
return self.build_func(*args, **kwargs, registry=self)
File "/opt/conda/lib/python3.8/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
return build_from_cfg(cfg, registry, default_args)
File "/opt/conda/lib/python3.8/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg
raise type(e)(f'{obj_cls.name}: {e}')
AttributeError: FlashInternImage: module 'DCNv4' has no attribute 'DCNv4'
I try to load the checkpoint from this link: https://huggingface.co/OpenGVLab/DCNv4/resolve/main/dino_4scale_flash_internimage_b_1x_coco.pth , and I've got the following error:
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
I tried to download and load the checkpoint again for several times, and the error remains, so I think the checkpoint itself is probably broken. Please take a look, thank you!
Dear author,
When I install DCNv4 by pip install DCNv4, it appears:
ERROR: Could not build wheels for DCNv4, which is required to install pyproject.toml-based projects,
And use C:\Users\admin\Desktop\DCNv4-main\DCNv4_op>python setup.py build install is also failed.
I don't know what happens.
Congratulations on the results of your paper! However, I have a question regarding its implementation. I've successfully compiled the method you proposed and found the necessary libraries within my Anaconda environment. When executed independently, the input and output dimensions align perfectly. Yet, when I integrate it into the YOLO framework for training, an error emerges, stating that CPU operations are not supported. Could you suggest any solutions to this issue?
Building trt serialized engine using DCNv4 with dynamic batch axis leads to error:
[04/28/2024-10:43:03] [TRT] [E] 4: kOPT values for profile 0 violate shape constraints: IShuffleLayer /model.10/conv/Reshape_1: reshaping failed for tensor: (Unnamed Layer* 1728) [Constant]_output Reshape dimension of -1 has no solution.
[04/28/2024-10:43:03] [TRT] [E] 4: [shapeCompiler.cpp::nvinfer1::builder::DynamicSlotBuilder::evaluateShapeChecks::1276] Error Code 4: Internal Error (kOPT values for profile 0 violate shape constraints: IShuffleLayer /model.10/conv/Reshape_1: reshaping failed for tensor: (Unnamed Layer* 1728) [Constant]_output Reshape dimension of -1 has no solution.)
Dear author:
Hello, I have encountered the following question. May I ask where the error may be? I have been investigating for a long time but have not resolved it.
Error info:
python: /tmp/pip-install-g285lflt/dcnv4_0c9a40fbaa094f858763d45f8220c7e2/src/cuda/dcnv4_im2col_cuda.cuh:301: void _dcnv4_im2col_cuda(cudaStream_t, const scalar_t*, const scalar_t*, scalar_t*, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, at::opmath_type<scalar_t>, int, int, int, int) [with scalar_t = float; stride_type = ulonglong4; int d_stride = 8; cudaStream_t = CUstream_st*; at::opmath_type<scalar_t> = float]: Assertion `(B*Q) % block_multiplier == 0' failed.
ckpt = torch.load("mask2former_flash_internimage_t_512_160k_ade20k_ss.pth", map_location='cpu')
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
As we known:
FBOCC is a awesome project for both OCC and 3DDet.
Recently, I just change the InternImage image backbone in the FBOCC to FlashInternImage instead.
But by the same project baseline and with same param and env, the FlashInternImage run with wierd ERROR as follows.
I guess the Resnet ERROR is caused by the bev_backbone, which is not compatible with image backbone FlashInternImage.
I am not sure.
Traceback (most recent call last):
File "tools/train.py", line 372, in <module>
main()
File "tools/train.py", line 361, in main
train_model(
File "/data/Project/fbocc/mmdet3d/apis/train.py", line 352, in train_model
train_detector(
File "/data/Project/fbocc/mmdet3d/apis/train.py", line 327, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 138, in run
iter_runner(iter_loaders[i], **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 68, in train
self.call_hook('after_train_iter')
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
getattr(hook, fn_name)(self)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 272, in after_train_iter
self.loss_scaler.scale(runner.outputs['loss']).backward()
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
return user_fn(self, *args)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 122, in backward
outputs = ctx.run_function(*detached_inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 90, in forward
out = _inner_forward(x)
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 73, in _inner_forward
out = self.conv1(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 446, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 442, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: set_sizes_and_strides is not allowed on a Tensor created from .data or .detach().
If your intent is to change the metadata of a Tensor (such as sizes / strides / storage / storage_offset)
without autograd tracking the change, remove the .data / .detach() call and wrap the change in a `with torch.no_grad():` block.
For example, change:
x.data.set_(y)
to:
with torch.no_grad():
x.set_(y)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11671) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-24_01:52:41
host : localhost.vm
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 11671)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
DCNv4/functions/dcnv4_func.py", line 125, in backward
ext.dcnv4_backward(*args)
RuntimeError: expected scalar type Float but found BFloat16
=。=真的没人吗,你们训练不用bf16的吗。。。
Thanks for your work! I notice that the DCNv4 also implements FlashDeformAttn
. What's the difference between FlashDeformAttn
and DeformableAttn
?
Does the DCNv4 kernel size have to be set to 3?
Dear authors,
Congrats on the job to optimize DCNv4, the changes make a lot of sense. I didn't expect that the previous DCNv3 was limited by the issue of memory instructions and too much computation, but it makes sense, specially for float16.
One issue I see with the current code is the memory accesses in ms_deform_attn_im2col_bilinear and ms_deform_attn_col2im_bilinear which are under different conditionals.
Similarly, the call to ms_deform_attn_col2im_bilinear is under a conditional itself.
These conditionals are of course necessary to avoid accessing outside the image.
My previous experience with GPU programming has shown me that it is hard for GPU compilers to optimize well this case, and reduce latency by issuing memory instructions significantly before their use.
Possibly recent CUDA compilers handle that just fine, and my past experience is no longer valid, and if you have seen very good GPU code generated, ignore my comment.
However if this still applies, a solution that I have found to give very significant performance boost in practice is the following:
At the beginning of your kernel check whether any of the conditions will be false for your warp.
Then if all the conditions will give true, execute a version of the kernel without any condition checks remaining. If any is false, execute the normal kernel.
Most of the time all the conditions will hold. Thus the kernel without any conditions will execute. This kernel will be much better optimized by the GPU compiler and be much faster. The 'slow' kernel will only execute a small minority of cases and not affect performance much.
Sadly I have no time to work on this at all, but I'm hopeful you use this technique and make an even better DCN.
It seems that: some params' shape is not matched.
RuntimeError: Error(s) in loading state_dict for FlashInternImage:
size mismatch for levels.0.blocks.0.dcn.offset_mask.weight: copying a param with shape torch.Size([270, 160]) from checkpoint, the shape in current model is torch.Size([272, 160]).
size mismatch for levels.0.blocks.0.dcn.offset_mask.bias: copying a param with shape torch.Size([270]) from checkpoint, the shape in current model is torch.Size([272]).
size mismatch for levels.0.blocks.1.dcn.offset_mask.weight: copying a param with shape torch.Size([270, 160]) from checkpoint, the shape in current model is torch.Size([272, 160]).
size mismatch for levels.0.blocks.1.dcn.offset_mask.bias: copying a param with shape torch.Size([270]) from checkpoint, the shape in current model is torch.Size([272]).
size mismatch for levels.0.blocks.2.dcn.offset_mask.weight: copying a param with shape torch.Size([270, 160]) from checkpoint, the shape in current model is torch.Size([272, 160]).
size mismatch for levels.0.blocks.2.dcn.offset_mask.bias: copying a param with shape torch.Size([270]) from checkpoint, the shape in current model is torch.Size([272]).
size mismatch for levels.0.blocks.3.dcn.offset_mask.weight: copying a param with shape torch.Size([270, 160]) from checkpoint, the shape in current model is torch.Size([272, 160]).
size mismatch for levels.0.blocks.3.dcn.offset_mask.bias: copying a param with shape torch.Size([270]) from checkpoint, the shape in current model is torch.Size([272]).
size mismatch for levels.0.blocks.4.dcn.offset_mask.weight: copying a param with shape torch.Size([270, 160]) from checkpoint, the shape in current model is torch.Size([272, 160]).
size mismatch for levels.0.blocks.4.dcn.offset_mask.bias: copying a param with shape torch.Size([270]) from checkpoint, the shape in current model is torch.Size([272]).
size mismatch for levels.1.blocks.0.dcn.offset_mask.weight: copying a param with shape torch.Size([540, 320]) from checkpoint, the shape in current model is torch.Size([544, 320]).
size mismatch for levels.1.blocks.0.dcn.offset_mask.bias: copying a param with shape torch.Size([540]) from checkpoint, the shape in current model is torch.Size([544]).
size mismatch for levels.1.blocks.1.dcn.offset_mask.weight: copying a param with shape torch.Size([540, 320]) from checkpoint, the shape in current model is torch.Size([544, 320]).
size mismatch for levels.1.blocks.1.dcn.offset_mask.bias: copying a param with shape torch.Size([540]) from checkpoint, the shape in current model is torch.Size([544]).
size mismatch for levels.1.blocks.2.dcn.offset_mask.weight: copying a param with shape torch.Size([540, 320]) from checkpoint, the shape in current model is torch.Size([544, 320]).
size mismatch for levels.1.blocks.2.dcn.offset_mask.bias: copying a param with shape torch.Size([540]) from checkpoint, the shape in current model is torch.Size([544]).
size mismatch for levels.1.blocks.3.dcn.offset_mask.weight: copying a param with shape torch.Size([540, 320]) from checkpoint, the shape in current model is torch.Size([544, 320]).
size mismatch for levels.1.blocks.3.dcn.offset_mask.bias: copying a param with shape torch.Size([540]) from checkpoint, the shape in current model is torch.Size([544]).
size mismatch for levels.1.blocks.4.dcn.offset_mask.weight: copying a param with shape torch.Size([540, 320]) from checkpoint, the shape in current model is torch.Size([544, 320]).
size mismatch for levels.1.blocks.4.dcn.offset_mask.bias: copying a param with shape torch.Size([540]) from checkpoint, the shape in current model is torch.Size([544]).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12676) of binary: /opt/conda/bin/python
DCNv4_op\DCNv4\modules\dcnv4.py 128行
x_proj = x
x = DCNv4Function.apply(
x, offset_mask,
self.kernel_size, self.kernel_size,
self.stride, self.stride,
self.pad, self.pad,
self.dilation, self.dilation,
self.group, self.group_channels,
self.offset_scale,
256,
self.remove_center
)
x = x.view(N, L, -1)
if self.center_feature_scale:
center_feature_scale = self.center_feature_scale_module(
x, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
center_feature_scale = center_feature_scale[..., None].repeat(
1, 1, 1, 1, self.channels // self.group).flatten(-2)
x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
if not self.without_pointwise:
x = self.output_proj(x)
x_proj shape is [bs, h, w, c], after x = x.view(N, L, -1)
x shape is [bs, h * w, c]
when use center_feature_scale:
x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
will cause shape missmatch error
my solution is put x = x.view(N, L, -1)
before if not self.without_pointwise:
like:
x_proj = x
x = DCNv4Function.apply(
x, offset_mask,
self.kernel_size, self.kernel_size,
self.stride, self.stride,
self.pad, self.pad,
self.dilation, self.dilation,
self.group, self.group_channels,
self.offset_scale,
256,
self.remove_center
)
if self.center_feature_scale:
center_feature_scale = self.center_feature_scale_module(
x, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
center_feature_scale = center_feature_scale[..., None].repeat(
1, 1, 1, 1, self.channels // self.group).flatten(-2)
x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
x = x.view(N, L, -1) # move to here
if not self.without_pointwise:
x = self.output_proj(x)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.