opengvlab / dcnv4 Goto Github PK

View Code? Open in Web Editor NEW

445.0 445.0 25.0 630 KB

[CVPR 2024] Deformable Convolution v4

Home Page: https://arxiv.org/pdf/2401.06197.pdf

License: MIT License

Python 79.52% Shell 0.43% C++ 2.64% Cuda 17.41%

dcnv4's People

Stargazers

Watchers

dcnv4's Issues

cuda error during inference

Thank you for your great work! I can train my model using dcnv4 but during inference I encounter this problem. How can I solve it?

RuntimeError: "dcnv4_backward_cuda" not implemented for 'BFloat16'

I have been using PyTorch's native Automatic Mixed Precision (AMP) feature and specified the data type as bf16 (bfloat16). However, I encountered an error during the backward propagation step. I would like to know if this is due to a lack of framework support or if I made an error in my usage.

RuntimeError: false INTERNAL ASSERT FAILED ... , please report a bug to PyTorch. kernel launch error

RuntimeError: false INTERNAL ASSERT FAILED at "/home/user/dcnv4/DCNv4_op/src/cuda/dcnv4_col2im_cuda.cuh":470, please report a bug to PyTorch. kernel launch error

how to solve?

有兄弟运行成功吗？

request for a pure Python implemented DCNv4

Hi! Thx for your great work!!

Could you please provide pure Python/PyTorch implemented DCNv4?
It will be easier to migrate and run on a Non-CUDA device.

Difference between dino flashinternimage-b and dino flashinternimage-l

Referring to the README and https://huggingface.co/OpenGVLab/DCNv4/tree/main, the weights for dino_4scale_flash_internimage_b_1x_coco.pth is somehow 1.39GB while the supposely larger backbone dino_4scale_flash_internimage_l_1x_coco.pth is only 964MB. Is this to be expected?

Given that the flash intern image Large backbone is already 1.25GB flash_intern_image_l_22k_384.pth, I am suspecting that the checkpoint was saved wrongly

Hello, may I ask which versions of torch can be supported by this dcn_v4? What is the minimum supported version of torch？

Detail of "ViT-B + DCNv4".

It is interesting in the paper that "Our observations indicate that substituting the previously used DWConv or Attention with our
DCNv4 leads to an increase in inference speed".

Could you provide the implemention details of "substituting the attention with DCNv4"?

Can't invoke DCNv4 class after build and install via pip

I have built the DCNv4_op project and installed via pip install -e .:

but when i'm trying to invoke it into my project, i met such error:

(mmd) zrway@zrpgs:/data/yolov8_proj$ python /data/yolov8_proj/project/test_model.py

                   from  n    params  module                                       arguments                     
  0                  -1  1      1392  ultralytics.nn.modules.conv.Conv             [3, 48, 3, 2]                 
  1                  -1  1     41664  ultralytics.nn.modules.conv.Conv             [48, 96, 3, 2]                
  2                  -1  2    130513  ultralytics.nn.modules.customed.PRC2f.PRC2f  [96, 96, 2, True]             
  3                  -1  1    166272  ultralytics.nn.modules.conv.Conv             [96, 192, 3, 2]               
  4                  -1  4    888481  ultralytics.nn.modules.customed.PRC2f.PRC2f  [192, 192, 4, True]           
  5                  -1  1    664320  ultralytics.nn.modules.conv.Conv             [192, 384, 3, 2]              
  6                  -1  4   3546433  ultralytics.nn.modules.customed.PRC2f.PRC2f  [384, 384, 4, True]           
  7                  -1  1   1991808  ultralytics.nn.modules.conv.Conv             [384, 576, 3, 2]              
  8                  -1  2   3512225  ultralytics.nn.modules.customed.PRC2f_DCN_v4.PRC2f_DCN_v4[576, 576, 2, True]           
  9                  -1  1   1670992  ultralytics.nn.modules.customed.SPPF_DCN.SPPF_DCN_v4[576, 576, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  2   2660929  ultralytics.nn.modules.customed.PRC2f.PRC2f  [960, 384, 2]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  2    703777  ultralytics.nn.modules.customed.PRC2f.PRC2f  [576, 192, 2]                 
 16                  -1  1    332160  ultralytics.nn.modules.conv.Conv             [192, 192, 3, 2]              
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  2   2366017  ultralytics.nn.modules.customed.PRC2f.PRC2f  [576, 384, 2]                 
 19                  -1  1   1327872  ultralytics.nn.modules.conv.Conv             [384, 384, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  2   5429089  ultralytics.nn.modules.customed.PRC2f.PRC2f  [960, 576, 2]                 
 22        [15, 18, 21]  1   3822016  ultralytics.nn.modules.head.Detect           [80, [192, 384, 576]]         
Traceback (most recent call last):
  File "/data/yolov8_proj/project/test_model.py", line 26, in <module>
    model = YOLO('yolov8m.yaml').load('/data/yolov8_proj/project/yolov8m.pt')
  File "/data/yolov8_proj/ultralytics/ultralytics/engine/model.py", line 92, in __init__
    self._new(model, task)
  File "/data/yolov8_proj/ultralytics/ultralytics/engine/model.py", line 142, in _new
    self.model = (model or self._smart_load("model"))(cfg_dict, verbose=verbose and RANK == -1)  # build model
  File "/data/yolov8_proj/ultralytics/ultralytics/nn/tasks.py", line 292, in __init__
    m.stride = torch.tensor([s / x.shape[-2] for x in forward(torch.zeros(1, ch, s, s))])  # forward
  File "/data/yolov8_proj/ultralytics/ultralytics/nn/tasks.py", line 291, in <lambda>
    forward = lambda x: self.forward(x)[0] if isinstance(m, (Segment, Pose, OBB)) else self.forward(x)
  File "/data/yolov8_proj/ultralytics/ultralytics/nn/tasks.py", line 84, in forward
    return self.predict(x, *args, **kwargs)
  File "/data/yolov8_proj/ultralytics/ultralytics/nn/tasks.py", line 102, in predict
    return self._predict_once(x, profile, visualize, embed)
  File "/data/yolov8_proj/ultralytics/ultralytics/nn/tasks.py", line 123, in _predict_once
    x = m(x)  # run
  File "/home/zrway/anaconda3/envs/mmd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/yolov8_proj/ultralytics/ultralytics/nn/modules/customed/PRC2f_DCN_v4.py", line 177, in forward
    y.extend(m(y[-1]) for m in self.m)
  File "/data/yolov8_proj/ultralytics/ultralytics/nn/modules/customed/PRC2f_DCN_v4.py", line 177, in <genexpr>
    y.extend(m(y[-1]) for m in self.m)
  File "/home/zrway/anaconda3/envs/mmd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/yolov8_proj/ultralytics/ultralytics/nn/modules/customed/PRC2f_DCN_v4.py", line 49, in forward
    return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
  File "/home/zrway/anaconda3/envs/mmd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/yolov8_proj/ultralytics/ultralytics/nn/modules/customed/PRC2f_DCN_v4.py", line 27, in forward
    x = self.dcnv4(
  File "/home/zrway/anaconda3/envs/mmd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/yolov8_proj/deps/DCNv4/DCNv4_op/DCNv4/modules/dcnv4.py", line 130, in forward
    x = DCNv4Function.apply(
  File "/home/zrway/anaconda3/envs/mmd/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 86, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/data/yolov8_proj/deps/DCNv4/DCNv4_op/DCNv4/functions/dcnv4_func.py", line 103, in forward
    output = ext.dcnv4_forward(*args)
RuntimeError: Not implemented on the CPU

i'd appreciate it if you chould provide some help to solve this.

Checkpoint file corrupted?

I load the checkpoint file "dino_4scale_flash_internimage_t_1x_coco.pth", and it reports the error as below. I downed it many times but it still not works.
"RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory"

I tried the dino_large_coco and maskrcnn_tiny_coco files, these files can be loaded.

FlashDeformAttnFunction Error

1.n_points = 2，error: invalid k
2.n_points = 8，error: CUDA error: an illegal memory access was encountered

LINK: fatal error link1881

$F{_ 75K@)6DW3R7SAUQ6R1X$
cuda 11.6
torch 1.11.0
windows10

expected scalar type Float but found BFloat16

DCNv4/functions/dcnv4_func.py", line 125, in backward
ext.dcnv4_backward(*args)
RuntimeError: expected scalar type Float but found BFloat16

DCNV4+ResNet50

DCNV2加在ResNet后几层是当前目标检测等模型的通用涨点策略，请问DCNV4能这么做吗，我看其他人有说DCNV4不支持降采样

输入和输出的特征分辨率不一致

按照如上图的设置DCNv4，输出的attn和输入的x的尺寸大小不一致，是我设置的问题吗？

Problem，please

“RuntimeError：batch%im2col_step==0 INTERNAL ASSERT FAILED at './DCNv4_op/src/cuda/dcnv4_cuda.cu':55, please report a bug tp Pytorch batch(504) must divide im2col_step(256), for this, during testing, I have already set batch to 512, but the test data may not be evenly divided by 256，for example，the last batch is 504.
How to solve this problem， please.

Error in running demo file using DCNV4

Thank you very much for your contribution!
When I run the following code：

import DCNv4
help(DCNv4)

The following error message has occurred：

Traceback (most recent call last):
  File "mytest.py", line 10, in <module>
    import DCNv4
  File "/home/coop/anaconda3/envs/v2xvit/lib/python3.7/site-packages/DCNv4/__init__.py", line 1, in <module>
    from .functions import DCNv4Function, FlashDeformAttnFunction
  File "/home/coop/anaconda3/envs/v2xvit/lib/python3.7/site-packages/DCNv4/functions/__init__.py", line 10, in <module>
    from .flash_deform_attn_func import FlashDeformAttnFunction
  File "/home/coop/anaconda3/envs/v2xvit/lib/python3.7/site-packages/DCNv4/functions/flash_deform_attn_func.py", line 34, in <module>
    raise NotImplementedError
NotImplementedError
free(): invalid pointer
Aborted (core dumped)

I tried downloading or running make. sh compilation and installation of DCNv4 using pip, but it still hasn't been resolved.

DCNv4 for CPU

Hi, thanks for the great repository.
Any plans to add CPU modules?

thanks

Who can provide the Windows compilation environment for dcnv4

Who can provide the Windows compilation environment for dcnv4？
Is there an example of successful compilation on Windows system？

About running time comparison with dcnv3

Amazing engineering implementation! But i have some questions, I compared DCNv3 and DCNv4 in RTX2080ti, PyTorch 1.13.1 environment. DCNv3 uses the built-in PyTorch function and the measured results are as follows. Does DCNV4 not have a speed advantage when there are fewer channels and higher resolution? May I ask what the possible reason is?

float32
dcnv3: 2.234679698944092
dcnv4: 2.8860323429107666
float16
dcnv3: 1.4734604358673096
dcnv4: 2.88419246673584

the code are

block1 = DCNv3(64, 64, group=4).cuda().eval()
block2 = DCNv4(64, 64, group=4).cuda().eval()
a = torch.rand(4, 64, 1920//4, 1088//4).cuda()
offset = torch.rand(4, 64, 1920//4, 1088//4).cuda()

for i in range(10):
    with torch.no_grad():
        block1(a, offset)
        block2(a, offset)

torch.cuda.synchronize()
start_time = time.time()
for i in range(100):
    with torch.no_grad():
        block1(a, offset)
torch.cuda.synchronize()
print('dcnv3:', time.time() - start_time)

torch.cuda.synchronize()
start_time = time.time()
for i in range(100):
    with torch.no_grad():
        block2(a, offset)
torch.cuda.synchronize()
print('dcnv4:', time.time() - start_time)

In comparison to DCNV3, it seems that the DCNV4 code is missing the offset variable. Could you tell me where in the DCNV4 code implementation the deformation of convolutional positions is achieved?

Failed to build DCNv4！！

Dear author, hello, I compile DCVv4 in win11 environment, after using
python setup.py build install command, the following problem appears, what is the problem
14 errors detected in the compilation of "C:Desktop/yolov8/ultralytics/nn/extra_modules/DCNv4_op/src/cuda/dcnv4_cuda.cu".
error: command 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin\nvcc.exe' failed with exit code 1

DCNv4 function and module

Dear Author,

In the module file, DCNv4 function is called with 6 parameters. However, in the forward function, it applied to 8 parameters. Could you correct it?

mask2former_flash_internimage_b_640_160k_ade20k_ss.pth file seems broken.

Can DCNv4 support torch=2.0 ?

flash_internimage_large problem

I tried the tiny (flash_internimage_t) model and it was perfectly fine.
But, when configuring the backbone according to the large model(flash_interimage_l), I get the following error and would like your help.

Config as below:

My env:
pytorch 2.0.1
cuda 11.8
GPU A800

forward() takes from 6 to 7 positional arguments but 8 were given

你好

这里调用apply函数时，会进入autocast_decorator

从而导致传入了self本身，使得参数第一位置增加了flashdeforable本身，因此参数args变成了8个，调用forward将会报错takes from 6 to 7 positional arguments but 8 were given

，麻烦解决此bug，我是在ultralytics中使用改模块

How to development DCNv4 fastest? win and linux are ok!

Dear everyone
1，Many people have encountered environment installation problems, so let me be the terminator to make the installation easier for everyone.
2，After installation, many people don’t know how to modify the model. I have a group where everyone can come together and discuss together to speed up understanding and publish papers.
+V：13862010554 build
+V：wechat group together

LINK : fatal error LNK1181: 无法打开输入文件“E:\DCNv4_op\build\temp.win-amd64-cpython-38\Release\DCNv4_op\src\cuda\dcnv4_cuda.obj”

what's the problem? HELP!
btw, my environments are vs2017 torch 2.0.1+cu118 torchvision0.15.2+cu118

Failed to build DCNv4！！

Dear author, hello, I compile DCVv4 in win11 environment, after using
python setup.py build install command, the following problem appears, what is the problem

16 errors detected in the compilation of "C:/Users/YG/Desktop/lishan/dcn/DCNv4_op/src/cuda/dcnv4_cuda.cu".
error: command 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin\nvcc.exe' failed with exit code 1

DCNV4 training speed

Hello author, thank you for your work. I am using a 4090 graphics card and an Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz. I found that the memory consumption of DCNV4 is indeed smaller than DCNV3, but the GPU utilization has decreased, and its training speed has become slower. What could be the reason for this?

dcnv4的参数

在mmdet框架中有stage_with_dcn=(False, True, True, True)),这个配置，就是把resnet的卷积换成dcnv2，现在我想把它们换成dcnv4，
那么相应的参数我应该如何设置呢。

Encountered with Nan loss when training classification task

Encountered with Nan loss when training classification task on imagenet.

Cuda: 11.6
Torch: 1.12.1+cu116
Timm: 0.6.11

Any help is appreciated! Thanks

No matching distribution found for DCNv4==latest

Hello,

Thanks for releasing the open source code. I tried to follow the instruction (DCNv4/segmentation/readme.md) to install DCNv4, but encountering an error message that says "RROR: Could not find a version that satisfies the requirement DCNv4==latest ERROR: No matching distribution found for DCNv4==latest". When I typed "pip install DCNv4", I encountered the following error message:

site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-kox_k2e6/dcnv4_191e7104a4a646c5a96941957ab1b8ee/setup.py", line 70, in
ext_modules=get_extensions(),
File "/tmp/pip-install-kox_k2e6/dcnv4_191e7104a4a646c5a96941957ab1b8ee/setup.py", line 48, in get_extensions
raise NotImplementedError('Cuda is not available')
NotImplementedError: Cuda is not available
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-11.2'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Could you give me some hints for this problems? In any case thanks a lot!

module 'DCNv4' has no attribute 'DCNv4'

Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/mmcv/utils/registry.py", line 52, in build_from_cfg
return obj_cls(**args)
File "/opt/conda/lib/python3.8/site-packages/mmseg/models/segmentors/encoder_decoder.py", line 36, in init
self.backbone = builder.build_backbone(backbone)
File "/opt/conda/lib/python3.8/site-packages/mmseg/models/builder.py", line 20, in build_backbone
return BACKBONES.build(cfg)
File "/opt/conda/lib/python3.8/site-packages/mmcv/utils/registry.py", line 215, in build
return self.build_func(*args, **kwargs, registry=self)
File "/opt/conda/lib/python3.8/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
return build_from_cfg(cfg, registry, default_args)
File "/opt/conda/lib/python3.8/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg
raise type(e)(f'{obj_cls.name}: {e}')
AttributeError: FlashInternImage: module 'DCNv4' has no attribute 'DCNv4'

CKPT for dino_4scale_flash_internimage_b_1x_coco seems to be broken

I try to load the checkpoint from this link: https://huggingface.co/OpenGVLab/DCNv4/resolve/main/dino_4scale_flash_internimage_b_1x_coco.pth , and I've got the following error:
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

I tried to download and load the checkpoint again for several times, and the error remains, so I think the checkpoint itself is probably broken. Please take a look, thank you!

Failed to build DCNv4

Dear author,
When I install DCNv4 by pip install DCNv4, it appears:
ERROR: Could not build wheels for DCNv4, which is required to install pyproject.toml-based projects,

And use C:\Users\admin\Desktop\DCNv4-main\DCNv4_op>python setup.py build install is also failed.

I don't know what happens.

Here have some problems about DCN4 deployment

Congratulations on the results of your paper! However, I have a question regarding its implementation. I've successfully compiled the method you proposed and found the necessary libraries within my Anaconda environment. When executed independently, the input and output dimensions align perfectly. Yet, when I integrate it into the YOLO framework for training, an error emerges, stating that CPU operations are not supported. Could you suggest any solutions to this issue?

TensorRT dynamic batch axis issue

Building trt serialized engine using DCNv4 with dynamic batch axis leads to error:
[04/28/2024-10:43:03] [TRT] [E] 4: kOPT values for profile 0 violate shape constraints: IShuffleLayer /model.10/conv/Reshape_1: reshaping failed for tensor: (Unnamed Layer* 1728) [Constant]_output Reshape dimension of -1 has no solution.
[04/28/2024-10:43:03] [TRT] [E] 4: [shapeCompiler.cpp::nvinfer1::builder::DynamicSlotBuilder::evaluateShapeChecks::1276] Error Code 4: Internal Error (kOPT values for profile 0 violate shape constraints: IShuffleLayer /model.10/conv/Reshape_1: reshaping failed for tensor: (Unnamed Layer* 1728) [Constant]_output Reshape dimension of -1 has no solution.)

Assertion `(B*Q) % block_multiplier == 0' failed.

Dear author:
Hello, I have encountered the following question. May I ask where the error may be? I have been investigating for a long time but have not resolved it.
Error info:
python: /tmp/pip-install-g285lflt/dcnv4_0c9a40fbaa094f858763d45f8220c7e2/src/cuda/dcnv4_im2col_cuda.cuh:301: void _dcnv4_im2col_cuda(cudaStream_t, const scalar_t*, const scalar_t*, scalar_t*, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, at::opmath_type<scalar_t>, int, int, int, int) [with scalar_t = float; stride_type = ulonglong4; int d_stride = 8; cudaStream_t = CUstream_st*; at::opmath_type<scalar_t> = float]: Assertion `(B*Q) % block_multiplier == 0' failed.

mask2former_flash_internimage_t_512_160k_ade20k_ss.pth has broken

ckpt = torch.load("mask2former_flash_internimage_t_512_160k_ade20k_ss.pth", map_location='cpu')
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

compatibility with [SOTA] FBOCC

As we known:

 FBOCC is a awesome project for both OCC and 3DDet.

 Recently, I just change the InternImage image backbone in the FBOCC to FlashInternImage instead.

 But by the same project baseline and with same param and env, the FlashInternImage run with wierd ERROR as follows.

 I guess the Resnet ERROR is caused by the bev_backbone, which is not compatible with image backbone FlashInternImage.
 I am not sure. 


  Traceback (most recent call last):
  File "tools/train.py", line 372, in <module>
    main()
  File "tools/train.py", line 361, in main
    train_model(
  File "/data/Project/fbocc/mmdet3d/apis/train.py", line 352, in train_model
    train_detector(
  File "/data/Project/fbocc/mmdet3d/apis/train.py", line 327, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 138, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 68, in train
    self.call_hook('after_train_iter')
  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
    getattr(hook, fn_name)(self)
  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 272, in after_train_iter
    self.loss_scaler.scale(runner.outputs['loss']).backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 122, in backward
    outputs = ctx.run_function(*detached_inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 90, in forward
    out = _inner_forward(x)
  File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 73, in _inner_forward
    out = self.conv1(x)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 446, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 442, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,  
RuntimeError: set_sizes_and_strides is not allowed on a Tensor created from .data or .detach().
If your intent is to change the metadata of a Tensor (such as sizes / strides / storage / storage_offset)
without autograd tracking the change, remove the .data / .detach() call and wrap the change in a `with torch.no_grad():` block.
For example, change:
    x.data.set_(y)
to:
    with torch.no_grad():
        x.set_(y)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11671) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-24_01:52:41
  host      : localhost.vm
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 11671)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

expected scalar type Float but found BFloat16

DCNv4/functions/dcnv4_func.py", line 125, in backward
ext.dcnv4_backward(*args)
RuntimeError: expected scalar type Float but found BFloat16

=。=真的没人吗，你们训练不用bf16的吗。。。

Problem about FlashDeformAttn

Thanks for your work! I notice that the DCNv4 also implements FlashDeformAttn. What's the difference between FlashDeformAttn and DeformableAttn?

about the kernel size

Does the DCNv4 kernel size have to be set to 3?

A technique to get even better speed

Dear authors,

Congrats on the job to optimize DCNv4, the changes make a lot of sense. I didn't expect that the previous DCNv3 was limited by the issue of memory instructions and too much computation, but it makes sense, specially for float16.

One issue I see with the current code is the memory accesses in ms_deform_attn_im2col_bilinear and ms_deform_attn_col2im_bilinear which are under different conditionals.

Similarly, the call to ms_deform_attn_col2im_bilinear is under a conditional itself.

These conditionals are of course necessary to avoid accessing outside the image.

My previous experience with GPU programming has shown me that it is hard for GPU compilers to optimize well this case, and reduce latency by issuing memory instructions significantly before their use.

Possibly recent CUDA compilers handle that just fine, and my past experience is no longer valid, and if you have seen very good GPU code generated, ignore my comment.

However if this still applies, a solution that I have found to give very significant performance boost in practice is the following:
At the beginning of your kernel check whether any of the conditions will be false for your warp.
Then if all the conditions will give true, execute a version of the kernel without any condition checks remaining. If any is false, execute the normal kernel.

Most of the time all the conditions will hold. Thus the kernel without any conditions will execute. This kernel will be much better optimized by the GPU compiler and be much faster. The 'slow' kernel will only execute a small minority of cases and not affect performance much.

Sadly I have no time to work on this at all, but I'm hopeful you use this technique and make an even better DCN.

Error model param shape: cascade_flash_internimage_l_fpn_3x_coco.pth

It seems that: some params' shape is not matched.

RuntimeError: Error(s) in loading state_dict for FlashInternImage:
size mismatch for levels.0.blocks.0.dcn.offset_mask.weight: copying a param with shape torch.Size([270, 160]) from checkpoint, the shape in current model is torch.Size([272, 160]).
size mismatch for levels.0.blocks.0.dcn.offset_mask.bias: copying a param with shape torch.Size([270]) from checkpoint, the shape in current model is torch.Size([272]).
size mismatch for levels.0.blocks.1.dcn.offset_mask.weight: copying a param with shape torch.Size([270, 160]) from checkpoint, the shape in current model is torch.Size([272, 160]).
size mismatch for levels.0.blocks.1.dcn.offset_mask.bias: copying a param with shape torch.Size([270]) from checkpoint, the shape in current model is torch.Size([272]).
size mismatch for levels.0.blocks.2.dcn.offset_mask.weight: copying a param with shape torch.Size([270, 160]) from checkpoint, the shape in current model is torch.Size([272, 160]).
size mismatch for levels.0.blocks.2.dcn.offset_mask.bias: copying a param with shape torch.Size([270]) from checkpoint, the shape in current model is torch.Size([272]).
size mismatch for levels.0.blocks.3.dcn.offset_mask.weight: copying a param with shape torch.Size([270, 160]) from checkpoint, the shape in current model is torch.Size([272, 160]).
size mismatch for levels.0.blocks.3.dcn.offset_mask.bias: copying a param with shape torch.Size([270]) from checkpoint, the shape in current model is torch.Size([272]).
size mismatch for levels.0.blocks.4.dcn.offset_mask.weight: copying a param with shape torch.Size([270, 160]) from checkpoint, the shape in current model is torch.Size([272, 160]).
size mismatch for levels.0.blocks.4.dcn.offset_mask.bias: copying a param with shape torch.Size([270]) from checkpoint, the shape in current model is torch.Size([272]).
size mismatch for levels.1.blocks.0.dcn.offset_mask.weight: copying a param with shape torch.Size([540, 320]) from checkpoint, the shape in current model is torch.Size([544, 320]).
size mismatch for levels.1.blocks.0.dcn.offset_mask.bias: copying a param with shape torch.Size([540]) from checkpoint, the shape in current model is torch.Size([544]).
size mismatch for levels.1.blocks.1.dcn.offset_mask.weight: copying a param with shape torch.Size([540, 320]) from checkpoint, the shape in current model is torch.Size([544, 320]).
size mismatch for levels.1.blocks.1.dcn.offset_mask.bias: copying a param with shape torch.Size([540]) from checkpoint, the shape in current model is torch.Size([544]).
size mismatch for levels.1.blocks.2.dcn.offset_mask.weight: copying a param with shape torch.Size([540, 320]) from checkpoint, the shape in current model is torch.Size([544, 320]).
size mismatch for levels.1.blocks.2.dcn.offset_mask.bias: copying a param with shape torch.Size([540]) from checkpoint, the shape in current model is torch.Size([544]).
size mismatch for levels.1.blocks.3.dcn.offset_mask.weight: copying a param with shape torch.Size([540, 320]) from checkpoint, the shape in current model is torch.Size([544, 320]).
size mismatch for levels.1.blocks.3.dcn.offset_mask.bias: copying a param with shape torch.Size([540]) from checkpoint, the shape in current model is torch.Size([544]).
size mismatch for levels.1.blocks.4.dcn.offset_mask.weight: copying a param with shape torch.Size([540, 320]) from checkpoint, the shape in current model is torch.Size([544, 320]).
size mismatch for levels.1.blocks.4.dcn.offset_mask.bias: copying a param with shape torch.Size([540]) from checkpoint, the shape in current model is torch.Size([544]).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12676) of binary: /opt/conda/bin/python

DCNv4相关

DCNv4_op\DCNv4\modules\dcnv4.py 128行

x_proj = x

x = DCNv4Function.apply(
    x, offset_mask,
    self.kernel_size, self.kernel_size,
    self.stride, self.stride,
    self.pad, self.pad,
    self.dilation, self.dilation,
    self.group, self.group_channels,
    self.offset_scale,
    256,
    self.remove_center
    )

x = x.view(N, L, -1)
if self.center_feature_scale:
    center_feature_scale = self.center_feature_scale_module(
        x, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
    center_feature_scale = center_feature_scale[..., None].repeat(
        1, 1, 1, 1, self.channels // self.group).flatten(-2)
    x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
if not self.without_pointwise:
    x = self.output_proj(x)

x_proj shape is [bs, h, w, c], after x = x.view(N, L, -1) x shape is [bs, h * w, c]
when use center_feature_scale:
x = x * (1 - center_feature_scale) + x_proj * center_feature_scale will cause shape missmatch error

my solution is put x = x.view(N, L, -1) before if not self.without_pointwise:
like:

x_proj = x

x = DCNv4Function.apply(
    x, offset_mask,
    self.kernel_size, self.kernel_size,
    self.stride, self.stride,
    self.pad, self.pad,
    self.dilation, self.dilation,
    self.group, self.group_channels,
    self.offset_scale,
    256,
    self.remove_center
    )

if self.center_feature_scale:
    center_feature_scale = self.center_feature_scale_module(
        x, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
    center_feature_scale = center_feature_scale[..., None].repeat(
        1, 1, 1, 1, self.channels // self.group).flatten(-2)
    x = x * (1 - center_feature_scale) + x_proj * center_feature_scale

x = x.view(N, L, -1)  # move to here

if not self.without_pointwise:
    x = self.output_proj(x)

dcnv4替换普通卷积报错

如图，对于形状为(1, 128, 248, 456)的任意数，只有dcnv4的步长为2，就会报这个错，而有的形状即使步长为2也不会报错，这是为什么呢

opengvlab / dcnv4 Goto Github PK

dcnv4's People

Stargazers

Watchers

Forkers

dcnv4's Issues

Recommend Projects

Recommend Topics

Recommend Org