Code Monkey home page Code Monkey logo

simple_bev's People

Contributors

aharley avatar zfang399 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simple_bev's Issues

Estimated seg problem

Hi!

I am currently training and testing the "Cam+Rad" and "Only Cam" BEV models on my dataset.

My dataset sensor configuration is as follows:

  • Number of camera : 1 (front)
  • Number of Radar : 1 (front)

[Cam+Rad BEV]
First of all, the Cam+Rad BEV visualization results work just fine:
image

[Only Cam BEV]
However, there is a problem with the Only RGB cam BEV. The final learning result using Only RGB cam is as follows.
image
And, Only RGB cam visualization results are BEV weird..
image

What am I missing?

Thank you so much for always helping.

The problem of offset

def get_center_and_offset_bev(self, lrtlist_cam, seg_bev):
......
......
offset = torch.stack([offset[:,0], offset[:,2]], dim=1)
min_offset = torch.min(offset, dim=3)[0]
max_offset = torch.max(offset, dim=3)[0]
offset = min_offset + max_offset

The meaning expressed by the sum of min_offset and max_offset?
What is the role of MIN_offset and max_offset?

Could the ego car be the scene center?

Hi, thanks a lot for sharing the nice work. I am wondering whether the ego car could be the scene center as the BEVFormer defined. It seems that the current center is defined in the front camera.

And another tiny question is that, in the code, there have some customized operations such as:
merge_intrinsics, split_intrinsics, safe_inverse.
I am wondering that would it cut down the performance if used common libraries such as PyTorch and NumPy, like torch.inverse(), np.linalg.inv().

Thanks again!

Regarding Onnx Conversion of SimpleBEV model

Hi,
I have tried converting the simpleBEV model to onnx using the torch.onnx.export , however I am facing an issue in generating the dummy input required for the forward pass in the conversion process.

I am getting this below error :

RuntimeError: Only tuples, lists and Variables are supported as JIT inputs/outputs. Dictionaries and strings are also accepted, but their usage is not recommended. Here, received an input of unsupported type: Vox_util

Can you please help @aharley @zfang399 . Thanks!!

"Lift" operation in code

Hi,

Thank you for your research.

Could you give me some tips about "Lift" operation in Code,

may be here:

simple_bev/utils/vox.py

Lines 291 to 341 in be46f0e

def unproject_image_to_mem(self, rgb_camB, pixB_T_camA, camB_T_camA, Z, Y, X, assert_cube=False, xyz_camA=None):
# rgb_camB is B x C x H x W
# pixB_T_camA is B x 4 x 4
# rgb lives in B pixel coords
# we want everything in A memory coords
# this puts each C-dim pixel in the rgb_camB
# along a ray in the voxelgrid
B, C, H, W = list(rgb_camB.shape)
if xyz_camA is None:
xyz_memA = utils.basic.gridcloud3d(B, Z, Y, X, norm=False, device=pixB_T_camA.device)
xyz_camA = self.Mem2Ref(xyz_memA, Z, Y, X, assert_cube=assert_cube)
xyz_camB = utils.geom.apply_4x4(camB_T_camA, xyz_camA)
z = xyz_camB[:,:,2]
xyz_pixB = utils.geom.apply_4x4(pixB_T_camA, xyz_camA)
normalizer = torch.unsqueeze(xyz_pixB[:,:,2], 2)
EPS=1e-6
# z = xyz_pixB[:,:,2]
xy_pixB = xyz_pixB[:,:,:2]/torch.clamp(normalizer, min=EPS)
# this is B x N x 2
# this is the (floating point) pixel coordinate of each voxel
x, y = xy_pixB[:,:,0], xy_pixB[:,:,1]
# these are B x N
x_valid = (x>-0.5).bool() & (x<float(W-0.5)).bool()
y_valid = (y>-0.5).bool() & (y<float(H-0.5)).bool()
z_valid = (z>0.0).bool()
valid_mem = (x_valid & y_valid & z_valid).reshape(B, 1, Z, Y, X).float()
if (0):
# handwritten version
values = torch.zeros([B, C, Z*Y*X], dtype=torch.float32)
for b in list(range(B)):
values[b] = utils.samp.bilinear_sample_single(rgb_camB[b], x_pixB[b], y_pixB[b])
else:
# native pytorch version
y_pixB, x_pixB = utils.basic.normalize_grid2d(y, x, H, W)
# since we want a 3d output, we need 5d tensors
z_pixB = torch.zeros_like(x)
xyz_pixB = torch.stack([x_pixB, y_pixB, z_pixB], axis=2)
rgb_camB = rgb_camB.unsqueeze(2)
xyz_pixB = torch.reshape(xyz_pixB, [B, Z, Y, X, 3])
values = F.grid_sample(rgb_camB, xyz_pixB, align_corners=False)
values = torch.reshape(values, (B, C, Z, Y, X))
values = values * valid_mem
return values

Coordinate in vox_util

Hi there,

Thank you for your research.

Ref, Center, Mem, camA, camB, pix B coordinate in vox_util aren't well understood..
Could you explain about it easily?

BEVFormer code: num_heads and num_points are swapped

Hi,
thank you for the nice work and for sharing your code!

I believe that your implementation of BEVFormer has a small bug:

self.deformable_attention = MSDeformAttn(d_model=dim, n_levels=1, n_heads=4, n_points=8)

It looks like the values for the parameters n_heads and n_points have been swapped compared to the normal initialization

def __init__(self, d_model=256, n_levels=4, n_heads=8, n_points=4):

See also the original implementation of BEVFormer:
def __init__(self, embed_dims=256, num_heads=8, num_levels=4, num_points=4,
https://github.com/fundamentalvision/BEVFormer/blob/20923e66aa26a906ba8d21477c238567fa6285e9/projects/mmdet3d_plugin/bevformer/modules/decoder.py#L160-L164

as well as the Deformable DETR paper:

M = 8 and K = 4 are set for deformable attentions by default.
K number of sampled keys in each feature level for each attention head
M number of attention heads

I am not sure how much of a difference it is going to make but just to warn other people.

Poor data loading performance and memory leaks

During training, every time my GPU worked for 5.5 seconds, it had to wait 11 seconds for the Dataloader. I wonder if you or anyone else have encountered this problem.
I think this code may cause the Dataloader to get items more slowly:
Loops (nuscnensdataset.py L1130, L1174-1175),
Lots of list.appends and torch.stacks,
Redundant Tensor.clone operations(nuscnensdataset.py L990,996,997),
A lot of CPU (numpy) operations, rather than loading them onto the GPU and then using GPU (torch.Tensor) operations.
This could be a fatal problem, given the weak CPU of on-board devices such as NVIDIA Orin.

Besides, I have frequently encountered problems that child processes were not properly reclaimed when training program terminated, resulting in memory leaks.

Please check. Thank you for your work.

3D object detection head ?

Hey, really cool to see a simple BEV method which out performs many strong baseline.

I have a question about 3D object bounding box detection. Did the authors also tried object detection task along with the semantic segmentation task? Any insights would be appreciated. Thank You !

Fig. 4 Effect of batch size

Thanks for your excellent work.
I have doubts about figure4 in the paper. I saw the batch-size has such a significant effect. This is a vast difference from our common perception.
If we use the same 25000 iterations, the different batch-size represent a vast difference in the number of iterations over the data. So is this comparison unreasonable?
I would like to understand how this experiment was performed.

For example, if bevformer iterates 25,000 times at batch1, it will have trained only 1 epoch on the nuScenes dataset. while simpleBEV iterates 25,000 times at batch40, meaning it has trained 40 epochs on the nuScenes dataset. right?

What is the use of xyz_camB?

Hello, I want to know why need to calculate xyz_camB in thisline and what is z after that line? And another is why the vertical dimension is Y instead of Z?

Please enlighten me!

How to run other models in nuscenes dataset?

I can see that some other models is finished in “net” folder, such as "liftnet" and "bevformernet". I want to run these models. The root is

class Bevformernet(nn.Module):

I search this models with Pycharm, but I can't find anywhere to use these models.

The only model used is Segnet, where in

model = Segnet(Z, Y, X, vox_util, use_radar=use_radar, use_lidar=use_lidar, use_metaradar=use_metaradar, do_rgbcompress=do_rgbcompress, encoder_type=encoder_type, rand_flip=rand_flip)

Would you please tell me how to run other models in nuscenes dataset? Where should I modify in code?

Thank you~

Questions on architecture design choices

Hi I have a couple of question regarding your NN architecture design and I would like to ask if you could give the the motivation for these particular design choices (or if they are copied from some other work point me to it):

For both ResNet backbones you stopped at the 3rd block and did not use the 4th block:

self.backbone = nn.Sequential(*list(resnet.children())[:-4])

self.layer3 = backbone.layer3

What is the motivation for the use of instance normalization in the decoders?

nn.InstanceNorm2d(shared_out_channels),

Why did you not use activation functions for the up-sampling layers in the BEV grid?

class UpsamplingAdd(nn.Module):

Questions about the results of bevformer in table1.

Hi, Dr.Harley, simple-bev is really a nice work, however, when running the code to reproduce the results in “table1” I meet some problems:

  1. Is "simple_bev/nets/bevformernet.py” corresponds to the "Deformable attention” in “table1” and "simple_bev/nets/bevformernet2.py” corresponds to the "Multi-scale deform. attn.” in “table1”?
  2. The performance of “bevformernet.py” seems to be similar with “segnet”(as shown bellow). I don’t know if I have did anything wrong.
  3. When training the "bevformernet2" it tends to be overfitted, IOU_v is low.
    Thank you for your attention, I would be really grateful if you can help me out.

log of Q2:

segnet:
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000356/376; rtime 0.06; itime 0.69 (783.48 ms); loss 2.87869; iou_ev 47.4
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000357/376; rtime 0.07; itime 0.80 (783.53 ms); loss 0.26470; iou_ev 47.4
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000358/376; rtime 0.07; itime 0.69 (783.27 ms); loss -0.71782; iou_ev 47.4
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000359/376; rtime 0.06; itime 0.66 (782.94 ms); loss 0.30850; iou_ev 47.4
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000360/376; rtime 0.08; itime 0.66 (782.60 ms); loss -0.15269; iou_ev 47.5
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000361/376; rtime 0.04; itime 0.63 (782.17 ms); loss -0.56200; iou_ev 47.5
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000362/376; rtime 0.04; itime 0.62 (781.73 ms); loss 0.11772; iou_ev 47.5
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000363/376; rtime 0.04; itime 0.63 (781.31 ms); loss 0.36632; iou_ev 47.5
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000364/376; rtime 0.04; itime 0.62 (780.87 ms); loss 1.63898; iou_ev 47.6
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000365/376; rtime 0.05; itime 0.64 (780.47 ms); loss 5.04701; iou_ev 47.5
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000366/376; rtime 0.04; itime 0.64 (780.08 ms); loss 1.01278; iou_ev 47.5
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000367/376; rtime 0.05; itime 0.65 (779.72 ms); loss 0.95337; iou_ev 47.5
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000368/376; rtime 0.05; itime 0.66 (779.39 ms); loss 0.78735; iou_ev 47.5
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000369/376; rtime 0.05; itime 0.65 (779.04 ms); loss 2.96054; iou_ev 47.5
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000370/376; rtime 0.04; itime 0.62 (778.62 ms); loss 2.15018; iou_ev 47.5
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000371/376; rtime 0.04; itime 0.72 (778.45 ms); loss 4.13992; iou_ev 47.4
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000372/376; rtime 0.05; itime 0.64 (778.07 ms); loss 1.97022; iou_ev 47.4
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000373/376; rtime 0.05; itime 0.63 (777.67 ms); loss 0.56684; iou_ev 47.4
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000374/376; rtime 0.04; itime 0.63 (777.27 ms); loss 1.27457; iou_ev 47.4
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000375/376; rtime 0.05; itime 0.64 (776.91 ms); loss 3.13456; iou_ev 47.4
8x5_3e-4s_segnet_reproduce_19:22:27_16_eval_20:55:09; step 000376/376; rtime 0.05; itime 0.64 (776.55 ms); loss 1.49759; iou_ev 47.4
final trainval mean iou 47.43055910993624

bevformer
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000351/376; rtime 0.04; itime 1.57 (1583.65 ms); loss 1.46815; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000352/376; rtime 0.05; itime 1.54 (1583.52 ms); loss 1.36396; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000353/376; rtime 0.04; itime 1.52 (1583.35 ms); loss 1.52722; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000354/376; rtime 0.04; itime 1.52 (1583.18 ms); loss 1.45763; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000355/376; rtime 0.05; itime 1.55 (1583.09 ms); loss 2.03853; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000356/376; rtime 0.07; itime 1.54 (1582.97 ms); loss 1.00910; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000357/376; rtime 0.04; itime 1.51 (1582.76 ms); loss 0.24750; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000358/376; rtime 0.04; itime 1.52 (1582.60 ms); loss 0.00532; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000359/376; rtime 0.04; itime 1.52 (1582.43 ms); loss 0.19061; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000360/376; rtime 0.04; itime 1.52 (1582.24 ms); loss 0.04461; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000361/376; rtime 0.04; itime 1.51 (1582.06 ms); loss -0.15120; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000362/376; rtime 0.04; itime 1.51 (1581.86 ms); loss 0.19511; iou_ev 47.5
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000363/376; rtime 0.04; itime 1.51 (1581.67 ms); loss 0.41124; iou_ev 47.5
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000364/376; rtime 0.05; itime 1.54 (1581.56 ms); loss 0.64006; iou_ev 47.5
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000365/376; rtime 0.05; itime 1.53 (1581.42 ms); loss 1.81830; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000366/376; rtime 0.04; itime 1.53 (1581.29 ms); loss 0.45757; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000367/376; rtime 0.04; itime 5.94 (1593.17 ms); loss 0.35694; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000368/376; rtime 0.07; itime 1.55 (1593.07 ms); loss 0.77848; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000369/376; rtime 0.05; itime 1.52 (1592.87 ms); loss 1.30311; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000370/376; rtime 0.04; itime 1.50 (1592.63 ms); loss 1.00361; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000371/376; rtime 0.04; itime 1.51 (1592.39 ms); loss 2.16301; iou_ev 47.3
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000372/376; rtime 0.04; itime 1.52 (1592.19 ms); loss 0.88017; iou_ev 47.3
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000373/376; rtime 0.04; itime 1.52 (1592.00 ms); loss 0.48593; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000374/376; rtime 0.04; itime 1.52 (1591.81 ms); loss 0.77463; iou_ev 47.4
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000375/376; rtime 0.04; itime 1.51 (1591.59 ms); loss 1.96368; iou_ev 47.3
8x5_3e-4s_bevformer_21:07:58_16_eval_01:13:41; step 000376/376; rtime 0.04; itime 1.51 (1591.38 ms); loss 0.78340; iou_ev 47.4
final trainval mean iou 47.36635237197667

log of Q3:

bevformer
8x5_3e-4s_bevformer_21:07:58; step 005610/25000; rtime 0.23; itime 7.22; loss 1.94302; iou_t 29.3; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005611/25000; rtime 0.29; itime 7.15; loss 1.96159; iou_t 28.7; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005612/25000; rtime 0.26; itime 7.15; loss 1.57377; iou_t 26.5; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005613/25000; rtime 0.21; itime 7.30; loss 2.18127; iou_t 25.8; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005614/25000; rtime 0.17; itime 7.03; loss 2.06807; iou_t 25.9; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005615/25000; rtime 0.24; itime 7.30; loss 1.96291; iou_t 25.6; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005616/25000; rtime 0.20; itime 7.23; loss 1.79805; iou_t 26.3; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005617/25000; rtime 0.19; itime 7.26; loss 1.75140; iou_t 26.8; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005618/25000; rtime 0.22; itime 7.31; loss 1.84597; iou_t 27.1; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005619/25000; rtime 0.18; itime 7.34; loss 1.59728; iou_t 27.1; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005620/25000; rtime 0.30; itime 7.09; loss 2.27120; iou_t 26.8; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005621/25000; rtime 0.23; itime 7.14; loss 1.91050; iou_t 26.6; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005622/25000; rtime 0.15; itime 6.57; loss 1.78379; iou_t 27.5; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005623/25000; rtime 0.16; itime 6.59; loss 1.61208; iou_t 28.0; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005624/25000; rtime 0.15; itime 6.66; loss 1.95314; iou_t 27.8; iou_v 29.8
8x5_3e-4s_bevformer_21:07:58; step 005625/25000; rtime 0.19; itime 7.05; loss 1.64065; iou_t 28.0; iou_v 29.8
bevformer2
8x5_3e-4s_bevformer_MS_00:06:32; step 005610/25000; rtime 0.35; itime 5.87; loss 1.30963; iou_t 39.2; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005611/25000; rtime 0.21; itime 5.83; loss 0.97845; iou_t 39.6; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005612/25000; rtime 0.20; itime 6.03; loss 1.67607; iou_t 39.3; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005613/25000; rtime 0.23; itime 6.03; loss 1.30028; iou_t 39.3; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005614/25000; rtime 0.19; itime 6.19; loss 1.44862; iou_t 39.6; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005615/25000; rtime 0.20; itime 5.56; loss 1.33884; iou_t 39.1; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005616/25000; rtime 0.25; itime 6.23; loss 1.39724; iou_t 38.7; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005617/25000; rtime 0.24; itime 5.57; loss 1.90144; iou_t 38.5; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005618/25000; rtime 0.22; itime 6.27; loss 1.33009; iou_t 38.3; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005619/25000; rtime 0.30; itime 5.66; loss 1.22954; iou_t 37.7; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005620/25000; rtime 0.33; itime 5.83; loss 1.67304; iou_t 38.3; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005621/25000; rtime 0.23; itime 6.05; loss 1.51381; iou_t 37.9; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005622/25000; rtime 0.20; itime 4.56; loss 1.54351; iou_t 38.8; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005623/25000; rtime 0.15; itime 4.39; loss 1.39122; iou_t 38.1; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005624/25000; rtime 0.16; itime 5.98; loss 1.22388; iou_t 38.6; iou_v 5.5
8x5_3e-4s_bevformer_MS_00:06:32; step 005625/25000; rtime 0.17; itime 7.95; loss 2.08057; iou_t 38.4; iou_v 5.5

Negative loss

I am currently training the bev model with my dataset.
However, after enough steps, the total loss becomes negative as shown below.
image

After debugging, when calculating uncertainty_loss(ce_uncertainty_loss, center_uncertainty_loss, offset_uncertainty_loss),
weight(model.module.ce_weight, model.module.center_weight, model.module.offset_weight) is negative.
image

Are there any possible causes?

How do I know what REF is?

@aharley Thanks for your sharing. I don't know if I understand correctly. The following code completes the conversion of ref to mem coordinate system. Mem is the coordinate of voxel, that is, the coordinate of bev feature. What does ref refer to?How do I determine this reference coordinate system? Can I specify a coordinate system at will? I understand that the code is cam_front. If I change to lidar_top as the reference coordinate system, does the code need to be changed? Looking forward to your reply, thank you very much!!
def get_mem_T_ref(self, B, Z, Y, X, assert_cube=False, device='cuda'):

seg_bev label

Hi!

I have a question.

Checking seg_bev_g for analysis shows the following.

image

Why are the value(=color) different for each object?

Evaluation problem?

Why choose BEV segmentation rather than BEV object detection or 3D object detection?

Question about hardware requirement.

How much gpu memory needed for training this model at batch size 1(6 cams+radar) in normal input resolution(448×800)?

Single RTX3090 will do? (I only have one RTX3070)

Training log

For test, I'm trying to train a Cam+Rad BEV model with a NuScenes mini dataset.
But, iou_v is nan as shown below.
image

Could you please share your training log for debugging?

KeyError: 'H' in evaluation

hi,
The following error occurred when I used the recommended method for evaluation,
File "/home/zhwsh/code/simple_bev-main/nuscenesdataset.py", line 728, in sample_augmentation
H, W = self.data_aug_conf['H'], self.data_aug_conf['W']
KeyError: 'H'

Then I noticed that there was no 'H''W' defined in "data_aug_conf",what's the value of them?
data_aug_conf = {
'final_dim': final_dim,
'cams': ['CAM_FRONT_LEFT', 'CAM_FRONT', 'CAM_FRONT_RIGHT',
'CAM_BACK_LEFT', 'CAM_BACK', 'CAM_BACK_RIGHT'],
'ncams': ncams,
}

visualization problem

Thanks for your wonderful work.
I am trying to reproduce it, and the result looks great.
So I want to further analyze the result.
Could you share your visualization command?

issue in bevformer2

When I tried to train bevformer2, I used two 3090 GPUs for training and reported an error of ERROR: torch. distributed. final. multiprocessing. api: failed (exitcode: -6) local rank: 1 (pid: 26301). This error does not occur every time, but the probability of occurrence is high.
I noticed that the code has already commented that using multi-scale feature will not work.
After checking the code, I found that there was an issue with the parameters of VanillaSelfAttention. and SpatialCrossAttention.
When using multi-scale features, n_levels needs to be set to the number of multi-scale features of 3 to solve the problem.

Evaluation: filter activated

Hello,

Thank you for your great work. I have a question concerning your results, especially the visibility filter that seems to be activated in your training code (cf. valid_bev_g during loss and metric calculation).

Did you use this filter for your tables in your paper ? It seems to have a big impact on metrics I got and some of the papers you compare with do not use this filter (cf. FIERY that overwrittes the filtering argument to False while training and evaluating).

Thank you in advance

No performance improvement after using radar data

Hi @aharley :

Thanks for your job, which give me some ideas.
In my BEV detection model, I Rasterisation the RADAR data and concatenated with the BEV feature output from LSS according to dimensions, but the result was 2-3 points lower.

Do you know why? Or tell me some potential problem.

Issue with visualisation being flipped

Thank you for this contribution it has been very educational for me!

I am trying to train this model on a custom data set.

Can you confirm a few assumtions I have made?

  1. This is the coordinate system:
    # COORDS:
    # Z: forward (roll)
    # Y: down (yaw)
    # X: right (pitch)
    #
    #
    # (forward) z
    #           \
    #            \
    #             *---------- x (right)
    #             |
    #             |
    #             |
    #             | Y (down)
  1. For the sensors:
  2. roll is along z. +roll is clockwise
  3. yaw is along y. +yaw is clockwise
  4. pitch is along y. +pitch is down

Secondly, I have both my result & prediction for the segnet being flipped 180 deg. I am hoping you can give me any pointers on where to look.

Actual Values:
image

Expected Values:
image

Raw translation matrix between Camera and Radar Sensors

Translation between different sensor modalities between frames is doable. For example, CAM_FRONT frame 0 to RADAR_FRONT frame 1 involves the transformations:

CAM_FRONT F0 -> GLOBAL COORDS -> RADAR_FRONT F1

It is possible to do so using the egopose records as done in get_radar_data() function from nuscenesdataset.py, this snippet shows the car_from_global transformation matrix which describes the transformation from GLOBAL to the RADAR FRONT ego frame:

    # Get reference pose and timestamp.
    ref_sd_token = sample_rec['data']['RADAR_FRONT']
    ref_sd_rec = nusc.get('sample_data', ref_sd_token)
    ref_pose_rec = nusc.get('ego_pose', ref_sd_rec['ego_pose_token'])
    ref_cs_rec = nusc.get('calibrated_sensor', ref_sd_rec['calibrated_sensor_token'])
    ref_time = 1e-6 * ref_sd_rec['timestamp']

    # Homogeneous transformation matrix from global to _current_ ego car frame.
    car_from_global = transform_matrix(ref_pose_rec['translation'], Quaternion(ref_pose_rec['rotation']),inverse=True)

What I am looking for is a 'raw' transformation matrix which describes the transformation between different sensor modalities when the vehicle is stationary and not moving (CAM_FRONT to RADAR_FRONT for example). This is also the case for when the sensor modalities timestamp is EXACTLY the same, as this would mean the ego vehicle has the same global coordinates for both modalities. A case for this is when ref_sd_token = sample_rec['data']['RADAR_FRONT'] is not available.

As i stated, this raw transformation matrix be extracted when the timestamps between CAM_FRONT and RADAR_FRONT is 0. However, I could not find a case for when this occurs as radar and camera are asynchronous so it would be rare for this to occur. The closest I could get is 4 microseconds but I would like the exact raw transformation matrix if this is possible?

Thank you.

C+R pretrained model & Vis.

Thanks for the great research!

I have a few questions.

  1. Is there cam + radar pretrained model? It seems that only camera models are available for download.
  2. For continuous visualization (=video), what should I do?

Code changes

Hi !

I tested it with the code committed on 31 Aug.
But now it seems that various files (vox.py, nuscenesdataset.py, train_nuscenes.py) have been updated.

What has changed?

Can you explain the motivation and usage of "randomize reference cam augmentation"?

Firstly i think the motivation of this paper is very insightful and the conclusion is solid enough, thanks the authors for your effort.
I'm curious about the "randomize reference cam augmentation" mentioned in this paper. I found the first camera and the randomized camera are swapped in code to make sure the 0th cam is randomized referenced camera, but i don't understand what the motivation of this augmentation is and how it's done in detail. Can you explain it or provide the reference paper (if exists). Besides, the paper mentioned "We have observed qualitatively that without this augmentation, the segmented cars have a slight bias forcertain orientations in certain positions; with the augmentation added, this bias disappears.", can you show us the badcase with image to make it more intutive?

Performances as a function of the batch size

Hi, thanks for releasing this cool work!
I have a question about Fig. 4 in your paper and the related paragraph. Since you train for 25k iterations with batch_size = 8, do you also increase (or decrease) the number of iterations when the batch size decreases (or increases, respectively), or is it always kept fixed at 25k?
Thank you in advance for the answer.

Questions regarding backbone network

Hi! First of all thank you for the great quality of this work, both the paper and the code.
I have a couple of doubts regarding the backbone:

  1. As mentioned in issue #24 the image features in the repo come from the concatenation of the output of the second layer and the upsampled output of the third layer. In the paper, it is instead stated that the features come from the concatenation of the output of the third layer with the upsampled output of the last layer, leading to feature maps of dimension C x H/8 x W/8, while the approach in the code will produce FM of dimension C x H/4 x W/4. From which of the two approaches come the results reported in the paper? Does this difference have significant effect on performance (if both have been tested)?
  2. In the paper it is mentioned that the ResNet-101 backbone is initialized from COCO pretraining citing the DETR paper, while in the code the network is initialized from torchvision default weights (ImageNet pretraining). In the experiment sections of the paper, the effect of input resolution is discussed and it is hypothesised that the decreasing performance with higher resolution could be explained due to worse transfer from inconsistency with the pretraining scale. Do the results in this section of the code come from the approach described in the paper (COCO-pretraining) or the one in the code? In case you have run experiments with both approaches, does this make any significant difference?
    Thanks again.

mask mems about reduce_masked_mean

Thansk to the excellent work! I have a question about reduce_masked_mean.
in oder to fuse every camera feature, you get "mask_mems" by mask_mems = (torch.abs(feat_mems) > 0).float(), and get in reduce_masked_mean, but in vox_util.unproject_image_to_mem function , there is a "valid_mem" from valid_mem = (x_valid & y_valid & z_valid).reshape(B, 1, Z, Y, X).float(), and adapt to values by values = values * valid_mem , I'm confused about this, why not using "valid_mem" from vox_util.unproject_image_to_mem ,but get a new "mask_mems" by mask_mems = (torch.abs(feat_mems) > 0).float()? seems not make sense to me.

image

image

image

Question about translation of Radar coordinates

The radar points are got via the get_radar_data() function inside of nuscenesdataset.py and the points are translated into the egopose realtive to the RADAR_FRONT:

   # Get reference pose and timestamp.
    ref_sd_token = sample_rec['data']['RADAR_FRONT']
    ref_sd_rec = nusc.get('sample_data', ref_sd_token)
    ref_pose_rec = nusc.get('ego_pose', ref_sd_rec['ego_pose_token'])
    ref_cs_rec = nusc.get('calibrated_sensor', ref_sd_rec['calibrated_sensor_token'])
    ref_time = 1e-6 * ref_sd_rec['timestamp']

    # Homogeneous transformation matrix from global to _current_ ego car frame.
    car_from_global = transform_matrix(ref_pose_rec['translation'], Quaternion(ref_pose_rec['rotation']),inverse=True)

A similar thing happens via get_lidar_data(), these points are translated into the egopose relative to the LIDAR_TOP:

    # Get reference pose and timestamp.
    ref_sd_token = sample_rec['data']['LIDAR_TOP']
    ref_sd_rec = nusc.get('sample_data', ref_sd_token)
    ref_pose_rec = nusc.get('ego_pose', ref_sd_rec['ego_pose_token'])
    ref_cs_rec = nusc.get('calibrated_sensor', ref_sd_rec['calibrated_sensor_token'])
    ref_time = 1e-6 * ref_sd_rec['timestamp']

    # Homogeneous transformation matrix from global to _current_ ego car frame.
    car_from_global = transform_matrix(ref_pose_rec['translation'], Quaternion(ref_pose_rec['rotation']),
                                        inverse=True)

Then these points are translated to CAM_FRONT coordinates in train_nuscenes.py via these transformation matrices:

    cams_T_velo = __u(utils.geom.safe_inverse(__p(velo_T_cams)))

And finally translated:

    xyz_cam0 = utils.geom.apply_4x4(cams_T_velo[:,0], xyz_velo0)
    # apply transformation of radar to camera coords (1st camera coords [:,0])
    rad_xyz_cam0 = utils.geom.apply_4x4(cams_T_velo[:,0], xyz_rad)

I do not understand how this is correct. How would one transformation matrix correctly transform both RADAR and LIDAR into the CAM_FRONT frame if both RADAR and LIDAR are stored in different coordinate systems (egopose relative to their sensors?).
I am sure I am missing something/ have a misunderstaning and would someone be able to point me in the right direction? Thanks

why "xyz_cam0 = utils.geom.apply_4x4(cams_T_velo[:,0], xyz_velo0)"?

Thanks for your great work.
When i read the code, i meet a question in the line 145 of "train_nuscenes.py".
xyz_cam0 = utils.geom.apply_4x4(cams_T_velo[:,0], xyz_velo0)
"cams_T_velo[:,0]" means the "camera0 to world" and "xyz_velo0" means the points in the world coordinate.

It seems that followed code is right.
xyz_cam0 = utils.geom.apply_4x4(velo_T_cams[:,0], xyz_velo0)

Except your reply.
Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.