yikaiw / cen Goto Github PK

[TPAMI 2022, NeurIPS 2020] Code release for "Deep Multimodal Fusion by Channel Exchanging"

License: MIT License

Python 100.00%

semantic-segmentation refinenet rgbd nyudv2 image-translation network-pruning taskonomy depth-estimation rgbd-segmentation

cen's People

Contributors

Stargazers

Watchers

Forkers

liuguoyou trendingtechnology warren-ding yyht cv-ip zeta1999 kimtaehyeong muwutufu mymuli wwhappylife daydreamer2023 miaorain sarrbranka robot-ai-machinelearning suuankotanki lianhui1993 dev233 jie311 elham-gh justxkh alhassan20 big-dipper7 xiaoshuai999 deub-chck liguoyu1 eduardofraguas wh-forker x1ng-z myheathcliff klin106 liuhe2017 lidiwen8 fanrz shawn207 lizhi3158 zirui0623 alopexlagopusxu hljeong wuzhe71 fenghao2002 holmes-gu shivamshuklama vanhoanglepsa rongfei-chen

cen's Issues

For fusion of 3 modalities

Hi,
I was experimenting with your code on my own dataset. However, I realized that image2image translation model only supports the fusion for only two modalities. I checked out the code in detail, it seems that the exchange class is implemented for two sub-networks.
class Exchange(nn.Module):
def init(self):
super(Exchange, self).init()

def forward(self, x, insnorm, insnorm_threshold):
    insnorm1, insnorm2 = insnorm[0].weight.abs(), insnorm[1].weight.abs()
    x1, x2 = torch.zeros_like(x[0]), torch.zeros_like(x[1])
    x1[:, insnorm1 >= insnorm_threshold] = x[0][:, insnorm1 >= insnorm_threshold]
    x1[:, insnorm1 < insnorm_threshold] = x[1][:, insnorm1 < insnorm_threshold]
    x2[:, insnorm2 >= insnorm_threshold] = x[1][:, insnorm2 >= insnorm_threshold]
    x2[:, insnorm2 < insnorm_threshold] = x[0][:, insnorm2 < insnorm_threshold]
    return [x1, x2]

You can see here that's the case. Can you provide the exchange class for more than two modalities?
Thanks in advance

Formatting iOS Lidar Depth Data For Transfer-Learning

We have 2D depth data corresponding to an RGB image, it has values 0.0-5.0 that represent the distance in meters from the sensor to that object in a straight line.

We want to transfer learn (on your pretrained weights) over our dataset, and it seems that we should save our depth data as pngs, since that's the format of them in the dataset you link to. What preprocessing should we run, if any, for our depth data to be in the correct format - maybe just scale 0-5 to 0-255 and save as a grayscale png?

(By the way, in our data 0.0 is the default value when something is too far away or no signal is returned. It seems like this is the same for the kinect depth data because the black patches are probably 0 values.)

And I don't think the paper mentions any preprocessing done on depth, but the utils/datasets.py file does have this:
if key == 'depth':
img = cv2.applyColorMap(cv2.convertScaleAbs(255 - img, alpha=1), cv2.COLORMAP_JET)

What is this doing?

Thanks in advance!
Eli

In the end of CE

你好，将设我有AB两个模态，AB之间做CE，如果我理解正确的话，会让一个其中一个feature vector变得越来越重要，另一个越来越不重要，那么最后是应该直接丢弃那个越来越不重要的，还是用soft_alpha去fuse他们两个呢。。。

最终表示的问题

你们的工作做得非常的棒。我想问一下：经过CEN的fusion之后，如果我要进行多模态情感分析（有三个模态：文本，音频以及视频），从文章看说经过fusion之后仍然是三个输出，每个输出都有了其他模态的信息了。现在我要进行情感预测，我是只取其中的一个模态进预测还是把他们给拼接起来呢？

About visualization figures in paper

Hi, thanks for your job. I wonder what 'averaged' means in Figure3, since the visualized feature maps are chosen by scaling factor in BN layer. And may I ask which layer/stage are these feature maps belonged to specifically？Because I really want to know whether the outdoor datasets have such characters. I'd be grateful if you would describe it in more detail.

About where the mean(ensemble) is calculated

Hi, I find that the function validate() in the segmentation experiment may be wrong. It looks like this. The annotation says

    """Validate segmenter
    Args:
      segmenter (nn.Module) : segmentation network
      val_loader (DataLoader) : training data iterator
      epoch (int) : current epoch
      num_classes (int) : number of classes to consider
    Returns:
      Mean IoU (float)
    """

, but I do not find any operation to calculate the mean value of IoU between the output of RGB and that of depth. It seems just return the IoU of depth, not the mean value. Would you mind giving more details of this?

关于多模态的形式

您好！非常感谢您如此出色的作品！
对于多模态的数据是如何进行工作的我有点没看明白，我的理解是这样的：每个模态的数据以及标签作为一组，同时并行训练两个模态的数据，每个模态内部进行Channel exchange,不知道我理解的是否正确，非常期待您的回答！
谢谢！

How can get the datasets of "train" and "val"

It's my pleasure for me to see your paper "Deep Multimodal Fusion by Channel Exchanging",and I have downloaded the corresponding code from github, but how can I get the "train" dataset and "val" dataset? Looking forward to your early reply! Thank you!

Sparsity constraint in channel exchanging

Hello,
Thank you for your very interesting work! I was planning on experimenting with CEN but I couldn't seem to find the implementation of the sparsity constraint in channel exchanging, as mentioned in Section 3.3, that channel exchanging is only performed in different (disjoint) sub-parts for different modalities. Would you be able to point me to where in the model is this implemented?

Thanks.

Paper request

Hello, I am interested in your research. I want to get a deeper understanding by reading your paper, but I could not retrieve your paper on arxiv.org. Can you give me this paper? ? I promise that it will only be used for personal research and not spread, my email address is: [email protected], thank you very much

Some question about the image size

Thanks for your excellent work！I have some question about the input size of NYUDv2,
Why the processed image size is not 480 x 640 in provided NYUDv2 dataset?
and what is the AlignToMask() transformation used for in NYUDv2 and SUNRGBD dataset?

Thank you very much and look forward your reply.

some question for image size

Thanks for your excellent work！I want to input images with different height and width, but I will get an error.Do the images input have to be the same height and width?
Thank you very much and look forward your reply.

Query regarding applicability to other tasks

Hi,
Thank you for your great work. I had a specific query the answer to which was not mentioned in the paper:

Given homogeneous data (images), but where different modalities (different image streams) do not correspond to a different version of the same view, for example, in pose regression methods like MapNet, the multimodal input would be 2 completely different images. So not 2 images (different views) of the same scene (like RGB+D images) but 2 different images of 2 different scenes (1 image taken at time t and the other taken at time t+1).

Do you feel that there could still be a gain with channel exchanging?

Thank you for your answer.

Method to choose a good lambda (in Equation 4)

Hi, I am trying to use the channel-exchanging in a multimodal self-supervised network (depth and rgb) and follow this line of code to add sparse constraints.

As I plot the scaling factors that have sparse constraints, I find all of them decrease to zero at last. However, there seems to be a stable ratio of scaling factors that will not become zero (larger than threshold, thus will not be exchanged) in the figure.5 in your paper. May I ask have you met this case that all scaling factors become zero in your experiments?

Best,
beniko

Question about CEN

Hello, thank you very much for the code you shared, we used your algorithm to train the model (aerial RGB and elevation modality, using ImageNet pre-trained model), but during model validation we found that the scaling factors of the two modalities are very close, and with the deepening of the convolution, the two modalities scaling factors of all channels are almost the same, the effect of exchange is not obvious.
May I ask where is the problem?
Thanks in advance!

Non colorised masks

Where can I find the non-colorised labels or masks?

为什么对rgb, depth和ens的loss求和？

作者你好！在文章的loss公式中，我的理解是只将ens拿来算loss。为什么代码里面需要对rgb，depth和ens的loss求和。