princeton-vl / raft Goto Github PK

View Code? Open in Web Editor NEW

3.0K 39.0 607.0 10.01 MB

License: BSD 3-Clause "New" or "Revised" License

Python 83.51% Shell 2.30% C++ 1.67% Cuda 12.53%

raft's Introduction

RAFT

This repository contains the source code for our paper:

RAFT: Recurrent All Pairs Field Transforms for Optical Flow
ECCV 2020
Zachary Teed and Jia Deng

Requirements

The code has been tested with PyTorch 1.6 and Cuda 10.1.

conda create --name raft
conda activate raft
conda install pytorch=1.6.0 torchvision=0.7.0 cudatoolkit=10.1 matplotlib tensorboard scipy opencv -c pytorch

Demos

Pretrained models can be downloaded by running

./download_models.sh

or downloaded from google drive

You can demo a trained model on a sequence of frames

python demo.py --model=models/raft-things.pth --path=demo-frames

Required Data

To evaluate/train RAFT, you will need to download the required datasets.

By default datasets.py will search for the datasets in these locations. You can create symbolic links to wherever the datasets were downloaded in the datasets folder

├── datasets
    ├── Sintel
        ├── test
        ├── training
    ├── KITTI
        ├── testing
        ├── training
        ├── devkit
    ├── FlyingChairs_release
        ├── data
    ├── FlyingThings3D
        ├── frames_cleanpass
        ├── frames_finalpass
        ├── optical_flow

Evaluation

You can evaluate a trained model using evaluate.py

python evaluate.py --model=models/raft-things.pth --dataset=sintel --mixed_precision

Training

We used the following training schedule in our paper (2 GPUs). Training logs will be written to the runs which can be visualized using tensorboard

./train_standard.sh

If you have a RTX GPU, training can be accelerated using mixed precision. You can expect similiar results in this setting (1 GPU)

./train_mixed.sh

(Optional) Efficent Implementation

You can optionally use our alternate (efficent) implementation by compiling the provided cuda extension

cd alt_cuda_corr && python setup.py install && cd ..

and running demo.py and evaluate.py with the --alternate_corr flag Note, this implementation is somewhat slower than all-pairs, but uses significantly less GPU memory during the forward pass.

raft's People

Contributors

Stargazers

Watchers

Forkers

wang-kx templeblock ggsonic marvis ywu40 bruceleefirst haooooooqi pavelrst cxy1997 weiyaowang shuaiyuan1996 kmbriedis marcszafraniec complexfilter yiweichen04 sljlp mimirobo zachteed eric-mingxiao aharonazulay alexeyab yingjia-cai xjohnxjohn littlespray planetceres eglxiang lbjcelsius trantorrepository kuibinzhao stark-chang peterzs liuguoyou lanxielee maximkuklin godencrystal freewind2016 jin-clova baifree cv-ip tchigher leoyouli sxjpage nestorlong xd-liu miraclebiu healthonrails xrosliang yawudede interesting-opensource-projects rozgo exii-uw jeecabs maciejmacko daiszh janyasir skasman navyjeff guozhongluo yingunjun zhaofuq aqdus01 8secz-johndpope qiao-maoying hucui2022 skylook shuuchen klens-codes lllxiang mindfyhh xyft123 hansenlyx0708 zhuxiongwei24 albert2x flamehaze1115 jerryxu0907 cloveryww dbofseuofhust slam-box zarivan hzhang57 verigle robintzeng xueliancheng cherylxuli boxofpasta seokjulee huangzehao mihaibujanca codepointer gaoqiangwu xinntao kevin-s64 ye-hanyu 574411705 etienne-meunier hityzy1122 rivergold zuoym15 qiuhuan richardotiono

raft's Issues

Small update block doesn't have the attribute "reset_mask"

Hi,

When doing inference using your small RAFT, it runs into error:

After copying and pasting these lines from basicupdateblock to smallupdateblock, no errors:

why concate [x, flow]

Dear Zachary,

Thanks for your nice work of RAFT. It is elegant, powerful and of great novelty. I have a puzzle that why you concate [x, flow] before you input the x to GRUs. Does it influence the results serious? I have not see any ablation study of this.

cheers

Dropout never runs and if it does throws RuntimeError

Hi,

Thanks for sharing the code!

In this code, I think dropout will always be set to 0 by raft.py line 29. get_kwargs() returns a list of tuples:

[('batch_size', 2), ('clip', 1.0), ('corr_levels', 4), ('corr_radius', 4), ('dataset', 'sintel_1'), ('dropout', 0.2), ('epsilon', 1e-08), ('image_size', [368, 496]), ('iters', 12), ('lr', 2e-06), ('name', 'sintel1_do0'), ('num_steps', 1), ('restore_ckpt', None), ('small', False), ('wdecay', 5e-05)]

Then, the 'dropout' in args.get_kwargs() call should return False and args.dropout is then set to 0.

If this code is changed so that there is no reset of dropout, there seems to be a RuntimeError.

The masks for VariationalDropout are set for net and inp here.
However, right before the masks are applied here, inp is concatenated with the motion encoder here.

This means that the mask and the tensor are different sizes and will give the following RuntimeError.

File "core/modules/update.py", line 34, in forward
return self.mask * x
RuntimeError: The size of tensor a (128) must match the size of tensor b (256) at non-singleton dimension 1

Is this just a problem with the public facing github repo?

Best,
Charles

small model available?

Is the small version of the pretained model(Sintel) available for download??

Are those pretrained model trained on Sintel training split?

Hi, I am curious about what datasets those pretrained models are trained on.
C+T? C+T+S?

Why the edges of the colorized optical flow look jaggy?

First, great work and congratulations of the paper acceptance on ECCV 2020. I have a short question, why does the out image look like it has been bilinear interpolated? Does there exists a setting for the edges to look more natural?

I will be surprised that if those output images have state-of-the-art EPE performance on Sintel.

Implementation of the Efficient computation part differs from the paper

Hi, for the part of Efficient Computation for High-Resolution Images

It mentions that the level-based correlation should be done by half-pooling first and then matrix multiplication. However, in the code:

You first calculate the correlation between two feature maps by full resolution. Let's say [384/8, 1024/8] = [48, 128]. After some fancy operations of reshape, view and matmul, then you got a [6144, 6144] cost volume.
Next, it uses avg_pool2d to downsample to get following levels.
Obviously, such calculation takes O(N^2) and opposite to your description of the paper. I am curious the reason for the difference. Is this code the early version that hasn't finished the efficient implementation?

can not re-implement the results on sintel dataset test split

the result on sintel: link

the result on local:
pytorch version: 1.6
cuda version: 10.2
weight file: raft-sintel.pth
warm-start: yes
iters: 32

Which result does raft-sintel.pth correspond to in the paper？
C+T+S/K or C+T+S+K+H?

DAVIS dataset

Hi, I want to use the DAVIS dataset as you mention in the paper 4.5.
I wonder which pretrain model did you use.
Also, should I modify the evaluate.py to load the DAVIS dataset ?

Thank you !!

Inference Time

Congratulations on winning the best paper award!

In the paper it is mentioned that the inference time of RAFT is around 100 ms. I want to ask what is the image size you are using? Also, which tool are you using to test the inference time?

Thanks

Old flow visualization

Hello!

You are using old flow visualization code from before tomrunia/OpticalFlow_Visualization@15d0fcd

It was originally copied from Matlab with 1-based indices. This is a minor point, but actually results in occasional crashes.

Dmitry

question on augmentor image size

Thanks for sharing your work. I have a question on the image size setting for the augmentor in FlyingChairs dataset. Why do you set the crop size(self.image_size) to be [368, 496]?

Thanks,

Memory issue with one GPU at inference time - process gets killed

Hi, thanks for the nice work!

I am testing RAFT on Epic Kitchens and even though I limited the number of frames processed to 300 and I am using --alternate_corr flag, the python process gets killed after a certain point, with no warnings or messages (only says "killed"). Interestingly, when I limit the frames to 100 it works fine.

I was wondering if anyone has had the same problem and how they resolved it. I only have one GPU available and more inference time is okay with me.

Thanks!

Which of the released model is more suitable for real-world videos

Hi,

I want to use your model as an optical flow extractor on real-world videos. May I know which of the released model is more suitable?

Thanks!

dataloader question when ./train_mixed.sh

File "train.py", line 247, in
train(args)
File "train.py", line 150, in train
train_loader = datasets.fetch_dataloader(args)
File "core/datasets.py", line 230, in fetch_dataloader
train_loader = data.DataLoader(train_dataset, batch_size=args.batch_size,
File "/data/zzl/anaconda3/envs/raft/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 224, in init
sampler = RandomSampler(dataset, generator=generator)
File "/data/zzl/anaconda3/envs/raft/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 95, in init
raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

Nice work, I am curious about model architecture.

Why do you discard all batchnorm after resNet backbone. What is the reasion. How to consider to make this decision.

Cuda out of memory (when using many frames)

Hi there

I know that a similar issue has been raised before and I already took a look at it. The solution of modifying that line actually helped with images of size1920*1080. So I made few modifications on the demo.py to read images and save the optical flow viz and it works fine with 20 images. However, when I try on a bigger number (300 images), I get the Cuda out of memory error code again. I thought that using with torch.no_grad(): allowed freeing the memory used by pytorch every time.

The error happens specifically in this line of code: return padder.pad(images)[0]

I even modified the code so that load_image_list() would be called for every 10 images inside a loop so that I could use the returned values as input for to model extract the flow for batches of 10 images one batch at a time. But I still get an out of memory error. I was sure this would solve the issue since I was able to work with 20 images before. What am I missing?

Thanks a lot !!

Results reproduction

Hi, thanks for sharing the code! I would like to reproduce the results of the paper and I have three questions. Let's say I want to fine-tune the chairs+things checkpoint on Sintel dataset and compare my results to the results in the paper.

What are the relevant metrics? In the paper I see two different tabels: Table 4 mentions 'accuracy'. Table 1 does not mention any metrics, just numbers. Looking at Table 2 (Kitti dataset) I can see f1-epe and f1-all metrics. Fig.6 says "Accuracy is measured by the EPE on the Sintel(train) final pass
after training on C+T". So what are the metrics to measure the performance of the fine-tuned model? Looking at the evaluation function, looks like the relevant metric is EPE. AM I right?
Looking at the Sintel dataset, I see two sub sets: Clean and Final. The fine-tuning of the model on the Sintel dataset is done by training on both Clean and Final for 60000 steps. According to Table 1, a model which trained on Chairs+Things is evaluated on Sintel (separately on Clean and Final) the numbers (EPE I guess?) are - Clean: 1.63, Final: 2.83. When a Sintel-fine-tuned model evaluated, the numbers are: 2.42, 3.39. My question is the following: On which dataset a Sintel-fine-tuned model was evaluated?
The number of fine-tuning steps are basically the number of batches which were fed to the model during the ft process, am I right? If I reduce the size of the batch, should I scale the number of steps as well? i.e. Sintel ft is 60,000 with batch size 4. If want to use batch size 2, the number of steps should be 120,000?

Thanks a lot!

I want to know the flops of model.

May I have maken some mistakes, the flops I get is too big,I can not believe it.
input1 = torch.randn(2, 1,3, 960, 1280)
input1 = (tuple(input1))
flops, params = profile(test_model, input1).
Look forward to your reply.

Efficent Implementation

Hello,
I would like to ask if I can use your Efficent Implementation of correlation layer during training. I have limited GPU memory but long training time is possible.

Trained Model Load ERROR.

When running the demp.py as the README file, I meet the error as followed. It seesm that your provided trained models may contains zip magic number. Are there some slutions to deal with this error.

(/home/anaconda3/envs/torch-1.0) l:/mnt/workspace/gitlab/RAFT-master$ python demo.py --model=/mnt/workspace/github/RAFT-master/models/raft-things.pth --path=demo-frames
Traceback (most recent call last):
File "/home/anaconda3/envs/torch-1.0/lib/python3.6/tarfile.py", line 189, in nti
n = int(s.strip() or "0", 8)
ValueError: invalid literal for int() with base 8: 'ild_tens'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/anaconda3/envs/torch-1.0/lib/python3.6/tarfile.py", line 2294, in next
tarinfo = self.tarinfo.fromtarfile(self)
File "/home/anaconda3/envs/torch-1.0/lib/python3.6/tarfile.py", line 1090, in fromtarfile
obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
File "/home/anaconda3/envs/torch-1.0/lib/python3.6/tarfile.py", line 1032, in frombuf
chksum = nti(buf[148:156])
File "/home/anaconda3/envs/torch-1.0/lib/python3.6/tarfile.py", line 191, in nti
raise InvalidHeaderError("invalid header")
tarfile.InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/anaconda3/envs/torch-1.0/lib/python3.6/site-packages/torch/serialization.py", line 595, in _load
return legacy_load(f)
File "/home/anaconda3/envs/torch-1.0/lib/python3.6/site-packages/torch/serialization.py", line 506, in legacy_load
with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar,
File "/home/anaconda3/envs/torch-1.0/lib/python3.6/tarfile.py", line 1586, in open
return func(name, filemode, fileobj, **kwargs)
File "/home/anaconda3/envs/torch-1.0/lib/python3.6/tarfile.py", line 1616, in taropen
return cls(name, mode, fileobj, **kwargs)
File "/home/anaconda3/envs/torch-1.0/lib/python3.6/tarfile.py", line 1479, in init
self.firstmember = self.next()
File "/home/anaconda3/envs/torch-1.0/lib/python3.6/tarfile.py", line 2306, in next
raise ReadError(str(e))
tarfile.ReadError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "demo.py", line 79, in
demo(args)
File "demo.py", line 52, in demo
model.load_state_dict(torch.load(args.model))
File "/home/anaconda3/envs/torch-1.0/lib/python3.6/site-packages/torch/serialization.py", line 426, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/home/anaconda3/envs/torch-1.0/lib/python3.6/site-packages/torch/serialization.py", line 599, in _load
raise RuntimeError("{} is a zip archive (did you mean to use torch.jit.load()?)".format(f.name))
RuntimeError: /mnt/workspace/github/RAFT-master/models/raft-things.pth is a zip archive (did you mean to use torch.jit.load()?)

is the pretrained models provided same as papers?

Thank you for your excellent works on the optical flow. May I ask is the provided pretrained models same as the paper's description?
Based on the name, I would assume:
raft-chairs.pth: pretrained on FlyingChairs
raft-things.pth: pretrained on FlyingChairs + FlyingThings
raft-small.pth: smaller model pretrained on Chairs + Things
raft-kitti.pth: trained on C+T+K
raft-sintel.pth: trained on C+T+S
Is it right?

Input Image Pad Mode

A little curious about the image pad mode.
The pad mode for kitti and other dataset(default is sintel mode) is different, why?

RAFT/evaluate.py

Line 64 in d3f3840

padder = InputPadder(image1.shape, mode='kitti')

RAFT/core/utils/utils.py

Line 16 in d3f3840

self._pad = [pad_wd//2, pad_wd - pad_wd//2, 0, pad_ht]

failed to load model

Hello,
Thanks for publishing a great code.
While trying to test your model, I have a problem with model loading as follows

Here is the command line I used
python ./demo.py --model ./models/raft-small.pth --path /media/F/sample_codes/RAFT/demo_data/cam01

Here is the error message

File "./demo.py", line 81, in
demo(args)
File "./demo.py", line 53, in demo
model.load_state_dict(torch.load(args.model))

RuntimeError: Error(s) in loading state_dict for RAFT:
Missing key(s) in state_dict: "fnet.conv1.weight", "fnet.conv1.bias", "fnet.layer1.0.conv1.weight", "fnet.layer1.0.conv1.bias", "fnet.layer1.0.conv2.weight", "fnet.layer1.0.conv2.bias", "fnet.layer1.1.conv1.weight", "fnet.layer1.1.conv1.bias", "fnet.layer1.1.conv2.weight", "fnet.layer1.1.conv2.bias", "fnet.layer2.0.conv1.weight", "fnet.layer2.0.conv1.bias", "fnet.layer2.0.conv2.weight", "fnet.layer2.0.conv2.bias", "fnet.layer2.0.downsample.0.weight", "fnet.layer2.0.downsample.0.bias", "fnet.layer2.1.conv1.weight", "fnet.layer2.1.conv1.bias", "fnet.layer2.1.conv2.weight", "fnet.layer2.1.conv2.bias", "fnet.layer3.0.conv1.weight", "fnet.layer3.0.conv1.bias", "fnet.layer3.0.conv2.weight", "fnet.layer3.0.conv2.bias", "fnet.layer3.0.downsample.0.weight", "fnet.layer3.0.downsample.0.bias", "fnet.layer3.1.conv1.weight", "fnet.layer3.1.conv1.bias", "fnet.layer3.1.conv2.weight", "fnet.layer3.1.conv2.bias", "fnet.conv2.weight", "fnet.conv2.bias", "cnet.norm1.weight", "cnet.norm1.bias", "cnet.norm1.running_mean", "cnet.norm1.running_var", "cnet.conv1.weight", "cnet.conv1.bias", "cnet.layer1.0.conv1.weight", "cnet.layer1.0.conv1.bias", "cnet.layer1.0.conv2.weight", "cnet.layer1.0.conv2.bias", "cnet.layer1.0.norm1.weight", "cnet.layer1.0.norm1.bias", "cnet.layer1.0.norm1.running_mean", "cnet.layer1.0.norm1.running_var", "cnet.layer1.0.norm2.weight", "cnet.layer1.0.norm2.bias", "cnet.layer1.0.norm2.running_mean", "cnet.layer1.0.norm2.running_var", "cnet.layer1.1.conv1.weight", "cnet.layer1.1.conv1.bias", "cnet.layer1.1.conv2.weight", "cnet.layer1.1.conv2.bias", "cnet.layer1.1.norm1.weight", "cnet.layer1.1.norm1.bias", "cnet.layer1.1.norm1.running_mean", "cnet.layer1.1.norm1.running_var", "cnet.layer1.1.norm2.weight", "cnet.layer1.1.norm2.bias", "cnet.layer1.1.norm2.running_mean", "cnet.layer1.1.norm2.running_var", "cnet.layer2.0.conv1.weight", "cnet.layer2.0.conv1.bias", "cnet.layer2.0.conv2.weight", "cnet.layer2.0.conv2.bias", "cnet.layer2.0.norm1.weight", "cnet.layer2.0.norm1.bias", "cnet.layer2.0.norm1.running_mean", "cnet.layer2.0.norm1.running_var", "cnet.layer2.0.norm2.weight", "cnet.layer2.0.norm2.bias", "cnet.layer2.0.norm2.running_mean", "cnet.layer2.0.norm2.running_var", "cnet.layer2.0.norm3.weight", "cnet.layer2.0.norm3.bias", "cnet.layer2.0.norm3.running_mean", "cnet.layer2.0.norm3.running_var", "cnet.layer2.0.downsample.0.weight", "cnet.layer2.0.downsample.0.bias", "cnet.layer2.0.downsample.1.weight", "cnet.layer2.0.downsample.1.bias", "cnet.layer2.0.downsample.1.running_mean", "cnet.layer2.0.downsample.1.running_var", "cnet.layer2.1.conv1.weight", "cnet.layer2.1.conv1.bias", "cnet.layer2.1.conv2.weight", "cnet.layer2.1.conv2.bias", "cnet.layer2.1.norm1.weight", "cnet.layer2.1.norm1.bias", "cnet.layer2.1.norm1.running_mean", "cnet.layer2.1.norm1.running_var", "cnet.layer2.1.norm2.weight", "cnet.layer2.1.norm2.bias", "cnet.layer2.1.norm2.running_mean", "cnet.layer2.1.norm2.running_var", "cnet.layer3.0.conv1.weight", "cnet.layer3.0.conv1.bias", "cnet.layer3.0.conv2.weight", "cnet.layer3.0.conv2.bias", "cnet.layer3.0.norm1.weight", "cnet.layer3.0.norm1.bias", "cnet.layer3.0.norm1.running_mean", "cnet.layer3.0.norm1.running_var", "cnet.layer3.0.norm2.weight", "cnet.layer3.0.norm2.bias", "cnet.layer3.0.norm2.running_mean", "cnet.layer3.0.norm2.running_var", "cnet.layer3.0.norm3.weight", "cnet.layer3.0.norm3.bias", "cnet.layer3.0.norm3.running_mean", "cnet.layer3.0.norm3.running_var", "cnet.layer3.0.downsample.0.weight", "cnet.layer3.0.downsample.0.bias", "cnet.layer3.0.downsample.1.weight", "cnet.layer3.0.downsample.1.bias", "cnet.layer3.0.downsample.1.running_mean", "cnet.layer3.0.downsample.1.running_var", "cnet.layer3.1.conv1.weight", "cnet.layer3.1.conv1.bias", "cnet.layer3.1.conv2.weight", "cnet.layer3.1.conv2.bias", "cnet.layer3.1.norm1.weight", "cnet.layer3.1.norm1.bias", "cnet.layer3.1.norm1.running_mean", "cnet.layer3.1.norm1.running_var", "cnet.layer3.1.norm2.weight", "cnet.layer3.1.norm2.bias", "cnet.layer3.1.norm2.running_mean", "cnet.layer3.1.norm2.running_var", "cnet.conv2.weight", "cnet.conv2.bias", "update_block.encoder.convc1.weight", "update_block.encoder.convc1.bias", "update_block.encoder.convc2.weight", "update_block.encoder.convc2.bias", "update_block.encoder.convf1.weight", "update_block.encoder.convf1.bias", "update_block.encoder.convf2.weight", "update_block.encoder.convf2.bias", "update_block.encoder.conv.weight", "update_block.encoder.conv.bias", "update_block.gru.convz1.weight", "update_block.gru.convz1.bias", "update_block.gru.convr1.weight", "update_block.gru.convr1.bias", "update_block.gru.convq1.weight", "update_block.gru.convq1.bias", "update_block.gru.convz2.weight", "update_block.gru.convz2.bias", "update_block.gru.convr2.weight", "update_block.gru.convr2.bias", "update_block.gru.convq2.weight", "update_block.gru.convq2.bias", "update_block.flow_head.conv1.weight", "update_block.flow_head.conv1.bias", "update_block.flow_head.conv2.weight", "update_block.flow_head.conv2.bias", "update_block.mask.0.weight", "update_block.mask.0.bias", "update_block.mask.2.weight", "update_block.mask.2.bias".

    Unexpected key(s) in state_dict: "module.fnet.conv1.weight", "module.fnet.conv1.bias", "module.fnet.layer1.0.conv1.weight", "module.fnet.layer1.0.conv1.bias", "module.fnet.layer1.0.conv2.weight", "module.fnet.layer1.0.conv2.bias", "module.fnet.layer1.0.conv3.weight", "module.fnet.layer1.0.conv3.bias", "module.fnet.layer1.1.conv1.weight", "module.fnet.layer1.1.conv1.bias", "module.fnet.layer1.1.conv2.weight", "module.fnet.layer1.1.conv2.bias", "module.fnet.layer1.1.conv3.weight", "module.fnet.layer1.1.conv3.bias", "module.fnet.layer2.0.conv1.weight", "module.fnet.layer2.0.conv1.bias", "module.fnet.layer2.0.conv2.weight", "module.fnet.layer2.0.conv2.bias", "module.fnet.layer2.0.conv3.weight", "module.fnet.layer2.0.conv3.bias", "module.fnet.layer2.0.downsample.0.weight", "module.fnet.layer2.0.downsample.0.bias", "module.fnet.layer2.1.conv1.weight", "module.fnet.layer2.1.conv1.bias", "module.fnet.layer2.1.conv2.weight", "module.fnet.layer2.1.conv2.bias", "module.fnet.layer2.1.conv3.weight", "module.fnet.layer2.1.conv3.bias", "module.fnet.layer3.0.conv1.weight", "module.fnet.layer3.0.conv1.bias", "module.fnet.layer3.0.conv2.weight", "module.fnet.layer3.0.conv2.bias", "module.fnet.layer3.0.conv3.weight", "module.fnet.layer3.0.conv3.bias", "module.fnet.layer3.0.downsample.0.weight", "module.fnet.layer3.0.downsample.0.bias", "module.fnet.layer3.1.conv1.weight", "module.fnet.layer3.1.conv1.bias", "module.fnet.layer3.1.conv2.weight", "module.fnet.layer3.1.conv2.bias", "module.fnet.layer3.1.conv3.weight", "module.fnet.layer3.1.conv3.bias", "module.fnet.conv2.weight", "module.fnet.conv2.bias", "module.cnet.conv1.weight", "module.cnet.conv1.bias", "module.cnet.layer1.0.conv1.weight", "module.cnet.layer1.0.conv1.bias", "module.cnet.layer1.0.conv2.weight", "module.cnet.layer1.0.conv2.bias", "module.cnet.layer1.0.conv3.weight", "module.cnet.layer1.0.conv3.bias", "module.cnet.layer1.1.conv1.weight", "module.cnet.layer1.1.conv1.bias", "module.cnet.layer1.1.conv2.weight", "module.cnet.layer1.1.conv2.bias", "module.cnet.layer1.1.conv3.weight", "module.cnet.layer1.1.conv3.bias", "module.cnet.layer2.0.conv1.weight", "module.cnet.layer2.0.conv1.bias", "module.cnet.layer2.0.conv2.weight", "module.cnet.layer2.0.conv2.bias", "module.cnet.layer2.0.conv3.weight", "module.cnet.layer2.0.conv3.bias", "module.cnet.layer2.0.downsample.0.weight", "module.cnet.layer2.0.downsample.0.bias", "module.cnet.layer2.1.conv1.weight", "module.cnet.layer2.1.conv1.bias", "module.cnet.layer2.1.conv2.weight", "module.cnet.layer2.1.conv2.bias", "module.cnet.layer2.1.conv3.weight", "module.cnet.layer2.1.conv3.bias", "module.cnet.layer3.0.conv1.weight", "module.cnet.layer3.0.conv1.bias", "module.cnet.layer3.0.conv2.weight", "module.cnet.layer3.0.conv2.bias", "module.cnet.layer3.0.conv3.weight", "module.cnet.layer3.0.conv3.bias", "module.cnet.layer3.0.downsample.0.weight", "module.cnet.layer3.0.downsample.0.bias", "module.cnet.layer3.1.conv1.weight", "module.cnet.layer3.1.conv1.bias", "module.cnet.layer3.1.conv2.weight", "module.cnet.layer3.1.conv2.bias", "module.cnet.layer3.1.conv3.weight", "module.cnet.layer3.1.conv3.bias", "module.cnet.conv2.weight", "module.cnet.conv2.bias", "module.update_block.encoder.convc1.weight", "module.update_block.encoder.convc1.bias", "module.update_block.encoder.convf1.weight", "module.update_block.encoder.convf1.bias", "module.update_block.encoder.convf2.weight", "module.update_block.encoder.convf2.bias", "module.update_block.encoder.conv.weight", "module.update_block.encoder.conv.bias", "module.update_block.gru.convz.weight", "module.update_block.gru.convz.bias", "module.update_block.gru.convr.weight", "module.update_block.gru.convr.bia

demo does not work

I've been able to run the demo with a code from this commit: 559176d
But on the latest commit, when trying to run the demo I get this error:

Traceback (most recent call last):
File ".../RAFT/demo.py", line 75, in
demo(args)
File ".../RAFT/demo.py", line 44, in demo
model.load_state_dict(torch.load(args.model))
File "G:\Software\Anaconda\envs\scopeflow\lib\site-packages\torch\nn\modules\module.py", line 847, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DataParallel:
Missing key(s) in state_dict: "module.update_block.mask.0.weight", "module.update_block.mask.0.bias", "module.update_block.mask.2.weight", "module.update_block.mask.2.bias".

Data changes on KITTI2015

I have noticed that the data about KITTI2015 in this article has changed on the KITTI website (from 6.30 to 5.10).

What happened?

Are you going to correct the paper and update the KITTI weight file on github?

chairs checkpoint reproduction

Dear authors

The raft-chairs.pth checkpoint you provide, was it trained using multi-gpu with the command in train_standard.sh, or with train_mixed.sh?

flow_low ?

Hey, I'm a total noob at optical flow estimation and I don't understand what flow_low is.
flow_low, flow_up = model(image1, image2, iters=20, test_mode=True)

An explanation would be greatly appreciated.

A consultation on the use of RAFT

Dear author of RAFT:
It is very nice that your work of RAFT has forwarded the optical flow estimation. Recently, I prepare to use this work to generate optical flow graphs for DAVIS 2016, which contains 50 short videos widely used to do video object segmentation. However, I am new to optical flow estimation and thus is a little puzzled by your provided demo.py and evaluate.py. I wanna know which .py file I should use to estimate DAVIS 2016.

Question about dataset

Hi, Thanks for sharing the work. I have a question about the training dataset.

In the paper, "C+T+S+K+H" is a combination of KITTI, HD1K, and Sintel data when finetuning on Sintel. However, the code shows it's a combination of KITTI, HD1K, Sintel, and 3Dthings. C+T+ S/K also includes 3Dthings. Is there some mistake in code?

Efficient Cost Computation for High Resolution Images

Hi Zachary,

In the paper section 3.2, an alternative cost computation method is mentioned, which scales O(NM). I was wondering is this implementation provided in the repository?

Best,
Jianyuan

demo low resolution optical flow

I've been able to run the demo with a code from this commit: 559176d
I've expected to see results with similar quality as here: http://sintel.is.tue.mpg.de/hero?flow_type=Flow&method_id=2419&metric_id=0&selected_pass=0
But the actual results were with lower resolution.
Can you please add the exact argumets in demo to recreate the high resolution resluts?

Question about training

Hi!
Thanks for sharing such a excellent work!
I'm wondering what GPU did you use for training, and how long it takes for each training phase?

Best wishes!

Not really an issue, per se

But in core/utils/utils.py in InputPadder, you have a line that looks like

(((self.ht // 8) + 1) * 8 - self.ht) % 8

And I just want to say that that's mathematically equivalent to (-self.ht)%8, and yields the exact same result for all integers and floats.

Question about stacking multiple frames

Hi,
congratulations, I tried your repo and it's a very good work.
I'm a newcomer regarding the optical flow, and by looking at your code I see that you compute optical flow for a pair of frames. Anyway, reading some papers I found that someone stack multiple consecutive frames(e.g. 6) and compute optical flow for them, is there a way to do something similar in your framework?(i.e. computing optical flow between six consecutive frames instead of two.)

Thank you!

Does the neighbor lookups offset the all-pairs?

Hi, I am curious about the operation of neighbor lookups applied on all-pairs correlation.
As I understand, each entry of all-pair cost volume indicates a correlation value between each vector in fmap1 and each vector in fmap2.
However, neighbor lookups only capture a range of correlation values to be catenated into a feature map. Thus, all-pairs correlation is not fully utilized in the following steps, and the correlation is still computed according to the search range.

For example, consider the corr at level 1 is: [48x128, 1, 48. 128 ]=[6144, 1, 48, 128] (ignoring the batch)
coords_lvl: [6144, 7, 7, 2] which represents that each pix in fmap1 has a (7, 7) coordinates of its neighbor.
Then bilinear_sampler samples the all-pair cost volume according to coords_lvl.
Finally, the result corr: [48, 128, 49]. Each entry stores the correlation value between each feature in fmap1 and its 7x7 neighbor in fmap2.
Where is the all pair??????

Confusion about the represenation in the formula one

it's a nice work, congratulations!
After going through the paper, I tried to understand the meaning of the formula one since it doesn't mention the definition of ijhkl.

I think (i, j) should be the coordinate of feature1, (k, l) should be the coordinate of feature2; h be the channel of feature1. However, all feature pairs should have the same number of channels. The formula is supposed to be:
C_ijkl = sum(g(I1)_ijh * g(I2)_klh)
instead of

The summation is over the channel, while img2 misses the channel representation. It doesn't make sense.

Schematic + minor bug

Hi!
Nice work, I especially appreciate the code clarity.
I just created a schematic of RAFT in drawio and wanted to share:
https://drive.google.com/file/d/1R9NeeKfHLCyMm6S7wH8sAjvI98Sc7sDg/view?usp=sharing
If you find any mistakes or have suggestions to make it more clear please tell me.

Also, just a minor bug - I think when constructing the correlation pyramid:

RAFT/core/modules/corr.py

Line 18 in 559176d

for i in range(self.num_levels):

you meant to write

for i in range(self.num_levels - 1):

as currently the code constructs one extra level that is not used.

Have a nice day,
Tomas

What is the output of bilinear sampling in each multiscale level?

Does something like global pooling in the neighboring area happen? Are the multiscale outputs then concatenated?

What metrics should we refer during training?

Hi,
How can I explain the metrics during training? Or could you please provide a loss curve we can refer?
As I read your code, the metrics are divided into four parts. Average EPE over all pixels, average epe over pixels < 1pix; <3pix; <5pix. Based on my training records, the average epe over all; <1; <3 becomes larger at first and epe(<5pix) decreases. My understanding is that the pred_flow at first is not accurate and pretty large, thus <5 pix has a large number while small all epe doesn't indicate a correct prediction, right??
Therefore, what value should we refer during training to check if it converges since optical flow is not like classification task which has a straightforward evaluation(accuracy).

Question about the flow estimation.

Hi! first at all, thank you so much for sharing this work, is really amazing!

I´m using this repo to generate an optic flow of a generical video, we are working on face anti spoofing techniques and we want to use an approach like this, but Im getting weird results which I do not know if they are normal, you can see the output here.
https://drive.google.com/file/d/1-ALLjgPthC4P52F-V35dVDkTZq2EbQyN/view?usp=sharing

is this the normal output of this framework? it seems a little noisy, but I do not know if I messed up somewhere.

this is the arguments I´m using, thanks in advance for any advice or suggestion!

parser.add_argument('--model', default = 'models/chairs+things.pth', help="restore checkpoint")
parser.add_argument('--small', action='store_true', help='use small model')
parser.add_argument('--iters', type=int, default=50)

A typo in Efficient Correlated layer

RAFT/core/corr.py

Line 82 in 25eb2ac

fmap1_i = self.pyramid[0][0].permute(0, 2, 3, 1).contiguous()

Hi, should the 0 here be i?

fmap1_i = self.pyramid[0][0].permute(0, 2, 3, 1).contiguous() --> fmap1_i = self.pyramid[i][0].permute(0, 2, 3, 1).contiguous()

about up mask

Hi! first at all, thank you so much for sharing this work.
Actually, we can do upsample predictions after all iters.

EPE performance on the flying chairs dataset

Hi Zach,

Thanks for your excellent work. The provided pretrained models have been trained on both the chairs and thing3D dataset. It is quicker to validate ideas on the chairs dataset. I was wondering if you may provide the RAFT model only pre-trained on the chairs dataset or its performance.

Best,
Jianyuan

How to deal with the DataParallel imbalanced memory usage problem?

Hi, I am very interested in your work and try to train your model using your training scheme. But when using FlyingChairs as the training dataset, the maximum batch size I can set is 5 no matter I use 1 2080Ti GPU or 2 2080Ti GPUs because of the DataParallel imbalanced memory usage. May you tell me how to deal with this problem?

Why scale mask by 0.25 to balance gradients?

Hi Zachary:

I read you code in update block.

mask = .25 * self.mask(net)

It mentions that it uses scale mask to balance gradients. I wonder what is the rationale for you to choose 0.25 for the upsampling by 8 scenario. What would you choose if you were to do upsampling by 2 and 4.

Thanks!

What does '2-view' mean in the paper?

Hi, thanks for sharing the excellent work!!

A quick question, could you please explain what does '2-view' means in the original paper Table. I? e.g. Ours (2-view). It seems there is no explanation in the paper and supplementary.

Is that means stereo-setting? or just the normal model with 4.8M parameters (compared to the small model with 1M parameters)?

Thanks!!

about pretrained models

Thank you very much for your work.For some reason I want to run your code on pytorch1.3, could you please provide the small version of the no-Zip model?

GPU Memory Usage

Thanks for your code!

How much gpu memory is used during inference with one 1080x1920 image? And how about the speed, ~500ms(100ms for 1036x482 -> 500ms for 1080x1920)?

Why do you use batch normalization in context encoder?

Hi, this is a very nice work! Thanks for your contribution!

In your paper, you use instance normalization in feature encoder, but batch normalization in context encoder. But in the code, I find that BN is freezed after training on chairs. In this case, I am confuse why you do not use instance normalization in both feature encoder and context encoder? Or may be we can share feature encoder and context encoder, which can further recude the parameters.

princeton-vl / raft Goto Github PK

raft's Introduction

RAFT

Requirements

Demos

Required Data

Evaluation

Training

(Optional) Efficent Implementation

raft's People

Contributors

Stargazers

Watchers

Forkers

raft's Issues

Recommend Projects

Recommend Topics

Recommend Org