Code Monkey home page Code Monkey logo

efnet's Issues

Running EFNet on custom dataset

Hi, how to run EFNet on a custom dataset of video frames in which the root contains subfolders. Each subfolder contains frames of a separate video. The YAML configs we have here dont cover datasets of this type. You just mentioned the h5 hormat

running on my own frames

Hi,
I wanted to run the test script on my own frames without creating any dataset. I was confused on how to do it. Can you please guide me on how to run the test script for my own frames.

Log Files from Training

Thank you for your awesome code!

I am hoping you might open-source the log files you have from training. Maybe the training and validation loss as a function of epoch
(and/or batch) with an estimate of the runtime?

训练时loss为nan

作者你好,我在训练你的网络的时候,在迭代过程中遇到了loss为nan的问题,无法正常训练,请问这要怎么解决呢
2023-06-16 21:58:43,765 INFO: [debug..][epoch: 0, iter: 151, lr:(2.000e-04,2.000e-05,)] [eta: 395 days, 5:42:49, time (data): 0.489 (0.001)] l_pix: -3.0743e+01
2023-06-16 21:58:44,519 INFO: [debug..][epoch: 0, iter: 152, lr:(2.000e-04,2.000e-05,)] [eta: 392 days, 15:56:30, time (data): 0.482 (0.001)] l_pix: -2.8084e+01
2023-06-16 21:58:44,519 INFO: Saving models and training states.
Test 00001088: 100%|██████████| 1089/1089 [23:56<00:00, 1.29s/image]
2023-06-16 22:22:48,051 INFO: Validation debug, # psnr: 28.4838 # ssim: 0.9036
2023-06-16 22:22:48,546 INFO: [debug..][epoch: 0, iter: 153, lr:(2.000e-04,2.000e-05,)] [eta: 411 days, 19:13:58, time (data): 0.490 (0.001)] l_pix: -3.0197e+01
2023-06-16 22:22:49,036 INFO: [debug..][epoch: 0, iter: 154, lr:(2.000e-04,2.000e-05,)] [eta: 409 days, 3:35:46, time (data): 0.489 (0.001)] l_pix: -2.7506e+01
2023-06-16 22:22:49,505 INFO: [debug..][epoch: 0, iter: 155, lr:(2.000e-04,2.000e-05,)] [eta: 406 days, 12:46:06, time (data): 0.470 (0.001)] l_pix: inf
2023-06-16 22:22:49,988 INFO: [debug..][epoch: 0, iter: 156, lr:(2.000e-04,2.000e-05,)] [eta: 403 days, 22:44:43, time (data): 0.482 (0.002)] l_pix: nan
2023-06-16 22:22:50,474 INFO: [debug..][epoch: 0, iter: 157, lr:(2.000e-04,2.000e-05,)] [eta: 401 days, 9:30:33, time (data): 0.487 (0.002)] l_pix: nan
2023-06-16 22:22:50,938 INFO: [debug..][epoch: 0, iter: 158, lr:(2.000e-04,2.000e-05,)] [eta: 398 days, 21:02:06, time (data): 0.464 (0.001)] l_pix: nan
2023-06-16 22:22:51,750 INFO: [debug..][epoch: 0, iter: 159, lr:(2.000e-04,2.000e-05,)] [eta: 396 days, 9:26:15, time (data): 0.487 (0.001)] l_pix: nan
2023-06-16 22:22:52,240 INFO: [debug..][epoch: 0, iter: 160, lr:(2.000e-04,2.000e-05,)] [eta: 393 days, 22:28:10, time (data): 0.490 (0.002)] l_pix: nan
2023-06-16 22:22:52,240 INFO: Saving models and training states.
Test 00001088: 100%|██████████| 1089/1089 [21:33<00:00, 1.19s/image]
2023-06-16 22:44:32,789 INFO: Validation debug, # psnr: -42.1933 # ssim: 0.0002
2023-06-16 22:44:33,276 INFO: [debug..][epoch: 0, iter: 161, lr:(2.000e-04,2.000e-05,)] [eta: 410 days, 1:52:18, time (data): 0.481 (0.001)] l_pix: nan
2023-06-16 22:44:33,763 INFO: [debug..][epoch: 0, iter: 162, lr:(2.000e-04,2.000e-05,)] [eta: 407 days, 13:36:31, time (data): 0.485 (0.001)] l_pix: nan

loss为nan

作者你好,我在训练你的网络的时候,出现了PSNR为-42,l_pix为nan的问题,无法正常训练,请问该怎么解决呢,我只是改了数据集的路径,其他都是保持不变的

2023-06-15 21:08:13,778 INFO: Model [ImageEventRestorationModel] is created.
2023-06-15 21:08:13,804 INFO: Resuming training from epoch: 0, iter: 648.
2023-06-15 21:08:13,951 INFO: Start training from epoch: 0, iter: 648
2023-06-15 21:08:18,450 INFO: [debug..][epoch: 0, iter: 649, lr:(2.000e-04,2.000e-05,)] [eta: 5 days, 8:38:06, time (data): 4.499 (1.987)] l_pix: nan
2023-06-15 21:08:19,032 INFO: [debug..][epoch: 0, iter: 650, lr:(2.000e-04,2.000e-05,)] [eta: 4 days, 0:29:47, time (data): 0.581 (0.003)] l_pix: nan
2023-06-15 21:08:19,602 INFO: [debug..][epoch: 0, iter: 651, lr:(2.000e-04,2.000e-05,)] [eta: 3 days, 8:16:14, time (data): 0.570 (0.003)] l_pix: nan
2023-06-15 21:08:20,166 INFO: [debug..][epoch: 0, iter: 652, lr:(2.000e-04,2.000e-05,)] [eta: 2 days, 22:27:08, time (data): 0.563 (0.002)] l_pix: nan
2023-06-15 21:08:20,727 INFO: [debug..][epoch: 0, iter: 653, lr:(2.000e-04,2.000e-05,)] [eta: 2 days, 15:53:43, time (data): 0.561 (0.002)] l_pix: nan
2023-06-15 21:08:21,331 INFO: [debug..][epoch: 0, iter: 654, lr:(2.000e-04,2.000e-05,)] [eta: 2 days, 11:32:32, time (data): 0.603 (0.002)] l_pix: nan
2023-06-15 21:08:21,946 INFO: [debug..][epoch: 0, iter: 655, lr:(2.000e-04,2.000e-05,)] [eta: 2 days, 8:21:24, time (data): 0.615 (0.002)] l_pix: nan
2023-06-15 21:08:22,558 INFO: [debug..][epoch: 0, iter: 656, lr:(2.000e-04,2.000e-05,)] [eta: 2 days, 5:51:33, time (data): 0.611 (0.002)] l_pix: nan
2023-06-15 21:08:22,558 INFO: Saving models and training states.
2023-06-15 21:26:51,089 INFO: Validation debug, # psnr: -42.1933 # ssim: 0.0002
2023-06-15 21:26:51,656 INFO: [debug..][epoch: 0, iter: 657, lr:(2.000e-04,2.000e-05,)] [eta: 257 days, 21:51:19, time (data): 0.563 (0.002)] l_pix: nan
2023-06-15 21:26:52,259 INFO: [debug..][epoch: 0, iter: 658, lr:(2.000e-04,2.000e-05,)] [eta: 234 days, 14:08:50, time (data): 0.602 (0.003)] l_pix: nan
2023-06-15 21:26:52,840 INFO: [debug..][epoch: 0, iter: 659, lr:(2.000e-04,2.000e-05,)] [eta: 215 days, 3:37:24, time (data): 0.581 (0.002)] l_pix: nan

分布式训练出错

4卡分布式训练出错,我的机器配置为8*titan,报错信息如下:ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 ;torch.distributed.elastic.multiprocessing.errors.ChildFailedError。使用readme中给出的训练命令。

Question abount training time

Hi,I really appreciate your excellent work and try to retrain the network.But it took quite long time to reach the same result in the paper.Is it normal to take about 5 days to train the network with a batch size of 4 for 800k iterations on GTX 1080Ti ?Could you please share some details about time consumed for training?
I would appreciate for your early reply.

mask是做什么用的?

亲爱的作者,首先感谢您的优秀工作,我注意到每张图片都对应一个单通道的mask,请问这个mask是怎么生成的?有什么作用呢?

Training from scratch (200k iters + 100k iters) doesn't achieve the reported performance (PSNR 35.46) with GoPro

Hello, thank you for your good research first of all.

I was trying to reproduce the performance reported in your paper with SCER-GoPro dataset that you shared as a link.
(Before I started training from scratch, I had checked that your pre-trained weights gave me PSNR 35.44.
I thought this difference was not that big.)

Since I trained by myself with SCER-GoPro dataset and the code implementation here, total number of iterations was set as 200k which was not equal to the explanation in your paper. (The paper said total num of iter was 300k). Hence I thought this inconsistency was from that difference.
Thus I proceeded to train an additional 100k iters again, however, the performance became lower than that of 200k iter.

Is it because of issues from loading the resume training or should I have to modify some part of the experimental setting in code implementation to obtain the same performance?
Result performance is like below:

psnr 35.46 ssim 0.972 at 200K iter
psnr: 34.4930 ssim: 0.9662 at 200K + 100K iter

It would be really appreciated if you answer my question!

About qualitative results

Hello!
Thank you for the nice work!
I really appreciate it.
I'm curious about the qualitative results files you uploaded.
I think GoPro test, REBlur test, REBlur addition -> all the links are identical(with GoPro test).
Can you update the REBlur test, REBlur addition results' file link(with Google Drive)?
Thanks!

Dataset release

Hi,

Thanks for your great work. Any plans for releasing the dataset?
I find the link in the repo is unavailable.

Thanks a lot.

multi-GPU

How can I modify the code to realize multi-GPU operation? thanks!

train error

raise subprocess.CalledProcessError(returncode=process.returncode,

subprocess.CalledProcessError: Command '['/root/miniconda3/bin/python', '-u', 'basicsr/train.py', '--local_rank=0', '-opt', 'options/train/GoPro/EFNet.yml', '--launcher', 'pytorch']' died with <Signals.SIGKILL: 9>.
(base) root@autodl-container-5d4911a352-200b5df6:~/autodl-tmp/EFNet-main/EFNet-main# /root/miniconda3/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 35 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.