muslll / neosr Goto Github PK

View Code? Open in Web Editor NEW

113.0 113.0 27.0 5.2 MB

neosr is a framework for training real-world single-image super-resolution networks.

Home Page: https://github.com/muslll/neosr

License: Apache License 2.0

Python 99.99% Shell 0.01%

image-restoration machine-learning super-resolution

neosr's People

Contributors

Stargazers

Watchers

neosr's Issues

wrapper around train.py, ... scripts and release on pypi

That would heavily simplify the management of updates for users and make the installation much simpler, just run pip install neosr and one would be ready to go.

As the config templates wouldn't be available locally anymore a neosr template ls and neosr template write command may also be of interest.

Validation doesn't appear to be working.

As the title says, even with validation on in the config, it doesn't seem to do anything.

YML config for reference: https://pastebin.com/y6ZKPqta.

Feature request: allow toggling attention implementation via options

Some networks have been modified from their official versions to use scaled_dot_product_attention. I understand this has some benefits, but for the time being it also prevents onnx exporting. There's been some discussion on discord about adding the ability to toggle (I think initially to turn it on for the archs which have off at the moment) and the primary concern about adding the option as I understand it is breaking compatibility with e.g. ChaiNNer. It seems like with an agreed upon convention it should be possible to support both. I'm just opening this issue to track the progress of the topic more easily than finding discord links every time.

Automatically remove old checkpoints and visualizations

I recently trained 2 models with neosr and my hard drive filled up pretty quickly with old checkpoints and visualizations that were of no interest to me. I do want to see the visualizations of recent iterations and compare them with older ones, but I don't need every saved visualization to do that.

So I would to request for neosr to add an option for train.py to automatically remove old checkpoints and their visualizations to take up less space. I think the option could be called --auto-clean or similar.

Since I'm talking about checkpoints and visualizations in this issue, --auto-clean would only work when we can map visualizations to checkpoints. Basically, we need val.val_freq == logger.save_checkpoint_freq. If there are no visualizations, then that works too of course. If the option is enabled, then only the following checkpoints (and associated visualizations) should be kept:

Keep the latest 10 checkpoints.
Keep every 10th checkpoint.
Keep every checkpoint that has been the best in at least one validation metric.

All other checkpoints (and their associated validations) should be removed.

What do you think?

Perceptual Loss and AMP - discussion

Currently, Perceptual Loss and AMP are causing issues. Everything points towards GradScaler zeroing values.
Read more on pytorch forums thread.
A temporary solution was commited, moving perceptual loss completely outside of autocast and doing backward() without GradScaler. This however is not an optimal solution.

assets

Non-RGB SPAN models

Hi @muslll!

I just read through the SPAN code again, and wondered whether span even supports anything other than RGB images as input. If I understand PyTorch tensors correctly, then this line:

neosr/neosr/archs/span_arch.py

Line 239 in 6973906

x = (x - self.mean) * self.img_range

will fail for non-RGB inputs because self.mean is defined like this:

neosr/neosr/archs/span_arch.py

Line 222 in 6973906

self.mean = torch.Tensor(rgb_mean).view(1, 3, 1, 1)

and will always have 3 channels.

So the torch.Tensor(rgb_mean).view(1, 3, 1, 1) should probably be changed to torch.Tensor(rgb_mean).view(1, in_channels, 1, 1) . Alternatively, we could also use the same approach as SwinIR:

https://github.com/muslll/neosr/blob/master/neosr/archs/swinir_arch.py#L775-L779

        if in_chans == 3:
            rgb_mean = (0.4488, 0.4371, 0.4040)
            self.mean = torch.Tensor(rgb_mean).view(1, 3, 1, 1)
        else:
            self.mean = torch.zeros(1, 1, 1, 1)

Correction: The above suggested are not backwards compatible because of single-channel images are broadcasted. So we need to keep the current behavior for in_chans in (1, 3).

What do you think?

imwrite raise IOError when trying to write an image where the filename contains unicode characters

OpenCV (currently) doesn't support writing to paths which contain Unicode characters. Mentioned here: opencv/opencv-python#211 and here: opencv/opencv#4292

Mismatched resume state issue

When resuming training from a manually specified resume state, if the name of the model at the top of the config is mismatched with the resume state it throws an ambiguous error about the yml file being missing. It'd be good to have a proper error message for this condition

dataset_enlarge_ratio description

Where it says "duplicate the dataset by n times", shouldn't it be "multiply the dataset by n times"?

OmniSR variable passing bug.

I was training a 2x OmniSR model, and encountered the following error:

Traceback (most recent call last):
  File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\train.py", line 238, in <module>
    train_pipeline(root_path)
  File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\train.py", line 139, in train_pipeline
    model = build_model(opt)
  File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\models\__init__.py", line 28, in build_model
    model = MODEL_REGISTRY.get(opt['model_type'])(opt)
  File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\models\default.py", line 35, in __init__
    self.net_g = build_network(opt['network_g'])
  File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\archs\__init__.py", line 21, in build_network
    net = ARCH_REGISTRY.get(network_type)(**opt)
  File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\archs\omnisr_arch.py", line 883, in omnisr
    return omnisr_net(
  File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\archs\omnisr_arch.py", line 834, in __init__
    up_scale    = kwargs["upsampling"]
KeyError: 'upsampling'

These were my network_g settings:

network_g: type: omnisr upsampling: 2 window_size: 16

It seems like the solution to fixing this is to revise like 830 of omnisr_arch.py from:

def __init__(self,num_in_ch=3,num_out_ch=3,num_feat=64, window_size=8, upsampling=4, **kwargs):

def __init__(self,num_in_ch=3,num_out_ch=3,num_feat=64, **kwargs):

Might be easiest to require window_size and upsampling to be specified in the config.

Models and states saving/loading issue

The latest update introduces a bug where models and states are saved as "{current_iter}.0.pth" and {current_iter}.0.state, the models are being looked for with the new naming scheme, but the state is still loaded with the expectation that there is no ".0", and returns an error as it cannot find "{current_iter}.state".

Tested this on a clean neosr install.

`train_realesrgan.yml` is named "train_esrgan"

https://github.com/muslll/neosr/blob/ff23dc06ece18589c93755f8fac36416ab70ca71/options/train_realesrgan.yml#L2C7-L2C19

This is because the file was initially train_esrgan.yml and later renamed. Seems like you forgot to update the internal name of the config.

can not start train

PS D:\AIGC\Net-Train\neosr-new> python train.py -opt train_compact.yml
Path already exists. Renaming it to D:\AIGC\Net-Train\neosr-new\experiments\2xFilm_archived_20240622_233806
Path already exists. Renaming it to D:\AIGC\Net-Train\neosr-new\experiments\tb_logger\2xFilm_archived_20240622_233806
22-06-2024 11:38 PM | INFO:
------------------------ neosr ------------------------
Pytorch Version: 2.3.1+cu121
22-06-2024 11:38 PM | INFO: Dataset [paired] is built.
22-06-2024 11:38 PM | INFO: Training statistics:
Starting model: 2xFilm
Number of train images: 200
Dataset enlarge ratio: 1
Batch size per gpu: 2
Accumulated batches: 2
World size (gpu number): 1
Required iters per epoch: 100
Total epochs: 20. Total iters: 2000
22-06-2024 11:38 PM | INFO: Using network [compact].
22-06-2024 11:38 PM | INFO: Using network [unet].
22-06-2024 11:38 PM | INFO: Loss [mssim] enabled.
22-06-2024 11:38 PM | INFO: Loss [PerceptualLoss] enabled.
22-06-2024 11:38 PM | INFO: Loss [GANLoss] enabled.
22-06-2024 11:38 PM | INFO: Loss [colorloss] enabled.
22-06-2024 11:38 PM | INFO: Loss [lumaloss] enabled.
22-06-2024 11:38 PM | INFO: Using model [default].
22-06-2024 11:38 PM | INFO: Start training from epoch: 0, iter: 0
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\admin\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\admin\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\spawn.py", line 131, in _main
prepare(preparation_data)
File "C:\Users\admin\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\spawn.py", line 246, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\admin\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\spawn.py", line 297, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 286, in run_path
File "", line 98, in _run_module_code
File "", line 88, in run_code
File "D:\AIGC\Net-Train\neosr-new\train.py", line 11, in
from neosr.data import build_dataloader, build_dataset
File "D:\AIGC\Net-Train\neosr-new\neosr\data_init.py", line 24, in
dataset_modules = [importlib.import_module(
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\admin\AppData\Local\Programs\Python\Python312\Lib\importlib_init.py", line 90, in import_module
return _bootstrap.gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AIGC\Net-Train\neosr-new\neosr\data\otf_dataset.py", line 11, in
from neosr.data.degradations import circular_lowpass_kernel, random_mixed_kernels
File "D:\AIGC\Net-Train\neosr-new\neosr\data\degradations.py", line 7, in
from scipy.stats import multivariate_normal
File "C:\Users\admin\AppData\Local\Programs\Python\Python312\Lib\site-packages\scipy\stats_init.py", line 606, in
from ._stats_py import *
File "C:\Users\admin\AppData\Local\Programs\Python\Python312\Lib\site-packages\scipy\stats_stats_py.py", line 49, in
from . import distributions
File "C:\Users\admin\AppData\Local\Programs\Python\Python312\Lib\site-packages\scipy\stats\distributions.py", line 10, in
from . import _continuous_distns
File "C:\Users\admin\AppData\Local\Programs\Python\Python312\Lib\site-packages\scipy\stats_continuous_distns.py", line 33, in
import scipy.stats._boost as boost
File "C:\Users\admin\AppData\Local\Programs\Python\Python312\Lib\site-packages\scipy\stats_boost_init.py", line 37, in
from scipy.stats._boost.nct_ufunc import (
ImportError: DLL load failed while importing nct_ufunc: 页面文件太小，无法完成操作。

GPU memory not fixed when training

Thank you for your great work！
Why does the GPU memory change all the time during training instead of being fixed all the time?
Except for changing scale to 2, I use the default configs.

cv2.error: Caught error in DataLoader worker process 2.

Shortly after starting the training following error occurs:

2023-09-06 18:24:25,932 INFO: Start training from epoch: 0, iter: 0
Traceback (most recent call last):
  File "G:\_AI\UPSCALE\neosr\train.py", line 241, in <module>
    train_pipeline(root_path)
  File "G:\_AI\UPSCALE\neosr\train.py", line 215, in train_pipeline
    train_data = prefetcher.next()
                 ^^^^^^^^^^^^^^^^^
  File "G:\_AI\UPSCALE\neosr\neosr\data\prefetch_dataloader.py", line 97, in next
    self.preload()
  File "G:\_AI\UPSCALE\neosr\neosr\data\prefetch_dataloader.py", line 83, in preload
    self.batch = next(self.loader)  # self.batch is a dict
                 ^^^^^^^^^^^^^^^^^
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 633, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1345, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1371, in _process_data
    data.reraise()
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\_utils.py", line 644, in reraise
    raise exception
cv2.error: Caught error in DataLoader worker process 2.
Original Traceback (most recent call last):
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\_utils\worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "G:\_AI\UPSCALE\neosr\neosr\data\paired_dataset.py", line 87, in __getitem__
    img_gt = imfrombytes(img_bytes, float32=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\_AI\UPSCALE\neosr\neosr\utils\img_util.py", line 133, in imfrombytes
    img = cv2.imdecode(img_np, imread_flags[flag])
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
cv2.error: OpenCV(4.8.0) D:\a\opencv-python\opencv-python\opencv\modules\imgcodecs\src\loadsave.cpp:802: error: (-215:Assertion failed) !buf.empty() in function 'cv::imdecode_'

The strange thing is, seeing the last line, I don't even have that path or drive on my PC??

I followed the installation instructions, so all dependencies should be installed.

Windows 10
Python 3.11.5
Torch 2
CUDA 11.8

Config File attached.
train_realesrgan.txt

incorrect conversion to grayscale

neosr/neosr/data/single_dataset.py

Line 51 in f2a2be9

def __getitem__(self, index):

in this piece of code you have an error, you are converting float32 to int8 directly, i.e. the output is int 0-1
img_lq = Image.fromarray(img_lq.astype(np.uint8)) if we add multiplication by 255 here
img_lq = np.array(img_lq, dtype=np.float32) and here division by 255

then the output will be a correct output in the range from 0 to 1
otherwise, everything below 1 will be 0 and 1 will be equal to 1

but this is a straightforward solution, I just want to report a bug

AMP enabled info spamming the console

Not sure why this is happening, but wanted to report as a potential bug. (It's driving me nuts). I tried figuring out where the AMP printing code is located but no such luck. =/

Print file path that causes read error upon dataloader crash

This would be extremely helpful for troubleshooting a large dataset. I just had an issue where I was getting errors in my dataset about a corrupted file, and it turns out it was trying to read a my meta_info file as an image all along. I was only able to diagnose this after hacking together a modified dataloader script.

I will share my code if desired, but it is not pretty.

How to save a model fully in pth? Not state dict

All architectures with `rgb_mean = (0.5, 0.5, 0.5)` are incompatible

Hey musl.

Some people on Discord recently asked whether 417432c was a breaking change. I always thought that it didn't matter, but I was wrong. For some architectures, this changes the results only slightly, others drastically change.

So please (1) document all of these changes and (2) make them detectable for spandrel.

As for how to make them detectable: I would suggest adding a neosr_version tensor to each model that just contains a single int32. This int is the version number of neosr changes. The idea is that each of those changes you made is essentially a new version of the architecture. This number just tracks the version and allows others to detect it.

Of course, this addition is unnecessary for architectures you created or did not modify.

Python version problem

Thank you for your work.
I want to test this work in python 3.10 since I am not a sudo-user in which I dont have the authority to change the python version myself...

Is there any way to run your codes in python 3.10 ?
If there are any limitations in some packages, I will try to adjust it.

Thank you.

Real-CUGAN License

It appears to be listed here? https://github.com/bilibili/ailab/blob/main/Real-CUGAN/LICENSE

You have it marked as "Unknown" in your LICENSE file.

I'm also somewhat concerned about having code merged in that has no known license, like DCTLSA. Wouldn't it be preferable to wait until you know what the license is?

Exiting early doesn't save latest training state

If you exit early it saves the latest model (net_g_latest and net_d_latest) but not the latest training state, so when you resume training it restarts from the latest regular checkpoint.

Cannot resume training ESRGAN

There is a chance I am doing something wrong, but I have tried everything to resume training of an ESRGAN model it doesn't seem to work properly. I've tried starting from a pretrained model, and also tried without a pretrained model.

Whenever I try to resume (by either setting the resume state and commenting out the pretrain or by using auto-resume) it gives me a lot of warnings like this:

------------------------ neosr ------------------------
Pytorch Version: 2.1.1+cu118
2023-12-15 17:59:51,497 INFO: Dataset [paired] is built.
2023-12-15 17:59:51,497 INFO: Training statistics:
        Starting model: 2x_hdphoto
        Number of train images: 1000
        Dataset enlarge ratio: 5
        Batch size per gpu: 8
        World size (gpu number): 1
        Require iter number per epoch: 625
        Total epochs: 256; iters: 160000.
2023-12-15 17:59:51,497 INFO: Dataset [paired] is built.
2023-12-15 17:59:51,497 INFO: Number of val images/folders in val_1: 10
2023-12-15 17:59:51,716 INFO: Network [esrgan] is created.
2023-12-15 17:59:52,044 INFO: Network [unet] is created.
2023-12-15 17:59:52,263 INFO: Loading esrgan model from C:\neosr\experiments\2x_hdphoto\models\net_g_26000.pth, with param key: [None].
2023-12-15 17:59:52,310 WARNING: Current net - loaded net:
2023-12-15 17:59:52,310 WARNING:   body.0.rdb1.conv1.bias
2023-12-15 17:59:52,310 WARNING:   body.0.rdb1.conv1.weight
2023-12-15 17:59:52,310 WARNING:   body.0.rdb1.conv2.bias
2023-12-15 17:59:52,310 WARNING:   body.0.rdb1.conv2.weight
.....
2023-12-15 17:59:52,357 WARNING:   conv_up2.bias
2023-12-15 17:59:52,357 WARNING:   conv_up2.weight
2023-12-15 17:59:52,357 WARNING: Loaded net - current net:
2023-12-15 17:59:52,357 WARNING:   params
2023-12-15 17:59:52,419 INFO: Loading unet model from C:\neosr\experiments\2x_hdphoto\models\net_d_26000.pth, with param key: [params].
2023-12-15 17:59:52,419 INFO: Loss [HuberLoss] is created.
2023-12-15 17:59:53,857 INFO: Loss [PerceptualLoss] is created.
2023-12-15 17:59:53,857 INFO: Loss [GANLoss] is created.
2023-12-15 17:59:53,857 INFO: Loss [colorloss] is created.
2023-12-15 17:59:53,873 INFO: Model [default] is created.
2023-12-15 17:59:53,888 INFO: Resuming training from epoch: 41, iter: 26000.
2023-12-15 18:00:10,845 INFO: Using CUDA prefetch dataloader.
2023-12-15 18:00:10,845 INFO: AMP enabled.
2023-12-15 18:00:10,845 INFO: Start training from epoch: 41, iter: 26000

And it SEEMS like it has resumed, as it continues on from there in iterations. But looking at the visualization, I can see that it has started from scratch. All of the prior learning is lost and the generations look like garbage again.

I am pretty new to training upscale models, so there is a good possibility that I am doing something wrong. But I feel like I have followed the instructions in the "configuration walkthrough" carefully.

BTW, I am training on windows, CUDA, Python 3.11, with a RTX 4080

model zoo

Thank you to the author, this repository is very valuable.

I would like to ask if it is possible to add a model zoo in the future, which can include pretrained models of all supported models for direct testing?

Question regarding net_d

Hi muslll,

Thank you for your great work. It works perfectly fine in python 3.10 and fixed the yaml configuration.
I had another question and I guess it was better if I open a new issue thread.

For networks such as SAFMN and SPAN, there are no net_d (no discriminator) from what I read in the paper.
However, when. I run train_span.yml or train_safmn.yml in their basic setting equivalent with the repo, it seems to work with the discriminator turned on and training goes smoothly.

So the question is,
are there explicit different results when we use a UNet discriminator + gan_loss on net_d on networks that does not essentially require a net_d? (like span or safmn) or does it improve the performance when we add the net_d during training for any models?

Thank you in advanced :).

SPAN incompatible with official arch code

Hi. I just added SPAN to spandrel (library we use in chainner now) and was unpleasantly surprised that Kim's model didn't work. When you removed eval_conv, you made the models produced by neosr incompatible.

Here's the diff of what you changed (shortened):

 class Conv3XC(nn.Module):
     def __init__(
         self,
         c_in: int,
         c_out: int,
         gain1=1,
         gain2=0,
         s=1,
         bias=True,
         relu=False,
     ):
         super().__init__()
         self.weight_concat = None
         self.bias_concat = None
         self.update_params_flag = False
         self.stride = s
         self.has_relu = relu
         gain = gain1
 
         self.sk = nn.Conv2d(...)
         self.conv = nn.Sequential(...)

-        self.eval_conv = nn.Conv2d(...)
-        self.eval_conv.weight.requires_grad = False
-        self.eval_conv.bias.requires_grad = False  # type: ignore
-        self.update_params()
-
-    def update_params(self):
-        ...
-
     def forward(self, x):
-        if self.training:
-            pad = 1
-            x_pad = F.pad(x, (pad, pad, pad, pad), "constant", 0)
-            out = self.conv(x_pad) + self.sk(x)
-        else:
-            self.update_params()
-            out = self.eval_conv(x)
+        pad = 1
+        x_pad = F.pad(x, (pad, pad, pad, pad), "constant", 0)
+        out = self.conv(x_pad) + self.sk(x)

         if self.has_relu:
             out = F.leaky_relu(out, negative_slope=0.05)
         return out

The issue is self.eval_conv. Its weights and biases are saved in the .pth file. So when I use official arch code to load your model, those weights and biases are missing.

So I suggest re-adding self.eval_conv in __init__ and not using it. This will restore compatibility.

That all said, self.eval_conv should have been non-persistent in the first place. The SPAN authors messed this one up.

train.py throws an error, unable to proceed to training.

Here's the output on the terminal.

(venv) [username@archlinux neosr]$ python train.py -opt train_realesrgan.yml
Traceback (most recent call last):
File "/home/username/neosr/train.py", line 241, in
train_pipeline(root_path)
File "/home/username/neosr/train.py", line 101, in train_pipeline
opt, args = parse_options(root_path, is_train=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/username/neosr/neosr/utils/options.py", line 163, in parse_options
opt['dist'] = False
~~~^^^^^^^^
TypeError: 'str' object does not support item assignment

Python 3.11.7
Latest pytorch
Cuda 12.1

Wiki feedback

Hi! I read through the Configuration Walkthrough wiki page and have a few questions. Could you answer these questions and maybe add the answers to the wiki?

The compile option is experimental. This option enables pytorch's new torch.compile(), which can speed up training. As of this writing, pytorch 2.0.0 does not have support for compile using Python 3.11. Pytorch version 2.1 is expected to fix this, as well as better Windows support.

NeoSR now requires 2.1, so is the last part still relevant?

The manual_seed option enables deterministic training. If you're not doing precise comparisons, this option should be commented out, otherwise training performance will decrease.

What are "precise comparisons"? Or rather, when should this be used?

The gt_size is one of the most important options you have to change. It sets the size that each image will be cropped before being sent to the network.

How will it be cropped/which portion will be used? Will it use random offsets?

Document which architectures have been modified and are incompatible with the original architecture

Hey musl.

Kim just informed me that ATD light models from neosr don't load with spandrel because of added the norm parameter. IMO it's not a problem for neosr to make improvements to existing architectures, but I think that this has to be communicated. Just like with SPAN, ATD models with norm=False (which is the default) are incompatible with the original arch code.

Could you please document somewhere (1) whether an architecture is compatible with original arch code, and (2) which specific parameter configurations are compatible/incompatible? E.g. for SPAN and ATD, I would document both as "partially compatible" because they are only compatible if norm=True is set.

In general, I would also be interested to hear what your take is on neosr models being incompatible with original arch code? E.g. would you accept major changes to an existing arch that majorly improves it (idk, 3x speed) but loses compatibility?

Huge performance drop after update

Today I updated neosr and noticed a huge drop in performance.

RGT: from 2.9 it/s to 1.6 it/s
Swinir_small: from 6.6 it/s to 2.3 it/s

I reversed commit by commit and found that the 2dcafe4 (Change no_grad -> inference_mode) was causing it.

Was it supposed to happen?

My setup: ubuntu 23.04, rtx 4090, torch 2.2.0
Train config: RGT and swinir_small, both otf and paired, amp + bfloat16 enabled.

Before update (RGT):

After update:

Validation fails for HAT models

I was trying to train a 4x HAT model. I have some 1440x1080 images I use for validation (lr: 360x270) and validation fails with a shape error. Other architectures (at least esrgan, omnisr, and swinir) work with the same config.

2023-10-20 14:09:08,553 INFO: Saving models and training states.
  File "neosr/train.py", line 242, in <module>
    train_pipeline(root_path)
  File "neosr/train.py", line 211, in train_pipeline
    model.validation(val_loader, current_iter,
  File "neosr/neosr/models/default.py", line 462, in validation
    self.nondist_validation(
  File "neosr/neosr/models/otf.py", line 198, in nondist_validation
    super(otf, self).nondist_validation(
  File "neosr/neosr/models/default.py", line 377, in nondist_validation
    self.test()
  File "neosr/neosr/models/default.py", line 338, in test
    self.output = self.net_g(img)
                  ^^^^^^^^^^^^^^^
  File "neosr/venv/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "neosr/venv/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "neosr/neosr/archs/hat_arch.py", line 952, in forward
    x = self.conv_first(x)
                           
  File "neosr/neosr/archs/hat_arch.py", line 929, in forward_features                                                                                                       
    attn_mask = self.calculate_mask(x_size).to(x.device)                                                                                                                                             
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                          
  File "neosr/neosr/archs/hat_arch.py", line 909, in calculate_mask                                                                                                         
    mask_windows = window_partition(img_mask, self.window_size)  # nw, window_size, window_size, 1                                                                                                   
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                      
  File "neosr/neosr/archs/hat_arch.py", line 91, in window_partition                                                                                                        
    x = x.view(b, h // window_size, window_size, w // window_size, window_size, c)                                                                                                                   
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                   
RuntimeError: shape '[1, 17, 16, 22, 16, 1]' is invalid for input of size 97920

and again rgb_mean

hello, I noticed a long time ago that in commit: 417432c, you changed rgb_mean(0.5,0.5,0.5) but I didn’t ask because I knew, but now I noticed that because of this, when using the official code, a color shift occurs, I think it’s worth return official values or add flags to models with architectures with rgb_mean
https://slow.pics/s/2GVfqi66

just wanted to express my respect

Thank you for the wonderful dataset and training code. I am a member of the open-source community interested in super-resolution training. However, I cannot buy you coffee in China, so I can only express my gratitude through words. Thank you once again!

Modifications for CRAFT, SwinIR, and DCTLAS are not detectable

Hey musl. I finally found some time to make spandrel compatible with your modified architectures, and it's not looking good. ATD was easy to detect and support because of the no_norm trick, but the others are hopeless. The state dicts for CRAFT and SwinIR are the same whether flash attention is used or not, and the dropout you added to DCTLAS also can't be detected.

This unfortunately means that the pth files of all CRAFT and SwinIR models trained with flash attention are essentially useless. The issue is that spandrel will see them as regular CRAFT and SwinIR model and silently use the wrong inference code, which leads to decreased performance. So the models will work, but they will produce worse outputs. It's the worst kind of unsupported models.

I'm not sure whether the modifications to DCTLAS are a problem. You just added a nn.Dropout2d(p=0.5) here. I don't know much about these things, so I'm not sure whether this will even affect the inference result.

What to do now

Going forward, I think the best option would be to use the no_norm trick for flash attention in CRAFT and SwinIR. If we add a flash_attention tensor when flash attention is enabled, spandrel can detect this and then use the correct inference code.

About DCTLAS: I'm not sure whether we even need to do anything, so I would like to hear your thoughts on the matter. If the dropout does affect inference results, then we need to use the same trick again.

Related to #61 and chaiNNer-org/spandrel#230

Add tile option

Real-ESRGAN and HAT-GAN offer the option tile in yml files when doing inference. See for example:

https://github.com/XPixelGroup/HAT/blob/c7e0b2b9c9a8d37a3cf3e7dfa46698d0507d8f1f/options/test/HAT_GAN_Real_SRx4.yml#L1-L11

name: HAT_GAN_Real_SRx4
model_type: HATModel
scale: 4
num_gpu: 1  # set num_gpu: 0 for cpu mode
manual_seed: 0

tile: # use the tile mode for limited GPU memory when testing.
  tile_size: 512 # the higher, the more utilized GPU memory and the less performance change against the full image. must be an integer multiple of the window size.
  tile_pad: 32 # overlapping between adjacency patches.must be an integer multiple of the window size.

datasets:
  test_1:  # the 1st test dataset
    name: custom
    type: SingleImageDataset
    dataroot_lq: input_dir
    io_backend:
      type: disk

EDIT: I've seen that there's already something commented in the code:

neosr/neosr/models/default.py

Lines 472 to 474 in ff23dc0

    
               # TODO: verify 
        
               def test(self): 
        
                   self.tile = self.opt['val'].get('tile', False)

It doesn't make use of custom tile size or pad, but out of the box it does work for RealESRGAN.

Support safetensors as pretrained models

We plan to support and push safetensors on OMDB in the future for security reasons. So it would be great if neosr could support them.

Add Anime4K models

Very useful repo. Can you please add Anime4K models too?
I guess we need a anime4k_arch.py and a .yaml file. I tried something myself, but the .pth was only 70kb. So not sure if I did things correctly.

	# TODO: verify
	def test(self):
	self.tile = self.opt['val'].get('tile', False)

muslll / neosr Goto Github PK

neosr's People

Contributors

Stargazers

Watchers

Forkers

neosr's Issues

What to do now

Recommend Projects

Recommend Topics

Recommend Org