Hello, thanks for your nice work. I met a bug on --continue training. <p dir=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

question about --continue training about deeplabv3plus-pytorch HOT 8 OPEN

vainf commented on July 3, 2024

question about --continue training

from deeplabv3plus-pytorch.

Comments (8)

PytaichukBohdan commented on July 3, 2024 1

@kinfeparty @VainF Found the issue.

According to Pytorch optimizer documentation,

if you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.

It is fixed by moving model to cuda before loading state dict to optimizer:

` if opts.ckpt is not None and os.path.isfile(opts.ckpt):

    checkpoint = torch.load(opts.ckpt, map_location=torch.device('cpu'))
    # checkpoint = torch.load(opts.ckpt)
    model.load_state_dict(checkpoint["model_state"])

    model = nn.DataParallel(model)
    model.to(device)

    if opts.continue_training:
        optimizer.load_state_dict(checkpoint["optimizer_state"])
        scheduler.load_state_dict(checkpoint["scheduler_state"])
        cur_itrs = checkpoint["cur_itrs"]
        best_score = checkpoint['best_score']
        print("Training state restored from %s" % opts.ckpt)
    print("Model restored from %s" % opts.ckpt)
    del checkpoint  # free memory
else:
    print("[!] Retrain")

    model = nn.DataParallel(model)
    model.to(device)`

from deeplabv3plus-pytorch.

VainF commented on July 3, 2024 1

@PytaichukBohdan thanks!

from deeplabv3plus-pytorch.

VainF commented on July 3, 2024

Hi @kinfeparty , I added the missing map_location in the latest commit. Please try again.

from deeplabv3plus-pytorch.

kinfeparty commented on July 3, 2024

Hi @VainF ,I modified the code but met the same bug.

from deeplabv3plus-pytorch.

PytaichukBohdan commented on July 3, 2024

Hi @VainF , got the same issue.
Do you know what it can be related to?

from deeplabv3plus-pytorch.

YLiu-creator commented on July 3, 2024

when continue training, the ASPPPooling met the error:
Original Traceback (most recent call last):
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/utils.py", line 16, in forward
x = self.classifier(features)
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 84, in forward
low_output_feature= self.aspp(low_level_beforeFPM)
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 265, in forward
res.append(conv(x))
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 233, in forward
x = super(ASPPPooling, self).forward(x)
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 81, in forward
exponential_average_factor, self.eps)
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/functional.py", line 1652, in batch_norm
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 512, 1, 1])

ASPPPooling worked when retraining
I don't know how to debug, please give some help.

from deeplabv3plus-pytorch.

YLiu-creator commented on July 3, 2024

when continue training, the ASPPPooling met the error:
Original Traceback (most recent call last):
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/utils.py", line 16, in forward
x = self.classifier(features)
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 84, in forward
low_output_feature= self.aspp(low_level_beforeFPM)
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 265, in forward
res.append(conv(x))
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 233, in forward
x = super(ASPPPooling, self).forward(x)
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 81, in forward
exponential_average_factor, self.eps)
File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/functional.py", line 1652, in batch_norm
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 512, 1, 1])

ASPPPooling worked when retraining
I don't know how to debug, please give some help.

I konw the "1" was caused by AdaptiveAvgPool2d, but why only except error in continue training?

from deeplabv3plus-pytorch.

longphamkhac commented on July 3, 2024

How can my output segmentation image be the same as the second image, tks sir very much

from deeplabv3plus-pytorch.

question about --continue training about deeplabv3plus-pytorch HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent