Code Monkey home page Code Monkey logo

Comments (18)

wangg12 avatar wangg12 commented on August 29, 2024 36

@ssnl @xingyizhou Does this bug still exist with pytorch >= 1.0?

from pytorch-pose-hg-3d.

xingyizhou avatar xingyizhou commented on August 29, 2024 6

Hi,
I have investigated this problem (on another project, while I can not reproduce the bug on this project). It seems it is caused by very large intermediate features (e.g. > 10000) before batch normalization. Then the train() mode is on, it will be normalized be itself so training is OK. But when eval() mode is on, a slight difference (of the intermediate feature) with the BN mean/std from training will results in large offsets for output. I don't know the causal of the problem but it looks mathematically reasonable. However, down-grading PyTorch version to 0.1.12 will eliminate the problem. Please notify me if you have any other observations on this bug. Thanks!

from pytorch-pose-hg-3d.

xingyizhou avatar xingyizhou commented on August 29, 2024 4

Hi all,
As pointed by @leoxiaobin, turn off cudnn of BN layer resolves the issue. It can be realized by set torch.backends.cudnn.enabled = False in main.py, which disables cudnn for all layers and slows down the training by about 1.5x time, or re-build pytorch from source by hacking cudnn in BN layers https://github.com/pytorch/pytorch/blob/e8536c08a16b533fe0a9d645dd4255513f9f4fdd/aten/src/ATen/native/Normalization.cpp#L46 .

from pytorch-pose-hg-3d.

xingyizhou avatar xingyizhou commented on August 29, 2024

Hi,
As far as I know, it should be caused by a pytorch internal bug in BN. You can comment model.eval() in testing to see if the validation acc gets better (but it still won't match the desired performance). The bug should not be reproducible. And re-train the network once more (better on another machine) should have different results. Or you can downgrade your pytorch version below 0.1.12, which is a version where I haven't met/ heard about this bug (but still not guaranteed). Please let me know if the above solutions help. Thanks!

from pytorch-pose-hg-3d.

FANG-Xiaolin avatar FANG-Xiaolin commented on August 29, 2024

I tried for another 3 times. The train acc is approximately 0.87 during the last epoch(the 60th epoch) but the validation acc changes every time and always lower than 0.50. The validation acc is around 0.80 in the 55th epoch so it seems that there is a sudden drop during the last epoch and I notice that the training loss gets slightly higher during the last epoch.

from pytorch-pose-hg-3d.

xingyizhou avatar xingyizhou commented on August 29, 2024

Hi,
Thanks for reporting the problem. However I don't have other solutions yet and will keep looking into it. It might not be a bug of the code, since an isolated implementation of HourglassNet (I am not sure if the bug is from the network architecture) also has this problem (bearpaw/pytorch-pose#33). People there suggest using learning rate 1e-4. You can have a try to see if the bug still exists.

from pytorch-pose-hg-3d.

FANG-Xiaolin avatar FANG-Xiaolin commented on August 29, 2024

Hi,
Thanks for your advice. Yes it works if using LR 1e-4. The val acc is 0.80+ in this way.

from pytorch-pose-hg-3d.

FANG-Xiaolin avatar FANG-Xiaolin commented on August 29, 2024

Hi,
Yes I think it is reasonable. Sure I will notify you if I observe something new. Thanks for your reply!

from pytorch-pose-hg-3d.

ssnl avatar ssnl commented on August 29, 2024

IIRC, your repo sets batch size to 1. If that is the case it's not really a PyTorch bug. Running stats with batch size = 1 is unstable itself.

from pytorch-pose-hg-3d.

xingyizhou avatar xingyizhou commented on August 29, 2024

Thanks for the suggestion! The training batch size is 6 and testing is 1. When testing, eval() mode is on and the batch size does not affect the computation.

from pytorch-pose-hg-3d.

ssnl avatar ssnl commented on August 29, 2024

from pytorch-pose-hg-3d.

FANG-Xiaolin avatar FANG-Xiaolin commented on August 29, 2024

Get it. Thanks.

from pytorch-pose-hg-3d.

xingyizhou avatar xingyizhou commented on August 29, 2024

Oh I still want this issue to be opened to wait for better solutions...

from pytorch-pose-hg-3d.

FANG-Xiaolin avatar FANG-Xiaolin commented on August 29, 2024

Sure! My bad.

from pytorch-pose-hg-3d.

ujsyehao avatar ujsyehao commented on August 29, 2024

@wangg12 I am doing experiments to observe if the bug exists in pytorch >= 1.0.

from pytorch-pose-hg-3d.

qiangruoyu avatar qiangruoyu commented on August 29, 2024

@ wangg12 我正在做实验,以观察pytorch> = 1.0中是否存在该错误。

Can you meet this error when the version of pytorch >= 1.0

from pytorch-pose-hg-3d.

zhouyuangan avatar zhouyuangan commented on August 29, 2024

@ujsyehao 你好,请问你的实验结果如何?

from pytorch-pose-hg-3d.

sisrfeng avatar sisrfeng commented on August 29, 2024

Hi all,
As pointed by @leoxiaobin, turn off cudnn of BN layer resolves the issue. It can be realized by set torch.backends.cudnn.enabled = False in main.py, which disables cudnn for all layers and slows down the training by about 1.5x time, or re-build pytorch from source by hacking cudnn in BN layers https://github.com/pytorch/pytorch/blob/e8536c08a16b533fe0a9d645dd4255513f9f4fdd/aten/src/ATen/native/Normalization.cpp#L46 .

torch.backends.cudnn.enabled = Falseinmain.py`
Should it be "torch.backends.cudnn.benchmark = False"?

If I have followed this step, I need not modify main.py, right? :
For other pytorch version, you can manually open torch/nn/functional.py and find the line with torch.batch_norm and replace the torch.backends.cudnn.enabled with False

from pytorch-pose-hg-3d.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.