在训练自定义的数据集时，发现 Lr 过小时会导致 Loss<

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Lr过小时会导致Loss为nan about plsc HOT 9 CLOSED

geoexploring commented on May 19, 2024

Lr过小时会导致Loss为nan

from plsc.

Comments (9)

GuoxiaWang commented on May 19, 2024

@geoexploring

看了一下，压根没学到东西。目前看不出什么问题，我看你是用单卡训的，可以从以下几个步骤进行排查。
（1）把 sample_ratio: 0.1 改成 sample_ratio: 1.0 试试，先排除 PartialFC 的问题

此外，你用的是什么 paddle 版本，推荐使用稳定的 release 2.2 版本

from plsc.

geoexploring commented on May 19, 2024

@GuoxiaWang ，感谢您的及时回复。

我是在百度的AI Studio上训练的，Paddle的版本是paddlepaddle-gpu 2.2.2.post101。

按照您的建议，将sample_ratio改成1.0后，仍然会出现:

Lr过小时会导致Loss为nan，特别是当Lr缩小为原来的十分之一时（比如0.025变为0.0025，0.1变化0.01），都会导致Loss变化nan；
在验证集上的评估后的结果也和上述相同。

另外，补充报错的信息：

Training: 2022-03-24 15:02:19,865 - loss nan, lr: 0.010000, epoch: 11, step: 18000, eta: 1.72 hours, throughput: 433.87 imgs/sec
testing verification..
[[ 0.  0.  0. ...  0.  0.  0.]
 [ 0.  0.  0. ...  0.  0.  0.]
 [ 0.  0.  0. ...  0.  0.  0.]
 ...
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]]
Traceback (most recent call last):
  File "/home/aistudio/PLSC/train_aio.py", line 304, in <module>
    train(args)
  File "/home/aistudio/PLSC/dynamic/train_aio.py", line 228, in train
    best_metric = callback_verification(global_step, backbone)
  File "/home/aistudio/PLSC/dynamic/utils/verification_aio.py", line 211, in __call__
    best_metric = self.ver_test(backbone, num_update)
  File "/home/aistudio/PLSC/dynamic/utils/verification_aio.py", line 143, in ver_test
    nfolds=10)
  File "<decorator-gen-287>", line 2, in test
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 351, in _decorate_function
    return func(*args, **kwargs)
  File "/home/aistudio/PLSC/dynamic/utils/verification_aio.py", line 89, in test
    embeddings = sklearn.preprocessing.normalize(embeddings)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/preprocessing/_data.py", line 1905, in normalize
    estimator='the normalize function', dtype=FLOAT_DTYPES)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 721, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 106, in _assert_all_finite
    msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

谢谢！

from plsc.

GuoxiaWang commented on May 19, 2024

你的数据集是可见的吗？我可以拿你的数据以及你的配置复现一下

from plsc.

geoexploring commented on May 19, 2024

@GuoxiaWang , 谢谢您！

这个涉及到公司业务上的数据，而且数据组织很繁琐，可能会耽搁您很长时间，我再研究研究。

万分感谢！

from plsc.

GuoxiaWang commented on May 19, 2024

@geoexploring 可以先拿公开数据集用你的配置来试试看看，如果公开集也有问题，那就是代码写得有问题了，如果公开集没问题，那就是你那边数据处理有问题

from plsc.

geoexploring commented on May 19, 2024

@GuoxiaWang ，谢谢您！

我们那个数据集属于另一种类型的问题了，目前还没有公开数据集。谢谢您的建议，我再看看网络架构和数据加载方面有没有啥问题。

谢谢！

from plsc.

geoexploring commented on May 19, 2024

@GuoxiaWang

发现新特点：当用FP16训练时，不会出现上述的训练中途Loss变为nan的情况，但是会经常弹出信息Found inf or nan of distributed parameter, dtype is paddle.float16；Found inf or nan, current scale is: 13824.0, decrease to: 13824.0*0.5，并且 Loss下降速度相比FP32会慢很多。请问这是什么原因呢？

谢谢！

from plsc.

GuoxiaWang commented on May 19, 2024

@geoexploring

Found inf or nan of distributed parameter, dtype is paddle.float16；Found inf or nan, current scale is: 13824.0, decrease to: 13824.0*0.5

这个是正常的，我故意打印的，使用 FP16 的时候是有一个叫做 loss scaling 的东西，上面这句话是在当模型并行的 FC 中计算时，梯度出 nan/inf 了，这时候会跳过当前步的更新，同时 loss scaling 缩小一倍，继续走下一个 step，当2000步没出现 nan/inf 了，loss scaling 又调大一倍。

不过 loss 下降速度比 FP32 慢很多，我觉得首先训完看看吧，如果训完最后验证集上的精度合理那就合理。
通常我见到的 FP16 训练的 loss 会比 FP32 的大一些，这个是由于 FP16 精度没有 FP32 那么高导致的。

from plsc.

geoexploring commented on May 19, 2024

@GuoxiaWang , 谢谢您的快速回复！

这确实是一个不错的设计，其他问题我发邮件咨询您。谢谢！

from plsc.

Lr过小时会导致Loss为nan about plsc HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent