Comments (9)
看了一下,压根没学到东西。目前看不出什么问题,我看你是用单卡训的,可以从以下几个步骤进行排查。
(1)把 sample_ratio: 0.1 改成 sample_ratio: 1.0 试试, 先排除 PartialFC 的问题
此外,你用的是什么 paddle 版本,推荐使用稳定的 release 2.2 版本
from plsc.
@GuoxiaWang ,感谢您的及时回复。
我是在百度的AI Studio上训练的,Paddle的版本是paddlepaddle-gpu 2.2.2.post101
。
按照您的建议,将sample_ratio
改成1.0
后,仍然会出现:
Lr
过小时会导致Loss
为nan,特别是当Lr
缩小为原来的十分之一时(比如0.025
变为0.0025
,0.1
变化0.01
),都会导致Loss
变化nan;
在验证集上的评估后的结果也和上述相同。
另外,补充报错的信息:
Training: 2022-03-24 15:02:19,865 - loss nan, lr: 0.010000, epoch: 11, step: 18000, eta: 1.72 hours, throughput: 433.87 imgs/sec
testing verification..
[[ 0. 0. 0. ... 0. 0. 0.]
[ 0. 0. 0. ... 0. 0. 0.]
[ 0. 0. 0. ... 0. 0. 0.]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]
Traceback (most recent call last):
File "/home/aistudio/PLSC/train_aio.py", line 304, in <module>
train(args)
File "/home/aistudio/PLSC/dynamic/train_aio.py", line 228, in train
best_metric = callback_verification(global_step, backbone)
File "/home/aistudio/PLSC/dynamic/utils/verification_aio.py", line 211, in __call__
best_metric = self.ver_test(backbone, num_update)
File "/home/aistudio/PLSC/dynamic/utils/verification_aio.py", line 143, in ver_test
nfolds=10)
File "<decorator-gen-287>", line 2, in test
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 351, in _decorate_function
return func(*args, **kwargs)
File "/home/aistudio/PLSC/dynamic/utils/verification_aio.py", line 89, in test
embeddings = sklearn.preprocessing.normalize(embeddings)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/preprocessing/_data.py", line 1905, in normalize
estimator='the normalize function', dtype=FLOAT_DTYPES)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 721, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 106, in _assert_all_finite
msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
谢谢!
from plsc.
你的数据集是可见的吗?我可以拿你的数据以及你的配置复现一下
from plsc.
@GuoxiaWang , 谢谢您!
这个涉及到公司业务上的数据,而且数据组织很繁琐,可能会耽搁您很长时间,我再研究研究。
万分感谢!
from plsc.
@geoexploring 可以先拿公开数据集用你的配置来试试看看,如果公开集也有问题,那就是代码写得有问题了,如果公开集没问题,那就是你那边数据处理有问题
from plsc.
@GuoxiaWang ,谢谢您!
我们那个数据集属于另一种类型的问题了,目前还没有公开数据集。谢谢您的建议,我再看看网络架构和数据加载方面有没有啥问题。
谢谢!
from plsc.
发现新特点:当用FP16训练时,不会出现上述的训练中途Loss变为nan的情况,但是会经常弹出信息Found inf or nan of distributed parameter, dtype is paddle.float16;Found inf or nan, current scale is: 13824.0, decrease to: 13824.0*0.5
,并且 Loss下降速度相比FP32会慢很多。请问这是什么原因呢?
谢谢!
from plsc.
Found inf or nan of distributed parameter, dtype is paddle.float16;Found inf or nan, current scale is: 13824.0, decrease to: 13824.0*0.5
这个是正常的,我故意打印的,使用 FP16 的时候是有一个叫做 loss scaling 的东西,上面这句话是在当模型并行的 FC 中计算时,梯度出 nan/inf 了,这时候会跳过当前步的更新,同时 loss scaling 缩小一倍,继续走下一个 step,当2000步没出现 nan/inf 了,loss scaling 又调大一倍。
不过 loss 下降速度比 FP32 慢很多,我觉得首先训完看看吧,如果训完最后验证集上的精度合理那就合理。
通常我见到的 FP16 训练的 loss 会比 FP32 的大一些,这个是由于 FP16 精度没有 FP32 那么高导致的。
from plsc.
@GuoxiaWang , 谢谢您的快速回复!
这确实是一个不错的设计,其他问题我发邮件咨询您。谢谢!
from plsc.
Related Issues (20)
- dynamic model export onnx error HOT 8
- MobilefaceNet_128_arcface_dynamic_0.1_fp16_NHWC resume KeyERROR HOT 2
- PLSC训练得到的模型转paddle和ONNX,同一张图片,二者输出结果不一致问题?
- AMP 支持哪些算子 HOT 1
- TypeError: __init__() got an unexpected keyword argument 'data_format' HOT 2
- 输出模型是如何设计的? HOT 4
- inference.py推理结果的含义是? HOT 9
- 训练报错 HOT 5
- 两台服务器,每台4张卡,训练出错 HOT 6
- Face Recognition inference模型 HOT 2
- ValueError: Flag FLAGS_cudnn_exhaustive_search cannot set its value through this functio HOT 1
- 请问当前工程版本对应的最新的paddlepaddle是啥? HOT 1
- Problems exporting model
- 分类数目变大,尽管可以将参数拆分到各个GPU上,但是各个GPU上的隐层特征allgather也带来显存消耗 HOT 1
- PLSC只能使用python2.7?
- issues of training with dynamic graph HOT 2
- 最新版本的 plsc 对paddle版本的要求有误 HOT 3
- 咱这个MobileFace-Paddle的pretrained model哪里可以下载呢? HOT 7
- 预训练模型预测示例图片错误率高达83.33%,请教可能出现问题的地方 HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from plsc.