dingxiaoh / repoptimizers Goto Github PK
View Code? Open in Web Editor NEWOfficial repo of RepOptimizers and RepOpt-VGG
License: MIT License
Official repo of RepOptimizers and RepOpt-VGG
License: MIT License
我再运行过程中,在每一个epoch中代码都会跳转到创建model和optimizer的函数中,反复打印加载scale.pth信息
请问一下博主,使用RepOpt-VGG-A0-hs在cifar100上进行超参搜索训练的精度只有56左右,使用搜索后的超参在imagenet上,使用RepOpt-VGG-A0-target得到精度只有70多,低了repvggA0 2个点,这是为什么?
您好,我看到您的工作中只实现了B1以上的大模型精度对齐,想问下B1以下的小模型是否也能保持这种精度对齐呢?因为在下游任务中很少会用到这么大的模型作为backbone。
不太理解remark中所说的,BN在training-time时是非线性的。
作者好,请问RepOptimizer应用在其他任务中,优化器只需要针对backbone网络设计,还是需要针对整个网络设计?
Thank you for reading.
In this experiment, I proposed to train yolov6s using the repopt method on the DOTA dataset. According to the official document, I firstly trained the model in hs mode to search the optimal hyper-parameters of optimizer, but found the weird val/mAP curves like a sine function. As seen in the figure, the orange curve refers to the yolov6s model trained after 400 epochs, the blue one is yolov6s in hs mode after 250 training epochs, and the red one represents the yolov6s trained in hs mode after 400 epochs.
As far as I know, in the hs mode, the Scales (hyper-parameters) are trained just as normal parameters together with other model parameters, everything should run like the orange curve. But what made the hs-mode mAP curves wave like sine function?
数据使用的是torchvision.dataset.imagenet这个接口,但是训练时报错
Traceback (most recent call last):
File "main_repopt.py", line 461, in
main(config)
File "main_repopt.py", line 199, in main
train_one_epoch(config, model, criterion, data_loader_train, optimizer, epoch, mixup_fn, lr_scheduler, model_ema=model_ema)
File "main_repopt.py", line 298, in train_one_epoch
loss.backward()
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 264, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 153, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
terminate called after throwing an instance of 'c10::Error'
what(): NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:161, unhandled cuda error, NCCL version 21.1.4
ncclUnhandledCudaError: Call to CUDA function failed.
Exception raised from ncclCommAbort at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:161 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7f37de87663c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xfa (0x7f37de841a28 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: + 0x3c1e92e (0x7f361ae5892e in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0xac (0x7f361ae393fc in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0xd (0x7f361ae395cd in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0x10f3211 (0x7f366cc99211 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x1105810 (0x7f366ccab810 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0xa71082 (0x7f366c617082 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0xa72043 (0x7f366c618043 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0xf8b98 (0x56188c6e6b98 in /opt/conda/bin/python3)
frame #10: + 0xfa78b (0x56188c6e878b in /opt/conda/bin/python3)
frame #11: + 0xf8b4f (0x56188c6e6b4f in /opt/conda/bin/python3)
frame #12: + 0x1ef516 (0x56188c7dd516 in /opt/conda/bin/python3)
frame #13: + 0x11c574 (0x56188c70a574 in /opt/conda/bin/python3)
frame #14: _PyGC_CollectNoFail + 0x2b (0x56188c8435db in /opt/conda/bin/python3)
frame #15: PyImport_Cleanup + 0x371 (0x56188c85d7b1 in /opt/conda/bin/python3)
frame #16: Py_FinalizeEx + 0x7a (0x56188c85da9a in /opt/conda/bin/python3)
frame #17: Py_RunMain + 0x1b8 (0x56188c8625c8 in /opt/conda/bin/python3)
frame #18: Py_BytesMain + 0x39 (0x56188c862939 in /opt/conda/bin/python3)
frame #19: __libc_start_main + 0xf3 (0x7f37f39470b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #20: + 0x1e8f39 (0x56188c7d6f39 in /opt/conda/bin/python3)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 8744) of binary: /opt/conda/bin/python3
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 187, in main
launch(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 173, in launch
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 688, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
main_repopt.py FAILED
Other Failures:
<NO_OTHER_FAILURES>
不知道这个错误的原因是什么?希望大佬们帮我分析一下
我比较了u2netp,u2netp-repconv,u2netp-repopt在自训练数据集上的分割精度,miou分别是0,9159,0.9169,0.9170,但是进行ptq-uint8量化后,原生结构的量化损失较小,而repconv和repopt均存在较大的量化损失,repopt量化损失很大正常吗?
I have reproduced the B1 training, But the evaluate results keeps incorrect. Could you release a Recommended eval commands?
@DingXiaoH 大佬你好,RepOptimizer 文中给出了RepVGG-B1的 PTQ精度为54.5左右,想请教下在怎样的设置下测试到的。我本地PTQ测试,B1的PTQ精度会直接掉到 10一下。目前有些工作,需要对齐一下你的结果,求具体的PTQ配置🙏
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.