dingxiaoh / repoptimizers Goto Github PK

View Code? Open in Web Editor NEW

248.0 248.0 17.0 887 KB

Official repo of RepOptimizers and RepOpt-VGG

License: MIT License

Python 100.00%

repoptimizers's People

Contributors

Stargazers

Watchers

Forkers

scott-mao jie311 dumoio athrunsunny wolfworld6 dl-cnn qiuhuan samihrd jasonrjw qhy991 jizhishutong xqpinitial kidchou igiardiyanto zwilsonss yuxiaowei55555 jewelc92

repoptimizers's Issues

我把代码移植到自己的工程上，反复打印并创建优化器

我再运行过程中，在每一个epoch中代码都会跳转到创建model和optimizer的函数中，反复打印加载scale.pth信息

RepOpt-VGG-A0-target精度达不到repvggA0？

请问一下博主，使用RepOpt-VGG-A0-hs在cifar100上进行超参搜索训练的精度只有56左右，使用搜索后的超参在imagenet上，使用RepOpt-VGG-A0-target得到精度只有70多，低了repvggA0 2个点，这是为什么？

对于B1以下的网络效果是否同样有效？

您好，我看到您的工作中只实现了B1以上的大模型精度对齐，想问下B1以下的小模型是否也能保持这种精度对齐呢？因为在下游任务中很少会用到这么大的模型作为backbone。

想跑代码必须要下载Imagenet数据集吗，138G的数据能不能不下载呢？

RepOptimizer如何应用在下游任务中？

作者好，请问RepOptimizer应用在其他任务中，优化器只需要针对backbone网络设计,还是需要针对整个网络设计？

RepOPT in Yolov6: training in the hyper search (hs) mode appears SINE-LIKE mAP curve

Thank you for reading.

In this experiment, I proposed to train yolov6s using the repopt method on the DOTA dataset. According to the official document, I firstly trained the model in hs mode to search the optimal hyper-parameters of optimizer, but found the weird val/mAP curves like a sine function. As seen in the figure, the orange curve refers to the yolov6s model trained after 400 epochs, the blue one is yolov6s in hs mode after 250 training epochs, and the red one represents the yolov6s trained in hs mode after 400 epochs.

As far as I know, in the hs mode, the Scales (hyper-parameters) are trained just as normal parameters together with other model parameters, everything should run like the orange curve. But what made the hs-mode mAP curves wave like sine function?

The equivalency GR = CSLA could not be verified with two 3x3 convolutions.

When we replace the 1x1 conv of CSLA with 3x3 conv, the relative difference becomes larger.

训练出错

数据使用的是torchvision.dataset.imagenet这个接口，但是训练时报错
Traceback (most recent call last):
File "main_repopt.py", line 461, in
main(config)
File "main_repopt.py", line 199, in main
train_one_epoch(config, model, criterion, data_loader_train, optimizer, epoch, mixup_fn, lr_scheduler, model_ema=model_ema)
File "main_repopt.py", line 298, in train_one_epoch
loss.backward()
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 264, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 153, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
terminate called after throwing an instance of 'c10::Error'
what(): NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:161, unhandled cuda error, NCCL version 21.1.4
ncclUnhandledCudaError: Call to CUDA function failed.
Exception raised from ncclCommAbort at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:161 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7f37de87663c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xfa (0x7f37de841a28 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: + 0x3c1e92e (0x7f361ae5892e in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0xac (0x7f361ae393fc in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0xd (0x7f361ae395cd in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0x10f3211 (0x7f366cc99211 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x1105810 (0x7f366ccab810 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0xa71082 (0x7f366c617082 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0xa72043 (0x7f366c618043 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0xf8b98 (0x56188c6e6b98 in /opt/conda/bin/python3)
frame #10: + 0xfa78b (0x56188c6e878b in /opt/conda/bin/python3)
frame #11: + 0xf8b4f (0x56188c6e6b4f in /opt/conda/bin/python3)
frame #12: + 0x1ef516 (0x56188c7dd516 in /opt/conda/bin/python3)
frame #13: + 0x11c574 (0x56188c70a574 in /opt/conda/bin/python3)
frame #14: _PyGC_CollectNoFail + 0x2b (0x56188c8435db in /opt/conda/bin/python3)
frame #15: PyImport_Cleanup + 0x371 (0x56188c85d7b1 in /opt/conda/bin/python3)
frame #16: Py_FinalizeEx + 0x7a (0x56188c85da9a in /opt/conda/bin/python3)
frame #17: Py_RunMain + 0x1b8 (0x56188c8625c8 in /opt/conda/bin/python3)
frame #18: Py_BytesMain + 0x39 (0x56188c862939 in /opt/conda/bin/python3)
frame #19: __libc_start_main + 0xf3 (0x7f37f39470b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #20: + 0x1e8f39 (0x56188c7d6f39 in /opt/conda/bin/python3)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 8744) of binary: /opt/conda/bin/python3
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 187, in main
launch(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 173, in launch
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 688, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

         main_repopt.py FAILED

================================================
Root Cause:
[0]:
time: 2022-07-20_08:16:08
rank: 0 (local_rank: 0)
exitcode: -6 (pid: 8744)
error_file: <N/A>
msg: "Signal 6 (SIGABRT) received by PID 8744"

Other Failures:
<NO_OTHER_FAILURES>

不知道这个错误的原因是什么？希望大佬们帮我分析一下

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.