Code Monkey home page Code Monkey logo

models's Introduction

MegEngine

MegEngine is a fast, scalable, and user friendly deep learning framework with 3 key features.

  • Unified framework for both training and inference
    • Quantization, dynamic shape/image pre-processing, and even derivation with a single model.
    • After training, put everything into your model to inference on any platform with speed and precision. Check here for a quick guide.
  • The lowest hardware requirements
    • The memory usage of the GPU can be reduced to one-third of the original memory usage when DTR algorithm is enabled.
    • Inference models with the lowest memory usage by leveraging our Pushdown memory planner.
  • Inference efficiently on all platforms
    • Inference with speed and high-precision on x86, Arm, CUDA, and RoCM.
    • Supports Linux, Windows, iOS, Android, TEE, etc.
    • Optimize performance and memory usage by leveraging our advanced features.

Installation

NOTE: MegEngine now supports Python installation on Linux-64bit/Windows-64bit/MacOS(CPU-Only)-10.14+/Android 7+(CPU-Only) platforms with Python from 3.6 to 3.9. On Windows 10 you can either install the Linux distribution through Windows Subsystem for Linux (WSL) or install the Windows distribution directly. Many other platforms are supported for inference.

Binaries

To install the pre-built binaries via pip wheels:

python3 -m pip install --upgrade pip
python3 -m pip install megengine -f https://megengine.org.cn/whl/mge.html

Building from Source

How to Contribute

We strive to build an open and friendly community. We aim to power humanity with AI.

How to Contact Us

Resources

License

MegEngine is licensed under the Apache License, Version 2.0

Citation

If you use MegEngine in your publication,please cite it by using the following BibTeX entry.

@Misc{MegEngine,
  institution = {megvii},
  title =  {MegEngine:A fast, scalable and easy-to-use deep learning framework},
  howpublished = {\url{https://github.com/MegEngine/MegEngine}},
  year = {2020}
}

Copyright (c) 2014-2021 Megvii Inc. All rights reserved.

models's People

Contributors

asthestarsfalll avatar blablabiu avatar dc3671 avatar fatescript avatar flashrunrun avatar lijiansong avatar lixiangyin666 avatar megvii-mge avatar mondayzhou avatar pepperonibo avatar randolph87xxx avatar rlee719 avatar tpoisonooo avatar xpmemeda avatar yzchen avatar zhiy-zhang avatar zhouyizhuang-megvii avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

models's Issues

复现 EfficientNet

任务描述

  • 复现 EfficientNet ,训练正常收敛,验收指标符合预期,并将代码提交到 offical/vision/classification/models 下

目标

  • 数据集ImageNet
  • 准确率和论文一致或更高
  • 脚本可以完整完成训练步骤
  • 提供训练后的权重文件
  • 提交至 https://github.com/MegEngine/Hub

无法dump resnet50的量化模型

环境

1.系统环境:
2.MegEngine版本:
3.python版本:python3.7
4.模型名称:

复现步骤

请提供关键的代码片段便于追查问题

python3 inference.py -a resnet50 --mode quantized --dump

请提供完整的日志及报错信息

Traceback (most recent call last):
File "inference.py", line 110, in
main()
File "inference.py", line 50, in main
model = models.dictargs.arch
File "/home/liujunjie/AIBenchmark/models/model-zoo/megengine_quant/quantization/models/resnet.py", line 296, in resnet50
m = ResNet(Bottleneck, [3, 4, 6, 3], **kwargs)
File "/home/liujunjie/AIBenchmark/models/model-zoo/megengine_quant/quantization/models/resnet.py", line 162, in init
self.layer1 = self._make_layer(block, 64, layers[0], norm=norm)
File "/home/liujunjie/AIBenchmark/models/model-zoo/megengine_quant/quantization/models/resnet.py", line 235, in _make_layer
norm=norm,
TypeError: init() got an unexpected keyword argument 'norm'

执行Models/official/vision/detection/inference.py 发生错误

环境

1.系统环境:ubuntu 18
2.MegEngine版本:MegEngine-0.3.1
3.python版本:python3.6.9
4.模型名称:resnet50

复现步骤

1.从网络下载训练文件
from megengine import hub
model = hub.load('megengine/models', 'resnet50', pretrained=True)
2. 执行Models/official/vision/detection/inference.py

ython3 tools/inference.py -f retinanet_res50_1x_800size.py -i ../../assets/cat.jpg -m /home/liyeguang/python_example/resnet50/resnet50_fbaug_76254_4e14b7d1.pkl
3. 发生错误

请提供关键的代码片段便于追查问题

请提供完整的日志及报错信息

30 15:43:50 Load Model : /home/AAA/python_example/resnet50/resnet50_fbaug_76254_4e14b7d1.pkl completed
30 15:43:50[mgb] WRN cuda unavailable: no CUDA-capable device is detected(100) ndev=-1
Traceback (most recent call last):
File "tools/inference.py", line 68, in
main()
File "tools/inference.py", line 50, in main
model.load_state_dict(mge.load(args.model)["state_dict"])
KeyError: 'state_dict'

关于DetectionPadCollator

image
这里似乎是会将不同尺寸的输入图片pad到相同尺寸然后concat起来?
这样做不会有什么影响吗?
或者是我理解错了吗?

量化模型的时候出现不预期的问题

环境

1.系统环境:ubuntu1804
2.MegEngine版本:1.4.0
3.python版本:3.6.8

复现步骤

  1. git clone https://github.com/MegEngine/Models.git
  2. python3 train.py -a resnet18 -d /path/to/imagenet --mode normal 出现此问题

请提供关键的代码片段便于追查问题

1、train.py 和 finetune.py 同样出现此问题

请提供完整的日志及报错信息

Traceback (most recent call last):
File "finetune.py", line 315, in
main()
File "finetune.py", line 69, in main
train_proc(world_size, args)
File "finetune.py", line 159, in worker
data.RandomSampler(train_dataset, batch_size=cfg.BATCH_SIZE, drop_last=True)
File "/usr/local/python/lib/python3.6/site-packages/megengine/data/sampler.py", line 322, in init
self.sampler_iter = iter(self.sampler)
File "/usr/local/python/lib/python3.6/site-packages/megengine/data/sampler.py", line 100, in iter
return self.batch()
File "/usr/local/python/lib/python3.6/site-packages/megengine/data/sampler.py", line 146, in batch
if self.drop_last and len(batch_index[-1]) < self.batch_size:
IndexError: list index out of range

resnet分类模型无法指定分类数量

使用 classification resnet进行分类苹果/橘子训练时,指定分类数量为2时,无法训练

Models/official/vision/classification/resnet/model.py
resnet50只支持1000分类数据集训练,我只创建2分类数据时,修改num_classes=2时,无法正常完成训练

任务描述

数据集格式是这样的:
/path/to/imagenet
train
apple
xxx.jpg
...
orangle
xxx.jpg
...
...
val
apple
xxx.jpg
...
orangle
xxx.jpg
...

目标

由于我时初学,在经过自己各种修改后,仍然无法解决,需要帮忙。谢谢

hub.load导入"retinanet_res50_1x_800size"报错

环境

1.系统环境:Ubuntu 18.04
2.MegEngine版本:0.4.0
3.python版本:3.6.8
4.模型名称:retinanet_res50_1x_800size

复现步骤

请提供关键的代码片段便于追查问题

from megengine import hub
model = hub.load(
"megengine/models",
"retinanet_res50_1x_800size",
pretrained=True,
)

请提供完整的日志及报错信息

Traceback (most recent call last):
File "", line 4, in
File "/usr/local/lib/python3.6/dist-packages/megengine/hub/hub.py", line 202, in load
raise RuntimeError("Cannot find callable {} in hubconf.py".format(entry))
RuntimeError: Cannot find callable retinanet_res50_1x_800size in hubconf.py

单机多卡跑resnet示例报dataloader错误

环境

1.ubuntu :
2.0.3:
3.python3.7.3:
4.resnet:

复现步骤

  1. cd official/vision/classification/resnet
  2. python train.py --data=/path/to/imagenet --ngpus=8 --workers=8

请提供完整的日志及报错信息

Process Process-2:
Traceback (most recent call last):
File "/home/cmq/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/cmq/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/ssd/ssd0/MegEngine_Models/official/vision/classification/resnet/train.py", line 188, in worker
train_func, train_queue, optimizer, args, epoch=epoch
File "/ssd/ssd0/MegEngine_Models/official/vision/classification/resnet/train.py", line 219, in train
for step, (image, label) in enumerate(data_queue):
File "/home/cmq/anaconda3/lib/python3.7/site-packages/megengine/data/dataloader.py", line 122, in iter
return _ParallelDataLoaderIter(self)
File "/home/cmq/anaconda3/lib/python3.7/site-packages/megengine/data/dataloader.py", line 216, in init
worker.start()
File "/home/cmq/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/home/cmq/anaconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/home/cmq/anaconda3/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/cmq/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/home/cmq/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init
self._launch(process_obj)
File "/home/cmq/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/cmq/anaconda3/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle weakref objects
Exception ignored in: <function _ParallelDataLoaderIter.del at 0x7f42a9f22620>
Traceback (most recent call last):
File "/home/cmq/anaconda3/lib/python3.7/site-packages/megengine/data/dataloader.py", line 544, in del
if self.__initialized:
AttributeError: '_ParallelDataLoaderIter' object has no attribute '_ParallelDataLoaderIter__initialized'

复现 Googlenet

任务描述

  • 复现 Googlenet ,训练正常收敛,验收指标符合预期,并将代码提交到 offical/vision/classification/models 下

目标

  • 数据集ImageNet
  • 准确率和论文一致或更高
  • 脚本可以完整完成训练步骤
  • 提供训练后的权重文件
  • 提交至 https://github.com/MegEngine/Hub

改进现有的RetinaNet的量化

背景

1.实现fully量化的RetinaNet
2.将量化检测模型和vision/detection进行合并

任务描述

1.现有RetinaNet受MegEngine版本限制仅仅量化了backbone部分,后续需要一个fully量化的RetinaNet
2.检测量化部分code有部分冗余,需要进行合并

目标

将quant retinanet统一合入 vision/detection

有关shufflenet训练学习率的问题

因为你们用torch训练shufflenet的代码是用torch.nn.dataparallel实现多卡训练的,跑起来比较慢,于是我就自己改成了用torch.distributed.dataparallel实现多卡训练,但无法复现原来的结果。然后发现了MegEngine也有实现多卡的训练,对其中学习率的设置有点疑问。

我发现在README中提到:

--learning-rate, 训练时的初始学习率,默认0.0625,在分布式训练下,实际学习率等于初始学习率乘以节点/gpu数;

1、在torch.nn.dataparallel实现的shufflenetv2中,初始学习率设置为0.5。而在这里的用MegEngine实现的shufflenet中,初始学习率设置为0.5/8=0.0625。请问这两种设置是等效的吗?

2、如果我想用torch.distributed.dataparallel实现多卡训练,那么初始学习率应该如何等效地进行设置?

谢谢o( ̄▽ ̄)ブ

Missing Code(有代码遗漏导致无法运行)

processed_img = transform.apply(image)[np.newaxis, :]

Missing Code

There is one line of missing code for transforming ndarray into tensor. It will lead to an error:
该码下遗漏了一行将ndarray转换成tensor的代码,导致在运行的时候报错:

TypeError: op Convolution expect type Tensor as inputs, got numpy.ndarray actually

Solution

Adding these below L63:
L63下面应添加:

    processed_img = megengine.tensor(processed_img, dtype="float32")

Missing Code(代码遗漏导致无法运行)

processed_img = transform.apply(image)[np.newaxis, :]

Missing Code

There is one line of missing code for transforming ndarray into tensor. It will lead to an error:
该码下遗漏了一行将ndarray转换成tensor的代码,导致在运行的时候报错:

TypeError: op Convolution expect type Tensor as inputs, got numpy.ndarray actually

Solution

Adding these below L63:
L63下面应添加:

    processed_img = megengine.tensor(processed_img, dtype="float32")

大佬求帮助,我在训练语义分割的时候,自定义了模型的数据输入,然后训练时偶发性报错'mgb::CudaError'

背景

本人在测试训练语义分割模型时,因为数据格式不同,采用自己编写的数据集生成器进行训练。

任务描述

训练时偶发性报错。

terminate called after throwing an instance of 'mgb::CudaError'
what(): failed to query event: 700: an illegal memory access was encountered

backtrace:
/opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb9CudaErrorC1ERKSs+0x54) [0x7f30e95ae164]
/opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb12CudaCompNode9EventImpl11do_finishedEv+0xbc) [0x7f30e956358c]
/opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb18CompNodeImplHelper15EventImplHelper8finishedEv+0x57) [0x7f30e957bf47]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x3d7c3d) [0x7f31433bac3d]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x4fcdff) [0x7f31434dfdff]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f3143bb6b43]
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f3143c47bb4]
(last_err=700(an illegal memory access was encountered) device=0 mem_free=0.000MiB mem_tot=0.000MiB)
Aborted (core dumped)

我猜测是显存泄露了,但是不是很清楚在哪泄露了
训练时我一直监测gpu,显存一直没有使用满,只有一半。

Spiky GPU utilization when training classification task

Hi,
When running train script for image classification directory - resnet/shufflenet etc. I notice very spiky GPU utilization behavior.

Scale lr with batch size and ngpus:

python3 train.py --data /home/ubuntu/imagenet --arch resnet50 --batch-size 64 --learning-rate 0.1 --ngpus 8

Open a separate terminal window and run:
watch -n0.1 nvidia-smi

You should see GPU-utilization almost oscillate between 60% - 100%
This is consistency reproducible with all trials.

If we expect spikes, my question would be why? If we don't, can you try to reproduce this GPU spike behavior and detail what part of the code causes it? Perhaps performance improvements could be gained if the utilization did not spike?

Is this expected behavior? I observe this behavior during a single training run, and it appears to happen during every epoch. Something’s not right and I don’t know how to debug/diagnose it under the framework, perhaps some specific diagnostic tools if you don't mind.

训练出错(https://github.com/MegEngine/Models/tree/master/official/quantization)

执行 python3 train.py -a resnet18 -d /home/rootx/Models/official/quantization/dataset/flowers-recognition/ --mode normal
出错了,缺少文件,咋办?
2020-09-05_094759

root@rootx-virtual-machine:/home/rootx/Models/official/quantization# python3 train.py -a resnet18 -d /home/rootx/Models/official/quantization/dataset/flowers-recognition/ --mode normal
err: Failed to load CUDA Driver API library
err: failed to load cuda func: cuCtxGetCurrent
err: failed to load cuda func: cuCtxGetCurrent
05 09:43:15[mgb] ERR cudaGetDeviceCount failed: CUDA driver version is insufficient for CUDA runtime version (err 35)
05 09:43:15[mgb] WRN cuda unavailable: CUDA driver version is insufficient for CUDA runtime version(35) ndev=-1
05 09:43:16 preparing dataset..
05 09:43:16 WRN devkit directory /home/rootx/Models/official/quantization/dataset/flowers-recognition/ILSVRC2012_devkit_t12 does not exists
05 09:43:16 checksum devkit tar file /home/rootx/Models/official/quantization/dataset/flowers-recognition/ILSVRC2012_devkit_t12.tar.gz ...
Traceback (most recent call last):
File "train.py", line 309, in
main()
File "train.py", line 65, in main
worker(0, 1, args)
File "train.py", line 152, in worker
train_dataset = data.dataset.ImageNet(args.data, train=True)
File "/usr/local/lib/python3.6/dist-packages/megengine/data/dataset/vision/imagenet.py", line 97, in init
self._prepare_devkit()
File "/usr/local/lib/python3.6/dist-packages/megengine/data/dataset/vision/imagenet.py", line 245, in _prepare_devkit
calculate_md5(raw_file) == checksum
File "/usr/local/lib/python3.6/dist-packages/megengine/data/dataset/vision/utils.py", line 56, in calculate_md5
with open(filename, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/rootx/Models/official/quantization/dataset/flowers-recognition/ILSVRC2012_devkit_t12.tar.gz'
root@rootx-virtual-machine:/home/rootx/Models/official/quantization#

环境

1.系统环境:
2.MegEngine版本:
3.python版本:
4.模型名称:

复现步骤

请提供关键的代码片段便于追查问题

请提供完整的日志及报错信息

大佬求帮助,在训练atss检测的时候,按照官方dump成mge出错,.tm模型不能可视化

环境

1.系统环境:Ubuntu 22.04.2
2.MegEngine版本:1.13.0
3.python版本:3.10.12
4.模型名称:atss_res18_coco_3x_800size

复现步骤

1.训练

python  official/vision/detection/tools/train.py    -f official/vision/detection/configs/atss_res18_coco_3x_800size.py -n 1 -d data/coco/

2.convert

python convert.py -f official/vision/detection/configs/atss_res18_coco_3x_800size.py -w log-of-atss_res18_coco_3x_800size/epoch_9.pkl -i official/assets/cat.jpg

请提供关键的代码片段便于追查问题

**##转换代码**
import numpy as np
import megengine.functional as F
import megengine.hub
from megengine import jit, tensor
import megengine as mge
import megengine.distributed as dist
from megengine.autodiff import GradManager
from megengine.data import DataLoader, Infinite, RandomSampler
from megengine.data import transform as T
from megengine.optimizer import SGD

from official.vision.detection.tools.data_mapper import data_mapper
from official.vision.detection.tools.utils import DetEvaluator, import_from_file
import megengine.traced_module as tm
import argparse
import bisect
import copy
import os
import time
import cv2

def make_parser():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-f", "--file", default="net.py", type=str, help="net description file"
    )
    parser.add_argument(
        "-w", "--weight_file", default=None, type=str, help="weights file",
    )
    parser.add_argument("-i", "--image", type=str)
    return parser


if __name__ == "__main__":
    parser = make_parser()
    args = parser.parse_args()
    
    current_network = import_from_file(args.file)
    cfg = current_network.Cfg()
    cfg.backbone_pretrained = False
    model = current_network.Net(cfg)
    
    ori_img = cv2.imread(args.image)
    image, im_info = DetEvaluator.process_inputs(
        ori_img.copy(), model.cfg.test_image_short_size, model.cfg.test_image_max_size,
    )
    state_dict = mge.load(args.weight_file)
    if "state_dict" in state_dict:
        state_dict = state_dict["state_dict"]
    model.load_state_dict(state_dict)
    model.eval()
    
    traced_resnet = tm.trace_module(model, mge.tensor(image),im_info=mge.tensor(im_info))
    # 可以在这里进行基于 trace_module 的图手术,以及模型转换
    traced_resnet.eval()
    mge.save(traced_resnet,"test.tm")
    @jit.trace(symbolic=True, capture_as_const=True)
    def infer_func(data, im_info, model):
        pred = model(data,im_info)
        return pred

    output = infer_func(mge.tensor(image),im_info=mge.tensor(im_info), model=traced_resnet)
    infer_func.dump("log-of-atss_res18_coco_3x_800size/test.mge", arg_names=["data"])

请提供完整的日志及报错信息

25 17:37:50[mgb] WRN [dnn]
            Cudnn8 will jit ptx code with cache. You can set
            CUDA_CACHE_MAXSIZE and CUDA_CACHE_PATH environment var to avoid repeat jit(very slow).
            For example `export CUDA_CACHE_MAXSIZE=2147483647` and `export CUDA_CACHE_PATH=/data/.cuda_cache`
25 17:37:53[mgb] ERR error while applying optimization pass PassConvertToCompatible: bad input shape for polyadic operator: {256}, {1,256,136,100}

backtrace:
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb13MegBrainErrorC1ERKSs+0x4a) [0x7fdfdad5590a]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(+0x2db7867) [0x7fdfdadb7867]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN6megdnn12ErrorHandler15on_megdnn_errorERKSs+0x14) [0x7fdfde9d86a4]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(+0x6a078d8) [0x7fdfdea078d8]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(+0x6a07a91) [0x7fdfdea07a91]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb3opr8Elemwise20get_output_var_shapeEN6megdnn5param8Elemwise4ModeERKNS2_11SmallVectorINS2_11TensorShapeELj4EEE+0x37) [0x7fdfdaf150e7]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(_ZNK3mgb3opr8Elemwise20get_output_var_shapeERKN6megdnn11SmallVectorINS2_11TensorShapeELj4EEERS5_+0x29) [0x7fdfdaf1e0c9]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb2cg5mixin24OutshapePureByInshapeOpr10infer_descEmRN6megdnn11TensorShapeERKNS0_12static_infer6InpValE+0x19d) [0x7fdfdade9e0d]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb2cg12static_infer22StaticInferManagerImpl13TagShapeTrait8do_inferERKNS1_6InpValE+0x57) [0x7fdfdae0b437]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(+0x2e0b292) [0x7fdfdae0b292]

Traceback (most recent call last):
  File "/home/csy/megvii/Models/convert.py", line 63, in <module>
    infer_func.dump("log-of-atss_res18_coco_3x_800size/test.mge", arg_names=["data"])
  File "/home/csy/.local/lib/python3.10/site-packages/megengine/jit/tracing.py", line 1183, in dump
    dump_content, dump_info = G.dump_graph(
  File "/home/csy/.local/lib/python3.10/site-packages/megengine/core/tensor/megbrain_graph.py", line 456, in dump_graph
    dump_content = _imperative_rt.dump_graph(
RuntimeError: bad input shape for polyadic operator: {256}, {1,256,136,100}

backtrace:
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb13MegBrainErrorC1ERKSs+0x4a) [0x7fdfdad5590a]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(+0x2db7867) [0x7fdfdadb7867]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN6megdnn12ErrorHandler15on_megdnn_errorERKSs+0x14) [0x7fdfde9d86a4]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(+0x6a078d8) [0x7fdfdea078d8]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(+0x6a07a91) [0x7fdfdea07a91]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb3opr8Elemwise20get_output_var_shapeEN6megdnn5param8Elemwise4ModeERKNS2_11SmallVectorINS2_11TensorShapeELj4EEE+0x37) [0x7fdfdaf150e7]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(_ZNK3mgb3opr8Elemwise20get_output_var_shapeERKN6megdnn11SmallVectorINS2_11TensorShapeELj4EEERS5_+0x29) [0x7fdfdaf1e0c9]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb2cg5mixin24OutshapePureByInshapeOpr10infer_descEmRN6megdnn11TensorShapeERKNS0_12static_infer6InpValE+0x19d) [0x7fdfdade9e0d]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb2cg12static_infer22StaticInferManagerImpl13TagShapeTrait8do_inferERKNS1_6InpValE+0x57) [0x7fdfdae0b437]
/home/csy/.local/lib/python3.10/site-packages/megengine/core/lib/libmegengine_shared.so(+0x2e0b292) [0x7fdfdae0b292]

NLP部分的模型试验过程发现无法按照readme运行

环境

1.系统环境:MegStudio
2.MegEngine版本:
3.python版本:3.7
4.模型名称:bert cls mgpc

复现步骤

  1. 下载models
  2. 按照readme提示在MegStudio运行
  3. 没有tuple

请提供关键的代码片段便于追查问题

请提供完整的日志及报错信息

image

COCO数据集测试速度异常

背景

COCO数据集测试速度异常

image

任务描述

按照首页readme安装megengine和Models,从 hub.py 下载 预训练权重,直接在coco数据集上进行测试。代码未做任何改动,采用8卡测试,最终结果mAP为42.6(正常),但是耗时将近6m30s(不正常),正常8卡时间应该在1min左右

python:3.6
megengine版本:v1.4.0

目标

希望多卡测试速度正常,请确认问题并帮忙修复

python3 train.py ERROR

Traceback (most recent call last):
File "train.py", line 41, in
import megengine as mge
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/megengine/init.py", line 3, in
raise ValueError("This is an placeholder only")
ValueError: This is an placeholder only

复现 InceptionNet

任务描述

  • 复现 InceptionNet ,训练正常收敛,验收指标符合预期,并将代码提交到 offical/vision/classification/models 下

目标

  • 数据集ImageNet
  • 准确率和论文一致或更高
  • 脚本可以完整完成训练步骤
  • 提供训练后的权重文件
  • 提交至 https://github.com/MegEngine/Hub

imagenet数据集

https://github.com/MegEngine/Models/tree/master/official/quantization中,命令python3 train.py -a resnet18 -d /path/to/imagenet --mode normal 中的imagenet数据集在哪里下载?http://www.image-net.org/,这里吗?flower(https://www.kaggle.com/c/flower-classification-with-tpus/data)数据集可以用在这个命令中,正确运行吗?
捕获
用megstudio创建项目,选取数据集,怎么没有kaggle flower 数据集?讲课中,是有的,在哪里有?

背景

任务描述

目标

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.