It seems that the code cannot generate "./train_pkl/samples_bytes_0.pkl" successfully.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="use

The full log is attatched below....My version is also 4.0 <p dir="aut

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Error when using --cache-mode part,about svip-lab/as-mlp

Comments (9)

niujinshuchong commented on August 18, 2024

@HantingChen
Please create train_pkl or val_pkl manually. Otherwise, the first run cannot save the *.pkl file in train_pkl or val_pkl because it cannot find those folders.

Thank you very much for sharing the error.

from as-mlp.

HantingChen commented on August 18, 2024

@HantingChen
Please create train_pkl or val_pkl manually. Otherwise, the first run cannot save the *.pkl file in train_pkl or val_pkl because it cannot find those folders.

Thank you very much for sharing the error.

I have created train_pkl folder, and the first run did save the *.pkl file.
You can see that the second running did not try to produce the pkl file.
However, the size of the pkl file is 0. There may be something wrong when saving the pkl file.

from as-mlp.

niujinshuchong commented on August 18, 2024

@HantingChen
Also please note that if you create *.pkl using 8 gpus and then if you want to train a model with different gpus, you should regenerate those *.pkl files again.

from as-mlp.

niujinshuchong commented on August 18, 2024

@HantingChen
Please create train_pkl or val_pkl manually. Otherwise, the first run cannot save the *.pkl file in train_pkl or val_pkl because it cannot find those folders.
Thank you very much for sharing the error.

I have created train_pkl folder, and the first run did save the *.pkl file.
You can see that the second running did not try to produce the pkl file.
However, the size of the pkl file is 0. There may be something wrong when saving the pkl file.

@HantingChen I just cloned the code and tested it. It can create *.pkl files with cache-mode part. (PS. my pickle version is 4.0)

Would you please try it again and attach the full log.

from as-mlp.

HantingChen commented on August 18, 2024

The full log is attatched below....My version is also 4.0

The first running log:

./train_pkl/samples_bytes_0.pkl
global_rank 0 cached 0/1281167 takes 0.00s per block
global_rank 0 cached 128116/1281167 takes 21.24s per block
global_rank 0 cached 256232/1281167 takes 19.01s per block
global_rank 0 cached 384348/1281167 takes 18.36s per block
global_rank 0 cached 512464/1281167 takes 29.66s per block
global_rank 0 cached 640580/1281167 takes 35.94s per block
global_rank 0 cached 768696/1281167 takes 36.32s per block
global_rank 0 cached 896812/1281167 takes 35.54s per block
global_rank 0 cached 1024928/1281167 takes 37.19s per block
global_rank 0 cached 1153044/1281167 takes 46.55s per block
global_rank 0 cached 1281160/1281167 takes 50.39s per block
Traceback (most recent call last):
File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ma-user/anaconda3/envs/Pytorch-1.4.0/bin/python', '-u', 'main.py', '--local_rank=0', '--cfg', 'configs/as_base_patch4_shift5_224.yaml', '--data-path', '/cache/imagenet/imagenet/', '--eval', '--resume', '/cache/model/asmlp_base_patch4_shift5_224.pth', '--moxfile', '0']' died with <Signals.SIGKILL: 9>.

The second running log:

./train_pkl/samples_bytes_0.pkl
Traceback (most recent call last):
File "main.py", line 349, in
main(config)
File "main.py", line 78, in main
dataset_train, dataset_val, data_loader_train, data_loader_val, mixup_fn = build_loader(config)
File "/home/ma-user/work/AS-MLP-main/data/build.py", line 17, in build_loader
dataset_train, config.MODEL.NUM_CLASSES = build_dataset(is_train=True, config=config)
File "/home/ma-user/work/AS-MLP-main/data/build.py", line 80, in build_dataset
cache_mode=config.DATA.CACHE_MODE if is_train else 'part')
File "/home/ma-user/work/AS-MLP-main/data/cached_image_folder.py", line 250, in init
cache_mode=cache_mode)
File "/home/ma-user/work/AS-MLP-main/data/cached_image_folder.py", line 122, in init
self.init_cache()
File "/home/ma-user/work/AS-MLP-main/data/cached_image_folder.py", line 137, in init_cache
self.samples = pickle.load(handle)
EOFError: Ran out of input
Traceback (most recent call last):
File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ma-user/anaconda3/envs/Pytorch-1.4.0/bin/python', '-u', 'main.py', '--local_rank=0', '--cfg', 'configs/as_base_patch4_shift5_224.yaml', '--data-path', '/cache/imagenet/imagenet/', '--eval', '--resume', '/cache/model/asmlp_base_patch4_shift5_224.pth', '--moxfile', '0']' returned non-zero exit status 1.

from as-mlp.

niujinshuchong commented on August 18, 2024

I also tested the code with 1 gpu. The output looks like this:

`
CUDA_VISIBLE_DEVICES=9 python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --cfg configs/as_tiny_patch4_shift5_224.yaml --data-path /root/fake_data/ImageNet-Zip/ --batch-size 64 --cache-mode part --accumulation-steps 2

=> merge config from configs/as_tiny_patch4_shift5_224.yaml
RANK and WORLD_SIZE in environ: 0/1
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
[2021-07-22 09:01:08 asmlp_tiny_patch4_shift5_224](main.py 340): INFO Full config saved to output/asmlp_tiny_patch4_shift5_224/default/config.json
[2021-07-22 09:01:08 asmlp_tiny_patch4_shift5_224](main.py 343): INFO AMP_OPT_LEVEL: O1
AUG:
AUTO_AUGMENT: rand-m9-mstd0.5-inc1
COLOR_JITTER: 0.4
CUTMIX: 1.0
CUTMIX_MINMAX: null
MIXUP: 0.8
MIXUP_MODE: batch
MIXUP_PROB: 1.0
MIXUP_SWITCH_PROB: 0.5
RECOUNT: 1
REMODE: pixel
REPROB: 0.25
BASE:

''
DATA:
BATCH_SIZE: 64
CACHE_MODE: part
DATASET: imagenet
DATA_PATH: /root/fake_data/ImageNet-Zip/
IMG_SIZE: 224
INTERPOLATION: bicubic
NUM_WORKERS: 8
PIN_MEMORY: true
ZIP_MODE: false
EVAL_MODE: false
LOCAL_RANK: 0
MODEL:
ASMLP:
DEPTHS:
- 2
- 2
- 6
- 2
  EMBED_DIM: 96
  IN_CHANS: 3
  MLP_RATIO: 4.0
  PATCH_NORM: true
  PATCH_SIZE: 4
  SHIFT_SIZE: 3
  DROP_PATH_RATE: 0.2
  DROP_RATE: 0.0
  LABEL_SMOOTHING: 0.1
  NAME: asmlp_tiny_patch4_shift5_224
  NUM_CLASSES: 1000
  RESUME: ''
  TYPE: asmlp
  OUTPUT: output/asmlp_tiny_patch4_shift5_224/default
  PRINT_FREQ: 10
  SAVE_FREQ: 1
  SEED: 0
  TAG: default
  TEST:
  CROP: true
  THROUGHPUT_MODE: false
  TRAIN:
  ACCUMULATION_STEPS: 2
  AUTO_RESUME: true
  BASE_LR: 0.000125
  CLIP_GRAD: 5.0
  EPOCHS: 300
  LR_SCHEDULER:
  DECAY_EPOCHS: 30
  DECAY_RATE: 0.1
  NAME: cosine
  MIN_LR: 1.25e-06
  OPTIMIZER:
  BETAS:
- 0.9
- 0.999
  EPS: 1.0e-08
  MOMENTUM: 0.9
  NAME: adamw
  START_EPOCH: 0
  USE_CHECKPOINT: false
  WARMUP_EPOCHS: 20
  WARMUP_LR: 1.25e-07
  WEIGHT_DECAY: 0.05

in part /root/fake_data/ImageNet-Zip/
./train_pkl/samples_bytes_0.pkl
global_rank 0 cached 0/50000 takes 0.00s per block
global_rank 0 cached 5000/50000 takes 1.76s per block
global_rank 0 cached 10000/50000 takes 1.71s per block
global_rank 0 cached 15000/50000 takes 1.76s per block
global_rank 0 cached 20000/50000 takes 1.77s per block
global_rank 0 cached 25000/50000 takes 1.88s per block
global_rank 0 cached 30000/50000 takes 1.87s per block
global_rank 0 cached 35000/50000 takes 1.83s per block
global_rank 0 cached 40000/50000 takes 1.80s per block
global_rank 0 cached 45000/50000 takes 1.86s per block
local rank 0 / global rank 0 successfully build train dataset
in part /root/fake_data/ImageNet-Zip/
./val_pkl/samples_bytes_0.pkl
global_rank 0 cached 0/50000 takes 0.00s per block
global_rank 0 cached 5000/50000 takes 1.54s per block
global_rank 0 cached 10000/50000 takes 1.60s per block
global_rank 0 cached 15000/50000 takes 1.39s per block
global_rank 0 cached 20000/50000 takes 1.48s per block
global_rank 0 cached 25000/50000 takes 1.40s per block
global_rank 0 cached 30000/50000 takes 1.24s per block
global_rank 0 cached 35000/50000 takes 1.53s per block
global_rank 0 cached 40000/50000 takes 1.52s per block
global_rank 0 cached 45000/50000 takes 1.46s per block
local rank 0 / global rank 0 successfully build val dataset
[2021-07-22 09:02:18 asmlp_tiny_patch4_shift5_224](main.py 76): INFO Creating model:asmlp/asmlp_tiny_patch4_shift5_224
[2021-07-22 09:02:18 asmlp_tiny_patch4_shift5_224](main.py 79): INFO AS_MLP(
(patch_embed): PatchEmbed(
(proj): Conv2d(3, 96, kernel_size=(4, 4), stride=(4, 4))
(norm): GroupNorm(1, 96, eps=1e-05, affine=True)
)
(pos_drop): Dropout(p=0.0, inplace=False)
(layers): ModuleList(
(0): BasicLayer(
dim=96, input_resolution=(56, 56), depth=2
(blocks): ModuleList(
(0): AxialShiftedBlock(
dim=96, input_resolution=(56, 56), shift_size=3, mlp_ratio=4.0
(norm1): GroupNorm(1, 96, eps=1e-05, affine=True)
(axial_shift): AxialShift(
dim=96, shift_size=3
(conv1): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1))
(conv2_1): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1))
(conv2_2): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1))
(conv3): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1))
(actn): GELU()
(norm1): GroupNorm(1, 96, eps=1e-05, affine=True)
(norm2): GroupNorm(1, 96, eps=1e-05, affine=True)
(shift_dim2): Shift()
(shift_dim3): Shift()
)
(drop_path): Identity()
(norm2): GroupNorm(1, 96, eps=1e-05, affine=True)
(mlp): Mlp(
(fc1): Conv2d(96, 384, kernel_size=(1, 1), stride=(1, 1))
(act): GELU()
(fc2): Conv2d(384, 96, kernel_size=(1, 1), stride=(1, 1))
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): AxialShiftedBlock(
dim=96, input_resolution=(56, 56), shift_size=3, mlp_ratio=4.0
(norm1): GroupNorm(1, 96, eps=1e-05, affine=True)
(axial_shift): AxialShift(
dim=96, shift_size=3
(conv1): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1))
(conv2_1): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1))
(conv2_2): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1))
(conv3): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1))
(actn): GELU()
(norm1): GroupNorm(1, 96, eps=1e-05, affine=True)
(norm2): GroupNorm(1, 96, eps=1e-05, affine=True)
(shift_dim2): Shift()
(shift_dim3): Shift()
)
(drop_path): DropPath()
(norm2): GroupNorm(1, 96, eps=1e-05, affine=True)
(mlp): Mlp(
(fc1): Conv2d(96, 384, kernel_size=(1, 1), stride=(1, 1))
(act): GELU()
(fc2): Conv2d(384, 96, kernel_size=(1, 1), stride=(1, 1))
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(downsample): PatchMerging(
input_resolution=(56, 56), dim=96
(reduction): Conv2d(384, 192, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm): GroupNorm(1, 384, eps=1e-05, affine=True)
)
)
(1): BasicLayer(
dim=192, input_resolution=(28, 28), depth=2
(blocks): ModuleList(
(0): AxialShiftedBlock(
dim=192, input_resolution=(28, 28), shift_size=3, mlp_ratio=4.0
(norm1): GroupNorm(1, 192, eps=1e-05, affine=True)
(axial_shift): AxialShift(
dim=192, shift_size=3
(conv1): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1))
(conv2_1): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1))
(conv2_2): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1))
(conv3): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1))
(actn): GELU()
(norm1): GroupNorm(1, 192, eps=1e-05, affine=True)
(norm2): GroupNorm(1, 192, eps=1e-05, affine=True)
(shift_dim2): Shift()
(shift_dim3): Shift()
)
(drop_path): DropPath()
(norm2): GroupNorm(1, 192, eps=1e-05, affine=True)
(mlp): Mlp(
(fc1): Conv2d(192, 768, kernel_size=(1, 1), stride=(1, 1))
(act): GELU()
(fc2): Conv2d(768, 192, kernel_size=(1, 1), stride=(1, 1))
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): AxialShiftedBlock(
dim=192, input_resolution=(28, 28), shift_size=3, mlp_ratio=4.0
(norm1): GroupNorm(1, 192, eps=1e-05, affine=True)
(axial_shift): AxialShift(
dim=192, shift_size=3
(conv1): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1))
(conv2_1): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1))
(conv2_2): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1))
(conv3): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1))
(actn): GELU()
(norm1): GroupNorm(1, 192, eps=1e-05, affine=True)
(norm2): GroupNorm(1, 192, eps=1e-05, affine=True)
(shift_dim2): Shift()
(shift_dim3): Shift()
)
(drop_path): DropPath()
(norm2): GroupNorm(1, 192, eps=1e-05, affine=True)
(mlp): Mlp(
(fc1): Conv2d(192, 768, kernel_size=(1, 1), stride=(1, 1))
(act): GELU()
(fc2): Conv2d(768, 192, kernel_size=(1, 1), stride=(1, 1))
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(downsample): PatchMerging(
input_resolution=(28, 28), dim=192
(reduction): Conv2d(768, 384, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm): GroupNorm(1, 768, eps=1e-05, affine=True)
)
)
(2): BasicLayer(
dim=384, input_resolution=(14, 14), depth=6
(blocks): ModuleList(
(0): AxialShiftedBlock(
dim=384, input_resolution=(14, 14), shift_size=3, mlp_ratio=4.0
(norm1): GroupNorm(1, 384, eps=1e-05, affine=True)
(axial_shift): AxialShift(
dim=384, shift_size=3
(conv1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv2_1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv2_2): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv3): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(actn): GELU()
(norm1): GroupNorm(1, 384, eps=1e-05, affine=True)
(norm2): GroupNorm(1, 384, eps=1e-05, affine=True)
(shift_dim2): Shift()
(shift_dim3): Shift()
)
(drop_path): DropPath()
(norm2): GroupNorm(1, 384, eps=1e-05, affine=True)
(mlp): Mlp(
(fc1): Conv2d(384, 1536, kernel_size=(1, 1), stride=(1, 1))
(act): GELU()
(fc2): Conv2d(1536, 384, kernel_size=(1, 1), stride=(1, 1))
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): AxialShiftedBlock(
dim=384, input_resolution=(14, 14), shift_size=3, mlp_ratio=4.0
(norm1): GroupNorm(1, 384, eps=1e-05, affine=True)
(axial_shift): AxialShift(
dim=384, shift_size=3
(conv1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv2_1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv2_2): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv3): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(actn): GELU()
(norm1): GroupNorm(1, 384, eps=1e-05, affine=True)
(norm2): GroupNorm(1, 384, eps=1e-05, affine=True)
(shift_dim2): Shift()
(shift_dim3): Shift()
)
(drop_path): DropPath()
(norm2): GroupNorm(1, 384, eps=1e-05, affine=True)
(mlp): Mlp(
(fc1): Conv2d(384, 1536, kernel_size=(1, 1), stride=(1, 1))
(act): GELU()
(fc2): Conv2d(1536, 384, kernel_size=(1, 1), stride=(1, 1))
(drop): Dropout(p=0.0, inplace=False)
)
)
(2): AxialShiftedBlock(
dim=384, input_resolution=(14, 14), shift_size=3, mlp_ratio=4.0
(norm1): GroupNorm(1, 384, eps=1e-05, affine=True)
(axial_shift): AxialShift(
dim=384, shift_size=3
(conv1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv2_1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv2_2): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv3): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(actn): GELU()
(norm1): GroupNorm(1, 384, eps=1e-05, affine=True)
(norm2): GroupNorm(1, 384, eps=1e-05, affine=True)
(shift_dim2): Shift()
(shift_dim3): Shift()
)
(drop_path): DropPath()
(norm2): GroupNorm(1, 384, eps=1e-05, affine=True)
(mlp): Mlp(
(fc1): Conv2d(384, 1536, kernel_size=(1, 1), stride=(1, 1))
(act): GELU()
(fc2): Conv2d(1536, 384, kernel_size=(1, 1), stride=(1, 1))
(drop): Dropout(p=0.0, inplace=False)
)
)
(3): AxialShiftedBlock(
dim=384, input_resolution=(14, 14), shift_size=3, mlp_ratio=4.0
(norm1): GroupNorm(1, 384, eps=1e-05, affine=True)
(axial_shift): AxialShift(
dim=384, shift_size=3
(conv1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv2_1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv2_2): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv3): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(actn): GELU()
(norm1): GroupNorm(1, 384, eps=1e-05, affine=True)
(norm2): GroupNorm(1, 384, eps=1e-05, affine=True)
(shift_dim2): Shift()
(shift_dim3): Shift()
)
(drop_path): DropPath()
(norm2): GroupNorm(1, 384, eps=1e-05, affine=True)
(mlp): Mlp(
(fc1): Conv2d(384, 1536, kernel_size=(1, 1), stride=(1, 1))
(act): GELU()
(fc2): Conv2d(1536, 384, kernel_size=(1, 1), stride=(1, 1))
(drop): Dropout(p=0.0, inplace=False)
)
)
(4): AxialShiftedBlock(
dim=384, input_resolution=(14, 14), shift_size=3, mlp_ratio=4.0
(norm1): GroupNorm(1, 384, eps=1e-05, affine=True)
(axial_shift): AxialShift(
dim=384, shift_size=3
(conv1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv2_1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv2_2): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv3): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(actn): GELU()
(norm1): GroupNorm(1, 384, eps=1e-05, affine=True)
(norm2): GroupNorm(1, 384, eps=1e-05, affine=True)
(shift_dim2): Shift()
(shift_dim3): Shift()
)
(drop_path): DropPath()
(norm2): GroupNorm(1, 384, eps=1e-05, affine=True)
(mlp): Mlp(
(fc1): Conv2d(384, 1536, kernel_size=(1, 1), stride=(1, 1))
(act): GELU()
(fc2): Conv2d(1536, 384, kernel_size=(1, 1), stride=(1, 1))
(drop): Dropout(p=0.0, inplace=False)
)
)
(5): AxialShiftedBlock(
dim=384, input_resolution=(14, 14), shift_size=3, mlp_ratio=4.0
(norm1): GroupNorm(1, 384, eps=1e-05, affine=True)
(axial_shift): AxialShift(
dim=384, shift_size=3
(conv1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv2_1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv2_2): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(conv3): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
(actn): GELU()
(norm1): GroupNorm(1, 384, eps=1e-05, affine=True)
(norm2): GroupNorm(1, 384, eps=1e-05, affine=True)
(shift_dim2): Shift()
(shift_dim3): Shift()
)
(drop_path): DropPath()
(norm2): GroupNorm(1, 384, eps=1e-05, affine=True)
(mlp): Mlp(
(fc1): Conv2d(384, 1536, kernel_size=(1, 1), stride=(1, 1))
(act): GELU()
(fc2): Conv2d(1536, 384, kernel_size=(1, 1), stride=(1, 1))
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(downsample): PatchMerging(
input_resolution=(14, 14), dim=384
(reduction): Conv2d(1536, 768, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm): GroupNorm(1, 1536, eps=1e-05, affine=True)
)
)
(3): BasicLayer(
dim=768, input_resolution=(7, 7), depth=2
(blocks): ModuleList(
(0): AxialShiftedBlock(
dim=768, input_resolution=(7, 7), shift_size=3, mlp_ratio=4.0
(norm1): GroupNorm(1, 768, eps=1e-05, affine=True)
(axial_shift): AxialShift(
dim=768, shift_size=3
(conv1): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1))
(conv2_1): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1))
(conv2_2): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1))
(conv3): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1))
(actn): GELU()
(norm1): GroupNorm(1, 768, eps=1e-05, affine=True)
(norm2): GroupNorm(1, 768, eps=1e-05, affine=True)
(shift_dim2): Shift()
(shift_dim3): Shift()
)
(drop_path): DropPath()
(norm2): GroupNorm(1, 768, eps=1e-05, affine=True)
(mlp): Mlp(
(fc1): Conv2d(768, 3072, kernel_size=(1, 1), stride=(1, 1))
(act): GELU()
(fc2): Conv2d(3072, 768, kernel_size=(1, 1), stride=(1, 1))
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): AxialShiftedBlock(
dim=768, input_resolution=(7, 7), shift_size=3, mlp_ratio=4.0
(norm1): GroupNorm(1, 768, eps=1e-05, affine=True)
(axial_shift): AxialShift(
dim=768, shift_size=3
(conv1): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1))
(conv2_1): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1))
(conv2_2): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1))
(conv3): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1))
(actn): GELU()
(norm1): GroupNorm(1, 768, eps=1e-05, affine=True)
(norm2): GroupNorm(1, 768, eps=1e-05, affine=True)
(shift_dim2): Shift()
(shift_dim3): Shift()
)
(drop_path): DropPath()
(norm2): GroupNorm(1, 768, eps=1e-05, affine=True)
(mlp): Mlp(
(fc1): Conv2d(768, 3072, kernel_size=(1, 1), stride=(1, 1))
(act): GELU()
(fc2): Conv2d(3072, 768, kernel_size=(1, 1), stride=(1, 1))
(drop): Dropout(p=0.0, inplace=False)
)
)
)
)
)
(norm): GroupNorm(1, 768, eps=1e-05, affine=True)
(avgpool): AdaptiveAvgPool2d(output_size=1)
(head): Linear(in_features=768, out_features=1000, bias=True)
)
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
[2021-07-22 09:02:19 asmlp_tiny_patch4_shift5_224](main.py 88): INFO number of params: 28282696
[2021-07-22 09:02:19 asmlp_tiny_patch4_shift5_224](main.py 91): INFO number of GFLOPs: 4.3585536
All checkpoints founded in output/asmlp_tiny_patch4_shift5_224/default: []
[2021-07-22 09:02:19 asmlp_tiny_patch4_shift5_224](main.py 116): INFO no checkpoint found in output/asmlp_tiny_patch4_shift5_224/default, ignoring auto resume
[2021-07-22 09:02:19 asmlp_tiny_patch4_shift5_224](main.py 129): INFO Start training
[2021-07-22 09:02:24 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][0/781] eta 1:08:59 lr 0.000000 time 5.3000 (5.3000) loss 3.4992 (3.4992) grad_norm 2.2539 (2.2539) mem 8882MB
[2021-07-22 09:02:28 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][10/781] eta 0:10:30 lr 0.000000 time 0.3332 (0.8183) loss 3.4760 (3.4774) grad_norm 2.2970 (2.6936) mem 8882MB
[2021-07-22 09:02:31 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][20/781] eta 0:07:34 lr 0.000000 time 0.3261 (0.5978) loss 3.4740 (3.4767) grad_norm 2.3040 (2.6371) mem 8882MB
[2021-07-22 09:02:35 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][30/781] eta 0:06:33 lr 0.000000 time 0.3421 (0.5244) loss 3.4737 (3.4762) grad_norm 2.6535 (2.6483) mem 8882MB
[2021-07-22 09:02:38 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][40/781] eta 0:05:59 lr 0.000000 time 0.3265 (0.4854) loss 3.4857 (3.4764) grad_norm 2.1506 (2.6657) mem 8882MB
[2021-07-22 09:02:42 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][50/781] eta 0:05:37 lr 0.000001 time 0.3316 (0.4612) loss 3.4687 (3.4765) grad_norm 2.1028 (2.6646) mem 8882MB
[2021-07-22 09:02:46 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][60/781] eta 0:05:21 lr 0.000001 time 0.3295 (0.4454) loss 3.4673 (3.4756) grad_norm 2.2019 (2.6883) mem 8882MB
[2021-07-22 09:02:49 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][70/781] eta 0:05:09 lr 0.000001 time 0.3385 (0.4347) loss 3.4785 (3.4754) grad_norm 2.2327 (2.6915) mem 8882MB
[2021-07-22 09:02:53 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][80/781] eta 0:04:59 lr 0.000001 time 0.3303 (0.4267) loss 3.4808 (3.4752) grad_norm 2.3357 (2.6958) mem 8882MB
[2021-07-22 09:02:57 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][90/781] eta 0:04:50 lr 0.000001 time 0.3215 (0.4198) loss 3.4
`

from as-mlp.

niujinshuchong commented on August 18, 2024

@HantingChen Would you please try with a small datasets? You can try by replacing the train data with val data by
mv train train_backup
ln -s val train
in the imagenet folder.

from as-mlp.

HantingChen commented on August 18, 2024

@HantingChen Would you please try with a small datasets? You can try by replacing the train data with val data by
mv train train_backup
ln -s val train
in the imagenet folder.

I use the --eval mode, so it already used the val data. I think this error may be caused by my environment. I will test it using other machine.

Thanks for your reply!

from as-mlp.

dongzelian commented on August 18, 2024

@HantingChen Hi, if you use SSD to store the ImageNet dataset, you can also use cache-mode no, the training speed is similar.

from as-mlp.

Error when using --cache-mode part about as-mlp HOT 9 CLOSED

Comments (9)

Related Issues (18)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent