Code Monkey home page Code Monkey logo

ffcv-imagenet's Introduction

ffcv ImageNet Training

A minimal, single-file PyTorch ImageNet training script designed for hackability. Run train_imagenet.py to get...

  • ...high accuracies on ImageNet
  • ...with as many lines of code as the PyTorch ImageNet example
  • ...in 1/10th the time.

Results

Train models more efficiently, either with 8 GPUs in parallel or by training 8 ResNet-18's at once.

See benchmark setup here: https://docs.ffcv.io/benchmarks.html.

Citation

If you use this setup in your research, cite:

@misc{leclerc2022ffcv,
    author = {Guillaume Leclerc and Andrew Ilyas and Logan Engstrom and Sung Min Park and Hadi Salman and Aleksander Madry},
    title = {ffcv},
    year = {2022},
    howpublished = {\url{https://github.com/libffcv/ffcv/}},
    note = {commit xxxxxxx}
}

(Make sure to replace xxxxxxx above with the hash of the commit used!)

Configurations

The configuration files corresponding to the above results are:

Link to Config top_1 top_5 # Epochs Time (mins) Architecture Setup
Link 0.784 0.941 88 77.2 ResNet-50 8 x A100
Link 0.780 0.937 56 49.4 ResNet-50 8 x A100
Link 0.772 0.932 40 35.6 ResNet-50 8 x A100
Link 0.766 0.927 32 28.7 ResNet-50 8 x A100
Link 0.756 0.921 24 21.7 ResNet-50 8 x A100
Link 0.738 0.908 16 14.9 ResNet-50 8 x A100
Link 0.724 0.903 88 187.3 ResNet-18 1 x A100
Link 0.713 0.899 56 119.4 ResNet-18 1 x A100
Link 0.706 0.894 40 85.5 ResNet-18 1 x A100
Link 0.700 0.889 32 68.9 ResNet-18 1 x A100
Link 0.688 0.881 24 51.6 ResNet-18 1 x A100
Link 0.669 0.868 16 35.0 ResNet-18 1 x A100

Training Models

First pip install the requirements file in this directory:

pip install -r requirements.txt

Then, generate an ImageNet dataset; make the dataset used for the results above with the following command (IMAGENET_DIR should point to a PyTorch style ImageNet dataset:

# Required environmental variables for the script:
export IMAGENET_DIR=/path/to/pytorch/format/imagenet/directory/
export WRITE_DIR=/your/path/here/

# Starting in the root of the Git repo:
cd examples;

# Serialize images with:
# - 500px side length maximum
# - 50% JPEG encoded
# - quality=90 JPEGs
./write_imagenet.sh 500 0.50 90

Then, choose a configuration from the configuration table. With the config file path in hand, train as follows:

# 8 GPU training (use only 1 for ResNet-18 training)
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Set the visible GPUs according to the `world_size` configuration parameter
# Modify `data.in_memory` and `data.num_workers` based on your machine
python train_imagenet.py --config-file rn50_configs/<your config file>.yaml \
    --data.train_dataset=/path/to/train/dataset.ffcv \
    --data.val_dataset=/path/to/val/dataset.ffcv \
    --data.num_workers=12 --data.in_memory=1 \
    --logging.folder=/your/path/here

Adjust the configuration by either changing the passed YAML file or by specifying arguments via fastargs (i.e. how the dataset paths were passed above).

Training Details

System setup. We trained on p4.24xlarge ec2 instances (8 A100s).

Dataset setup. Generally larger side length will aid in accuracy but decrease throughput:

  • ResNet-50 training: 50% JPEG 500px side length
  • ResNet-18 training: 10% JPEG 400px side length

Algorithmic details. We use a standard ImageNet training pipeline (à la the PyTorch ImageNet example) with only the following differences/highlights:

  • SGD optimizer with momentum and weight decay on all non-batchnorm parameters
  • Test-time augmentation over left/right flips
  • Progressive resizing from 160px to 192px: 160px training until 75% of the way through training (by epochs), then 192px until the end of training.
  • Validation set sizing according to "Fixing the train-test resolution discrepancy": 224px at test time.
  • Label smoothing
  • Cyclic learning rate schedule

Refer to the code and configuration files for a more exact specification. To obtain configurations we first gridded for hyperparameters at a 30 epoch schedule. Fixing these parameters, we then varied only the number of epochs (stretching the learning rate schedule across the number of epochs as motivated by Budgeted Training) and plotted the results above.

FAQ

Why is the first epoch slow?

The first epoch can be slow for the first epoch if the dataset hasn't been cached in memory yet.

What if I can't fit my dataset in memory?

See this guide here.

Other questions

Please open up a GitHub discussion for non-bug related questions; if you find a bug please report it on GitHub issues.

ffcv-imagenet's People

Contributors

andrewilyas avatar bamps53 avatar kentaroy47 avatar lengstrom avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ffcv-imagenet's Issues

DEFAULT_CROP_RATIO wrong?

You remarked below settings of image resolution as 500px.

# Required environmental variables for the script:
export IMAGENET_DIR=/path/to/pytorch/format/imagenet/directory/
export WRITE_DIR=/your/path/here/

# Starting in the root of the Git repo:
cd examples;

# Serialize images with:
# - 500px side length maximum
# - 50% JPEG encoded
# - quality=90 JPEGs
./write_imagenet.sh 500 0.50 90

However, train_imagenet.py has a below environment variable

DEFAULT_CROP_RATIO = 224/256

I think it may be 224/500. Is it right?

train_loss variable is None and val_loss variable in log not showing up

Hi,

Thank you for the awesome project and the training script. I was able to replicate the result for resnet18 for 16 epochs (as per the resnet18 dataset settings), which came out to roughly the same accuracy. My question is related to the train_loss variable coming out as None and the validation loss isn't being recorded in the log. Based on the top1/5 accuracies, it looks like it is working but it would still be nice to have both losses logged though. Can you confirm if this is the expected behavior?

=> Log: {'current_lr': 0.2498251798561151, 'top_1': 0.06347999721765518, 'top_5': 0.18308000266551971, 'val_time': 9.070343971252441, 'train_loss': None, 'epoch': 0}

Thanks!

Imagenet dataset preparation size

In an attempt to replicate results as a sanity test, I ran the data preparation script as ./write_imagenet.sh 500 0.50 90 in its default configuration on Imagenet dataset. I can see from the documentation provided at https://docs.ffcv.io/benchmarks.html that initializing the writer with RGBImageField(write_mode=proportion, compress_probability=0.5, max_resolution= 512, jpeg_quality=90) should generate a dataset of size 202.04 GB. However when I ran this myself, I got a train dataset size 337 GB and val 15 GB.

I am wondering if the compress_probability value used in the documentation at https://docs.ffcv.io/benchmarks.html was higher than 0.5, which leads to a smaller dataset size than I got? It's a little unclear why I have a 40% larger dataset using similar configuration values.

I'm also a bit confused with the comment below, as per my understanding using prob=0.5 means that you use JPEG encoding for 50% of the images, and raw pixel values for 50% of the images (not 90%?)

# Serialize images with:
# - 500px side length maximum
# - 50% JPEG encoded, 90% raw pixel values
# - quality=90 JPEGs
./write_imagenet.sh 500 0.50 90

yaml file for imagenet writter hyperparameters?

I couldn't find the yaml file for imagenet writter hyperparameters (as above)

@param('dataset')
@param('split')
@param('data_dir')
@param('write_path')
@param('max_resolution')
@param('num_workers')
@param('chunk_size')
@param('subset')
@param('jpeg_quality')
@param('write_mode')
@param('compress_probability')

Are they all set as default in the source code? (as above)

Section('cfg', 'arguments to give the writer').params(
    dataset=Param(And(str, OneOf(['cifar', 'imagenet'])), 'Which dataset to write', default='imagenet'),
    split=Param(And(str, OneOf(['train', 'val'])), 'Train or val set', required=True),
    data_dir=Param(str, 'Where to find the PyTorch dataset', required=True),
    write_path=Param(str, 'Where to write the new dataset', required=True),
    write_mode=Param(str, 'Mode: raw, smart or jpg', required=False, default='smart'),
    max_resolution=Param(int, 'Max image side length', required=True),
    num_workers=Param(int, 'Number of workers to use', default=16),
    chunk_size=Param(int, 'Chunk size for writing', default=100),
    jpeg_quality=Param(float, 'Quality of jpeg images', default=90),
    subset=Param(int, 'How many images to use (-1 for all)', default=-1),
    compress_probability=Param(float, 'compress probability', default=None)
)

How to enable Multi-GPU training (1 model, multiple GPUs) under the server with limited memory?

Description

Hi, @lengstrom . Thanks for your wonderful work!

My goal is to run a ResNet18 under ImageNet on my server using a multi-GPU training strategy to speed up the training process. The server has 4 RTX 2080 Ti GPUs with a 46G memory, which is not large enough to load ImageNet into the memory.

I have read the instructions on https://docs.ffcv.io/parameter_tuning.html (Scenario: Large scale datasets and Scenario: Multi-GPU training (1 model, multiple GPUs)

Right now, I can run a ResNet18 on a single card by using os_cache=False. However, if I use in_memory=0 and distributed = 1 to run the provided train_imagenet.py code as follows, some errors are reported, which are listed at the bottom. Would you please tell me how to solve this issue?


Command

python train_imagenet.py --config-file rn18_configs/rn18_16_epochs.yaml \
    ... \
    --data.in_memory=0 \
    --training.distributed=1

Message

Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.

=> Logging in ...

Not enough memory; try setting quasi-random ordering
(OrderOption.QUASI_RANDOM) in the dataloader constructor's order argument.

Full error below:
0%| | 0/1251 [00:01<?, ?it/s]
Exception ignored in: <function EpochIterator.del at 0x7f528d4f04c0>
Traceback (most recent call last):
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 161, in del
self.close()
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 158, in close
self.memory_context.exit(None, None, None)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 59, in exit
self.executor.exit(*args)
AttributeError: 'ProcessCacheContext' object has no attribute 'executor'
Traceback (most recent call last):
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 510, in
ImageNetTrainer.launch_from_args()
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
return func(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call
return self.func(*args, **filled_args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 461, in launch_from_args
ch.multiprocessing.spawn(cls._exec_wrapper, nprocs=world_size, join=True)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 468, in _exec_wrapper
cls.exec(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
return func(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call
return self.func(*args, **filled_args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 478, in exec
trainer.train()
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
return func(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call
return self.func(*args, **filled_args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 300, in train
train_loss = self.train_loop(epoch)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
return func(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call
return self.func(*args, **filled_args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 361, in train_loop
for ix, (images, target) in enumerate(iterator):
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/loader.py", line 214, in iter
return EpochIterator(self, selected_order)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 43, in init
raise e
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 37, in init
self.memory_context.enter()
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 32, in enter
self.memory = np.zeros((self.schedule.num_slots, self.page_size),
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 229. GiB for an array with shape (29251, 8388608) and data type uint8

PR for resuming training from checkpoint and validate model loaded from disk

Hello! I have a modified FFCV branch which I have been using to load a model from a checkpoint (to avoid mid-training disconnect/random failures), as well as validate the final_weights.pt loaded from disk (eg, if we wish to run it on another val.ffcv file).

I would gladly raise a PR for review for this, if you'd like this feature to be added.

ImportError: libopencv_imgproc.so

Hi,
After downgrading torch (and torchvision) version from 1.10 to 1.9, the import ffcv command raises ImportError:

File "/my_home/miniconda3/envs/test/lib/python3.9/site-packages/ffcv/libffcv.py", line 5, in <module>
    import ffcv._libffcv
ImportError: libopencv_imgproc.so.405: cannot open shared object file: No such file or directory

To reproduce:

conda create -y -n test python=3.9 cupy pkg-config compilers libjpeg-turbo opencv pytorch torchvision cudatoolkit=11.3 numba -c pytorch -c conda-forge
conda activate test
pip install ffcv

Running python and importing ffcv works fine at this point. But if we try to reinstall pytorch for 1.9 version with:

conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge

the import ffcv command breaks. Reinstalling ffcv doesn't help.

Do you have an idea how to resolve this?
(installing 1.9 from the beginning works fine and doesn't break ffcv, but I guess that shouldn't be the preferred way of solving this issue)

Potential minor speedup: Gaussian blur is a separable 2d convolution

blurred = F.conv2d(x, self.blur_filter, stride=1, padding=(1, 1),

Not to be nitpicky, but this could actually be replaced with two "1d" convolutions, one for width and one for height, which would use ~2K operations instead of ~K^2:

def separable_conv2d(inputs: Tensor, k_h: Tensor, k_w: Tensor) -> Tensor:
    kernel_size = max(k_h.shape[-2:])
    pad_amount = kernel_size // 2 #'same' padding.
    # Gaussian filter is separable:
    out_1 = F.conv2d(inputs, k_h, padding=(0, pad_amount))
    out_2 = F.conv2d(out_1, k_w, padding=(pad_amount, 0))
    return out_2

Reproducing Validation Numbers

In an attempt to replicate your numbers, we trained for 40 epochs on a single A100 GPU with the ffcv dataset files generated from the bash script provided with the config specified in rn50_40_epochs.yaml .

After training for ~5 hours, we observed top1=0.729 and top5 = 0.915, in contrast to your quoted numbers of 0.772 and 0.932 from the configuration table in the README. The primary difference was we used 1xA100 instead of 8xA100 that you used, and observed a total training roughly 8x of what you quote (35.6 minutes for 8xA100).

I don't believe that using a single GPU instead of 8 should impact validation accuracy to this extent (5.5% for top 1 and 1.5% for top 5). Could you suggest why this might be happening, or if it is indeed due to using a single A100 GPU instead of 8?

ValueError: could not broadcast input array from shape (80,) into shape (160,)

Hi there 👋

I am trying to write an object detection dataset

...
ds = YOLODataset(Path("/home/zuppif/Documents/neatly/detector/datasets/train"), padding=True)
print([el.shape for el in ds[0]])

writer = DatasetWriter("dataset.beton", {
    'image': RGBImageField(),
    'label': NDArrayField(shape=(ds.max_num_of_labels, 1), dtype=np.dtype(np.int64)),
    'bbox': NDArrayField(shape=(ds.max_num_of_labels, 4), dtype=np.dtype(np.float32)),
}, num_workers=4)


writer.from_indexed_dataset(ds)

YOLODataset returns a tuple with shapes [(1080, 1920, 3), (20, 1), (20, 4)]

I've got the following error

Traceback (most recent call last):
  File "/home/zuppif/miniconda3/envs/ffcv/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/zuppif/miniconda3/envs/ffcv/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/zuppif/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/writer.py", line 113, in worker_job_indexed_dataset
    handle_sample(sample, dest_ix, field_names, metadata, allocator, fields)
  File "/home/zuppif/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/writer.py", line 51, in handle_sample
    field.encode(destination, field_value, allocator.malloc)
  File "/home/zuppif/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/fields/ndarray.py", line 99, in encode
    data_region[:] = field.reshape(-1).view('<u1')
ValueError: could not broadcast input array from shape (80,) into shape (160,)

Looks like is trying to review the bboxes into 160 but not sure why.

Thanks a lot in advance

Cheers,

Fra

A complete example for imagenet data loading

I've been trying to use your FFCV data loader for imagenet training. I find the provided example hard to follow as you use progressive resizing. I wonder if you could provide a complete example with the most commonly used resolution 224.

I have also coded it up myself, but I found the validation accuracy is significantly lower than the training accuracy in my case (see attached code snippet below). For example, after 3 epochs, the training ACC is around 40%, but the validation is only 15%.

def get_ffcv_trainloader(train_dataset, device, batch_size, num_workers=12, in_memory=True):
    train_path = Path(train_dataset)
    assert train_path.is_file()

    decoder = RandomResizedCropRGBImageDecoder((224, 224))
    image_pipeline: List[Operation] = [
        decoder,
        RandomHorizontalFlip(),
        ToTensor(),
        ToDevice(device, non_blocking=True),
        ToTorchImage(),
        NormalizeImage(IMAGENET_MEAN, IMAGENET_STD, np.float32)
    ]

    label_pipeline: List[Operation] = [
        IntDecoder(),
        ToTensor(),
        Squeeze(),
        ToDevice(device, non_blocking=True)
    ]

    order = OrderOption.QUASI_RANDOM
    loader = Loader(train_dataset,
                    batch_size=batch_size,
                    num_workers=num_workers,
                    order=order,
                    os_cache=in_memory,
                    drop_last=True,
                    pipelines={
                        'image': image_pipeline,
                        'label': label_pipeline
                    })

    return loader


def get_ffcv_valloader(val_dataset, device, batch_size, num_workers=12):
    val_path = Path(val_dataset)
    assert val_path.is_file()
    cropper = CenterCropRGBImageDecoder((224, 224), ratio=224/256)
    image_pipeline = [
        cropper,
        ToTensor(),
        ToDevice(device, non_blocking=True),
        ToTorchImage(),
        NormalizeImage(IMAGENET_MEAN, IMAGENET_STD, np.float32)
    ]

    label_pipeline = [
        IntDecoder(),
        ToTensor(),
        Squeeze(),
        ToDevice(device, non_blocking=True)
    ]

    loader = Loader(val_dataset,
                    batch_size=batch_size,
                    num_workers=num_workers,
                    order=OrderOption.SEQUENTIAL,
                    drop_last=False,
                    pipelines={
                        'image': image_pipeline,
                        'label': label_pipeline
                    })
    return loader

Error in val_loop

Hi, I'm encountering the following error in val_loop:

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/dist-packages/ffcv/loader/epoch_iterator.py", line 84, in run
    result = self.run_pipeline(b_ix, ixes, slot, events[slot])
  File "/usr/local/lib/python3.10/dist-packages/ffcv/loader/epoch_iterator.py", line 146, in run_pipeline
    results = stage_code(**args)
  File "", line 2, in stage_code_1
  File "/usr/local/lib/python3.10/dist-packages/ffcv/transforms/ops.py", line 56, in to_device
    dst.copy_(inp, non_blocking=self.non_blocking)
RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See https://github.com/pytorch/rfcs/pull/17 for more details.

The training loop, works fine.

Any idea of the causes and how to solve it?

ValueError(“total size of new array must be changed”)

Hi, I’m trying to train model by the guide line.
But I got the ValueError and SystemError when I try to load the following codes:

python train_imagenet.py --config-file rn50_configs/<your config file>.yaml \
--data.train_dataset=/path/to/train/dataset.ffcv \
--data.val_dataset=/path/to/val/dataset.ffcv \
--data.num_workers=12 --data.in_memory=1 \
--logging.folder=/your/path/here

截圖 2022-05-23 下午1 27 25

截圖 2022-05-23 下午1 27 32

How can I solve this?
Thanks in advance for your replies.

Training extremely slow

Hello,

I followed closely the README and launched a training using the following command on a server with 8 V100 GPUs:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python train_imagenet.py --config-file rn50_configs/rn50_88_epochs.yaml \
    --data.train_dataset=$HOME/data/imagenet_ffcv/train_500_0.50_90.ffcv \
    --data.val_dataset=$HOME/data/imagenet_ffcv/val_500_0.50_90.ffcv \
    --data.num_workers=3 --data.in_memory=1 \
    --logging.folder=$HOME/experiments/ffcv/rn50_88_epochs

Training took almost an hour per epoch, and the second epoch is almost as slow as the first one. The output of the log file is as follows:

cat ~/experiments/ffcv/rn50_88_epochs/d9ef0d7f-17a3-4e57-8d93-5e7c9a110d66/log 
{"timestamp": 1650641704.0822473, "relative_time": 2853.3256430625916, "current_lr": 0.8473609134615385, "top_1": 0.07225999981164932, "top_5": 0.19789999723434448, "val_time": 103.72948884963989, "train_loss": null, "epoch": 0}
{"timestamp": 1650644358.3394542, "relative_time": 5507.582849979401, "current_lr": 1.6972759134615385, "top_1": 0.16143999993801117, "top_5": 0.3677400052547455, "val_time": 92.9171462059021, "train_loss": null, "epoch": 1}

Is there anything I should check?

Thank you in advance for your response.

Question about the parameter of the `write_imagenet.py`

@lengstrom
Thanks for your wonderful work!

I have a question about the parameters of the write_imagenet.py.

From the repo of ffcv, we can see
https://github.com/libffcv/ffcv/blob/bfd9b3d85e31360fada2ecf63bea5602e4774ba3/ffcv/fields/rgb_image.py#L337

        write_mode = self.write_mode
        as_jpg = None

        if write_mode == 'smart':
            as_jpg = encode_jpeg(image, self.jpeg_quality)
            write_mode = 'raw'
            if self.smart_threshold is not None:
                if image.nbytes > self.smart_threshold:
                    write_mode = 'jpg'
        elif write_mode == 'proportion':
            if np.random.rand() < self.proportion:
                write_mode = 'jpg'
            else:
                write_mode = 'raw'

The default write mode in https://github.com/libffcv/ffcv-imagenet/blob/main/write_imagenet.py is smart,
and the smart_threshold is None.
So the script is actually running in RAW write mode?

Related issues are #1

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 2.73 TiB for an array with shape (124994103978,) and data type [('sample_id', '<u8'), ('ptr', '<u8'), ('size', '<u8’)]

i wanna train on imagenet21k, not imagenet1k. so i downloaded the imagenet21k(winter) on the official site.(imagenet official site)

and then i ran the "write_imagenet.sh" with default argument value (500 0.50 90) , setting the dataset imagenet21k(winter).

finally, i run the train_imagenet.py. and there is an error.

"numpy.core._exceptions._ArrayMemoryError: Unable to allocate 2.73 TiB for an array with shape (124994103978,) and data type [('sample_id', '<u8'), ('ptr', '<u8'), ('size', '<u8’)] "

what is that error? why does that error occur? (when i make a dataset with imagenet1k and run the train, there is not any error.)
Is there any way to learn with imagenet21k?

(ref : The ffcv dataset that is made by "write_imagenet.sh" size is 2.73 TB.)

batch_size=1 causes error when Squeeze() is in the "label" pipeline

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/home/cbotos/miniconda3/envs/ffcv/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/cbotos/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 79, in run
    result = self.run_pipeline(b_ix, ixes, slot, events[slot])
  File "/home/cbotos/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 138, in run_pipeline
    return tuple(x[:len(batch_indices)] for x in args)
  File "/home/cbotos/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 138, in <genexpr>
    return tuple(x[:len(batch_indices)] for x in args)
IndexError: slice() cannot be applied to a 0-dim tensor.

I would say that this error is sorta unexpected, but I could have anticipated it since the Squeeze is also squishing the batch dimension in this case (if I understood the situation correctly)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.