Code Monkey home page Code Monkey logo

pytorch-lmdb's Introduction

pytorch-lmdb

Forked from https://github.com/Lyken17/Efficient-PyTorch/ and simplified. Fixed quite a few warnings and made it easier to use via command line. Tested on both Windows and Linux systems using Python 3.8.

Speed overview

Trained on the Cats versus Dogs dataset avaliable on Kaggle. Results compare the torch.ImageFolder and our lmdb implementation. These are the results using a local SSD:

Timings for lmdb
Avg data time: 0.011866736168764075
Avg batch time: 0.10090051865091129
Total data time: 2.325880289077759
Total batch time: 19.776501655578613

Timings for imagefolder: 
Avg data time: 0.017892257291443493 
Avg batch time: 0.1053010200967594  
Total data time: 3.506882429122925  
Total batch time: 20.638999938964844

These are the results using a network file system (NFS) drive:

Timings for lmdb
Avg data time: 0.040608997247657
Avg batch time: 0.06778134983413074
Total data time: 7.9593634605407715
Total batch time: 13.285144567489624

Timings for imagefolder: 
Avg data time: 0.056209570291090985
Avg batch time: 0.08088788086054277
Total data time: 11.017075777053833
Total batch time: 15.854024648666382

LMDB

The format of converted LMDB is defined as follow.

key value
img-id1 (jpeg_raw1, label1)
img-id2 (jpeg_raw2, label2)
img-id3 (jpeg_raw3, label3)
... ...
img-idn (jpeg_rawn, labeln)
__keys__ [img-id1, img-id2, ... img-idn]
__len__ n

As for details of reading/writing, please refer to code.

Convert ImageFolder to LMDB

The folder2lmdb script can convert a default image-label structure to an LMDB file (see above). For example, to run it on Linux, given the Dogs vs Cats dataset is in /data and it has a subfolder called "train":

python folder2lmdb.py -f ~/pytorch-lmdb/data/cats_vs_dogs -s "train"

ImageFolderLMDB

The usage of ImageFolderLMDB is identical to torchvision.datasets.

import ImageFolderLMDB
from torch.utils.data import DataLoader
dst = ImageFolderLMDB(path, transform, target_transform)
loader = DataLoader(dst, batch_size=64)

Run the test tool

The main script includes the ImageFolderLMDB class. It can be run from command line and takes an ImageFolder path and a LMDB database path, runs training on the Dogs vs Cats dataset and outputs execution times of the two file storage strategies. For example, to run it on Linux, given the Dogs vs Cats dataset is in /data and the already created LMDB file is too:

python main.py -f ~/pytorch-lmdb/data/cats_vs_dogs/train -l ~/pytorch-lmdb/data/cats_vs_dogs/train.lmdb

pytorch-lmdb's People

Contributors

thecml avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

flyingxiang

pytorch-lmdb's Issues

lmdb read very slow when multi-processing

First of all, thank you for sharing your wonderful code.

Using your code, I'm trying to use pytorch DDP. When num_workers >1 or using DDP, data load may take a very long time.
When the lmdb file is caching, the data load takes only about 1ms, but when it is not caching, it sometimes takes tens of seconds. Do you know why?

image

When reading lmdb in a single process, it reads in about 0.2 seconds even if caching is not enabled.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.