szymonmaszke / torchdatasets Goto Github PK

View Code? Open in Web Editor NEW

328.0 7.0 19.0 1.5 MB

PyTorch dataset extended with map, cache etc. (tensorflow.data like)

License: MIT License

Shell 0.29% Python 99.71%

pytorch dataset disk tensorflow tf-data cache torch concatenate library map filter

torchdatasets's Introduction

Package renamed to torchdatasets!

Use map, apply, reduce or filter directly on Dataset objects
cache data in RAM/disk or via your own method (partial caching supported)
Full PyTorch's Dataset and IterableDataset support
General torchdatasets.maps like Flatten or Select
Extensible interface (your own cache methods, cache modifiers, maps etc.)
Useful torchdatasets.datasets classes designed for general tasks (e.g. file reading)
Support for torchvision datasets (e.g. ImageFolder, MNIST, CIFAR10) via td.datasets.WrapDataset
Minimal overhead (single call to super().__init__())

Version	Docs	Tests	Coverage	Style	PyPI	Python	PyTorch	Docker	Roadmap

💡 Examples

Check documentation here: https://szymonmaszke.github.io/torchdatasets

General example

Create image dataset, convert it to Tensors, cache and concatenate with smoothed labels:

import torchdatasets as td
import torchvision

class Images(td.Dataset): # Different inheritance
    def __init__(self, path: str):
        super().__init__() # This is the only change
        self.files = [file for file in pathlib.Path(path).glob("*")]

    def __getitem__(self, index):
        return Image.open(self.files[index])

    def __len__(self):
        return len(self.files)


images = Images("./data").map(torchvision.transforms.ToTensor()).cache()

You can concatenate above dataset with another (say labels) and iterate over them as per usual:

for data, label in images | labels:
    # Do whatever you want with your data

Cache first 1000 samples in memory, save the rest on disk in folder ./cache:

images = (
    ImageDataset.from_folder("./data").map(torchvision.transforms.ToTensor())
    # First 1000 samples in memory
    .cache(td.modifiers.UpToIndex(1000, td.cachers.Memory()))
    # Sample from 1000 to the end saved with Pickle on disk
    .cache(td.modifiers.FromIndex(1000, td.cachers.Pickle("./cache")))
    # You can define your own cachers, modifiers, see docs
)

To see what else you can do please check torchdatasets documentation

Integration with `torchvision`

Using torchdatasets you can easily split torchvision datasets and apply augmentation only to the training part of data without any troubles:

import torchvision

import torchdatasets as td

# Wrap torchvision dataset with WrapDataset
dataset = td.datasets.WrapDataset(torchvision.datasets.ImageFolder("./images"))

# Split dataset
train_dataset, validation_dataset, test_dataset = torch.utils.data.random_split(
    model_dataset,
    (int(0.6 * len(dataset)), int(0.2 * len(dataset)), int(0.2 * len(dataset))),
)

# Apply torchvision mappings ONLY to train dataset
train_dataset.map(
    td.maps.To(
        torchvision.transforms.Compose(
            [
                torchvision.transforms.RandomResizedCrop(224),
                torchvision.transforms.RandomHorizontalFlip(),
                torchvision.transforms.ToTensor(),
                torchvision.transforms.Normalize(
                    mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
                ),
            ]
        )
    ),
    # Apply this transformation to zeroth sample
    # First sample is the label
    0,
)

Please notice you can use td.datasets.WrapDataset with any existing torch.utils.data.Dataset instance to give it additional caching and mapping powers!

🔧 Installation

🐍 pip

Latest release:

pip install --user torchdatasets

Nightly:

pip install --user torchdatasets-nightly

🐋 Docker

CPU standalone and various versions of GPU enabled images are available at dockerhub.

For CPU quickstart, issue:

docker pull szymonmaszke/torchdatasets:18.04

Nightly builds are also available, just prefix tag with nightly_. If you are going for GPU image make sure you have nvidia/docker installed and it's runtime set.

❓ Contributing

If you find any issue or you think some functionality may be useful to others and fits this library, please open new Issue or create Pull Request.

To get an overview of thins one can do to help this project, see Roadmap

torchdatasets's People

Contributors

Stargazers

Watchers

Forkers

sethips trendingtechnology joanna-janos zeta1999 neverix nguyennhan1992 carloalbertobarbano susemeee zivzone bdthombre klaudiapalasz trenta3 bearpaw henrique g-pichler phongphuhanam iq-scm autra-weiliu

torchdatasets's Issues

Document gotcha when using DataLoader with workers

Using .cache() (with the default memory cacher) does nothing when the Dataset is used in a multi-process DataLoader. This is a gotcha that should probably pointed out in the documentation and the tutorial, as it is easy to overlook.

In my case I was dropping in torchdatas cache in a program that already had the DataLoaders defined. The DataLoaders were initialized with a positive int in num_workers. It took me a while to figure out why the cache didn't seem to work, at all.

Multiple concatenation with logical or operator yields nested concatenation

Concatenation of two datasets with the logical operator works as intended:
concat_2 = images | images

While concatenation of more datasets (concat_3 = images | images | images) yields a nested concatenated dataset.

The code is equivalent to:
concat_3 = torchdata.datasets.ConcatDataset([torchdata.datasets.ConcatDataset([images, images]), images])

I'd argue a more intuitive result would something that is equivalent to this instead:
concat_3 = torchdata.datasets.ConcatDataset([images, images, images])

In short, concatenating with the | operator to an already concatenated dataset should add the new dataset in the list of concatenations, instead of creating a nested concatenated dataset.

Apparent mismatch between official pip version `0.2.0` and GitHub tagged version of `0.2.0`

First, thanks for an elegant library that has saved me a significant amount of time over the past couple years.

Now, the problem: Since the name change, Ive been trying to refactor some old code to work with future versions of torch (specifically torch==1.8.1+cu101). While doing so, I seem to have uncovered a confusing issue that shows up upon installation of torchdatasets. The problem is that neither installing via pip install torchdatasets or even pip install torchdatasets==0.2.0 results in an identical version of the repo to the one tagged as 0.2.0 on GitHub. This became a problem for me, while I tried simply importing:

(ffcv-test) jrose3@serrep3:/media/data/jacob/GitHub/ffcv/examples/cifar$ python
Python 3.8.12 | packaged by conda-forge | (default, Jan 30 2022, 23:42:07)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torchdatasets as torchdata
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/media/data/conda/jrose3/envs/ffcv-test/lib/python3.8/site-packages/torchdatasets/__init__.py", line 60, in <module>
    from . import cachers, datasets, maps, modifiers, samplers
  File "/media/data/conda/jrose3/envs/ffcv-test/lib/python3.8/site-packages/torchdatasets/datasets.py", line 28, in <module>
    from torch.utils.data import _typing
ImportError: cannot import name '_typing' from 'torch.utils.data' (/media/data/conda/jrose3/envs/ffcv-test/lib/python3.8/site-packages/torch/utils/data/__init__.py)

Examining the 2 implicated python scripts (1 in torch and 1 in torchdatasets) I realized that the file torch.utils.data._typing isn't actually introduced into any torch repo until version torch==1.9.0, while I'm currently using torch==1.8.1 and as far as I can tell, the only stated requirement for the torchdatasets library is torch>=1.2.0 listed in requirements.txt.

Looking further into the torchdatasets file that relies on torch.utils.data._typing, namely torchdatasets.datasets.py, I found that it's only used once, and for a comically unnecessary type hint used in a placeholder class's definition!

class MetaIterableWrapper(MetaIterable, GenericMeta, _typing._DataPipeMeta): pass

My assumption is that this was introduced as part of an effort to integrate the new torch data pipe pattern, but at some point it leaked into the main repo and broke a bunch of other, significant assumptions necessary to install smoothly. Since I can only find it via my locally installed pip version and not on GitHub, I have no clear way of tracking down when it was introduced or by whom.

My recommendation is removing these 2 lines from the file torchdatasets/datasets.py hosted on pip for version 0.2.0 (Im not sure if these can be revised without updating the version as well). Thoughts?

Concatenate datasets with different map function

Hi @szymonmaszke , is it possible to concatenate two datasets with different map function? I checked the doc but I am not sure.
Thank you in advance!

Pickle support for Storage will be removed in torch 1.5

Hi!
This is an awesome package, but I ran into a warning while using the Pickle cacher:

***/site-packages/torch/storage.py:34: FutureWarning: pickle support for Storage will be removed in 1.5. Use `torch.save` instead
  warnings.warn("pickle support for Storage will be removed in 1.5. Use `torch.save` instead", FutureWarning)

As far as I understand, it can be fixed by creating a new cacher that will use torch.save and torch.load. I was planning to open this as a PR, but I couldn't figure out how to set up the test environment so this is an issue instead.

HDF5 Support

Thanks for this amazing library. I was wondering for large datasets with millions of images, would it make sense to cache in a single file (e.g., HDF5) instead of creating millions of cache files? Do you have any plan to support Hdf5 format? Thanks!

metaclass conflict

I got TypeError: metaclass conflict importing torchdata

Python Version: Python 3.6.13
PyTorch:

torch==1.7.1+cu110
torchaudio==0.7.2
torchdata==0.2.0
torchvision==0.8.2+cu110

In [1]: import torchdata as td
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-d739c1dd990c> in <module>
----> 1 import torchdata as td

~/git/envs/pyenv/lib/python3.6/site-packages/torchdata/__init__.py in <module>
     58 """
     59 
---> 60 from . import cachers, datasets, maps, modifiers, samplers
     61 from ._version import __version__
     62 from .datasets import Dataset, Iterable

~/git/envs/pyenv/lib/python3.6/site-packages/torchdata/datasets.py in <module>
    155 
    156 
--> 157 class Iterable(TorchIterable, _DatasetBase, metaclass=MetaIterable):
    158     r"""`torch.utils.data.IterableDataset` **dataset with extended capabilities**.
    159 

TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

type(TorchIterable)
Out[2]: typing.GenericMeta

Dataset inherited torch.Dataset which already inherited GenericMeta metaclass, but it also specified using MetaDataset as metaclass which caused the conflicting

You can reproduce with this

class M_A(type):
    pass

class M_B(type):
    pass

class A(metaclass=M_A):
    pass

class C(A, metaclass=M_B):
    pass

This can be solved with a wrapper:

class MetaDatasetWrapper(MetaDataset, GenericMeta): pass

class Dataset(TorchDataset, _DatasetBase, metaclass=MetaDatasetWrapper):

Support stratified subsampler

Hi! From what I can see currently there is no simple way in pytorch to perform a stratified subsampling of the training dataset.
I think it fits this library scope perfectly.
Let me know what do you think about it.

load N samples in memory (queue) and train on it on GPU. In the meanwhile, load another N samples into queue

Hi, question as title.
What shall I do use torchdata?

AttributeError: 'Subset' object has no attribute 'map'

Hello! Thanks so much for this wonderful library.

This is my first time using it, and I'm following the README example and StackOverflow post to apply transformations after splitting the data. However, I am getting the above error when I try to run train_dataset.map(train_transform). Does the wrapper still work, or did I make a mistake somewhere?

dataset = td.datasets.WrapDataset(torchvision.datasets.ImageFolder('./root'))

total_num = len(dataset)
train_num = int(0.7 * total_num)
val_num = int(0.2 * total_num)
test_num = total_num - train_num - val_num
train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(
    dataset, (train_num, val_num, test_num)
)

train_dataset = train_dataset.map(train_transform)

pip install doesn't work in Google Colab

I was introduced to this library when I asked this question on stack overflow.

I was able to do a pip install and get my work done on my local machine.
But, I also need to be able to share some of the code with a teammate via Google Colab.
So, I put the local Jupyter notebook into Colab and tried !pip install torch data.
Turns out that doesn't work. It gives the following error message.

ERROR: Could not find a version that satisfies the requirement torchdata (from versions: none)
ERROR: No matching distribution found for torchdata

Details of the Colab environment

OS: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9 (default, Apr 18 2020, 01:56:04)  [GCC 8.4.0]
numpy version: 1.18.4
future version: 0.16.0
PyTorch version: 1.5.0+cu101
Torchvision Version: 0.6.0+cu101

Is there any way I can get the library to work in Colab?
Thank you.

python3.6 order (MRO) for bases type, GenericMeta, _DataPipeMeta

Traceback (most recent call last):
  File "test-dataset.py", line 1, in <module>
    from dataset import func
  File "/root/torch-cache-test/dataset/func.py", line 3, in <module>
    from . import nocache
  File "/root/torch-cache-test/dataset/nocache.py", line 1, in <module>
    from . import utils
  File "/root/torch-cache-test/dataset/utils.py", line 8, in <module>
    import torchdatasets as td
  File "/opt/conda/lib/python3.6/site-packages/torchdatasets/__init__.py", line 60, in <module>
    from . import cachers, datasets, maps, modifiers, samplers
  File "/opt/conda/lib/python3.6/site-packages/torchdatasets/datasets.py", line 164, in <module>
    class MetaIterableWrapper(MetaIterable, GenericMeta, _typing._DataPipeMeta): pass
TypeError: Cannot create a consistent method resolution
order (MRO) for bases type, GenericMeta, _DataPipeMeta

Metaclass issue with Python 3.7/3.8

The Metaclass trick doesn't seem to work on later versions of python.
Any ideas?
This can be easily checked on colab, for example (please see below).

Thanks for the amazing work here

!pip install torchdata
!python --version
Collecting torchdata
  Downloading torchdata-0.2.0-py3-none-any.whl (27 kB)
Requirement already satisfied: torch>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from torchdata) (1.9.0+cu102)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch>=1.2.0->torchdata) (3.7.4.3)
Installing collected packages: torchdata
Successfully installed torchdata-0.2.0
Python 3.7.12

import torchdata as td
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-4-d739c1dd990c> in <module>()
----> 1 import torchdata as td

1 frames

/usr/local/lib/python3.7/dist-packages/torchdata/__init__.py in <module>()
     58 """
     59 
---> 60 from . import cachers, datasets, maps, modifiers, samplers
     61 from ._version import __version__
     62 from .datasets import Dataset, Iterable

/usr/local/lib/python3.7/dist-packages/torchdata/datasets.py in <module>()
    155 
    156 
--> 157 class Iterable(TorchIterable, _DatasetBase, metaclass=MetaIterable):
    158     r"""`torch.utils.data.IterableDataset` **dataset with extended capabilities**.
    159 

TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

Benchmark

Can you share some benchmark when using this library vs when not use it?

Thank you.

support for pytorch 1.3.0

first, this is a great package! it seems like to bring the only cool part of tensorflow (tf.data) to pytorch. Thanks for the effort.

Is torch 1.3.0 supported in the current release?

TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

Hi there, I love the repo and wanted to try it out on a project... but when I import torchdata it gives me this error. I installed the latest version of torchdata using pip install --upgrade torchdata and yet it throws this error
TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases