pytorch / vision Goto Github PK

Datasets, Transforms and Models specific to Computer Vision

License: BSD 3-Clause "New" or "Revised" License

Python 86.82% C++ 9.19% Cuda 1.91% C 0.01% Shell 0.12% Batchfile 0.10% CMake 0.38% Java 0.46% Objective-C 0.06% Objective-C++ 0.93% Ruby 0.02%

computer-vision machine-learning

vision's Introduction

torchvision

The torchvision package consists of popular datasets, model architectures, and common image transformations for computer vision.

Installation

Please refer to the official instructions to install the stable versions of torch and torchvision on your system.

To build source, refer to our contributing page.

The following is the corresponding torchvision versions and supported Python versions.

`torch`	`torchvision`	Python
`main` / `nightly`	`main` / `nightly`	`>=3.8`, `<=3.12`
`2.3`	`0.18`	`>=3.8`, `<=3.12`
`2.2`	`0.17`	`>=3.8`, `<=3.11`
`2.1`	`0.16`	`>=3.8`, `<=3.11`
`2.0`	`0.15`	`>=3.8`, `<=3.11`

older versions

`torch`	`torchvision`	Python
`1.13`	`0.14`	`>=3.7.2`, `<=3.10`
`1.12`	`0.13`	`>=3.7`, `<=3.10`
`1.11`	`0.12`	`>=3.7`, `<=3.10`
`1.10`	`0.11`	`>=3.6`, `<=3.9`
`1.9`	`0.10`	`>=3.6`, `<=3.9`
`1.8`	`0.9`	`>=3.6`, `<=3.9`
`1.7`	`0.8`	`>=3.6`, `<=3.9`
`1.6`	`0.7`	`>=3.6`, `<=3.8`
`1.5`	`0.6`	`>=3.5`, `<=3.8`
`1.4`	`0.5`	`==2.7`, `>=3.5`, `<=3.8`
`1.3`	`0.4.2` / `0.4.3`	`==2.7`, `>=3.5`, `<=3.7`
`1.2`	`0.4.1`	`==2.7`, `>=3.5`, `<=3.7`
`1.1`	`0.3`	`==2.7`, `>=3.5`, `<=3.7`
`<=1.0`	`0.2`	`==2.7`, `>=3.5`, `<=3.7`

Image Backends

Torchvision currently supports the following image backends:

torch tensors
PIL images:
- Pillow
- Pillow-SIMD - a much faster drop-in replacement for Pillow with SIMD.

[UNSTABLE] Video Backend

Torchvision currently supports the following video backends:

pyav (default) - Pythonic binding for ffmpeg libraries.
video_reader - This needs ffmpeg to be installed and torchvision to be built from source. There shouldn't be any conflicting version of ffmpeg installed. Currently, this is only supported on Linux.

conda install -c conda-forge 'ffmpeg<4.3'
python setup.py install

Using the models on C++

TorchVision provides an example project for how to use the models on C++ using JIT Script.

Installation From source:

mkdir build
cd build
# Add -DWITH_CUDA=on support for the CUDA if needed
cmake ..
make
make install

Once installed, the library can be accessed in cmake (after properly configuring CMAKE_PREFIX_PATH) via the TorchVision::TorchVision target:

find_package(TorchVision REQUIRED)
target_link_libraries(my-target PUBLIC TorchVision::TorchVision)

The TorchVision package will also automatically look for the Torch package and add it as a dependency to my-target, so make sure that it is also available to cmake via the CMAKE_PREFIX_PATH.

For an example setup, take a look at examples/cpp/hello_world.

Python linking is disabled by default when compiling TorchVision with CMake, this allows you to run models without any Python dependency. In some special cases where TorchVision's operators are used from Python code, you may need to link to Python. This can be done by passing -DUSE_PYTHON=on to CMake.

TorchVision Operators

In order to get the torchvision operators registered with torch (eg. for the JIT), all you need to do is to ensure that you #include <torchvision/vision.h> in your project.

Documentation

You can find the API documentation on the pytorch website: https://pytorch.org/vision/stable/index.html

Contributing

See the CONTRIBUTING file for how to help out.

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

Pre-trained Model License

The pre-trained models provided in this library may have their own licenses or terms and conditions derived from the dataset used for training. It is your responsibility to determine whether you have permission to use the models for your use case.

More specifically, SWAG models are released under the CC-BY-NC 4.0 license. See SWAG LICENSE for additional details.

Citing TorchVision

If you find TorchVision useful in your work, please consider citing the following BibTeX entry:

@software{torchvision2016,
    title        = {TorchVision: PyTorch's Computer Vision library},
    author       = {TorchVision maintainers and contributors},
    year         = 2016,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/pytorch/vision}}
}

vision's People

Contributors

Stargazers

Watchers

Forkers

colesbury smartkiwi graingert chagge soledad89 mkess szagoruyko ml-lab caomw desimone bobbens zhoumingjun cadene lysandergg suhangpro maikel72 rjkmurray75 bugra tdeboissiere alykhantejani eulerreich maratyszcza achillessaxby lukeyeager lychee-eng anubhavashok zmonoid spro tymokvo ncullen93 felixgwu isumitg mbjoseph eladhoffer skrish13 ellisbrown wakeupbuddy uridah lichengunc jdc08161063 benjamesbabala paojianghu cadrev sunjieee gpleiss mjdietzx yakimoris lizeng614 ajaytalati sravya8 gabrielhuang fmassa yongyitang92 ntomita bodokaiser dmitryulyanov rfelixmg yobajnin varunagrawal malreddysid donglianggao furiouslycurious soonminhwang ly015 chenyuntc mehdidc giserh roytseng-tw rkdsone lopuhin daesony lyken17 yichuan9527 acgtyrant xavierlinnow soralab aaron-xichen abhagat-splunk lucasb-eyer grseb9s lpcinelli zzhang1987 raunakdoesdev samanklesaria xiashangqin davebs puzzledqs scp-173-cool rwestwood89 luyongxi danielhauagge tingwei-jen achraf-oussidi tanxchong renxiangxiangx parneetk ahirner pravin0999 chelovekhe chsasank

vision's Issues

save_image behavior

https://github.com/pytorch/vision/blob/master/torchvision/utils.py#L52

The .mul(0.5).add(0.5) in this line seems to be assuming that the input values are in [-1,1]? Isn't the Torch convention to use [0,1] for images?

Raise an error when ImageFolder doesn't find any files

https://discuss.pytorch.org/t/error-dataloaderiter-object-has-no-attribute-shutdown/702/4

Random transforms for both input and target?

In some scenarios (like semantic segmentation), we might want to apply the same random transform to both the input and the GT labels (cropping, flip, rotation, etc).
I think we can get this behaviour emulated in a segmentation dataset class by resetting the random seed before calling the transform for the labels.
This sound a bit fragile though.

One other possibility is to have the transforms accept both inputs and targets as arguments.

Do you have any better solutions?

single channel images transform error

please add the following after line 78 (the for loop)

            data[i] = transforms.Compose([
                    transforms.ToPILImage(),
                    # Transformations
                    transforms.ToTensor(),
                ])(data[i])

in the MNIST example code (https://github.com/pytorch/examples/blob/master/mnist/main.py) and run it.
Reported by @adithyap

Pretrained VGG models

Is there any plan for pretrained VGG models, especially VGG16 or VGG19? These are needed for style transfer; AlexNet and ResNets tend not to work as well in my experience.

Crush with LSUN dataset in testing

The lmdb file I download with LSUN dataset in testing was not named as category_[train/val/test]_lmdb which used in torchvision/datasets/lsun.py. It's named like test_lmdb without any categories.

One approach to solve this problem is replacing line #73 with an if statement like:

       	if classes != 'test':
                classes = [c + '_' + classes for c in categories]
        else:
        	classes = [classes]

new transform functions

just curious.. is the plan to keep transforms such as RandomCrop, HorizontalFlip, etc only working on PIL images, or would you prefer *Tensor support as well?

Also, are yall open to adding new transforms such a rotations, shearing, shifting, zooming, etc -- for instance transforms.RandomRotation(-30,30) or transforms.HorizontalShift(0.2) or transforms.RandomZoom(0.8,1.2) .. or are these already supported elsewhere?

Would be willing to contribute in these areas. It is particularly important to combine these transforms together to only require 1 interpolation, and I have experience with that.

Update conda package to 0.1.7 ?

Calling conda install torchvision -c soumith installs version 0.1.6:

(pytorch) daviddelaiglesia@daviddelaiglesia:~$ conda install torchvision -c soumith
Fetching package metadata ...........
Solving package specifications: .
Package plan for installation in environment /home/daviddelaiglesia/miniconda3/envs/pytorch:
The following NEW packages will be INSTALLED:
    torchvision: 0.1.6-py35_19 soumith

Whereas pip install torchvision install the newest version:

Collecting torchvision
  Downloading torchvision-0.1.7-py2.py3-none-any.whl
Installing collected packages: torchvision
Successfully installed torchvision-0.1.7

Installing pytorch from the homepage binaries (I can confirm this one: conda install pytorch torchvision cuda80 -c soumith) also install 0.1.6 torchvision.

I found this because the 0.1.6 version didn't include this fix c0a6cfe

Do we still need `requirements.txt`?

I was building from source today and noticed that the requirements are directly in the setup.py file now. Do we still need the requirements.txt?

Pascal VOC dataset

FYI, I started writing a simple Pascal VOC dataset class.
https://github.com/fmassa/vision/tree/voc_dataset

Preprocessing for pretrained models?

What kind of image preprocessing is expected for the pretrained models? I couldn't find this documented anywhere.

If I had to guess I would assume that they expect RGB images with the mean/std normalization used in fb.resnet.torch and pytorch/examples/imagenet. Is this correct?

DenseNet FCN

Hi @orashi and @gpleiss - thank you for adding DenseNet!

I was wondering if you have looked at DenseNet Fully Convolutional paper which gives phenomenal scores for segmentation : https://arxiv.org/pdf/1611.09326v1.pdf

I will owe you all a round of beers if you can consider implementing DenseNet FCN too.

Great day,
FC

[suggestion] ResNet to subclass Sequential?

It would be simpler to remove last layers and keeping names without having to construct a new Sequential and calling add_module

Missing blank space in bottom and right side of image generate using make_grid

There only has blank space on top and left side of the whole image generate using make_grid. The bottom and right side doesn't have a blank space. I think change line #68 in utils.py can fix this problem:
grid = tensor.new(3, height * ymaps, width * xmaps).fill_(0) ----->
grid = tensor.new(3, height * ymaps + padding, width * xmaps + padding).fill_(0)

change package name to torch_vision, instead of the current torchvision

mismatch between documentation and code in torchvision.transforms.ToTensor

torchvision.transforms.ToTensor documentation says that the numpy.ndarray (H x W x C) is converted to torch.FloatTensor of shape (C x H x W):
Converts a PIL.Image or numpy.ndarray (H x W x C) in the range [0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0].

However, the code does not do permute dimensions:
img = torch.from_numpy(pic)

Related issue: the documentation says that the np.array normalized to [0, 255] is converted to torch.FloatTensor normalized to [0,1]. This is not happening.

Transformations in GPU

Is there any plan to support image transformations for GPU?
Doing big transformations e.g resizing (224x224) <-> (64x64) with PIL seems a bit slow.

load image dataset from list files

I suggest to add a IO for read images from a list like this to support custom image data input,

img1 label1
img2 label2
...

which like caffe's LMDB list format.

I implement it by referencing torchvision/datasets/folder.py

import torch.utils.data as data

from PIL import Image
import os
import os.path

def default_loader(path):
	return Image.open(path).convert('RGB')

def default_flist_reader(flist):
	"""
	flist format: impath label\nimpath label\n ...(same to caffe's filelist)
	"""
	imlist = []
	with open(flist, 'r') as rf:
		for line in rf.readlines():
			impath, imlabel = line.strip().split()
			imlist.append( (impath, int(imlabel)) )
					
	return imlist

class ImageFilelist(data.Dataset):
	def __init__(self, root, flist, transform=None, target_transform=None,
			flist_reader=default_flist_reader, loader=default_loader):
		self.root   = root
		self.imlist = flist_reader(flist)		
		self.transform = transform
		self.target_transform = target_transform
		self.loader = loader

	def __getitem__(self, index):
		impath, target = self.imlist[index]
		img = self.loader(os.path.join(self.root,impath))
		if self.transform is not None:
			img = self.transform(img)
		if self.target_transform is not None:
			target = self.target_transform(target)
		
		return img, target

	def __len__(self):
		return len(self.imlist)

The usage is same to ImageFolder class,

 46     train_loader = torch.utils.data.DataLoader(
 47         ImageFilelist(root="../place365_challenge/data_256/", flist="../place365_challenge/places365_train_challenge.txt",
 48             transform=transforms.Compose([transforms.RandomSizedCrop(224),
 49                 transforms.RandomHorizontalFlip(),
 50                 transforms.ToTensor(), normalize,
 51         ])),
 52         batch_size=64, shuffle=True,
 53         num_workers=4, pin_memory=True)
 54 
 55     val_loader = torch.utils.data.DataLoader(
 56         ImageFilelist(root="../place365_challenge/val_256/", flist="../place365_challenge/places365_val.txt",
 57             transform=transforms.Compose([transforms.Scale(256),
 58                 transforms.CenterCrop(224),
 59                 transforms.ToTensor(), normalize,
 60         ])),
 61         batch_size=16, shuffle=False,
 62         num_workers=1, pin_memory=True)

VGG classifier setting different from Original paper

One of the dropout layer was "wrongly" inserted.
The original final layers of Caffe Version (https://gist.github.com/ksimonyan/211839e770f7b538e2d8) is:
self.classifier = nn.Sequential(
nn.Linear(512 * 7 * 7, 4096),
nn.ReLU(True),
nn.Dropout(),
nn.Linear(4096, 4096),
nn.ReLU(True),
nn.Dropout()
nn.Linear(4096, 1000),
)
This won't make difference when we use model.eval(), but will make discrepancy if we want to finetune VggNet by loading Caffe's parameters.

RandomSizedCrop ---> RandomScaledCrop

I strongly suggest to change the name of the "RandomSizedCrop" to "RandomScaledCrop". This confusion led to a one week of debugging for me.

Latest Pre-train VGG Corrupted

At least the vgg19xxx.pth and vggxxx.16 - did not try the others.

Multicrop - missing feature

Hi!

I think the transforms could be improved by adding a multicrop transform for the test.
Thanks

Report a small mistake in PyTorchVision LSUN dataset

https://github.com/pytorch/vision/blob/master/torchvision/datasets/lsun.py#L110

In the definition of __getitem__(self, index), I think line 100 should be sub = ind instead of sub += ind, if the number of LSUN dataset classes is more than two, i.e., classes = ['bedroom_train', 'church_outdoor_train', 'bridge_train'], and target is bridge_train, then an error will occur.

How to get the output of hidden layers in torchvision.model

I have checked the document, but I can't find any information about this.

`save_image` adds an offset to images?

In the save_image function, there's an offset added to images. Assuming that my image tensor has values between 0 and 1, my "pre-save" tensor will have values between 0.5 and 1, resulting in a final PIL image with values between 128 and 255.

This seems like a bug? I'd propose having no offset.
Thanks! 😄

Python path.py usage ?

Hello, thanks for this awesome work !

Have you considered using path.py to wrapp os.path function ?
https://pypi.python.org/pypi/path.py

As code compacity and readability seems to be a major concern, it could be nice to use it. It's avalaible for python 2.7 and 3.5, os independent, and can be downloaded via pip.

Here is an example for ImageFolder dataset :

def make_dataset(dir, class_to_idx):
    images = []
    for target in os.listdir(dir):
        d = os.path.join(dir, target)
        if not os.path.isdir(d):
            continue

        for filename in os.listdir(d):
            if is_image_file(filename):
                path = '{0}/{1}'.format(target, filename)
                item = (path, class_to_idx[target])
                images.append(item)

    return images

and now with path.py wrapper :

from path import Path
def make_dataset(dir, class_to_idx):
    dir = Path(dir)
    images = []
    for d in dir.dirs():
        target = str(d.basename())

        for path in d.files():
            if path.ext in IMG_EXTENSIONS:
                item = (path, class_to_idx[target])
                images.append(item)

    return images

Maybe you considered it but still decided not to use it ? In that case why ? (like not the accusing 'why', that is a genuine question as i am not very experienced for deploying big python frameworks)

MNIST missing

would be nice to have

support int16 grayscale images

This is often the case with medical (MRI) data.

Required changes would be in ToTensor probably something like:

# PIL image mode: 1, L, P, I, F, RGB, YCbCr, RGBA, CMYK
if pic.mode == 'YCbCr':
  nchannel = 3
else:
  nchannel = len(pic.mode)
# handle PIL Image
buf = pic.tobytes()
if len(buf) > pic.width * pic.height * nchannel:
  img = torch.LongTensor(torch.LongStorage.from_buffer(buf))
else:
  img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
img = img.view(pic.size[1], pic.size[0], nchannel)

as well as in ToPILImage (just remove normalization to [0, 255] here?).

However I can't assess possible side effects. int16 support may be not very good in pillow (e.g. plt.imshow(Image.fromarray(int16_np_array)) does not work) also there may be other transforms which depend on [0, 255] byte range.

SVHN dataset

Hi, I'm interested in adding format 2 (the classification task) of the SVHN dataset: http://ufldl.stanford.edu/housenumbers/

I'm planning on looking at the current dataset code in this repo and will add a similar svhn.py with a stubbed option for format 1. Let me know if there are any fundamental issues I may hit when trying to add this, or also if anybody has an uncommitted version if this that they can add.

-Brandon.

Problem with current transforms.Scale implementation

The current Scale implementation accepts as argument a single integer size, and rescales the image making size the new size of the smaller edge.

This results in a RuntimeError: inconsistent tensor sizes when the Scale transform is used on datasets containing images where the smaller edge is not always the same.

Is transforms.Scale only intended for datasets where the image shape is consistent??

As an example, the dataset of this Kaggle competition contains images of varying sizes, wich shows the problem with the current implementation:

d = "/train/Type_1/"
s = Scale(300)
for i in os.listdir(d):
    img = Image.open(d+i)
    print("Original: {}".format(img.size))
    resized = s(img)
    print("Resized: {}".format(resized.size))

Original: (3096, 4128)
Resized: (300, 400)

Original: (4128, 3096)
Resized: (400, 300)

An raises the previously mentioned error when you try to use the Scale transform as follows:

train_set = ImageFolder(train_path, 
                        transform=Compose([Scale(300),
                                           ToTensor()
                                          ]))
train_loader = DataLoader(train_set, 
                          batch_size=8, 
                          shuffle=True)
for i, (images, labels) in enumerate(train_loader):
    images = Variable(images)
    labels = Variable(labels)

RuntimeError: inconsistent tensor sizes at /data/users/soumith/miniconda2/conda-bld/pytorch-cuda80-0.1.10_1488757768560/work/torch/lib/TH/generic/THTensorMath.c:2548

Right now I'm using my own version of the Scale transform:

class Reshape(object):
    """ Rescales the input PIL.Image to the given 'shape'.
    Parameters
    ----------
    shape : tuple of int
    interpolation: Default: PIL.Image.BILINEAR
    
    """

    def __init__(self, shape, interpolation=Image.BILINEAR):
        self.shape = shape
        self.interpolation = interpolation

    def __call__(self, img):
        
        return img.resize(self.shape, self.interpolation)

Should I pull request this as a new transform or is there an interesetst in rewriting Scale??

ToTensor transform function, ndarray dimensions are not rearranged

From the code in transforms.py :

Converts a PIL.Image (RGB) or numpy.ndarray (H x W x C) in the range
[0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0].

However for ndarray, only .from_numpy(pic) is called, without reshaping the tensor. Is it normal ?

Also, assuming range (0,255) for PIL images seems reasonable, but ndarray might have different distributions, especially for target images. Would it be interesting to have a normalize parameter which lets you choose if you want to divide values by 255 or not ?

(linear) range normalization? std normalization?

I don't see transforms for linear normalization (e.g. between 0-1 or arbitrary ranges) or std normalization? Do these exist somewhere else or do people just implement this in the Dataset class? Any plans for this? It's useful when sampling from folders. It also could be useful so you can potential remove the automatic division by 255. in the ToTensor() transform and therefore support loading arbitrary numpy arrays from file (an area for which there is a high user demand). Idk.. just a thought.

Anyways, here's some code to do these things:

class RangeNormalize(object):
    """Given min_val: (R, G, B) and max_val: (R,G,B),
    will normalize each channel of the torch.*Tensor to
    the provided min and max values.

    Works by efficiently calculating a linear transform:
        a = (max'-min')/(max-min)
        b = max' - a * max
        new_value = a * value + b
    where min' & max' are given values, 
    and min & max are observed min/max for each channel

    Example:
        >>> x = torch.rand(3,50,50)
        >>> rn = RangeNormalize((0,0,10),(1,1,11)) # normalize last channel between 10-11
        >>> x_norm = rn(x) 

    Also works with just one value for min/max across all channels:
        >>> x = torch.rand(3,50,50)
        >>> rn = RangeNormalize(-1,1)
        >>> x_norm = rn(x)
    """
    def __init__(self, min_, max_):
        if not isinstance(min_, list) and not isinstance(min_, tuple):
            min_ = [min_]*3
        if not isinstance(max_, list) and not isinstance(max_, tuple):
            max_ = [max_]*3

        self.min_ = min_
        self.max_ = max_

    def __call__(self, tensor):
        for t, min_, max_ in zip(tensor, self.min_, self.max_):
            max_val = torch.max(t)
            min_val = torch.min(t)
            a = (max_-min_)/float(max_val-min_val)
            b = max_ - a * max_val
            t.mul_(a).add_(b)
        return tensor

and

class StdNormalize(object):

    def __init__(self):
        pass

    def __call__(self, tensor):
        for t in tensor:
            mean = torch.mean(t)
            std  = torch.std(t)
            t.sub_(mean).div_(std)
        return tensor

suggestion for transforms.Scale()

I suggest to modified transforms.Scale() to accept two type of size input:
1, If isinstance(self.size, int), resize the image short side to self.size
2. Else, resize the image each side according to self.size

The code looks like this:

class Scale(object):
    def __init__(self, size, interpolation=Image.BILINEAR):
        self.size = size
        self.interpolation = interpolation

    def __call__(self, img):
        if isinstance(self.size, int):
            w, h = img.size
            if (w <= h and w == self.size) or (h <= w and h == self.size):
                return img
            if w < h:
                ow = self.size
                oh = int(self.size * h / w)
                return img.resize((ow, oh), self.interpolation)
            else:
                oh = self.size
                ow = int(self.size * w / h)
                return img.resize((ow, oh), self.interpolation)
        else:
            return img.resize(self.size, self.interpolation)

Feature Request: Support for non-image dataset

I think this modular design very much and want to stick on it. However, my dataset is not images, it seems that it causes runtime error when not using Image.fromarray(), the dataloader does not work to iteration next batch.

Is it possible to support for non-image data and same modular design to be used ?

i.e. Define a Dataset class inherited from torch.utils.data.Dataset
then define Dataloader with iter(dataloader)

Error loading image from folder

The current default_loader in ImageFolder use PIL to load image. It has problem with loading some truncated images(I dont know what truncated images are and how they are produced.). How to avoid this error?

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-96-09289af6659d> in <module>()
----> 1 a=Image.open(img).convert('RGB')

/data/wanggu/anaconda3/lib/python3.6/site-packages/PIL/Image.py in convert(self, mode, matrix, dither, palette, colors)
    842                 return self.copy()
    843 
--> 844         self.load()
    845 
    846         if matrix:

/data/wanggu/anaconda3/lib/python3.6/site-packages/PIL/ImageFile.py in load(self)
    224                             else:
    225                                 raise IOError("image file is truncated "
--> 226                                               "(%d bytes not processed)" % len(b))
    227 
    228                         b = b + s

OSError: image file is truncated (54 bytes not processed)

Unit tests for Vision

This issue fleshes out the full details and scope of the unit tests needed for torchvision.

There are very limited unit tests under test/ which dont cover the transform outputs themselves, but are limited to dimension and shape checks.

First, let's start with quantitative tests on known results.

We need to have a set of 10 test images, and then do each transformation of vision.transforms on these 10 images, and compare them pixel-wise with known results. We then compare the known results with the computed result from the transforms, and if they are within some threshold, we pass the test.

Some of the transforms such as Horizontal / Vertical flip can also have exact numerical unit-tests.

The test images:

2 monochrome images
2 3-channel images
2 4-channel PNG images with an Alpha component

Can find some on Wikipedia that are freely licensed.

The tests need to cover all transforms under: https://github.com/pytorch/vision#transforms

For similar testing, you can have a look at:

https://github.com/torch/image/blob/master/test/test.lua#L258-L646

ImageFolder hanging in wait state

Using imagefolder to load images for training, but it always seems to hang here: Substituting "ImageFolder" with "CIFAR10" always works.

^CTraceback (most recent call last):
  File "train.py", line 291, in <module>
    main()
  File "train.py", line 132, in main
    train(train_loader, model, criterion, optimizer, epoch)
  File "train.py", line 157, in train
    for i, (input, target) in enumerate(train_loader):
  File "/conda3/envs/idp/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 168, in __next__
    idx, batch = self.data_queue.get()
  File "/conda3/envs/idp/lib/python3.5/queue.py", line 164, in get
    self.not_empty.wait()
  File "/conda3/envs/idp/lib/python3.5/threading.py", line 293, in wait
    waiter.acquire()
KeyboardInterrupt

category file for trained models ZOO

Hi guys, where is the category file list for the models pre-trained in the ZOO?
We will need to upload that in order to use the model, right?

Error when loading pretrained VGG19

I encountered an error when loading a pretrained vgg19 model. I tried other networks like Alexnet and they're seemed fine.

I use torch==0.1.10, torchvision==0.1.7, just updated from master branch. Also, it's under Python3.6 on latest macOS.

Do anyone have any idea how this happened and can be fixed? An error message seems like complaining incompatible un-pickling.

model = models.vgg19(True)
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.6/site-packages/torchvision/models/vgg.py", line 141, in vgg19
model.load_state_dict(model_zoo.load_url(model_urls['vgg19']))
File "/usr/local/lib/python3.6/site-packages/torch/utils/model_zoo.py", line 57, in load_url
return torch.load(cached_file)
File "/usr/local/lib/python3.6/site-packages/torch/serialization.py", line 222, in load
return _load(f, map_location, pickle_module)
File "/usr/local/lib/python3.6/site-packages/torch/serialization.py", line 370, in _load
result = unpickler.load()
AttributeError: Can't get attribute '_rebuild_tensor' on <module 'torch._utils' from '/usr/local/lib/python3.6/site-packages/torch/_utils.py'>

option for pretrained model for vgg net is not available.

I installed torchvision today with conda on a new machine and found that the argument for pretrained vgg models is not available now. But the corresponding option for resnet is still there.

~~An older version of torchvision installed on Mar 7th works fine.~~
I forgot pyTorch on this machine was installed from source.

finetune resnet18

hi, i am trying to finetune the resnet model with my own data,i follow the imagenet folders main.py example to modify the fc layer in this way, i only finetune in resnet not alexnet

def main():
    global args, best_prec1
    args = parser.parse_args()

    # create model
    if args.pretrained:
        print("=> using pre-trained model '{}'".format(args.arch))
        model = models.__dict__[args.arch](pretrained=True)
      #modify the fc layer
        model.fc=nn.Linear(512,100)
    else:
        print("=> creating model '{}'".format(args.arch))
        model = models.__dict__[args.arch]()

    if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):
        model.features = torch.nn.DataParallel(model.features)
        model.cuda()
    else:
        model = torch.nn.DataParallel(model).cuda()

    # optionally resume from a checkpoint
    if args.resume:
        if os.path.isfile(args.resume):
            print("=> loading checkpoint '{}'".format(args.resume))
            checkpoint = torch.load(args.resume)
            args.start_epoch = checkpoint['epoch']
            best_prec1 = checkpoint['best_prec1']
            model.load_state_dict(checkpoint['state_dict'])
            print("=> loaded checkpoint '{}' (epoch {})"
                  .format(args.resume, checkpoint['epoch']))
        else:
            print("=> no checkpoint found at '{}'".format(args.resume))

    cudnn.benchmark = True

and when testing the model i trained ,i found the fc layer is still 1000 kinds
,i struggle to figure it out for a long time ,but it still the same ,i dont why

    (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
    (relu): ReLU (inplace)
    (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
    (downsample): Sequential (
      (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
      (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
    )
  )
  (1): BasicBlock (
    (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
    (relu): ReLU (inplace)
    (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
  )
)
(avgpool): AvgPool2d (
)

(fc): Linear (512 -> 1000)
)
)

here is my testing code:

import torch
import torch.nn as nn
#from __future__ import print_function
import argparse
from PIL import Image
import torchvision.models as models
import skimage.io
from torch.autograd import Variable as V
from torch.nn import functional as f
from torchvision import transforms as trn

# define image transformation
centre_crop = trn.Compose([
        trn.ToPILImage(),
        trn.Scale(256),
        trn.CenterCrop(224),
        trn.ToTensor(),
        trn.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
filename=r'2780-0-20161221_0001.jpg'
img = skimage.io.imread(filename)
x = V(centre_crop(img).unsqueeze(0), volatile=True)
model = models.__dict__['resnet18']()
model = torch.nn.DataParallel(model).cuda()
checkpoint = torch.load('model_best1.pth.tar')
model.load_state_dict(checkpoint['state_dict'])
best_prec1 = checkpoint['best_prec1']
logit = model(x)
print(logit)
print(len(logit))
h_x = f.softmax(logit).data.squeeze()

anyone can tell me where did i go wrong ,thank you so much!

Load gray images

How to load gray image in torch vision. I find that folder.py only open image as RGB format:
return Image.open(path).convert('RGB')

support numpy arrays as input (over PIL.Image?)

This issue is more thought of a discussion. Wouldn't it be a better approach to use numpy arrays as input over PIL.Images?

Benefits:

numpy is more general then PIL.Image
doing some transforms with numpy ops should be faster
the transformations provided by PIL are also available via scipy and/or skimage for numpy arrays
there are fewer problems with using different value ranges, types (e.g. float64) with numpy

What are your ideas on this?

CamVid dataset

Hi, I am interested in adding CamVid Dataset for Image Segmentation.

The Repository hosting the dataset is here- https://github.com/alexgkendall/SegNet-Tutorial/tree/master/CamVid

cc @pmeier

Adding DenseNets (and pretrained models) to the models

It would be awesome to have DenseNet imagenet models (and pretrained versions as well) available in the pytorch model zoo.

We could probably just convert the existing pretrained torch models.

Open to a PR?

Embed LR schedule and initialization with the model

I tried to implement SqueezeNet as a torchvision model and train it via ImageNet example, and found that it doesn't converge as is. The reference code differs in two aspect:

All but the last convolutions are initialized with Xavier Glorot initializer, the last is normal with stdev 0.01
The learning rate is linearly decreased (polynomial schedule with power=1).

In PyTorch these aspects are hard-coded inside the ImageNet example, but I think it makes sense to make them part of the model definition in torch.vision. What's your position on it?

model compatibility

Is there any way to transform pytorch model to caffe, or tensorflow?

Variable input resolutions for ResNets

The current implementation of the ResNet models has a fixed 7x7 average pooling layer before the final FC layer.

If this was changed to nn.AdaptiveAvgPool2d(1) it would allow for variable sized input batches (e.g. like the 320x320 batches used at test time in the 'Identity Mappings in Deep Residual Network' paper)? Although I'm not sure if this will incur a performance hit?

Precision doesn’t improve when training alexnet and vgg16 on custom dataset

I use pytorch imagenet example on a custom dataset. My dataset has nearly 300 categories, and 12000 images totally. The dataset is organized in train and val directories.

alextnet with learning rate 0.1 - python main.py --arch=alexnet dataset

Epoch: [49][0/29]	Time 3.213 (3.213)	Data 3.137 (3.137)	Loss 5.7120 (5.7120)	Prec@1 0.781 (0.781)	Prec@5 1.172 (1.172)
Epoch: [49][10/29]	Time 0.182 (0.869)	Data 0.000 (0.713)	Loss 5.7154 (5.7094)	Prec@1 0.000 (0.426)	Prec@5 0.781 (1.456)
Epoch: [49][20/29]	Time 2.013 (0.829)	Data 1.931 (0.696)	Loss 5.7096 (5.7113)	Prec@1 0.781 (0.316)	Prec@5 2.734 (1.376)
Test: [0/10]	Time 3.072 (3.072)	Loss 5.7060 (5.7060)	Prec@1 0.000 (0.000)	Prec@5 0.000 (0.000)
 * Prec@1 0.333 Prec@5 1.667

alexnet with learning rate 0.01 - python main.py --arch=alexnet --lr=0.01 dataset

Epoch: [89][0/29]	Time 3.110 (3.110)	Data 3.040 (3.040)	Loss 4.7523 (4.7523)	Prec@1 5.469 (5.469)	Prec@5 19.922 (19.922)
Epoch: [89][10/29]	Time 0.189 (0.831)	Data 0.070 (0.700)	Loss 4.7577 (4.8041)	Prec@1 6.250 (5.611)	Prec@5 19.141 (17.685)
Epoch: [89][20/29]	Time 2.163 (0.831)	Data 2.079 (0.705)	Loss 4.8331 (4.8019)	Prec@1 4.688 (5.673)	Prec@5 19.531 (17.839)
Test: [0/10]	Time 3.048 (3.048)	Loss 4.6815 (4.6815)	Prec@1 8.203 (8.203)	Prec@5 23.047 (23.047)
 * Prec@1 7.458 Prec@5 22.833

vgg16 with learning rate 0.01 - basically the same as alexnet, the training didn't convergent.
However, I can train this dataset with resnet18:
python main.py --arch=resnet18 --batch-size=128 dataset

Epoch: [89][40/57]	Time 0.536 (0.446)	Data 0.413 (0.165)	Loss 0.3770 (0.5072)	Prec@1 92.188 ```
(88.529)	Prec@5 97.656 (95.332)
Epoch: [89][50/57]	Time 0.369 (0.440)	Data 0.000 (0.166)	Loss 0.4453 (0.5025)	Prec@1 89.844 (88.664)	Prec@5 95.312 (95.374)
Test: [0/19]	Time 1.668 (1.668)	Loss 0.8600 (0.8600)	Prec@1 81.250 (81.250)	Prec@5 94.531 (94.531)
Test: [10/19]	Time 0.104 (0.463)	Loss 1.5666 (1.5452)	Prec@1 67.188 (67.827)	Prec@5 84.375 (84.659)
 * Prec@1 67.375 Prec@5 84.208

Ignore hidden files in datasets.ImageFolder?

For example, OS X likes to create .DS_Store files and they confuse datasets.ImageFolder (more precisely, the find_classes function) - it thinks this is one of the classes. Maybe it makes sense to ignore hidden files when determining classes that are present in a folder, what do you think? This particular case can also be solved if only directories were considered, not sure if it's ok to just skip regular files.