lightly-ai / lightly Goto Github PK

View Code? Open in Web Editor NEW

2.8K 2.8K 246.0 10.75 MB

A python library for self-supervised learning on images.

Home Page: https://docs.lightly.ai/self-supervised-learning/

License: MIT License

Makefile 0.05% Python 99.92% Shell 0.03%

computer-vision contrastive-learning deep-learning embeddings machine-learning pytorch self-supervised-learning

lightly's People

Contributors

Stargazers

Watchers

Forkers

jackyvan pauloboaventura lynxgsm matthiasheller deeplearning2012 trendingtechnology anthar shruti-shyam busycalibrating codeaudit splendor-kill rheehot cualquiercosa327 laurenmoos tangibleai ssbagalkar shaheen19 lukasbommes-forked-projects yangsenwxy vincentagnus rishirelan adrianarnaiz malteebner eelcohoogendoorn stjordanis haroldss tekrific mehdidc deepampatel alan-ai-learner rajesh16702 ramonserrallonga mehrdad-shokri zumbalamambo chaoso nemavatsala josecegra zlapp sailfish009 hoangtv98 athon-millane poodarchu pranavsinghps1 haorand talha1503 luobo123luobo123 asimniazi63 ricklentz 13301338176 prashant118 manikant92 emmanuelowusu nashory lindsey98 zhuoranyu ronbee r2d2oid linhduongtuan cxz osm1n damilu1 shikharmn aymuos15 dangxuanvuong98 metavai atinangrish tuchsanai emg110 lihenglin nike682631 arsham-boredom ijayock jayzz zhangsongdmk shantam07 scheibenreif natejenkins hamidehkerdegari gongleii haiboku233 snickell umami-ware tyroneli adbmd lzy-v chenchy techthiyanes prelaunch-lift haoweiclouds1 mldl arthasjax nicolizamacorrea fantes hanfengzhai dczifra plzhai aakgun bj2016 malfurionzz junchengberry

lightly's Issues

Add: Upload of embeddings csv file to API

This is required for #103.

Tutorial: How to use custom augmentations

Description

We want to add a tutorial which highlights how a user can use custom augmentations to do self-supervised learning with lightly.

Tasks

Find a suitable dataset
Write the code for the tutorial
Write the text surrounding the code blocks
Add cool plots

Acceptance Criteria

Code works and results look meaningful

Handle sampling results (tag) in PIP package

Description

After sampling, we get a tag back. Translate the tag into a list of filenames.

Tasks

Update API request to expect a tagId after sampling
Add method to translate a tag into a list of related filenames
Update methods depending on tags (download dataset, copy dataset) if needed

Collate functions can't handle tuple for input image size

The current implementation of collate functions fails when we use a tuple for the desired input image size.

#this works
collate_fn = lightly.data.ImageCollateFunction(
    input_size=32
)

# this breaks
collate_fn = lightly.data.ImageCollateFunction(
    input_size=(32, 32),
)

RuntimeError: Forbidden upload to dataset with no existing tags

Got the following error while uploading embedding

Command used

lightly-upload token='' dataset_id='' embeddings='/home/usr/embeddings.csv'

Error

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/effd/lib/python3.7/site-packages/lightly/cli/upload_cli.py", line 107, in upload_cli
    _upload_cli(cfg)
  File "/home/ubuntu/anaconda3/envs/effd/lib/python3.7/site-packages/lightly/cli/upload_cli.py", line 52, in _upload_cli
    embedding_name=cfg['embedding_name']
  File "/home/ubuntu/anaconda3/envs/effd/lib/python3.7/site-packages/lightly/api/upload.py", line 115, in upload_embeddings_from_csv
    raise RuntimeError(msg)
RuntimeError: Forbidden upload to dataset with no existing tags.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Replace the double-concatenation in the BaseCollateFunction

Currently, the BaseCollateFunction concatenates the input images as follows:

# three concatenations
b0 = torch.cat([transform(img) for img in imgs], 0)
b1 = torch.cat([transform(img) for img in imgs], 0)
return torch.cat([b0, b1], 0)

This is inefficient because new memory has to be allocated for every concatenation operation (of which there are three).
An alternative with a single concatenation would be:

# single concatenation
b = [transform(imgs[i % bsz]) for i in range(2*bsz)]
return torch.cat(b, 0)

A short experiment for bsz=512 and image_height=128 shows the potential speed-up:

Required time (so far): 0.1442425012588501s
Required time (new): 0.09266984462738038s

Lightly-magic fails with trainer.max_epochs=0

When trying to use the lightly-magic CLI command without fine-tuning a model the CLI fails because it tries to load a checkpoint during the embedding phase which doesn't exist.

Command to reproduce: lightly-magic input_dir=raw trainer.max_epochs=0

Error message:

Training: 0it [00:00, ?it/s]Saving latest checkpoint...
[2020-11-26 15:47:06,327][lightning][INFO] - Saving latest checkpoint...
Training: 0it [00:00, ?it/s]
Best model is stored at: /datasets/videos/lightly_outputs/2020-11-26/15-46-37/
Traceback (most recent call last):
  File "/opt/conda/envs/lightly/lib/python3.7/site-packages/lightly/cli/lightly_cli.py", line 72, in lightly_cli
    return _lightly_cli(cfg)
  File "/opt/conda/envs/lightly/lib/python3.7/site-packages/lightly/cli/lightly_cli.py", line 28, in _lightly_cli
    embeddings = _embed_cli(cfg, is_cli_call)
  File "/opt/conda/envs/lightly/lib/python3.7/site-packages/lightly/cli/embed_cli.py", line 85, in _embed_cli
    checkpoint, map_location=device
  File "/opt/conda/envs/lightly/lib/python3.7/site-packages/torch/serialization.py", line 581, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/envs/lightly/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/envs/lightly/lib/python3.7/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
IsADirectoryError: [Errno 21] Is a directory: '/datasets/videos/lightly_outputs/2020-11-26/15-46-37/'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Add callback for knn-validation

Add hooks for validation at the end of each epoch (only informative if there are labels). Let's make sure it can be switched off and that the user can pass a validation set of his choice.

Tests: add tests to currently untested functions

Following larger function are currently untested:

data/_utils.py: check_images() WARNING: This function is never used, should it be deleted?
cli/download_cli.py: _download_cli(), download_cli()
cli/upload_cli.py: _upload_cli(), upload_cli()
core.py: train_embedding_model(), embed_images()

General thoughts:

We should have a consistent way of testing the CLI.
We should have a consistent way of mocking the server part. E.g. we could create a mock server: https://github.com/stoplightio/prism

Update config.yaml

The config.yaml has not the latest collate parameters. E.g. vf_prob, hf_prob is missing.

Additionally, there is no information about using half-precision. This should be added.

Error while using VideoDataset without extension specified

Issue

When using the VideoDataset I end up with an exception if the extension argument is not set. I can reproduce the error on different lightly versions as well as video files.

OS: CentOS
Python 3.7
Lightly1.07

Upload stops when PC sleeps for too long (~1min)

When uploading images my PC went sleeping for about a minute. After waking up, the upload stopped with following error:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /boris-platform-dev/google-oauth2%7C108619227381715356556/5ff6ff536580b3000accacaf/training/training/n7/n7039.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=boris-platform-dev%40boris-250909.iam.gserviceaccount.com%2F20210107%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210107T125248Z&X-Goog-Expires=3601&X-Goog-SignedHeaders=host&X-Goog-Signature=228f3730fa7f8e31d8540e0a0261d61b8232220a416e663f7f3bb556f253b19f258edeb32a00e482b78a901e4f2f97d5448ba643e00daffdec08f7fb815c93eb1335d556b7074d13974a21ba221224f656db3bed34506607f3e0b69b5e8c21ef71cefe79510d30ab5be186f9d98bd7fb9ffaec7bcce5c1014b8d6aaf096671cc196c35ccc0e2dbc7f34554a01fc778166f958a3552f52162c122f532b4e2857a6bf63f63ad8dc5c58acab304465fa86631ee3395dbcba17df617d99d21f183b950bc6f77741510d0437f9b31c188132c52fc72b8f782ed31f956ddd73ebeaf41d64f55aace58952e75d0067d83ebd5d76e84a1445db846f874ff512b7b970193 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x15780a400>: Failed to establish a new connection: [Errno 51] Network is unreachable'))

Improve Documentation

Let's add some high-level information about the structure of the package. And also explain how the different modules are connected. Maybe even make an illustration?

Update ResNet to have optional `num_classes`

Currently, the ResNet implementation always adds a classification head but for almost all of the self supervised approaches, you only care about the actual backbone (and so you strip away this linear layer).

I'd propose to make the num_classes argument optional, and if None, omit the final linear layer so you only get the actual backbone. I can open a pull request for this if you're open to this.

Also, It looks like the library's implementation is not totally consistent with the default PyTorch ones (some missing max pools, different kernel sizes). I think a lot of people are likely to use the default PyTorch models for a lot of things, which would make the ResNets trained using Lightly incompatible. Would there be an interest in switching to the default PyTorch implementations?

How to use as input a memory representation of images rather than files ?

Using files on disk to store images is often inefficient due to the large number of files.
Providing a way to use a memory abstraction instead would be really helpful.
It could be https://www.tensorflow.org/datasets or https://pytorch.org/docs/stable/data.html

What do you think? Is this already possible but simply not documented?

Try python codegenerators and compare them

I found the one by https://github.com/swagger-api/swagger-codegen.git to work best, as the models it creates have exactly and only the parameters from the spec as argument.

Implement CoMatch

Add an implementation of the CoMatch framework.

Paper

remove checkpointing from embedding module

With the current setup the embedding module SelfSupervisedEmbedding automatically creates checkpoints when trained. This also causes problems if checkpoints already exist, basically making it hard to use on its own. I would propose to remove the checkpoint_callback code

lightly/lightly/embedding/_base.py

Line 46 in d1b4711

self.init_checkpoint_callback()

from the embedding module.

I think that should be a rather small change but it could help with usability.

Refactor models

The SimCLR and MoCo architectures are currently implemented as standalone architectures.
Before adding new architectures like BYOL or SimSiam we should probably make use of the existing code, e.g.

Divide architectures into more general frameworks (in our case: with momentum updates vs. without momentum updates).
Have SimCLR, MoCo, BYOL, SimSiam, etc inherit from the respective generalized architectures.

Make transform settable in Lightly Dataset

Currently if one has created a Lightly Dataset and wants to apply transforms at a later stage, we have to do it as follows:

dataset = data.LightlyDataset(root="./", name='CIFAR10', download=True)
...
use dataset to train SimCLR model for example
...
test_transforms = torchvision.transforms.ToTensor()
dataset.dataset.transform = test_transforms

We should extend the Dataset wrapper to directly support transforms.

Add prebuilt collate functions for SimCLR, MoCo, and SwAV

This will save the user a line of code when trying to reproduce results and can make our example code leaner.

I suggest something like the following lines in lightly.data.collate.py

class SimCLRCollateFunction(ImageCollateFunction):
    """Description...

    """

    def __init__(self, input_size=32):
        super(SimCLRCollateFunction, self).__init__(
            input_size=input_size,
            # put all the SimCLR settings here
        )

PIP-API get current tag

Description

The PIP package should provide a function to get the current tag from the API.

Dependencies

https://github.com/lightly-ai/lightly-core/issues/86

Tasks

implement the function

Add SwAV loss

Let's add the loss from SwAV to the package such that it can be used with and without a memory bank.

We'll put it in a separate file at lightly/loss/swav.py.

Following closely our implementation of the NTXentropy loss, the backbone should look like this:

class SwAVLoss(MemoryBankModule):

    def __init__(self, 
                      # whatever arguments we need
                      memory_bank_size: int = 3000):

        super(SwAVLoss, self).__init__(size=memory_bank_size)
        ...

    def forward(self, output: torch.Tensor, labels: torch.Tensor = None):
        
        output, negatives = super(SwAVLoss, self).forward(output)
        if negatives is None:
            # calculate loss from batch only
            ...
        else:
            # calculate loss from batch and negatives
            ...

        return loss

Move image transforms from collate function to the dataset

Currently, the collate function converts each image into a pair of transformed images before concatenating them to a batch. This has led to a lot of confusion. Ideally, the transform could be passed to the LightlyDataset constructor.

This would also have an impact on #68. Furthermore, a LightlyDataset should have a train and inference mode to switch between augmentations for contrastive learning and for infering image representations.

Move loading of pretrained model from CLI to models

Pre-trained models from the model zoo can only be used from the CLI. It would be better to move them to the models module and add the option to load a pre-trained model.

Since there is no evidence pretrained models can hurt performance I would use them by default.

Then we could create them using:

E.g.

model = lightly.models.ResNetMoCo(num_ftrs=128, pretrained=True)

when only uploading images, don't open the images to improve upload speed

to improve the speed of the upload process, dont open the images prior to uploading since we are not using the extracted metadata anywhere

Implement SimSiam Representation Learning

As proposed by users u/lfotofilter and u/AiDreamer on Reddit.

Add an implementation of the SimSiam Representation Learning framework.

Paper

Update PIP API to follow mapping from filenames to ids

As described in https://github.com/lightly-ai/lightly-core/issues/123, the pip package needs a mapping from the following four representations of the samples belonging to a tag:

bitmask: str / List[bit], length: #NoSamplesInDataset
sample_indexes: List[int], length: #NoSamplesInTag
filenames: List[str], length: #NoSamplesInTag
sampleIDs: List[MongoObjectID] / List[str], length: #NoSamplesInTag

The mapping can be done using the following two lists downloaded from the API:

A) all_filenames: List[str], length: #NoSamplesInDataset
B) all_ sampleIDs: List[MongoObjectID] / List[str], length: #NoSamplesInDataset

Following mapping functions are needed:

from 1. to 2.: implemented in the bitmask class
from 2. to 3. by simple indexing, using A)
from 2. to 4. by simple indexing, using B)
from 4. to 2. by reverse indexing using A)
from 4. to 3. by reverse indexing using B)
from 2. to 1.: implemented in the bitmask class

Furthermore, the following workflows have to be updated to use these mappings:

Upload of embeddings: order them by the order in the initial tag
Creating a new tag from a custom selected subset (e.g. defined by filenames)

Add description about high-level pip package structure

To make it easier for other contributors to work on lightly it would be good to outline a bit the structure of the PIP package and the reasoning behind it. There were many internal discussions on how to derive a scalable architecture.

Add: API for sampling

It should be possible to request a sampling from our API using the package directly (no need to go through the web-app).

Restructure config files

Currently, all CLI configs are stored in a single file config.yaml. In the future, the file structure should look as follows:

config/
|-- config.yaml
|-- model/
   |-- resnet-18.yaml
   |-- resnet-34.yaml
   |-- ...
|-- data/
  |-- data.yaml
|-- ...

This way, a user can still overwrite the default arguments like so:

lightly-train input_dir=my-dir model.num_ftrs=32

However, one could easily switch between different default settings and even write custom config files

# this will use the default settings from resnet-34.yaml
lightly-train input_dir=my-dir model=resnet-34 
# this will use the settings in the custom config file
lightly-train input_dir=my-dir model=my-model

Add: Histogram equalization augmentation

As implemented in PIL.ImageOps. May be used to replace ImageNet normalization when working on X-ray images. This normalization increases contrast in the image.

Example:

COVID-19 DETERIORATION PREDICTION VIA SELF-SUPERVISED REPRESENTATION LEARNING AND MULTI-IMAGE PREDICTION

Add support for new tag representation (binary mask)

The new Lightly platform will use a new tag representation. The transmitted data will be encoded as (16bit) hex strings.

The goal is to provide a simple helper class to work with this new format and switch between hex representation and binary representation.

Some functionalities which need to be implemented:

convert a hex string from the API to a binary mask

'ab3f' --> '1010101100111111'

get the set indices from a binary mask
basic arithmetics (&, ~) to set individual bits or invert the bitmask

Make output folder optional and easier to find

Problem

When I use the CLI lightly it always creates a new output folder with a timestamp. This is good to make experiments reproducible. However, for some CLI commands such as lightly-upload or lightly-download creating folders seems not adding any value.
Furthermore, I would rename the folder to something more descriptive than just outputs.

Suggestion

Make lightly-upload and lightly-download not create new output folders by default.
Rename outputs to lightly_outputs or similar to make it clear which folders have been created by the lightly CLI.

data augmentation

support for data augmentation pipelines, optimally data augmentation pipelines that support higher-level optimization like policy search or HPO

data augmentation would have its own abstraction but be bound to a pipeline abstraction and then pipelines would be passed to collate for dual-augmentation of batches specific to many self-supervision tasks, collation could be associated more with descriptive methods for how much divergence is introduced (via randomness) for each of the two "paths".

This would allow researchers to use strategies such as curriculum/active learning with augmentation diverging as the consistency loss reduced (for example)

Bundling Experiment Params

To reduce constructor clutter and to allow for easy logging / experiment design, I believe that we should allow common hyper-parameters to be bundled and to make assertions about these parameters in one place.

There are two options: one a top-level params API that bundles one object to be passed to the data-loader and the model. The other, two objects- data params and model params.

I am biased to the former specifically for the framework-wide approach of self-supervision, there is a deep interconnection between data, data transformations, and models - perhaps a bit of which is present with all of deep learning- but I believe this paradigm in particular forces thinking of all of this as very interconnected.

This also forces the good practice of researchers thinking up-front about their entire experimental design, a kind of literate experimental design. With rich logging and enforcing invariants and adding defaults it could also be very beginner friendly.

Requirements

params API that returns params object that bundles data and model parameters and is implementation agnostic (available parameters checked at instantiation time of models/data loader)
rich logging
informed defaults (many of these will span across models but can also add validations that are methods with model names to ensue everything is available "in advance" for using specific models and logging if default is set)
All of this should have a "Builder" design pattern for expressive instantiation

Update backbones to use default PyTorch implementations

Currently, the primary ResNet backbone uses a custom implementation that would not be directly compatible with the default PyTorch implementations (in addition to having a slightly different layer configuration). I think it would be advantageous to move to using the standard models offered in torchvision as most people likely default to those.

Update PIP API communication to use the new tag format

Depends on #111

The goal is to make sure the pip package can communicate using the new tag format.

Tasks

The Python client is generated with the new OpenAPI spec.
The client supports triggering a sampling with the preselected and query tag_id
The client reads the tag_id from the JobData
The client gets the tag from the tag_id (via the API)
The client extracts the sample_ids as bit mask from the tag
The client extracts the sample_ids as List[int] from the bit mask.
The client reads the filenames (as List[str]) from the API
The client maps the sample_ids to a list of filenames

Definition of done

A single test is defined to test all tasks
The single tests runs successfully with the updated server

Add autocompletion to command-line tool

We can use this for inspiration.

Implement BYOL

As proposed by user u/extracoffeeplease on Reddit.

Add an implementation of the BYOL framework.

Paper

Allow download of full dataset using CLI

At the moment, our CLI only supports two ways of downloading datasets:

We only download the filenames and manually take care of copying the right samples
We download the filenames and automatically copy samples from a source folder to a destination folder.

However, it would be nice to download the full dataset (original images) if they are available from the platform to a destination folder.

Can't install and use lightly on Google Colab

When trying to use lightly I get the following error:

Steps to reproduce

Create a new Google Colab
Install lightly using !pip install lightly
Import lightly using import lightly

Add configuration parameters to CLI documentation

The CLI documentation mentions that we have parameters, but there is no list of them. It would be great to provide a lit, so users know what kind of parameters they can set.

Make torchvision evaluation sets available from the command-line.

Currently, in lightly.cli.embed_cli, the following lines make it impossible to embed eval/test data from CLI:

dataset = LightlyDataset(root, name=data, train=True, download=download,
                         from_folder=input_dir, transform=transform)

Be more clear about the models and mention the difference in implementation compared to the papers

For SimCLR and MoCo there are two versions. Our implementations are more close to v2 of both from what I've seen. But I'm not sure we use all the changes from the papers. I think it makes sense to clarify which model lightly is using and if there is a difference to the paper we should mention this. Maybe also mention how one could use SimCLRv1 or SimCLRv2 by changing some of the parameters?

Add batch shuffling to MoCo

In their paper, the authors of MoCo mention that they shuffle their batches in order to prevent a flow of information between the key encoder and the query encoder (if the positive pairs are normed with the same statistics, the model can cheat). As a solution, they shuffle the batches and split them into smaller sub-batches on which the batch norm is then calculated. We can and should implement a similar strategy.

Fix TOX

At the moment, tox installs the dependencies in the local environment which can overwrite the current installation of required packages.

Test coverage CI

Use CI to track the test coverage

Tasks:

choose a test coverage tool: Codecov
run the tool in Github CI
report the coverage with every PR
show the coverage as badge

lightly-ai / lightly Goto Github PK

lightly's People

Contributors

Stargazers

Watchers

Forkers

lightly's Issues

Tutorial: How to use custom augmentations

Description

Tasks

Acceptance Criteria

Description

Tasks

Command used

Error

Issue

Description

Dependencies

Tasks

Problem

Suggestion

Tasks

Definition of done

Steps to reproduce

Recommend Projects

Recommend Topics

Recommend Org