Code Monkey home page Code Monkey logo

smlr's Introduction

Smlr - A Simple Image Clustering Script using CLIP and Hierarchical Clustering

Smlr is a simple script for machine learning hobbyists looking for a quick way to minimize overrepresentation of concepts in image datasets. This script groups similar images in a directory using hierarchical clustering with CLIP embeddings and the Annoy library for efficient similarity search.

Requirements

To use this script, ensure you have the following packages installed in your Python environment:

  • torch
  • torchvision
  • numpy
  • Pillow
  • transformers
  • tqdm
  • pathlib
  • annoy
  • scipy

If you haven't already, you can install the required packages using pip in one go:

pip install torch torchvision numpy Pillow transformers tqdm pathlib annoy scipy

How it Works

  1. Checks if there's an existing embeddings file and loads it if found. If not, it generates embeddings for each image in the specified directory using a CLIP model from the HuggingFace Transformers library.
  2. Saves the generated embeddings to a file in the image directory for future use*.
  3. Builds an Annoy index for efficient similarity search using the generated embeddings.
  4. Computes a distance matrix from the Annoy index.
  5. Applies hierarchical clustering to the distance matrix and assigns clusters based on a given threshold.
  6. Moves images corresponding to the embeddings in each cluster to separate folders, with names like "cluster_0", "cluster_1", and so on.
  7. Moves images not included in any cluster to a folder named "unique".
  8. If there are any corrupted or damaged images, they get moved to a folder named "corrupted".

*The script saves the generated embeddings in the image directory as 'embeddings.npy'. If this file is detected during future runs, the script will ask you if you want to use the existing embeddings or make new ones. This is useful if something goes wrong during the clustering process or if you want to try different thresholds without going through the whole embeddings generation process again.

Usage

If you're running on a 3090 or a similarly beefy GPU you can quickly get started with the following command:

python smlr.py --image_directory /path/to/your/image_directory

Some Extra Options for those who like to tinker or have less VRAM to spare

  • --clip_model: Choose the pre-trained CLIP model for generating embeddings. Options from least to most demanding: openai/clip-vit-base-patch16, openai/clip-vit-base-patch32, openai/clip-vit-large-patch14, openai/clip-vit-large-patch14-336 (default).
  • --threshold: Set the threshold value for hierarchical clustering. Lower values will reduce false positives but may miss more. Default is 0.22.
  • --batch_size: Pick the batch size when generating CLIP embeddings. Higher values need more VRAM. Default is 192.

After running the script, you'll find the clustered images in separate folders within the input directory, the unique images in the "unique" folder, and any corrupted images in the "corrupted" folder.

A Few Notes

  • The default CLIP model and batch size work well on a 24 GB RTX 3090. If you have less VRAM, you might run into OOM errors with these settings. If your graphics card is not as chunky, try starting with a batch size of 32 and the clip-vit-base-patch32 model, and then work up (or down) from there until you find what works best for you.
  • Different CLIP models might need different threshold values to work well. Feel free to play around with the options to see what works best for your specific case.
  • While I've tested this script a fair few times and not had suffered any data loss it's always a good practise to have your data backed up before doing anything with it.

Thanks

Cheers to Theodoros Ntakouris for outlining a clear starting point for this script in his Medium article.

smlr's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

sdubnov

smlr's Issues

clustering methods

Good idea and nice job! Work well with CLIP embeddings.

I also tried to cluster CLIP embeddings with DBSCAN. But the result is not good as hierarchical clustering. Could you explain the reason of using hierarchical clustering and why it works? Thanks a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.